Skip to content

jfaleiro/datasciencepatterns

Repository files navigation

Patterns in Data Science

This is a collection of code excerpts, mostly in R, to address common puzzles in data sciences. They are organized by different bodies of knowledge and should serve as something similar to a cheat sheet. I am still in the process of gathering them from a number of different projects and course notes done in the past, so more content to be added and what is already in here might get further organized (time allowing).

Most of this content is not original, and is either copied from my own other projects or (the absolute majority) from lectures and class notes of the Data Science Specialization at Johns Hopkins.

Note: github has size limitations on serving HTML content, and most of this material is rendered as an HTML output from R chunks. If you have problems visualizing the content, click on the [MD] or [RMD] for the markup document.

[MD] // [RMD]

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available.

Sample Online Research (Browse the Code)

[MD] // [RMD]

Statistical inference is the process of drawing conclusions about populations or scientific truths from data. There are many modes of performing inference including statistical modeling, data oriented strategies and explicit use of designs and randomization in analyses.

Sample Simulations (Browse the Code)

[MD] // [RMD]

Linear models, as their name implies, relates an outcome to a set of predictors of interest using linear assumptions. Regression models, a subset of linear models, are the most important statistical analysis tool in data sciences. These notes cover regression analysis, least squares and inference using regression models.

Sample Report (Browse the Code)

Machine Learning

Annotations and patterns related to machine learning came out to be very long, so we divided them further into 4 groups:

Annotations on tools and techniques for understanding, building, and testing prediction functions. Common tasks performed by data scientists and data analysts in prediction and machine learning. Basic grounding concepts related to selection of training and tests sets, overfitting, and error rates. A wide range of model based and algorithmic machine learning methods including regression, classification trees, Naive Bayes, and random forests. The complete process of building prediction functions including data collection, feature creation, algorithms, and evaluation.

[MD] // [RMD]

A data product is the final result of a statistical analysis for larger audiences. Data products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference. These are notes about the basics of creating data products as web applications, R packages, and interactive graphics.

Sample Applications (Browse the Code)

About

Common patterns in data science

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages