Patterns in Data Science

This is a collection of code excerpts, mostly in R, to address common puzzles in data sciences. They are organized by different bodies of knowledge and should serve as something similar to a cheat sheet. I am still in the process of gathering them from a number of different projects and course notes done in the past, so more content to be added and what is already in here might get further organized (time allowing).

Most of this content is not original, and is either copied from my own other projects or (the absolute majority) from lectures and class notes of the Data Science Specialization at Johns Hopkins.

Note: github has size limitations on serving HTML content, and most of this material is rendered as an HTML output from R chunks. If you have problems visualizing the content, click on the [MD] or [RMD] for the markup document.

Reproducible Research

[MD] // [RMD]

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available.

Sample Online Research (Browse the Code)

Health and Economic Impact of Severe Weather Events

Statistical Inference

[MD] // [RMD]

Statistical inference is the process of drawing conclusions about populations or scientific truths from data. There are many modes of performing inference including statistical modeling, data oriented strategies and explicit use of designs and randomization in analyses.

Sample Simulations (Browse the Code)

Regression Models

[MD] // [RMD]

Linear models, as their name implies, relates an outcome to a set of predictors of interest using linear assumptions. Regression models, a subset of linear models, are the most important statistical analysis tool in data sciences. These notes cover regression analysis, least squares and inference using regression models.

Sample Report (Browse the Code)

Predicting the Carbon Footprint of Automobiles Through Regression Models

Machine Learning

Annotations and patterns related to machine learning came out to be very long, so we divided them further into 4 groups:

Annotations on tools and techniques for understanding, building, and testing prediction functions. Common tasks performed by data scientists and data analysts in prediction and machine learning. Basic grounding concepts related to selection of training and tests sets, overfitting, and error rates. A wide range of model based and algorithmic machine learning methods including regression, classification trees, Naive Bayes, and random forests. The complete process of building prediction functions including data collection, feature creation, algorithms, and evaluation.

Data Products

[MD] // [RMD]

A data product is the final result of a statistical analysis for larger audiences. Data products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference. These are notes about the basics of creating data products as web applications, R packages, and interactive graphics.

Sample Applications (Browse the Code)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
DataProducts		DataProducts
MachineLearning		MachineLearning
MarkovChains		MarkovChains
RegressionModels		RegressionModels
ReproducibleResearch		ReproducibleResearch
StatisticalInference		StatisticalInference
.gitignore		.gitignore
LICENSE		LICENSE
README.html		README.html
README.md		README.md
datasciencepatterns.Rproj		datasciencepatterns.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patterns in Data Science

Reproducible Research

Statistical Inference

Regression Models

Machine Learning

Data Products

About

Releases

Packages

Languages

License

jfaleiro/datasciencepatterns

Folders and files

Latest commit

History

Repository files navigation

Patterns in Data Science

Reproducible Research

Statistical Inference

Regression Models

Machine Learning

Data Products

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages