Welcome to Computer Learning Using Big Data!

Data come from a wide variety of sources: from physical sensors, social media posts, academic survey responses, polling responses, video feeds, and more. How do we make sense of all that data? How can we learn from the data to better understand our world? This course will introduce computer programming tools that you can use to handle large data sets, visualize the data, and teach the computer how to learn from the data to make predictions about the future.

Course Tools

We will be programming in Python with an alternative track in R. No prior programming experience is necessary. You will learn everything you need to know as we go through the course of the semester. If you are familiar with a Python/R distribution, you may install the necessary packages as you go along.

Another alternative is to use SageMath Cloud for the programming, visualization, and machine learning for this course. All of the packages and tools we use are already installed.

Homework

There will be regular homework assignments that will be given at the end of each class period. Learning is best accomplished by doing, so give the assignments a try.

Course Project

There is also a final project component to the course. The goal is to join all of the tools and techniques together into a single machine learning project.

Where to get data

We will spend quite a bit of time looking for public data as we get going in this class. Here are a couple of places to look for data sets to work with:

The UCI repository: https://archive.ics.uci.edu/ml/index.php
Kaggle Public Datasets: https://www.kaggle.com/datasets
Ceasar's repository: https://github.com/caesar0301/awesome-public-datasets

You will need to dig into some of these sources to find data sets that you are interested in working with.

Course Schedule

Class	Topic
01	Big Data Ingesting: CSVs, Data frames, and Plots: Python / R
02	ML Models: Linear regression + Validation: Python / R
03	Big Data Cleaning: Data Transformations Python/ R (with Python Supplemental Work or R Supplemental Work)
04	ML Models: Naïve Bays + Evaluation Metrics Python / R
05	Big Data Enrichment: Joining and Grouping data Python / R
06	ML Models: SVM + Overfitting Python / R
07	ML Models: Decision Trees Python / R
08	ML Techniques: Outliers
09	ML Models: Clustering
10	ML Techniques: Feature scaling
11	ML Techniques: PCA
12	ML Models: Logistic Regression
13	ML Models: Neural Networks
Appendix 1	Regressions using continuous and categorical features Python / R

The Three Learning Principles

1. Occam's Razor

The general idea from Occam is: Frustra fit per plura quod potest fieri per pauciora (It is futile to do with more things that which can be done with fewer). We will strive to cut out unneeded details from our models again and again and again over the course of the semester. We will see that, as a rule of thumb, the simplest model that fits the data is also the most plausable and will tend to have the best out-of-sample performance. We will look at how the model's free parameters affect the in-sample and out-of-sample performance.

2. Sampling Bias

No matter how good your model is, the real-world performance ultimately depends on the data you use to train the model. If the data have an inherent bias, then the model's predictions will reflect that bias. We will continually ask ourselves: "Where do the data come from?" and "How are they collected?" The more we understand about the quality of the data, the more we will know about their predictive power.

3. Data Snooping

This principle is, perhaps, the most challenging. If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised. We will work hard to separate out "testing" data and lock it away as soon as possible in the learning process. This will help us combat the problems that arise through data snooping.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Appendix		Appendix
Class01		Class01
Class02		Class02
Class03		Class03
Class04		Class04
Class05		Class05
Class06		Class06
Class07		Class07
Class08		Class08
Class09		Class09
Class10		Class10
Class11		Class11
Class12		Class12
Class13		Class13
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Welcome to Computer Learning Using Big Data!

Course Tools

Homework

Course Project

Where to get data

Course Schedule

The Three Learning Principles

1. Occam's Razor

2. Sampling Bias

3. Data Snooping

About

Uh oh!

Releases

Packages

Languages

License

madsenmj/ml-introduction-course

Folders and files

Latest commit

History

Repository files navigation

Welcome to Computer Learning Using Big Data!

Course Tools

Homework

Course Project

Where to get data

Course Schedule

The Three Learning Principles

1. Occam's Razor

2. Sampling Bias

3. Data Snooping

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages