General Assembly's Data Science course in Washington, DC
Python Jupyter Notebook
Latest commit fa4e1ab Apr 7, 2016 @justmarkham remove broken link

README.md

DAT3 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (10/2/14 - 12/18/14).

Instructors: Josiah Davis and Kevin Markham (Data School blog, email newsletter, YouTube channel)

Course Project information

Week Tuesday Thursday
0 10/2: Introduction
1 10/7: Git and GitHub 10/9: Base Python
2 10/14: Getting and Cleaning Data 10/16: Exploratory Data Analysis
3 10/21: Linear Regression
Milestone: Question and Data Set
10/23: Linear Regression Part 2
4 10/28: Machine Learning and KNN 10/30: Model Evaluation
5 11/4: Logistic Regression
Milestone: Data Exploration and
Analysis Plan
11/6: Logistic Regression Part 2, Clustering
6 11/11: Dimension Reduction 11/13: Clustering Part 2, Naive Bayes
7 11/18: Natural Language Processing 11/20: Decision Trees
8 11/25: Recommenders
Milestone: First Draft Due
Thanksgiving
9 12/2: Ensembling 12/4: Ensembling Part 2, Python Companion Tools
10 12/9: Working a Data Problem
Milestone: Second Draft Due
12/11: Neural Networks
11 12/16: Review 12/18: Project Presentations

Class 1: Introduction

  • Introduction to General Assembly
  • Course overview and philosophy (slides)
  • What is data science? (slides)
  • Brief demo of Slack

Homework:

Optional:

Class 2: Git and GitHub

  • Homework discussion: Any installation issues? Find any interesting GitHub projects? Any takeaways from "Analyzing the Analyzers"?
  • Introduce yourself: What's your technical background? Why did you join this course? How do you define success in this course?
  • Office hours
  • Git and GitHub lesson (slides)
    • Create a repo on GitHub, clone it, make changes, and push up to GitHub
    • Fork the DAT3-students repo, clone it, add a Markdown file (about.md) in your folder, push up to GitHub, and create a pull request

Homework:

Optional:

  • Clone this repo (DAT3) for easy access to the course files
  • Watch Introduction to Git and GitHub (36 minutes) to repeat a lot of today's presentation
  • Read the first two chapters of Pro Git for a much deeper understanding of version control and the basic Git commands
  • Learn some more Markdown and add it to your about.md file, then push those edits to GitHub and send another pull request
  • Read this friendly command line tutorial if you are brand new to the command line
  • For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford

Resources:

Class 3: Base Python

  • Any questions about Git/GitHub?
  • Discuss the course project. What's one thing you learned from reviewing student projects?
  • Base Python lesson, with exercises (code)

Homework:

Class 4: Getting and Cleaning Data

Homework:

Optional:

Resources:

Class 5: Exploratory Data Analysis

Homework:

Optional:

Resources:

  • For more web scraping with Beautiful Soup 4, here's a longer example: slides, code
  • Web scraping without writing any code: "turn any website into an API" with import.io or kimono
  • Simple examples of joins in Pandas, for when you need to merge multiple DataFrames together

Class 6: Linear Regression

  • Discuss your project question and data set
  • Pandas for visualization (code)
  • Linear regression (code, slides)
    • What is linear regression?
    • How to interpret the output?
    • What assumptions does linear regression depend upon?
    • What is multicollinearity and heteroskedasticity, and why should I care?
    • How do I represent categorical variables?

Optional:

  • Post your favorite visualization in the "viz" channel on Slack, and tell us what you like about it!

Resources:

  • For more on Pandas plotting, browse through this IPython notebook or read the visualization page from the official Pandas documentation
  • To learn how to customize your plots further, browse through this IPython notebook on matplotlib
  • To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and here is an excellent slide deck from Columbia's Data Mining class
  • If you are already a master of ggplot2 in R, you may prefer "ggplot for Python" over matplotlib: introduction, tutorial

Class 7: Linear Regression Part 2

  • Linear regression, continued

Homework:

  • Complete the exercises at the end of the python script from class

Resources:

Class 8: Machine Learning and KNN

  • Discuss homework solutions (code)
  • "Human learning" on iris data using Pandas (code)
  • Introduction to numpy (code)
  • Machine learning and K-Nearest Neighbors (slides)

Homework:

Optional:

  • Walk through the rest of the numpy reference and see if you can understand each of the functions

Resources:

  • For a more thorough introduction to numpy, this guide is quite good

Class 9: Model Evaluation

  • Introduction to scikit-learn with iris data (code)
  • Discuss the article on the bias-variance tradeoff
  • Model evaluation procedures (slides, code)
    • Training error
    • Underfitting and overfitting
    • Test set approach
    • Cross-validation
  • Model evaluation metrics (slides, code)
    • Confusion matrix
  • Introduction to Kaggle

Homework:

Optional:

Resources:

Class 10: Logistic Regression

  • Any questions from last time: model evaluation, Kaggle, article on Smart Autofill?
  • Summary of your feedback
  • Discuss your data exploration and analysis plan
  • Logistic Regression (slides, code)

Homework:

  • Continue to work on Part I of the exercise from class and submit your solution to DAT3-students

Class 11: Logistic Regression Part 2, Clustering

  • Logistic Regression, continued (exercise solution)
  • Clustering (slides)
    • Why cluster?
    • Introduction to the K-means algorithm

Homework:

  • Read through section 8.2 on K-means Clustering from Introduction to Data Mining by next Thursday. What are some of the strengths and limitations of k-means clustering?

Resources:

Class 12: Dimension Reduction

Homework:

  • Read Paul Graham's "A Plan for Spam" in preparation for Thursday's class on Naive Bayes

Resources:

Class 13: Clustering Part 2, Naive Bayes

Homework:

  • Open Python, type import nltk, type nltk.download(), find the "NLTK Downloader" popup window, click on "all", then click on "Download". Do this at home, since it's more than 300 MB! If you have space constraints on your computer, we can tell you next class exactly which packages to download.

Resources:

Class 14: Natural Language Processing

Resources:

Homework:

  • We will use Graphviz to visualize the output of the classification trees. Please download this before class.

Class 15: Decision Trees

At the end of this class, you should be able to do the following:

  • Describe the output of a decision tree to someone without a data science background
  • Describe how the algorithm creates the decision tree
  • Predict the likelihood of a binary event using the decision tree algorithm in scikit-learn
  • Create a decision tree visualization
  • Determine the optimal tree size using a tune grid and the AUC metric in Python
  • Describe the strengths and weaknesses of a decision tree

Homework:

  • Work on your project. The first draft of your project is due on Tuesday at 5 pm.

Resources:

  • Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes a code walkthrough
  • For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams
    • Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes
    • If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format
    • If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
  • Chapter 8.1 of the Introduction to Statistical Learning also covers the basics of Classification and Regression Trees

Class 16: Recommenders

Class 17: Ensembling

Resources:

  • Leo Brieman's paper on Random Forests
  • yhat has a brief primer on Random Forests that can provide a review of many of the topics we covered today.
  • Here is a link to some Kaggle competitions that were won using Random Forests
  • Ensemble models... tend to strongly outperform their component models on new data. Doesn't this violate “Occam’s razor”? In this paper entitled: The Generalization Paradox of Ensembles John Elder IV argues for a more refined understanding of model complexity.

Class 18: Ensembling Part 2, Python Companion Tools

Resources:

Class 19: Working a Data Problem

Class 20: Neural Networks

  • Inspiration for Neural Networks
  • Neural Networks
  • Gradient Descent

Resources:

Homework:

Class 21: Review

Resources:

Class 22: Project Presentations

  • Note: Guests are welcome! Invite your friends and family!

Bonus Content