DAT3 Course Repository
Course materials for General Assembly's Data Science course in Washington, DC (10/2/14 - 12/18/14).
|1||10/7: Git and GitHub||10/9: Base Python|
|2||10/14: Getting and Cleaning Data||10/16: Exploratory Data Analysis|
|3||10/21: Linear Regression
Milestone: Question and Data Set
|10/23: Linear Regression Part 2|
|4||10/28: Machine Learning and KNN||10/30: Model Evaluation|
|5||11/4: Logistic Regression
Milestone: Data Exploration and
|11/6: Logistic Regression Part 2, Clustering|
|6||11/11: Dimension Reduction||11/13: Clustering Part 2, Naive Bayes|
|7||11/18: Natural Language Processing||11/20: Decision Trees|
Milestone: First Draft Due
|9||12/2: Ensembling||12/4: Ensembling Part 2, Python Companion Tools|
|10||12/9: Working a Data Problem
Milestone: Second Draft Due
|12/11: Neural Networks|
|11||12/16: Review||12/18: Project Presentations|
Class 1: Introduction
- Introduction to General Assembly
- Course overview and philosophy (slides)
- What is data science? (slides)
- Brief demo of Slack
- Install Anaconda distribution of Python 2.7, Git, and Slack
- Add a photo to your Slack profile
- Create a GitHub account
- Read Analyzing the Analyzers (40 pages) and think about where you'd like to fit in!
- Subscribe to some data-focused newsletters, to keep current: Center for Data Innovation, O'Reilly Data Newsletter, Data Community DC
- Watch Introduction to Data Science and Analysis (50 minutes) for another look at the data science workflow
- Find an open source project hosted on GitHub that interests you
Class 2: Git and GitHub
- Homework discussion: Any installation issues? Find any interesting GitHub projects? Any takeaways from "Analyzing the Analyzers"?
- Introduce yourself: What's your technical background? Why did you join this course? How do you define success in this course?
- Office hours
- Git and GitHub lesson (slides)
- Review the course project information, past projects from other GA students, and public data sources
- Clone this repo (DAT3) for easy access to the course files
- Watch Introduction to Git and GitHub (36 minutes) to repeat a lot of today's presentation
- Read the first two chapters of Pro Git for a much deeper understanding of version control and the basic Git commands
- Learn some more Markdown and add it to your
about.mdfile, then push those edits to GitHub and send another pull request
- Read this friendly command line tutorial if you are brand new to the command line
- For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford
- Dillinger is a browser-based Markdown editor, useful for checking your Markdown code
- GitRef is an excellent reference guide for Git commands
- Git quick reference for beginners is a shorter reference guide with commands grouped by workflow
Class 3: Base Python
- Any questions about Git/GitHub?
- Discuss the course project. What's one thing you learned from reviewing student projects?
- Base Python lesson, with exercises (code)
- Complete the exercises at the end of the Python script we went over in class today and add your solutions to your folder in the DAT3-students repo
- Keep thinking about your project, and consult past projects and public data sources for more inspiration
Class 4: Getting and Cleaning Data
- Discuss homework solutions (code)
- File input/output in Python
- Getting data from APIs
- Exercise 2 from file input/output
- Read What I do when I get a new data set as told through tweets
- Watch Look at Your Data (18 minutes)
- Exercise 3 from file input/output
- Read this fun article about using web scraping to analyze Netflix's "micro-genres"
- Online Python Tutor is useful for visualizing (and debugging) your code
- Directory of API wrappers for Python
Class 5: Exploratory Data Analysis
- Discuss homework solutions (code)
- Scraping the web for data
- Pandas for data analysis (code)
- Split-Apply-Combine pattern
- Project milestone: Submit your question and data set to DAT3-students by Tuesday!
- Read through this excellent example of data wrangling and exploration in Pandas
- To learn more Pandas, read through this three-part tutorial (some overlap with today's class), or read through these two excellent (but extremely long) notebooks: Introduction to Pandas, Data Wrangling with Pandas
- For more web scraping with Beautiful Soup 4, here's a longer example: slides, code
- Web scraping without writing any code: "turn any website into an API" with import.io or kimono
- Simple examples of joins in Pandas, for when you need to merge multiple DataFrames together
Class 6: Linear Regression
- Discuss your project question and data set
- Pandas for visualization (code)
- Linear regression (code, slides)
- What is linear regression?
- How to interpret the output?
- What assumptions does linear regression depend upon?
- What is multicollinearity and heteroskedasticity, and why should I care?
- How do I represent categorical variables?
- Post your favorite visualization in the "viz" channel on Slack, and tell us what you like about it!
- For more on Pandas plotting, browse through this IPython notebook or read the visualization page from the official Pandas documentation
- To learn how to customize your plots further, browse through this IPython notebook on matplotlib
- To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and here is an excellent slide deck from Columbia's Data Mining class
- If you are already a master of ggplot2 in R, you may prefer "ggplot for Python" over matplotlib: introduction, tutorial
Class 7: Linear Regression Part 2
- Linear regression, continued
- Complete the exercises at the end of the python script from class
- One of the best places to go for more information about linear regression is chapter 3 of our course "textbook": An Introduction to Statistical Learning - or just read Kevin's highly abbreviated version
- For more information about core assumptions, check out this article and this one
- For more on log transformations, check out this article
- This handout provides an overview of the computation of the F-test
- This may be a helpful article on how to derive the coefficient estimates
Class 8: Machine Learning and KNN
- Discuss homework solutions (code)
- "Human learning" on iris data using Pandas (code)
- Introduction to numpy (code)
- Machine learning and K-Nearest Neighbors (slides)
- Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it on Thursday
- Walk through the rest of the numpy reference and see if you can understand each of the functions
- For a more thorough introduction to numpy, this guide is quite good
Class 9: Model Evaluation
- Introduction to scikit-learn with iris data (code)
- Discuss the article on the bias-variance tradeoff
- Model evaluation procedures (slides, code)
- Training error
- Underfitting and overfitting
- Test set approach
- Model evaluation metrics (slides, code)
- Confusion matrix
- Introduction to Kaggle
- Project milestone: Submit your "Data Exploration and Analysis Plan" to DAT3-students by Tuesday!
- Read this simple example of machine learning and see if you understand everything in the article
- Watch Kevin's Kaggle project presentation video (16 minutes) for a tour of the machine learning process
- For more on Kaggle, watch the video Kaggle Transforms Data Science Into Competitive Sport (28 minutes)
- For much more on the Kaggle Allstate competition, read Kevin's project paper, read a brief interview with the first place team, review the Python code from the second place team, or skim the solution sharing thread
- If you want to try out the Kaggle Bike Sharing Demand competition, feel free to reuse Kevin's starter code
- If you'd like to see more on today's topics, these videos from Hastie and Tibshirani are excellent: bias-variance tradeoff (10 minutes), test set (aka "validation set") approach (14 minutes), cross-validation (14 minutes) - or just read section 5.1 from their book (free PDF download!)
- Kevin wrote a simple guide to confusion matrix terminology that you can use as a reference guide
- The Kaggle wiki has a decent page describing other common model evaluation metrics
Class 10: Logistic Regression
- Any questions from last time: model evaluation, Kaggle, article on Smart Autofill?
- Summary of your feedback
- Discuss your data exploration and analysis plan
- Logistic Regression (slides, code)
- Continue to work on Part I of the exercise from class and submit your solution to DAT3-students
Class 11: Logistic Regression Part 2, Clustering
- Logistic Regression, continued (exercise solution)
- Clustering (slides)
- Why cluster?
- Introduction to the K-means algorithm
- Read through section 8.2 on K-means Clustering from Introduction to Data Mining by next Thursday. What are some of the strengths and limitations of k-means clustering?
- If you would like a review on the topics we covered today (and Tuesday), the videos from Hastie and Tibshirani from Stanford are very good:
- Introduction to Classification (10 minutes)
- Logistic Regression and Maximum Likelihood (9 minutes)
- Multivariate Logistic Regression and Confounding Variables (10 minutes)
- If you want to understand the math of how coefficients are estimated, check out these notes from CMU's Advanced Data Analysis class. Written by Cosma Shalizi, one of CMU's professors.
- Documentation for plotting math text
- Documentation for plotting scatter plots
Class 12: Dimension Reduction
- Model evaluation metrics, continued
- Dimension Reduction (Guest Lecturer: Sinan Ozdemir)
- Read Paul Graham's "A Plan for Spam" in preparation for Thursday's class on Naive Bayes
- Kevin has a video tutorial (14 minutes) and blog post summarizing ROC curves and AUC
- scikit-learn has extensive documentation on model evaluation
- On Cross Validated, this question has dozens of explanations of PCA, and this question has a useful visualization of what is essentially PCA
Class 13: Clustering Part 2, Naive Bayes
- Clustering Analysis (slides)
- Naive Bayes (slides)
- Open Python, type
import nltk, type
nltk.download(), find the "NLTK Downloader" popup window, click on "all", then click on "Download". Do this at home, since it's more than 300 MB! If you have space constraints on your computer, we can tell you next class exactly which packages to download.
- For clustering, scikit-learn has documentation on K-means clustering, alternative clustering algorithms, and clustering metrics
- Vipin Kumar from the University of Minnesota has a helpful chapter on clustering from his textbook: Introduction to Data Mining
- For an alternative introduction to Bayes' Theorem, Bayes' Rule for Ducks, Bayes' Rule in an animated gif, and this 5-minute video on conditional probability may be helpful
- For more details on Naive Bayes classification, Wikipedia has two useful articles: Naive Bayes classifier, Naive Bayes spam filtering
- If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004
Class 14: Natural Language Processing
- Naive Bayes, continued (code)
- Natural Language Processing (code)
- Natural Language Processing with Python: free online book to go in-depth with NLTK
- NLP online course: no sessions are available, but video lectures and slides are still accessible
- Brief slides on the major task areas of NLP
- Detailed slides on a lot of NLP terminology
- A visual survey of text visualization techniques: for exploration and inspiration
- DC Natural Language Processing: active Meetup group
- Stanford CoreNLP: suite of tools if you want to get serious about NLP
- Getting started with regex: Python introductory lesson and reference guide, real-time regex tester, in-depth tutorials
- We will use Graphviz to visualize the output of the classification trees. Please download this before class.
Class 15: Decision Trees
At the end of this class, you should be able to do the following:
- Describe the output of a decision tree to someone without a data science background
- Describe how the algorithm creates the decision tree
- Predict the likelihood of a binary event using the decision tree algorithm in scikit-learn
- Create a decision tree visualization
- Determine the optimal tree size using a tune grid and the AUC metric in Python
- Describe the strengths and weaknesses of a decision tree
- Work on your project. The first draft of your project is due on Tuesday at 5 pm.
- Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes a code walkthrough
- Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes
- If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format
- If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
- Chapter 8.1 of the Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
Class 16: Recommenders
Class 17: Ensembling
- Leo Brieman's paper on Random Forests
- yhat has a brief primer on Random Forests that can provide a review of many of the topics we covered today.
- Here is a link to some Kaggle competitions that were won using Random Forests
- Ensemble models... tend to strongly outperform their component models on new data. Doesn't this violate “Occam’s razor”? In this paper entitled: The Generalization Paradox of Ensembles John Elder IV argues for a more refined understanding of model complexity.
Class 18: Ensembling Part 2, Python Companion Tools
- Review the Random Forest algorithm
- Discuss tuning the Random Forest Algorithm
- Give an overview of the AdaBoost algorithm
- Implement boosted trees in Python
- IPython Notebook
- Chapter 10 of the Elements of Statistical Learning covers Boosting. See page 339 for the algorithm presented in class.
- Dr. Justin Esary has a nice tutorial on Boosting. Watch from 32:00 – 59:00 for relevant material.
- Tutorial by Professor Rob Schapire of Princeston on the AdaBoost Algorithm
- IPython documentation in website form and notebook form: does not focus exclusively on the IPython Notebook
Class 19: Working a Data Problem
Class 20: Neural Networks
- Inspiration for Neural Networks
- Neural Networks
- Gradient Descent
- Michael Neilson has a free book called Neural Networks and Deep Learning that gives thorough introduction to Neural Networks
- Geoffrey Hinton, one of the pioneers of the deep learning movement, has an entire class on Coursera called Neural Networks for Machine Learning
- The Python wiki has a list of Python packages commonly used for Neural Networks
- Read this "classic" paper, which may help you to connect many of the topics we have studied throughout the course: A Few Useful Things to Know about Machine Learning
Class 21: Review
- Review of all of data science (slides)
- Comparing supervised learning algorithms
- Special guest: Laura Lorenz
- scikit-learn "machine learning map": Guide for choosing the optimal estimator
- Choosing a Machine Learning Classifier: Short and highly readable
- Machine Learning Done Wrong: Thoughtful advice on common mistakes to avoid in machine learning
- Practical machine learning tricks from the KDD 2011 best industry paper: More advanced advice than the resources above
- An Empirical Comparison of Supervised Learning Algorithms: Research paper from 2006
- Getting in Shape for the Sport of Data Science: 75-minute video of practical tips for machine learning (by the past president of Kaggle)
- Resources for continued learning
- Bonus content (below)
- Keep using Slack!
Class 22: Project Presentations
- Note: Guests are welcome! Invite your friends and family!
- Python reference guide
- Regular expressions ("regex"):
- Tidy data
- Overview of reproducibility and reproducible research
- Twitter definition of reproducibility
- How to share data with a statistician: practical guide for turning raw data into tidy data in a reproducible and documented manner
- Example of reproducible data processing: includes raw data, processed data, processing scripts, and documentation
- Colbert on reproducibility (8-minute video)
- Domino Data Lab