This course provides an introduction to applied data science including data preparation, data analysis, factor analysis, statistical inference, predictive modeling, and data visualization.
The goal of data science is to extract information from a data set and transform it into an understandable structure for further use.
An emphasis will be placed on understanding the fundamentals using scripting languages and interactive methods to learn course concepts. Problems and data sets are selected from a broad range of disciplines of interest to students, faculty, and industry partners.
Lectures are augmented with hands-on tutorials using Jupyter Notebooks. Laboratory assignments will be completed using Python and related data science packages: NumPy, Pandas, SciPy, StatsModels, SciKit-Learn, and MatPlotLib.
2-2-3 (class hours/week, laboratory hours/week, credits)
Prerequisites: MA-262 Probability and Statistics; programming maturity, and the ability to program in Python.
ABET: Math/Science, Engineering Topics.
Outcomes:
- Understand the basic process of data science.
- The ability to identify, load, and prepare a data set for a given problem.
- The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
- The ability to perform basic data analysis and statistical inference.
- The ability to perform supervised learning of prediction models.
- The ability to perform unsupervised learning.
- The ability to perform data visualization and report generation.
- The ability to assess the quality of predictions and inferences.
- The ability to apply methods to real world data sets.
Tools: Python and related packages for data analysis, machine learning, and visualization. Jupyter Notebooks.
References:
Python Data Science Handbook, Jake VanderPlas, O'Reilly.
Data Science from Scratch, Joel Grus, O'Reilly
Mining of Massive Datasets. Anand Rajaraman and Jeffrey David Ullman.
Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning. Springer, 2009.
-
Data Science end-to-end:
- Reading: Python Data Science Handbook (PDSH) Ch. 1
- Reading optional: Data Science from Scratch (DSfS) Ch. 1
- Reference: git - the simple guide
- Using Jupyter Notebooks
- Python Programming for Data Science Submission required
- Python Programming Style Optional
- Dates and Time
- Python Objects, Map, Lambda, and List Comprehensions Submission required
Note: Initiate walkthrough of hands-on notebooks with students, let them complete submissions on their own.
Outcomes addressed in week 1:
- Understand the basic process of data science
- Lecture with Hands-on Notebooks:
- Python Numpy Submission required
- Python Numpy Aggregates
- Reading: PDSH Ch. 2
- Reading: DSfS Ch. 2
- Reading: PDSH Ch. 3
- Reading: DSfS Ch. 10
- NumPy Stack Submission required
- Stanford Low Back Pain Data Analysis Submission required
Outcomes addressed in week 2:
- Understand the basic process of data science
- The ability to identify, load, and prepare a data set for a given problem.
- Reading: PDSH Ch. 4
- Reading: DSfS Ch. 5, 6
- Visualization Tools Add state-of-the-art visualization; Tableau, d3.js, etc.
- Data Visualization
- EDA Visualization Submission required
Outcomes addressed in week 3:
- Understand the basic process of data science
- The ability to identify, load, and prepare a data set for a given problem.
- The ability to perform data visualization and report generation.
- The ability to perform basic data analysis and statistical inference.
- Reading: PDSH Ch. 5 p. 331-375, 390-399
- Reading: An Introduction to Statistical Learning: with Applications in R (ISLR) Ch. 1, 2
- Linear Regression Notebook Use for second lecture
- Linear Regression 2
- Gradient Descent notebook
- Reading: ISLR Ch. 3
- Reading: PDSH Ch. 5 p. 359-375
- Introduction to Machine Learning with Scikit Learn
- Supervised Learning Linear Regression Submission required
Outcomes addressed in week 4:
- The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
- The ability to perform basic data analysis and statistical inference.
- The ability to perform supervised learning of prediction models.
- The ability to perform data visualization and report generation.
- The ability to apply methods to real world data sets.
- Reading: ISLR Ch. 4.6.5
- Reading: ISLR Ch. 4
- Midterm review
- Supervised Learning - Logistic Regression Submission required
Outcomes addressed in week 5:
- The ability to perform data visualization and report generation.
- The ability to assess the quality of predictions and inferences.
- The ability to apply methods to real world data sets.
- The ability to perform supervised learning of prediction models.
- Scikit-learn ROC Curve notebook
- Reading: PDSH Ch. 5 p. 331-375, 390-399
- Reading: ISLR Ch. 5
- Midterm
- Midterm Exam: Midterm review study guide
- Supervised Learning - Logistic Regression continued 2-week lab, Submission required
Outcomes addressed in week 6:
- The ability to identify, load, and prepare a data set for a given problem.
- The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
- The ability to perform supervised learning of prediction models.
- The ability to perform data visualization and report generation.
- The ability to assess the quality of predictions and inferences.
- The ability to apply methods to real world data sets.
- Reading: PDSH Ch. 5 p. 462-475
- Reading: ISLR Ch. 10.5.2
- SciPy Hierarchical Clustering and Dendrogram
- DBSCAN and clustering comparisons
- Reading: ISLR Ch. 10.1, 10.3, 10.5.1
K-Means Clustering Submission required
Data Preprocessing and Normalization
Outcomes addressed in week 9:
- The ability to identify, load, and prepare a data set for a given problem.
- The ability to perform unsupervised learning.
- The ability to perform data visualization and report generation.
- The ability to apply methods to real world data sets.
- Reading: PDSH Ch. 5 p. 421-432
- Reading: ISLR Ch. 8.1
- Reading: PDSH Ch. 5 p. 421-432
- Reading: ISLR Ch. 8.2
- Introduce Data Science Project Submission required
- Decision Trees optional
- Random Forests optional
Outcomes addressed in week 8:
- The ability to identify, load, and prepare a data set for a given problem.
- The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
- The ability to perform supervised learning of prediction models.
- The ability to perform data visualization and report generation.
- The ability to assess the quality of predictions and inferences.
- The ability to apply methods to real world data sets.
- Gradient Boosting, XGBoost
Notebooks:
- Time Series
Notebooks:
Worked example: air passenger prediction
Time Series Load and Explore Data
Outcomes addressed in week 10:
- Understand the basic process of data science
- The ability to identify, load, and prepare a data set for a given problem.
- The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
- The ability to perform basic data analysis and statistical inference.
- The ability to perform supervised learning of prediction models.
- The ability to perform unsupervised learning.
- The ability to perform data visualization and report generation.
- The ability to assess the quality of predictions and inferences.
- The ability to apply methods to real world data sets.
- Dimensionality Reduction
Notebooks:
-
Reading: MMDS Ch. 9
- Final Exam - Thursday 8am-9am S-107
Work on Data Science Project
Optional:
Outcomes addressed in week 9:
- The ability to perform unsupervised learning.
- The ability to perform data visualization and report generation.