Skip to content

jayurbain/DataScienceIntro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Introduction to Data Science

This course provides an introduction to applied data science including data preparation, data analysis, factor analysis, statistical inference, predictive modeling, and data visualization.

The goal of data science is to extract information from a data set and transform it into an understandable structure for further use.

An emphasis will be placed on understanding the fundamentals using scripting languages and interactive methods to learn course concepts. Problems and data sets are selected from a broad range of disciplines of interest to students, faculty, and industry partners.

Lectures are augmented with hands-on tutorials using Jupyter Notebooks. Laboratory assignments will be completed using Python and related data science packages: NumPy, Pandas, SciPy, StatsModels, SciKit-Learn, and MatPlotLib.

2-2-3 (class hours/week, laboratory hours/week, credits)

Prerequisites: MA-262 Probability and Statistics; programming maturity, and the ability to program in Python.

ABET: Math/Science, Engineering Topics.

Outcomes:

  • Understand the basic process of data science.
  • The ability to identify, load, and prepare a data set for a given problem.
  • The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
  • The ability to perform basic data analysis and statistical inference.
  • The ability to perform supervised learning of prediction models.
  • The ability to perform unsupervised learning.
  • The ability to perform data visualization and report generation.
  • The ability to assess the quality of predictions and inferences.
  • The ability to apply methods to real world data sets.

Tools: Python and related packages for data analysis, machine learning, and visualization. Jupyter Notebooks.

References:

An Introduction to Statistical Learning: with Applications in R. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. 2015 Edition, Springer.

Python Data Science Handbook, Jake VanderPlas, O'Reilly.

Data Science from Scratch, Joel Grus, O'Reilly

Hands-On Machine Learning with Scikit-Learn and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems, Aurélien Géron. O'Reilly Media.

Mining of Massive Datasets. Anand Rajaraman and Jeffrey David Ullman.

Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning. Springer, 2009.


Week 1: Intro to Data Science, data science programming in Python

Lecture:

  1. Introduction to Data Science

  2. Data Science end-to-end:

  1. Python for Data Science
  • Reading: Python Data Science Handbook (PDSH) Ch. 1
  • Reading optional: Data Science from Scratch (DSfS) Ch. 1
  1. Introduction to Git and GitHub

Lab Notebooks:

Note: Initiate walkthrough of hands-on notebooks with students, let them complete submissions on their own.

Outcomes addressed in week 1:

  • Understand the basic process of data science

Week 2: NumPy Stack, Exploratory Data Analysis

Lecture:

  1. Lecture with Hands-on Notebooks:
  1. Exploratory Data Analysis, Pandas Dataframe
  • Reading: PDSH Ch. 3
  • Reading: DSfS Ch. 10

Hands-on Notebooks:

Lab Notebooks:

Outcomes addressed in week 2:

  • Understand the basic process of data science
  • The ability to identify, load, and prepare a data set for a given problem.

Week 3: Probability and Statistical Inference, Visualization

Lecture:

  1. Probability, Stats, and Visualization
  • Reading: PDSH Ch. 4
  • Reading: DSfS Ch. 5, 6
  1. Visualization Tools Add state-of-the-art visualization; Tableau, d3.js, etc.

Lab Notebooks:

Outcomes addressed in week 3:

  • Understand the basic process of data science
  • The ability to identify, load, and prepare a data set for a given problem.
  • The ability to perform data visualization and report generation.
  • The ability to perform basic data analysis and statistical inference.

Week 4: Linear Regression, Multivariate Regression

Lecture:

  1. Linear Regression 1
  • Reading: PDSH Ch. 5 p. 331-375, 390-399
  • Reading: An Introduction to Statistical Learning: with Applications in R (ISLR) Ch. 1, 2
  1. Linear Regression Notebook Use for second lecture

Lab Notebooks:

Outcomes addressed in week 4:

  • The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
  • The ability to perform basic data analysis and statistical inference.
  • The ability to perform supervised learning of prediction models.
  • The ability to perform data visualization and report generation.
  • The ability to apply methods to real world data sets.

Week 5: Introduction to Machine Learning, KNN, Model Evaluation and Metrics. Logistic Regression

Lecture:

  1. Introduction to Machine Learning with KNN
  • Reading: ISLR Ch. 4.6.5
  1. Logistic Regression Classification
  • Reading: ISLR Ch. 4
  • Midterm review

Lab Notebooks:

Outcomes addressed in week 5:

  • The ability to perform data visualization and report generation.
  • The ability to assess the quality of predictions and inferences.
  • The ability to apply methods to real world data sets.
  • The ability to perform supervised learning of prediction models.

Week 6: Midterm

Lecture:

  1. Model Evaluation and Metrics
  1. Midterm

Lab Notebooks:

Outcomes addressed in week 6:

  • The ability to identify, load, and prepare a data set for a given problem.
  • The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
  • The ability to perform supervised learning of prediction models.
  • The ability to perform data visualization and report generation.
  • The ability to assess the quality of predictions and inferences.
  • The ability to apply methods to real world data sets.

Week 7: Unsupervised learning, clustering

Lecture:

  1. Clustering - K-Means
  • Reading: PDSH Ch. 5 p. 462-475
  • Reading: ISLR Ch. 10.5.2
  1. Clustering - Hierarchical, Probabilistic

Lab Notebooks:

K-Means Clustering Submission required
Data Preprocessing and Normalization

Outcomes addressed in week 9:

  • The ability to identify, load, and prepare a data set for a given problem.
  • The ability to perform unsupervised learning.
  • The ability to perform data visualization and report generation.
  • The ability to apply methods to real world data sets.

Week 8: Decision Trees

Lecture:

  1. Decision Trees
  • Reading: PDSH Ch. 5 p. 421-432
  • Reading: ISLR Ch. 8.1
  1. Bagging, Random Forests, Boosting
  • Reading: PDSH Ch. 5 p. 421-432
  • Reading: ISLR Ch. 8.2

Lab Notebooks:

Outcomes addressed in week 8:

  • The ability to identify, load, and prepare a data set for a given problem.
  • The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
  • The ability to perform supervised learning of prediction models.
  • The ability to perform data visualization and report generation.
  • The ability to assess the quality of predictions and inferences.
  • The ability to apply methods to real world data sets.

Week 9:  XGBoost, Time Series

Lecture:

  1. Gradient Boosting, XGBoost
    Notebooks:
  1. Time Series
    Notebooks:
    Worked example: air passenger prediction
    Time Series Load and Explore Data

Outcomes addressed in week 10:

  • Understand the basic process of data science
  • The ability to identify, load, and prepare a data set for a given problem.
  • The ability to analyze a data set including the ability to understand which data attributes (dimensions) affect the outcome.
  • The ability to perform basic data analysis and statistical inference.
  • The ability to perform supervised learning of prediction models.
  • The ability to perform unsupervised learning.
  • The ability to perform data visualization and report generation.
  • The ability to assess the quality of predictions and inferences.
  • The ability to apply methods to real world data sets.

Week 10: Dimensionality reduction

Lecture:

  1. Dimensionality Reduction
    Notebooks:
  1. Final Exam - Thursday 8am-9am S-107

Lab Notebooks:

Work on Data Science Project

Optional:

Outcomes addressed in week 9:

  • The ability to perform unsupervised learning.
  • The ability to perform data visualization and report generation.

Week 11: Final Project Presentations

Monday, 8:00 AM - 10:00 AM, S107

About

Introductory data science materials

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published