IS 606 - Statistics and Probability for Data Analytics - Spring 2016
Instructor: Jason Bryer, Ph.D.
Class Meetup: Thursday 7:00pm to 8:00pm
Office Hours: By appointment
This course covers basic techniques in probability and statistics that are important in the field of data analytics. Discrete probability models, sampling from infinite and finite populations, statistical distributions, basic Bayesian statistics, and non-parametric statistical techniques for categorical data are covered in this course. Each of these statistical concepts will be applied in a variety of real-world scenarios through the use of case studies and customized data sets.
Course Learning Outcomes:
By then end of the course, students should be able to:
- Understand the foundations of probability theory and perform basic probability calculations.
- Build basic stochastic models for commonly encountered business problems.
- Model situations involving uncertainty using appropriate probability distributions and conditional techniques.
- Explore and summarize data using descriptive statistics.
- Test hypotheses using classical and modern computational techniques.
- Construct estimators and calculate intervals using classical and modern computational techniques.
- Perform basic Bayesian statistical techniques for estimation and testing hypotheses.
Program Learning Outcomes addressed by the course:
- Business Understanding. Learn when probabilistic techniques apply to certain categories of business problems, discuss the sorts of solutions that are possible, and understand the limitations of these techniques.
- Foundational Math Skills. Explore and analyze data, build probabilistic and statistical models, construct estimators, and test hypotheses.
- Predictive Modeling. Learn foundational techniques that underlie predictive modeling algorithms, such as Naïve Bayes.
- Presentation. Complete and submit collaborative assignments using techniques from the course.
How is this course relevant for data analytics professionals?
Probabilistic techniques are the foundation of many data science applications from data exploration and visualization to outlier analysis, stochastic modelling, and data mining algorithms. This course will ensure that students have a strong understanding of these foundations.
- Homework (16%)
- Labs (40%)
- Data Project (20%)
- Final exam (18%)
- Meetup Presentation (5%)
- Getting Aquainted (1%)
|Quality of Performance||Letter Grade||Range %||GPA|
|Excellent - work is of exceptional quality||A||93 - 100||4|
|Excellent||A-||90 - 92.9||3.7|
|Good - work is above average||B+||87 - 89.9||3.3|
|Satisfactory||B||83 - 86.9||3|
|Below Average||B-||80 - 82.9||2.7|
|Poor||C+||77 - 79.9||2.3|
|Poor||C||70 - 76.9||2|
How This Course Works:
This course is conducted entirely online. Each week, you will have various resources made available, including weekly readings from the textbooks and occasionally additional readings provided by the instructor. Most weeks will have homework assignments to be submitted. There will also be a presentation required and a forum post introduction required. You are expected to complete all assignments by their due dates.
Meetup presentations will comprise the solution and presentation to the class of one of the suggested problems for study from the weekly materials (not the graded homework problems). Each student must present one problem throughout the semester. Problems are chosen by entering your name and problem in the Google Spreadsheet. Note there is a maximum of three presentatiosn per meetup and presentations should be no more than five minutes. Additionally, prepare your presentation so that the slides or document (I suggest using R Markdown) will be shared on the course website. Problems are assigned first come, first served, so any problem not already chosen by another student is available.
Further details on each of these assignments will be available in Blackboard and/or this Github repository.
NOTE: Tentative. Subject to change
|Jan-29||Feb-7||1||Intro to Data|
|Feb-29||Mar-13||4||Foundation for Inference|
|Mar-14||Apr-3||5||Inference for Numerical Data|
|Mar-14||Apr-3||6||Inference for Categorical Data|
|Apr-18||May-1||8||Multiple & Logistic Regression|
|May-2||May-15||Navarro||Introduction to Bayesian Analysis|
|May-17||May-20||Final Exam (due by 5pm on May 20, 2016)|
There will be weekly meetups. You are encouraged to attend as many as you can but recordings will generally be availabe within a few days of the meetup.
| Introduction to the course ([Video](https://youtu.be/w6dJ0NQlcX4), [Slides](https://htmlpreview.github.io/?https://github.com/jbryer/IS606Spring2016/blob/master/Slides/2016-01-29-Intro_to_Course.html)) |
Thursday, Feb 4, 7:00 pm | Introduction to data (Video, Slides) | Thursday, Feb 11, 7:00 pm | Probability (Video, Slides) | Thursday, Feb 18, 7:00 pm | Distributions Part I (Video, Slides) | Thursday, Feb 25, 7:00 pm | Distributions Part II (Video, Slides) | Thursday, Mar 3, 7:00 pm | Foundation for Inference (Video, Slides) | March 10 | NO MEETUP | Thursday, Mar 17, 7:00 pm | Inference for Numerical Data (Video, Slides) | Thursday, Mar 24, 7:00 pm | Inference for Categorical Data (Video, Slides) | Thursday, Mar 31, 7:00 pm | Presentations and Data Project | | Thursday, Apr 7, 7:00 pm | Linear Regression (Video, Slides) | Thursday, Apr 14, 7:00 pm | Linear Regression 2 (Video, Slides) | Thursday, Apr 21, 7:00 pm | Multiple & Logistic Regression (Video, Slides) | April 28 | NO MEETUP | Thursday, May 5, 7:00 pm | Intro to Bayesian Analysis (Video, Slides) | Thursday, May 12, 7:00 pm | Conclusions (Video, Slides) |
Diez, D.M., Barr, C.D., & Çetinkaya-Rundel, M. (2015). OpenIntro Statistics (3rd Ed).
Navarro, D. (2015, version 0.5). Learning Statistics with R
This is free textbook that supplements a lot of the material covered in Diez and Barr. We will use the chapter on Bayesian analysis. You can download a PDF version, or buy a print copy from Lulu through the author's website.
Kabacoff, R.I. (2011). R in Action. Manning Publications.
Wickham, H. Advanced R. Baca Raton, FL: Taylor & Francis Group.
Kruschke, J.K. (2014). Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan (2nd Ed). London: Academic Press.
This book can be purchased from Amazon, but also check out the author's webiste (doingbayesiandataanalysis.blogspot.com/) for additional resources.
The solutions to the practice problems are at the end of the book and do not need to be handed in. Graded assignments should be typed (preferably using R Markdown) or neatly hand written and scanned. Data for the homework assignments, and for within the chapters too, can be downloaded here.
- Chapter 1 - Introduction to Data (due 2/8/2016)
- Practice: 1.7 (available in R using the
data(iris)command), 1.9, 1.23, 1.33, 1.55, 1.69
- Graded: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70
- Practice: 1.7 (available in R using the
- Chapter 2 - Probability (due 2/15/2016)
- Practice: 2.5, 2.7, 2.19, 2.29, 2.43
- Graded: 2.6, 2.8, 2.20, 2.30, 2.38, 2.44
- Chapter 3 - Distributions of Random Variables (due 2/29/2016)
- Practice: 3.1 (see
normalPlot), 3.3, 3.17 (use
qqnormsimfrom lab 3), 3.21, 3.37, 3.41
- Graded: 3.2 (see
normalPlot), 3.4, 3.18 (use
qqnormsimfrom lab 3), 3.22, 3.38, 3.42
- Practice: 3.1 (see
- Chapter 4 Foundations for Inference (due 3/14/2016)
- Practice: 4.3, 4.13, 4.23, 4.25, 4.39, 4.47
- Graded: 4.4, 4.14, 4.24, 4.26, 4.34, 4.40, 4.48
- Chapter 5 - Inference for Numerical Data (due 4/4/2016)
- Practice: 5.5, 5.13, 5.19, 5.31, 5.45
- Graded: 5.6, 5.14, 5.20, 5.32, 5.48
- Chapter 6 - Inference for Categorical Data (due 4/4/2016)
- Practice: 6.5, 6.11, 6.27, 6.43, 6.47
- Graded: 6.6, 6.12, 6.20, 6.28, 6.44, 6.48
- Chapter 7 - Introduction to Linear Regression (due 4/18/2016)
- Practice: 7.23, 7.25, 7.29, 7.39
- Graded: 7.24, 7.26, 7.30, 7.40
- Chapter 8 - Multiple and Logistic Regression (due 5/2/2016)
- Practice: 8.1, 8.3, 8.7, 8.15, 8.17
- Graded: 8.2, 8.4, 8.8, 8.16, 8.18
- Navarro Chapter 17 - Bayesian Analysis (due 5/16/2016)
- Graded: TBD
These mini projects will have you explore statistical topics using R. For each project, create an R Markdown file. Name your file using the following format:
LastName-X.Rmd where X is 0 to 8 for the project number.
- Introduction to R and RStudio (Template)
- Introduction to Data (Template)
- Probability (Template)
- Distributions of Random Variables (Template)
- Foundations for Statistical Inference
- Inference for Numerical Data (Template)
- Inference for Categorical Data (Template)
- Introduction to Linear Regression (Template)
- Multiple Linear Regerssion (Template)
The purpose of the data project is for you to conduct reproducible research using open access data. The final project will include an R Markdown file with all required data files so that anyone else can run your analysis. Your project will be made available to other students on this website. The proposal will be graded on a pass/fail basis. More details on the format of the project including templates are on this page: https://github.com/jbryer/IS606Spring2016/blob/master/Project/IS606_Data_Project.md
- Proposal due March 7, 2016
- Final Project due May 16, 2016
- R - Windows or Mac
- RStudio - Download Windows or Mac version from here
- LaTeX - Windows use MiKTeX or Mac use MacTeX (it is best to use Safari to download this file as Chrome or Firefox will often fail)
If using Windows, you also need to download and install these:
Once everything is installed, execute the following command in RStudio to install the packages we will use for this class (you can copy-and-paste):
install.packages(c('openintro','OIdata','devtools','ggplot2','psych','reshape2', 'knitr','markdown','shiny')) devtools::install_github("jbryer/IS606")
IS606 R Package
Many of the course resouces are available in the
IS606 R package. Here are some command to get started:
library('IS606') # Load the package vignette(package='IS606') # Lists vignettes in the IS606 package vignette('os3') # Loads a PDF of the OpenIntro Statistics book data(package='IS606') # Lists data available in the package getLabs() # Returns a list of the available labs viewLab('Lab0') # Opens Lab0 in the default web browser startLab('Lab0') # Starts Lab0 (copies to getwd()), opens the Rmd file shiny_demo() # Lists available Shiny apps
- Quick-R. Kabakoff's website. Great reference along with his book, R in Action.
- O'Reilly Try R. Great tutorial on R where you can try R commands directly from the web browser.
- R Reference Card
- Video Overview of RStudio
- Journal of Statistical Software
- The R Journal
- An Introduction to Statistical Learning with Applications in R
Learning R Markdown
- Video on RMarkdown by RStudio - This 26 minute video talks about some updates to RMarkdown.
- Markdown Basics. Markdown is a way of formatting plain text documents mostly for the web. However, it has become for other writing tasks too. It has become popular because it focusses on writing and not formatting. The formatting is taken care later. The Markdown Basics provides a nice introduction to Markdown.
- The R Markdown Website has a nice introduction on how Markdown is extended to allow for the inclusion of R code and output.
- Video Introduction to R Markdown. This short video (under 4 minutes) was recorded with an older version, so not all of the features and dialog boxes will look the same, but may be helpful.
Creating Math Equations
Office Hours (cell phone or using GoToMeeting): TBD and also by appointment throughout the week. You’re encouraged to schedule an appointment, but you can try to call anytime.
You are encouraged to ask us questions on the "Ask Your Instructor"" forum on the course discussion board where other students will be able to benefit from your inquiries.
For the most part, you can expect me to respond to questions by email within 24 to 48 hours. If you do not hear back from me within 48 hours of sending an email, please resend your message.
I will be checking in on the course regularly, just about every day and likely several times each day. Please do not hesitate to ask if you have questions or concerns.
Accessibility and Accommodations
The CUNY School of Professional Studies is firmly committed to making higher education accessible to students with disabilities by removing architectural barriers and providing programs and support services necessary for them to benefit from the instruction and resources of the University. Early planning is essential for many of the resources and accommodations provided. Please see: http://sps.cuny.edu/student_services/disabilityservices.html
Online Etiquette and Anti-Harassment Policy
The University strictly prohibits the use of University online resources or facilities, including Blackboard, for the purpose of harassment of any individual or for the posting of any material that is scandalous, libelous, offensive or otherwise against the University’s policies. Please see: http://media.sps.cuny.edu/filestore/8/4/9_d018dae29d76f89/849_3c7d075b32c268e.pdf
Academic dishonesty is unacceptable and will not be tolerated. Cheating, forgery, plagiarism and collusion in dishonest acts undermine the educational mission of the City University of New York and the students' personal and intellectual growth. Please see: http://media.sps.cuny.edu/filestore/8/3/9_dea303d5822ab91/839_1753cee9c9d90e9.pdf
Student Support Services
If you need any additional help, please visit Student Support Services: http://sps.cuny.edu/student_resources/