# jgscott/STA380

STA 380: Predictive Modeling
R
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
 Failed to load latest commit information. R Aug 16, 2017 data Aug 15, 2017 exercises Aug 1, 2017 notes Aug 15, 2017 .gitignore Aug 1, 2016 README.md Aug 15, 2017 syllabus.md Aug 1, 2017

# STA 380: Predictive Modeling

Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

## Office hours

On Tuesday-Thursday, August 8-10 and August 15-17, I will hold office hours from 9-10 AM in CBA 6.478.

## Exercises

The first set of exercises is available here.

The second set of exercises is available here.

## Topics

### (0) The data scientist's toolbox

Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

### (1) Exploratory analysis

Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)

Scripts and data:

• excerpts from my course notes on statistical modeling
• NIST Handbook, Chapter 1.
• R walkthroughs on basic EDA: see the three walkthroughs in the Exploratory Data Analysis section here.
• Good graphics: scan through some of the New York Times' best data visualizations. Lots of good stuff here but for our purposes, the best things to look at are those in the "Data Visualizations" section, about 60% of the way down the page. Control-F for "Data Visualization" and you'll find it.

### (2) Foundations of probability

Basic probability, and some fun examples. Joint, marginal, and conditional probability. Law of total probability. Bayes' rule. Independence. These are videos on UT Box.. You will need to sign up for UT Box with your UT e-mail account in order to access these.

Some optional stuff:

### (3) Resampling methods

The bootstrap and the permutation test; joint distributions; using the bootstrap to approximate value at risk (VaR).

Scripts:

If time:

• ISL Section 5.2 for a basic overview.
• These notes on bootstrapping and the permutation test.
• Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
• This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
• Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.

Shalizi (Chapter 6) also has a much lengthier treatment of the bootstrap, should you wish to consult it.

### (4) Clustering

Basics of clustering; K-means clustering; hierarchical clustering.

Scripts and data:

### (5) Latent features and structure

Principal component analysis (PCA).

Scripts and data:

If time:

• ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
• Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.

### (6) Networks and Association Rules

Networks and association rule mining.

Scripts and data: