#Predictive data analysis - a.k.a. models 101

##What do we mean when we talk about models?  
At their most basic, models are means of predicting or estimating what will/would happen without it actually happening. A more nuanced view of statistical models is that they help up understand relationships that are not readily apparant. 

We will denote denote models by stating "given information $X$, what can we say about $Y$"? 

For example
* Given the time spent driving ($X$), how far have we traveled ($Y$)?
* Given that I reduce my morning shower by 5min, how much energy will I save?
* Given the price of an item and its brand name, will a customer purchase it?
* Given a student's demograpic "profile", what major will they choose?
* Given the amount of time spent studying for Physics 1A, what grade will a student get?
* Given the aveage grade of a major's introductory course, how many students will leave the major?
* Given that a student does well in a Physics course, how do we expect them to perform in a Gender Studies course?

<span style="color:red; font-size:1.5em;">
Please think up other models that may be relevent to your project ideas and include them below.
</span>

* Given ...

## Describing Data 
Discuss the differences between the models described above. 

### Variable types
* **Continuous**:  May take an infinite number of numerical values (when bounded). 
* **Discrete**:    May take only a countable number of numerical values.
* **Logical**:     May take only two values, true or false.
* **Categorical**: May take a finite number of labeled values with no specific order or numerical separation from eachother. 

<span style="color:red; font-size:1.5em;">
Please give three examples of each type of data below.
</span>

* **Continuous**: 
* **Discrete**:    
* **Logical**:    
* **Categorical**: 

##  Describing Models

### Probablistic versus Deterministic models
**Deterministic models** predict a definite outcome. E.g., "If you drive for one hour at a speed of 60 miles per hour, you will have traveled __ miles". 

**Probabilistic models** give tell you the likelihood of an outcome. E.g., "If you study for an exam you 50%  likely to get an A, and 40%  to get a B" or "If you do well in a course you are more likely to major in the subject matter"

#### What is a statistic?
sta·tis·tic (stəˈtistik)

noun: a fact or piece of data from a study of a large quantity of numerical data.
e.g., the mean or standard deviation of a set of data. 


### Statistical Models
If we have observations of variables, data, and we estiamte relationships between the variables using the data, we call the resulting model a "statistical model" 

This is because the realtionships between data are defined by **parameters** that are fit using statistics. 

### Supervised versus unsupervised models
Most models are supervised,  they are models for which we observe the outcome variable $Y$.  In unsupervised models we do not observe $Y$. In this case a modeler may look for patterns in $X$ that could be related to theier desired outcome, $Y$.  Though it is difficult or impossible to proove that they relate to this outcome.  

# Importing data
This is a lot of code to import data on grade distributions for courses, and to process it into a useful format for us.
I'm inlcuding it because you may find it instructive later but will not be reviewing it today. 

In [153]:
# Packages to import and process data
import os
import numpy as np
import pandas as pd 

## We are loading grade distribution data
from courses taught in L&S during the spring of 2013

In [87]:
## Read in a file of grade distributions for the L&S in the spring of 2013.
#  I believe that each row of this file is a set of students in the course who all received the same grade, 
#  the column heading are misleading me-thinks
folder = 'data'
filename = 'GradeDist_2013Sp.csv'
df = pd.read_csv(os.path.join(folder, filename))
df

In [155]:
# Function to process the raw grade data
def proc_grade_dist_data(df):
    # Take only columns we plan on using, 
    df_limited = df[['Course Subject Short Nm', 'Course Number' ,'Enrollment Cnt','Grade Subtype Desc','Grade Nm','Average Grade']]
    df_limited = df_limited.sort(['Course Subject Short Nm','Course Number'])
    # Rename them
    df_limited.columns = ['Subject','Course Number','Student Count','Grade Subtype','Letter Grade','Numerical Grade']
    
    # Sum all of the enrollments from separate sections (CCNs) of the same course
    # Pandas makes this easy, but we have to be careful not to also sum the numerical grade column
    df_grouped = df_limited.groupby(['Subject','Course Number','Letter Grade'])
    df_agg = df_grouped.agg({'Student Count':'sum', 'Numerical Grade':'mean'}).reset_index()
    
    # Find the total number of students in each class and merge as a column in our original data.frame
    df_sum_course = df_agg[['Subject','Course Number','Student Count']].groupby(['Subject','Course Number']).sum().reset_index()
    df_sum_course.columns = ['Subject','Course Number','Total Student Count']

    # Find the fraction of students with each grade in each course
    df_merged = df_agg.merge(df_sum_course)
    df_merged['Frac w Grade'] = df_merged['Student Count'] / df_merged['Total Student Count']

    ### FIND AVERAGE LETTER GRADE
    # Exclude all Pass/Incomplete/NoPass Grades
    df_letter = df_merged.loc[ (df_merged['Numerical Grade'].notnull()) ]

    # Find a new sum of headcounts in each course, only counting those with letter grades
    df_sum_course_letter         = df_letter[['Subject','Course Number','Student Count']].groupby(['Subject','Course Number']).sum().reset_index()
    df_sum_course_letter.columns = ['Subject','Course Number','Total Student Count Letter']

    # Find the fraction of students with each letter grade in each course
    df_letter_merged = df_letter.merge(df_sum_course_letter)
    df_letter_merged['Frac w Letter Grade'] = df_letter_merged['Student Count'] / df_letter_merged['Total Student Count Letter']

    # Find average letter grade in each course
    df_letter_merged['Contrib to Ave Letter'] = df_letter_merged['Frac w Letter Grade'] * df_letter_merged['Numerical Grade']
    df_aveletter = df_letter_merged[['Subject','Course Number','Contrib to Ave Letter']].groupby(['Subject','Course Number']).sum().reset_index()
    df_aveletter.columns = ['Subject','Course Number','Ave Letter Grade']

    ## Merge average letter grade onto the original dataset as a new column. 
    df_merged_2 = df_merged.merge(df_aveletter, 'left')
    
    ## Categories classes as upper div, lower, div, and graduate
    import re
    cn_noletter = [float(re.sub('[^0-9]','',df_merged_2.loc[i]['Course Number'])) for i in list( df_merged_2.index)]

    def get_level(n):
        if n < 100:
            return('Lover Division')
        elif n < 200:
            return('Upper Division')
        elif n > 200:
            return('Graduate')
        else: 
            return('Null')

    df_merged_2['Course Level'] = [get_level(cn_noletter[i]) for i in np.arange(0, len(cn_noletter))]
    df_merged_2
    
    return(df_merged_2)

## Now we can print out the processed data

In [160]:
## Process the raw data
grades2013 = proc_grade_dist_data(df)
grades2013

Unnamed: 0,Subject,Course Number,Letter Grade,Student Count,Numerical Grade,Total Student Count,Frac w Grade,Ave Letter Grade,Course Level
0,Aerospace Studies,100,Not Pass,2,,43,0.046512,,Upper Division
1,Aerospace Studies,100,Pass,41,,43,0.953488,,Upper Division
2,Aerospace Studies,135B,A,3,4.0,6,0.500000,3.883333,Upper Division
3,Aerospace Studies,135B,A+,2,4.0,6,0.333333,3.883333,Upper Division
4,Aerospace Studies,135B,B+,1,3.3,6,0.166667,3.883333,Upper Division
5,Aerospace Studies,1B,A,16,4.0,28,0.571429,3.382143,Lover Division
6,Aerospace Studies,1B,A+,3,4.0,28,0.107143,3.382143,Lover Division
7,Aerospace Studies,1B,A-,1,3.7,28,0.035714,3.382143,Lover Division
8,Aerospace Studies,1B,B,5,3.0,28,0.178571,3.382143,Lover Division
9,Aerospace Studies,1B,F,3,0.0,28,0.107143,3.382143,Lover Division


# Binned basic statistics (mean, standard deviations, percentiles)
 Continuous $Y$,  Categorial $X$
 
Two statistics you've undoubtedly learned about are means and standard deviations. For those who want the mathematics, we have defined them below.  

* **mean:** the average value of Y, $\bar Y = \frac{1}{N}\sum_{i=1}^{N}y_n$
* **standard deviation:** An average deviation from the mean of Y, $\sigma_Y = \frac{1}{N} \sqrt{(y_n - \bar Y)^2}$
* **percentile, $p_i(Y)$,** The value that i% of observations of $Y$ lie below. 

By "binning" values of $Y$ based on the categories in $X$ becomes a predictive model.  

* "Given the values in $X$, we predict on average $Y= \bar Y$. And that with $i$% likelihood we believe Y to be below $p_i(Y)$. 

This might be confusing... lets take a look. Using our data...

In [230]:
## Pandas makes it easy to make group statistics using the "groupby" function on a data.frame
# To be "clean" we will take only the appropriate columns of our data here
# Also, each entry for a course will have the same average letter grade, so there will be 
#       duplicate entries.  Here we also drop the duplicates. 
grades_red = grades2013[['Subject','Course Number','Course Level','Ave Letter Grade']].drop_duplicates()
grades_red = grades_red.loc[ grades_red['Ave Letter Grade'].notnull() ]

## Now we can find the mean, medain, standard deviation, and percentiles of average letter grades for
# each subject, stratified by division. 
# We begin by creating a dictionary of all of the stats we would like. 
aggstats = {'Ave Letter Grade':[np.mean, 
                                np.std
                               ]}      

# We then run a groupby stats call on the dataframe
grades_binnedmodel = grades_red.groupby(['Subject','Course Level']).agg(aggstats).reset_index()

# Not let's call out a couple of subjects
subjects = sorted(list(set(grades_binnedmodel['Subject'])))
indices  = list(set(grades_binnedmodel.index))
               
grades_binnedmodel.loc[grades_binnedmodel['Subject'] == 'Mathematics']


Unnamed: 0_level_0,Subject,Course Level,Ave Letter Grade,Ave Letter Grade
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,std
141,Mathematics,Graduate,3.832037,0.210564
142,Mathematics,Lover Division,2.827983,0.239069
143,Mathematics,Upper Division,3.218393,0.422656


# Linear Regression
Continuous $Y$, any $X$

Linear Regression is a measure 

# Classification (Logistical Regression or SVM)
Logical $Y$, any $X$.

# Clustering (Gaussian mixtures or k-means)
Unobserved, categorical $Y$, any $X$.

#Principle Components Analysis
Unobserved, continuous $Y$, any $X$. 

## In reality the velocity of the car may not be constant, we can model that with a random error. 

$$
p = vt + p_0 + \epsilon_t
$$