# Lecture 19: Pipelines and Hypothesis Tests

### Please note: This lecture will be recorded and made available for viewing online. If you do not wish to be recorded, please adjust your camera settings accordingly. 

# Reminders/Announcements:
- Assignment 6 has been collected. Assignment 7 coming soon (almost done with assignments!)
- Quiz 2 is on February 22nd. Please see the Canvas announcement for details (almost done with quizzes!).
    - Lecture on that Monday will just be cancelled. Twenty minutes will not get us very far with any topic these days...

Today we will wrap up our basic stats overview. Note that we are still in the Python kernel!

In [0]:
import pandas as pd
import matplotlib.pyplot as plt

## Pandas and Regression 

Just like numpy arrays, you can use pandas dataframes as inputs into different models, such as a linear regression. Let's examine housing prices in California. This data is taken from the 1990 US Census, and displays data on (essentially) neighborhood level data. Data obtained from Aurelien Geron (https://github.com/ageron) who obtained it from Luis Torgo (https://www.dcc.fc.up.pt/~ltorgo/Regression/)

In [0]:
housing = pd.read_csv('housing.csv')
housing.head()

In [0]:
housing.describe()

In [0]:
housing.info()

Uh oh! We are missing some values in the "total_bedrooms" column. Let's *impute* by replacing 'null entries' with the corresponding *average value*. In pandas, blank entries are usually represented by `NaN`s. You can test for this by using the `isna()` command:

In [0]:
housing.loc[housing['total_bedrooms'].isna()].head()

Using our knowledge from last time, we can now "fill in the blanks" using masking and the loc command

In [0]:
housing.loc[housing['total_bedrooms'].isna(), 'total_bedrooms'] = 537.870553

In [0]:
housing.info()

A different alternative would have been to simply "drop" (remove from the dataset) any row that had missing information

In [0]:
housing.dropna() #This doesn't do anything now...since we already took care of the NaNs!

Great! Let's keep going. Do you think this really represents California housing?

In [0]:
housing.plot(kind = 'scatter', x = 'longitude', y = 'latitude')
plt.show()

Remember that `alpha` parameter from way back long ago that seemed useless? It is *very* useful to describe *densities* of geographic data!

In [0]:
housing.plot(kind = 'scatter', x = 'longitude', y = 'latitude', alpha = .1)
plt.show()

What about those color map things?

In [0]:
housing.plot(kind = 'scatter', x = 'longitude', y = 'latitude', alpha = .4, s = housing['population']/100, label = 'population',c = 'median_house_value', cmap = 'Purples')
plt.show()

The goal with this data is to predict housing prices on the remaining data.

In [0]:
housing.corr()

Lets try to use this dataset to predict housing prices. But first; does it really make sense to consider the "total rooms" in the district? Is there a better option? Also, do we really want total rooms *and* total bedrooms? Those are probably highly correlated...maybe we can just keep the *ratio* of rooms to bedrooms:

## ************ Participation Check ********************
Add a new column `roomsPerHouse` to the housing dataframe, equal to the ratio `total_rooms/households`. Then add a new column `bedroomsPerRoom` equal to the ratio `total_bedrooms/total_rooms`.

## ***********************************************************

In [0]:
housing.corr()

Ok! Let's do a basic linear regression using a few of the variables.

In [0]:
from sklearn.linear_model import LinearRegression
ols = LinearRegression()

In [0]:
ols

In [0]:
ols.fit(housing[['housing_median_age','population','median_income', 'roomsPerHouse','bedroomsPerRoom']], housing['median_house_value'])

In [0]:
ols.coef_

In [0]:
ols.intercept_

In [0]:
c = list(housing.columns)
c

In [0]:
#c.remove('longitude')
housing[c]

Looking good! What if we wanted to handle the categorical attribute? (Proximity to the ocean is probably a big factor in housing prices!)

## Categorical Attributes and "Pipelines"

Scikit learn offers many preprocessing tools which can be used to form *pipelines* on your dataset. This is essentially automated cleaning and processing of your data; for instance, when we *imputed* missing values above, that was a common "pipeline" technique.

In [0]:
proximity = housing[['ocean_proximity']]
proximity.head(10)

In [0]:
proximity.sample(n=10)

As with the "Boy/Girl" scenario in Lecture 17, we want to translate this to numeric data somehow. Scikit Learn has several builtin encoders for categorical data. The first method simply labels the data with an integer:

In [0]:
from sklearn.preprocessing import OrdinalEncoder
ordEncode = OrdinalEncoder(categories = [['ISLAND','NEAR OCEAN','NEAR BAY','<1H OCEAN', 'INLAND']])
ordEncode.fit_transform(proximity)

In [0]:
pOrd = proximity.copy()
pOrd['Ord'] = ordEncode.fit_transform(proximity)
pOrd.head()

In [0]:
pOrd.sample(n=10)

This *can be fine* in some cases, such as in a survey: if you had the choices very bad, bad, average, good, very good, then you could reasonably relabel this as 0,1,2,3,4. In this case, this method has *some potential*; we are measuring "distance from the ocean" somehow. But it is not so clear that this is the right method.

An alternative is to use a *one hot encoding*, which creates *5 new binary variables*.

In [0]:
from sklearn.preprocessing import OneHotEncoder
hotEncode = OneHotEncoder()
Encoded = pd.DataFrame(hotEncode.fit_transform(proximity).toarray())
Encoded.head()

In [0]:
proximity.join(Encoded).sample(10)

You could then pipe this in to your model to get more explanation for housing prices, etc. Let's add these one-hot variables to our original dataset.

In [0]:
housing = housing.join(Encoded)
housing.head()

## Training Sets, Test Sets, and Model Evaluation

In practice, a *very* important preprocessing step is to split up your data into a *training set* and a *testing set*. Sometimes the data will already be "shuffled" and you can simply take the first ~80\% as training, and the remaining 20\% as a test set. In other cases you will need to shuffle the data yourself. Thankfully, Scikit has a builtin for this!

In [0]:
from sklearn.model_selection import train_test_split
trainingData, testingData = train_test_split(housing, test_size = .2, random_state = 3141592653)

In [0]:
trainingData.head()

In [0]:
testingData.head()

Once you do this, in practice you *completely forget about the test set* until you are done building your "model". Then the test set is used to, well, TEST your model. This is done in part to help prevent "overfitting," which will be explored in the homework.

## ***** Extended Participation Check *********************


Train a linear regression on the *training data* dataframe. Only use `housing_median_age` and `median_income` as the explanatory variables. Once you have done this, do the following:
- use the model to *predict* the housing prices of the houses in the *training data set*, based on `housing_median_age` and `median_income`
- use the imported `mean_squared_error` function to calculate the *training error* (the code for this is already written for you)
- on *the same model*, predict the housing prices of the houses in the *testing data set* based on their `housing_median_age` and `median_income`
- use the imported `mean_squared_error` function to calculate the *testing error*

In [0]:
from sklearn.metrics import mean_squared_error

In [0]:
#Train the model 

In [0]:
#Predict the training set

In [0]:
#Compute the training error

In [0]:
#Predict the testing set

In [0]:
#Compute the testing error

## ***********************************************************************

Note: This Participation Check will be *very* relevant to a problem on the upcoming homework!!

This is a *very large error*, signifying (in part) that the linear model is *underfitting* the data. Even if we add more explanatory variables, we still get a quite poor error:

In [0]:
housing.head()

In [0]:
explanatory = [] 
ols = LinearRegression()
ols.fit(trainingData[explanatory], trainingData['median_house_value'])

In [0]:
trainPredictions = ols.predict(trainingData[explanatory])
mean_squared_error(trainPredictions, trainingData['median_house_value'])**(1/2)

In [0]:
testPredictions = ols.predict(testingData[explanatory])
mean_squared_error(testPredictions, testingData['median_house_value'])**(1/2)

This is simply because linear models are too crude of a model for this data. Perhaps if we have extra time at the end of the quarter we will discuss more powerful techniques for studying this data set.

## Change of Pace: Hypothesis Testing

If you have a larger background in statistics, you may be concerned that I have not been mentioning things like "null hypotheses." I think getting into the details of that is *a bit* out of the scope for this class, but let me at least mention it briefly. (In particular, this will be *very useful* for a problem on your upcoming homework...)

Suppose you had two collections of data. For example; maybe a biologist provided you with two files containing data on the *weights* of two different species of frog (in this case I will simply be making up a hypothetical data set):

In [0]:
import scipy.stats
import pandas as pd
import matplotlib.pyplot as plt
s = scipy.stats.norm()
a = 1/2.33
t = scipy.stats.tukeylambda(a)
frog1 = pd.DataFrame([i+7 for i in s.rvs(size = 10)])
frog2 = pd.DataFrame([i+6.5 for i in t.rvs(size = 10)])

In [0]:
frog1.head()

In [0]:
frog2.head()

Let's start by plotting out the weights for each frog to look at their distribution...

In [0]:
from matplotlib import pyplot
import numpy
bins = numpy.linspace(0, 12, 20)

pyplot.hist(frog1[0], bins, alpha=0.5, label='Frog 1')
pyplot.hist(frog2[0], bins, alpha=0.5, label='Frog 2')
pyplot.legend(loc='upper right')
pyplot.show()

Do these two species of frogs have the same distribution of weights? Hard to tell...we only sampled 10 frogs of each species! What if we just got really heavy blue frogs on accident?!

The concept of a *hypothesis test* is to quantify our uncertainty in this area. A basic question you could ask is the following: is the *average weight* of species 1 equal to the *average weight* of species 2?

To do this, we can do a *t-test*. WARNING: there are statistical hypotheses that need to be met in order for a t-test to be valid. I will not get into them here, but please be careful if you want to do this in the real world (see more here https://en.wikipedia.org/wiki/Student%27s_t-test)

To do this with scipy, you can call `scipy.stats.ttest_ind`

In [0]:
scipy.stats.ttest_ind(frog1,frog2)

How should we interpret this? An important thing to look at is the *p value*. The p value is a number between 0 and 1 which says, roughly, how confident we are that the means of the two frog weights are different. The way to interpret it is: *if* the two means were the same, then the fraction of experiments that would result in this outcome "randomly" would be p. 

So *if* the frogs had the same average weight and *if* we reran this experiment 100 times (collecting 10 frogs from each species and weighing them) then this outcome would happen 100*p times "by chance".

A common cutoff to "reject the null hypothesis" is p = .05. I.e. *if* you get a p-value less than .05, then you can *reject the null hypothesis that the frogs have the same average weight*.

If your p value is larger than that, then you *cannot* say that the frogs *do* have the same average weight; you simply haven't "proven" that they don't. Maybe you simply need to increase the size of your experiment!

In [0]:
frog1 = pd.DataFrame([i+7 for i in s.rvs(size=10000)])
frog2 = pd.DataFrame([i+6.5 for i in t.rvs(size=10000)])
bins = numpy.linspace(0, 12, 30)
pyplot.hist(frog1, bins, alpha=0.5, label='Frog 1')
pyplot.hist(frog2, bins, alpha=0.5, label='Frog 2')
pyplot.legend(loc='upper right')
pyplot.show()

In [0]:
scipy.stats.ttest_ind(frog1,frog2)

## Hypothesis Testing and Linear Regression

Hypothesis tests appear all over the place, as you can imagine.
- Do ads on YouTube increase revenue more than ads in newpapers?
- Does tobacco use decrease life expectancy?
- Is family income related to college admissions?
- ...

They are almost *always* employed together with linear regressions. This is what I have been skipping out on (in part because Scikit Learn suppresses hypothesis testing in their models).

The idea is, my linear regression gave me a big positive coefficient on some variable. But is that coefficient *statistically significant*, or did it happen by chance? To learn more about that, I recommend a course dedicated to this, such as statistical modelling or econometrics.

If you want to see more detailed linear regressions, you don't have to necessarily use R or Stata! Python's `statsmodels` may be the module for you if you want to keep things "pythonic"