## ## Week 3-1 - Linear Regression - class notebook

This notebook gives three examples of regression, that is, fitting a linear model to our data to find trends. For the finale, we're going to duplicate the analysis behind the Washington Post story 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

## Part 1 - Single variable regression
We'll start with some simple data on height and weight.

In [2]:
hw = pd.read_csv("height-weight.csv")
hw

Unnamed: 0,name,height,weight
0,Joyce,51.3,50.5
1,Louise,56.3,77.0
2,Alice,56.5,84.0
3,James,57.3,83.0
4,Thomas,57.5,85.0
5,John,59.0,99.5
6,Jane,59.8,84.5
7,Jeffrey,62.5,84.0
8,Janet,62.5,112.5
9,Carol,62.8,102.5


Let's look at the distribution of each of these variables.

Really, the interesting thing is to look at them together. For this we use a scatter plot.

Clearly there's a trend that relates the two. One measure of the strength of that trend is called "correlation". We can compute the correlation between every pair of columns with `corr()`, though in this case it's really only between one pair.


If you want to get better at knowing what sort of graph a correlation coefficient corresponds to, play the remarkable 8-bit game [Guess the Correlation](http://guessthecorrelation.com/)

So far so good. Now suppose we want to know what weight we should guess if we know someone is 60" tall. We don't have anyone of that height in our data, and even id we did, they could be above or below average height. We need to build some sort of *model* which captures the trend, and guesses the average weight at each height.

*ENTER THE REGRESSION*.

Ok, now we've got a "linear regression." What is it? It's just a line `y=mx+b`, which we can recover like this:

We can plot this line `y=mx+b` on top of the scatterplot to see it.

So if we want to figure out the average weight of someone who is 60" tall, we can compute

There's a shortcut for this, which will come in handy when we add variables

## Part 2 - Multi-variable regression 

We can do essentially the same trick with one more independent variable. Then our regression equation is `y =  m1*x1 + m2*x2 + b`. We'll use one of the built-in `sklearn` data test as demonstration data.

In [14]:
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
diabetes = datasets.load_diabetes()

print(diabetes.DESCR)

Diabetes dataset

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani

In [None]:
# take a look at the predictive (independent) variables

In [None]:
# take a look at the "target" (dependent) variable

In [None]:
# fit a regression

Ok awesome, we've fit a regression with multiple variables. What did we get? Let's check the coefficients

Now we have *two* coefficients. They're both positive, which means that both age and BMI are associated with increased disease progression. We have an intercept too, the predicted value of the target variable when both age and BMI are zero (which never happens, but that's the way the math works)

To really see what's going on here, we're going to plot the whole thing in beautiful 3D. Now instead of a regression line, we have a regression *plane.* Are you ready for this?

In [20]:
# Helpful function that we'll use later for making more 3D regression plots
def plot_regression_3d(x, y, z, model, elev=30, azim=30, xlab=None, ylab=None):
    fig = plt.figure()
    ax = Axes3D(fig, elev=elev, azim=azim)

    # This looks gnarly, but we're just taking four points at the corners of the plot, 
    # and using predict() to determine their vertical position
    xmin = x.min()
    xmax = x.max()
    ymin = y.min()
    ymax = y.max()
    corners_x = np.array([[xmin, xmin], [xmax, xmax]])
    corners_y = np.array([[ymin, ymax], [ymin, ymax]])
    corners_z = model.predict(np.array([[xmin, xmin, xmax, xmax], [ymin, ymax, ymin, ymax]]).T).reshape((2, 2))
    ax.plot_surface(corners_x, corners_y, corners_z, alpha=0.5)

    ax.scatter(x, y, z, alpha=0.3)

    ax.set_xlabel(xlab)
    ax.set_ylabel(ylab)



In [1]:
# Now plot our diabetes data

## Part 3 - Analysis of 2016 voters

Aside from prediction, we can use regression to attempt explanations. The coefficient `m` in the above encodes a guess about the existence and strength of the relationship between `x` and `y`. If it's zero, we guess that they're unrelated. Otherwise, it tells us how they are likely to vary together.

In this section we're going to try to understand what motivated people to vote for Trump but looking at the relationship between vote and other variables in the [2016 American National Election Study data](http://electionstudies.org/project/2016-time-series-study/). 

There were quite a few statistical analyses of this "why did Trump win?" kind after the election, by journalists and researchers. 

- [Racism motivated Trump voters more than authoritarianism](https://www.washingtonpost.com/news/monkey-cage/wp/2017/04/17/racism-motivated-trump-voters-more-than-authoritarianism-or-income-inequality) - Washington Post
- [The Rise of American Authoritarianism](https://www.vox.com/2016/3/1/11127424/trump-authoritarianism) - Vox
- [Education, Not Income, Predicted Who Would Vote For Trump](https://fivethirtyeight.com/features/education-not-income-predicted-who-would-vote-for-trump/) - 538
- [Why White Americans Voted for Trump – A Research Psychologist’s Analysis](https://techonomy.com/2018/02/white-americans-voted-trump-research-psychologists-analysis/) - Techonomy
- [Status threat, not economic hardship, explains the 2016 presidential vote](http://www.pnas.org/content/early/2018/04/18/1718155115) - Diana C. Mutz, PNAS
- [Trump thrives in areas that lack traditional news outlets](https://www.politico.com/story/2018/04/08/news-subscriptions-decline-donald-trump-voters-505605) - Politico
- [The Five Types of Trump Voters](https://www.voterstudygroup.org/publications/2016-elections/the-five-types-trump-
voters) - Voter Study Group

Many of these used regression, but some did not. My favoite is the Voter Study Group analysis which used clustering -- just like we learned last week. It has a good discussion of the problems with using a regression to answer this question. 

We're going to use regression anyway, along the lines of the [Washington Post piece](https://www.washingtonpost.com/news/monkey-cage/wp/2017/04/17/racism-motivated-trump-voters-more-than-authoritarianism-or-income-inequality/?utm_term=.01d9d3764f2c) which also uses ANES data. In particular, a regression on variables representing attitudes about authoritarianism and minorities.


In [2]:
# read 'anes_timeseries_2016_rawdata.csv'


The first thing we need to do is construct indices of "authoritarianism" and "racism" from answers to the survey questions. We're following exactly what the Washington Post did here. Are "authoritarianism" and "racism" accurate and/or useful words for indices constructed of these questions? Our choice of words will hugely shape the impression that readers come away with -- even if we do the exact same calculations.

We start by dropping everything we don't need: we keep only white voters, only people who voted, and just the cols we want

In [4]:
# drop non-white voters

In [3]:
# keep only Trump, Clinton voters

In [5]:
# keep only columns on authoritarian, racial scales

Now we have to decode these values.

For the child-rearing questions, the code book tells us that 1 means the first option and 2 means the second. But 3 means both and then there are all sorts of codes that mean the question wasn't answered, in different ways. And then there's the issue that the questions have different directions: Options 1 might mean either "more" or "less" authoritarian. So we have a custom translation dictionary for each column. This is the stuff that dreams are made of, people.

In [26]:
# recode the authoritarian variables

In [27]:
# recode the racial variables

In [28]:
# check the results

Unnamed: 0,V162034a,V162239,V162240,V162241,V162242,V162211,V162212,V162213,V162214
0,2,1,1,1,-1,2,2,2,2
1,2,-1,1,-1,-1,0,-1,1,0
7,2,1,1,-1,1,-1,2,2,1
13,2,1,1,-1,-1,1,2,1,0
14,1,-1,-1,-1,-1,-1,-2,-1,-2


Finally, add the authority and racial columns together to form the composite indexes.

In [6]:
# sum each group of columns. End up with vote, authority, racial columns



Data prepared at last! Let's first look at the scatter plots

Er, right... all this says is that we've got votes for both candidates at all levels of authoritarianism. To get a sense of how many dots in each point, we can add some jitter and make the points a bit transparent.

In [7]:
# function to add noise to the values in the array


In [8]:
# plot vote vs authoritarian variables with jitter

Note that, generally, as you move to the right (more authoritarian) there are more Trump voters. We can do this same plot with the racial axis.

In [9]:
# plot vote vs racial variables with jitter

Similar deal. The axis is smoother because we are summing numbers from a five point agree/disagree scale, rather than just the two-option questions of the authoritarianism subplot. 

Now in glorious 3D.

In [10]:
# 3D plot of both sets of vars

Same problem: everything is on top of each other. Same solution.

In [11]:
# jittered 3D plot

You can definitely see the change alog both axes. But which factor matters more? Let's get quantitative by fitting a linear model. Regression to the rescue!

In [36]:
# This is some drudgery to convert the dataframe into the format that sklearn needs: 


In [12]:
# This does the actual regression


In [13]:
# call plot_regression_3d

Well that looks cool but doesn't really clear it up for me. Let's look at the coefficients.


Looks like the coefficient on `racial` is higher. But wait, we choose the numbers that we turned each response into! We could have coded `racial` on a +/-1 scale instead of a +/-2 scale, or a +/-10 scale. So... we could get any number we want just be changing how we convert the data.

To fix this, we're going to standardize the values (both dependent and independent) to have mean 0 and standard deviation 1. This gives us [standardized coefficients](https://en.wikipedia.org/wiki/Standardized_coefficient).

In [14]:
# normalize the columns and take a look

In [15]:
# fit another regression

What we have now is the same data, just scaled in each direction

In [16]:
# call plot_regression_3d

Finally, we can compare the coefficients directly. It doesn't matter what range we used to code the survey answers, because we divided it out during normalization.


So there we have it. For white voters in the 2016 election, the standardized regression coefficient on racial factors is quite a bit bigger than the standardized coeffiecient on authoritrianism. But what does this actually mean?

In [17]:
# what's the new intercept?