# Lecture 17: Statistics/Data Wrangling (1 of 3?)

### Please note: This lecture will be recorded and made available for viewing online. If you do not wish to be recorded, please adjust your camera settings accordingly. 

# Reminders/Announcements:
- Assignment 5 has been collected. Assignment 6 coming soon to a project near you.
- Please fill out this survey regarding your preferred final project topic: https://docs.google.com/forms/d/e/1FAIpQLSeByoY87ENkYyM0MIeC3MmQKO9JKO7m83jR-6Fk0cyLzucqkA/viewform?usp=sf_link
    - This survey will close *Sunday Feb 14 at 8pm*. If you do not know the best project for your background/interests, please email me or a TA!
    - If you do not fill out this survey, you will 
        - be assigned into a *random* time zone!
        - be assigned a *random* topic!
    - As of right now, only ~45 have signed up for topics!
- Quiz 2 is on February 22nd. It will be very similar in style to Quiz 1.
- Participation checks are back
- Long Weekend!!


## A Few Python Modules

There are many ways to work with statistics computationally these days. One of the most common is the R programming language. Time permitting, we will explore a bit of R at the end of this quarter, but for now we are going to keep things "Pythonic."

Working with statistics in Python will require a few import statements. Some useful libraries are below (you don't always need all of them, but each of them are useful in specific circumstances):
- NumPy 
- Pandas
- Statsmodels
- SciKit Learn
- SciPy

Note that most of this would work in the standard Python kernel! I'm only using SageMath here for easy access to plots.

Today (for our introduction to stats) I will be trying to keep things as simple as possible. So I will mainly be using NumPy, SciPy, and SciKit Learn.

In [0]:
import numpy as np
import scipy.stats

## Stats

From Wikipedia, https://en.wikipedia.org/wiki/Statistics: "Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data."

Warning! I *am not* a statistician. My "data background" is more related to economics and machine learning, which means I have different philosophical viewpoints towards the field. 

So let's study some data! In this directory is a file, `heightData.csv`. It was obtained here: https://www.ncdrisc.org/data-downloads-height.html

The  file contains data for the global average height of children over time in centimeters. Let's take a look:

In [0]:
with open('heightData.csv','r') as myFile: #r for "read only"
    rawData = myFile.readlines()

Usually data files like this come with a header:

In [0]:
rawData[0]

It is always good practice to look at the "head" and the "tail" of the data, to get a feel for what you are working with:

In [0]:
for line in rawData[:5]:
    print(line)

In [0]:
for line in rawData[-5:]:
    print(line)

It looks like the age groups go from 5 - 19, the years go from 1985 - 2019, and the data is split between boys and girls.

Let's prepare this data a little bit:

In [0]:
data = [line.split(',') for line in rawData]
data[:5]

What would be a good way of formatting this data? How about a dictionary. Keys could be a tuple, `(sex, year, age)`. Values could be the height data.

In [0]:
heightDict = {(line[0].strip('"'), int(line[1]), int(line[2])): float(line[3]) for line in data[1:]}

In [0]:
heightDict[('Boys', 2000, 15)]

In [0]:
heightDict[('Girls', 2000, 15)]

This looks better!

A basic way to use statistics is to *summarize* and *visualize* data. A useful tool in statistics is to start small. Instead of analyzing all the data at once, let's fix a sex and an age, and let time vary.

## *********** Participation Check ******************************************
Using the heightDict dictionary, create a list of data points of the form `(year, height)` for the years from 1985 - 2019, and for the heights in the 18 year male category. Make a list plot of the resulting data to get a sense for the trend of heights over time.

## *********************************************************************************************

The most basic summarization we can do is with our eyes. If we plot the data over time, what do we see?

In [0]:
list_plot(collated)

Do we see the same trend for girls? How about for younger groups?

In [0]:
sex = 'Girls'
age = 18
collated = np.array([(year,heightDict[(sex,year, age)]) for year in range(1985,2020)])
list_plot(collated)

In [0]:
sex = 'Boys'
age = 12
collated = np.array([(year,heightDict[(sex,year, age)]) for year in range(1985,2020)])
list_plot(collated)

Next we want to do some real calculations. Let's "summarize" the heights in the previous category.

In [0]:
heights = np.array([d[1] for d in collated])
scipy.stats.describe(heights)

The main takeaways:
- `nobs`: the number of data points
- `minmax`: the minimum and maximum
- `mean`: the sample mean, equal to the average of the data set:`sum(data)/nobs`
- `variance`: the sample variance, which describes how much the data "varies." Equal to the formula below:

In [0]:
#Sample Variance
mean = 144.8022142363996
nobs = 35
c = (heights - mean)**2
sum(c)/(nobs-1)

In [0]:
#Sample Standard Deviation
(sum(c)/34)**(1/2)

You may be more accustomed to the *sample standard deviation*, obtained by taking the square root of the sample variance.

In this case the variance is very low compared to the mean; so the data is "tightly centered" around the mean. 

What if we only fix a *gender*?

In [0]:
sex = 'Boys'
collated = np.array([heightDict[(sex,year, age)] for year in range(1985,2020) for age in range(6,20)])
scipy.stats.describe(collated)

In [0]:
std = (363.4513017342805)**(1/2)
std

Now the standard deviation/variance is much higher, which we'd expect. Most variance in height is probably due to age. 

What do we see for girls? 

In [0]:
sex = 'Girls'
collated = np.array([heightDict[(sex,year, age)] for year in range(1985,2020) for age in range(6,20)])
type(scipy.stats.describe(collated))

In [0]:
std = (229.38766959375724)^(1/2)
std

## Regression

One of the most important tools in statistics is the concept of a *regression*. This lets us discover relationships in the data and potentially make predictions. 

For example. Based on the plots above, what can we predict about the average height of children, from 1985 to 2019?

It is *increasing*! Let's go back to the first example, where we fixed sex and age:

In [0]:
sex = 'Boys'
age = 18
collated = np.asarray([(year,heightDict[(sex,year, age)]) for year in range(1985,2020)])
collated

In [0]:
list_plot(collated)

From the plot, it *looks* like there is almost a linear relationship here. In other words, the average height of 18 year old boys from 1985 to 2019 seemingly satisfies a relationship
$$
height = \beta_0+\beta_1year
$$
How can we find the *best* choice of $\beta_0,\beta_1$?

## Ordinary Least Squares

The concept of OLS is as follows. We will choose $\beta_0,\beta_1$ to *minimize* the squared error on our data set. I think a picture is worth a dozen words here:

![](lr1.png)

Any line approximating the data will give "predictions" which we can compare to reality

![](lr2.png)

The total comparison to "predicted" vs "reality" gives an "error" for our model:

![](lr3.png)

There are formulas for minimizing the error exactly in basic cases. Intuitively though, it should look like the green line below. Blue, orange, and red lines are *bad* choices for this data:

![](lr4.png)

Note! No data point is predicted *perfectly*, but almost all of the data points are predicted *well*.

Thankfully, this is easy to do in Python! There are actually *many ways to do it.*

In [0]:
#Model setup
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols

First we want to separate our data into two pieces.
- `X` will denote the *explanatory variables*
- `y` will denote the *explained variables*

In [0]:
collated

In [0]:
#Data setup
X = np.asarray([[data[0]] for data in collated]) #Shape of the X array is what? A list of lists
y = np.asarray([data[1] for data in collated])

In [0]:
print(X[0])
print(y[0])

Now we *fit* the model on our data:

In [0]:
#Model fitting
ols.fit(X,y)
ols

Well that was a bit underwhelming...

How do we get the $\beta$s?

In [0]:
ols.coef_

In [0]:
ols.intercept_

Thus the predicted model is that in this time period, the height of 18 year old boys satisfies
$$
height = 17.78 + .075*year.
$$
How well does this explain the data?

In [0]:
list_plot([(X[i],y[i]) for i in range(len(X))])+plot(ols.intercept_+ols.coef_[0]*x,(1985,2020),color = 'green')

Looks pretty good! How do we interpret the model? The coefficient .075 says that we predict that every year, the average height of an 18 year old male increased by .075 centimeters. The nice thing about this is that it lets us make predictions! For instance; what would you think the average height of an 18 year old male was in 1984?

In [0]:
ols.predict([[1984]])

What about 2025?

In [0]:
ols.predict([[2025]])

## Quick Warning

As with *everything* in statistics/machine learning, the model can only go so far. For example: here is a prediction which is *obviously wrong*. People in the year 66 AD were not on average 22 centimeters tall:

In [0]:
ols.predict([[66]])

That would have made basketball way too difficult.

![](shaq.png)

People in the year 2000 BC were not negative centimeters tall:

In [0]:
ols.predict([[-2000]])

Here is a more subtle issue you may have with the model. Note the upswing in the data in recent years:

In [0]:
list_plot([(X[i],y[i]) for i in range(len(X))])+plot(ols.intercept_+ols.coef_[0]*x,(1985,2020),color = 'green')

Our prediction for 2025 is probably much too small!

In [0]:
ols.predict([[2025]])

In general models can do a great job at predicting data *within the sample boundaries*. In other words, *interpolation* is much easier than *extrapolation*. For example; suppose we were missing data on the year 1995. We could do a good job filling that in:

In [0]:
ols.predict([[1995]])

The linear model has done a good job as a whole, *in the period 1985 -2019*, at modelling the data. But in some sense it is too "simple" to accurately model all of the trends that we see. And it is dangerous to use the model to make predictions that are *too far* outside the range 1985 - 2020. Small extrapolation is ok, but you just have to be careful.

In [0]:
ols.predict([[3000]])

## Multiple Linear Regression

What if we wanted to repeat this analysis, but for girls? One option would be to simply restart the process:

In [0]:
sex = 'Girls'
age = 18
collated = np.asarray([(year,heightDict[(sex,year, age)]) for year in range(1985,2020)])

X = np.asarray([[data[0]] for data in collated])
y = np.asarray([data[1] for data in collated])

ols = LinearRegression()
ols.fit(X,y)

In [0]:
ols.coef_

In [0]:
ols.intercept_

In [0]:
list_plot([(X[i],y[i]) for i in range(len(X))])+plot(ols.intercept_+ols.coef_[0]*x,(1985,2020),color = 'green')

And we can do it for younger age groups as well:

In [0]:
sex = 'Boys'
age = 5
collated = np.asarray([(year,heightDict[(sex,year, age)]) for year in range(1985,2020)])

X = np.asarray([[data[0]] for data in collated])
y = np.asarray([data[1] for data in collated])

ols = LinearRegression()
ols.fit(X,y)

In [0]:
ols.coef_

In [0]:
ols.intercept_

In [0]:
list_plot([(X[i],y[i]) for i in range(len(X))])+plot(ols.intercept_+ols.coef_[0]*x,(1985,2020),color = 'green')

Certainly we can do this, and it paints a useful picture. But wouldn't it be better if we could do it all at once?

This is the concept of a *multiple linear regression*. Let's not fix any of the variables, and instead try to write 
$$
height = \beta_0 + \beta_1*sex + \beta_2*year+\beta_3*age.
$$
In some sense this makes sense; except the first variable is a string!

This is called a *categorical variable*. We can transform it into something numeric using the rule
$$
Girls \to 0,\;\; Boys\to 1.
$$

## ******************* Participation Check *******************************
Write a function `categorize(s)` which does the above categorization: `categorize('Girls')` gives `0`, etc.

In [0]:
def categorize(s):
    #Your code here

## ********************************************************************
Now multiple linear regression *as a concept* is more complicated, but in practice it is *just as easy* to fit the model. We start by making an array of explanatory variables and the explained variable again. In this case, we have:
- X, an array with entries of the form `(sex, year, age)`
- y, an array with heights as entries
Note that we will use our categorize function to turn strings into categorical data.

In [0]:
X = np.asarray([[categorize(key[0]),key[1],key[2]] for key in heightDict])
y = np.asarray([heightDict[key] for key in heightDict])

print(X[0])
print(y[0])

In [0]:
ols = LinearRegression()
ols.fit(X,y)

In [0]:
ols.coef_

In [0]:
ols.intercept_

How to interpret this? The model is predicting that
$$
height = 3.48*sex + .093*year + 4.26* age - 98.10
$$

It's very important to know how to read this, so let's go through coefficient by coefficient:
- Holding *everything else equal*, a boy is roughly 3.48 cm taller than a girl.
- Holding *everything else equal*, a child in year $y+1$ was .093 cm taller than a child in year $y$.
- Holding *everything else equal*, a child which is $a+1$ years old is 4.26 cm taller than a child which is $a$ years old.

This can be a bit confusing, so let's really drive this home. What does "everything else equal" mean? I mean *everything else about these children is the same*. It is crucial that you only modify one variable at a time. For example:
- The average 8 year old boy in 1994 was roughly 3.48 cm taller than the average 8 year old girl in 1994.
- The average 8 year old boy in 1994 was roughly .093 cm taller than the average 8 year old boy in 1993.
- The average 8 year old boy in 1994 was roughly 4.26 cm taller than the average 7 year old boy in 1994.

Note that in every case above, only one thing changes between these children at a time.

Could you string these together to compare multiple variables at once? In principle, yes. But it just gets a bit messier and harder to analyze.

Note: On Quiz 2, I will ask *at least one question* about interpreting a coefficient in a regression.

## What's Next?

Well in practice you want to do *much more analysis* of the model that comes out of OLS. Maybe you have heard of things like t-tests, p-values, F-tests, R-squared, etc. Since this is just a "user's guide" to regressions, I don't know how much of that we will cover, but surely I'll at least mention some of it. For now, we are going to pretend that all of the coefficients are "statistically significant" (don't tell an actual statistician that I'm doing this, they will yell at me). Your main focus should be getting comfortable with what these coefficients predict.

On Wednesday we will take a quick break from statistical computations to deal with *Pandas*, which is a wonderful module for data handling. It is kinda like Excel but for cool kids, like us. On next Wednesday we will then come back to more statistical stuff, with Pandas in our toolkit. Afterwards we will begin our discussion of natural language processing. After that, I'm not entirely sure (we are in some sense way ahead of schedule)