# Simple Statistics in Python

I'm going to show you how to run some simple statistics using Python.

In general, Python is very powerful for machine learning (e.g., scikit-learn, TensorFlow, etc.), while R is designed for statistics and cutting-edge statistical tools typically show up there first. That being said, all of the basic tools of a social science researcher are avaiable in Python.

In this notebook, I show you how to run some basic statistical tests and models using the [scipy stats module](https://docs.scipy.org/doc/scipy/reference/stats.html). I am assuming that you already have a working knowledge of what these statistical tests do. I am just showing you how to perform them in Python.

* Note: I personally do most of my statistical modeling in R, so I may be missing some of the tools that pure Python researchers would be aware of.



In [None]:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Getting the data

I'm going to just create some random data

In [None]:
X1 = stats.norm.rvs(size = 100) # 100 random, normally distributed values
X2 = stats.norm.rvs(size = 100)
X3 = stats.norm.rvs(size = 100)
group = np.random.choice(['A','B','C'], size=100)
# Our outcome is influenced by X1, X2, and the group, plus some random noise
Y = 1.5 * X1 - 2.3 * X2 + 3 * (group == 'A') + 1.2 * (group == 'B') + stats.norm.rvs(size = 100)

# We can store these in a data frame
df = pd.DataFrame({'X1':X1,
                   'X2':X2,
                   'X3': X3,
                   'group':group,
                   'Y':Y})

## Univariate statistics

There are lots of univariate statistics we can get - mean, median, quartiles, quantiles, etc.

In [None]:
# These all use numpy. This is the mean
np.mean(X1)

In [None]:
# And this is how you do the same thing with data in a data frame.
# All columns are numpy arrays underneath, so this first should work for
# any of the statistics.

# Numpy way
np.mean(df.X1)

In [None]:
# Pandas also has a number of statistics built in, which you can apply directly
df.X1.mean()

In [None]:
# For all columns
df.mean()

In [None]:
# Pandas is obviously great for doing grouping, which you often want for this
# type of statistics. "aggregate" lets you get multiple statistics

df.groupby('group').aggregate([np.mean, np.median])

In [None]:
# You can even write your own custom functions to aggregate

def mean_plus_1(array):
    array = array + 1
    return np.mean(array)

df.groupby('group').agg(mean_plus_1)

## Exercise 1

Get the median and the 25th percentile value for X1

In [None]:
# Your code here


More built in functions

In [None]:
# The describe function lists a number of these
stats.describe(X1)

In [None]:
# This also works for dataframes, with a different set of stats
df.describe()

## Bi-variate statistics

### Correlations
Scipy has both Pearson's correlation and Spearman's rank correlation.

In [1]:
# These 2 should not be correlated. 
stats.pearsonr(X1, X2)
# the first value returned is R, the second is the p-value

NameError: name 'stats' is not defined

In [None]:
# These should be correlated, on the other hand
stats.pearsonr(X1, Y)

In [None]:
# For pandas, you can get a correlation matrix
df.corr()

In [None]:
# Or just pass the columns you are interested in
stats.pearsonr(df.X1, df.X2)

In [None]:
stats.spearmanr(X1, X2)

In [None]:
stats.spearmanr(X1, Y)

### T-tests

T-tests test whether 2 distributions have the same mean.

For our data, X1-X3 all should have the same mean, but Y should differ from any of them.

In [None]:
stats.ttest_ind(X1, X2)

In [None]:
stats.ttest_ind(X3, Y)

## Exercise 2

Write some code that compares the correlations of each set of variables and prints the two variables with the highest correlation.

Hint: You will probably want to use two for loops (although there may also be a tricky way to do this with pandas)

In [None]:
## Your code here

## Multivariate statistics

### Chi-squared test

These test whether the frequency of something occurring by group is independent. So, we'll need to change Y into something that has a frequency.

The following code will produce the 2 rows of a table. The first row (`large_y_counts`) is the number of large Y values by group. The second (`small_y_counts`) is the number of small y values per group.

In [None]:
large_y_counts = []
small_y_counts = []
Y_med = np.median(Y)
for g in ['A','B','C']:
    large_y_count = 0
    small_y_count = 0
    # Instead of looping through the values, we loop through the index.
    # That way we can use the same index (i) to the the value of `Y[i]` and
    # the value of the `groups[i]` variable
    for i in range(len(Y)):
        if group[i] == g:
            if Y[i] > Y_med:
                large_y_count += 1
            else:
                small_y_count += 1
    large_y_counts.append(large_y_count)
    small_y_counts.append(small_y_count)

In [None]:
# Like many things, this could be done more quickly with pandas
df['large_y'] = df.Y > df.Y.median()

large_y_counts = df.loc[df.large_y==True,:].groupby('group').Y.count()
small_y_counts = df.loc[df.large_y==False,:].groupby('group').Y.count()

In [None]:
# Now, we call the Chi-squared test
stats.chi2_contingency([large_y_counts, small_y_counts])
# This returns the Chi-square value, a p-value, degrees of freedom, and the expected counts.

### ANOVA

This tests whether the means of multiple groups have the same population mean.

In [None]:
# these all should
stats.f_oneway(X1,X2,X3)

In [None]:
# but adding Y should change it

stats.f_oneway(X1,X2,X3,Y)

### Linear Regression

In [None]:
# Simple linear regression is possible with scipy stats
stats.linregress(X1, Y)

### Multiple Linear Regression

In [None]:
# but for multiple regression we need to use something else. One option is sklearn,
# the machine learning package. Another, maybe simpler is statsmodels, which I show here:

import pandas as pd
import statsmodels.formula.api as sm

In [None]:
df = pd.DataFrame({'X1':X1,
                   'X2':X2,
                   'X3': X3,
                   'group':group,
                   'Y':Y})

In [None]:
result = sm.ols(formula="Y ~ X1 + X2 + X3 + group", data=df).fit()

In [None]:
result.summary()

Note the benefit of regression - the coefficient for X1 is much closer to the true coefficient (1.5)

## Exercise 3

Google to figure out how to output this table as text that you could put into a Word document

In [None]:
# Your code here

### Very brief intro to machine learning

Per a request, I'm going to show how to run a random forest model using sklearn.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# sklearn doesn't handle categorical vars, so we have to change to dummies
rf_df = pd.get_dummies(df)

Y = rf_df.pop('Y').values
Y_train = Y[:50]
Y_test = Y[50:]
X_train = rf_df[:50]
X_test = rf_df[50:]
clf = RandomForestRegressor(n_estimators=50)
clf.fit(X_train,Y_train)

In [None]:
clf.feature_importances_

In [None]:
feature_importance = pd.Series(clf.feature_importances_, index = X_train.columns.values)

In [None]:
feature_importance.plot.bar()

In [None]:
clf.predict(X_test)

In [None]:
clf.score(X_test, Y_test)

## Exercise 4

Think of a question in some data we've used before (crash data, Twitter data, reddit data) that a statistical test would help to answer and apply one of those covered above.