# Basic Statistics with Python

1. Simple distribution plots
2. Confidence intervals
3. Hypothesis testing
4. Simple linear models

### 1. Simple Distribution Plots

In [None]:
# Import packages that you need
from sklearn import datasets
import pandas as pd
import numpy as np

In [None]:
# Load the iris data-set from sklearn
iris = datasets.load_iris()

# Convert to dataframe
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                  columns= iris['feature_names'] + ['target'])

In [None]:
# Take a look
df.head()

In [None]:
# Graph boxplot distributions for the dataframe columns
%matplotlib inline
df.plot.box(figsize=(12, 8))

In [None]:
# Graph a scatterplot matrix to visualize distribution and interaction of each column/component 
%matplotlib inline
pd.plotting.scatter_matrix(df, figsize=(12, 12))

### 2. Confidence Intervals

In [None]:
# First we need some more packages
import scipy
import matplotlib.pyplot as plt

In [None]:
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html
# Confidence interval calculated by scipy

# Let's calculated confidence interval of "sepal width (cm)" at 95%
mean_sw = df['sepal width (cm)'].mean()
std_sw = df['sepal width (cm)'].std()

# Store confidence interval in the variable ci
ci = scipy.stats.norm.interval(0.95,
                               loc=mean_sw,
                               scale=std_sw)

In [None]:
print ("Mean sepal width (cm): ", mean_sw)
print ("Standard deviation sepal width (cm): ", std_sw)
print ("UL, LL 95% confidence bounds: ", ci)

In [None]:
# Graph the confidence interval
%matplotlib inline

# Plot histogram
plt.hist(df['sepal width (cm)'], bins=10)

# Plot confidence interval lines as red lines
plt.axvline(x=ci[0], 
            ymin=0, 
            ymax=max(df['sepal width (cm)']), 
            color='red')

plt.axvline(x=ci[1], 
            ymin=0, 
            ymax=max(df['sepal width (cm)']), 
            color='red')

### 3. Hypothesis Testing

The "target" column actually represents 3 different types of iris. In this section we will perform hypothesis testing to compare the different targets based on the factors such as sepal width.

1. Null Hypothesis: There is no difference in mean petal length (cm) between target levels 0.0 and 1.0

We will compare the column values for petal length (cm) between the target levels 0.0 and 1.0. We will first need to get each data set in the form of an array.

In [None]:
# Dataset for mean petal length (cm) for target level 0.0
petallength_0 = df.loc[df['target'] == 0.0]['petal length (cm)'].values

# Dataset for mean petal length (cm) for target level 1.0
petallength_1 = df.loc[df['target'] == 1.0]['petal length (cm)'].values

In [None]:
# Welch's t-test for unequal variances
t, p = scipy.stats.ttest_ind(petallength_0, petallength_1, equal_var=False)

print ("t-value: ", t)
print ("p-value: ", p)

When we have more than 2 groups to compare, we can use ANOVA to calculate the F-statistic. A significant F-stat indicates that the difference between targets (0, 1, and 2) can be explained by the category of data we are using (petal length (cm))

In [None]:
# Dataset for mean petal length (cm) for target level 0.0
petallength_0 = df.loc[df['target'] == 0.0]['petal length (cm)'].values

# Dataset for mean petal length (cm) for target level 1.0
petallength_1 = df.loc[df['target'] == 1.0]['petal length (cm)'].values

# Dataset for mean petal length (cm) for target level 2.0
petallength_2 = df.loc[df['target'] == 2.0]['petal length (cm)'].values

In [None]:
f, p = scipy.stats.f_oneway(petallength_0, petallength_1, petallength_2)

print ("f-stat: ", f)
print ("p-value: ", p)

### Simple Linear Models

Least squares regression hypothesizes a linear model for the parameters in a dataset and provides an estimation of coefficient for each parameter and whether or not it is significant as a predictor. This is a great way to tell if a particular parameter is significant for the dataset. We can analyze all columns of parameters at once instead of grinding through f-stats for each manually.

In [None]:
import statsmodels.api as sm

In [None]:
# Create linear model
model = sm.OLS(endog = df['target'],
               exog = df[[c for c in list(df) if c != 'target']]).fit()

In [None]:
print(model.summary())