# Python Statistics

## Collecting and Cleaning Data

### 01 - Loading data


Goals:

1. Load data from a CSV file using the `pd.read_csv` function.
2. Understand how to access and interpret the shape of a DataFrame.
3. Apply the `.describe` method to obtain summary statistics for a DataFrame.

In [None]:
import numpy as np
import pandas as pd
pd.__version__

In [None]:
import pandas as pd
url = 'AmesHousing.csv'
df = pd.read_csv(url, engine='pyarrow', dtype_backend='pyarrow')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.describe()

### 02 - Strings and Categories


Goals:

* Understand the data types of columns in a DataFrame using the `.dtypes` attribute.
* Select and filter categorical columns using the `.select_dtypes` method.
* Compute and interpret summary statistics for categorical columns using the `.describe` method.
* Determine the memory usage of string columns in a DataFrame.
* Convert string columns to the `'category'` data type to save memory.


In [None]:
df.dtypes

In [None]:
# Categoricals - Pandas 2
df.select_dtypes('string')  # or 'strings[pyarrow]'

In [None]:
# Categoricals
df.select_dtypes('string').describe().T

In [None]:
(df
 .select_dtypes('string')
 .memory_usage(deep=True)
 .sum()
)

## Exploring & Visualizing

### 03 - Categorical Exploration

Goals:

* Explore a categorical column, such as "MS Zoning," by accessing the column and displaying its unique values.
* Visualize the value counts of a categorical column using a bar chart.
* Visualize the value counts of a categorical column using a horizontal bar chart.

In [None]:
import pandas as pd
url = 'data/ames-housing-dataset.zip'
raw = pd.read_csv(url, engine='pyarrow', dtype_backend='pyarrow')

# make function
def shrink_ints(df):
    mapping = {}
    for col in df.dtypes[df.dtypes=='int64[pyarrow]'].index:
        max_ = df[col].max()
        min_ = df[col].min()
        if min_ < 0:
            continue
        if max_ < 255:
            mapping[col] = 'uint8[pyarrow]'
        elif max_ < 65_535:
            mapping[col] = 'uint16[pyarrow]'
        elif max_ <  4294967295:
            mapping[col] = 'uint32[pyarrow]'
    return df.astype(mapping)


def clean_housing(df):
    return (df
     .assign(**df.select_dtypes('string').replace('', 'Missing').astype('category'),
             **{'Garage Yr Blt': df['Garage Yr Blt'].clip(upper=df['Year Built'].max())})
     .pipe(shrink_ints)
    )    

housing = clean_housing(raw)

In [None]:
housing.describe()

In [None]:
# categoricals
(housing
  ['MS Zoning'])

In [None]:
# categoricals
(housing
  ['MS Zoning']
  .value_counts())

In [None]:
# categoricals
(housing
  ['MS Zoning']
  .value_counts()
  .plot.bar())

In [None]:
# categoricals
(housing
  ['MS Zoning']
  .value_counts()
  .plot.barh())

### 04: Histograms and Distributions

Goals:

* Obtain descriptive statistics of the "SalePrice" column using the `.describe` method.
* Visualize the distribution of the "SalePrice" column using a histogram.
* Customize the histogram by specifying the number of bins using the `bins` parameter.

In [None]:
# Numerical
(housing
 .SalePrice
 .describe()
)

In [None]:
# Numerical
(housing
 .SalePrice
 .hist()
)

In [None]:
# Numerical
(housing
 .SalePrice
 .hist(bins=130)
)

### 05 - Correlations

Goals:

* Calculate the Pearson correlation
* Calculate the Spearman correlation 
* Color a correlation matrix appropriately

In [None]:
# Pearson correlation
housing.corr()

In [None]:
housing.corr(numeric_only=True)

In [None]:
(housing
 .corr(method='spearman', numeric_only=True)
 .style
 .background_gradient()
)

In [None]:
(housing
 .corr(method='spearman', numeric_only=True)
 .style
 .background_gradient(cmap='RdBu')
)

In [None]:
(housing
 .corr(method='spearman', numeric_only=True)
 .style
 .background_gradient(cmap='RdBu', vmin=-1, vmax=1)
)

### 06 - Scatter Plots

Goals:

* Create a scatter plot
* Set transparency
* Jitter plot values

In [None]:
(housing
 .plot
 .scatter(x='Year Built', y='Overall Cond')
)

In [None]:
housing['Year Built'].corr(housing['Overall Cond'], method='spearman')

In [None]:
(housing
 .plot
 .scatter(x='Year Built', y='Overall Cond', alpha=.1)
)

In [None]:
# with jitter in y
(housing
 .assign(**{'Overall Cond': housing['Overall Cond'] + np.random.random(len(housing))*.8 -.4})
 .plot
 .scatter(x='Year Built', y='Overall Cond', alpha=.1)
)

In [None]:
# make function
def jitter(df_, col, amount=.5):
    return (df_
            [col] + np.random.random(len(df_))*amount - (amount/2))
    
(housing
 .assign(#**{'Overall Cond': housing['Overall Cond'] + np.random.random(len(housing))*.8 -.4})
     **{'Overall Cond': jitter(housing, 'Overall Cond', amount=.8)})
 .plot
 .scatter(x='Year Built', y='Overall Cond', alpha=.1)
)

In [None]:

(housing
 #.assign(**{'Overall Cond': housing['Overall Cond'] + np.random.random(len(housing))*.8 -.4})
 .plot
 .hexbin(x='Year Built', y='Overall Cond', alpha=1, gridsize=18)
)

### 07 - Visualizing Categoricals and Numerical Values

Goals:

* Create a box plot of a single column
* Create a box plot of multiple columns
* Use the `.pivot` method
* Use Seaborn to create other distibution plots by category

In [None]:
# Numerical and categorical
(housing
 #.assign(**{'Overall Cond': housing['Overall Cond'] + np.random.random(len(housing))*.8 -.4})
 .plot
 .box(x='Year Built', y='Overall Cond')
)

In [None]:
# Make multiple box plots
(housing
 .pivot(columns='Year Built', values='Overall Cond')
 .apply(lambda ser: ser[~ser.isna()].reset_index(drop=True))
 .plot.box()
)

In [None]:
(housing
 .pivot(columns='Year Built', values='Overall Cond')
 .apply(lambda ser: ser[~ser.isna()].reset_index(drop=True))
 .loc[:, [1900, 1920, 1940, 1960, 1980, 2000]]
 .plot.box()
)

In [None]:
1993 // 10

In [None]:
# Group by decade
(housing
 .assign(decade=(housing['Year Built']//10 ) * 10)
 .pivot(columns='decade', values='Overall Cond')
 .apply(lambda ser: ser[~ser.isna()].reset_index(drop=True))
 .plot.box()
)

In [None]:
# or use seaborn
import seaborn as sns

sns.boxplot(data=housing, x='Year Built', y='Overall Cond')

In [1]:
sns.boxplot?

Object `sns.boxplot` not found.


In [None]:
sns.boxplot(data=housing, x='Year Built', y='Overall Cond',
            order=[1900, 1920, 1940]
)

In [None]:
sns.violinplot(data=housing, x='Year Built', y='Overall Cond',
            order=[1900, 1920, 1940]
)

In [None]:
sns.boxenplot(data=housing, x='Year Built', y='Overall Cond',
            order=[1900, 1920, 1940]
)

## Hypothesis Test

### 08 - Exploring Data

Goals:

* Explore summary statistics by group


In [None]:
from scipy import stats
housing.Neighborhood.value_counts()

In [None]:
(housing
 .groupby('Neighborhood')
 .describe())

In [None]:
(housing
 .groupby('Neighborhood')
 .describe()
 .loc[['CollgCr', 'NAmes'], ['SalePrice']]
)

In [None]:
(housing
 .groupby('Neighborhood')
 .describe()
 .loc[['CollgCr', 'NAmes'], ['SalePrice']]
 .T
)

### 09 - Visualizing Distributions

Goals

* Make histograms of both distributions
* Make a cumulative distribution plot

In [None]:
n_ames = (housing
          .query('Neighborhood == "NAmes"')
          .SalePrice)
college_cr = (housing
          .query('Neighborhood == "CollgCr"')
          .SalePrice)

In [None]:
ax = n_ames.hist(label='NAmes')
college_cr.hist(ax=ax, label='CollgCr')
ax.legend()

In [None]:
alpha = .7
ax = n_ames.hist(label='NAmes', alpha=alpha)
college_cr.hist(ax=ax, label='CollgCr', alpha=alpha)
ax.legend()

In [None]:
(n_ames
 .to_frame()
 .assign(cdf=n_ames.rank(method='average', pct=True))
 .sort_values(by='SalePrice')
 .plot(x='SalePrice', y='cdf', label='NAmes')
)

In [None]:
def plot_cdf(ser, ax=None, label=''):
    (ser
     .to_frame()
     .assign(cdf=ser.rank(method='average', pct=True))
     .sort_values(by='SalePrice')
     .plot(x='SalePrice', y='cdf', label=label, ax=ax)
    )
    return ser
plot_cdf(n_ames, label='NAmes')

In [None]:
import matplotlib.pyplot as plt
def plot_cdf(ser, ax=None, label=''):
    (ser
     .to_frame()
     .assign(cdf=ser.rank(method='average', pct=True))
     .sort_values(by='SalePrice')
     .plot(x='SalePrice', y='cdf', label=label, ax=ax)
    )
    return ser
    
fig, ax = plt.subplots(figsize=(8,4))
plot_cdf(n_ames, label='NAmes', ax=ax)
plot_cdf(college_cr, label='CollegeCr', ax=ax)

### 010 - Running Statistical Tests

Goals:

* Use the `scipy.stats` module to run a statistical test

In [None]:
print(dir(stats))

In [None]:
stats.ks_2samp?

In [None]:
ks_statistic, p_value = stats.ks_2samp(n_ames, college_cr)
print(ks_statistic, p_value)

In [None]:
if p_value > 0.05:
    print('Fail to reject null hypothesis: Same distribution')
else:
    print('Reject null hypothesis: Not from the same distribution')


### 011 - Testing for Normality

Goals:

* Use the `scipy.stats` module to test for normality
* Use the `scipy.stats` module to create a probability plot

In [None]:
# Use the Shapiro-Wilks test
shapiro_stat, p_value = stats.shapiro(n_ames)

In [None]:
if p_value > 0.05:
    print("The distribution of the series is likely normal (fail to reject H0)")
else:
    print("The distribution of the series is likely not normal (reject H0)")


In [None]:
p_value

In [None]:
stats.probplot?

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8,4))
_ = stats.probplot(n_ames, plot=ax)

In [None]:
alpha = .7
ax = n_ames.hist(label='NAmes', alpha=alpha)
college_cr.hist(ax=ax, label='CollgCr', alpha=alpha)
ax.legend()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8,4))
_ = stats.probplot(college_cr, plot=ax)