# Data Distribution

Statistical moments of a distribution:

- Moments of location 

- Moments of variability

- Moments of skewness

- Moments of kurtosis

---

### Libraries

- [SciPy](https://docs.scipy.org/doc/scipy/reference/index.html#scipy-api)

- [statsmodels](https://www.statsmodels.org/stable/api.html)

- [wquantiles](https://pypi.org/project/wquantiles/)

In [1]:
# imports

import pandas as pd
import numpy as np
from scipy.stats import trim_mean   # conda install scipy
from statsmodels import robust      # conda install -c conda-forge statsmodels 
import wquantiles                   # pip install wquantiles

import seaborn as sns
import matplotlib.pylab as plt

---

### Exploring the Data Distribution

So far we have used a single number to describe the location or variability of the data. Now we are going to explore how the data is distributed overall.

- __Boxplot:__ Is a quick way to visualize the distribution of data (a.k.a.: box and whiskers plot).

- __Frequency table:__ A tally of the count of numeric data values that fall into a set of intervals (bins).

- __Histogram:__ A plot of the frequency table with the bins on the x-axis and the count (or proportion) on the y-axis.

- __Density plot:__ A smoothed version the histogram.



---

#### Equal-count distributions

In [2]:
state = pd.read_csv('./datasets/state.csv')
state.sort_values(by='Murder.Rate').reset_index(drop=True)

FileNotFoundError: [Errno 2] No such file or directory: './datasets/state.csv'

In [None]:
# Percentiles table (equal-count bins). Quartiles and Deciles are usually used.

percentages = [0.05, 0.25, 0.5, 0.75, 0.95]
percentiles = state['Murder.Rate'].quantile(percentages)
df = pd.DataFrame(percentiles)
df.index = [f'{int(p * 100)}%' for p in percentages]
df.transpose()

In [None]:
# Boxplot (Matplotlib whiskers implementation = 1.5 IQR)

data = state['Murder.Rate']
ax = data.plot.box(figsize=(5, 8))
ax.set_ylabel('Murder Rate per 100k')
#ax.boxplot(data, whis=[0, 100])
plt.tight_layout()
plt.grid()
plt.show()

---

#### Equal-size distributions

In [None]:
# Frequency table (equal-size bins)

binnedPopulation = pd.cut(state['Population'], 10)

binnedPopulation.head()

In [None]:
binnedPopulation.value_counts()

In [None]:
binnedPopulation.name = 'binnedPopulation'
df = pd.concat([state, binnedPopulation], axis=1)
df = df.sort_values(by='Population')
df.head()

In [None]:
groups = []
for group, subset in df.groupby(by='binnedPopulation'):
    groups.append({'BinRange': group,
                   'Count': len(subset),
                   'States': ','.join(subset.Abbreviation)})
groups

In [None]:
binrange = pd.DataFrame(groups)
binrange

__IMPORTANT:__ too big bin size miss aspects of the distribution, while too small bin size can't give us the big picture.

In [None]:
# Histogram (or plotting the frequency table)

#data = state['Population'] / 1_000_000
data = state['Murder.Rate']
ax = data.plot.hist(figsize=(12, 8))
#ax.set_xlabel('Population (millions)')
ax.set_xlabel('Murder Rate (per 100,000)')
plt.tight_layout()

In [None]:
# Density plot calculated from data using a kernel density estimate implementation (area under the curve == 1)

ax = data.plot.hist(density=True,
                    #xlim=[0, 12], 
                    #bins=range(1,12),
                    figsize=(12, 8))
data.plot.density(ax=ax)
#ax.set_xlabel('Population (millions)')
ax.set_xlabel('Murder Rate (per 100,000)')
plt.tight_layout()

---

### Exploring Binary and Categorical Data

Categorical data can be simply analyzed as counts or proportions of an all.

- __Probability:__ Is an imaginary construction defined as the proportion of times an event will occur if the situation could be repeated over and over.

- __Mode:__ The most commonly occurring category or value in a dataset.

- __Bar charts:__ The frequency or proportion for each category plotted as bars.


In [None]:
# Category proportions

dfw = pd.read_csv('./datasets/dfw_airline.csv')
dfw_proportions = 100 * dfw / dfw.values.sum()
dfw_proportions

In [None]:
# Bar charts can be understand it as not ordered (arbitrary) bins.

ax = dfw.transpose().plot.bar(figsize=(12, 8), legend=False)
ax.set_xlabel('Cause of delay')
ax.set_ylabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Mode 

df_athletes = pd.read_excel('./datasets/Athletes.xlsx')
print('Athletes mode:', df_athletes['NOC'].mode()[0])

df_coaches = pd.read_excel('./datasets/Coaches.xlsx')
print('Coaches mode:', df_coaches['NOC'].mode()[0])

---

### Correlation (bivariate analysis)

Includes the correlation among features (predictors) and between features and a target variable (numeric vs. numeric).

- __Correlation coefficient:__ A metric that measures the extent to which numeric variables are asociated with one and other (range from -1 to +1). The most commonly used is _Pearson's correlation coefficient_.

- __Correlation matrix:__ A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

- __Scatterplot:__ A plot in which the x-axis is the value of one variable, and the y-axis the value of another.


In [None]:
sp500_sym = pd.read_csv('./datasets/sp500_sectors.csv')
sp500_px = pd.read_csv('./datasets/sp500_data.csv', index_col=0)

In [None]:
telecomSymbols = sp500_sym[sp500_sym['sector'] == 'telecommunications_services']['symbol']
telecom = sp500_px.loc[sp500_px.index >= '2012-07-01', telecomSymbols]
print(telecom.shape)
telecom.head()

In [None]:
# Correlation matrix => https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

telecom.corr()

In [None]:
etfs = sp500_px.loc[sp500_px.index > '2012-07-01', sp500_sym[sp500_sym['sector'] == 'etf']['symbol']]
print(etfs.shape)
etfs.head()

In [None]:
# Correlation matrix heatmap => https://seaborn.pydata.org/generated/seaborn.heatmap.html

fig, ax = plt.subplots(figsize=(10, 10))
ax = sns.heatmap(etfs.corr(),
                 vmin=-1,
                 vmax=1,
                 cmap=sns.diverging_palette(20, 220, as_cmap=True),
                 # https://seaborn.pydata.org/generated/seaborn.diverging_palette.html
                 ax=ax)
plt.tight_layout()

In [None]:
# Scatterplots (dominant quadrants give correlation info)

ax = telecom.plot.scatter(x='T',
                          y='VZ',
                          figsize=(10, 10),
                          marker='$\u25EF$',
                          alpha=1)
ax.set_xlabel('ATT (T)')
ax.set_ylabel('Verizon (VZ)')
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.tight_layout()

---

### Exploring Two or More Variables (Multivariate analysis)

Estimates and plots involving two or more than two features.

__Exploring two variables:__

- __Hexagonal binning:__ A plot of two numeric variables with the records binned into hexagons.

- __Contour plot:__ A plot showing the density of two numeric variables like a topographical map.

- __Contingency table:__ A tally of counts between two or more categorical variables.

__Exploring more than two variables:__

- __Conditioning:__ The type of charts used to compare two variables are readily extendended to more variables through the notion of conditioning (i.e.: using a conditioning variable).


In [None]:
kc_tax = pd.read_csv('./datasets/kc_tax.csv')
kc_tax0 = kc_tax.loc[(kc_tax.TaxAssessedValue < 750000) & 
                     (kc_tax.SqFtTotLiving > 100) &
                     (kc_tax.SqFtTotLiving < 3500), :]
kc_tax0.shape

In [None]:
# Hexagonal bining plot

ax = kc_tax0.plot.hexbin(x='SqFtTotLiving',
                         y='TaxAssessedValue',
                         gridsize=30,
                         sharex=False,     
                         figsize=(10, 8))
ax.set_xlabel('Finished Square Feet')
ax.set_ylabel('Tax Assessed Value')
plt.tight_layout()

In [None]:
# Contour plot

fig, ax = plt.subplots(figsize=(10, 10))
sns.kdeplot(data=kc_tax0.sample(10000),
            x='SqFtTotLiving',
            y='TaxAssessedValue',
            ax=ax,
            cmap="Reds")
kc_tax0.sample(10000).plot.scatter(x='SqFtTotLiving',
                                   y='TaxAssessedValue',
                                   marker='$\u25EF$',
                                   alpha=0.1,
                                   ax=ax)
ax.set_xlabel('Finished Square Feet')
ax.set_ylabel('Tax Assessed Value')
plt.tight_layout()

In [None]:
lc_loans = pd.read_csv('./datasets/lc_loans.csv')
lc_loans

In [None]:
# Contingency table

crosstab = lc_loans.pivot_table(index=['grade'],columns=['status'],aggfunc=lambda x: len(x),margins=True)
crosstab

In [None]:
# Contingency table (percentages)

perc_crosstab = crosstab.copy().loc['A':'G',:]
perc_crosstab.loc[:,'Charged Off':'Late'] = perc_crosstab.loc[:,'Charged Off':'Late'].div(perc_crosstab['All'], axis=0)
perc_crosstab['All'] = perc_crosstab['All'] / sum(perc_crosstab['All'])
perc_crosstab

In [None]:
airline_stats = pd.read_csv('./datasets/airline_stats.csv').sort_values(by='airline').reset_index(drop=True)
airline_stats

In [None]:
airline_stats['airline'].value_counts()

In [None]:
# Categorical and numeric data (Boxplots)

ax = airline_stats.boxplot(by='airline',
                           column='pct_carrier_delay',
                           figsize=(12, 8))
ax.set_xlabel('')
ax.set_ylabel('Daily % of Delayed Flights')
plt.suptitle('')
plt.tight_layout()

In [None]:
# Categorical and numeric data (Violinplot) => https://seaborn.pydata.org/generated/seaborn.violinplot.html

fig, ax = plt.subplots(figsize=(12, 8))
sns.violinplot(data=airline_stats,
               x='airline',
               y='pct_carrier_delay',
               ax=ax,
               inner='quartile',
               color='white')
ax.set_xlabel('')
ax.set_ylabel('Daily % of Delayed Flights')
plt.tight_layout()

In [None]:
# Using a conditioning variable

zip_codes = [98188, 98105, 98108, 98126]
kc_tax_zip = kc_tax0.loc[kc_tax0['ZipCode'].isin(zip_codes),:]
kc_tax_zip

In [None]:
# Conditioning hexagonal bining plot

def hexbin(x, y, color, **kwargs):
    cmap = sns.light_palette(color, as_cmap=True)
    plt.hexbin(x, y, gridsize=30, cmap=cmap, **kwargs)
    
# https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
g = sns.FacetGrid(kc_tax_zip, col='ZipCode', col_wrap=2, height=4)
# https://seaborn.pydata.org/generated/seaborn.FacetGrid.map.html
g.map(hexbin,'SqFtTotLiving','TaxAssessedValue',extent=[0, 3500, 0, 700000])
g.set_axis_labels('Finished Square Feet', 'Tax Assessed Value')
g.set_titles('Zip code {col_name:.0f}')
plt.tight_layout()

---