# Exploratory Data Analysis

Mostly from [Chapter 1, Practical Statistics for Data Scientists](https://github.com/gedeck/practical-statistics-for-data-scientists/tree/master/python/notebooks)

## Setup

The following cell sets up the Colab environment. No changes are made if run locally.

In [None]:
# If running on Colab, set up the environment
import sys
if 'google.colab' in sys.modules:
    !pip install requests wquantiles
    !mkdir -p /content/data
    %cd /content
    !wget -q https://raw.githubusercontent.com/olearydj/INSY7120/refs/heads/main/notebooks/common.py -O common.py

In all environments, use a function from the common module (installed above or already local) to download required data files. They will be placed in `./data`

In [None]:
import common

STATE_CSV = common.download_gh_file('https://raw.githubusercontent.com/gedeck/practical-statistics-for-data-scientists/refs/heads/master/data/state.csv')

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

import seaborn as sns
import matplotlib.pylab as plt

## Estimates of Location

### Location Estimates of Population and Murder Rates

Load the data using Pandas `read_csv` function and look at the first eight rows of data.

In [None]:
state = pd.read_csv(STATE_CSV)
print(state.head(8))

Get a sense of the data.

In [None]:
# shape and size
print("ndim:", state.ndim)
print("shape:", state.shape)

In [None]:
# data types
state.dtypes

In [None]:
# general information
state.info()

In [None]:
print(state.describe())

Compute the mean, trimmed mean, and median for Population. For `mean` and `median` we can use the _pandas_ methods of the data frame. The trimmed mean requires the `trim_mean` function in _scipy.stats_.

In [None]:
print(state['Population'].mean())

In [None]:
print(trim_mean(state['Population'], 0.1))

In [None]:
print(state['Population'].median())

mean > trimmed mean > median

Why?

Weighted mean is available with numpy. For weighted median, we can use the specialised package `wquantiles` (https://pypi.org/project/wquantiles/).

In [None]:
print(state['Murder.Rate'].mean())

In [None]:
print(np.average(state['Murder.Rate'], weights=state['Population']))

In [None]:
print(wquantiles.median(state['Murder.Rate'], weights=state['Population']))

## Estimates of Variability

### Example: Variability Estimates of State Population

Working with the murder rate data from before.

In [None]:
# Table 1-2
print(state.head(8))

Standard deviation

In [None]:
print(state['Population'].std())

Interquartile range is calculated as the difference of the 75% and 25% quantile.

In [None]:
print(state['Population'].quantile(0.75) - state['Population'].quantile(0.25))

Or use the `iqr` function from SciPy.

In [None]:
from scipy.stats import iqr

iqr(state['Population'])

Median absolute deviation from the median can be calculated with a method in _statsmodels_.

This version is scaled to put the result on the same scale as the standard deviation. See [this WikiPedia page for more details.](https://en.wikipedia.org/wiki/Median_absolute_deviation#Relation_to_standard_deviation)

In [None]:
print(robust.scale.mad(state['Population']))
print(abs(state['Population'] - state['Population'].median()).median() / 0.6744897501960817)

## Percentiles and Boxplots
_Pandas_ has the `quantile` method for data frames.

In [None]:
print(state['Murder.Rate'].quantile([0.05, 0.25, 0.5, 0.75, 0.95]))

_Pandas_ provides a number of basic exploratory plots; one of them are boxplots

In [None]:
ax = (state['Population']/1_000_000).plot.box(figsize=(3, 4))
ax.set_ylabel('Population (millions)')

plt.tight_layout()
plt.show()

## Frequency Table and Histograms

_Pandas_ also supports histograms for exploratory data analysis. By default, Pandas uses 10 bins.

In [None]:
# default, 10 bins
ax = (state['Population'] / 1_000_000).plot.hist(figsize=(4, 4))
ax.set_xlabel('Population (millions)')

plt.tight_layout()
plt.show()

To change it, use the `bins` parameter of `pd.Series.plot.hist`.

In [None]:
# five bins
ax = (state['Population'] / 1_000_000).plot.hist(bins = 5, figsize=(4, 4))
ax.set_xlabel('Population (millions)')

plt.tight_layout()
plt.show()

In [None]:
# twenty bins
ax = (state['Population'] / 1_000_000).plot.hist(bins = 20, figsize=(4, 4))
ax.set_xlabel('Population (millions)')

plt.tight_layout()
plt.show()

In [None]:
help(pd.Series.plot.hist)

## Density Estimates
Density is an alternative to histograms that can provide more insight into the distribution of the data points. Use the argument `bw_method` to control the smoothness of the density curve.

In [None]:
ax = state['Murder.Rate'].plot.hist(density=True, xlim=[0, 12], 
                                    bins=range(1,12), figsize=(4, 4))
state['Murder.Rate'].plot.density(ax=ax)
ax.set_xlabel('Murder Rate (per 100,000)')

plt.tight_layout()
plt.show()

## Exploring Binary and Categorical Data

New dataset - source of delays for Dallas / Fort Worth Airport. Counts for each of the five categories of this single categorical variable.

In [None]:
AIRPORT_DELAYS_CSV = common.download_gh_file('https://raw.githubusercontent.com/gedeck/practical-statistics-for-data-scientists/refs/heads/master/data/dfw_airline.csv')
dfw = pd.read_csv(AIRPORT_DELAYS_CSV)
print(dfw)

Very simple dataframe. Likely aggregated / categorized from a larger list of incidents.

In [None]:
dfw.shape

In [None]:
dfw.values

To calculate the proportion of each category, simply divide the count in each category by the total number of incidents.

In [None]:
print(100 * dfw / dfw.values.sum())

Bar charts are commonly used to display the relative proportions of each category in a single categorical variable. Pandas supports this with `plot.bar`

In [None]:
ax = dfw.transpose().plot.bar(figsize=(4, 4), legend=False)
ax.set_xlabel('Cause of delay')
ax.set_ylabel('Count')

plt.tight_layout()
plt.show()

Note: Pandas expects categories displayed on the x-axis of a bar chart to be represented in rows of a single column. This is why the data was transposed above. The shorthand notation for `transpose()` is `T`:

In [None]:
dfw.T

## Bivariate / Multivariate Analysis

### Correlation

Wine Quality Dataset

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv(url, sep=";")
wine.head()

In [None]:
wine.shape

In [None]:
# Compute the correlation matrix
corr_matrix = wine.corr()
corr_matrix


In [None]:
# Filter for stronger correlations (absolute value > 0.3)
strong_corr = corr_matrix[(corr_matrix > 0.3) | (corr_matrix < -0.3)]

# Plot the heatmap
plt.figure(figsize=(8,6))
sns.heatmap(strong_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5, mask=strong_corr.isnull())
plt.title("Wine Quality Dataset - Correlation Matrix")
plt.show()

In [None]:
### Scatterplot

In [None]:
# Create a scatterplot of Alcohol vs. Quality with a regression line
plt.figure(figsize=(8,5))
sns.regplot(x=wine["alcohol"], y=wine["quality"], scatter_kws={'alpha':0.5}, line_kws={'color':'red'})

# Labels and title
plt.xlabel("Alcohol Content (%)")
plt.ylabel("Wine Quality Score")
plt.title("Alcohol vs. Wine Quality")

plt.show()

In [None]:
# Scatterplot of Fixed Acidity vs. Density
plt.figure(figsize=(8,5))
sns.scatterplot(x=wine["fixed acidity"], y=wine["density"], alpha=0.5)

# Labels and title
plt.xlabel("Fixed Acidity (g/L)")
plt.ylabel("Density (g/cm³)")
plt.title("Fixed Acidity vs. Density in Wine")

plt.show()