In [1]:
import numpy as np

In [2]:
import matplotlib.pylab as plt

In [4]:
from pathlib import Path

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust

import seaborn as sns
import matplotlib.pylab as plt

In [5]:
# Data sets path
AIRLINE_STATS_CSV = 'Data/airline_stats.csv'
KC_TAX_CSV =  'Data/kc_tax.csv.gz'
LC_LOANS_CSV =  'Data/lc_loans.csv'
AIRPORT_DELAYS_CSV =  'Data/dfw_airline.csv'
SP500_DATA_CSV = 'Data/sp500_data.csv.gz'
SP500_SECTORS_CSV = 'Data/sp500_sectors.csv'
STATE_CSV = 'Data/state.csv'

In [6]:
def printest(args, value):
    return print( "{} : \n {} \n".format(args, value) )

# Exploratory Data Analysis

In [7]:
state = pd.read_csv(STATE_CSV)
state.head(3)

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ


# Estimates of Location

Variables containing measured or count data can often possess thousands of unique values. An essential part of data exploration involves identifying a "representative value" for each feature - a value that suggests where the majority of the data points tend to cluster (i.e., their central tendency).

**Mean**

The mean, often referred to as the average, is a measure of the central tendency of a dataset. It is calculated by adding all the values in the dataset and then dividing by the total number of values.

$$\mu = \frac{1}{n}\sum_{i = 1}^{n} x_i $$

Here, $x_i$ represents each individual data point, and $n$ is the total number of observations.

The mean provides a useful summary of the data's central location, but it is <mark>sensitive to extreme values (outliers)</mark>.

**Trimmed (Truncated) Mean**

Is a modified version of the mean. It's computed by first sorting the data values, then excluding a fixed number of values from both ends of the sorted list, and finally, taking the average of the remaining data points. This approach mitigates the impact of outliers and can provide a <mark>more representative 'central value' when dealing with skewed data </mark>.

The formula to calculate the trimmed mean, omitting $p$ smallest and largest values, is:

$$\mu_t = \frac{1}{n-2p}\sum_{i = p+1}^{n-p} x_i $$

Here, $n$ is the total number of observations, $p$ is the number of observations discarded from each end, and $x_i$ represents each individual data point that is included in the trimmed mean calculation.

**Weighted Mean**

Is a generalization of the arithmetic mean that enables us to assign specific weights or importance to each data point. In calculating the weighted mean, each data point is multiplied by a predetermined weight before summation. The sum of these products is then divided by the total of the weights, not just the number of data points. This allows the weighted mean to reflect the relative contribution of each point to the total.

$$\mu_w= \frac{1}{\sum_{i = 1}^{n} w_i}\sum_{i = 1}^{n} w_ix_i $$

This measure is particularly useful when some data points are intrinsically more significant than others, or when the data collected does not equally represent the different groups that we are interested in measuring.

**Median**

It is the value that separates the highest half of a data set from the lowest half. In other words, the median is the middle point of a data set. Unlike the mean, the median is not affected by outliers or skewed data. This makes the median a more robust measure than the mean when dealing with data that contains extreme values or is not symmetrically distributed.

To compute the median, the data set must first be sorted in ascending order. If the number of observations, $n$, is odd, the median is the value at position $(n+1)/2$. If $n$ is even, the median is the average of the values at positions $n/2$ and $n/2 + 1$.

The median gives a better measure of central tendency when data is skewed or when there are extreme outliers. However, in symmetric distributions with no outliers, the mean and the median are often the same or close to each other.

**Conclusion**

The median, while robust to extreme values, only considers the middle value (or the average of the two middle values in case of an even number of observations), effectively discarding the rest of the data. The mean, on the other hand, considers all data points but is heavily influenced by outliers or extreme values.

The trimmed mean attempts to balance this by removing a certain percentage of the largest and smallest values in the dataset, and then calculating the mean of the remaining data. This makes it robust to outliers (like the median) while still utilizing more data than the median does.


Consider the data set containing population and murder
rates (in units of murders per 100,000 people per year) for each US state (2010
Census). Let's compute the mean, trimmed mean, and median for the population:

In [None]:
state.head(3)

: 

In [None]:
printest('Mean Population',state['Population'].mean())
printest('Truncated Mean Population',trim_mean(state['Population'], 0.1) )
printest('Median Population',state['Population'].median())

: 

Both the truncated mean and the median provide more "typical" population values than the mean, as they are less influenced by any extreme population numbers.

# Estimates of Variability

Variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out from the central value.

**Variance and Standart Deviation**
The variance is an average of the squared deviations, 

$$s^2 = \frac{1}{n}\sum_{i = 1}^{n}(x_i - \mu)^2 $$

and the standard deviation is the square root of the variance:

$$s = \sqrt{\frac{1}{n}\sum_{i = 1}^{n}(x_i - \mu)^2} $$

The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations


# Estimates Based on Percentiles



**The interquartile range (IQR)**

IQR is a measure of variability, based on dividing a dataset into quartiles. It is defined as the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$). 

A quartile divides data into four equal parts, each comprising 25% of the data. $Q_1$ represents the 25th percentile of the data, meaning 25% of data points are less than or equal to this value. $Q_3$ represents the 75th percentile, implying that 75% of data points are less than or equal to this value.

The IQR gives a sense of how spread out the values in a dataset are and is a robust measure of dispersion that is not influenced by outliers. It is often used in exploratory data analysis to detect outliers and to summarize a large dataset.

- To calculate the IQR
  1. **Ordering the Data**: First, arrange the data in increasing or decreasing order. This step is crucial as the IQR depends on values at specific positions in the ordered dataset.
  2. **Determining Quartiles:**  identify the first quartile ($Q_1$) and the third quartile ($Q_3$). 
      - $Q_1$ is the median (middle value) of the first half of the sorted data.
      - $Q_3$ is the median of the second half of the sorted data.
  3. **Calculating IQR:** Finally, calculate the IQR by subtracting $Q_1$ from $Q_3$. This difference represents the range of the central 50% of the dataset, hence giving a robust measure of dispersion that is not influenced by outliers or extreme values.


- For example, if a dataset has the following values: $[3, 5, 7, 8, 9, 11, 14, 15, 16, 17]$, 
  -  $Q_1$ would be 8 (median of $[3, 5, 7, 8]$), 
  -  $Q_3$ would be 16 (median of $[14, 15, 16, 17]$), 
  -   IQR would be $16 - 8 = 8$.

In [None]:
printest('Standard Deviation Population:',state['Population'].std())
printest('IQR Population',state['Population'].quantile(0.75) - state['Population'].quantile(0.25))


: 

## Percentiles and Boxplots

**Percentiles**

In [None]:
percentages = [0.05, 0.25, 0.5, 0.75, 0.95]
df = pd.DataFrame(state['Murder.Rate'].quantile(percentages))
df.index = [f'{p * 100}%' for p in percentages]
print(df.transpose())

: 

The median is 4 murders per 100,000 people, although there is quite a bit of variability:
the 5th percentile is only 1.6 and the 95th percentile is 6.51.

**Box Plot**

A box plot, also known as a box-and-whisker plot, is a graphical representation of statistical data that illustrates a dataset's key quantiles and potential outliers.

- The box represents the interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile). The line inside the box marks the median (50th percentile).
  
- The whiskers extend from the box to show the overall spread of the data, typically 1.5 times the IQR.
  - The lower whisker point is calculated as $Q_1 - 1.5 IQR$
  - The upper whisker point is calculated as $Q_3 + 1.5 IQR$
  
- Any dots or other markers beyond the whiskers indicate potential outliers.
The box plot provides a summary of a dataset's distribution and is particularly useful for comparing distributions across groups.

In [None]:
plt.figure(figsize=(6, 6))
plt.boxplot(state['Population']/10e5, widths = 0.7, patch_artist=True,
            boxprops=dict(facecolor='lightblue', color='black'),
            capprops=dict(color='black'),
            whiskerprops=dict(color='black'),
            flierprops=dict(color='red', markeredgecolor='red'),
            medianprops=dict(color='black'),
            )
q1 = np.percentile(state['Population']/10e5, 25)
q2 = np.percentile(state['Population']/10e5, 50)
q3 = np.percentile(state['Population']/10e5, 75)
iqr = q3-q1

plt.text(1.35, q1, '$Q_1$', va='center',color= 'navy', fontsize = 12)
plt.text(1.35, q2, '$Q_2$ (Median)', va='center', color= 'navy', fontsize = 12)
plt.text(1.35, q3, '$Q_3$', va='center', color= 'navy', fontsize = 12)
plt.text(1.1, q3+1.5*iqr, '$Q_3 + 1.5IQR$', va='center',color= 'navy', fontsize = 12)

# q1-1.5*iqr = 0  because the feature has no negative values
plt.text(1.1, -0.5, '$Q_1 - 1.5IQR$', va='center',color= 'navy', fontsize = 12)

plt.ylabel('Population (millions)')
plt.title('Boxplot of Population')
plt.xticks([])

plt.tight_layout()
plt.show()

: 

## Histograms

In [None]:
plt.figure(figsize=(6, 6))
plt.hist(state['Population']/10e5, bins = 10, edgecolor = 'navy', color = 'lightblue')

plt.xlabel('Population (millions)')
plt.ylabel('Frequency')
plt.xticks(range(0,40, 10))

plt.tight_layout()
plt.show()

: 

## Density Plots

Related to the histogram is a density plot, which shows the distribution of data values
as a continuous line. Density plot can be thought of as a smoothed histogram,
although it is typically computed directly from the data.

In [None]:
plt.figure(figsize=(6, 6))
sns.histplot(state['Population']/10e5, bins = 10, color = 'blue', kde=True)
plt.xlabel('Population (millions)')
plt.ylabel('Frequency')
plt.xticks(range(0,40, 10))
plt.tight_layout()
plt.show()

: 

# Correlation

Exploratory data analysis in many modeling projects (whether in data science or in
research) involves examining correlation among predictors, and between predictors
and a target variable. Variables X and Y (each with measured data) are said to be positively
correlated if high values of X go with high values of Y, and low values of X go
with low values of Y. If high values of X go with low values of Y, and vice versa, the
variables are negatively correlated.

## Scatterplots

# Exploring Two or More Variables