# Exploratory Data Analysis (EDA)

Exploratory data analysis is the process of reviewing and cleaning data to:
- derive insights
- generate hypothesis for experiments


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings

warnings.filterwarnings('ignore')

## Loading data

In [None]:
df = pd.read_csv('../data/Auto.csv')

## Exploring

In [None]:
df.head()

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
df.name.value_counts()

In [None]:
df.describe()

In [None]:
sns.histplot(data=df, x='mpg')
plt.show()

In [None]:
sns.histplot(data=df, x='mpg',binwidth=2)
plt.show()

## Data Validation

Data validation is an important early step in exploratory data analysis. It checks data types and data ranges against their expectations before we go any further.

In [None]:
df.info()

In [None]:
df.dtypes

What if we are not happy with the type of a column?

In [None]:
df['origin'] = df.origin.astype('category')
df['cylinders'] = df.cylinders.astype('category')

In [None]:
# df['column'] = df['column'].astype(int)

We can validate categorical data using the isin method

In [None]:
# df['categorical_column'].isin(["value_a", "value_b"])
# ~df['categorical_column'].isin(["value_a", "value_b"])

Validating numerical data

In [None]:
df.select_dtypes("number").head()

In [None]:
df.mpg.min()

In [None]:
df.mpg.max()

In [None]:
sns.boxplot(data=df, x='mpg')

In [None]:
sns.boxplot(data=df, x='mpg', y='origin')

In [None]:
sns.boxplot(data=df, x='mpg', hue='origin')

# Data Summarization

Grouping data helps understanding the characteristics of groups of data.
Aggregating function indicates how to summarize the grouped data (count, mean, sum, min, max, var, std)

## Numerical

In [None]:
df.groupby('origin')['mpg'].mean()

In [None]:
df[['origin', 'mpg']].groupby('origin').agg(['count', 'mean'])

In [None]:
df.groupby('origin').agg({'mpg':['max', 'min'],'weight':['max', 'min']})

In [None]:
df.groupby('origin').agg(min_mpg=('mpg', 'min'), max_weight=('weight', 'max'))

### Categorical

In [None]:
sns.barplot(data=df, x='origin', hue='cylinders')

## Addressing Missing Data

Missing data can affect distributions. Data can be then underrepresentative of the whole population, leading to wrong conclusions.

In [None]:
df.isna().sum()

When facing missing data there are several options:
- drop missing values (if the total is < 5% of the total values)
- imputation of the mean, median or mode depending on the context
- imputation by subgroup


In [None]:
# Example on how to proceed with missing values:

threshold = len(df)*0.05

# Drop those under the threshold
cols_to_drop = df.columns[df.isna().sum() <= threshold]
df.dropna(subset=cols_to_drop, inplace=True)

# For the ones over the threshold, impute mode, mean...
cols_with_missing_values = df.columns[df.isna().sum() > threshold]

for col in cols_with_missing_values:
    df[col].fillna(df[col].mode()[0])

# Imputing by subgroup
df_dict = df.groupby("grouping_col").['target_col'].median().to_dict()
df['target_col'] = df['target_col'].fillna(df['grouping_col'].map(df_dict))


In [None]:
## Converting and analyzing categorical data

In [None]:
df

In [None]:
df['brand'] = df.name.str.split().str[0]

In [None]:
df.brand.nunique()

In [None]:
sns.histplot(data=df, x='brand')

In [None]:
df.brand.str.contains('^chev', case=False).sum()

In [None]:
df.brand.value_counts()

In [None]:
ford='ford'
chevrolet='chevrolet'
plymouth='plymouth'
dodge='dodge'
amc='amc'
toyota='toyota'
datsun='datsun'
buick='buick'
pontiac='pontiac'
volkswagen='volkswagen'
honda='honda'
mercury='mercury'
mazda='mazda'
oldsmobile='oldsmobile'
fiat='fiat'
peugeot='peugeot'
audi='audi'
chrysler='chrysler'
volvo='volvo'

brands=[ford, chevrolet, plymouth, dodge, amc, toyota, datsun, buick, pontiac, volkswagen, honda, mercury, mazda, oldsmobile, fiat, peugeot, audi, chrysler, volvo]

In [None]:
conditions=[
    (df.name.str.contains('^ford', case=False)),
    (df.name.str.contains('^chevrolet|chevroelt|chevy', case=False)),
    (df.name.str.contains('^plymouth', case=False)),
    (df.name.str.contains('^dodge', case=False)),
    (df.name.str.contains('^amc', case=False)),
    (df.name.str.contains('^toyota|toyouta', case=False)),
    (df.name.str.contains('^datsun', case=False)),
    (df.name.str.contains('^buick', case=False)),
    (df.name.str.contains('^pontiac', case=False)),
    (df.name.str.contains('^volkswagen|vw|vokswagen', case=False)),
    (df.name.str.contains('^honda', case=False)),
    (df.name.str.contains('^mercury', case=False)),
    (df.name.str.contains('^mazda', case=False)),
    (df.name.str.contains('^oldsmobile', case=False)),
    (df.name.str.contains('^fiat', case=False)),
    (df.name.str.contains('^peugeot', case=False)),
    (df.name.str.contains('^audi', case=False)),
    (df.name.str.contains('^chrysler', case=False)),
    (df.name.str.contains('^volvo', case=False))
]

In [None]:
df['clean_brand'] = np.select(conditions, brands, default='other')

In [None]:
df

In [None]:
sns.countplot(data=df, x='brand')
plt.show()

## Working with numeric data




In [None]:
# pd.Series.str.replace('to remove', 'to replace with')

# df['col'] = df['col'].astype(float)

# df['std_dev']

In [None]:
df

In [None]:
df['clean_brand_mean_std'] = df.groupby('clean_brand')['mpg'].transform(lambda x: x.std())
df['clean_brand_mean_mpg'] = df.groupby('clean_brand')['mpg'].transform(lambda x: x.mean())

In [None]:
df[['mpg', 'clean_brand', 'clean_brand_mean_mpg', 'clean_brand_mean_std']]

## Outliers

The Interquartile Range (IQR) can help us identifying outliers.

Upper outliers are those observations bigger than the 75th percentile + 1.5 times IQR
Lower outliers are those observations smaller than the 25th percentile - 1.5 times IQR

Outliers are extreme values that may not accurately represent our data. They can change the mean and standard deviation. Statistical tests and machine learning models need normally distributed data and not skewed.

Once we know we have outliers...:
- why do they have them?
- are they accurate? or do they represent errors during the data collection phase?

In [None]:
seventyfifth_q = df['mpg'].quantile(0.75)
twentyfifth_q = df['mpg'].quantile(0.25)

iqr = seventyfifth_q - twentyfifth_q

df['outliers'] = ((df['mpg']<twentyfifth_q-1.5*iqr) | (df['mpg']>seventyfifth_q+1.5*iqr))

In [None]:
#In this example there is no outlier

df.outliers.sum()

In [None]:
sns.boxplot(data=df, x='mpg')

In [None]:
sns.histplot(data=df, x='mpg')

In [None]:
sns.histplot(data=df[df.outliers==False], x='mpg')

In [None]:
sns.pairplot(data=df)

## Histograms

In [None]:
swing = pd.read_csv('../data/2008_swing_states.csv')
swing.head()

In [None]:
plt.hist(swing['dem_share'])
plt.xlabel('percent of vote for obama')
plt.ylabel('number of counties')
plt.show()

In [None]:
plt.hist(swing['dem_share'], bins=[0,10,20,30,40,50,60,70,80,90,100])
plt.xlabel('percent of vote for obama')
plt.ylabel('number of counties')
plt.show()

In [None]:
sns.set()
plt.hist(swing['dem_share'], bins=[0,10,20,30,40,50,60,70,80,90,100])
plt.xlabel('percent of vote for obama')
plt.ylabel('number of counties')
plt.show()

**BINNING BIAS:** Histograms depend a lot on the chosen bins

## Bee Swarm Plot


In [None]:
sns.swarmplot(y='dem_share', x='state', hue='state', data=swing)
plt.xlabel('state')
plt.ylabel('percent of vote for obama')
plt.show()

In [None]:
swing

Bee swarm plots have limitations when displaying too many datapoints

## Empirical cumulative distribution function (ECDFs)

In [None]:
x = np.sort(swing.dem_share)
y = np.arange(1,len(x)+1)/len(x)
plt.plot(x, y, marker='.', linestyle='none')
plt.ylabel('ECDF')
plt.xlabel('percent of vote for obama')

plt.show()