# Exploratory Data Analysis in Python

## Overview
- Understanding how EDA is done in Python
- Various steps involved in the Exploratory Data Analysis
- Performing EDA on a given dataset

[Exploratory Data Analysis In Python: https://www.analyticsvidhya.com/blog/2022/02/exploratory-data-analysis-in-python/](https://www.analyticsvidhya.com/blog/2022/02/exploratory-data-analysis-in-python/)

## Structure-Based Exploratory Data Analysis

### Import Python Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

### Load dataset

In [None]:
dataset_folder = Path("./<path.to.folder>")
dataset_filename = "<filename>"

df = pd.read_csv(dataset_folder/datset_filename)

### Display observations

# Display first 5 rows
df.head()

# Display last 5 rows
df.tail()

### Count Number of Non-Missing Values for each Variable

df.count()

## Descriptive Statistics


#### Continous variables
- count
- mean
- standard deviation
- 5 point summary:
    - minimum
    - first quartile
    - second quartile
    - third quartile
    - maximum

In [2]:
# Get descriptive stats of continous variables
df.describe()

NameError: name 'df' is not defined

#### Categorical variables

- count
- unique: number of unique values
- top: most frequent value (mode)
- frequency: count of most frequent value

In [None]:
# Get descriptive stats of categorical variables
df.describe(include='all')

### Display Complete Meta-Data of dataset

In [None]:
df.info()

## Content Based Exploratory Data Analysis

### Handling Duplicates

In [None]:
# Check for duplicates in the data
df.duplicated()

In [None]:
# Drop duplicates
df.drop_duplicates()

# To drop duplicates in a certain column:
# df.drop_duplicates(subset='UserID')

### Handling Outliers

Consider any variable from dataframe and determine upper cutoff and lower cutoff
using the following methods:
- Percentile method
- IQR method
- Standard Deviation method

In [None]:
# Using the feature name: Purchase

p0 = df.Purchase.min() # minimum
p100 = df.Purchase.max() # maximum
q1 = df.Purchase.quantile(0.25) # 1st quartile
q2 = df.Purhcase.quantile(0.50) # 2nd quartile
q3 = df.Purchase.quantile(0.75) # 3rd quartile
iqr = q3 - q1 # Interquartile range

lc = q1 - 1.5*iqr # lower cutoff value
uc = q3 - 1.5*iq4 # upper cutoff value


If lc < p0, there are no outliers
If lc > p0, there are outliers

If rc > p100, there are no outliers
if rc < p100, there are outliers

### Visualize outliers using a boxplot

In [None]:
df.Purchase.plot(kind='box')

### Outlier Treatment

Clip method: values that are outside the range of the lower or upper cutoff
are replaced with the cutoff value.

In [None]:
# Clipping all values greater than the upper cutoff
df.Purchase.clip(upper=uc)

# To treat the outliers and make the changes permanent:
df.Purchase.clip(upper=uc, inplace=True)

# Visualize changes
df.Purhcase.plot(kind='box')

## Handling Missing Values

In [None]:
# Detect missing values
df.isna()

# Get percentage of missing values
df.isna().sum() / df.shape[0]

### Missing Value Treatment

Missing values can be treated with the following methods:
- Drop the variable
- Drop the observation(s)
- Missing value imputation

If missing values are > 60%, this would be a good candidate for dropping the
variable. 

In [None]:
# Impute mode for categorical values
df.Product_Category_2.mode()[0]
df.Product_Category_2.fillna(df.Product_Category_2.mode()[0], inplace=True)

# Check 
df.isna().sum()

In [None]:
# Drop variable
df.dropna(axis=1, inplace=True)

### Check Datatypes

df.dtypes

## Univariate Analysis

A single variable is used to plot charts in this type of analysis. These charts
are used to see distribution and composition of data depending on the type of 
variable: categorical or numerical. 

### Continous variables

Use Box plots and Histograms

In [None]:
# Histogram
df.Purchase.hist()
plt.show()

# Another method
df.Purhcase.plot(kind='hist', grid = True)
plt.show()

# Using matplotlib
plt.his(df.Purchase)
plt.grid(True)
plt.show()

In [3]:
# Box plot
df.Purhcase.plot(kind='box')
plt.show()

plt.boxplot(df.Purchase)
plt.show()



NameError: name 'df' is not defined

### Categorical variables

Use bar charts, horizontal bar charts, etc...

## Bivariate Analysis

Two variables are used at a time to create charts.
With two types of variables, categorical and numerical, there can be three types
of bivariate analysis.

### Numerical and Numerical

Scatter plots and correlation matrix witha heatmap can be used to visualize

### Numerical and Categorical

Use bar and line charts to see composition/comparison of data