# Simple EDA Template

### Overview Steps:
    1. Research and Brainstorm
    2. Preprocessing
    3. Individual Variables Exploration
    4. Variable Correlation Exploration
    5. Conclusion and Next Steps

## 1. Research and Brainstorm

### Research the context of the data
    - how was the data collected?
    - who collected the data?
    - are there any biases that could have come from the data?
    - is there any domain knowledge needed in order to informly explore this data?
    - are there any variables that require research to understand their full meaning?
### Brainstorm questions and concepts that may or may not be answered with the data
    - are there any variables you suspect would correlate?
    - are there any variables that are expected to have certain trends or values?

## 2. Preprocessing

### Important issues to look for in the data

#### Duplicates
    - does the data have duplicates that need to be removed?
    - how does keeping or removing duplicate values change the insight from later EDA?
#### Null Values 
    - does the data have null values that should be removed?
    - how does keeping or removing null values change the insight from later EDA?
#### Oulier Values
    - do categorical varibles have resonable responses - example variables states have only real states?
    - do quantitative variables have a resonable range and standard deviation?
#### Inconsistent formats
    - do the variables have data types that make sense for the variable?

In [None]:
#load in packages
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

In [None]:
#load in data
data = pd.read_csv("")

#have a large dataset?
# ! pip install datatble
# import datatable as dt
# data = dt.fread("").to_pandas()

In [None]:
data.head()

In [None]:
#number of duplicates 
duplicate_rows_df = data[data.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

#if you would like to remove duplicated values
# data = data.drop_duplicates()
# print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
#number of na values by variable
nulls = data.isnull().sum().sort_values(ascending=False)
print("Number of missing values:")
nulls.head(20)

In [None]:
#observe unique values of variables to see if anything is sus
for variables in data.columns:
    print("----------------------------------------------")
    print("unique values of variable: " + str(variables))
    print(data[variables].unique())

In [None]:
#statistical values of numerical variables
data.describe()

In [None]:
#find format of variables
print(data.dtypes)

## 3. Individual Variables Exploration
    - is there a dominated value for certain variables?
    - is there an interesting distribution within the responses?
    - do the responses make sense based on the context and background of the data?

In [None]:
#value counts of categorical values
data[''].value_counts().plot(kind='bar', figsize=(8,8))
pass

In [None]:
counts = data.groupby('')[''].count()
counts.head(20)

In [None]:
#if you have a continuous float value, might be useful to use a histogram
#I like to use the seaborn histogram 

sns.displot(data, x = '', binwidth=0.2, kind = 'hist')

## 2-3. Alternative Method

### Pandas Profiling 


In [None]:
# ! pip install pandas_profiling
import pandas_profiling

profile = data.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="pandas_profiling.html")

## 4. Variable Correlation Exploration
    - How to variables compare to one another
    - This is a good section to explore ideas and questions from the brainstorming questions of how certain variables might compare to one another
    - Often important to aggregate many of the features to compare

In [None]:
#start with pairplot to get simple comparisions of variables
sns.pairplot(data)
pass

In [None]:
#Often important to aggregate many of the features to compare
aggregate_average = data.groupby('')[''].mean()
aggregate_average.head(20)

In [None]:
#you can also create scatter plots to see how different aggragates compare by combining dataframes

Merged_Df.plot.scatter(x='', y='', c='DarkBlue', figsize=(8,8))
pass

## 5. Conclusion and Next Steps
    - Document findings at the end of the EDA project that way they are easy to access and look at again in the future