# Simple EDA Template

### Overview Steps:
    1. Research and Brainstorm
    2. Preprocessing
    3. Individual Variables Exploration
    4. Variable Correlation Exploration
    5. Conclusion and Next Steps

## 1. Research and Brainstorm

### Research the context of the data
    - how was the data collected?
    - who collected the data?
    - are there any biases that could have come from the data?
    - is there any domain knowledge needed in order to informly explore this data?
    - are there any variables that require research to understand their full meaning?
### Brainstorm questions and concepts that may or may not be answered with the data
    - are there any variables you suspect would correlate?
    - are there any variables that are expected to have certain trends or values?

## 2. Preprocessing

### Important issues to look for in the data

#### Duplicates
    - does the data have duplicates that need to be removed?
    - how does keeping or removing duplicate values change the insight from later EDA?
#### Null Values 
    - does the data have null values that should be removed?
    - how does keeping or removing null values change the insight from later EDA?
#### Oulier Values
    - do categorical varibles have resonable responses - example variables states have only real states?
    - do quantitative variables have a resonable range and standard deviation?
#### Inconsistent formats
    - do string variables have any float values?

In [3]:
#load in packages
import pandas as pd
import numpy as pd
import matplotlib as plt
import seaborn as sns

In [None]:
#load in data
data = pd.read_csv("")

#have a large dataset?

# ! pip install datatble
# import datatable as dt
# data = dt.fread("").to_pandas()

In [None]:
#number of duplicates 
duplicate_rows_df = data[data.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
#number of na values by variable
nulls = data.isnull().sum().sort_values(ascending=False)
print("Number of missing values:" + nulls)

In [None]:
#observe unique values of variables to see if anything is sus
for variables in data.columns():
    print("unique values of variable: " + variables)
    print(data.variables.unique())

In [None]:
#statistical values of numerical variables
data.describe()

In [None]:
#find format of variables
for variables in data.columns():
    print(variables + type(variables))

## 3. Individual Variables Exploration
    - 

In [None]:
#value counts of categorical values
data.calculated_source.value_counts().plot(kind='bar', figsize=(20,10))
pass

In [None]:
data.hist(column='', bins=25, grid=False, figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)
pass

## 2-3. Alternative Method

### Pandas Profiling 


In [None]:
import pandas_profiling

profile = data_fifa.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="fifa_pandas_profiling.html")

## 4. Variable Correlation Exploration
    - How to variables compare to one another
    - This is a good section to explore ideas and questions from the brainstorming questions of how certain variables might compare to one another

In [None]:
data.plot.scatter(x='', y='', c='DarkBlue')

In [None]:
sns.pairplot(data)

## 5. Conclusion and Next Steps
    - 