# Simple EDA Template with Example

### Overview Steps:
    1. Research and Brainstorm
    2. Preprocessing
    3. Individual Variables Exploration
    4. Variable Correlation Exploration
    5. Conclusion and Next Steps

## 1. Research and Brainstorm

### Research the context of the data
    - how was the data collected?
    - who collected the data?
    - are there any biases that could have come from the data?
    - is there any domain knowledge needed in order to informly explore this data?
    - are there any variables that require research to understand their full meaning?
### Brainstorm questions and concepts that may or may not be answered with the data
    - are there any variables you suspect would correlate?
    - are there any variables that are expected to have certain trends or values?

## 2. Preprocessing

### Important issues to look for in the data

#### Duplicates
    - does the data have duplicates that need to be removed?
    - how does keeping or removing duplicate values change the insight from later EDA?
#### Null Values 
    - does the data have null values that should be removed?
    - how does keeping or removing null values change the insight from later EDA?
#### Outlier Values
    - do categorical varibles have reasonable responses - example variables states have only real states?
    - do quantitative variables have a reapip install jupyter_to_mediumsonable range and standard deviation?
#### Inconsistent formats
    - do the variables have data types that make sense for the variable?

In [1]:
! pip install jupyter_to_medium

Collecting jupyter_to_medium
  Downloading jupyter_to_medium-0.2.10-py3-none-any.whl (31 kB)
Collecting nbconvert==5.6.1
  Downloading nbconvert-5.6.1-py2.py3-none-any.whl (455 kB)
[K     |████████████████████████████████| 455 kB 1.8 MB/s eta 0:00:01
Collecting matplotlib<4.0.0,>=3.5.0
  Downloading matplotlib-3.5.1-cp38-cp38-macosx_10_9_x86_64.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 29.2 MB/s eta 0:00:01
[?25hCollecting beautifulsoup4<5.0.0,>=4.10.0
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 15.5 MB/s eta 0:00:01
[?25hCollecting jupyter-contrib-nbextensions<0.6.0,>=0.5.1
  Downloading jupyter_contrib_nbextensions-0.5.1-py2.py3-none-any.whl (20.9 MB)
[K     |████████████████████████████████| 20.9 MB 39.8 MB/s eta 0:00:01     |█████████████████               | 11.1 MB 39.8 MB/s eta 0:00:01
[?25hCollecting numpy<2.0.0,>=1.21.4
  Downloading numpy-1.21.4-cp38-cp38-macosx_10_9_x86_64.whl (16.9 MB

In [2]:
#load in packages
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

In [3]:
#https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019
#load in data
data = pd.read_csv("bestsellers_with_categories.csv", error_bad_lines=False)

#have a large dataset?
# ! pip install datatble
# import datatable as dt
# data = dt.fread("").to_pandas()

FileNotFoundError: [Errno 2] No such file or directory: 'bestsellers_with_categories.csv'

In [None]:
data.head()

In [None]:
#number of duplicates 
duplicate_rows_df = data[data.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

#if you would like to remove duplicated values
# data = data.drop_duplicates()
# print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
#number of na values by variable
nulls = data.isnull().sum().sort_values(ascending=False)
print("Number of missing values:")
nulls.head(20)

In [None]:
#observe unique values of variables to see if anything is sus
for variables in data.columns:
    print("----------------------------------------------")
    print("unique values of variable: " + str(variables))
    print(data[variables].unique())

In [None]:
#statistical values of numerical variables
data.describe()

In [None]:
#find format of variables
print(data.dtypes)

## 3. Individual Variables Exploration
    - is there a dominated value for certain variables?
    - is there an interesting distribution within the responses?
    - do the responses make sense based on the context and background of the data?

In [None]:
#value counts of categorical values
data['Genre'].value_counts().plot(kind='bar', figsize=(8,8))
pass

In [None]:
#value counts of categorical values
data['User Rating'].value_counts().plot(kind='bar', figsize=(8,8))
pass

In [None]:
#value counts of categorical values
data['Price'].value_counts().plot(kind='bar', figsize=(8,8))
pass

In [None]:
author_counts = data.groupby('Author')['Name'].count()
author_counts.head(40)

In [None]:
#if you have a continuous float value, might be useful to use a histogram
#I like to use the seaborn histogram 

# sns.displot(data, x = '', binwidth=0.2, kind = 'hist')

## 2-3. Alternative Method

### Pandas Profiling 


In [None]:
# ! pip install pandas_profiling
import pandas_profiling

profile = data.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="pandas_profiling.html")

## 4. Variable Correlation Exploration
    - How do variables compare to one another
    - This is a good section to explore ideas and questions from the brainstorming questions of how certain variables might compare to one another
    - Often important to aggregate many of the features to compare

In [None]:
#start with pairplot to get simple comparisions of variables
sns.pairplot(data)
pass

In [None]:
#Often important to aggregate many of the features to compare
author_average_price = data.groupby('Author')['Price'].mean()
author_average_price.head(20)

In [None]:
author_counts.head(20)

In [None]:
#how does number of books an author has on the top 50 compare to their average price?
Author_Price_Df = pd.DataFrame(data = author_average_price)
Author_Bookcount_Df = pd.DataFrame(data = author_counts)

Merged_Df = Author_Price_Df.join(Author_Bookcount_Df, on = 'Author', how = 'left')

In [None]:
Merged_Df.head()

In [None]:
Merged_Df.plot.scatter(x='Price', y='Name', c='DarkBlue', figsize=(8,8))
pass

## 5. Conclusion and Next Steps
    - Document findings at the end of the EDA project that way they are easy to access and look at again in the future

### Notes from Amazon Top 50 Bestselling Books 2009-2019 EDA
    - 