# Exploratory Data Analysis and Visualization (in Python)

By: [Paul Jeffries](https://twitter.com/ByPaulJ) 

**NOTE: this is an early work in progress. Check back shortly for new additions.**

## Introduction 

The purpose of this document is to serve as a smorgasbord of EDA techniques and visualization tools. 

Once I have had some time to more thoroughly flesh out this document, it will more closely resemble the [other markdowns that I have up on GitHub]('https://github.com/pmaji/data-science-toolkit/blob/master/classification/logit/logistic_regression.md') that are much more neatly organized.

The counterpart to this notebook--written in R--can [be found here]('https://github.com/pmaji/data-science-toolkit/blob/master/eda-and-visualization/eda_and_visualization.md'). 

## Setup 

To enable toggle-able table of contents (with button to enable / disable), check out [this link here]('https://github.com/minrk/ipython_extensions') for instructions on how to get set up. 

In [1]:
import datetime
# prints the present date and time as a form of log
print("This notebook was last run: ", datetime.datetime.now())

This notebook was last run:  2019-04-01 23:27:39.593951


In [2]:
# key libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# cd to correct directory 
%cd '/Users/pauljeffries/Desktop/personal/personal_code/data-science-toolkit' 
# print working directory just be sure
%pwd

/Users/pauljeffries/Desktop/personal/personal_code/data-science-toolkit


'/Users/pauljeffries/Desktop/personal/personal_code/data-science-toolkit'

## Importing, Exploring, and Cleaning the Data

### Importing the Data

The data used in this document come from a [Kaggle post](https://www.kaggle.com/kemical/kickstarter-projects/home) focused on Kickstarter campaigns. If unfamiliar with the notion of a Kickstarter campaign (henceforth just campaign), I would recommend reading [this FAQ here](https://help.kickstarter.com/hc/en-us/categories/115000499013-Kickstarter-basics). I will not spend a great deal of time explaining the data, so for more information on the data specifically, I recommend reading the detailed exploration on the [data page for this Kaggle](https://www.kaggle.com/kemical/kickstarter-projects).

In [4]:
# importing the dataset from the CSV
base_df = pd.read_csv(filepath_or_buffer = 'hypothesis-tests/data/ks-projects-201801-sampled.csv')

# print the head of the base dataframe
base_df.head()

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd.pledged,usd_pledged_real,usd_goal_real
0,1,136458340,Squid Packs: a personalized twist on the class...,Product Design,Design,USD,2012-11-19,5000.0,2012-10-18 19:27:07,7458.0,successful,78,US,7458.0,7458.0,5000.0
1,2,381995336,Smash Monster Rampage,Tabletop Games,Games,USD,2013-09-30,8000.0,2013-08-23 15:00:56,52693.0,successful,566,US,52693.0,52693.0,8000.0
2,3,1425707545,Free Software Solutions For Everyone,Software,Technology,USD,2014-09-12,525.0,2014-08-13 02:28:09,100.0,failed,1,US,100.0,100.0,525.0
3,4,960476049,Everlasting Youth Dish (Canceled),Food,Food,EUR,2015-03-07,5000.0,2015-01-06 00:16:47,0.0,canceled,0,IE,0.0,0.0,5286.03
4,5,962915363,Grippy Cushion - Stick - Grip - Charge,Gadgets,Technology,AUD,2016-12-03,15000.0,2016-10-14 01:59:59,6527.0,canceled,83,AU,181.86,4873.81,11200.72


### Preliminary Data Exploration

I emphasize *preliminary* here because, when doing EDA, I prefer to explore the data in two parts. The first stage--the preliminary stage--involves just enough exploration to facilitate basic data cleaning. At this stage things like looking for null values, type cohesion, etc., is te focus. The second stage--the deeper exploration stage--presumes that our data are clean, and then we can being more nuanced EDA like exploring data distributions, correlations, etc.

In [5]:
# building a helpful function to look get info on the columns in our data frame
def dataframe_explorer(df):
    dict_list = []
    for col in df.columns:
        data = df[col]
        dict_ = {}
        # pulls the null count for a column
        dict_.update({'null_count' : data.isnull().sum()})
        # counts the unique values in a column
        dict_.update({'unique_count' : len(data.unique())})
        # gets the types of the data in a column
        dict_.update({'data_type' : data.dtype})
        dict_list.append(dict_)
            
    col_info_df = pd.DataFrame(dict_list)
    col_info_df.index = df.columns
    col_info_df.sort_values(by=['null_count','unique_count'], ascending=[True, False], inplace=True)
        
    return col_info_df

In [6]:
# checking out our data with the the dataframe explorer
dataframe_explorer(base_df)

Unnamed: 0,data_type,null_count,unique_count
Unnamed: 0,int64,0,189330
ID,int64,0,189330
launched,object,0,189190
usd_pledged_real,float64,0,61187
pledged,float64,0,37885
usd_goal_real,float64,0,31194
goal,float64,0,5288
deadline,object,0,3130
backers,int64,0,2901
category,object,0,159


### Cleaning the Data

We'll start by making use of some of the [excellent functions provided by the janitor package](https://github.com/ericmjl/pyjanitor) to clean up column names and remove any entirely empty rows. In the case of this data, which came already very clean (a rare treat!), it doesn't have a large effect on our df. If we didn't already now that our data were so clean, it would be wise to re-run the dataframe_explorer() function from above on the cleaned data below, in addition to spot-checking with .head() / .tail().

In [7]:
from janitor import clean_names, remove_empty

In [8]:
clean_base_df = (base_df
          .clean_names(strip_underscores=True)
          .remove_empty()
     ) # further method chaining possible

clean_base_df.head()

Unnamed: 0,unnamed_0,id,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd_pledged,usd_pledged_real,usd_goal_real
0,1,136458340,Squid Packs: a personalized twist on the class...,Product Design,Design,USD,2012-11-19,5000.0,2012-10-18 19:27:07,7458.0,successful,78,US,7458.0,7458.0,5000.0
1,2,381995336,Smash Monster Rampage,Tabletop Games,Games,USD,2013-09-30,8000.0,2013-08-23 15:00:56,52693.0,successful,566,US,52693.0,52693.0,8000.0
2,3,1425707545,Free Software Solutions For Everyone,Software,Technology,USD,2014-09-12,525.0,2014-08-13 02:28:09,100.0,failed,1,US,100.0,100.0,525.0
3,4,960476049,Everlasting Youth Dish (Canceled),Food,Food,EUR,2015-03-07,5000.0,2015-01-06 00:16:47,0.0,canceled,0,IE,0.0,0.0,5286.03
4,5,962915363,Grippy Cushion - Stick - Grip - Charge,Gadgets,Technology,AUD,2016-12-03,15000.0,2016-10-14 01:59:59,6527.0,canceled,83,AU,181.86,4873.81,11200.72


## Summary Statistics



### High-level summary stats

## Bespoke Visualizations

### Histograms

### Density Plots