# Week 4: Exploratory Data Analysis (Categorical Variables + Numerical Variables) + Data Manipulation

**Sources:**

- Python for Marketing Reserach and Analytics. J. Schwarz, C. Chapman, and E.M. Feit. Springer 2020.
- Matplotlib: https://matplotlib.org
- Seaborn for Categorical Data: https://seaborn.pydata.org/tutorial/categorical.html

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify

# use ggplot style when plotting
plt.style.use('ggplot')

`pandas` provide data scientists with differen methods and APIs (application programming interface) to read data from different sources. The following table shows some of the interfaces:

|**Data Source**|**Read Function**|**Write Function**| **Data Type**|
|:--------------|:----------------|:-----------------|:-------------|
csv | `read_csv()` | `to_csv()` | structured
JSON | `read_json()` | `to_json()` | semi-unstructured
XML | `read_xml()` | `to_xml()` | semi-structured
Excel | `read_excel()` | `to_excel()` | structured
Python Pickle File Format | `read_pickle()` | `to_pickle()` | Python object

# 4.1 EDA for Two Categorical Variables

In [None]:
store_sales = pd.read_csv('product_data.csv')
store_sales.head()

## 4.1 Distribution of Two Categorical Variables

### 4.4.1 Cross Tabulation (Joint Distribution)

In [None]:
# if we have two categorical variables, we can create a cross-tabulation as follows
# this cross tabulation shows us the joint distribution of the two categorical variables

# the following line of code find the joint distribution of the mean price for product 1 by promotion and country



In [None]:
# the following line of code find the joint distribution of the number of transactions
# by promotion and country



In [None]:
# pandas has a crosstab() method which makes it easier to create cross tabs
# the output is similar to the previous output so you can use either command



In [None]:
#Crosstabs for mpg


### 4.1.2 Joint and Marginal Probability Distribution 

- With `pd.crosstab()`, we can normalize our table using the normalize argument:
    - If passed ‘all‘ or True, will normalize overall values 
    - If passed ‘index‘ will normalize over each row.
    - If passed ‘columns‘ will normalize over each column.
    - If margins is True, will also normalize margin values.

In [None]:
# normalize by total to get joint probability distribution
# add margins = True to get the marginal probability distribution


In [None]:
# add margins = True to get the marginal probability distribution


### 4.1.3 Conditional Probabiility Distribution

In [None]:
# normalize by row
# Conditional probability (conditioned on promo here)


In [None]:
# normalize by column

# Conditional probability (conditioned on country here)


- More on crosstabs (https://pythonguides.com/crosstab-in-python-pandas/)

## 4.2 Visual Distribution of Two Categorical Variables

- In this section, we will utilze the `seaborn` data visualization package to create our plots
- We will also switch to the `tips` dataset that comes with `seaborn`, so there is no need to load the data from a csv file
- `seaborn` methods builds on top of `matplotlib` functionality
- We could have down the plots below using `matplotlib` but we will have to write more code
- Note that `seaborn` does the crosstabulation for the categorical variable (unlike `matplotlib`)
- Note that we could have also used the `seaborn` methods to visualize One Categorical variable
- To create the plots below using `matplotlib`, follow this links:
    - Stacked Barchart: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py
    - Grouped Barchart: https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py

In [None]:
# we will use the tips dataset that comes with seaborn


### 4.2.1 Grouped Barchart

In [None]:
# The following barchart show the distribution of smokers vs. non-smokers



In [None]:
# Interaction of two categorical variables



### 4.2.2 Mosaic Plot

- To create the mosaic plot, we will use the mosaic function from statsmodels python package

In [None]:
from statsmodels.graphics.mosaicplot import mosaic



### 4.2.3 Side-by-side boxplot

- This kind of plot shows the three quartile values of the distribution along with extreme values.
- The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile
- Observations that fall outside this range are displayed independently

- We can add a third dimension(categorical) to our side-by-side box plot using a hue semantic

### 4.2.4 Creating a New Variable to Color Plots Based on a Condition

### 4.2.5 Point Plot

# 4.3 EDA for One Numerical Variable

In [None]:
cardio = pd.read_csv('CardioGoodFitness.csv')
cardio.head()

## 4.3.1 Numerical Summaries

In [None]:
cardio.Age.mean()

In [None]:
cardio.describe().round()

## 4.3.2 Histograms

- Histograms are only for numerical variables
- Based on our data exploration before, we can be studing the distribution of Age, Income and Miles using a histogram.

https://seaborn.pydata.org/generated/seaborn.distplot.html

## 4.3.3 Boxplots

In [None]:
# Income Distribution

# Based on the boxplot, we can see that:

# income ranges between 30K and 100K. 
# 50% of the CardioGood customers have income between 45K and 60K
# 25% of the customers have income higher than 60K
# 25% of the customers have income lower than 30K
# the median income for customers is around 50K (i.e. 50% of the customers earn more than 50K)
# we see some customers with an outlier income (80K and above)



In [None]:
# Horizontal Boxplot



## 4.3.4 Visualizing Interaction btween One Numerical Variable and One Categorical Variable

In [None]:
# Interaction: Product x Income

# We can see that prodcut TM798 is baught by customer with high income compared to the other two products
# The other two products (TM195 and TM498) are mostly bought by people of lower income (less than 70K with majority less than 55K)



In [None]:
# Interaction: Product x Miles

# It seems that people who expect to walk/run more miles per week are using the product TM798 
# 75% expects to move more than 140 miles per week
# Customers who baught TM195 and TM498 expect to walk/run less per than 140 miles per week


In [None]:
# Interaction: Fitness x Age

# Customers who rated themselves very low on fitness are in their 20's mostly
# That does not mean that all customers in their 20's rated themselves low




In [None]:
# Interaction: Fitness x Miles

# There is an clear positive corrlation between fitness ratings and the number of miles customers expect to walk/run
# Customers who rated themselves as "athletes" expect to run a lot of miles (more than 75% expect to walk/run more than 150 miles per week)




In [None]:
# Product x Income



In [None]:
# We can get rid of the (confidence interval) ci around the mean value as follows



# 4.4 EDA for Two Numerical Variables

## 4.4.1 Correlation and Correlation Matrix

In [None]:
# correlation



In [None]:
# correlation matirx


## 4.4.2 Visualizing Two Numerical Variables

### Pair plots

### Correlation Heatmap

### Scatterplot

### Scatterplot with a Categorical Variable

- Scatterplot for the numerical variables, and use color of the point for one of the categorical variable
- Scatterplot for the numerical variables, and use size of the point for one of the categorical variable
- Scatterplot for the numerical variables, and use shape of the point for one of the categorical variable

In [None]:
# using color 


In [None]:
# using shape/style  


### Facetplots

In [None]:
# Facetplots could be used with other types of plots as well

