In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# The Data Analytics Process walkthrough

## Objectives

* Review the CRISP-DM process and its relation to the Data Science Lifecycle\
* Do a walkthrough example of this process with a relevant dataset

## CRISP-DM

Recall that the Data Analytics/Data Science project cycle discussed in our last class. It is an open standard approach or process that describes common approaches used by data professionals and data mining experts. For us, as analysts, it is an adaptation or iteration of the CRISP-DM or CRoss Industry Standard Process for Data Mining model which includes in it process related to the statistical modeling of data and deployment of those models or algorithms in professional environments.

![](./images/crisp-dm.png)

Because we havent yet dealt with statistical models / algorithms just yet, our process is a broaded to a general lifecycle concerned with building domain knowledge through data exploration, data visualization so that we may provide solutions to a business problem, recommendations to a stakeholder or deeper insight into a particular domain for its own sake. 

## The Data Science/Analytics Life Cycle

![](./images/ds_lifecycle.png)

## 1.) Business understanding : Building Domain knowledge.

* **How much or how many?** 
  * E.g. Distributions of variables. Simple counts or aggregate statistics (mean, median, mode,    variance, categorical counts (pd.DataFrame.value_counts() )


* **Which category?**
  * E.g. “what group(s) am I interested in” “what metrics or features are of importance?”


* **Which Group?**
  * E.g creating a number of groups (segments) of your customers based on their monetary or domain value.


* **Is this weird?** 
    * E.g. Through visualization, observing strange effects in data: does revenue not track inventory -> could be a loss prevention/theft issue. Overall signals in the data that aren’t what one would expect or require further investigation.


* **Which items would a customer/user/stakeholder prefer?**
    * E.g “What recommendations can I make to existing customers or stakeholders”


### **Business case/question:**
Suppose your stakeholders would like to know generally, how their stores are performing across the USA and how their sales and profit are tracking and what areas might be good to invest or target to drive growth.

In [None]:
df =  pd.read_csv('superstore.csv')

In [None]:
df.info()

In [None]:
df.head()

### Data Cleaning

There isnt a lot of data cleaning required here as this dataset has been carefully curated. Here is an example of a simple data cleaning measure one might take

In [None]:
# the column headers are all capitalized which makes accessing them slightly harder. I like to lower them while
# while i perform analysis/visualization.

df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ','_')
df.columns = df.columns.str.replace('-','_')
df.columns

In [None]:
df.region.value_counts()

In [None]:
df.country.value_counts()

In [None]:
df.isna().sum()

## Data Visualization:

### The Histogram: _Understanding Typical Values and Spread in our Numerical Data_

Our numerical data here is rather simple and make perfect candidates for our success metrics.

In [None]:
# Firstly lets get only the subset of data we are interested in at the moment.
df_dist = df[['sales','profit']]

In [None]:
# initialize our plot space.
plt.figure(figsize=[20,10])
plt.suptitle("Checking Distribution and Outliers for Sales and Profit", size=20)

# sistogram for sales
plt.subplot(2,3,1)
plt.hist(df_dist['sales'], bins=200, color='#03396c')
plt.xlim(0,1000)

# boxplot for sales
plt.subplot(2,3,2)
sns.boxplot(df_dist['sales'], color='#b3cde0')

# histogram for profit
plt.subplot(2,3,4)
plt.hist(df_dist['profit'], bins=200, color='#03396c')
plt.xlim(-250,300)

# Boxplot for profit
plt.subplot(2,3,5)
sns.boxplot(df_dist['profit'], color='#b3cde0')


plt.show()

The boxplots look a bit compressed. We can see the presence of many outliers here which affects the scale of out plot. If we subselect as many of the outliers to not include our data, we can better visualize the data in the IQR etc

In [None]:
# Subselecting the data so that the Outliers dont not compress the scale 

'''
Notice that the data is being cut off here so that we can see certain details better. this is particular to the 
data we have sectioned off for plotting of the histogram. In general you want to be very careful around including/
excluding outliers in your data and have good reasons for doing so, since these outliers may be relevant to 
the summary statistics and what conclusions we draw from them.
'''

df_dist = df_dist[df_dist['sales'] <= 2000]
df_dist = df_dist[df_dist['profit'] <= 2000]
df_dist = df_dist[df_dist['profit'] >= -200]

In [None]:
# initialize our plot space.
plt.figure(figsize=[20,10])
plt.suptitle("Checking Distribution and Outliers for Sales and Profit", size=20)

# sistogram for sales
plt.subplot(2,3,1)
plt.hist(df_dist.sales, bins=200, color='#03396c')
plt.axvline(df_dist.sales.median())
plt.xlim(0,1000)

# boxplot for sales
plt.subplot(2,3,2)
sns.boxplot(df_dist.sales, color='#b3cde0')

# histogram for profit
plt.subplot(2,3,4)
plt.hist(df_dist.profit, bins=200, color='#03396c')
plt.axvline(df_dist.profit.median())
plt.xlim(-250,300)

# Boxplot for profit
plt.subplot(2,3,5)
sns.boxplot(df_dist.profit, color='#b3cde0')



plt.show()

**Add your conclusions Below:**

### The Bar Chart : _Comparing Numerical and Categorical Values_

In our case, `sales` is a good numerical feature and `region` are a good categorical feature. Therefore, a Bar Chart is the best visualization to convey insights here. This is because, with Bar Charts, you can show your Categorical Features on one of the axis and the Aggregration of your Numerical Features on the other.

In [None]:
# lets make a bar chart for the data in question. 

# first lets subselect the data we would like to plot
df_bar = df[['region','sales']]

In [None]:
# Lets aggregate the data and format it so that it plots nicely 
df_bar = df_bar.groupby('region').mean().sort_values(by='sales', ascending=False)

In [None]:
# Setting the figure size
plt.figure(figsize=[15,8]) 

# Visualizing using Bar Chart
plt.suptitle("Average Sales Across Different Regions", size=20)

# Plotting the BarChart
plt.subplot(1,2,1)
plt.bar(x=df_bar.index, height='sales', color = ['#011f4b','#03396c','#005b96','#6497b1'], data=df_bar)

# Plotting the Horizontal BarChart (Use this if there are many unique values for a Categorical Feature)
plt.subplot(1,2,2)
plt.barh(y=df_bar.index, width='sales',color = ['#011f4b','#03396c','#005b96','#6497b1'], data=df_bar)
plt.gca().invert_yaxis() # Inverting the Y Axis

plt.show()

* it is always better to display the values in the BarChart in an order(preferably highest to lowest).

* You can further improve this plot by adding annotations for each of the bars. What I mean by that is, to show "Sales" values on top of each Bar. (consider using plt.text())

* **Both bar charts convey the same insights. A horizontal bar chart is preferable when the number of unique values in the categorical feature is large.**

### The Line Plot: _Visualizing changes over time_

It might be sensible to understand the behavior of sales over time. This kind of metric will allow us to see if there are any times of year to focus company efforts to increase sales where needed. Also it will allow us to have a more complete understanding of the market dynamic considered here.

When visualizing data over time a line chart is best because it allows us to see trends through time and the data points connected hints at a dependency from one point to the next. As always, remember that **independent variables** go on the x-axis or the horizontal axis. **Dependent variables** go on the y-axis or the vertical axis. What does this mean?? it means that we suppose that whatever is affected by another variable goes on the y and what is doing the affecting goes on the x.

In [None]:
df.columns

Notice that we have two numerical features that would be useful to plot together: `sales` and `profit`

In [None]:
df.order_date.head()

In [None]:
# First of all, we are going to take only the subset of data for our purpose. (To keep things simple)
df_line = df[['order_date','sales','profit']].sort_values('order_date') # Chronological Ordering
df_line['order_date'] = pd.to_datetime(df_line['order_date']) # Converting into DateTime
df_line = df_line.groupby('order_date').mean() # Groupby to get the average Sales and Profit on each day

# Visualizing the Line Chart
plt.figure(figsize=[15,8])
plt.plot(df_line.index, 'sales', data=df_line, color='#6497b1') # Avg Sales over Time
plt.plot(df_line.index, 'profit', data=df_line, color='#03396c') # Avg Profit over Time
plt.title("Average Sales and Profit over Time Period(2014-2018)", size=20, pad=20)
plt.legend()
plt.show()

* It is always better to display values in a line plot in a chronological order.

* You can further improve this plot by adding annotations for a certain event in the timeline to catch the audiences or stakeholders attention.

**Add conclusions here:***

### The Scatter Plot: _Relationships between Numerical Features_

Lets continue to inspect our numerical features: `sales` and `profit` and see how they are related and if any correlation may be present. We also have data that can be encoded in our scatterplot to give us a more detailed picture of the relationships at play between these variables. this is the `segment` feature

In [None]:
df.segment.value_counts()

In [None]:
# lets subselect the data we're interested in plotting
df_scatter = df[['sales','profit','segment']]

# Create the figure space for our scatter plot.
plt.figure(figsize=[15,8])

# Profit in the Y axis (), and Sales in the X. Hue will classify the dots according to Segment.
# The size of the dots are according to the volumen of "Sales".
sns.scatterplot(x=df_scatter.sales, y=df_scatter.profit, hue=df_scatter.segment, palette=['#011f4b','#005b96','#6497b1'], size=df_scatter.sales, sizes=(100,1000), legend='auto') 
plt.title("Sales vs Profit Across Different Customer Segments", size=20, pad=20)
plt.show()

**Add your conclusions here:**

What other questions might we ask to help our stakeholders know more about their company success and where they might consider applying resources? lets take another look at the columns

In [None]:
df.columns

In [None]:
df.category

In [None]:
df.ship_mode

Maybe we can visualize sales and profit against `ship mode`... because `sales` and `profit` are numerical and would be dependent on the shipping mode... this is a hint to plot sales/profit on the y-axis and shipping mode on the x-axis. A stacked bar chart would be a nice choice here:

In [None]:
# Take a subset of the data we want to visualize.
df_stackb = df[['ship_mode','sales','profit']]

# Lets aggreggate sales and profit by ship_mode
df_stackb = df_stackb.groupby(['ship_mode']).sum().reset_index()


In [None]:
# Plot a stacked bar chart
plt.figure(figsize=[8,12])
plt.subplot(3,1,1)
plt.bar(x=df_stackb.ship_mode, height=df_stackb.sales, color='#005b96')
plt.bar(x=df_stackb.ship_mode, height=df_stackb.profit, bottom=df_stackb.sales, color='#6497b1')
plt.title("Sales & Profit Across Ship Modes", size=20, pad=20)
plt.legend(['sales','profit'])

In [None]:
df_sbar = df[['sales','profit','category']]
            
df_sbar = df_sbar.groupby(['category']).sum().reset_index()

plt.bar(x=df_sbar.category, height=df_sbar.sales, color='#005b96')
plt.bar(x=df_sbar.category, height=df_sbar.profit, bottom=df_sbar.sales, color='#6497b1')
plt.title("Sales & Profit Across Category", size=20, pad=20)
plt.legend(['sales','profit'])
             

**Add conclusion here:**