# Data Wrangling, EDA, and Visualization (66 points)

We are going to investigate the TaFeng Transactions to another level.

In [1]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy, Datascience, pandas modules.
import numpy as np
import pandas as pd
import seaborn as sns

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# Importing Data
Please copy necessary codes from workshop 3 to import the three tables and conduct the same LEFT join to get the tafeng_full dataframe.

In [2]:
age_class_columns = ['code', 'age_range']
age_classes = pd.read_csv('age_classes.txt', sep=" ",  
                          header=None, names=age_class_columns)

# age_classes

In [3]:
residence_areas = pd.read_csv('residence_area.txt', 
                              delimiter=':',
                              header=None, 
                              names=['code','area'])

residence_areas['area'] = residence_areas['area'].str.strip()

# residence_areas

In [4]:
tafeng_transactions = pd.read_csv('TaFengTransactions.txt', delimiter=';')

# remove potential leading or trailing whitespace
tafeng_transactions['age_code'] = tafeng_transactions['age_code'].str.strip()
tafeng_transactions['residence_area'] = tafeng_transactions['residence_area'].str.strip()

# tafeng_transactions.head()

In [5]:
tafeng_full = pd.merge(tafeng_transactions, age_classes,
                      how='left', left_on='age_code', right_on='code')
tafeng_full = tafeng_full.drop('code', axis=1)

tafeng_full = pd.merge(tafeng_full, residence_areas, 
                      how='left', left_on = 'residence_area', right_on = 'code') 
tafeng_full = tafeng_full.drop('code', axis=1)

In [20]:
#Copy the code here
tafeng_full.head() # output 1 point

Unnamed: 0,entry_date,transaction_time,customer_id,age_code,residence_area,product_subclass,product_id,amount,asset,sales_price,age_range,area
0,2016-12-26,2001-01-15 00:00:00,1786439,G,H,110109,4710043552065,1,144,190,50-54,unknown
1,2016-12-26,2001-01-15 00:00:00,98946,E,E,100312,4710543111014,1,32,38,40-44,115
2,2016-12-26,2001-01-15 00:00:00,905602,D,E,500206,4710114322115,1,64,79,35-39,115
3,2016-12-26,2001-01-15 00:00:00,1964295,E,E,530106,4713813010123,1,174,147,40-44,115
4,2016-12-26,2001-01-15 00:00:00,2146553,B,D,100217,8801019421013,1,47,52,25-29,114


# Question 1 (24 points)
Create a data frame called "carts" that contains the three variables above, as well as "customer_id" and "transaction_time". Make sure you use the names specified. (8 points)

In [35]:
num_items = tafeng_full.groupby(['transaction_time', 'customer_id'])['amount'].sum()
num_items.columns = ['transaction_time', 'customer_id', 'num_items']
num_items.head()

transaction_time     customer_id
2000-11-01 00:00:00  45957          1
                     164252         1
                     217361         1
                     916264         2
                     955188         2
Name: amount, dtype: int64

In [32]:
total_value = tafeng_full.groupby(['transaction_time', 'customer_id'])['sales_price'].sum()
total_value.columns = ['transaction_time', 'customer_id', 'total_value']
total_value.head()

transaction_time     customer_id
2000-11-01 00:00:00  45957          133
                     164252          89
                     217361          65
                     916264          48
                     955188          48
Name: sales_price, dtype: int64

In [33]:
num_unique = tafeng_full.groupby(['transaction_time', 'customer_id'])['product_id'].nunique()
num_unique.columns = ['transaction_time', 'customer_id', 'num_unique']
num_unique.head()

transaction_time     customer_id
2000-11-01 00:00:00  45957          1
                     164252         1
                     217361         1
                     916264         1
                     955188         1
Name: product_id, dtype: int64

In [39]:
carts = pd.merge(pd.merge(num_items, total_value, how='outer', on=['transaction_time', 'customer_id']), num_unique, how='outer', on=['transaction_time', 'customer_id'])
carts.rename(columns={'amount':'num_items', 'sales_price':'total_value', 'product_id':'num_unique'}, inplace=True)

carts # output 1 point

Unnamed: 0_level_0,Unnamed: 1_level_0,num_items,total_value,num_unique
transaction_time,customer_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-11-01 00:00:00,45957,1,133,1
2000-11-01 00:00:00,164252,1,89,1
2000-11-01 00:00:00,217361,1,65,1
2000-11-01 00:00:00,916264,2,48,1
2000-11-01 00:00:00,955188,2,48,1
2000-11-01 00:00:00,1327205,1,102,1
2000-11-01 00:00:00,1790184,1,75,1
2000-11-01 00:00:00,1846225,1,28,1
2000-11-01 00:00:00,1846546,1,19,1
2000-11-01 00:00:00,2004402,1,10,1


Now let's take a look at the relationship between the number of items in a cart and the cart's total value. Intuitively the two should be positively correlated. Make a SCATTER plot that will help us inspect the relationship between these two variables. (6 points) 

In [None]:
# type your code here
# plot output 1 point

We might expect from the distribution of the number of trips that there would be a few very large values for the number of items and the total amount spent. Indeed, a handful of observations make it difficult to see the shape of the bulk of the data. Take a log transformation of these two variables. Make another scatter plot, but this time, log-transform both the x and y axes. (2 points)

In [None]:
# type your code here
# plot output 1 point

Please study the functionality of seaborn's lmplot and use it to simultaneously plot the points and the line-of-best-fit for the log-log data. (2 points)

In [None]:
# type your code here
# plot output 1 point

# Self-Directed EDA
This last two questions are intentionally more open-ended and will be graded on the completeness of the plot(s) produced and the insights you gain from them. Be sure to consider NECESSARY transformations, subsets, correlations, reference markers, and lines/curves-of-best-fit to reveal the relationship that you are wanting to learn more about. Also be sure to make plots that are appropriate for the variable types. For completeness, be explicit about any assumptions you make in your analysis.

# Question 2 (14 points)
Make a visualization of and interpret the age distribution of the shoppers. (10 points)

In [None]:
... #replace ... with your code and an output plot deserves 1 point

# Question 3 (26 points)
Make a visualization of and interpret the relationship between amount spent on a shopping trip and the customer's age. (20 points)

In [None]:
... #replace ... with your code and an output plot deserves 1 point

In [None]:
Your observations? (5 points) - 