### Feature Engineering

Dataset: Online Retail-12-2010.csv

Import libraries

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Load the dataset
- Import the Online Retail-12-2010.csv file into a pandas dataframe
- Parse the date columns using the parse_dates argument

In [None]:
retail = pd.read_csv('../datasets/Online Retail-12-2010.csv', parse_dates=['InvoiceDate'])

In [None]:
retail.head()

In [None]:
retail.dtypes

In [None]:
retail.shape

Data Preparation

Convert "CustomerID" to "object" dtype

In [None]:
retail['CustomerID'] = retail['CustomerID'].astype(object)

In [None]:
retail.dtypes

Remove the rows without customer IDs

Check how many CustomerID's are null

In [None]:
retail.CustomerID.isnull().value_counts()

Remove the Null CustomerIDs

In [None]:
retail.dropna(subset=['CustomerID'], axis=0, inplace=True)

In [None]:
retail.shape

In [None]:
retail.head()

Engineer/Create Features

*Amount*

Calculate the Amount from Existing Features
Amount = Quantity * UnitPrice

In [None]:
retail['Amount'] = retail['Quantity'] * retail['UnitPrice']
retail.head()

Extracting Date Time Features

*Hour*

In [None]:
retail['Hour'] = retail['InvoiceDate'].dt.hour
retail.head()

*Weekday (Name)*

In [None]:
#retail['Day_Name'] = retail['InvoiceDate'].dt.weekday_name
retail['Day_Name'] = retail['InvoiceDate'].dt.day_name()
retail.head()

*Weekday (Number)*

Gets the Day of the week in numerical form. Will help us Identify if it is a weekday or a weekend.

The day of the week with Monday=0, Sunday=6

In [None]:
retail['Day_Number'] = retail['InvoiceDate'].dt.weekday
retail.head()

Allows us to generate better features, such as the "Is_Weekend Feature"

In [None]:
#Use the Weekday Feature to generate a binary feature to Identify a Weekend
retail['Is_Weekend'] = np.where(retail['Day_Number']>=5, 1, 0)

retail.head()

Analyzing the data

In [None]:
# What is the total amount sold?
retail.Amount.sum()

In [None]:
# What is the total quantity sold?
retail.Quantity.sum()

In [None]:
# What is the total quantity sold for item with StockCode #: 10002
retail[(retail.StockCode == "10002")].Quantity.sum()

In [None]:
# What time of day do they have the highest amount of sales
sns.barplot(x=retail.Hour, y = retail.Amount, ci=0)

Transform Data on a Per Customer Level

In [None]:
agg = {'InvoiceNo': 'nunique', # No. of Transactions
      'Amount': 'sum', # Total Amount
      'Quantity':'sum',
      'Country':'first',
      'Hour':['min','max'],
      'Day_Name':'max'}


customer_df = retail.groupby(["CustomerID"], as_index=False).agg(agg)

In [None]:
customer_df.head()

In [None]:
customer_df.columns

In [None]:
column_names = ['CustomerID','No_Transactions','Total_Amount','Total_Quantity','Country','Hour_Min','Hour_Max','Day_Max']

customer_df.columns = column_names

In [None]:
customer_df.head()

In [None]:
customer_df.Country.value_counts