## Data preparation
Note: in order to simplify data import process, this file was run using Google Colab instead of Jupyter notebook.

In [1]:
!pip install pymysql

Collecting pymysql
  Downloading https://files.pythonhosted.org/packages/ed/39/15045ae46f2a123019aa968dfcba0396c161c20f855f11dea6796bcaae95/PyMySQL-0.9.3-py2.py3-none-any.whl (47kB)
Installing collected packages: pymysql
Successfully installed pymysql-0.9.3


In [4]:
#Imports
import pandas as pd
import numpy as np
import datetime
import os
from sqlalchemy import create_engine

## Extraction

In [5]:
# Get the URL of the drive folder
folder_url = '../Data/Customer Loyalty'

In [6]:
# Get filenames
all_files = os.listdir(folder_url)
trans_files = [string for string in all_files if string.startswith('trans')]
time_files = [string for string in all_files if string.startswith('time')]

In [8]:
# Import files one by one and append to a dataframe
for i, file in enumerate(trans_files):
  # Create temporary string to access the file
  temp_string = folder_url + '/' + file
  # Import the temporary pandas dataframe for string
  temp_df = pd.read_csv(temp_string)
  # If this is the first one save it to a different variable else append it
  if i==0:
    trans_df = temp_df.copy()
  else:
    trans_df = trans_df.append(temp_df, ignore_index=True)
  # Debug break
  if i==-1:
    break

In [7]:
trans_df.head()

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
0,200607,20060415,7,19,1,0.93,PRD0900033,CL00201,DEP00067,G00021,D00005,CUST0000410727,UM,OT,994100100398294,L,MM,Full Shop,Mixed,STORE00001,LS,E02
1,200607,20060413,5,20,1,1.03,PRD0900097,CL00001,DEP00001,G00001,D00001,CUST0000634693,LA,YF,994100100532898,L,LA,Top Up,Fresh,STORE00001,LS,E02
2,200607,20060416,1,14,1,0.98,PRD0900121,CL00063,DEP00019,G00007,D00002,,,,994100100135562,L,MM,Top Up,Grocery,STORE00001,LS,E02
3,200607,20060415,7,19,1,3.07,PRD0900135,CL00201,DEP00067,G00021,D00005,CUST0000410727,UM,OT,994100100398294,L,MM,Full Shop,Mixed,STORE00001,LS,E02
4,200607,20060415,7,19,1,4.81,PRD0900220,CL00051,DEP00013,G00005,D00002,CUST0000410727,UM,OT,994100100398294,L,MM,Full Shop,Mixed,STORE00001,LS,E02


In [0]:
trans_df.columns = trans_df.columns.str.lower()

In [0]:
#Get now the time table
time_df =  pd.read_csv(folder_url + '/' + time_files[0])

In [10]:
trans_df.cust_code.nunique()

5000

In [11]:
trans_df.cust_code.value_counts().describe()

count    5000.0000
mean      508.2038
std       681.1192
min         1.0000
25%        32.0000
50%       205.0000
75%       739.2500
max      5309.0000
Name: cust_code, dtype: float64

## Transformation

### Data cleansing

In [12]:
# Review NAs for the whole dataframe
trans_df.isna().sum()

shop_week                        0
shop_date                        0
shop_weekday                     0
shop_hour                        0
quantity                         0
spend                            0
prod_code                        0
prod_code_10                     0
prod_code_20                     0
prod_code_30                     0
prod_code_40                     0
cust_code                   617450
cust_price_sensitivity      617450
cust_lifestage              924940
basket_id                        0
basket_size                      0
basket_price_sensitivity         0
basket_type                      0
basket_dominant_mission          0
store_code                       0
store_format                     0
store_region                     0
dtype: int64

We have 617450 transactions with unidentified customers and more than 300.000 transactions with identified customers without lifestage information. 

### Customers table

Now we are going to extract the customer data from the complete dataset in order to clean and send the information to the database.

In [0]:
# Extract the customer information
cust_df = trans_df[['cust_code', 'cust_price_sensitivity', 'cust_lifestage']]
cust_df.columns = ['cust_id', 'cust_price_sensitivity', 'cust_lifestage']

In [14]:
#Review NAs
cust_df.isna().sum()

cust_id                   617450
cust_price_sensitivity    617450
cust_lifestage            924940
dtype: int64

In [15]:
#Delete the rows without customer information
cust_df.dropna(how='all', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [16]:
# Review NAs again
cust_df.isna().sum()

cust_id                        0
cust_price_sensitivity         0
cust_lifestage            307490
dtype: int64

Check the amount of identified customers without lifestage information

In [17]:
cust_df.cust_id[cust_df.cust_lifestage.isna()].nunique()

521

In [18]:
cust_df.cust_id.nunique()

5000

We have about 10% of the customers without lifestage information. This records will be kept, however, we can find if that variable is set somewhere else in the dataframe for the same customer.

In [19]:
# Get the number of unique rows in which cust_lifestage is NAs
cust_df[cust_df.cust_lifestage.isna()].drop_duplicates().shape

(521, 3)

Since the amount of unique rows for `cust_df` is equal to the amount of unique customers thart have lifestage set to NA, we conclude that there's no existing information within the dataframe for the variable `cust_lifestage` to replace null values.

This information will be stored in the database after removing the duplicate entries into the `Customers` table.

In [0]:
customers = cust_df.drop_duplicates() 

In [87]:
customers.head()

Unnamed: 0,cust_id,cust_price_sensitivity,cust_lifestage
0,CUST0000410727,UM,OT
1,CUST0000634693,LA,YF
47,CUST0000353957,LA,PE
50,CUST0000089820,LA,OT
52,CUST0000715467,MM,OT


We end up with a total of 5.000 identified customers to include.

## Timestamps table

In [49]:
timestamps_df.columns

Index(['index', 'shop_week', 'shop_date', 'shop_hour', 'date_from', 'date_to'], dtype='object')

In [0]:
# Extract the time information
timestamps_df = trans_df[['shop_week', 'shop_date', 'shop_hour']].copy()
timestamps_df.drop_duplicates(inplace=True)
timestamps_df = timestamps_df.merge(time_df, on='shop_week', how='left', sort=True)
timestamps_df.reset_index(inplace=True)
timestamps_df.columns = ['time_id', 'shop_week', 'shop_date', 'shop_hour', 'date_from', 'date_to']

For the purpose of this analysis, dates will not be formatted, they will be stored as strings. In further steps this can easily be done and stored.

In [0]:
# And store the dataframe
timestamps = timestamps_df[['time_id','shop_week', 'date_from', 'date_to', 'shop_date', 'shop_hour']]

In [175]:
# Let's take a look at it.
timestamps.head()

Unnamed: 0,time_id,shop_week,date_from,date_to,shop_date,shop_hour
0,0,200607,20060410,20060416,20060415,19
1,1,200607,20060410,20060416,20060413,20
2,2,200607,20060410,20060416,20060416,14
3,3,200607,20060410,20060416,20060412,19
4,4,200607,20060410,20060416,20060413,18


## Products table

In [0]:
# Extract the products information
products_df = trans_df[['prod_code', 'prod_code_10', 'prod_code_20', 'prod_code_30', 'prod_code_40']].copy()
products_df.columns = ['prod_id', 'prod_code_10', 'prod_code_20', 'prod_code_30', 'prod_code_40']
products_df.drop_duplicates(inplace=True)

In [103]:
products_df.head()

Unnamed: 0,prod_id,prod_code_10,prod_code_20,prod_code_30,prod_code_40
0,PRD0900033,CL00201,DEP00067,G00021,D00005
1,PRD0900097,CL00001,DEP00001,G00001,D00001
2,PRD0900121,CL00063,DEP00019,G00007,D00002
3,PRD0900135,CL00201,DEP00067,G00021,D00005
4,PRD0900220,CL00051,DEP00013,G00005,D00002


In [0]:
# And store the dataframe
products = products_df

## Baskets table

In [0]:
# Extract the baskets information
baskets_df = trans_df[['basket_id', 'basket_size', 'basket_price_sensitivity', 'basket_type', 'basket_dominant_mission']].copy()
baskets_df.drop_duplicates(inplace=True)

In [179]:
any(baskets.index == 2147483647)

False

In [106]:
baskets_df.head()

Unnamed: 0,basket_id,basket_size,basket_price_sensitivity,basket_type,basket_dominant_mission
0,994100100398294,L,MM,Full Shop,Mixed
1,994100100532898,L,LA,Top Up,Fresh
2,994100100135562,L,MM,Top Up,Grocery
5,994100100532897,M,MM,Small Shop,Fresh
6,994100100136577,M,LA,Small Shop,Fresh


In [0]:
# And store the dataframe
baskets = baskets_df

## Stores table

In [0]:
# Extract the stores information
stores_df = trans_df[['store_code', 'store_format', 'store_region']].copy()
stores_df.columns = ['store_id', 'store_format', 'store_region']
stores_df.drop_duplicates(inplace=True)

In [110]:
stores_df.head()

Unnamed: 0,store_id,store_format,store_region
0,STORE00001,LS,E02
47,STORE00002,LS,W01
94,STORE00003,LS,E01
145,STORE00004,MS,E03
155,STORE00006,LS,S01


In [0]:
# And store the dataframe
stores = stores_df

## Transactions table

In [0]:
# Extract the stores information
transactions_df = trans_df[['shop_week', 'shop_date', 'shop_hour',
                            'quantity', 'spend', 'prod_code', 'cust_code','basket_id', 'store_code']]
transactions_df.columns = ['shop_week', 'shop_date','shop_hour',
                           'quantity', 'spend', 'prod_id', 'cust_id','basket_id', 'store_id']

In [0]:
# Append the time_id
transactions_df = transactions_df.merge(timestamps[['shop_date', 'shop_hour', 'time_id']], 
                                  how='left',  
                                  on=['shop_date', 'shop_hour'])
transactions_df.drop(columns=['shop_week', 'shop_date', 'shop_hour'], inplace=True)

In [0]:
transactions = transactions_df[['time_id','quantity', 'spend', 'prod_id', 'cust_id', 'basket_id', 'store_id']]

In [115]:
transactions.head()

Unnamed: 0,time_id,quantity,spend,prod_id,cust_id,basket_id,store_id
0,0,1,0.93,PRD0900033,CUST0000410727,994100100398294,STORE00001
1,1,1,1.03,PRD0900097,CUST0000634693,994100100532898,STORE00001
2,2,1,0.98,PRD0900121,,994100100135562,STORE00001
3,0,1,3.07,PRD0900135,CUST0000410727,994100100398294,STORE00001
4,0,1,4.81,PRD0900220,CUST0000410727,994100100398294,STORE00001


# Loading data

In [0]:
# Connect to the DB
mysql_engine = create_engine('mysql+pymysql://root1000ml:IheartData@learning-1000ml.c0zbrffehjje.us-east-2.rds.amazonaws.com:3306/grocery_db',echo=False)

In [0]:
# Set tables names to iterate
sql_tables = mysql_engine.table_names()
pd_tables = [tab.lower() for tab in sql_tables]

In [194]:
#Iterate over tables and send to sql database
for pdtab, sqltab in zip(pd_tables, sql_tables):
  print(pdtab, sqltab, eval(pdtab).shape[0])
  eval(pdtab).to_sql(sqltab, mysql_engine, if_exists='append', index=False)

baskets Baskets 490982
customers Customers 5000
products Products 4997
stores Stores 761
timestamps Timestamps 11467
transactions Transactions 3158469
