Prepare data for analysis. This part of the process you are cleaning and staging the data so we can be prepared for a true marketing analysis using SQL. 
1. Import pandas 
2. Import random package
3. Read existing .csv file
4. Perform join statement on shopping_trends.csv and mock_data.csv

In [1]:
import pandas as pd
import numpy as np
import random
import sqlite3
import sqlalchemy


In [2]:
# Read the two CSV files into DataFrames
shopping_trends = pd.read_csv('shopping_trends.csv')
shopping_trends_column_names = ["customer_id","Age","Gender","Item_Purchased","Category","Purchase_Amount_USD","Location","Size","Color","Season","Review_Rating","Subscription_Status","Payment_Method","Shipping_Type","Discount_Applied","Promo_Code_Used","Previous_Purchases","Preferred_Payment_Method","Frequency_of_Purchases"]
shopping_trends.columns= shopping_trends_column_names
mock_data = pd.read_csv('MOCK_DATA_1.csv')


# Merge the DataFrames on the 'Customer_ID' column
merged_data = shopping_trends.merge(mock_data, on='customer_id')
print(merged_data)

     customer_id  Age Gender Item_Purchased     Category  Purchase_Amount_USD  \
0              1   55   Male         Blouse     Clothing                   53   
1              2   19   Male        Sweater     Clothing                   64   
2              3   50   Male          Jeans     Clothing                   73   
3              4   21   Male        Sandals     Footwear                   90   
4              5   45   Male         Blouse     Clothing                   49   
..           ...  ...    ...            ...          ...                  ...   
995          996   44   Male        Jewelry  Accessories                   80   
996          997   29   Male        Sandals     Footwear                   91   
997          998   64   Male          Pants     Clothing                   30   
998          999   51   Male          Shoes     Footwear                   90   
999         1000   50   Male          Socks     Clothing                   28   

          Location Size    

1. Now you need to write these files to a database. 
2. It is a best practice to write to a database vs. storing in your memory.
3. Your database will need to be able to run queries from the 2 marketing tables

In [3]:
# Connect to the Database
conn = sqlite3.connect('cl_shopper_trends.db')
merged_data.to_sql('merged_data', conn, if_exists='replace', index=False)

1000

1. You already performed one data cleaning task by renaming column header descriptions with underscores instead of spaces.
2. After glancing over the 3 files you noticed there are some duplicate descriptions for the Frequency_of_Purchases column in the merged_data file, change: "Fortnightly" to "Biweekly" and "Every 3 months" to "Quarterly"

In [4]:
#Data Cleaning Number 2- deduplicate Frequency of Purchase Descriptions by passing through an updated dictionary
replace_duplicates = {'Fortnightly': 'Bi-Weekly', 'Every 3 Months': 'Quarterly'}
merged_data['Frequency_of_Purchases'] = merged_data['Frequency_of_Purchases'].replace(replace_duplicates)

print(merged_data['Frequency_of_Purchases'])

0      Bi-Weekly
1      Bi-Weekly
2         Weekly
3         Weekly
4       Annually
         ...    
995       Weekly
996    Quarterly
997    Bi-Weekly
998    Bi-Weekly
999      Monthly
Name: Frequency_of_Purchases, Length: 1000, dtype: object


It looks like you don't have any values for First Purchase and Last Purchase, which will be critical to calculating Customer Lifetime Value and Loyalty. You will need to add these columns using Pandas.

    A. Customer Value is calculated using: Average Order Value divided by Average Number of Purchases. You will run this calculation in SQL. 
    B. Customer Lifetime Value is calculated: (Customer Value) * (Average Customer Lifespan)
    C. Average Customer Lifespan calculation: average time between first purchase order and last purchase order.

In [5]:
import pandas as pd
import random
from datetime import datetime, timedelta

In [6]:
#Create variable start date 
start_date = datetime(2020, 1, 1)
#Create variable end date 
end_date = datetime(2023, 12, 31)

In [7]:
#Create two variables to calculate time between customer start_date and customer end_date. The days between these two values will need to be divided in half so the customer end_date (or last_purchase) date is not before the customer start_date.
delta = end_date - start_date
alpha = delta/2


In [8]:
#Use the delta and alpha variables with the timedelta and random functions and save them to two lists which will be the base of your new columns: customer_first_purchase and customer_last_purchase
customer_first_purchase, customer_last_purchase = [],[]
for i in range(1000):
    first_purchase = start_date + timedelta(days=random.randrange (0,alpha.days))
    last_purchase = start_date + timedelta(days=random.randrange(alpha.days, delta.days))
    customer_first_purchase.append(first_purchase)
    customer_last_purchase.append(last_purchase)
print(customer_first_purchase, customer_last_purchase) 

#By creating two empty lists and a for loop, we are able to generate random values between our alpha and delta calculations which are based on the start_date and end_date variables.

[datetime.datetime(2021, 9, 29, 0, 0), datetime.datetime(2020, 6, 10, 0, 0), datetime.datetime(2021, 6, 12, 0, 0), datetime.datetime(2020, 11, 18, 0, 0), datetime.datetime(2020, 12, 31, 0, 0), datetime.datetime(2021, 2, 12, 0, 0), datetime.datetime(2021, 4, 12, 0, 0), datetime.datetime(2021, 6, 18, 0, 0), datetime.datetime(2021, 11, 29, 0, 0), datetime.datetime(2021, 5, 15, 0, 0), datetime.datetime(2021, 1, 29, 0, 0), datetime.datetime(2021, 11, 9, 0, 0), datetime.datetime(2020, 9, 23, 0, 0), datetime.datetime(2021, 12, 19, 0, 0), datetime.datetime(2021, 1, 30, 0, 0), datetime.datetime(2021, 12, 3, 0, 0), datetime.datetime(2021, 2, 15, 0, 0), datetime.datetime(2021, 2, 2, 0, 0), datetime.datetime(2021, 7, 5, 0, 0), datetime.datetime(2021, 7, 31, 0, 0), datetime.datetime(2021, 6, 25, 0, 0), datetime.datetime(2021, 12, 19, 0, 0), datetime.datetime(2021, 10, 18, 0, 0), datetime.datetime(2020, 11, 9, 0, 0), datetime.datetime(2020, 9, 10, 0, 0), datetime.datetime(2021, 10, 8, 0, 0), datetim

In [9]:
#Great! We are able to see the random outputs from the previous code, let's see if we can use the .head() function to view the customer_first_purchase and customer_last_purchase in a dataframe
merged_data.head()
merged_data['first_purchase'] = customer_first_purchase

In [10]:
merged_data.head()
merged_data['last_purchase'] = customer_last_purchase

In [11]:
merged_data.to_sql('merged_data', conn, index=False, if_exists='replace')
merged_data.head()

Unnamed: 0,customer_id,Age,Gender,Item_Purchased,Category,Purchase_Amount_USD,Location,Size,Color,Season,...,tiktok_campaign,tiktok_cpa,tiktok_conversion,tiktok_dates,email_campaign,email_cpa,email_conversion,email_dates,first_purchase,last_purchase
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,...,cross product,$5.07,False,7/18/22,abandoned cart,$3.00,True,10/29/22,2021-09-29,2022-04-02
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,...,frequency purchase,$0.31,False,6/7/22,welcome,$6.14,True,5/24/22,2020-06-10,2022-09-18
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,...,frequency purchase,$3.88,False,10/16/22,cross product,$8.11,True,8/1/22,2021-06-12,2023-06-15
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,...,we missed you,$5.57,True,7/4/22,welcome,$7.44,True,1/12/22,2020-11-18,2022-08-27
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,...,frequency purchase,$0.76,True,6/24/22,welcome,$7.29,False,1/29/22,2020-12-31,2023-12-09


In [12]:
# Double check datatypes will be in a codable format
column_data_types = merged_data.dtypes
print(column_data_types)

customer_id                          int64
Age                                  int64
Gender                              object
Item_Purchased                      object
Category                            object
Purchase_Amount_USD                  int64
Location                            object
Size                                object
Color                               object
Season                              object
Review_Rating                      float64
Subscription_Status                 object
Payment_Method                      object
Shipping_Type                       object
Discount_Applied                    object
Promo_Code_Used                     object
Previous_Purchases                   int64
Preferred_Payment_Method            object
Frequency_of_Purchases              object
first_name                          object
last_name                           object
tiktok_campaign                     object
tiktok_cpa                          object
tiktok_conv

To calculate Cusotmer Lifetime Value, let's go ahead and extrac the month out of the columns First_Purchase and Last_Purchase. This will make things easier when it comes to performing your marketing analysis. 

In [13]:
merged_data['first_purchase_month'] = merged_data['first_purchase'].dt.month
print(merged_data['first_purchase_month'])

0       9
1       6
2       6
3      11
4      12
       ..
995    12
996     8
997     1
998     4
999     3
Name: first_purchase_month, Length: 1000, dtype: int32


In [14]:
merged_data['last_purchase_month'] = merged_data['last_purchase'].dt.month
print(merged_data['last_purchase_month'])

0       4
1       9
2       6
3       8
4      12
       ..
995     6
996     8
997     9
998    10
999     6
Name: last_purchase_month, Length: 1000, dtype: int32


It looks like the CPA fields are reading as objects when we need them to read as floats since they are associated with a dollar amount. Let's change that:

In [15]:
# Remove the dollar sign and convert tiktok_cpa to float
merged_data['tiktok_cpa'] = merged_data['tiktok_cpa'].replace('[\$,]', '', regex=True).astype(float)
#Remove the dollar sign and convert email_cpa to float
merged_data['email_cpa'] = merged_data['email_cpa'].replace('[\$,]', '', regex=True).astype(float)

column_data_types = merged_data.dtypes
print(column_data_types)

customer_id                          int64
Age                                  int64
Gender                              object
Item_Purchased                      object
Category                            object
Purchase_Amount_USD                  int64
Location                            object
Size                                object
Color                               object
Season                              object
Review_Rating                      float64
Subscription_Status                 object
Payment_Method                      object
Shipping_Type                       object
Discount_Applied                    object
Promo_Code_Used                     object
Previous_Purchases                   int64
Preferred_Payment_Method            object
Frequency_of_Purchases              object
first_name                          object
last_name                           object
tiktok_campaign                     object
tiktok_cpa                         float64
tiktok_conv

Great! It looks like you passed the first requirement of the interview and your data is clean and populating with the necessary data to start your analysis. 

Time to start exploring with SQL via another Jupyter notebook. You will need to set your SQL Code to Python variables and run in this file. 