## Part 1 : Data Cleaning and Preparation

In [1]:
import numpy as np 
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

#### **Importing data**

In [2]:
df_customer = pd.read_csv('data/profile.csv').drop('Unnamed: 0', axis = 1)
df_offer = pd.read_csv('data/portfolio.csv').drop('Unnamed: 0', axis = 1)
df_transcript = pd.read_csv('data/transcript.csv').drop('Unnamed: 0', axis = 1)

#### **Data Cleaning and Manipulation**
Cleaning and addressing issues each data set, individually.

**df_customer**

In [3]:
df_customer.head(3)

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,


In [4]:
# Missing data
missing_percent = round(df_customer.isna().mean() * 100, 1)
pd.DataFrame(missing_percent[missing_percent > 0], columns=["% of Missing Values"])

Unnamed: 0,% of Missing Values
gender,12.8
income,12.8


All of the missing values come from two columns in df_customer: gender and income. The missing values account for 12.8% of the data in each column. All though it's not ideal, the missing values will be removed.

In [5]:
df_customer.dropna(inplace = True)
df_customer.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14825 entries, 1 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               14825 non-null  int64  
 2   id                14825 non-null  object 
 3   became_member_on  14825 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 694.9+ KB


became_member_on looks like it should be a date, not an integer:

In [6]:
# Data Type Manipulation
df_customer['became_member_on'] = pd.to_datetime(df_customer['became_member_on'], format = '%Y%m%d')

# Knowing that it will probably be helpful during EDA, I'm going to go ahead and add a few day, month, and year columns.
df_customer['year'] = df_customer['became_member_on'].dt.year
df_customer['month_number'] = df_customer['became_member_on'].dt.month
df_customer['day_of_month'] = df_customer['became_member_on'].dt.day
df_customer['month'] = df_customer['became_member_on'].dt.month_name()
df_customer['day_number'] = df_customer['became_member_on'].dt.weekday
df_customer['day'] = df_customer['became_member_on'].dt.day_name()

df_customer.head()

Unnamed: 0,gender,age,id,became_member_on,income,year,month_number,day_of_month,month,day_number,day
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,2017,7,15,July,5,Saturday
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,2017,5,9,May,1,Tuesday
5,M,68,e2127556f4f64592b11af22de27a7932,2018-04-26,70000.0,2018,4,26,April,3,Thursday
8,M,65,389bc3fa690240e798340f5a15918d5c,2018-02-09,53000.0,2018,2,9,February,4,Friday
12,M,58,2eeac8d8feae4a8cad5a6af0499a211d,2017-11-11,51000.0,2017,11,11,November,5,Saturday


****

**df_offer**

In [7]:
df_offer.head(3)

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"['email', 'mobile', 'social']",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"['web', 'email', 'mobile', 'social']",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"['web', 'email', 'mobile']",0,4,informational,3f207df678b143eea3cee63160fa8bed


In [8]:
# No missing data
df_offer.isna().sum()

reward        0
channels      0
difficulty    0
duration      0
offer_type    0
id            0
dtype: int64

In [9]:
df_offer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 612.0+ bytes


In [10]:
# Looking at the different offer_types
df_offer['offer_type'].unique()

array(['bogo', 'informational', 'discount'], dtype=object)

In [11]:
# Looking at the different channels
df_offer['channels'].unique()

array(["['email', 'mobile', 'social']",
       "['web', 'email', 'mobile', 'social']",
       "['web', 'email', 'mobile']", "['web', 'email']"], dtype=object)

Since there are three different offer types, we're going to make a categorical column `offer_code` so that it's easier to refernece each offer type. Similary, we're creating `channels_code` for `channels`/

In [12]:
# offer_type
offer_code = {"bogo": 'A', "discount": 'B', "informational": 'C'}
df_offer['offer_code'] = df_offer['offer_type'].map(offer_code)

channel_mapping = {
    "['web', 'email', 'mobile', 'social']": 'A',
    "['web', 'email', 'mobile']": 'B',
    "['email', 'mobile', 'social']": 'C',
    "['web', 'email']": 'D'
}

# channels
df_offer['channels_code'] = df_offer['channels'].map(channel_mapping)

****

**df_transcript**

In [13]:
df_transcript.head(3)

Unnamed: 0,person,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0


In [14]:
print(df_transcript['value'])

0         {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1         {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2         {'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3         {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4         {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}
                                ...                       
306529                      {'amount': 1.5899999999999999}
306530                                    {'amount': 9.53}
306531                                    {'amount': 3.61}
306532                      {'amount': 3.5300000000000002}
306533                                    {'amount': 4.05}
Name: value, Length: 306534, dtype: object


In [15]:
split_values = df_transcript['value'].str.split(':', n=1, expand=True)

# Renaming the columns
split_values.columns = ['key', 'value']

# Remove curly braces and single quotes from the 'key' and 'value' columns using lambda function
split_values['key_id'] = split_values['key'].apply(lambda x: x.replace('{', '').replace('}', '').replace("'", '').strip())
split_values['value_id'] = split_values['value'].apply(lambda x: x.replace('{', '').replace('}', '').replace("'", '').strip())
# Display the cleaned up DataFrame
split_values.head()
split_values.drop(['key', 'value'], axis = 1, inplace = True)

In [16]:
# Adding the split values back to transcript
df_transcript = pd.concat([df_transcript, split_values], axis=1)

# Dropping the original 'value' column
df_transcript.drop(columns=['value'], inplace=True)
df_transcript.head()

Unnamed: 0,person,event,time,key_id,value_id
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,offer id,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,offer id,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,e2127556f4f64592b11af22de27a7932,offer received,0,offer id,2906b810c7d4411798c6938adc9daaa5
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,offer id,fafdcd668e3743c1bb461111dcafc2a4
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,offer id,4d5c57ea9a6940dd891ad53e9dbe8da0


In [17]:
df_transcript['key_id'].value_counts()

amount      138953
offer id    134002
offer_id     33579
Name: key_id, dtype: int64

In [18]:
df_transcript['key_id'] = df_transcript['key_id'].replace('offer_id', 'offer id')

In [19]:
df_transcript.head(10)

Unnamed: 0,person,event,time,key_id,value_id
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,offer id,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,offer id,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,e2127556f4f64592b11af22de27a7932,offer received,0,offer id,2906b810c7d4411798c6938adc9daaa5
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,offer id,fafdcd668e3743c1bb461111dcafc2a4
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,offer id,4d5c57ea9a6940dd891ad53e9dbe8da0
5,389bc3fa690240e798340f5a15918d5c,offer received,0,offer id,f19421c1d4aa40978ebb69ca19b0e20d
6,c4863c7985cf408faee930f111475da3,offer received,0,offer id,2298d6c36e964ae4a3e7e9706d1fb8c2
7,2eeac8d8feae4a8cad5a6af0499a211d,offer received,0,offer id,3f207df678b143eea3cee63160fa8bed
8,aa4862eba776480b8bb9c68455b8c2e1,offer received,0,offer id,0b1e1539f2cc45b7b9fa7c272da2e1d7
9,31dda685af34476cad5bc968bdb01c53,offer received,0,offer id,0b1e1539f2cc45b7b9fa7c272da2e1d7


In [20]:
df_transcript.tail(10)

Unnamed: 0,person,event,time,key_id,value_id
306524,d613ca9c59dd42f497bdbf6178da54a7,transaction,714,amount,25.14
306525,eec70ab28af74a22a4aeb889c0317944,transaction,714,amount,43.58
306526,24f56b5e1849462093931b164eb803b5,transaction,714,amount,22.64
306527,24f56b5e1849462093931b164eb803b5,offer completed,714,offer id,"fafdcd668e3743c1bb461111dcafc2a4, reward: 2"
306528,5ca2620962114246ab218fc648eb3934,transaction,714,amount,2.2
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,714,amount,1.5899999999999999
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,714,amount,9.53
306531,a00058cf10334a308c68e7631c529907,transaction,714,amount,3.61
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,714,amount,3.5300000000000002
306533,c02b10e8752c4d8e9b73f918558531f7,transaction,714,amount,4.05


Right now, `value_id` contains the offer id for each offer, the amount paid for each transaction, and the reward points received in each transaction. That information needs to be split up. 

In [21]:
transcript = df_transcript.copy()

# Splitting reward values from the offer id where the offer was completed
transcript.loc[transcript['event'] == 'offer completed', 'reward'] = transcript.loc[transcript['event'] == 'offer completed', 'value_id'].str.split(',').str[1].str.split(':').str[1].astype(int)
transcript['value_id'] = transcript['value_id'].str.split(',').str[0]

# Splitting amount from each transaction
transcript.loc[transcript['event'] == 'transaction', 'money_spent'] = transcript.loc[transcript['event'] == 'transaction', 'value_id'].astype(float)

transcript.fillna(0, inplace = True)

transcript.tail(10)

Unnamed: 0,person,event,time,key_id,value_id,reward,money_spent
306524,d613ca9c59dd42f497bdbf6178da54a7,transaction,714,amount,25.14,0.0,25.14
306525,eec70ab28af74a22a4aeb889c0317944,transaction,714,amount,43.58,0.0,43.58
306526,24f56b5e1849462093931b164eb803b5,transaction,714,amount,22.64,0.0,22.64
306527,24f56b5e1849462093931b164eb803b5,offer completed,714,offer id,fafdcd668e3743c1bb461111dcafc2a4,2.0,0.0
306528,5ca2620962114246ab218fc648eb3934,transaction,714,amount,2.2,0.0,2.2
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,714,amount,1.5899999999999999,0.0,1.59
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,714,amount,9.53,0.0,9.53
306531,a00058cf10334a308c68e7631c529907,transaction,714,amount,3.61,0.0,3.61
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,714,amount,3.5300000000000002,0.0,3.53
306533,c02b10e8752c4d8e9b73f918558531f7,transaction,714,amount,4.05,0.0,4.05


In [22]:
# No missing data
transcript.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   person       306534 non-null  object 
 1   event        306534 non-null  object 
 2   time         306534 non-null  int64  
 3   key_id       306534 non-null  object 
 4   value_id     306534 non-null  object 
 5   reward       306534 non-null  float64
 6   money_spent  306534 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 16.4+ MB


In [23]:
# Change time to hours_since_start
transcript = transcript.rename(columns = {'time' : 'hours_since_start'})

****

In [24]:
# Save cleaned data for part 2
df_offer.to_csv('data/cleaned_offer.csv', index = False)
df_customer.to_csv('data/cleaned_customer.csv', index = False)
transcript.to_csv('data/cleaned_transcript.csv', index = False)