This document describes the analyses conducted based on Discover, the products recommendation app and sales data from Shopify.

Sample dataset of CSV file "discover_user_input_results.csv" is used in this analysis.
The CSV file is obtained from the partial json extract provided by Jeremy previously (to rephrase).

Partial outputs will be provided as and when necessary

*note to self: since this is a "partial" dataset, I will need to rerun this experiment with the full dataset*

In [6]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 100)

discover_data = pd.read_csv("discover_user_input_results.csv")
discover_data.head()

Unnamed: 0.1,Unnamed: 0,email,sensitivity,skinType,timestamp,concern_0,concern_1,concern_2,concern_3,concern_4,concern_5,concern_6,concern_7,concern_8,concern_9,concern_10,concern_11,concern_12,concern_13,result_0,result_1,result_2,result_3,result_4,result_5,result_6,result_7,result_8,result_9
0,935,jeremy@paulaschoice.sg,False,Combination,"September 9th 2017, 1:56:08 pm",Dehydration,Sun Damage,Men,Combination,,,,,,,,,,,7830,7780,7660,7740,7860,1720,8740,2760,5800,7880
1,936,jeremy@paulaschoice.sg,False,Oily,"September 9th 2017, 1:58:21 pm",Acne,Wrinkles,PIH,Men,Oily,,,,,,,,,,1150,7670,8720,7740,7870,6130,8740,2760,5700,6240
2,937,ck1411@singnet.com.sg,False,Combination,"September 9th 2017, 2:38:56 pm",Enlarged Pores,Sun Damage,Wrinkles,Dehydration,Loss of Firmness,Dullness,Combination,,,,,,,,7830,7780,7820,7740,7870,7760,7690,2760,7900,7960
3,938,jeremy@paulaschoice.sg,False,Oily,"September 9th 2017, 2:51:12 pm",Clogged Pores,Redness,Uneven Texture,Enlarged Pores,Acne,Wrinkles,Dehydration,Sun Damage,PIH,Dullness,Loss of Firmness,Men,Oily,,7830,7670,8720,7740,7870,7800,8740,2760,5700,7730
4,939,starlites18@gmail.com,False,Combination,"September 9th 2017, 2:59:07 pm",Enlarged Pores,Clogged Pores,Acne,Sun Damage,Uneven Texture,Redness,Combination,,,,,,,,6002,1350,2010,7740,7870,6130,7690,2750,5700,7730


Populate a table of emails, keeping the EARLIEST discover attempt of each customer.

In [7]:
discover_first = discover_data.drop_duplicates(subset = ['email'], keep='first')
discover_first['timestamp2'] = discover_first['timestamp'].map(lambda x: pd.to_datetime(x.replace('th', "")))
discover_first.drop('timestamp', axis=1,inplace=True)
# discover_first.to_csv('discover_first.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### Sales Data
1. Import sales_data (dated 2018-02-07) and perform basic formatting 
2. Match the emails available from Discover dataset to sales data, and create a new column called "discover_first_date" to indicate when did the customer first use Discover
    1. Note: doing this excludes all the customers who used Discover but DID NOT make a purchase.
3. Create a second new column to indicate whether a transaction is made BEFORE/AFTER the customer has tried Discover.
    1. Note: customers that have not tried Discover *at all* but made a purchase will also show "Not Yet"
    1. Perhaps those who have not tried Discover at all should have a third status ("Not at all")?
4. Create a third column to indicate the time taken between the customer's first attempt of Discover and their current transaction.


In [22]:

    
sales_data = pd.read_csv("shopify_orders_export_20180207.csv", 
                         low_memory=False, 
                         parse_dates=['Paid at', 'Fulfilled at', 'Created at'])

sales_data_clean = sales_data.drop(sales_data.columns.to_series()[-11:-1], axis=1)
sales_data_clean.dropna(subset=['Email'], axis=0, inplace=True)
sales_data_clean['discover_first_date'] = sales_data_clean['Email'].map(discover_first.set_index('email')['timestamp2'])
sales_data_clean['used_discover_already'] = (sales_data_clean['Created at']> sales_data_clean['discover_first_date']).map({True: "Used Discover", False: "Not yet"})
sales_data_clean['discover_sales_lead_time'] = sales_data_clean['Created at'] - sales_data_clean['discover_first_date']

pre_discover_sales = sales_data_clean[sales_data_clean['Created at']< "2017-09-09"]
post_discover_sales = sales_data_clean[sales_data_clean['Created at']>= "2017-09-09"]

same_day_purchase = sales_data_clean[(sales_data_clean['discover_sales_lead_time'] <= "1 days") & \
                                     (sales_data_clean['discover_sales_lead_time'] > "0 days")]
same_day_purchases_count = len(same_day_purchase['Email'].unique())

post_discover_launch_customer_count = len(discover_first['email'].unique())
percentage_same_day_purchase = same_day_purchases_count/post_discover_launch_customer_count

print("same_day_purchases_count:", same_day_purchases_count)
print("post_discover_launch_customer_count:", post_discover_launch_customer_count)
print("Count of customers who made a purchase on the same day after using Discover:",len(same_day_purchase['Email'].unique()))
print("Number of unique customers after the launch of Discover:", len(discover_first['email'].unique()))
print("percentage_same_day_purchase: {0:.2f}%".format(percentage_same_day_purchase))

same_day_purchases_count: 335
post_discover_launch_customer_count: 2422
Count of customers who made a purchase on the same day after using Discover: 335
Number of unique customers after the launch of Discover: 2422
percentage_same_day_purchase: 0.138%


## How long does it take for brand new customers to buy something after using Discover?
### First, identify customers who had not purchased before Discover launch

1. Create a new column that indicates whether this customer has made any purchase before the launch of Discover
    1. Note: this assumes that the 
2. Create a table containing Emails, First Tried Discover Date and whether they existed before 2017-09-09 (Discover Launch)
    1. Note: this EXCLUDES those customers who *have not tried Discover*, regardless of whether they have made a purchase before/after the launch of Discover

In [35]:
new_customers_test = pd.DataFrame(post_discover_sales['Email'].unique(), columns=["Post Launch Emails"])
new_customers_test['Exist before launch?'] = new_customers_test.isin(pre_discover_sales['Email'].unique())

new_customers_test = new_customers_test.merge(right=discover_first[['email', 'timestamp2']], left_on='Post Launch Emails', right_on='email')
new_customers_test = new_customers_test.drop('email', axis=1)
new_customers_test = new_customers_test.rename(columns={"timestamp2": "First Tried Discover"})
new_customers_test


Unnamed: 0,Post Launch Emails,Exist before launch?,First Tried Discover
0,jglyj82@gmail.com,False,2018-02-07 12:54:54
1,zarr.gyii@gmail.com,True,2017-09-26 03:45:14
2,ycobonpue@gmail.com,True,2017-10-31 06:06:57
3,karenkhor27@gmail.com,True,2017-09-13 02:03:15
4,boazruth76@yahoo.com,False,2018-01-24 09:57:32
5,bbggf.0901@gmail.com,True,2017-10-27 17:38:14
6,jeronblahblah@gmail.com,True,2017-09-17 05:58:14
7,ruienseah@gmail.com,False,2018-02-06 08:16:14
8,qingshuang111@gmail.com,False,2017-10-01 15:37:11
9,roseleenlua@gmail.com,True,2018-01-16 05:28:25


In [34]:
# get first transaction date of each of the "new" customers
post_discover_first_purchase = post_discover_sales[['Email', 'Created at']].drop_duplicates(subset='Email', keep='last')
# print("Number of rows:", len(post_discover_first_purchase))
post_discover_first_purchase

Unnamed: 0,Email,Created at
4,gilly.glanville@me.com,2018-02-08 03:59:00
9,alyssacmy@gmail.com,2018-02-07 13:45:51
11,jglyj82@gmail.com,2018-02-07 13:02:33
18,wailing_93@hotmail.com,2018-02-07 11:25:32
21,jesse.dytioco@gmail.com,2018-02-07 11:20:49
25,veron_chinron@hotmail.com,2018-02-07 10:09:37
31,weizen.t@gmail.com,2018-02-07 08:02:53
32,lady_portia3012@yahoo.com.sg,2018-02-07 07:43:11
41,nisha@vstravel.com.sg,2018-02-07 05:24:51
49,hhshelley1@gmail.com,2018-02-07 04:00:22


### To obtain the list of unique emails of NEW customers who did not take Discover

In [39]:


post_discover_sales_unique = pd.DataFrame(post_discover_sales['Email'].unique(), columns=["Post Launch Emails"])
new_customers_no_discover = post_discover_sales_unique[~post_discover_sales_unique.isin(pre_discover_sales['Email'].unique())]
new_customers_no_discover = new_customers_no_discover.dropna()
new_customers_no_discover.rename(columns={"Post Launch Emails": "Post Launch non-Discover Takers"})

Unnamed: 0,Post Launch non-Discover Takers
0,gilly.glanville@me.com
1,stephaniedata@yahoo.com
2,hsmeaton@hotmail.com
4,alyssacmy@gmail.com
5,jglyj82@gmail.com
8,wailing_93@hotmail.com
9,jesse.dytioco@gmail.com
14,weizen.t@gmail.com
18,serene818@hotmail.com
20,boazruth76@yahoo.com


## Lead time to purchase for new customers after taking Discover

In [40]:
new_customers_test = new_customers_test.merge(post_discover_first_purchase, left_on="Post Launch non-Discover Takers", right_on="Email")
new_customers_test = new_customers_test.drop('Email', axis=1).rename(index=str, columns={"Created at": "First Purchase Date"})
print(new_customers_test['Time To Buy'].describe())
print(new_customers_test['Time To Buy'].quantile(0.85))

KeyError: 'Post Launch non-Discover Takers'