This Jupyter Notebook describes the analyses conducted based on Discover, the products recommendation app and sales data from Shopify.

### Methodology
1. Import and format Discover and Shopify sales data
2. Combine the details of each Discover taker's **earliest** attempt into Shopify data
3. Identify NEW customers and match their earliest purchase with their earliest Discover attempt
4. Perform calculations, such as:
    1. Percentage of transactions that actually contain at least one of the recommended SKUs
    2. Average/median/min/max time taken to purchase after attempting Discover
    3. etc
    
Assumptions involved:
1. 

In [13]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 100)

'''
Sample dataset of CSV file "discover_user_input_results.csv" is used in this analysis.
The CSV file is obtained from the partial json extract provided by Jeremy previously (to rephrase).

Partial outputs will be provided as and when necessary

*note to self: since this is a "partial" dataset, I will need to rerun this experiment with the full dataset*
'''

discover_data = pd.read_csv("discover_user_input_results.csv")

# Populate a table of emails, keeping the EARLIEST discover attempt of each customer.

discover_first = discover_data.drop_duplicates(subset = ['email'], keep='first').copy()
discover_first['timestamp2'] = discover_first['timestamp'].map(lambda x: pd.to_datetime(x.replace('th', "")))
discover_first.drop('timestamp', axis=1,inplace=True)
# discover_first.to_csv('discover_first.csv')

In [5]:
''' 
Sales Data
1. Import sales_data (dated 2018-02-07) and perform basic formatting 
2. Match the emails available from Discover dataset to sales data, and create a new column called "discover_first_date" to indicate when did the customer first use Discover
    Note: doing this excludes all the customers who used Discover but DID NOT make a purchase.
3. Create a second new column to indicate whether a transaction is made BEFORE/AFTER the customer has tried Discover.
    Note: customers that have not tried Discover *at all* but made a purchase will also show "Not Yet"
    Perhaps those who have not tried Discover at all should have a third status ("Not at all")?
4. Create a third column to indicate the time taken between the customer's first attempt of Discover and their current transaction.
'''
    
sales_data = pd.read_csv("shopify_orders_export_20180207.csv", 
                         low_memory=False, 
                         parse_dates=['Paid at', 'Fulfilled at', 'Created at'])

sales_data_clean = sales_data.drop(sales_data.columns.to_series()[-11:-1], axis=1)
sales_data_clean.dropna(subset=['Email'], axis=0, inplace=True)
sales_data_clean['discover_first_date'] = sales_data_clean['Email'].map(discover_first.set_index('email')['timestamp2'])
sales_data_clean['used_discover_already'] = (sales_data_clean['Created at']> sales_data_clean['discover_first_date']).map({True: "Used Discover", False: "Not yet"})
sales_data_clean['discover_sales_lead_time'] = sales_data_clean['Created at'] - sales_data_clean['discover_first_date']

pre_discover_sales = sales_data_clean[sales_data_clean['Created at']< "2017-09-09"]
post_discover_sales = sales_data_clean[sales_data_clean['Created at']>= "2017-09-09"]

same_day_purchase = sales_data_clean[(sales_data_clean['discover_sales_lead_time'] <= "1 days") & \
                                     (sales_data_clean['discover_sales_lead_time'] > "0 days")]
same_day_purchases_count = len(same_day_purchase['Email'].unique())

post_discover_launch_customer_count = len(discover_first['email'].unique())
percentage_same_day_purchase = same_day_purchases_count/post_discover_launch_customer_count

print("same_day_purchases_count:", same_day_purchases_count)
print("post_discover_launch_customer_count:", post_discover_launch_customer_count)
print("Count of customers who made a purchase on the same day after using Discover:",len(same_day_purchase['Email'].unique()))
print("Number of unique customers after the launch of Discover:", len(discover_first['email'].unique()))
print("percentage_same_day_purchase: {0:.2f}%".format(percentage_same_day_purchase))

same_day_purchases_count: 335
post_discover_launch_customer_count: 2422
Count of customers who made a purchase on the same day after using Discover: 335
Number of unique customers after the launch of Discover: 2422
percentage_same_day_purchase: 0.14%


## How long does it take for brand new customers to buy something after using Discover?
### First, identify customers who had not purchased before Discover launch

1. Create a new column that indicates whether this customer has made any purchase before the launch of Discover
    1. Note: this assumes that the 
2. Create a table containing Emails, First Tried Discover Date and whether they existed before 2017-09-09 (Discover Launch)
    1. Note: this EXCLUDES those customers who *have not tried Discover*, regardless of whether they have made a purchase before/after the launch of Discover

In [4]:
new_customers_test = pd.DataFrame(post_discover_sales['Email'].unique(), columns=["Post Launch Emails"])
new_customers_test['Exist before launch?'] = new_customers_test.isin(pre_discover_sales['Email'].unique())

new_customers_test = new_customers_test.merge(right=discover_first[['email', 'timestamp2']], left_on='Post Launch Emails', right_on='email')
new_customers_test = new_customers_test.drop('email', axis=1)
new_customers_test = new_customers_test.rename(columns={"timestamp2": "First Tried Discover"})

In [5]:
# get first transaction date of each of the "new" customers
post_discover_first_purchase = post_discover_sales[['Email', 'Created at']].drop_duplicates(subset='Email', keep='last')
# print("Number of rows:", len(post_discover_first_purchase))

### To obtain the list of unique emails of NEW customers who did not take Discover

In [6]:


post_discover_sales_unique = pd.DataFrame(post_discover_sales['Email'].unique(), columns=["Post Launch Emails"])
new_customers_no_discover = post_discover_sales_unique[~post_discover_sales_unique.isin(pre_discover_sales['Email'].unique())]
new_customers_no_discover = new_customers_no_discover.dropna()
new_customers_no_discover.rename(columns={"Post Launch Emails": "Post Launch non-Discover Takers"})

Unnamed: 0,Post Launch non-Discover Takers
0,gilly.glanville@me.com
1,stephaniedata@yahoo.com
2,hsmeaton@hotmail.com
4,alyssacmy@gmail.com
5,jglyj82@gmail.com
8,wailing_93@hotmail.com
9,jesse.dytioco@gmail.com
14,weizen.t@gmail.com
18,serene818@hotmail.com
20,boazruth76@yahoo.com


In [14]:
new_customers_test

Unnamed: 0,Post Launch Emails,Exist before launch?,First Tried Discover,First Purchase Date
0,jglyj82@gmail.com,False,2018-02-07 12:54:54,2018-02-07 13:02:33
1,zarr.gyii@gmail.com,True,2017-09-26 03:45:14,2017-09-26 05:10:37
2,ycobonpue@gmail.com,True,2017-10-31 06:06:57,2017-09-19 06:46:41
3,karenkhor27@gmail.com,True,2017-09-13 02:03:15,2017-09-18 07:12:43
4,boazruth76@yahoo.com,False,2018-01-24 09:57:32,2018-01-11 04:18:16
5,bbggf.0901@gmail.com,True,2017-10-27 17:38:14,2017-10-28 16:57:08
6,jeronblahblah@gmail.com,True,2017-09-17 05:58:14,2017-10-21 10:10:01
7,ruienseah@gmail.com,False,2018-02-06 08:16:14,2018-02-06 08:29:58
8,qingshuang111@gmail.com,False,2017-10-01 15:37:11,2017-10-07 07:22:03
9,roseleenlua@gmail.com,True,2018-01-16 05:28:25,2018-01-17 05:26:42


## Lead time to purchase for new customers after taking Discover

In [15]:
new_customers_test = new_customers_test.merge(post_discover_first_purchase, left_on="Post Launch Emails", right_on="Email")
new_customers_test = new_customers_test.drop('Email', axis=1).rename(index=str, columns={"Created at": "First Purchase Date"})
new_customers_test['Time To Buy'] = new_customers_test['First Purchase Date']- new_customers_test['First Tried Discover']
print(new_customers_test['Time To Buy'].describe())
print(new_customers_test['Time To Buy'].quantile(0.85))

ValueError: cannot reindex from a duplicate axis