#### Problem Overview:
Johnny’s Jeans is a fashion start-up and a new customer of ActionIQ. The first campaign they want to create on our platform is a Thank-You message to all their first customers. They want to target their first 10,000 customers and send them some stats about their first year of sales.

The Johnny’s Jeans finance team reported that the first 10,000 customers spent a collective $33,904,525 on their site. 

However when the marketing team ran the sum of total revenue on the AIQ platform, the total sum was a bit less at $29,162,044. The Johnny’s Jeans marketing team was concerned about this data discrepancy and has reached out to AIQ.

#### Background of our integration with Johnny’s Jeans:
1. Every few weeks, Johnny’s Jeans data warehouse does an export of their most recent rows from their user-summary table. This data is encrypted and dropped into an AWS S3 bucket.

2. AIQ’s server is designed to ping that S3 bucket every 24 hours and when a new file is dropped, AIQ’s system is supposed to retrieve the new file for decryption.

3. The decrypted user-summary data is in a CSV with these fields:

    a. user_id
    b. Total_spend (total this user has spent at Johnny’s Jeans)
    c. count_saved_items (total amount of book-marked items)
    d. loyalty_credits (amount of loyalty points the user can spend)
    e. batch_id
    
4. AIQ’s

the data by choosing the largest batch_id for each user. This ensures that AIQ should only be loading 1 row per user.
data pipeline is designed to read through all the data files and de-duplicate

5. Once that data is de-duplicated, it is loaded into AIQ’s platform and is ready for users to start running queries.

#### Task:

1. Figure out best you can what the problem is. Bear in mind that just like any software engineering effort, AIQ is not immune to bugs and things may not work perfectly as expected.
2. You are writing an email to the AIQ backend engineer describing what you’ve found and what you think the issue here is. Please provide as much information and/or examples as possible.
3. Michael Greene, Johnny Jean’s email marketing director, emailed us asking why our data doesn’t seem to be matching up with theirs. Write a response for him.

#### Task Notes:
1. We have provided the 5 of the decrypted files that were sent to us by Johnny’s Jeans.

    a. Delta0.csv was their initial base set of data.
    b. Deltas 1-4 were their update files.
2. We have also sent you AIQ’s version of their user-summary table(AIQ-user-summary.csv). Johnny Jeans’ has exported the table on their end and we’ve sent you that as well (Johnny-user-summary.csv).


In [1]:
import numpy as np
import pandas as pd

In [2]:
AIQ_User_Summary = pd.read_csv('/Users/ir3n3br4t515/Desktop/AIQ-user-summary.csv')
J_User_Summary = pd.read_csv('/Users/ir3n3br4t515/Desktop/Johnny-user-summary.csv')
D_0 = pd.read_csv('/Users/ir3n3br4t515/Desktop/delta0.csv')
D_1 = pd.read_csv('/Users/ir3n3br4t515/Desktop/delta1.csv')
D_2 = pd.read_csv('/Users/ir3n3br4t515/Desktop/delta2.csv')
D_3 = pd.read_csv('/Users/ir3n3br4t515/Desktop/delta3.csv')
D_4 = pd.read_csv('/Users/ir3n3br4t515/Desktop/delta4.csv')




In [3]:
AIQ_User_Summary.head()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,5988,3500,12,517,3164
1,5989,775,17,613,298
2,5982,1749,8,386,1489
3,5983,3061,7,905,4271
4,5980,3852,48,938,3496


In [4]:
AIQ_User_Summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
user_id              10000 non-null int64
total_spend          10000 non-null int64
count_saved_items    10000 non-null int64
loyalty_credits      10000 non-null int64
batch_id             10000 non-null int64
dtypes: int64(5)
memory usage: 390.8 KB


In [5]:
AIQ_Total = AIQ_User_Summary['total_spend'].sum()

print (AIQ_Total)

29162044


In [6]:
J_User_Summary.head()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,5988,4208,20,287,5731
1,5989,775,17,613,298
2,5982,1749,8,386,1489
3,5983,4537,17,151,4623
4,5980,3852,48,938,3496


In [7]:
J_User_Summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
user_id              10000 non-null int64
total_spend          10000 non-null int64
count_saved_items    10000 non-null int64
loyalty_credits      10000 non-null int64
batch_id             10000 non-null int64
dtypes: int64(5)
memory usage: 390.8 KB


In [8]:
J_Total = J_User_Summary['total_spend'].sum()

print (J_Total)

33904525


In [9]:
J_Total - AIQ_Total

4742481

In [10]:

J_User_Summary['total_spend'].equals(AIQ_User_Summary['total_spend'])



False

In [11]:
J_User_Summary[~AIQ_User_Summary['total_spend'].isin(J_User_Summary['total_spend'])].dropna()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
18,6791,4230,5,486,5893
45,4644,897,20,870,660
70,91,2204,7,447,2250
78,1376,324,48,261,171
91,7189,6383,18,707,7659
...,...,...,...,...,...
9832,3768,4269,6,976,5277
9867,5155,4383,32,313,5315
9887,7274,1007,43,693,77
9946,1741,845,19,700,73


In [12]:
difference = J_User_Summary[~AIQ_User_Summary['total_spend'].isin(J_User_Summary['total_spend'])].dropna()

In [13]:
difference_total = difference['total_spend'].sum()

print (difference_total)

1399958


In [14]:
df1 = J_User_Summary
df2 = AIQ_User_Summary

In [15]:
df2[~df1['total_spend'].isin(df2['total_spend'])].dropna() - df1[~df2['total_spend'].isin(df1['total_spend'])].dropna()



Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,,,,,
3,,,,,
11,,,,,
15,,,,,
17,,,,,
...,...,...,...,...,...
9973,,,,,
9977,,,,,
9983,,,,,
9984,,,,,


In [16]:
D0_Total = D_0['total_spend'].sum()

print (D0_Total)

9993346


In [17]:
D_0.head()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,1,1119,34,109,1536
1,2,271,4,305,1223
2,3,390,15,113,105
3,4,1689,39,690,1042
4,5,70,9,154,6


In [18]:
D_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
user_id              10000 non-null int64
total_spend          10000 non-null int64
count_saved_items    10000 non-null int64
loyalty_credits      10000 non-null int64
batch_id             10000 non-null int64
dtypes: int64(5)
memory usage: 390.8 KB


In [19]:
D0_Total = D_0['total_spend'].sum()

print (D0_Total)

9993346


In [20]:
D_1.head()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,8195,2662,36,263,2948
1,8196,2804,19,156,2071
2,10,2862,48,660,2019
3,8203,2231,49,866,2231
4,8206,2731,23,446,2367


In [21]:
D_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2494 entries, 0 to 2493
Data columns (total 5 columns):
user_id              2494 non-null int64
total_spend          2494 non-null int64
count_saved_items    2494 non-null int64
loyalty_credits      2494 non-null int64
batch_id             2494 non-null int64
dtypes: int64(5)
memory usage: 97.5 KB


In [22]:
D1_Total = D_1['total_spend'].sum()

print (D1_Total)

6212911


In [23]:
D_2.head()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,8194,3169,14,123,3456
1,6,3907,11,659,3202
2,7,3561,16,694,3157
3,8,3340,17,289,3489
4,11,3626,41,454,4375


In [24]:
D_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2519 entries, 0 to 2518
Data columns (total 5 columns):
user_id              2519 non-null int64
total_spend          2519 non-null int64
count_saved_items    2519 non-null int64
loyalty_credits      2519 non-null int64
batch_id             2519 non-null int64
dtypes: int64(5)
memory usage: 98.5 KB


In [25]:
D2_Total = D_2['total_spend'].sum()

print (D2_Total)

8820088


In [26]:
D_3.head()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,1,4070,10,179,4940
1,3,4558,43,736,4510
2,8199,4663,8,728,4996
3,8,4991,24,480,5107
4,9,4379,16,148,4673


In [27]:
D_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2473 entries, 0 to 2472
Data columns (total 5 columns):
user_id              2473 non-null int64
total_spend          2473 non-null int64
count_saved_items    2473 non-null int64
loyalty_credits      2473 non-null int64
batch_id             2473 non-null int64
dtypes: int64(5)
memory usage: 96.7 KB


In [28]:
D3_Total = D_3['total_spend'].sum()

print (D3_Total)

11131828


In [29]:
D_4.head()

Unnamed: 0,user_id,total_spend,count_saved_items,loyalty_credits,batch_id
0,8194,5105,40,402,6355
1,5,6755,3,360,7785
2,8198,6331,19,841,7210
3,8,6186,39,947,6789
4,10,5498,39,821,7966


In [30]:
D_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483 entries, 0 to 2482
Data columns (total 5 columns):
user_id              2483 non-null int64
total_spend          2483 non-null int64
count_saved_items    2483 non-null int64
loyalty_credits      2483 non-null int64
batch_id             2483 non-null int64
dtypes: int64(5)
memory usage: 97.1 KB


In [31]:
D4_Total = D_4['total_spend'].sum()

print (D4_Total)

14897837
