# Feature Engineering "transactions"
<img src="https://cdn2.iconfinder.com/data/icons/webstore/512/dollar_money_bag-512.png" border="1" alt="Dataframe transactions" width="200" height="150">

This Python Notebook generates features for "transactions_v2.csv" and exports a file into a folder called data "final_transactions". 
I will do this NEW file as an improved version over previous ones in which:
- Some features have been removed after checking that they did not improve the score (maybe due to **overfitting**).
- New added features.
In the notebook "churn_prediction_vX" then I will load "final_transactions" to create the prediction.

In [1]:
#Import the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mpld3
import seaborn as sns
import matplotlib.dates as mdates
import time
import datetime
from datetime import datetime as dt
from pandas.lib import Timestamp

#Configure Panda
pd.options.display.width = 200

  # This is added back by InteractiveShellApp.init_path()


## 1. Data import and Feature Engineering

In [2]:
#Load transactions dataframe
transactions = pd.read_csv("data/transactions_v2.csv")

In [3]:
#Look at the first values in transactions:
transactions.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0


In [4]:
#Look at number of unique values in each feature:
print(transactions.nunique())
print('')
print('Number of rows & columns "TRANSACTIONS": ', transactions.shape)

msno                      1197050
payment_method_id              37
payment_plan_days              31
plan_list_price                48
actual_amount_paid             53
is_auto_renew                   2
transaction_date              820
membership_expire_date       1960
is_cancel                       2
dtype: int64

Number of rows & columns "TRANSACTIONS":  (1431009, 9)


<font color='red'>There are 1.43M transactions logs. As you can observe there are around 1.2M unique IDs (users), so some users have multiple transactions in the dataframe given to us.</font>
Next, I will:
- Change dates to _dtype_ to be able to operate with them.

### 1.1. Change dates to _dype_
It takes around 1 min

In [5]:
transactions['transaction_date_dtype'] = transactions.transaction_date.apply(lambda x: dt.strptime(str(int(x)), "%Y%m%d").date() if pd.notnull(x) else "NAN" )
transactions['membership_expire_date_dtype'] = transactions.membership_expire_date.apply(lambda x: dt.strptime(str(int(x)), "%Y%m%d").date() if pd.notnull(x) else "NAN" )
transactions.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,membership_expire_date_dtype
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0,2017-01-31,2017-05-04
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0,2015-08-09,2019-04-12
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0,2017-03-03,2017-04-22
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1,2017-03-29,2017-03-31
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0,2017-03-23,2017-04-23


### 1.2. Get a dataframe with only _msno_ (ID) and unique values

In [6]:
msno_aux = pd.unique(transactions[['msno']].values.ravel('K'))
msno_aux = {'msno': msno_aux}
#print(len(msno_aux))

msno_unique_values = pd.DataFrame(data = msno_aux)
print(msno_unique_values.head())
print('')
print(msno_unique_values.shape)

                                           msno
0  ++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=
1  ++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=
2  +/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=
3  +/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=
4  +00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=

(1197050, 1)


### <font color='red'>Now, for each feature I need to compute by **grouping** the sum of the value the feature is related to.</font>

### 1.3. "Dataframe" and "new file": max_expire_date (for a given _msno_)
The following line of code normally takes like 20 min to run! Be patient...
<font color='red'>UPDATE:</font> **I just created a file that is in my data, and uploaded to GitHub (if I forgot to do so, please tell me). Use that one!**

max_expire_date_df = transactions.groupby('msno', as_index=False)['membership_expire_date_dtype'].max()
print('Done!')

#New dataframe
max_expire_date_df.head()
#print(max_expire_date_df.shape)
#print(max_expire_date_df.count())
#print(max_expire_date_df.nunique()) #It is correct!

I will save this as a new .csv file to avoid computing it in the future (as it takes a lot of time).

max_expire_date_df.to_csv('data/max_expire_date_df.csv', index = False)
print('Done :)')

### 1.4. "Feature": membership duration (in total for al records of the same user) <font color='red'>mem_duration</font>

Difference between transaction_date and membership_expire_date. The difference is in terms of days (integer).

Later, I will compute the sum of all for each user (at the end of this file) by grouping them!

In [7]:
#--- difference in days ---
transactions['mem_duration'] = transactions.membership_expire_date_dtype - transactions.transaction_date_dtype
transactions['mem_duration'] = transactions['mem_duration'] / np.timedelta64(1, 'D')
transactions['mem_duration'] = transactions['mem_duration'].astype(int)

In [8]:
transactions.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,membership_expire_date_dtype,mem_duration
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0,2017-01-31,2017-05-04,93
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0,2015-08-09,2019-04-12,1342
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0,2017-03-03,2017-04-22,50
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1,2017-03-29,2017-03-31,2
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0,2017-03-23,2017-04-23,31


In [9]:
print(transactions.mem_duration.unique())
print('')
print(transactions.mem_duration.describe())
print('')
print(transactions.columns)

[  93 1342   50 ... 2341 2707 2158]

count    1.431009e+06
mean     1.185636e+02
std      2.230717e+02
min     -3.000000e+00
25%      3.000000e+01
50%      3.100000e+01
75%      5.300000e+01
max      7.303000e+03
Name: mem_duration, dtype: float64

Index(['msno', 'payment_method_id', 'payment_plan_days', 'plan_list_price', 'actual_amount_paid', 'is_auto_renew', 'transaction_date', 'membership_expire_date', 'is_cancel', 'transaction_date_dtype',
       'membership_expire_date_dtype', 'mem_duration'],
      dtype='object')


**The following is a graph but I just put it as a text to ease the reading of the whole document**

sns.set(rc={'figure.figsize':(13,4)}) #Horizontal, vertical
sns.distplot( transactions["mem_duration"], bins = 1000 )
plt.ylabel('Count', fontsize = 12)
plt.xlabel('Membership duration', fontsize = 12)
plt.show()

I can see that 30 is the most general membership duration. So I will put as 30 the values above 100.
I will one-hot encode this and get rid of the columns less important.

In [10]:
# Change the mem_duration of all rows with an value greater than 60 to 40
# https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
transactions.loc[transactions['mem_duration'] > 60, "mem_duration"] = 30
transactions.loc[transactions['mem_duration'] < 0, "mem_duration"] = 0

**The following is a graph but I just put it as a text to ease the reading of the whole document**

sns.set(rc={'figure.figsize':(13,4)}) #Horizontal, vertical
sns.distplot( transactions["mem_duration"], bins = 500 )
plt.ylabel('Count', fontsize = 12)
plt.xlabel('Membership duration', fontsize = 12)
plt.show()

In [11]:
transactions.mem_duration.unique()

array([30, 50,  2, 31, 39, 46, 33, 32, 36, 42, 41, 38, 45, 35, 57, 51, 55,
       54, 60, 34, 52,  0, 43, 59, 58, 53, 49, 56,  1, 37, 40, 48,  3, 47,
        8, 44, 15,  4, 23, 18,  6, 19, 25, 29, 17, 11,  5,  9, 24, 22, 14,
       28, 10,  7, 20, 13, 12, 16, 21, 26, 27])

In [12]:
transactions.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,membership_expire_date_dtype,mem_duration
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0,2017-01-31,2017-05-04,30
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0,2015-08-09,2019-04-12,30
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0,2017-03-03,2017-04-22,50
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1,2017-03-29,2017-03-31,2
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0,2017-03-23,2017-04-23,31


### 1.5. <font color='red'>"Feature Eng."</font>: _trans1_. Hot-encode "payment_method_id" feature <font color='red'>payment_method_id_XX</font>

In [13]:
trans1 = transactions

#One-hot encode payment_method_id and save it into payment_method_id_encode
payment_method_id_encode = pd.get_dummies(trans1['payment_method_id'], prefix='payment_method_id')

#DON'T DROP THE VARIABLE payment_method_id AS I NEED IT FOR NEXT VARIABLE (I will drop it then)

#Join the encoded payment_method_id_encode
trans1 = trans1.join(payment_method_id_encode)

#trans1.head()

In [14]:
#I get rid of columns whose value is below 30 as not significative
trans1 = trans1.drop("payment_method_id_2", axis=1)
trans1 = trans1.drop("payment_method_id_3", axis=1)
trans1 = trans1.drop("payment_method_id_5", axis=1)
trans1 = trans1.drop("payment_method_id_6", axis=1)
trans1 = trans1.drop("payment_method_id_8", axis=1)
trans1 = trans1.drop("payment_method_id_10", axis=1)
trans1 = trans1.drop("payment_method_id_11", axis=1)
trans1 = trans1.drop("payment_method_id_12", axis=1)
trans1 = trans1.drop("payment_method_id_13", axis=1)
trans1 = trans1.drop("payment_method_id_14", axis=1)
trans1 = trans1.drop("payment_method_id_15", axis=1)
trans1 = trans1.drop("payment_method_id_16", axis=1)
trans1 = trans1.drop("payment_method_id_17", axis=1)
trans1 = trans1.drop("payment_method_id_18", axis=1)
trans1 = trans1.drop("payment_method_id_19", axis=1)
trans1 = trans1.drop("payment_method_id_20", axis=1)
trans1 = trans1.drop("payment_method_id_21", axis=1)
trans1 = trans1.drop("payment_method_id_22", axis=1)
trans1 = trans1.drop("payment_method_id_23", axis=1)
trans1 = trans1.drop("payment_method_id_24", axis=1)
trans1 = trans1.drop("payment_method_id_25", axis=1)
trans1 = trans1.drop("payment_method_id_26", axis=1)
trans1 = trans1.drop("payment_method_id_27", axis=1)
trans1 = trans1.drop("payment_method_id_28", axis=1)
trans1 = trans1.drop("payment_method_id_29", axis=1)

In [15]:
trans1.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,...,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,payment_method_id_36,payment_method_id_37,payment_method_id_38,payment_method_id_39,payment_method_id_40,payment_method_id_41
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0,2017-01-31,...,1,0,0,0,0,0,0,0,0,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0,2015-08-09,...,0,0,0,0,0,0,0,0,0,1
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0,2017-03-03,...,0,0,0,0,1,0,0,0,0,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1,2017-03-29,...,0,0,0,0,1,0,0,0,0,0
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0,2017-03-23,...,0,0,0,0,0,0,0,0,0,1


In [16]:
print('Number of rows & columns "TRANSACTIONS": ', transactions.shape)
print('Number of rows & columns "TRANS1": ', trans1.shape)
#Same number of rows!

Number of rows & columns "TRANSACTIONS":  (1431009, 12)
Number of rows & columns "TRANS1":  (1431009, 24)


### 1.6. <font color='red'>"Feature Eng."</font>: <font color='red'>pay_method_churn</font>, create 3 variables for payment method 32, 35, 38 (which are the ones that have most probability to _churn_)

In [17]:
mask = (trans1.payment_method_id == 32)
mask1 = (trans1.payment_method_id == 35)
mask2 = (trans1.payment_method_id == 38)

trans1['pay_method_churn'] = trans1.payment_method_id
trans1.pay_method_churn = 0
column_name = 'pay_method_churn'
trans1.loc[mask,column_name] = 1
trans1.loc[mask1,column_name] = 1
trans1.loc[mask2,column_name] = 1

trans1 = trans1.drop('payment_method_id', axis = 1)

trans1.head()

Unnamed: 0,msno,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,membership_expire_date_dtype,...,payment_method_id_33,payment_method_id_34,payment_method_id_35,payment_method_id_36,payment_method_id_37,payment_method_id_38,payment_method_id_39,payment_method_id_40,payment_method_id_41,pay_method_churn
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,90,298,298,0,20170131,20170504,0,2017-01-31,2017-05-04,...,0,0,0,0,0,0,0,0,0,1
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,30,149,149,1,20150809,20190412,0,2015-08-09,2019-04-12,...,0,0,0,0,0,0,0,0,1,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,30,180,180,1,20170303,20170422,0,2017-03-03,2017-04-22,...,0,0,0,1,0,0,0,0,0,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,30,180,180,1,20170329,20170331,1,2017-03-29,2017-03-31,...,0,0,0,1,0,0,0,0,0,0
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,30,99,99,1,20170323,20170423,0,2017-03-23,2017-04-23,...,0,0,0,0,0,0,0,0,1,0


### 1.7. <font color='red'>"Feature Eng."</font>: "trans2". Hot-encode "payment_plan_days" feature and test algorithm <font color='red'>payment_plan_days_XX</font>

In [18]:
trans2 = trans1

payment_plan_days_encode = pd.get_dummies(trans2['payment_plan_days'], prefix='payment_plan_days')

trans2 = trans2.drop('payment_plan_days', axis=1)

trans2 = trans2.join(payment_plan_days_encode)

#trans2.head()

In [19]:
#I get rid of columns whose value is below 30 as not significative
trans2 = trans2.drop("payment_plan_days_0", axis=1)
trans2 = trans2.drop("payment_plan_days_1", axis=1)
trans2 = trans2.drop("payment_plan_days_3", axis=1)
#trans2 = trans2.drop("payment_plan_days_7", axis=1)
trans2 = trans2.drop("payment_plan_days_10", axis=1)
trans2 = trans2.drop("payment_plan_days_14", axis=1)
trans2 = trans2.drop("payment_plan_days_21", axis=1)
trans2 = trans2.drop("payment_plan_days_31", axis=1)
trans2 = trans2.drop("payment_plan_days_35", axis=1)
trans2 = trans2.drop("payment_plan_days_45", axis=1)
trans2 = trans2.drop("payment_plan_days_60", axis=1)
trans2 = trans2.drop("payment_plan_days_70", axis=1)
trans2 = trans2.drop("payment_plan_days_80", axis=1)
trans2 = trans2.drop("payment_plan_days_100", axis=1)
trans2 = trans2.drop("payment_plan_days_110", axis=1)
trans2 = trans2.drop("payment_plan_days_120", axis=1)
trans2 = trans2.drop("payment_plan_days_200", axis=1)
trans2 = trans2.drop("payment_plan_days_230", axis=1)
trans2 = trans2.drop("payment_plan_days_240", axis=1)
trans2 = trans2.drop("payment_plan_days_270", axis=1)
trans2 = trans2.drop("payment_plan_days_360", axis=1)
trans2 = trans2.drop("payment_plan_days_365", axis=1)
trans2 = trans2.drop("payment_plan_days_395", axis=1)
trans2 = trans2.drop("payment_plan_days_400", axis=1)
trans2 = trans2.drop("payment_plan_days_415", axis=1)
trans2 = trans2.drop("payment_plan_days_450", axis=1)
print('Done!')

Done!


In [20]:
trans2.head()

Unnamed: 0,msno,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,membership_expire_date_dtype,mem_duration,...,payment_method_id_39,payment_method_id_40,payment_method_id_41,pay_method_churn,payment_plan_days_7,payment_plan_days_30,payment_plan_days_90,payment_plan_days_180,payment_plan_days_195,payment_plan_days_410
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,298,298,0,20170131,20170504,0,2017-01-31,2017-05-04,30,...,0,0,0,1,0,0,1,0,0,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,149,149,1,20150809,20190412,0,2015-08-09,2019-04-12,30,...,0,0,1,0,0,1,0,0,0,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,180,180,1,20170303,20170422,0,2017-03-03,2017-04-22,50,...,0,0,0,0,0,1,0,0,0,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,180,180,1,20170329,20170331,1,2017-03-29,2017-03-31,2,...,0,0,0,0,0,1,0,0,0,0
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,99,99,1,20170323,20170423,0,2017-03-23,2017-04-23,31,...,0,0,1,0,0,1,0,0,0,0


In [21]:
print('Number of rows & columns "TRANSACTIONS": ', transactions.shape)
print('Number of rows & columns "TRANS2": ', trans2.shape)
#Same number of rows!

Number of rows & columns "TRANSACTIONS":  (1431009, 12)
Number of rows & columns "TRANS2":  (1431009, 29)


### 1.8. <font color='red'>"Feature Eng."</font>: "trans3". Hot-encode "plan_list_price" feature and test algorithm <font color='red'>plan_list_price_XX</font>

In [22]:
trans2.plan_list_price.unique()

array([ 298,  149,  180,   99, 1788,  536,  129,  100,  894,  480,  300,
        477, 1299, 1599,  699, 1200,    0,  799,  930,  600,    1,   35,
       1399,  150,  119,  447,  450,  210, 1000,  134,  120, 2000,  400,
        131,  500,  350, 1260,  126,  596,   70,  265, 1150,  143,  105,
       1300,   50,   30,   15])

In [23]:
trans3 = trans2

plan_list_price_encode = pd.get_dummies(trans3['plan_list_price'], prefix='plan_list_price')

trans3 = trans3.drop('plan_list_price', axis=1)

trans3 = trans3.join(plan_list_price_encode)

#trans3.head()

In [24]:
#I get rid of columns
trans3 = trans3.drop("plan_list_price_0", axis=1)
trans3 = trans3.drop("plan_list_price_1", axis=1)
trans3 = trans3.drop("plan_list_price_15", axis=1)
trans3 = trans3.drop("plan_list_price_30", axis=1)
trans3 = trans3.drop("plan_list_price_35", axis=1)
trans3 = trans3.drop("plan_list_price_50", axis=1)
trans3 = trans3.drop("plan_list_price_70", axis=1)
trans3 = trans3.drop("plan_list_price_100", axis=1)
trans3 = trans3.drop("plan_list_price_105", axis=1)
trans3 = trans3.drop("plan_list_price_119", axis=1)
trans3 = trans3.drop("plan_list_price_120", axis=1)
trans3 = trans3.drop("plan_list_price_126", axis=1)
trans3 = trans3.drop("plan_list_price_131", axis=1)
trans3 = trans3.drop("plan_list_price_134", axis=1)
trans3 = trans3.drop("plan_list_price_143", axis=1)
trans3 = trans3.drop("plan_list_price_150", axis=1)
trans3 = trans3.drop("plan_list_price_210", axis=1)
trans3 = trans3.drop("plan_list_price_265", axis=1)
trans3 = trans3.drop("plan_list_price_298", axis=1)
trans3 = trans3.drop("plan_list_price_300", axis=1)
trans3 = trans3.drop("plan_list_price_350", axis=1)
trans3 = trans3.drop("plan_list_price_400", axis=1)
trans3 = trans3.drop("plan_list_price_447", axis=1)
trans3 = trans3.drop("plan_list_price_450", axis=1)
trans3 = trans3.drop("plan_list_price_477", axis=1)
trans3 = trans3.drop("plan_list_price_480", axis=1)
trans3 = trans3.drop("plan_list_price_500", axis=1)
trans3 = trans3.drop("plan_list_price_536", axis=1)
trans3 = trans3.drop("plan_list_price_596", axis=1)
trans3 = trans3.drop("plan_list_price_600", axis=1)
trans3 = trans3.drop("plan_list_price_699", axis=1)
trans3 = trans3.drop("plan_list_price_799", axis=1)
trans3 = trans3.drop("plan_list_price_894", axis=1)
trans3 = trans3.drop("plan_list_price_930", axis=1)
trans3 = trans3.drop("plan_list_price_1000", axis=1)
trans3 = trans3.drop("plan_list_price_1150", axis=1)
trans3 = trans3.drop("plan_list_price_1200", axis=1)
trans3 = trans3.drop("plan_list_price_1599", axis=1)
trans3 = trans3.drop("plan_list_price_1788", axis=1)
trans3 = trans3.drop("plan_list_price_2000", axis=1)
print('Done!')

Done!


In [25]:
trans3.head()

Unnamed: 0,msno,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,membership_expire_date_dtype,mem_duration,payment_method_id_30,...,payment_plan_days_195,payment_plan_days_410,plan_list_price_99,plan_list_price_129,plan_list_price_149,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,298,0,20170131,20170504,0,2017-01-31,2017-05-04,30,0,...,0,0,0,0,0,0,0,0,0,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,149,1,20150809,20190412,0,2015-08-09,2019-04-12,30,0,...,0,0,0,0,1,0,0,0,0,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,180,1,20170303,20170422,0,2017-03-03,2017-04-22,50,0,...,0,0,0,0,0,1,0,0,0,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,180,1,20170329,20170331,1,2017-03-29,2017-03-31,2,0,...,0,0,0,0,0,1,0,0,0,0
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,99,1,20170323,20170423,0,2017-03-23,2017-04-23,31,0,...,0,0,1,0,0,0,0,0,0,0


In [26]:
print('Number of rows & columns "TRANSACTIONS": ', transactions.shape)
print('Number of rows & columns "TRANS3": ', trans3.shape)
#Same number of rows!

Number of rows & columns "TRANSACTIONS":  (1431009, 12)
Number of rows & columns "TRANS3":  (1431009, 36)


### 1.9. <font color='red'>"Feature Eng."</font>: notAutorenew_&_cancel
Binary feature to predict possible churning if
- auto_renew = 0 and
- is_cancel = 1

In [27]:
trans3['notAutorenew_and_cancel'] = ((trans3.is_auto_renew == 0) == (trans3.is_cancel == 1)).astype(np.int8)

In [28]:
trans3.head()

Unnamed: 0,msno,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,transaction_date_dtype,membership_expire_date_dtype,mem_duration,payment_method_id_30,...,payment_plan_days_410,plan_list_price_99,plan_list_price_129,plan_list_price_149,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,298,0,20170131,20170504,0,2017-01-31,2017-05-04,30,0,...,0,0,0,0,0,0,0,0,0,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,149,1,20150809,20190412,0,2015-08-09,2019-04-12,30,0,...,0,0,0,1,0,0,0,0,0,1
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,180,1,20170303,20170422,0,2017-03-03,2017-04-22,50,0,...,0,0,0,0,1,0,0,0,0,1
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,180,1,20170329,20170331,1,2017-03-29,2017-03-31,2,0,...,0,0,0,0,1,0,0,0,0,0
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,99,1,20170323,20170423,0,2017-03-23,2017-04-23,31,0,...,0,1,0,0,0,0,0,0,0,1


### 1.10. <font color='red'>"GROUPBY ALL COLUMNS"</font>
I will do this in a new dataframe just in case to have the one I've working with as a backup (and don't need to run all code above).

In [29]:
#BACKUP DATAFRAME
backup_df = trans3

#DATAFRAME I WILL WORK WITH
unwanted = ['transaction_date_dtype','membership_expire_date_dtype','actual_amount_paid','transaction_date','membership_expire_date']
trans3 = trans3.drop(unwanted, axis = 1)

trans3.head()

Unnamed: 0,msno,is_auto_renew,is_cancel,mem_duration,payment_method_id_30,payment_method_id_31,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,...,payment_plan_days_410,plan_list_price_99,plan_list_price_129,plan_list_price_149,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,0,0,30,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,1,0,30,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,1,0,50,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,1,1,2,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,1,0,31,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1


The following line takes at most 20 seconds...

In [30]:
sums_df = trans3.groupby('msno').sum().reset_index()

In [31]:
sums_df.head()

Unnamed: 0,msno,is_auto_renew,is_cancel,mem_duration,payment_method_id_30,payment_method_id_31,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,...,payment_plan_days_410,plan_list_price_99,plan_list_price_129,plan_list_price_149,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel
0,+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=,0,0,30,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,1,0,31,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
2,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2,0,99,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,2
3,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,1,0,31,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
4,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1,0,31,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [32]:
print(sums_df.shape)

(1197050, 32)


**Now, I have a dataframe with all my data grouped by msno and added (SUM)**

### 1.11. <font color='red'>"Feature Eng."</font>: cancel_ratio
Next code box will take 20 secs.

In [33]:
counts = trans3.groupby('msno')['is_cancel'].count().reset_index()
counts.columns = ['msno','transactions']
sums_is_cancel = trans3.groupby('msno')['is_cancel'].sum().reset_index()
merged_dataset = sums_is_cancel.merge(counts, how='inner', on='msno')

merged_dataset['is_cancel_number'] = merged_dataset['is_cancel']
merged_dataset = merged_dataset.drop(['is_cancel'], axis = 1)

In [34]:
merged_dataset.head()

Unnamed: 0,msno,transactions,is_cancel_number
0,+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=,1,0
1,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,1,0
2,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2,0
3,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,1,0
4,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1,0


In [35]:
print(merged_dataset.shape)

(1197050, 3)


### 1.12. Merge step 1.10. and 1.11.

In [36]:
result = sums_df.merge(merged_dataset, on='msno')

In [37]:
result.head()

Unnamed: 0,msno,is_auto_renew,is_cancel,mem_duration,payment_method_id_30,payment_method_id_31,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,...,plan_list_price_129,plan_list_price_149,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel,transactions,is_cancel_number
0,+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=,0,0,30,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
2,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2,0,99,0,0,0,0,0,0,...,0,2,0,0,0,0,0,2,2,0
3,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,1,0,31,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
4,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1,0,31,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0


In [38]:
print(result.shape)

(1197050, 34)


### 1.13. Put together 1.12. and the file I created in the first parts of this notebook: "max_expire_date_df.csv" (part 1.3.)

In [39]:
#CHECK IF YOU HAVE THIS FILE! If not, email pablo.depaz@hotmail.es asking for help :)
max_expire_date_dataframe = pd.read_csv('data/max_expire_date_df.csv')

**Now, merge this with the "result" dataframe in section 1.12.**

In [40]:
merging = result.merge(max_expire_date_dataframe, on='msno')

In [41]:
merging.head()

Unnamed: 0,msno,is_auto_renew,is_cancel,mem_duration,payment_method_id_30,payment_method_id_31,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,...,plan_list_price_149,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel,transactions,is_cancel_number,membership_expire_date_dtype
0,+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=,0,0,30,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,2018-02-06
1,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,2017-04-15
2,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2,0,99,0,0,0,0,0,0,...,2,0,0,0,0,0,2,2,0,2017-05-19
3,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,1,0,31,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,0,2017-04-26
4,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1,0,31,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,0,2017-04-15


In [42]:
print(merging.shape)

(1197050, 35)


### 1.14. Check if missing values

In [43]:
merging.isnull().sum()
#We should get 0!

msno                            0
is_auto_renew                   0
is_cancel                       0
mem_duration                    0
payment_method_id_30            0
payment_method_id_31            0
payment_method_id_32            0
payment_method_id_33            0
payment_method_id_34            0
payment_method_id_35            0
payment_method_id_36            0
payment_method_id_37            0
payment_method_id_38            0
payment_method_id_39            0
payment_method_id_40            0
payment_method_id_41            0
pay_method_churn                0
payment_plan_days_7             0
payment_plan_days_30            0
payment_plan_days_90            0
payment_plan_days_180           0
payment_plan_days_195           0
payment_plan_days_410           0
plan_list_price_99              0
plan_list_price_129             0
plan_list_price_149             0
plan_list_price_180             0
plan_list_price_1260            0
plan_list_price_1299            0
plan_list_pric

# 2. Create 3 dataframes with train_v1, train_v2 and submission_file.
I compute the days difference between **max_expire_date** and:
- Feb 28th (**for train_v1**)
- March 31th (**for train_v2**)
- April 30th (**for submission_file**)
<img src="http://tripkendall.com/wp-content/uploads/2018/01/pandas_logo-1080x675.jpg" border="1" alt="Dataframe transactions" width="200" height="150">

In [44]:
#Import train sets and sample_submission_zero.csv
train = pd.read_csv('data/train.csv')
train_v2 = pd.read_csv('data/train_v2.csv')
sample_sub = pd.read_csv('data/sample_submission_zero.csv')

### 2.1. Compute the dates difference

In [45]:
#Compute the difference for "train" (w.r.t. 2017 Feb 28th)
df1 = merging

#type(datetime.date(2017,2,28)) #This is a datetype
#df1.dtypes #This is an object (I think as a result of the export to the file)
df1['membership_expire_date_dtype'] = pd.to_datetime(df1['membership_expire_date_dtype'])
df1.dtypes

df1['diff_feb'] = df1['membership_expire_date_dtype'] - datetime.date(2017,2,28)
df1['diff_feb'] = df1['diff_feb'] / np.timedelta64(1, 'D')
df1['diff_feb'] = df1['diff_feb'].astype(int)

In [46]:
#Compute the difference for "train" (w.r.t. 2017 March 31st)
df2 = merging

df2['diff_mar'] = df2['membership_expire_date_dtype'] - datetime.date(2017,3,31)
df2['diff_mar'] = df2['diff_mar'] / np.timedelta64(1, 'D')
df2['diff_mar'] = df2['diff_mar'].astype(int)

In [47]:
#Compute the difference for "train" (w.r.t. 2017 March 31st)
df3 = merging

df3['diff_apr'] = df3['membership_expire_date_dtype'] - datetime.date(2017,3,31)
df3['diff_apr'] = df3['diff_apr'] / np.timedelta64(1, 'D')
df3['diff_apr'] = df3['diff_apr'].astype(int)

So, here I have 4 dataframes:
- **"merging"**: the good dataframe with the data and feature engineering
- **"df1"**: dataframe _merging_ with the difference of the expiration date w.r.t. 2017 Feb 28th.
- **"df2"**: dataframe _merging_ with the difference of the expiration date w.r.t. 2017 Mar 31st.
- **"df3"**: dataframe _merging_ with the difference of the expiration date w.r.t. 2017 Apr 30th.

**df1** to be merged with train (v1) <br>
**df2** to be merged with train (v2) <br>
**df3** to be merged with sample_submission

## 2.2. Merge the "dfX" dataframes with train, train_v2 and sample_submission

In [56]:
train_final = pd.merge(df1, train, on = 'msno')
train_final = train_final.drop("diff_mar", axis=1)
train_final = train_final.drop("diff_apr", axis=1)
train_final = train_final.drop("membership_expire_date_dtype", axis=1)

train_final.head()

Unnamed: 0,msno,is_auto_renew,is_cancel,mem_duration,payment_method_id_30,payment_method_id_31,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,...,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel,transactions,is_cancel_number,diff_feb,is_churn
0,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,46,0
1,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2,0,99,0,0,0,0,0,0,...,0,0,0,0,0,2,2,0,80,0
2,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,57,0
3,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,46,0
4,++/UDNo9DLrxT8QVGiDi1OnWfczAdEwThaVyD0fXO50=,2,0,107,0,0,0,0,0,0,...,0,0,0,0,0,2,2,0,84,0


In [57]:
print(train_final.shape)

(937147, 36)


<img src="https://www.calshrm.org/images/soft%20slit%20separator.png?crc=4104113101" border="1" alt="Dataframe transactions" width="400" height="100">


In [58]:
train_v2_final = pd.merge(df2, train_v2, on = 'msno')
train_v2_final = train_v2_final.drop("diff_feb", axis=1)
train_v2_final = train_v2_final.drop("diff_apr", axis=1)
train_v2_final = train_v2_final.drop("membership_expire_date_dtype", axis=1)

train_v2_final.head()

Unnamed: 0,msno,is_auto_renew,is_cancel,mem_duration,payment_method_id_30,payment_method_id_31,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,...,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel,transactions,is_cancel_number,diff_mar,is_churn
0,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,15,0
1,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2,0,99,0,0,0,0,0,0,...,0,0,0,0,0,2,2,0,49,0
2,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,26,0
3,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,15,0
4,++/UDNo9DLrxT8QVGiDi1OnWfczAdEwThaVyD0fXO50=,2,0,107,0,0,0,0,0,0,...,0,0,0,0,0,2,2,0,53,0


In [59]:
print(train_v2_final.shape)

(933578, 36)


<img src="https://www.calshrm.org/images/soft%20slit%20separator.png?crc=4104113101" border="1" alt="Dataframe transactions" width="400" height="100">


In [60]:
sample_sub_final = pd.merge(df3, sample_sub, on = 'msno')
sample_sub_final = sample_sub_final.drop("diff_feb", axis=1)
sample_sub_final = sample_sub_final.drop("diff_mar", axis=1)
sample_sub_final = sample_sub_final.drop("membership_expire_date_dtype", axis=1)

sample_sub_final.head()

Unnamed: 0,msno,is_auto_renew,is_cancel,mem_duration,payment_method_id_30,payment_method_id_31,payment_method_id_32,payment_method_id_33,payment_method_id_34,payment_method_id_35,...,plan_list_price_180,plan_list_price_1260,plan_list_price_1299,plan_list_price_1300,plan_list_price_1399,notAutorenew_and_cancel,transactions,is_cancel_number,diff_apr,is_churn
0,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,15,0
1,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,2,0,99,0,0,0,0,0,0,...,0,0,0,0,0,2,2,0,49,0
2,+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,26,0
3,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1,0,31,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,15,0
4,++/UDNo9DLrxT8QVGiDi1OnWfczAdEwThaVyD0fXO50=,2,0,107,0,0,0,0,0,0,...,0,0,0,0,0,2,2,0,53,0


In [61]:
print(sample_sub_final.shape)

(933578, 36)


<img src="https://www.calshrm.org/images/soft%20slit%20separator.png?crc=4104113101" border="1" alt="Dataframe transactions" width="400" height="100">


# DONE UNTIL HERE UP TO NOW!

## 3. Export files
It can take some minutes!

In [62]:
train_final.to_csv('data/final_trans_merged_train.csv', index = False)
print('Done!')

Done!


In [63]:
train_v2_final.to_csv('data/final_trans_merged_train_v2.csv', index = False)
print('Done!')

Done!


In [64]:
sample_sub_final.to_csv('data/final_trans_merged_sample_sub.csv', index = False)
print('Done!')

Done!
