# Feature engineering (Members)

This files generates feautures for "members.csv" and exports three files called members_train1, members_train2 and members_total. These files are already merged with the two training sets and with the submission file. As such, in the algorithm file, we only need to load the feature-engineered files of members in.

# 5 Members

## 5.1 Loading the data

In [1]:
#Import the relevant libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mpld3
import seaborn as sns
import matplotlib.dates as mdates
import time
from datetime import datetime
from sklearn import preprocessing

#Configure Panda
pd.options.display.width = 200

In [2]:
#Load members in:
members= pd.read_csv("data/members_v3.csv")

In [3]:
#Look at the first values in members:
print("members:")
print(members.head())

members:
                                           msno  city  bd  gender  registered_via  registration_init_time
0  Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=     1   0     NaN              11                20110911
1  +tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=     1   0     NaN               7                20110914
2  cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=     1   0     NaN              11                20110915
3  9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=     1   0     NaN              11                20110915
4  WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=     6  32  female               9                20110915


In the next section, we are going to generate features for the members file. In the data analysis of members, we concluded that certain variables such as the birthday or the city users live in have an influence on whether people churn or not. As the quantity as well as the quality of the features have a large influence on the quality of the outcome of the algorithm, it is important to carefully generate and test these features.

## 5.2 Features

### 5.2.1 Hot-encoding of city

To include infromation about the city of the user as a feature, we hot-encode the variable "city". Indeed, city is a categorical integer variable. As such, the value in "city" has no actual numerical meaning. In this case, it is used to describe the city in which the user lives. Alternatively, the cities could also have been designated by their actual name, or a different categorical value, such as "A", "B", etc. Summing up the values would make no sense. In addition, a city containg a high value does not have a higher importance. Before we handle information about the city to the algorithm, we need to binarize it.
If a varable is one-hot encoded, a new column is generated for every possible value that the variable could take. The variable "city" contains 21 different cities. As a result, hot-encoding will generate 21 columns containing only 1's and 0's. If a costumer lives in city 3 for example, only the newly generated column "city_3" will contain a 1, while the other 20 columns contain a 0.

In [4]:
#How many unique values does the variable city have?
members.city.nunique()

21

In [5]:
#One-hot encode the cities. 
#Instead of having a variable called city with values from 1-22, the alorithm performs better with 0's and 1's ->onehot encoding

#Generate our final file called final1
final1 = members

#One-hot encode city and save it into city_encde
city_encode = pd.get_dummies(final1['city'],prefix='city')

#Drop variable city in final2, as it is no longer needed
final1=final1.drop('city',axis=1)

#Join the encoded city_encode
final1 = final1.join(city_encode)

final1.head()


Unnamed: 0,msno,bd,gender,registered_via,registration_init_time,city_1,city_3,city_4,city_5,city_6,...,city_13,city_14,city_15,city_16,city_17,city_18,city_19,city_20,city_21,city_22
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,0,,11,20110911,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,0,,7,20110914,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,0,,11,20110915,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,0,,11,20110915,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,32,female,9,20110915,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### 5.2.2 Hot-encoding of the year

In this part, we one-hot encode the year of registration. As long as a date is not calculated in relation to another date, the value can be considered as a categorical one. First, we need to convert the integer date to the data type "date". This will allow us to extract the year. We showed in the data exploration that the chance of churning is higher for short-term users. We are going to consider the dates until the year 2012. This is due to the fact that we don't have a lot of data on the years before 2011. Adding these years as well will only make our algorithm slow and we unwanted noise is generated. It is important to provide only important data to the algorithm, and avoid spamming it with non-relevant information or columns that do not contain data about many users.

In [6]:
#transform integer dates to data type: date
final1['registration_init_time'] = final1.registration_init_time.apply(lambda x: datetime.strptime(str(int(x)), "%Y%m%d").date() if pd.notnull(x) else "NAN" )


In [7]:
#Create a dataframe containing the colum reg_year
date = pd.DataFrame(columns=['reg_year'])
#save the year of registration in it
date.reg_year=pd.DatetimeIndex(final1['registration_init_time']).year

#Drop variable registration_init_time in final1, as it is no longer needed
final1=final1.drop('registration_init_time',axis=1)
#Join the two dataframes
final1=final1.join(date)

#get one-hot encoding columns
year_encode = pd.get_dummies(final1['reg_year'],prefix='reg_year')
#Drop not so relevant years, see data exploraion file for members
year_encode=year_encode.drop(['reg_year_2004','reg_year_2005','reg_year_2006','reg_year_2007','reg_year_2008','reg_year_2009','reg_year_2010','reg_year_2011'],axis=1)

#Join and drop
final1 = final1.join(year_encode)
final1=final1.drop('reg_year',axis=1)

final1.head()


Unnamed: 0,msno,bd,gender,registered_via,city_1,city_3,city_4,city_5,city_6,city_7,...,city_19,city_20,city_21,city_22,reg_year_2012,reg_year_2013,reg_year_2014,reg_year_2015,reg_year_2016,reg_year_2017
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,0,,11,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,0,,7,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,0,,11,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,0,,11,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,32,female,9,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### 5.2.3 One-hot encoding of registration method

Next, we need to one-hot encode the registration method. Similarly to the "city"-variable, this feature contains integer values that have no numerical meaning. The algorithm will not be able to understand that the value in "registered_via" has no actual meaning. The algorithm will add a higher weight to high values, which would be wrong. This is why we need to one-hot encode the regisration method as well. Taking the data exploration into account, we decide to get rid of certain values for "registration_via" as they only represent a very small fraction of the data. As such, they are not relevant and will make the algorithm slow and inefficient.

In [8]:
#get the dummy columns
reg_encode = pd.get_dummies(final1['registered_via'],prefix='reg')

#drop the not so relevant columns (see data exploration file for members)
reg_encode=reg_encode.drop(['reg_-1','reg_1','reg_2','reg_5','reg_6','reg_8','reg_10','reg_13','reg_14','reg_16','reg_17','reg_18','reg_19'],axis=1)

#Join and drop
final1 = final1.join(reg_encode)
final1=final1.drop('registered_via',axis=1)

final1.head()


Unnamed: 0,msno,bd,gender,city_1,city_3,city_4,city_5,city_6,city_7,city_8,...,reg_year_2013,reg_year_2014,reg_year_2015,reg_year_2016,reg_year_2017,reg_3,reg_4,reg_7,reg_9,reg_11
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,32,female,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [9]:
#print which columns are left for registration metod 
reg_encode.head()
final1.head()

Unnamed: 0,msno,bd,gender,city_1,city_3,city_4,city_5,city_6,city_7,city_8,...,reg_year_2013,reg_year_2014,reg_year_2015,reg_year_2016,reg_year_2017,reg_3,reg_4,reg_7,reg_9,reg_11
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,0,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,32,female,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


### 5.2.4 Normalization of birthday

The age of the users also plays a role in the likelihood of churning, as it was mentioned in the data exploration. In a first step, we get rid of the outliers, by setting al the values that are smaller than one and larger than 90 to 24. 24 corresponds to the median We just assume that a user is between 1 and 90 years old. Next we normalize the data (min-max normalization). The data is normalized, because the range of the values for our raw data varies widely between different features. To prevent a bad effect on the performance of the algorithm, we take the precautionary step to normalize it. In fact, some objective functions will not work properly if data is not normalized.


In [10]:
bd=final1.bd.copy()
bd_df = pd.DataFrame(columns=['bd_norm'])
bd_df.bd_norm=bd

#set outliers to 24:
mask=(bd_df.bd_norm<0)
mask2=(bd_df.bd_norm>90)
column_name='bd_norm'


bd_df.loc[mask,column_name]=24
bd_df.loc[mask2,column_name]=24


#normalize
bd_df.bd_norm=bd_df.bd_norm/90

final1 = final1.join(bd_df)


In [11]:
bd_df.bd_norm.head()

0    0.000000
1    0.000000
2    0.000000
3    0.000000
4    0.355556
Name: bd_norm, dtype: float64

### 5.2.4 Young people

#We create a column with young people under 24, because they are more likely to churn
#save bd in a vector for new feature later on
bd=final1.bd

#create indexes
mask=(final1.bd<24)
mask_not=(final1.bd<1)
mask_not2=(final1.bd>23)

column_name='bd'
#change birthday indexes
final1.loc[mask,column_name]=1
final1.loc[mask_not,column_name]=0
final1.loc[mask_not2,column_name]=0

final1.head()


## 5.4 Merge Data 

In this part, the data is merged with the train and submission files and is tested for null values. Since many algorithms cannot handle NULL values (NaN), we need to replace them with different values that make sens in the context of the variable.

In [12]:
#dropping bd gender, regstered_via and regisration_init_time to test if city encoding helps algorithm performing
final1=final1.drop(['gender'],axis=1)
final1.head()

Unnamed: 0,msno,bd,city_1,city_3,city_4,city_5,city_6,city_7,city_8,city_9,...,reg_year_2014,reg_year_2015,reg_year_2016,reg_year_2017,reg_3,reg_4,reg_7,reg_9,reg_11,bd_norm
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0.0
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0.0
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0.0
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0.0
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,32,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0.355556


Before merging "members" with the train and test files, we need to check whether "members" contains any null values. However, the sum of the NaN's in the file is 0. 

In [13]:
#Check for null-values:
final1.isnull().sum().sum()

0

In [14]:
#Load the training and submssion data in, so we can create 3 files to export for the final algorithm

# From train.csv, we will extract the is_churn and use it as the y-label for training. 
train = pd.read_csv('data/train.csv')
print('train loaded!')

# From train_v2.csv (the churn data for march), we will extract the is_churn and use it as the y-label for test
train_v2 = pd.read_csv('data/train_v2.csv')

# From sample_submission_v2.csv, we will extract the msno's 
submission = pd.read_csv('data/sample_submission_v2.csv')
print('test loaded!')

#members_train1
#members_train2
#members_total

train loaded!
test loaded!


After loading the two train and the submission file in, we can merge them with members on "msno" to create three files called "members_train1", "members_train2" and "members_total". We need to merge the files retrieve the churn-information about the users and combine it with other data with have. While merging the files together, NULL values may be generated, because we use a left merge on the "train", "train_v2" and "submission" dataframe respectivaly. They may contain msno's that cannot be found in the "members"-file, and as such, NULL values in the corresponding rows are generated.
To get rid of the NULL values, we need to iterate through our new feature engineered data frame and set the NaN's to 0 (or 0.26 for the birthdays). For the variables "city", "registrated_via" and "reg_year" 0 entries in every feature-engineered column will mean that we do not have any data. For the birthdays, we chose to set them to 0.26 (normalized representation of 24), because 24 represents a high-frquency value for the birthdays (this can be concluded by looking at the histogram in the data exploration section)

In [15]:
#save a copy:
final=final1.copy()

In [16]:
#Drop the bd-column, as it is not needed anymore
final=final.drop('bd',axis=1)

In [17]:
#Merge the files together on "msno"
members_train1 = pd.merge(train,final,on='msno',how='left')
members_train2 = pd.merge(train_v2,final,on='msno',how='left')
members_total = pd.merge(submission ,final,on='msno',how='left')
print('merges done!')

merges done!


In [18]:
#check for NaN's after merging
print(members_train1.isnull().sum().sum())
print(members_train2.isnull().sum().sum())
print(members_total.isnull().sum().sum())

3820410
3629769
3708573


In [19]:
#Get rid of null-values
#Iterate through the final1 data frame and get rid of NaN's

cities = ['city_1','city_3','city_4','city_5','city_6','city_7','city_8','city_9','city_10','city_11','city_12','city_13','city_14','city_15','city_16','city_17','city_18','city_19','city_20','city_21','city_22']
for i in range(0,len(cities)):
        inpt = cities[i]
        members_train1[inpt]=members_train1[inpt].fillna(value=0)
        members_train2[inpt]=members_train2[inpt].fillna(value=0)
        members_total[inpt]=members_total[inpt].fillna(value=0)
        
reg_dates=['reg_year_2012','reg_year_2013','reg_year_2014','reg_year_2015','reg_year_2016','reg_year_2017']        
for i in range(0,len(reg_dates)):
        inpt = reg_dates[i]
        members_train1[inpt]=members_train1[inpt].fillna(value=0)
        members_train2[inpt]=members_train2[inpt].fillna(value=0)
        members_total[inpt]=members_total[inpt].fillna(value=0)

reg_meth=['reg_3','reg_4','reg_7','reg_9','reg_11']
for i in range(0,len(reg_meth)):
        inpt = reg_meth[i]
        members_train1[inpt]=members_train1[inpt].fillna(value=0)
        members_train2[inpt]=members_train2[inpt].fillna(value=0)
        members_total[inpt]=members_total[inpt].fillna(value=0)     
        
members_train1['bd_norm']=members_train1['bd_norm'].fillna(value=0.26)
members_train2['bd_norm']=members_train2['bd_norm'].fillna(value=0.26)
members_total['bd_norm']=members_total['bd_norm'].fillna(value=0.26)     



In [20]:
#check for NaN's. Are there any NaN's left?
print(members_train1.isnull().sum().sum())
print(members_train2.isnull().sum().sum())
print(members_total.isnull().sum().sum())

0
0
0


In [21]:
members_train1.head()

Unnamed: 0,msno,is_churn,city_1,city_3,city_4,city_5,city_6,city_7,city_8,city_9,...,reg_year_2014,reg_year_2015,reg_year_2016,reg_year_2017,reg_3,reg_4,reg_7,reg_9,reg_11,bd_norm
0,waLDQMmcOu2jLDaV1ddDkgCrB/jl6sD66Xzs0Vqax1Y=,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.4
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.422222
2,fGwBva6hikQmTJzrbz/2Ezjm5Cth5jZUNvXigKK2AFA=,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.3
3,mT5V8rEpa+8wuqi6x0DoVd3H5icMKkE9Prt49UlmK+4=,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.255556
4,XaPhtGLk/5UvvOYHcONTwsnH97P4eGECeq+BARGItRw=,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.3


In [22]:
#export file, can take up to 1 min
members_train1.to_csv('data/members_feb.csv', index=False)
members_train2.to_csv('data/members_mar.csv', index=False)
members_total.to_csv('data/members_apr.csv', index=False)
print("Done! :)")


Done! :)


In this section, every feature is tested for its performance with XGBoost. That way, we can decide on which features to select and which ones we should delete again. Furthemore we can try other benchmarking, such as taking a small sample for the data, or only dropping a single feature.

## 5.5 Feature testing in XGBoost:

In [23]:
#save members for the different months
members_feb = members_train1.copy()
members_mar = members_train2.copy()
members_apr = members_total.copy()

In [24]:
#print out the shapes
print(members_feb.shape)
print(members_mar.shape)
print(members_apr.shape)

(992931, 35)
(970960, 35)
(907471, 35)


Here we create two variables ul_1 and ul_2. These variales contain a sample of "members_feb" and "members_mar" respectively. In that way, we can generate smaller test and train files to increase the speed of testing, and to see how it reacts to smaller data samples.

In [25]:
print(members_feb.shape)
ul_1 = members_feb.sample(frac=1)
print(ul_1.shape)

print(members_mar.shape)
ul_2 = members_mar.sample(frac=1)
print(ul_2.shape)


(992931, 35)
(992931, 35)
(970960, 35)
(970960, 35)


In [31]:
#import xgboost and logloss
import xgboost as xgb
from sklearn.metrics import log_loss

In [32]:
#Function that calls XGBoost
# Run XGBoost
def xgb_boost(train,test,importance,depth,eta,num_it):
    start = time.time()
    if ('msno' in train.columns):
        train = train.drop('msno',axis=1)
    if ('msno' in test.columns):
        test = test.drop('msno',axis=1)
    X_train = train.drop('is_churn',axis=1)
    y_train = train['is_churn']
    X_test = test.drop('is_churn',axis=1)
    y_test = test['is_churn']
    
    dtrain = xgb.DMatrix(X_train, label = y_train)
    dtest = xgb.DMatrix(X_test, label = y_test)
    param = {
        #'max_depth': 3,  # the maximum depth of each tree. Try with max_depth: 2 to 10.
        'max_depth': depth,
        #'eta': 0.3,  # the training step for each iteration. Try with ETA: 0.1, 0.2, 0.3...
        'eta': eta,
        'silent': 1,  # logging mode - quiet
        'objective': 'multi:softprob',  # error evaluation for multiclass training
        'num_class': 3}  # the number of classes that exist in this datset
    #num_round = 20  # the number of training iterations. Try with num_round around few hundred!
    num_round = num_it
    #----------------
    bst = xgb.train(param, dtrain, num_round)
    y_pred_xgb = bst.predict(dtest)
    best_preds = np.asarray([np.argmax(line) for line in y_pred_xgb])
    y_pred_xgb = y_pred_xgb[:,1] #Column 2 out of 3
    print("(probs) Logloss for XGD Boost is: %.6f"%log_loss(y_test,y_pred_xgb))
    
    if (importance == 1):
        xgb.plot_importance(bst,max_num_features=10)
        plt.show()
    
    #y_pred_xgb[y_pred_xgb>=0.5] = 1
    #y_pred_xgb[y_pred_xgb<0.5] = 0
    #print("(1 or 0) Logloss for XGD Boost is: %.6f"%log_loss(y_test,y_pred_xgb))

    print('XGB Time = %.0f'%(time.time() - start))
    
    return y_pred_xgb

In [48]:
#call xgboost with our train and test data
#can be used to test out the parameters of xgboost
y_pred_xgb=xgb_boost(ul_1,ul_2,0,3,0.3,40)
print('### over-fitting check: ###')

#xgb_boost(ul_sample_1,ul_sample_1)

(probs) Logloss for XGD Boost is: 0.282287
XGB Time = 37
### over-fitting check: ###


In the next section, we are going to run xgbboost with every feaure except for one. As such we can see if the feature in question will actually improve the score or not. If our logloss gets worse without the feature in question, we can conclude that the feature is actually important and should be included in our prediction model. We create a loop that iterates through every single features and then drops it. XGBoost is called without the feature and the logloss is printed out. This is repeated until every feature is tested.

In [35]:
# columns selection
eng_columns = np.array(list(ul_1.drop(columns=['is_churn','msno'],axis=1)))

for i in range(0,len(eng_columns)):
    inpt = eng_columns[i]
    data_input = ul_1.drop(inpt,axis=1)
    data_outupt = ul_2.drop(inpt,axis=1)
    print('\n###',inpt,'removed ###')

    xgb_boost(data_input,data_outupt,0,3,0.3,20)



### city_1 removed ###
(probs) Logloss for XGD Boost is: 0.285027
XGB Time = 26

### city_3 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 25

### city_4 removed ###
(probs) Logloss for XGD Boost is: 0.284197
XGB Time = 26

### city_5 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 25

### city_6 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 21

### city_7 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 21

### city_8 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 22

### city_9 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 21

### city_10 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 21

### city_11 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 22

### city_12 removed ###
(probs) Logloss for XGD Boost is: 0.284205
XGB Time = 21

### city_13 removed ###
(probs) Logloss for XGD Boost is: 0.284232
XGB Time = 21

### city_14 removed ###

In this section, we test every feature alone to see its immediate effect on the prediction. Simlararly to the previous part, we create a loop that iterates through our features.

In [37]:
# columns selectione
eng_columns = np.array(list(ul_1.drop(columns=['is_churn','msno'],axis=1)))

for i in range(0,len(eng_columns)):
    inpt = eng_columns[i]
    data_input = ul_1[[inpt,'is_churn']]
    data_output = ul_1[[inpt,'is_churn']]
    print('\n###',inpt,'###')
    xgb_boost(data_input,data_output,0,3,0.3,20)



### city_1 ###
(probs) Logloss for XGD Boost is: 0.236006
XGB Time = 6

### city_3 ###
(probs) Logloss for XGD Boost is: 0.237730
XGB Time = 6

### city_4 ###
(probs) Logloss for XGD Boost is: 0.237505
XGB Time = 6

### city_5 ###
(probs) Logloss for XGD Boost is: 0.237413
XGB Time = 6

### city_6 ###
(probs) Logloss for XGD Boost is: 0.237644
XGB Time = 6

### city_7 ###
(probs) Logloss for XGD Boost is: 0.237744
XGB Time = 5

### city_8 ###
(probs) Logloss for XGD Boost is: 0.237683
XGB Time = 6

### city_9 ###
(probs) Logloss for XGD Boost is: 0.237724
XGB Time = 6

### city_10 ###
(probs) Logloss for XGD Boost is: 0.237708
XGB Time = 5

### city_11 ###
(probs) Logloss for XGD Boost is: 0.237716
XGB Time = 5

### city_12 ###
(probs) Logloss for XGD Boost is: 0.237649
XGB Time = 6

### city_13 ###
(probs) Logloss for XGD Boost is: 0.237554
XGB Time = 6

### city_14 ###
(probs) Logloss for XGD Boost is: 0.237710
XGB Time = 5

### city_15 ###
(probs) Logloss for XGD Boost is: 0.237600

In [50]:
#Prepare submission file
#y_pred_xgb = xgb_boost(total_data,user_logs_apr,0,1)
my_submission = pd.DataFrame({'msno': members_apr.msno, 'is_churn': y_pred_xgb})
print(my_submission.head())
cols = my_submission.columns.tolist()
cols = cols[-1:] + cols[:-1]
my_submission = my_submission[cols]
print(my_submission.head())
print(my_submission.count())

my_submission.to_csv('submission.csv', index=False)
print('Done! :-)')

  is_churn                                          msno
0     None  4n+fXlyJvfQnTeKXTWT507Ll4JVYGrOC8LHCfwBmPE4=
1     None  aNmbC1GvFUxQyQUidCVmfbQ0YeCuwkPzEdQ0RwWyeZM=
2     None  rFC9eSG/tMuzpre6cwcMLZHEYM89xY02qcz7HL4//jc=
3     None  WZ59dLyrQcE7ft06MZ5dj40BnlYQY7PHgg/54+HaCSE=
4     None  aky/Iv8hMp1/V/yQHLtaVuEmmAxkB5GuasQZePJ7NU4=
                                           msno is_churn
0  4n+fXlyJvfQnTeKXTWT507Ll4JVYGrOC8LHCfwBmPE4=     None
1  aNmbC1GvFUxQyQUidCVmfbQ0YeCuwkPzEdQ0RwWyeZM=     None
2  rFC9eSG/tMuzpre6cwcMLZHEYM89xY02qcz7HL4//jc=     None
3  WZ59dLyrQcE7ft06MZ5dj40BnlYQY7PHgg/54+HaCSE=     None
4  aky/Iv8hMp1/V/yQHLtaVuEmmAxkB5GuasQZePJ7NU4=     None
msno        907471
is_churn         0
dtype: int64
Done! :-)
