**Team name : *Maxime et les garçons, à table***

**Alexis Carpier, Maxime Seince, Victor Perroux, Théau Pihouée** 

# Kaggle challenge : Every feature engineering strategies attempted

Here are every data processing methods that we attempted but that did not work out fine and that we did not keep for our final model (you can see it in another notebook called "Kaggle_VFinal"). 

We still want to show these attempts because they reveal how our understanding of the data evolved and how it lead us to elaborate the processing strategy that performed the best afterwards.  

# Import libraries

In [8]:
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, log_loss

# Reading files

## Import the data 

In [9]:
train_df = pd.read_csv('train_ml.csv', index_col=0)
test_df = pd.read_csv('test_ml.csv', index_col=0)

In [11]:
train_df.head(3)

Unnamed: 0,date,org,tld,ccs,bcced,mail_type,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
0,"Mon, 15 Oct 2018 08:03:09 +0000 (UTC)",researchgatemail,net,0,0,multipart/alternative,4,28,0,1,47.0,25556,0,1,0,0,0,0,0,1
1,"Thu, 17 Apr 2014 09:12:33 -0700 (PDT)",no-ip,com,0,0,multipart/alternative,6,32,0,0,46.0,19930,1,1,0,0,0,0,0,0
2,"Thu, 27 Oct 2016 01:36:28 +0000",mail,goodreads.com,0,0,multipart/mixed,0,0,0,0,21.0,4,0,1,0,0,0,0,0,1


In [12]:
test_df.head()

Unnamed: 0,date,org,tld,ccs,bcced,mail_type,images,urls,salutations,designation,chars_in_subject,chars_in_body
0,"Thu, 13 Jul 2017 08:55:57 +0000",twitter,com,0,0,multipart/alternative,7,56,0,0,67.0,36243
1,"Sun, 30 Sep 2018 14:42:12 +0000",mailer,netflix.com,0,0,multipart/alternative,5,33,0,0,27.0,27015
2,"Mon, 13 Feb 2017 10:47:00 +0530",iiitd,ac.in,0,0,text/plain,0,2,1,0,22.0,788
3,"Thu, 16 Jun 2016 09:56:23 +0000",twitter,com,0,0,multipart/alternative,8,53,0,0,79.0,39504
4,"Mon, 18 Apr 2016 01:51:59 +0530",iiitd,ac.in,0,0,multipart/mixed,0,0,0,0,24.0,178773


## Description of the data 

Let's see how the data is structured : 

### General description 

In [13]:
# Calcul  du nombre d'émetteurs distincts 
senders = train_df['org'] 
senders_unique = senders.drop_duplicates()
print("Il y a", senders_unique.count()," émetteurs distincts d'emails.")

# Calcul du nombre d'organisations distinctes 
organizations = train_df['tld']
organizations_unique = organizations.drop_duplicates()
print("Il y a", organizations_unique.count()," organisations distinctes.")

# Calcul du nombre moyen d'images par email 
images = train_df['images']
print("Le nombre moyen d'images par email est de: ", images.mean()) 

# Calcul du nombre moyen de liens URL par email 
urls = train_df['urls']
print("Le nombre moyen de liens URL par email est de: ", urls.mean())

# Calcul du nombre de salutations présents dans le dataframe 
train_df_salutations = train_df[train_df['salutations']==1]
salutations = train_df_salutations['salutations']
print("Le nombre total de salutations dans le dataframe est: ", salutations.count())

# Calcul du nombre de désignations présentes dans le dataframe 
train_df_designations = train_df[train_df['designation']==1]
designations = train_df_designations['designation']
print("Le nombre total de désignations dans le dataframe est: ", designations.count())

# Calcul du nombre moyen de caractère dans l'object du mail
carac_subject = train_df['chars_in_subject']
print("Le nombre moyen de caractères dans l'objet de chaque mail est de: ", carac_subject.mean()) 

# Calcul du nombre moyen de caractères dans le corps du mail
carac_body = train_df['chars_in_body']
print("Le nombre moyen de caractères dans le corps de chaque mail est de: ", carac_body.mean())



Il y a 973  émetteurs distincts d'emails.
Il y a 271  organisations distinctes.
Le nombre moyen d'images par email est de:  9.806332081369263
Le nombre moyen de liens URL par email est de:  36.73108820044869
Le nombre total de salutations dans le dataframe est:  15700
Le nombre total de désignations dans le dataframe est:  4059
Le nombre moyen de caractères dans l'objet de chaque mail est de:  51.44203227433182
Le nombre moyen de caractères dans le corps de chaque mail est de:  232178.08081470092


### Description for each label

**Updates**

In [14]:
updates_email = train_df[train_df['updates']==1]
updates_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,14377.0,14377.0,14377.0,14377.0,14377.0,14377.0,14376.0,14377.0,14377.0,14377.0,14377.0,14377.0,14377.0,14377.0,14377.0,14377.0
mean,0.052445,0.000487,6.384503,30.850247,0.294915,0.079989,54.510573,38414.9,1.0,0.719969,0.026292,0.000696,0.022884,0.006121,0.005078,0.000348
std,0.599948,0.022061,9.749108,38.125525,0.456021,0.271285,32.481024,413629.7,0.0,0.449029,0.160008,0.026365,0.149538,0.077999,0.071078,0.018646
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,10.0,0.0,0.0,36.0,5668.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,4.0,19.0,0.0,0.0,44.0,19013.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,7.0,36.0,1.0,0.0,63.0,29035.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
max,22.0,1.0,116.0,617.0,1.0,1.0,423.0,42756900.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Personal**

In [15]:
personal_email = train_df[train_df['personal']==1]
personal_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,32118.0,32118.0,32118.0,32118.0,32118.0,32118.0,32108.0,32118.0,32118.0,32118.0,32118.0,32118.0,32118.0,32118.0,32118.0,32118.0
mean,0.432779,0.003487,10.016906,34.564979,0.423563,0.103867,49.537312,269987.3,0.32228,1.0,0.192353,0.150072,0.008936,0.002553,0.000747,0.111651
std,2.791944,0.05895,489.406124,158.099075,0.494131,0.305093,33.563227,2476317.0,0.467357,0.0,0.394155,0.357147,0.094108,0.050464,0.027326,0.314941
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,3.0,0.0,0.0,29.0,4435.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,2.0,16.0,0.0,0.0,42.0,19208.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,8.0,44.0,1.0,0.0,61.0,47225.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
max,155.0,1.0,83480.0,21540.0,1.0,1.0,528.0,74381080.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Promotions**

In [16]:
promotions_email = train_df[train_df['promotions']==1]
promotions_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0,7925.0
mean,0.005678,0.000126,14.167319,65.218549,0.281009,0.097539,56.477603,55734.65,0.047697,0.779558,1.0,0.000379,0.0,0.0,0.00694,0.0
std,0.086102,0.011233,13.242078,57.10563,0.44952,0.29671,30.741118,246932.2,0.213138,0.414571,0.0,0.019454,0.0,0.0,0.083023,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,4.0,25.0,0.0,0.0,37.0,20594.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,11.0,50.0,0.0,0.0,50.0,39910.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,21.0,93.0,1.0,0.0,68.0,67704.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
max,4.0,1.0,178.0,662.0,1.0,1.0,497.0,9388789.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0


**Forums**

In [17]:
forums_email = train_df[train_df['forums']==1]
forums_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0,6181.0
mean,1.661705,0.004045,0.568193,7.397508,0.585989,0.081702,49.701505,384791.0,0.001618,0.779809,0.000485,1.0,0.0,0.0,0.0,0.0
std,5.684528,0.063474,3.694667,32.430444,0.49259,0.273932,33.410302,2580932.0,0.040193,0.414409,0.022027,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,30.0,2746.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,3.0,1.0,0.0,40.0,7033.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,2.0,0.0,0.0,9.0,1.0,0.0,59.0,17550.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
max,155.0,1.0,200.0,2234.0,1.0,1.0,415.0,58733750.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0


**Purchases**

In [18]:
purchases_email = train_df[train_df['purchases']==1]
purchases_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0,329.0
mean,0.009119,0.0,6.753799,45.303951,0.136778,0.0,67.082067,33489.531915,1.0,0.87234,0.0,0.0,1.0,0.0,0.0,0.0
std,0.095199,0.0,6.583205,34.350134,0.344136,0.0,26.736108,31987.175414,0.0,0.334219,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,21.0,4.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,0.0,0.0,2.0,22.0,0.0,0.0,49.0,16286.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,0.0,0.0,6.0,43.0,0.0,0.0,61.0,27096.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,0.0,0.0,9.0,60.0,0.0,0.0,96.0,43714.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
max,1.0,0.0,60.0,281.0,1.0,0.0,135.0,295175.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


**Travel**

In [19]:
travel_email = train_df[train_df['travel']==1]
travel_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,0.01,0.0,4.65,21.02,0.53,0.0,68.81,130716.0,0.88,0.82,0.0,0.0,0.0,1.0,0.0,0.0
std,0.1,0.0,7.847505,28.616228,0.501614,0.0,38.057027,227783.7,0.326599,0.386123,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,18.0,8.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,0.0,0.0,0.0,2.0,0.0,0.0,45.0,13831.75,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.0,0.0,0.0,6.0,1.0,0.0,71.0,46711.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.0,0.0,7.0,31.25,1.0,0.0,74.0,110450.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
max,1.0,0.0,32.0,121.0,1.0,0.0,284.0,1272092.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


**Spam**

In [20]:
spam_email = train_df[train_df['spam']==1]
spam_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0
mean,0.0,0.0,7.171053,34.578947,0.203947,0.013158,54.947368,23008.789474,0.480263,0.157895,0.361842,0.0,0.0,0.0,1.0,0.0
std,0.0,0.0,9.389847,53.886258,0.404262,0.114327,23.516315,26036.211159,0.501262,0.365848,0.482122,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,16.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,0.0,0.0,0.75,6.75,0.0,0.0,38.0,2183.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,0.0,0.0,3.0,13.0,0.0,0.0,49.0,11493.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,0.0,0.0,10.0,28.0,0.0,0.0,62.0,36664.5,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
max,0.0,0.0,34.0,212.0,1.0,1.0,138.0,105529.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0


**Social**

In [21]:
social_email = train_df[train_df['social']==1]
social_email.describe()

Unnamed: 0,ccs,bcced,images,urls,salutations,designation,chars_in_subject,chars_in_body,updates,personal,promotions,forums,purchases,travel,spam,social
count,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0,4005.0
mean,0.0,0.0,13.322347,83.941573,0.275905,0.257179,73.619476,59427.322347,0.001248,0.895381,0.0,0.0,0.0,0.0,0.0,1.0
std,0.0,0.0,11.218594,59.386924,0.447025,0.437133,37.852176,34590.307703,0.035316,0.3061,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,11.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,6.0,36.0,0.0,0.0,45.0,29905.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,0.0,11.0,77.0,0.0,0.0,65.0,57440.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,0.0,17.0,111.0,1.0,1.0,97.0,88915.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
max,0.0,0.0,122.0,386.0,1.0,1.0,528.0,211580.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


**We notice that the dataset is highly unbalanced : some labels are present more than 30 000 times whereas some labels are present less than 150 times.**

# Transforming the multilabel problem to a multiclass problem 

The first idea that came to our mind was to transform this multilabel problem (each instance can have multiple labels) to a multiclass problem (each instance is necessarily labeled as one class). 

In [22]:
def recup_category(df,column):
    category=''
    if df.iloc[column,:].shape==(20,):
        for i in range(12,20):
            if df.iloc[column,i]==1:
                if len(category)!=0:
                    category=category + ' & ' + list(df)[i]
                else:
                    category=list(df)[i]
    return category

In [23]:
comptage_outputs={}
n=train_df.shape[0]
category=''
for i in range(n):
    category=''
    category=recup_category(train_df,i)
    if category in comptage_outputs:
        comptage_outputs[category]+=1
    else:    
        comptage_outputs[category]=1


for k, v in sorted(comptage_outputs.items(), key=lambda x: x[1]):
    print("%s: %s" % (k, v))
    
sum(comptage_outputs.values())

print(comptage_outputs)

personal & promotions & forums: 3
updates & personal & social: 5
updates & personal & forums: 10
personal & travel: 12
updates & travel: 18
personal & spam: 24
updates & purchases: 42
promotions & spam: 55
updates & personal & travel: 70
updates & spam: 73
updates & personal & purchases: 287
updates & personal & promotions: 378
social: 419
forums: 1361
promotions: 1692
personal & social: 3581
updates: 3893
personal & forums: 4807
personal & promotions: 5797
personal: 7543
updates & personal: 9601
{'personal & social': 3581, 'updates & personal': 9601, 'promotions': 1692, 'personal': 7543, 'personal & promotions': 5797, 'personal & forums': 4807, 'updates': 3893, 'forums': 1361, 'social': 419, 'updates & personal & promotions': 378, 'updates & spam': 73, 'updates & personal & purchases': 287, 'updates & personal & travel': 70, 'updates & purchases': 42, 'personal & spam': 24, 'promotions & spam': 55, 'updates & personal & social': 5, 'updates & personal & forums': 10, 'updates & travel'

However, this approach is too complicated and does not fit our problem well because the issue of unbalanced data is still very present (some labels have less than 10 samples and others have over 7000 samples). Another big issue is that some classes may be present in the test set but not in the training set so our algorithm won"t be able to recognize this unseen class. Finally, an additional effort has to be done to transform our results to the right form (multiclass => multilabel). 

For all these reasons, we decided to drop this method. 

# Balancing the data : Oversampling and undersampling 

## Undersampling 

In [26]:
for i in range(39670,23000,-1):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'personal'] == 1:
        train_df.drop(index=i, inplace=True)
train_df = train_df.reset_index()
del train_df['index']

for i in range(35383,22214,-1):
    if train_df.loc[i,'personal'] == 1:
        train_df.drop(index=i, inplace=True)
train_df = train_df.reset_index()
del train_df['index']

for i in range(25546,20066,-1):
    if train_df.loc[i,'personal'] == 1 and train_df.loc[i,'promotions'] == 1:
        train_df.drop(index=i, inplace=True)
train_df = train_df.reset_index()
del train_df['index']

## Oversampling

In [None]:
L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'personal'] == 1 and train_df.loc[i,'social'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 100*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]
    
L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'personal'] == 1 and train_df.loc[i,'forums'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 50*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]  
    
L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'personal'] == 1 and train_df.loc[i,'travel'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 42*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]    
    
L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'travel'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 28*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:] 
    
L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'personal'] == 1 and train_df.loc[i,'spam'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 21*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:] 

L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'purchases'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 12*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:] 

L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'promotions'] == 1 and train_df.loc[i,'spam'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 10*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:] 

L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'personal'] == 1 and train_df.loc[i,'travel'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 7*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]

L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'personal'] == 1 and train_df.loc[i,'travel'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 7*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]

L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'spam'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 7*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]
    
L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'personal'] == 1 and train_df.loc[i,'purchases'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 2*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]

L=[]
for i in range (train_df.shape[0]):
    if train_df.loc[i,'updates'] == 1 and train_df.loc[i,'personal'] == 1 and train_df.loc[i,'promotions'] == 1:
        L.append(i) #stocker les indices des lignes contenant les éléments qui nous intéressent
#ajouter les lignes au dataframe
for i in 2*L:    
    train_df.loc[train_df.shape[0]]=train_df.iloc[i,:]

# Creating a feature Week-End

In [None]:
#Récupération du jour de la semaine et séparation en week-end/semaine /!\ à faire avant la récupération de l'heure !

WeekEnd = ['Sat', 'Sun']
train_df['WeekEnd'] = 0
dates = train_df['date'].to_list()

for i, date in enumerate(dates) :
    
    print(train_df['date'][i][:3])
    
    day = train_df['date'][i][:3]
    
    if day in WeekEnd :
        train_df['WeekEnd'][i] = 1

# Encoding tld and org 

In [None]:
## Rem

In [None]:
def replace_similar_terms(df, column, string_to_match):
    
    terms = df[column].unique()
    
    for k, term in enumerate(terms) :
        if type(term)==float:
            terms[k]='None'
    
    matches = [term for term in terms if string_to_match in term]
    
    rows_with_matches = df[column].isin(matches)
    
    df.loc[rows_with_matches, column] = string_to_match

In [None]:
replace_similar_terms(train_df, 'org', 'mail')
replace_similar_terms(train_df, 'org', 'letter')
replace_similar_terms(train_df, 'org', 'info')
replace_similar_terms(train_df, 'org', 'recruit')
replace_similar_terms(train_df, 'org', 'news')
replace_similar_terms(train_df, 'org', 'work')
replace_similar_terms(train_df, 'org', 'code')

### For Org

In [None]:
# Création d'une colonne contenant les labels avec plus de 180 éléments. 
label_count = list_labels.index.tolist()
label_count_180 = []
for i in range(len(label_count)):
    if list_labels[i]>=180: 
        label_count_180.append(label_count[i])
print(label_count_180)

train_df['org_180']= np.nan 
for i in range(train_df.shape[0]):
    for label in label_count_180: 
        if train_df.loc[i,'org']==label:
            train_df.loc[i,'org_180']=label
            
df_dummies = pd.get_dummies(train_df['org_180'])

train_df = train_df.join(df_dummies)

### For tld 

In [None]:
# Création d'une colonne contenant les labels avec plus de 100 éléments. 
label_count_tld = list_labels_tld.index.tolist()
label_count_100 = []
for i in range(len(label_count_tld)):
    if list_labels_tld[i]>=100: 
        label_count_100.append(label_count_tld[i])
print(label_count_100)

train_df['tld_100'] = np.nan 
for i in range(train_df.shape[0]):
    for label in label_count_100: 
        if train_df.loc[i,'tld']==label:
            train_df.loc[i,'tld_100']=label
            
df_dummies_tld = pd.get_dummies(train_df['tld_100'])

train_df = train_df.join(df_dummies_tld)