# Jumia purchase prediction challenge by HAICK
Using machine learning techniques, this notebook demonstrates how to predict which items Jumia customers are likely to purchase in the next 4 months based on their past purchase history, as part of a Kaggle challenge from a datathon hosted by the `School of AI Algiers`.

PS: You will find details about data files and columns in the <b>Read me</b> file

## Importing packages and data:
Packages used :<br>
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- LGBM

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

import warnings 
warnings.filterwarnings("ignore")

np.random.seed(42)
np.random.RandomState(42)

RandomState(MT19937) at 0x2262076AA40

In [2]:
train_set = pd.read_csv('jum_train.csv')
test_set = pd.read_csv('jum_test.csv')

In [3]:
train_set.sample(5)

Unnamed: 0,Cust_id,id_purchase,CAT I,CAT II,revenue ( $ Algérien),city,platform,gender,Order date,month
19961,324660,5073DZ8593,Enfants et bébés,Couche,20227.0,A,Web,male,2022-09-06,9
20357,325290,4023DZ8711,Vêtements pour femmes,Vêtements de nuit,428277.0,A,Web,female,2022-08-23,8
2325,176851,3144DZ8490,Appareils ménagers,Réfrigérateurs et congélateurs,163669.0,A,Web,male,2022-08-23,8
20413,325365,3073DZ8716,Vêtements pour femmes,Jeans et jeggings,437889.0,A,Web,male,2022-08-25,8
10153,260716,3197DZ8541,Chaussures pour hommes,Chaussures de sport,430895.0,A,Web,male,2022-09-23,9


In [4]:
test_set.sample(5)

Unnamed: 0,Cust_id,id_purchase,CAT I,CAT II,revenue ( $ Algérien),city,platform,gender,Order date,month
1854,277843,7253DZ8654,Maison,Grils et cuisine en plein air,127398.0,A,Web,male,2022-09-16,9
2370,305633,0793DZ8571,Soins de la maison,Outils de nettoyage,419487.0,B,Web,male,2022-09-13,9
501,175828,4484DZ8590,Appareils ménagers,Réfrigérateurs et congélateurs,185722.0,A,Web,male,2022-09-23,9
1761,273502,9619DZ8648,Informatique,Fournitures,64329.0,B,Web,male,2022-07-21,7
1675,266777,8131DZ8488,Filles,Accessoires,102818.0,A,Web,female,2022-08-05,8


<br>This is a function I generally use to summarize information about my sets :

In [5]:
def summary(data) : 
    """ Take a pandas dataframe and create a summary about missing values """
    result = (pd.DataFrame(data.dtypes, columns=["Data Type"]).reset_index().rename(columns={'index': 'Column'}))
    result["Unique"] = data.nunique().values
    result["Null"] = data.isna().sum().values
    result["%null"] = np.round(data.isna().sum().values / data.shape[0] * 100, decimals=2)
    return result

In [6]:
summary(train_set)

Unnamed: 0,Column,Data Type,Unique,Null,%null
0,Cust_id,int64,10605,0,0.0
1,id_purchase,object,28370,0,0.0
2,CAT I,object,31,0,0.0
3,CAT II,object,47,0,0.0
4,revenue ( $ Algérien),float64,27608,0,0.0
5,city,object,3,0,0.0
6,platform,object,2,0,0.0
7,gender,object,2,0,0.0
8,Order date,object,161,0,0.0
9,month,int64,7,0,0.0


In [7]:
summary(test_set)

Unnamed: 0,Column,Data Type,Unique,Null,%null
0,Cust_id,int64,2500,0,0.0
1,id_purchase,object,4479,0,0.0
2,CAT I,object,28,0,0.0
3,CAT II,object,43,0,0.0
4,revenue ( $ Algérien),float64,4464,0,0.0
5,city,object,3,0,0.0
6,platform,object,2,0,0.0
7,gender,object,2,0,0.0
8,Order date,object,64,0,0.0
9,month,int64,3,0,0.0


After analysing the submission file, I understood that we are asked to frame the problem as a binary classification problem, where the input is the `customer ID` and `item ID`, and the output is `buy` or `not buy`.<br>
After careful consideration, I chose to approach the problem by treating the purchase history data for months 7, 8, and 9 as the training set, and the subsequent months 10, 11, 12, and 1 as the target for modeling.

In [8]:
train = train_set.copy()

In [9]:
train_set = train[train['month'].isin([7, 8, 9])]

In [10]:
train_target = train[~train['month'].isin([7, 8, 9])]

In [11]:
len(np.unique(np.concatenate((train_target['CAT I'].unique(), train_target['CAT II'].unique()), axis=0)))

76

There is 76 unique item category in the train set.<br>
My objective was to create a data structure that would store information on purchased items and their corresponding customers, using a list of tuples and then a pandas dataframe.

In [12]:
rows = []
for index in train_target.index:
    customer = train_target.loc[index, 'Cust_id']
    cat1 = train_target.loc[index, 'CAT I']
    cat2 = train_target.loc[index, 'CAT II']
    rows.append((customer, cat1))
    rows.append((customer, cat2))

In [13]:
rows[:5]

[(181408, 'Appareils ménagers'),
 (181408, 'Réfrigérateurs et congélateurs'),
 (181436, 'Appareils ménagers'),
 (181436, 'Réfrigérateurs et congélateurs'),
 (181766, 'Appareils ménagers')]

In [14]:
df = pd.DataFrame(rows, columns=['Customer', 'Product'])
df.sample(5)

Unnamed: 0,Customer,Product
4892,250133,Chaussures pour hommes
15334,156772,Accessoire femmes
3934,238918,Beauté et parfums
6127,261574,Baskets
11807,314480,Accessoires


The two sets (train and test) have the same columns => I concatenated them.

In [15]:
test_set.index = test_set.index + train_set.tail(1).index.values[0]+1
test_set.head()

Unnamed: 0,Cust_id,id_purchase,CAT I,CAT II,revenue ( $ Algérien),city,platform,gender,Order date,month
28370,153925,9137DZ8685,Accessoire femmes,Lunettes de soleil,181672.0,A,Web,male,2022-07-19,7
28371,153925,3290DZ8514,Accessoire femmes,Lunettes de soleil,482394.0,A,Web,male,2022-07-19,7
28372,15403,5431DZ8470,Accessoire femmes,Lunettes de soleil,116073.0,A,Web,male,2022-07-20,7
28373,15403,6262DZ8700,Accessoire femmes,Lunettes de soleil,183709.0,A,Web,male,2022-07-20,7
28374,15403,8904DZ8715,Accessoire femmes,Lunettes de soleil,273851.0,A,Web,male,2022-07-21,7


In [16]:
all_data = pd.concat([train_set, test_set])
all_data.head()

Unnamed: 0,Cust_id,id_purchase,CAT I,CAT II,revenue ( $ Algérien),city,platform,gender,Order date,month
0,153926,5774DZ8526,Accessoire femmes,Sacs,55251.0,A,Web,male,2022-07-19,7
1,153929,4224DZ8573,Accessoire femmes,Lunettes de soleil,444479.0,A,Web,male,2022-07-19,7
2,153935,7178DZ8657,Accessoire femmes,Lunettes de soleil,242911.0,A,Web,female,2022-07-19,7
3,153935,3598DZ8583,Accessoire femmes,Lunettes de soleil,267374.0,A,Web,female,2022-07-20,7
4,153950,5142DZ8612,Accessoire femmes,Sacs,102590.0,A,Web,female,2022-07-19,7


In [17]:
all_data['city'].value_counts(normalize=True)*100

A    84.612416
B    11.760418
C     3.627165
Name: city, dtype: float64

In [18]:
all_data['platform'].value_counts(normalize=True)*100

Web     75.510204
App     24.489796
Name: platform, dtype: float64

In [19]:
all_data['gender'].value_counts(normalize=True)*100

male      63.196707
female    36.803293
Name: gender, dtype: float64

## Feature Engineering:

In [20]:
all_data['gender'] = all_data['gender'].map({'male':1 , 'female':0})
all_data['platform'] = all_data['platform'].map({'Web':1 , 'App ':0})
for col in ['platform', 'gender']:
    all_data[col] = all_data[col].astype('category')

In [21]:
all_data.sample(5)

Unnamed: 0,Cust_id,id_purchase,CAT I,CAT II,revenue ( $ Algérien),city,platform,gender,Order date,month
8687,247389,7105DZ8556,Chaussures pour hommes,Chaussures de sport,286237.0,A,1,1,2022-09-18,9
30747,305650,9912DZ8594,Soins de la maison,Outils de nettoyage,422924.0,B,1,1,2022-08-08,8
21812,328112,9361DZ8585,Vêtements pour hommes,Polos,460665.0,A,1,0,2022-08-19,8
5603,210637,6959DZ8671,Beauté et parfums,Carrosserie,271947.0,A,1,1,2022-08-14,8
29194,195963,1644DZ8586,Automobile et motocycles,Motos,428139.0,A,1,0,2022-08-09,8


### Encoding categorical features:

In [22]:
data_obj_cat = all_data[['CAT I', 'CAT II']]
data_obj_cat = pd.get_dummies(data_obj_cat , prefix='', prefix_sep='')
for col in data_obj_cat.columns:
    data_obj_cat[col] = data_obj_cat[col].astype('category')

In [23]:
data = pd.concat([all_data.drop(['CAT I', 'CAT II'], axis=1),data_obj_cat],axis=1)

In [24]:
data_obj_city = all_data[['city']]
data_obj_city = pd.get_dummies(data_obj_city)
for col in data_obj_city.columns:
    data_obj_city[col] = data_obj_city[col].astype('category')
data = pd.concat([data.drop(['city'], axis=1),data_obj_city],axis=1)

### Date related features:

In [25]:
data['Order date'] = pd.to_datetime(data['Order date'], errors='coerce')

In [26]:
data['week_day'] = data['Order date'].dt.dayofweek.values + 1
for index in data.index:
    if data.loc[index, 'week_day'] == 7:
        data.loc[index, 'week_day'] = 0

In [27]:
data['day'] = data['Order date'].dt.day.values

In [28]:
data['day'].unique()

array([19, 20, 21, 22, 24, 25, 28, 27, 29, 31,  1, 26,  2,  3,  4,  5,  7,
        8, 10,  9, 11, 12, 14, 15, 16, 17, 18, 23, 30,  6, 13],
      dtype=int64)

In [29]:
data['Order date'].dt.day.unique()

array([19, 20, 21, 22, 24, 25, 28, 27, 29, 31,  1, 26,  2,  3,  4,  5,  7,
        8, 10,  9, 11, 12, 14, 15, 16, 17, 18, 23, 30,  6, 13],
      dtype=int64)

In [30]:
data['dispatch_day_sin'] = np.sin(data['day']*(2.*np.pi/31)) 
data['dispatch_day_cos'] = np.cos(data['day']*(2.*np.pi/31)) 
data['dispatch_day_of_week_sin'] = np.sin(data['week_day']*(2.*np.pi/7)) 
data['dispatch_day_of_week_cos'] = np.cos(data['week_day']*(2.*np.pi/7)) 

In [31]:
data['week_day']=data['week_day'].astype('category')
data['day']=data['day'].astype('category')

### Reseprating sets and more feature engineering:

In [32]:
train_final = data[data.index< train_set.tail(1).index.values[0]]
test_final = data[data.index>= train_set.tail(1).index.values[0]]

I grouped the 2 dataframes by `Customer ID`

In [33]:
dict = {}
cat_col = []
for col in train_final.columns:
    if col == 'gender':
        train_final[col] = train_final[col].astype('int64')
        test_final[col] = test_final[col].astype('int64')
        dict[col]='max'
        cat_col.append(col)
    elif col in ['week_day', 'day']:
        train_final[col] = train_final[col].astype('int64')
        test_final[col] = test_final[col].astype('int64')
        dict[col]= 'max'
        cat_col.append('week_day')
    elif col in ['dispatch_day_sin', 'dispatch_day_cos', 'dispatch_day_of_week_sin', 'dispatch_day_of_week_cos']:
        dict[col]='mean'
    elif col =='revenue ( $ Algérien)':
        dict[col]='sum'
    elif col not in ['platform', 'month', 'week_day', 'Cust_id', 'id_purchase', 'Order date']:
        train_final[col] = train_final[col].astype('int64')
        test_final[col] = test_final[col].astype('int64')
        dict[col]='sum'

In [34]:
train = train_final.groupby("Cust_id").agg(dict)

In [35]:
test = test_final.groupby("Cust_id").agg(dict)

In [36]:
for col in cat_col:
    train[col] = train[col].astype('category')
    test[col] = test[col].astype('category')

In [37]:
train.head()

Unnamed: 0_level_0,revenue ( $ Algérien),gender,Accessoire femmes,Accessoires hommes,Accessoires pour femmes,Accessoires unisexes,Appareils ménagers,Appareils photo et accessoires,Automobile et motocycles,Beauté et parfums,...,iPod et lecteurs MP3,city_A,city_B,city_C,week_day,day,dispatch_day_sin,dispatch_day_cos,dispatch_day_of_week_sin,dispatch_day_of_week_cos
Cust_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16,378761.0,0,0,3,0,0,0,0,0,0,...,0,3,0,0,4,19,0.17792,-0.150635,0.505324,-0.44867
159,387192.0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,2,9,0.968077,-0.250653,0.974928,-0.222521
170,442711.0,1,0,0,0,1,0,0,0,0,...,0,0,1,0,4,25,-0.937752,0.347305,-0.433884,-0.900969
229,351411.0,1,0,0,0,0,0,0,0,2,...,0,2,0,0,1,22,-0.334357,-0.602396,0.390916,0.811745
230,115296.0,1,0,0,0,0,0,0,0,2,...,0,2,0,0,5,26,-0.574004,-0.212588,-0.270522,-0.561745


In [38]:
test.head()

Unnamed: 0_level_0,revenue ( $ Algérien),gender,Accessoire femmes,Accessoires hommes,Accessoires pour femmes,Accessoires unisexes,Appareils ménagers,Appareils photo et accessoires,Automobile et motocycles,Beauté et parfums,...,iPod et lecteurs MP3,city_A,city_B,city_C,week_day,day,dispatch_day_sin,dispatch_day_cos,dispatch_day_of_week_sin,dispatch_day_of_week_cos
Cust_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1745,490115.0,1,0,0,0,1,0,0,0,0,...,0,1,0,0,4,28,-0.571268,0.820763,-0.433884,-0.900969
1749,1091037.0,1,0,0,0,0,3,0,0,0,...,0,3,0,0,1,18,0.272481,-0.797288,0.521221,0.748993
2003,434958.0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,5,22,-0.968077,-0.250653,-0.974928,-0.222521
2055,119155.0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,1,25,-0.937752,0.347305,0.781831,0.62349
2392,276692.0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,2,27,-0.724793,0.688967,0.974928,-0.222521


In [39]:
cols = ['platform_web', 'platform_app']
for col in cols:
    train[col]=np.zeros(len(train))
    test[col]=np.zeros(len(test))

In [40]:
for index in train_final.index:
    customer = train_final.loc[index, 'Cust_id']
    platform = train_final.loc[index, 'platform']
    if platform == 0:
        train.loc[customer, 'platform_web'] += 1
    if platform == 1:
        train.loc[customer, 'platform_app'] += 1

In [41]:
for index in test_final.index:
    customer = test_final.loc[index, 'Cust_id']
    platform = test_final.loc[index, 'platform']
    if platform == 0:
        test.loc[customer, 'platform_web'] += 1
    if platform == 1:
        test.loc[customer, 'platform_app'] += 1

In [42]:
train.reset_index(inplace=True)
test.reset_index(inplace=True)

The `Number of orders` column indicates the total count of orders made by each customer.

In [43]:
dir = {"train": [train, train_final], "test": [test, test_final]}
for data_set in ["train", "test"] :
    for customer in dir[data_set][0]["Cust_id"] :
        orders = len(dir[data_set][1][dir[data_set][1]["Cust_id"]== customer]["id_purchase"].unique())
        index = dir[data_set][0][dir[data_set][0]["Cust_id"]==customer].index[0]
        dir[data_set][0].loc[index, "Number Of Orders"] = orders

The `Recency` column indicates the time elapsed in days between a customer's most recent order and the latest date available in the data.

In [44]:
for data_set in ["train", "test"] :
    last = data['Order date'].max()
    max = pd.DataFrame(dir[data_set][1].groupby("Cust_id")["Order date"].max())["Order date"]
    dir[data_set][0]["Recency"] = (last - max).dt.days.values

The `Products/Order` column indicates the ratio of items bought by orderfor each customer.

In [45]:
for data_set in ["train", "test"] :
    for customer in dir[data_set][0]["Cust_id"] :
            number = np.round(dir[data_set][1][dir[data_set][1]["Cust_id"]==customer].groupby("id_purchase")["id_purchase"].describe()["count"].mean(), 2)
            index = dir[data_set][0][dir[data_set][0]["Cust_id"]==customer].index[0]
            dir[data_set][0].loc[index, "Products/Order"] = number

### Modeling Strategy:
I used a subset of the available data, consisting of the first three months (7, 8, 9) as the training set, and the subsequent four months (10, 11, 12, 1) as the target set for modeling. By incorporating future purchase data into the target set, I aimed to create a more accurate and effective model for predicting customer purchases in upcoming months.<br>
To ensure the data was structured in a useful way for modeling, I arranged each row to represent a combination of customer and item, with the total number of rows equalling the product of the number of customers and the number of items.<br>
The number of rows = the number of customers X the number of items

In [46]:
train1=train.copy()
for i in range(39):
    train = pd.concat([train,train1],axis=0)

In [47]:
submission = pd.read_csv('sample_submission.csv')

In [48]:
submission.loc[88, 'id'].split('_')[1]

'Chaussures de sport'

In [49]:
items=[]
for index in submission.index:
    items.append(submission.loc[index, 'id'].split('_')[1])

In [50]:
items = np.array(items)
items = np.unique(items)

In [51]:
test1=test.copy()
for i in range(39):
    test = pd.concat([test,test1],axis=0)

In [52]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_list = le.fit_transform(items)

In [53]:
products = []
encoded = []
for i in range(40):
    for j in range(len(test1)):
        products.append(items[i])
for i in range(40):
    for j in range(len(test1)):
        encoded.append(encoded_list[i])

To enable prediction of whether a customer will purchase a particular item, I transformed the relevant data into an encoded list format. The number of this encoded list present in one of my dataframe's rows represents the item in question.

In [54]:
test['product1'] = products
test['product'] = encoded

In [55]:
products = []
encoded = []
for i in range(40):
    for j in range(len(train1)):
        products.append(items[i])
for i in range(40):
    for j in range(len(train1)):
        encoded.append(encoded_list[i])

In [56]:
train['product1'] = products
train['product'] = encoded

In [57]:
train.reset_index(inplace=True)
test.reset_index(inplace=True)

In [58]:
train.drop('index', axis=1, inplace=True)
test.drop('index', axis=1, inplace=True)

### Making my dataframes looks like the submission file:

In [59]:
for index in train.index:
    name = str(train.loc[index, 'Cust_id']) + '_' + train.loc[index, 'product1']
    train.loc[index, 'index'] = name

In [60]:
for index in test.index:
    name = str(test.loc[index, 'Cust_id']) + '_' + test.loc[index, 'product1']
    test.loc[index, 'index'] = name

In [61]:
train.drop(['Cust_id', 'product1'], axis=1, inplace=True)
test.drop(['Cust_id', 'product1'], axis=1, inplace=True)

In [62]:
train.set_index('index', inplace=True)

In [63]:
test.set_index('index', inplace=True)

In [64]:
train['Target'] = np.zeros(len(train))

There is a little differnce between the items presents in the train set and the submission set.

In [65]:
df.loc[(df['Product'].str.contains('Accessoire')), 'Product'] = 'Accessoires'
df.loc[(df['Product'].str.contains('Chaussures')), 'Product'] = 'Chaussures'
df.loc[(df['Product'].str.contains('ébé')), 'Product'] = 'Bébé'
df.loc[(df['Product'].str.contains('Filles')), 'Product'] = 'Bébé'
df.loc[(df['Product'].str.contains('Garçons')), 'Product'] = 'Bébé'
df.loc[(df['Product'].str.contains('Jeux')), 'Product'] = 'Consoles'
df.loc[(df['Product'].str.contains('ppareils photo et accessoires')), 'Product'] = 'Autres appareils photo et accessoires'
df.loc[(df['Product'].str.contains('Téléphones')), 'Product'] = 'Smartphone'
df.loc[(df['Product'].str.contains('Fitness')), 'Product'] = 'Fitness'

In [66]:
def func(x,y):
    return str(x) + '_' + y

In [67]:
indices = []
for index in df.index:
    ind = func(df.loc[index, 'Customer'], df.loc[index, 'Product'])
    indices.append(ind)

In [68]:
indices = np.array(indices)

In [69]:
condition = train.index.isin(indices)

In [70]:
train.loc[condition, 'Target']=1

In [71]:
train.head()

Unnamed: 0_level_0,revenue ( $ Algérien),gender,Accessoire femmes,Accessoires hommes,Accessoires pour femmes,Accessoires unisexes,Appareils ménagers,Appareils photo et accessoires,Automobile et motocycles,Beauté et parfums,...,dispatch_day_cos,dispatch_day_of_week_sin,dispatch_day_of_week_cos,platform_web,platform_app,Number Of Orders,Recency,Products/Order,product,Target
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16_Accessoires,378761.0,0,0,3,0,0,0,0,0,0,...,-0.150635,0.505324,-0.44867,0.0,3.0,3.0,50,1.0,0,0.0
159_Accessoires,387192.0,0,0,1,0,0,0,0,0,0,...,-0.250653,0.974928,-0.222521,0.0,1.0,1.0,52,1.0,0,0.0
170_Accessoires,442711.0,1,0,0,0,1,0,0,0,0,...,0.347305,-0.433884,-0.900969,0.0,1.0,1.0,36,1.0,0,0.0
229_Accessoires,351411.0,1,0,0,0,0,0,0,0,2,...,-0.602396,0.390916,0.811745,1.0,1.0,2.0,39,1.0,0,0.0
230_Accessoires,115296.0,1,0,0,0,0,0,0,0,2,...,-0.212588,-0.270522,-0.561745,0.0,2.0,2.0,35,1.0,0,0.0


In [72]:
test.head()

Unnamed: 0_level_0,revenue ( $ Algérien),gender,Accessoire femmes,Accessoires hommes,Accessoires pour femmes,Accessoires unisexes,Appareils ménagers,Appareils photo et accessoires,Automobile et motocycles,Beauté et parfums,...,dispatch_day_sin,dispatch_day_cos,dispatch_day_of_week_sin,dispatch_day_of_week_cos,platform_web,platform_app,Number Of Orders,Recency,Products/Order,product
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1745_Accessoires,490115.0,1,0,0,0,1,0,0,0,0,...,-0.571268,0.820763,-0.433884,-0.900969,0.0,1.0,1.0,64,1.0,0
1749_Accessoires,1091037.0,1,0,0,0,0,3,0,0,0,...,0.272481,-0.797288,0.521221,0.748993,0.0,3.0,3.0,12,1.0,0
2003_Accessoires,434958.0,0,0,0,0,0,0,0,0,1,...,-0.968077,-0.250653,-0.974928,-0.222521,1.0,0.0,1.0,70,1.0,0
2055_Accessoires,119155.0,0,0,0,0,0,0,0,0,1,...,-0.937752,0.347305,0.781831,0.62349,0.0,1.0,1.0,67,1.0,0
2392_Accessoires,276692.0,0,0,0,0,0,0,0,0,1,...,-0.724793,0.688967,0.974928,-0.222521,0.0,1.0,1.0,3,1.0,0


The column names have many special characters => I encoded them in numbers

In [73]:
train.columns = range(len(train.columns))
train = train.rename(columns={94: 'Target'})

In [74]:
test.columns = range(len(test.columns))
test = test.rename(columns={94: 'Target'})

## Machine Learning:

### Spliting train data on 2 datasets to evaluate the model:

In [75]:
from sklearn.model_selection import train_test_split

In [76]:
train1, test1 = train_test_split(train, test_size=0.3 , random_state=101)

In [77]:
train1['Target'].value_counts(normalize=True)

0.0    0.986198
1.0    0.013802
Name: Target, dtype: float64

The target is <b>unbalanced</b>. So, I decided to duplicate rows from the the minority class.

In [78]:
filt = train1['Target'] == 1
s = train1[filt]
for i in range(70):
    train1=train1.append(s)

In [79]:
train1['Target'].value_counts(normalize=True)*100

0.0    50.158966
1.0    49.841034
Name: Target, dtype: float64

Then I shuffled the data

In [80]:
train1=train1.sample(frac=1.0)

In [81]:
x_train = train1.drop(['Target'],axis=1)
x_test = test1.drop(['Target'],axis=1)
y_train = train1['Target']
y_test = test1['Target']

#### Feature selection

In [167]:
from sklearn.feature_selection import RFE
model = LGBMClassifier()
rfe = RFE(model, n_features_to_select=20)
rfe.fit(x_train, y_train)

selected_features = x_train.columns[rfe.support_]

In [168]:
selected_features

Index([0, 3, 6, 9, 11, 12, 13, 21, 23, 29, 31, 32, 33, 69, 78, 84, 85, 86, 91,
       93],
      dtype='object')

#### Modeling

In [83]:
x_train = x_train[selected_features]
x_test = x_test[selected_features]

In [84]:
model = LGBMClassifier(random_seed=42)

In [85]:
model.fit(x_train,y_train)

LGBMClassifier(random_seed=42)

In [86]:
y_pred = model.predict(x_test)

In [87]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.948321388155413

### Generalize the model on all the train set:

In [88]:
train['Target'].value_counts(normalize=True)

0.0    0.986253
1.0    0.013747
Name: Target, dtype: float64

In [89]:
filt = train['Target'] == 1
s = train[filt]
for i in range(70):
    train=train.append(s)

In [90]:
train['Target'].value_counts(normalize=True)

0.0    0.5026
1.0    0.4974
Name: Target, dtype: float64

#### Shuffle the data

In [91]:
train=train.sample(frac=1.0)

Use the feature selected previously

In [92]:
target = train['Target']
train = train[selected_features]
train['Target'] = target

#### Modeling

In [93]:
model = LGBMClassifier(random_seed=42, n_estimators=1000)
model.fit(train.drop('Target', axis=1), train['Target'])

LGBMClassifier(n_estimators=1000, random_seed=42)

In [94]:
test = test[selected_features]

In [95]:
y = model.predict(test)

In [96]:
test['Target']=y

In [97]:
test['Target'].value_counts(normalize=True)

0.0    0.959836
1.0    0.040164
Name: Target, dtype: float64

In [98]:
for index in submission.index :
    ind = submission.loc[index, 'id']
    target = test.loc[ind, 'Target']
    submission.loc[index, 'cat'] = target

In [99]:
for index in submission.index:
    if submission.loc[index,'cat'] == 1:
        submission.loc[index,'cat'] = 'buy'
    else:
        submission.loc[index,'cat'] = 'not buy'

In [100]:
submission['cat'].value_counts(normalize=True)

not buy    0.684396
buy        0.315604
Name: cat, dtype: float64

In [101]:
submission.to_csv('late_submission4.csv', index=False)