### Projet réalisé par :
Mahmoud Benboubker\
Nicolas Calligaro\
Aïcha Lalhou



# Explication de la méthode

Nous utiliser plusieurs modèles et les combiner pour prédire les latitudes et longitude des différents messages. Pour cela, nous avons utilisé la librairie  scikit learn, et plus particulièrement le Voting Regressor.

Nous avons choisi de combiner les modèles les plus performants que nous avons testé afin d'assurer plus de stabilité de nos estimations, en réduisant la variance des estimateurs.

Cinq modèles ont été implémentés et combinés: 
- Random Forest
- XGBoost
- Gradient Boosting
- Bagging Regreeor
- Extra Tree

NB: Nous avons décidé d'étudier la prédiction des latitudes et longitudes séparément , afin d'assurer le meilleur paramétrage pour l'un et pour l'autre. 

# Set up et import des données 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IpyTools import *
from IotTools import *
pd.options.mode.chained_assignment = None  # default='warn'

from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import ExtraTreeRegressor
from sklearn.ensemble import VotingRegressor

from sklearn.model_selection import cross_val_predict

In [2]:
df_mess_train = pd.read_csv('mess_train_list.csv')
df_mess_test = pd.read_csv('mess_test_list.csv')
pos_train = pd.read_csv('pos_train_list.csv')
listOfBs = np.union1d(df_mess_train.bsid.unique(),df_mess_test.bsid.unique())

In [3]:
X_train = df_mess_train

# On utilise une fct de protection qui ré-écrit toutes les bases en dehors d'un certain cadre 

In [4]:
X_train= Correct_Bases (df_mess_train)
X_train[X_train.bs_lat>64].shape[0]

Nous avons 27 bases outliers
Base 9949 non vu
il reste 0 base avec lat >60


0

# On retire des devices trop 'complexes' à prédire

In [5]:
a=[476598,476896,476256,476513,476889,476248,473288,476327,476836]
X_train.shape
#2.828589

(39250, 8)

In [6]:
a=[476212., 476830., 476861., 476256,477201, 476829.,476609.,476327,476315,476835,476598,476889,474192,473288]
#40 > 2.4819194000000007

In [7]:
a=[]

In [8]:
# Récupération des meilleurs paramètres
df = pd.read_csv("best_params.csv")
df.head()

Unnamed: 0,Model,lng,lat
0,XGBRegressor,"{'booster': 'gbtree', 'gamma': 0.001, 'learnin...","{'booster': 'gbtree', 'gamma': 0, 'learning_ra..."
1,ExtraTreeRegressor,"{'criterion': 'friedman_mse', 'max_depth': 8, ...","{'criterion': 'friedman_mse', 'max_depth': 8,'..."
2,GradientBoostingRegressor,"{'learning_rate': 0.2, 'max_depth': 4, 'n_esti...","{'learning_rate': 0.1, 'max_depth': 4, 'n_esti..."
3,RandomForestRegressor,"{'criterion': 'mae', 'max_depth': 10, 'max_fea...","{'criterion': 'mae', 'max_depth': 10,'max_feat..."
4,BaggingRegressor,{'n_estimators': 100},{'n_estimators': 100}


# Création de matrices 

In [9]:
X_mod = X_train[~X_train.did.isin(a)]
df_feat, id_list=feat_mat_const(X_mod, listOfBs)

y_full = ground_truth_const(X_mod, pos_train, id_list)
y_full.shape,df_feat.shape


((6068, 3), (6068, 273))

# Application des algorithmes d'apprentissage et estimation du temps d'entrainement

### XGBoost

In [16]:
%%time

model_lng = xgb.XGBRegressor(**get_hyperparameter('XGBRegressor', 'lng'))
model_lat = xgb.XGBRegressor(**get_hyperparameter('XGBRegressor', 'lat'))


y_pred_lng_Xb = cross_val_predict(model_lng, df_feat, y_full.lng, cv=3)
y_pred_lat_Xb = cross_val_predict(model_lat, df_feat, y_full.lat, cv=3)
err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat_Xb, y_pred_lng_Xb)
np.percentile(err_vec, 80)
#sans param : 3.4274806
#avec param : 3.2619

Wall time: 7.67 s


3.2479798000000004

### Extra Tree

In [11]:
%%time

clf_lng=ExtraTreeRegressor(**get_hyperparameter('ExtraTreeRegressor', 'lng'))
clf_lat=ExtraTreeRegressor(**get_hyperparameter('ExtraTreeRegressor', 'lat'))

y_pred_lng_Etr = cross_val_predict(clf_lng, df_feat, y_full.lng, cv=3)
y_pred_lat_Etr = cross_val_predict(clf_lat, df_feat, y_full.lat, cv=3)
err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat_Etr, y_pred_lng_Etr)
np.percentile(err_vec, 80)
#sans param:3.25750
#avec param :3.110561

Wall time: 524 ms


3.530254600000001

### Random Forest

In [12]:
%%time
clf_lng=RandomForestRegressor(**get_hyperparameter('RandomForestRegressor', 'lng'))
clf_lat=RandomForestRegressor(**get_hyperparameter('RandomForestRegressor', 'lat'))

y_pred_lng_Rfr = cross_val_predict(clf_lng, df_feat, y_full.lng, cv=3)
y_pred_lat_Rfr = cross_val_predict(clf_lat, df_feat, y_full.lat, cv=3)
err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat_Rfr, y_pred_lng_Rfr)
np.percentile(err_vec, 80)
#sans param :3.273
#avec param :3.21965

Wall time: 4min 31s


3.3418512000000007

### Gradient Boosting

In [13]:
%%time
clf_lng=GradientBoostingRegressor(**get_hyperparameter('GradientBoostingRegressor', 'lng'))
clf_lat=GradientBoostingRegressor(**get_hyperparameter('GradientBoostingRegressor', 'lat'))

y_pred_lng_Gbr = cross_val_predict(clf_lng, df_feat, y_full.lng, cv=3)
y_pred_lat_Gbr = cross_val_predict(clf_lat, df_feat, y_full.lat, cv=3)
err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat_Gbr, y_pred_lng_Gbr)
np.percentile(err_vec, 80)
#sans param :3.3046138
#avec param :3.305409

Wall time: 11.4 s


3.355931400000002

### Bagging Regressor

In [14]:
%%time

clf_lng=BaggingRegressor(**get_hyperparameter('BaggingRegressor', 'lng'))
clf_lat=BaggingRegressor(**get_hyperparameter('BaggingRegressor', 'lat'))

y_pred_lng_Br = cross_val_predict(clf_lng, df_feat, y_full.lng, cv=3)
y_pred_lat_Br = cross_val_predict(clf_lat, df_feat, y_full.lat, cv=3)
err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat_Br, y_pred_lng_Br)
np.percentile(err_vec, 80)

Wall time: 40.9 s


3.2663767999999997

In [17]:
y_pred_lng = (y_pred_lng_Xb+y_pred_lng_Etr+y_pred_lng_Rfr+y_pred_lng_Gbr+y_pred_lng_Br)/5
y_pred_lat = (y_pred_lat_Xb+y_pred_lat_Etr+y_pred_lat_Rfr+y_pred_lat_Gbr+y_pred_lat_Br)/5
err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat, y_pred_lng)

In [20]:
np.percentile(err_vec, 80),y_pred_lat.shape
#Ref : 3.1226744
#1er sous ensemble device : 3.2070142
#2e sous ensemble device : 3.1333602000000007

(3.1581918000000004, (6068,))

In [21]:
r1 = RandomForestRegressor(**get_hyperparameter('RandomForestRegressor', 'lng'))
r3 = ExtraTreeRegressor(**get_hyperparameter('ExtraTreeRegressor', 'lng'))
Vr = VotingRegressor(estimators=[('Et',r3),('Gb', r1)])
y_pred_lng = cross_val_predict(Vr, df_feat, y_full.lng, cv=3)

r1 = RandomForestRegressor(**get_hyperparameter('RandomForestRegressor', 'lat'))
r3 = ExtraTreeRegressor(**get_hyperparameter('ExtraTreeRegressor', 'lat'))
Vr = VotingRegressor(estimators=[('Et',r3),('Gb', r1)])
y_pred_lat = cross_val_predict(Vr, df_feat, y_full.lat, cv=3)

err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat, y_pred_lng)
np.percentile(err_vec, 80)

3.2016486

# Combinaison des meilleurs algorithmes avec le Voting Regressor en moyennant

In [23]:
r1 = RandomForestRegressor(**get_hyperparameter('RandomForestRegressor', 'lng'))
r2 = GradientBoostingRegressor(**get_hyperparameter('GradientBoostingRegressor', 'lng'))
r3 = ExtraTreeRegressor(**get_hyperparameter('ExtraTreeRegressor', 'lng'))
r4 = xgb.XGBRegressor(**get_hyperparameter('XGBRegressor', 'lng'))
r5 = BaggingRegressor(**get_hyperparameter('BaggingRegressor', 'lng'))
Vr_lng = VotingRegressor(estimators=[('Et',r1),('Rf',r2),('Gb',r3),('Xg',r4),('Xdg',r5)])
y_pred_lng = cross_val_predict(Vr_lng, df_feat, y_full.lng, cv=3)

In [24]:
r1 = RandomForestRegressor(**get_hyperparameter('RandomForestRegressor', 'lat'))
r2 = GradientBoostingRegressor(**get_hyperparameter('GradientBoostingRegressor', 'lat'))
r3 = ExtraTreeRegressor(**get_hyperparameter('ExtraTreeRegressor', 'lat'))
r4 = xgb.XGBRegressor(**get_hyperparameter('XGBRegressor', 'lat'))
r5 = BaggingRegressor(**get_hyperparameter('BaggingRegressor', 'lat'))
Vr_lat = VotingRegressor(estimators=[('Et',r1),('Rf',r2),('Gb',r3),('Xg',r4),('Xdg',r5)])
y_pred_lat = cross_val_predict(Vr_lat, df_feat, y_full.lat, cv=3)

# Estimation de l'erreur

In [25]:
err_vec = Eval_geoloc(y_full.lat , y_full.lng, y_pred_lat, y_pred_lng)
np.percentile(err_vec, 80)

3.127853000000001

# Conclusion
Dans la suite du TP, nous utiliserons la méthode de voting regressor avec nos 5 modèles. Nous rajouterons également les hyper paramètres.
