# IoT Challenge - Geolocalization

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from geopy.distance import vincenty

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

In [3]:
# load train and test data
df_mess_train = pd.read_csv('mess_train_list.csv') # train set
df_mess_test = pd.read_csv('mess_test_list.csv') # test set
pos_train = pd.read_csv('pos_train_list.csv') # position associated to train set

## Data exploration [Théo]

## Features Engineering [Mathieu]

### One-Hot (Teacher ++ / Mean / RSSI / etc.. )

### Usage of barycentre

#### Standard barycentre

#### Weighted barycentre

## Outliers processing I [André]

Following our realization of outlier bases with geolocalization positions that do not seem to make sense, we decided to compute, approximatelly, their longitude and latitude by using the coordinates of the (weighted) barycentre of the messages each outlier base received.

The code coded provided below did this cleaning

In [5]:
# List of unique messages
listOfmess = np.unique(df_mess_train['messid'])

# DataFrame with all training_data, including positions
df = pd.concat([df_mess_train, pos_train], axis=1)

In [19]:
# We can notice that the outlier bases have latitude at 64.3 and longitude at -68.5:
df.groupby(['bsid']).mean()[['bs_lat', 'bs_lng']].sort_values(['bs_lat'], ascending=False).head(10)

Unnamed: 0_level_0,bs_lat,bs_lng
bsid,Unnamed: 1_level_1,Unnamed: 2_level_1
1772,64.3,-68.5
4156,64.3,-68.5
8560,64.3,-68.5
2943,64.3,-68.5
8449,64.3,-68.5
4987,64.3,-68.5
11951,64.3,-68.5
2293,64.3,-68.5
7248,64.3,-68.5
9784,64.3,-68.5


In [45]:
# Selecting these bases
bases_out = df[(df['bs_lat']==64.3) & (df['bs_lng']==-68.5)]['bsid'].unique()
bases_out

array([ 8355, 11007,  1594, 10151, 10162,  8451,  4993,  8560,  2293,
        4959, 10999,  1661,  8449,  4156,  4129,  1743,  4987,  1772,
        1796,  2707,  2943,  4123, 11951,  9784,  1092,  1854,  7248])

In [70]:
# Getting dataframe with all data for the ouliers bases
df_out = df[df['bsid'].isin(bases_out)]

# Initiating arrays that will have the lat, long and rssi of messages received by the bases, 
# Each column represents a message
mess_num = len(listOfmess) # number of messages
lat_array = np.zeros((df_out.shape[0], mess_num))
lng_array = np.zeros((df_out.shape[0], mess_num))
weight_array = np.zeros((df_out.shape[0], mess_num)) # weights to be used: sqrt(exp(rssi))
    
# Dictionary to track message id and corresponding column in array
mess_dict = {}
for i, column in enumerate(listOfmess):
    mess_dict[column] = i
    
# assigning values to arrays
for i, ix in enumerate(df_out.index):
    mess = df_out.loc[ix, 'messid']
    column = mess_dict[mess]
    
    # Using sqrt(exp(rssi)) as weight to get weighted centroid
    weight_array[i, column] = np.sqrt(np.exp(df_out.loc[ix, 'rssi']))
    weight = np.sqrt(np.exp(df_out.loc[ix, 'rssi']))
    lat_array[i, column] = df_out.loc[ix, 'lat'] * weight
    lng_array[i, column] = df_out.loc[ix, 'lng'] * weight
    
# Transforming arrays in dataframe in order to use groupby()
lat_df = pd.DataFrame(lat_array)
lng_df = pd.DataFrame(lng_array)
weight_df = pd.DataFrame(weight_array)

# Adding column bsid for each dataframes in order to perform groupby()
lat_df['bsid'] = lng_df['bsid'] = weight_df['bsid'] = df_out.reset_index()['bsid']

# Grouping and suming --- Note that values for lat and lng are already weighted
lat_df_grouped = lat_df.groupby('bsid').sum()
lng_df_grouped = lng_df.groupby('bsid').sum()
weight_df_grouped = weight_df.groupby('bsid').sum()

# Dividing each row by the sum of the weights for the respective row
lat_df_grouped = lat_df_grouped.divide(weight_df_grouped.sum(axis=1), axis=0)
lng_df_grouped = lng_df_grouped.divide(weight_df_grouped.sum(axis=1), axis=0)

# Getting the final weighted latitudes and longitudes
lat_out = lat_df_grouped.sum(axis=1)
lng_out = lng_df_grouped.sum(axis=1)

# Assigning these new latitudes and longitudes to the bases in the test and training sets
for base in lat_out.index:
    df_mess_train.loc[df_mess_train['bsid']==base, 'bs_lat'] = lat_out.loc[base]
    df_mess_train.loc[df_mess_train['bsid']==base, 'bs_lng'] = lng_out.loc[base]
    
    df_mess_test.loc[df_mess_test['bsid']==base, 'bs_lat'] = lat_out.loc[base]
    df_mess_test.loc[df_mess_test['bsid']==base, 'bs_lng'] = lng_out.loc[base]

## Model [Mathieu / Matyas / André]

### First raw predictions

### Outliers processing II

### Fine-tunning / Model selection

### Blending

## Testset preprocessing [André] 

We can notice that the test set has some of the outliers bases that are not present in the training set:

In [72]:
df_mess_test[(df_mess_test['bs_lat']==64.3) & (df_mess_test['bs_lng']==-68.5)]['bsid'].unique()

array([9949, 9941])

We can not replace their positions as we do not have the

## Running final model and computing predictions on test set [André] 