# **Waze Project**

## Feature engineering
Continuing to explore data and engineer features in addition to ones generated at the previous step:
* `km_per_driving_day`
* `percent_sessions_in_last_month`

In [80]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [106]:
df = pd.read_csv('waze_dataset_no_nans_outlier_to_95th_p.csv')

#### **`drives_per_driving_day`**

At the previous stage, it turned out that some users complete a lot of drives. Let's use it as a feature

In [107]:
df['drives'].describe()

count    14299.000000
mean        63.964683
std         55.127927
min          0.000000
25%         20.000000
50%         48.000000
75%         93.000000
max        200.000000
Name: drives, dtype: float64

In [108]:
df['drives_per_driving_day'] = df['drives'] / df['driving_days']

In [109]:
df.loc[df['drives_per_driving_day']==np.inf, 'drives_per_driving_day'] = 0
df['drives_per_driving_day'].describe()

count    14292.000000
mean         8.918904
std         17.816456
min          0.000000
25%          1.230769
50%          3.666667
75%          8.848901
max        200.000000
Name: drives_per_driving_day, dtype: float64

This is an interesting finding. There are entries with A LOT of drives per driving day, with the maximum of 200 (!).
Some users who completed 200 drives did that in just one driving day (!) and drove several thousand km. This is probably physically impossible.

In [110]:
df[df['drives_per_driving_day'] == 200].head()

Unnamed: 0.1,Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,km_per_driving_day,percent_sessions_in_last_month,drives_per_driving_day
63,63,63,retained,243,200,298.673647,1546,0,88,4695.169432,2146.467081,6,1,iPhone,4695.169432,0.85,200.0
360,376,376,churned,243,200,387.088693,2707,0,0,3309.863745,660.564362,1,1,iPhone,3309.863745,0.66,200.0
454,472,472,churned,243,200,257.054881,1341,139,19,8898.716275,4322.142842,1,1,Android,10828.24439,0.97,200.0
1889,1978,1978,retained,243,200,302.915904,1088,20,0,5268.853194,2312.04103,6,1,iPhone,5268.853194,0.86,200.0
2329,2445,2445,retained,243,200,375.045901,2793,43,35,4949.591318,747.623435,1,1,Android,4949.591318,0.86,200.0


It would be great if it was possible to get some elaboration on this data from the stakeholders. For now, I suggest we drop rows where `drives_per_driving_day` value is more than 50 (assuming taxis or deliveries can still make dozens of drives per day, but that alone would need some research) and setting the threshold of professional drivers at more than the mean value.

In [111]:
df = df[df['drives_per_driving_day'] < 50]

In [112]:
drives_per_day_mean = df['drives_per_driving_day'].mean()
drives_per_day_mean

6.329911306586418

#### **`total_sessions_per_day`**

This column represents the mean number of sessions per day _since onboarding_.

In [113]:
df['total_sessions_per_day'] = df['total_sessions'] / df['n_days_after_onboarding']
df.head()

Unnamed: 0.1,Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,km_per_driving_day,percent_sessions_in_last_month,drives_per_driving_day,total_sessions_per_day
0,0,0,retained,243,200,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,138.360267,0.95,10.526316,0.130381
1,1,1,retained,133,107,326.896596,1225,19,64,8898.716275,3160.472914,13,11,iPhone,1246.901868,0.41,9.727273,0.266854
2,2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,382.393602,0.84,11.875,0.051121
3,3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,304.530374,0.72,13.333333,4.505948
4,4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,219.455667,0.5,3.777778,0.107713


#### **`km_per_drive`**

This column shows the mean number of kilometers per drive made in the last month for each user.

In [114]:
df['km_per_drive'] = df['driven_km_drives'] / df['drives']
df.loc[df['km_per_drive']==np.inf, 'km_per_drive'] = 0
df['km_per_drive'].describe()

count    13840.000000
mean       232.132611
std        580.874059
min          0.000000
25%         34.121939
50%         75.585745
75%        183.683465
max       8898.716275
Name: km_per_drive, dtype: float64

In [115]:
df[df['km_per_drive'] > 1000]

Unnamed: 0.1,Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,km_per_driving_day,percent_sessions_in_last_month,drives_per_driving_day,total_sessions_per_day,km_per_drive
6,6,6,retained,3,2,236.725314,360,185,18,5249.172828,726.577205,28,23,iPhone,228.224906,0.01,0.086957,0.657570,2624.586414
52,52,52,retained,4,4,62.620853,3291,65,0,4996.233445,1345.047896,28,23,iPhone,217.227541,0.06,0.173913,0.019028,1249.058361
64,64,64,retained,4,3,113.818787,1830,233,0,8898.716275,3213.049582,6,0,Android,0.000000,0.04,0.000000,0.062196,2966.238758
178,185,185,churned,2,2,18.277806,2252,21,0,4267.335798,1968.320404,7,2,iPhone,2133.667899,0.11,1.000000,0.008116,2133.667899
192,200,200,churned,5,5,138.870575,321,0,0,8809.226415,2667.907883,3,0,Android,0.000000,0.04,0.000000,0.432619,1761.845283
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14165,14855,14855,retained,10,8,185.741218,1186,227,23,8398.906585,1637.979318,21,13,Android,646.069737,0.05,0.615385,0.156611,1049.863323
14168,14858,14858,retained,2,2,15.445180,2781,308,6,4875.445642,3457.510250,5,4,Android,1218.861410,0.13,0.500000,0.005554,2437.722821
14173,14863,14863,retained,1,1,79.109928,589,18,63,2298.559088,534.099571,16,16,Android,143.659943,0.01,0.062500,0.134312,2298.559088
14204,14895,14895,retained,1,1,142.095429,3411,0,0,8898.716275,4668.180092,11,11,Android,951.197885,0.01,0.090909,0.041658,8898.716275


In [116]:
df[df['km_per_drive'] > 1000].shape

(610, 19)

There are users who make drives that are hundreds and even thousands km long. There may be a group of long-haul drivers indicated here.
If it is true, the specifics of their usage need to be studied, but, for the sake of dealing with extremities, let's cut the rows with km_per_drive > 1000.

In [117]:
df = df[df['km_per_drive'] <= 1000]

In [118]:
df.shape

(13230, 19)

## Variable encoding

#### **Target encoding**

In [119]:
df['is_churned'] = np.where(df['label'] == 'churned', 1, 0)

In [120]:
df.drop(columns = ['label'], inplace=True)

#### **Dummying features**

The only categorical feature remaining is the device, we can get dummies for it.

In [121]:
df = pd.get_dummies(df, prefix="device", dtype=int)

In [104]:
df.head(15)

Unnamed: 0.1,Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,...,driving_days,km_per_driving_day,percent_sessions_in_last_month,drives_per_driving_day,total_sessions_per_day,km_per_drive,percent_of_sessions_to_favorite,is_churned,device_Android,device_iPhone
0,0,0,243,200,296.748273,2276,208,0,2628.845068,1985.775061,...,19,138.360267,0.95,10.526316,0.130381,13.144225,0.700931,0,1,0
1,1,1,133,107,326.896596,1225,19,64,8898.716275,3160.472914,...,11,1246.901868,0.41,9.727273,0.266854,83.165573,0.253903,0,0,1
2,2,2,114,95,135.522926,2651,0,0,3059.148818,1610.735904,...,8,382.393602,0.84,11.875,0.051121,32.201567,0.0,0,1,0
3,3,3,49,40,67.589221,15,322,7,913.591123,587.196542,...,3,304.530374,0.72,13.333333,4.505948,22.839778,4.86764,0,0,1
4,4,4,84,68,168.24702,1562,166,5,3950.202008,1219.555924,...,18,219.455667,0.5,3.777778,0.107713,58.091206,1.016363,0,1,0
5,5,5,113,103,279.544437,2637,0,0,901.238699,439.101397,...,11,81.930791,0.4,9.363636,0.106009,8.74989,0.0,0,0,1
7,7,7,39,35,176.072845,2999,0,0,7892.052468,2466.981741,...,20,394.602623,0.22,1.75,0.058711,225.487213,0.0,0,0,1
8,8,8,57,46,183.532018,424,0,26,2651.709764,1594.342984,...,20,132.585488,0.31,2.3,0.432859,57.645864,0.141665,0,1,0
9,9,9,84,68,244.802115,2997,72,0,6043.460295,2341.838528,...,3,2014.486765,0.34,22.666667,0.081682,88.874416,0.294115,1,0,1
10,10,10,23,20,117.225772,1946,0,36,8554.91444,4668.180092,...,9,950.546049,0.2,2.222222,0.060239,427.745722,0.3071,0,0,1


In [122]:
df.to_csv('dataset_encoded_features.csv')

## Conclusion

While trying to engineer helpful features, I discovered values that raised reasonable suspicion:
- extremely long averaged distances per drive
- extremely high averaged values of the number of drives per driving day

While, in my opinion, this might indicate two different groups of drivers: 1) long-haul drivers or travellers and 2) taxi/delivery drivers making lots of drives daily. Their patterns of usage need further research, and insights from stakeholders will be very important. However, some of such values do not seem physically possible. So it is crucial to discuss possible errors in the data and also learn how to establish proper thresholds.