# Instructions

In this challenge, you'll try to predict the severity of car accidents, based on features collected from after-crash police investigation

This [Kaggle challenge](https://www.kaggle.com/c/accident-severity) comprises of 1,000,000 accidents report, split into multiple `.csv` files.

**The goal of the model is to predict the severity of car accidents**. The target variable is called `grav` (for 'gravity') in the file `users.csv`. This variable has four levels, but in this challenge, we'll convert it to a binary classification problem. We will:
- Load data into pandas
- Create a single DataFrame for our problem, where each row is a user involved in an accident
- Extract the features you think would be relevant to predict its severity
- Build a data pipeline that gives you a baseline model
- Then, iterate on the different phase and try to get the best model! 

🔥 **Today is a special challenge** :
- You will send your best score to your batch slack channel!
- The winner will present its notebook to the class during the recap session at 5pm 💪

---
⚠️ **Good practices to follow for large exploratory notebooks**
- Build your Notebook linearily so that it can always be run from top to bottom without any errors
- Clean the outputs of your cells that are not needed
- Clean your variables in memory when you don't need them (especially when they are very large). You can use the python built-in function `del`, or the the **Jupyter nbextentions** `variable_inspector`
- Make heavy use of `table_of_content` and `collapsable_headings` 

# Data sourcing

Let's get started! The data we want to use is from the `csv` files in `/data/data_training`

## Loading data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
cara = pd.read_csv("data/data_training/caracteristics.csv", encoding="ISO-8859-1")
users = pd.read_csv("data/data_training/users.csv", encoding="ISO-8859-1")
places = pd.read_csv("data/data_training/places.csv", encoding="ISO-8859-1")
vehicles = pd.read_csv("data/data_training/vehicles.csv", encoding="ISO-8859-1")

❓ Explore the different tables, and the different variables using `challenge_variable.md`, which provides a description of features. More details can be found [here](https://www.data.gouv.fr/fr/datasets/r/8d4df329-bbbb-434c-9f1f-596d78ad529f) if needed, or in the [Kaggle](https://www.kaggle.com/ahmedlahlou/accidents-in-france-from-2005-to-2016/discussion) discussion channel. Understand

In [6]:
# Your code below

## Merge datasets

❓ We will create one single dataset where each row should represent a `user` in a car, by merging the data from the different files dataset.  
**Take some time to think about how you would do it yourself**, and only then, read carefully through the code below to understand exactly what we did

In [7]:
# Merge caracteristics and places on 'Num_Acc'
data = cara.merge(places, on='Num_Acc')

In [8]:
# Create a common key to merge users amd vehicles on
users['Num_Acc_num_veh'] = users['Num_Acc'].map(lambda x: str(x)) + users['num_veh']
vehicles['Num_Acc_num_veh'] = vehicles['Num_Acc'].map(lambda x: str(x)) + vehicles['num_veh']
# Remove useless columns
vehicles = vehicles.drop(columns=['index'])
users = users.drop(columns=['index', 'Num_Acc', 'num_veh'])
# Merge vehicles and users
tmp = vehicles.merge(users, on='Num_Acc_num_veh', how='inner')

In [9]:
# Merge all datasets on 'Num_Acc'
data = data.merge(tmp, on='Num_Acc', how='inner')
del tmp

In [None]:
data.shape

# Preprocessing

We will apply some preprocessing methods like standardization or missing values removal or imputing.
Remember to look at `challenge_variable.md` for a description of features.

## Clean Dataset

In [11]:
# drop lines without targets (if any)
data_cleaned = data[~np.isnan(data.grav)]

In [None]:
# Check whih features with highest ratio of NaN per column
(data_cleaned.isna().sum() / data_cleaned.shape[0]).sort_values(ascending=False)

In [13]:
# Remove too incomplete features
too_incomplete_features=[
    'locp', 'actp', 'etatp'
]

In [14]:
# Remove features that can be safely considered useless for the predictive power of our model
useless_features=[
    'v2', 'lat', 'long', 'gps', 'pr1', 'pr', 'v1', 'adr', 'voie',
    'index_x', 'Num_Acc', 'Num_Acc_num_veh', 'Num_Acc', 'num_veh', 'index_y',
    'jour', 'an',
    'dep', 'com', 'env1',
]

In [15]:
data_cleaned.drop(columns=too_incomplete_features+useless_features, inplace=True)

We now have a `data_cleaned` dataset! Let's now engineer our features as needed

## Prepare features and target

### Numerical features

In [49]:
# List numerical features to process
features_numerical = []

In [50]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler, StandardScaler

def preprocess_numerical_features(X):
    '''
    Returns a new DataFrame with
    - Missing values replaced by Column Mean
    - Features Standard Scaled
    - Original Features names kept in the DataFrame
    '''
    pass

In [None]:
# Check your code below
preprocess_numerical_features(data_cleaned[features_numerical])

❓ Do you get a Warning "A value is trying to be set on a copy of a slice from a DataFrame"?
If so, it may be because you are trying to modify the input DataFrame `data_cleaned`!

Read this [important blog on copy vs. view](https://www.practicaldatascience.org/html/views_and_copies_in_pandas.html) of pandas DataFrame and try to solve your warning by yourself



<details>
    <summary>Hint</summary>

`pd.DataFrame.copy()`
</details>

### (optional) cyclical features

In [3]:
# Do you see any cyclical features to process specifically?
# This can be done after a first baseline model is created.
features_cyclical = []

In [57]:
# YOUR CODE BELOW
def preprocess_cyclical_features(X):
    '''
    Input: DataFrame X
    Output: Returns new DataFrame, where all its features X_i have been replaced
    by both their sin(2*Pi / cycle_length * X_i) and cos(2*Pi / cycle_length * X_i), and delete initial feature X_i.
    '''
    pass


### Categorical features

❓ Create the last group of feature (categorical features) without hardcoding them manually. Then, create the associated preprocessing method

In [58]:
features_categorical = list(set(data_cleaned.columns) - set(features_numerical) - set(features_cyclical) - {'grav'})

In [60]:
def preprocess_categorical_features(X):   
    ''' Returns a new DataFrame with dummified columns'''
    pass

❓ Create the new `data_preprocessed` dataset by concatenating all three preprocessing, and then drop all remaining NaN that could not have been handled previously despite our preprocessing

In [55]:
# YOUR CODE BELOW

(1125397, 216)

## Split Dataset
❓ Create X and y, and don't forget to convert the classification into a binary task.

For instance:

```python
data['grav_binary'] = data['grav'].replace({1: 0, 4: 0, 2: 1, 3: 1})
```

In [27]:
# Create X and y

In [28]:
# Create a smaller dataset (X_small, y_small) for investigation purpose only

In [None]:
# Train Test Split using random_state=414
# Train Test Split 
# (let's forget for the sake of this challenge that we are data-leaking a bit on our preprocessing steps above)

In [62]:
# (optional) Create here a train/eval split within the train set itself.
# Some powerfull models (XGBOOST, Neural Network...) which are prone to overfitting on the traning set, needs "early stopping criteria", to avoid descending the gradient completely and avoid overfitting.

# Features exploration

You now have a dataset ready for training! 
**Skip directly to section 5 to get a baseline model working ASAP**, and only then come back to this section 4 if you want to better understand your X and get inspiration for the best model to use, or for some feature selection to reduce model complexity

## Visualization

❓Investigate your X. Are features strongly correlated? Are some feature more important than other?

In [63]:
# YOUR CODE HERE

## Forest-based most important features

❓ Fit a default RandomForestClassifier on a small smaple to estimate the top 20 feature importance. Do they make intuitive sense to your point of view?  Do you see any clear elbow for dimension-reduction?

In [64]:
# YOUR CODE HERE

❓ (Optional) There are better ways to estimate feature importance in a RandomForest. Feel free to try to two following options

**Option 1** : Recursive-method
1. Train a first model, note top1 feature (computed based on the gini-explicative power of the feature, in each tree)
2. Remove top1 from your X and retrain a RandomForest. Note top1 feature and it's relative importance
3. Loop

**Option 2** : Permutation-method ([sklearn.inspection.permutation_importance](https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance)), works with any model!
1. Train a first model, keep track of its accuracy
2. Take one feature and shuffle its columns. Compute new accuracy of the corrupted dataset, and note by how much it has been reduced.
3. Loop over all features and rank them by accuracy reduction

In [33]:
# YOUR CODE HERE

# Modeling

## Baseline performance metrics

❓ What is the class balance of your target?  
Do you think acccuracy would be a good score?
If you don't want to favor any class over the other, what would be a good performance metric for your problem? 

<details>
    <summary>Answer</summary>

In such an unbalanced problem, accuracy is meaningless: A very dumb model predicting always zeros would have great accuracy, to the detriment of the predictive power of class  1, which has precision and recall equal to zero!
    
The non-weighted mean between both f1 score of each class called `f1_macro` would be a good measure for this type of problem.
</details>

## Simple Model (A first iteration)

❓ Create a simple model, fast to train, to classify the severity of the accidents. Start simple. Don't forget to fit on your training set and evaluate the score on your test set. Can you beat the Baseline? What about its `f1_macro` score? Measure the time it takes on the full dataset, with `%%time` 

In [66]:
# YOUR CODE BELOW
%%time

# 🔥🔥🔥 Advanced Models - LeWagon batch contest ! 🔥🔥🔥

❓ Now it's your turn to shine! Play with different models and try to find the best one on your training set!
- Send your best `f1_macro` score (as defined above) to your slack channel without saying which model you used!
- ⚠️ Only send score tested on a full `y_test` of complete size (30% of 1M rows!) will be taken into account
- Feel free to use your X_small for investigation purpose
- If it takes too long to train, simplify your model, or use better feature preprocessing/selection

The winner will present its notebook to the class during the reboot 💪

(Don't forget, your Notebook should be made to be run from top to bottom in one go!)

<details>
    <summary>Some hints</summary>
Take a closer look at feature engineering: Are there some features we haven't correctly preprocessed?  
    
Most of the time, a good dataset trumps a good model!
</details>

In [68]:
# YOUR CODE HERE

### (Optional) - Pipeline most steps (prepross & fit) in one single Sklearn Pipeline