# Fixing a Corrupted Model
```In this exercise you will experience with "debugging" a model. You will analyze the datasets and the differences between them and the problems in the training process of the model.```

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams['figure.figsize'] = 12,7

```Here you will explore a trained model and see where, and more interestingly, why, it fails.```

- ```Completion Rate (CR) is the the percentage of orders that were completed successfully. You have a model that predicts this measurement in Kazan, a city in Russia```
- ```Attached to this notebook are two files: `train.csv` and `test.csv` ```
- ```A model to predict CR (`cr_guas`) was trained on `train.csv` (`DataAnalyst-Model.ipynb`) and saved into `model.pkl` ```
- ```Your task is to analyze the model results and find out if there are any issues with it```
- ```Please answer the questions below```
- ```Provide proofs in the form of statistics and plots```
- ```Write your own conclusion about the results and your suggestions for a solution```
- ```There are at least two major problems, and some other minor problems```

# Load Data

In [59]:
df_train = pd.read_csv('resources/train.csv', encoding='utf-8')
df_test = pd.read_csv('resources/test.csv', encoding='utf-8')

In [60]:
df_train.sample(5)

Unnamed: 0,sample_t,area,weekday,weekdaycat,hour,hourcat,cos604800,sin604800,cr_weight,cr_gaus,cr_mean
73681,1522256700,Kazan улица Воровского,2.711806,,17.083333,17,0.966728,-0.255807,0.629354,1.0,0.628231
4834,1521120600,Kazan улица Патриса Лумумбы,3.5625,,13.5,13,0.875223,0.483719,22.786867,0.77061,0.57218
101505,1522395000,Kazan проспект Фатыха Амирхана,4.3125,,7.5,7,0.382683,0.92388,23.267983,0.9316,0.830634
51801,1520789100,Kazan Сибирский Tракт 2,6.725694,,17.416667,17,-0.97955,-0.2012,7.306992,0.46454,0.38683
21392,1521093600,Kazan Аэропорт Казань,3.25,6.0,6.0,6,0.974928,0.222521,0.619839,1.0,0.939975


# Load model

In [4]:
from lightgbm import LGBMRegressor
from sklearn.externals import joblib

In [5]:
lgb = joblib.load('resources/model.pkl')

In [6]:
features = lgb.booster_.feature_name()
features

[u'area',
 u'weekday',
 u'weekdaycat',
 u'hour',
 u'hourcat',
 u'cos604800',
 u'sin604800',
 u'cr_mean']

In [61]:
df_train['area'] = df_train['area'].astype('category')
df_train['weekdaycat'] = df_train['weekdaycat'].astype('category')
df_train['hourcat'] = df_train['hourcat'].astype('category')

df_test['area'] = df_test['area'].astype(df_train['area'].dtype)
df_test['weekdaycat'] = df_test['weekdaycat'].astype(df_train['weekdaycat'].dtype)
df_test['hourcat'] = df_test['hourcat'].astype(df_train['hourcat'].dtype)

In [62]:
df_train['cr_pred'] = lgb.predict(df_train[features])
df_test['cr_pred'] = lgb.predict(df_test[features])

# Results

```Please analyze and compare the MAE of the predictions```

1. ```Is there any difference between train\test? Why?```
2. ```Is there any difference between areas? If so, why?```
3. ```The PM still thinks there is a problem with our prediction... I can't find any problem, Can you? Find as many as you can!```

**Don't forget your plots**