# Base models

This model is [Piotrek](https://twitter.com/pkuchta)'s baseline for their [Kaggle competition](https://inclass.kaggle.com/c/predict-impact-of-air-quality-on-death-rates).

Here we can check the RMSE of two basic models which serve as base models: regression and mean value


In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import svm
from sklearn.model_selection import GridSearchCV

import numpy as np

train = pd.read_csv('input/train.csv', parse_dates=['date'])

The training set has the following columns:
* `Id`
* `region`
* `date`
* `mortality_rate`
* `O3`
* `PM10`
* `PM25`
* `NO2`
* `T2M`

In [11]:
train.head()

Unnamed: 0,Id,region,date,mortality_rate,O3,PM10,PM25,NO2,T2M
0,1,E12000001,2007-01-02,2.264,42.358,9.021,,,278.138
1,2,E12000001,2007-01-03,2.03,49.506,5.256,,,281.745
2,3,E12000001,2007-01-04,1.874,51.101,4.946,,,280.523
3,4,E12000001,2007-01-05,2.069,47.478,6.823,,,280.421
4,5,E12000001,2007-01-06,1.913,45.226,7.532,,,278.961


We have missing values for NO2 and PM25 for 2007. Simplest way of dealing with it is to remove the rows with missing values

In [12]:
#train = train.dropna(axis=0, how='any')
train = train.loc[train['date'].dt.year >= 2009]

We start by training our linear regression model.

Our test set is like the training set above, except it does not have `mortality_rate` (that's what we're after). Ignore region and date for now.

In [13]:

seed = 7
np.random.seed(seed)
X_train, X_test, y_train, y_test = train_test_split(train[['O3', 'PM10', 'PM25', 'NO2', 'T2M']],
                                                    train['mortality_rate'], 
                                                    test_size = 0.3,
                                                    random_state=22)

## Linear regression model

In [14]:
regr = LinearRegression()
regr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Linear regression model RMSE result over test_daata

In [15]:
np.mean((y_test - regr.predict(X_test))**2)**0.5


0.2303967748219074

## Mean value model

In [16]:
mean_ration = y_test.mean()
mean_ration

1.2639107042253521

Mean value. RMSE result over test_daata

In [17]:
np.mean((y_test - mean_ration)**2)**0.5

0.28631876291660596

## Exploring different models

In [18]:
models = [('kneighbours', KNeighborsRegressor(),
              {'n_neighbors':[2,5,7,10,15,20,30,40,50],
               'leaf_size':[1,2,5,10,20,30,50,100]}),
          ('random_forest', RandomForestRegressor(), 
              {'n_estimators':[10,20,30,50]}),
          ('svr', svm.SVR(), 
              {'kernel':('linear', 'rbf'), 
               'C':[1, 2, 5, 8, 10]})]

def file_name(algo_name, params):
    return algo_name + ':' + ','.join([str(k)+'='+str(v) for k,v in params.items()]) + '.csv'

for algo_name, regressor, parameters in models:
    # http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
    model = GridSearchCV(regressor, parameters, n_jobs = 4, verbose = 0)
    model.fit(X_train, y_train)
    rmse = np.mean((y_test - model.predict(X_test))**2)**0.5
    print (algo_name + ': Checked', parameters, 'the best is:', model.best_params_, 'rmse', rmse, '\n')

    #predictions = X_test[['Id']].copy()
    #predictions['mortality_rate'] = model.predict(X_test)
    


kneighbours: Checked {'n_neighbors': [2, 5, 7, 10, 15, 20, 30, 40, 50], 'leaf_size': [1, 2, 5, 10, 20, 30, 50, 100]} the best is: {'leaf_size': 1, 'n_neighbors': 30} rmse 0.221742500868 

random_forest: Checked {'n_estimators': [10, 20, 30, 50]} the best is: {'n_estimators': 50} rmse 0.224691911236 



KeyboardInterrupt: 