# Predicting consomation in ECOMETERING_LIGHT.data_prepared_2011

### Notebook automatically generated from your Model

#### Generated on 2015-01-27 10:11:44.287767

prediction
This notebook will reproduce all the steps for a regression on  ECOMETERING_LIGHT.data_prepared_2011.
The main objective is to predict the variable consomation

Let's start with importing the required libs :

In [0]:
import dataiku
import numpy as np
import pandas as pd
import sklearn as sk
import dataiku.core.pandasutils as pdu
from dataiku.doctor.preprocessing import PCA
from collections import defaultdict, Counter

And tune pandas display options:

In [0]:
pd.set_option('display.width', 3000)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

#### Importing base data

The first step is to get our machine learning dataset:

In [0]:
input_columns = [u'temperature', u'hour', u'NAF_code', u'Site', u'month', u'week_day', u'consomation']
%time ml_dataset = dataiku.Dataset('ECOMETERING_LIGHT.data_prepared_2011').get_dataframe(columns=input_columns)
print 'Base data has %i rows and %i columns' % (ml_dataset.shape[0], ml_dataset.shape[1])
# Five first records",
ml_dataset.head(5)

#### Initial data management

The preprocessing aims at making the dataset compatible with modeling.
At the end of this step, we will have a matrix of float numbers, with no missing values.
We'll reuse the schema and the preprocessing steps defined in Models.

Let's coerce categorical columns into unicode, numerical features into floats.

In [0]:
categorical_features = [u'hour', u'NAF_code', u'Site', u'month', u'week_day']
numerical_features = [u'temperature']
text_features = []
from dataiku.doctor.utils import datetime_to_epoch
for feature in categorical_features:
    ml_dataset[feature] = ml_dataset[feature].astype('unicode')
for feature in text_features:
    ml_dataset[feature] = ml_dataset[feature].astype('unicode')
for feature in numerical_features:
    if ml_dataset[feature].dtype == np.dtype('<M8[ns]'):
        ml_dataset[feature] = datetime_to_epoch(ml_dataset[feature])
    else:
        ml_dataset[feature] = ml_dataset[feature].astype('double')

The preprocessing aims at making the dataset compatible with modeling.
At the end of this step, we will have a matrix of float numbers, with no missing values.
We'll reuse the schema and the preprocessing steps defined in Models.

( Deduplication not required, skipping... )

We renamed the target variable to a column named target

In [0]:
ml_dataset['__target__'] = ml_dataset['consomation']
del ml_dataset['consomation']


# Remove rows for which the target is unknown.
ml_dataset = ml_dataset[~ml_dataset['__target__'].isnull()]

The dataset needs to be split into 2 new sets, one that will be used for training the model
and another that will be used to test its generalization capability.
This is a simple cross-validation strategy.

In [0]:
train, valid = pdu.split_train_valid(ml_dataset, prop=0.8)
print 'Train data has %i rows and %i columns' % (train.shape[0], train.shape[1])
print 'Validation data has %i rows and %i columns' % (valid.shape[0], valid.shape[1])

#### Features preprocessing

The first thing to do at the features level is to handle the missing values.
Let's reuse the settings defined in the model

In [0]:
drop_rows_when_missing = []
flag_when_missing = []
impute_when_missing = [{'impute_with': u'MEDIAN', 'feature': u'temperature'}, {'impute_with': u'MODE', 'feature': u'hour'}, {'impute_with': u'MODE', 'feature': u'NAF_code'}, {'impute_with': u'MODE', 'feature': u'Site'}, {'impute_with': u'MODE', 'feature': u'month'}, {'impute_with': u'MODE', 'feature': u'week_day'}]

# Features for which we drop rows with missing values"
for feature in drop_rows_when_missing:
    train = train[train[feature].notnull()]
    valid = valid[valid[feature].notnull()]
    print 'Dropped missing records in %s' % feature

# Features for which we replace the feature by a 'flag feature' indicating whether the value was present"
for feature in flag_when_missing:
    train['notMissing_' + feature] = train[feature].map(lambda x: 0 if pd.isnull(x) else 1).astype(np.uint8)
    del train[feature]
    
    valid['notMissing_' + feature] = valid[feature].map(lambda x: 0 if pd.isnull(x) else 1).astype(np.uint8)
    del valid[feature]
    
    print 'Flagged missing values in feature %s' % feature

# Features for which we impute missing values"
for feature in impute_when_missing:
    if feature['impute_with'] == 'MEAN':
        v = train[feature['feature']].mean()
    elif feature['impute_with'] == 'MEDIAN':
        v = train[feature['feature']].median()
    elif feature['impute_with'] == 'CREATE_CATEGORY':
        v = 'NULL_CATEGORY'
    elif feature['impute_with'] == 'MODE':
        v = train[feature['feature']].value_counts().index[0]
    train[feature['feature']] = train[feature['feature']].fillna(v)
    valid[feature['feature']] = valid[feature['feature']].fillna(v)
    print 'Imputed missing values in feature %s with value %s' % (feature['feature'], unicode(str(v), 'utf8'))

We can now handle the categorical features (still using the settings defined in Models):

Let's dummify the following features.

In [0]:
LIMIT_DUMMIES = 10

categorical_to_dummify = [u'hour', u'NAF_code', u'Site', u'month', u'week_day']

def select_dummy_values(df, features):
    dummy_values = {}
    for feature in categorical_to_dummify:
        values = [
            value
            for (value, _) in Counter(train[feature]).most_common(LIMIT_DUMMIES)    
        ]
        dummy_values[feature] = values
    return dummy_values

DUMMY_VALUES = select_dummy_values(train, categorical_to_dummify)

def dummify_dataframe(df):
    for (feature, dummy_values) in DUMMY_VALUES.items():
        for dummy_value in dummy_values:
            dummy_name = '%s_value_%s' % (feature, unicode(dummy_value, 'utf8'))
            df[dummy_name] = (df[feature] == dummy_value).astype(float)
        del df[feature]
        print 'Dummified feature %s' % feature

dummify_dataframe(train)

dummify_dataframe(valid)

Rescaling features

In [0]:
# Rescaling is not required

#### Dimension Reduction

Finally, the last step before training the dimensionality reduction using a Principal Components Analysis.

In [0]:
# disabled in this model

#### Modeling

Before actually creating our model, we need to split the datasets into their features and labels parts:

In [0]:
train_X = train.drop('__target__', axis=1)
valid_X = valid.drop('__target__', axis=1)

train_Y = np.array(train['__target__'])
valid_Y = np.array(valid['__target__'])

Now we can finally create our model !

In [0]:
from dataiku.doctor import RandomForestRegressorIML
clf = RandomForestRegressorIML(
    n_jobs=2,
    random_state=1337,
    max_depth=23,
    min_samples_leaf=31,
    verbose=2)

... And train it

In [0]:
%time clf.fit(train_X, train_Y)

Build up our result dataset

In [0]:
%time _predictions = clf.predict(valid_X)
predictions = pd.Series(data=_predictions, index=valid_X.index, name='predicted_value')

# Build scored dataset
results_valid = valid_X.join(predictions, how='left')
results_valid = results_valid.join(valid['__target__'], how='left')
results_valid = results_valid.rename(columns= {'__target__': 'consomation'})

Let's have a look at feature importances

In [0]:
feature_importances_data = []
features = train_X.columns
for feature_name, feature_importance in zip(features, clf.feature_importances_):
    feature_importances_data.append({
        'feature': feature_name,
        'importance': feature_importance
    })

# Plot the results
pd.DataFrame(feature_importances_data)\
    .set_index('feature')\
    .sort('importance')[-10::]\
    .plot(title='Top 10 most important variables',
          kind='barh',
          figsize=(10, 6),
          color='#348ABD',
          alpha=0.6,
          lw='1',
          edgecolor='#348ABD',
          grid=False,)

#### Results

You can measure the model's accuracy:

In [0]:
c =  results_valid[['predicted_value', 'consomation']].corr()
print 'Pearson correlation: %s' % c['predicted_value'][1]

That's it. It's now up to you to tune your preprocessing, your algo, and your analysis !
