# Final Model Overview

<img src='mi_diagrama.png'>

In [1]:
import numpy as np
import pandas as pd
import math
from tools import *

# tfidf:
from sklearn.feature_extraction.text import TfidfVectorizer
# models:
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
# metrics:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# pipeline:
from sklearn.pipeline import Pipeline
# for pre-processing:
from nltk.stem import WordNetLemmatizer, PorterStemmer

  pd.set_option('display.max_colwidth', -1)


In [2]:
def add_creative_features(dataframe):
    """
    This function applies the tools.py functions I wrote into every excerpt.
    """
    dataframe['num_punct_marks'] = dataframe['excerpt'].apply(num_punct_marks)
    dataframe['num_uniq_words'] = dataframe['excerpt'].apply(num_unique_words)
    dataframe['avg_word_len'] = dataframe['excerpt'].apply(avg_word_len)
    dataframe['rarity'] = dataframe['excerpt'].apply(rarity)


def add_clasic_test(dataframe):
    """
    This function applies all the publicly available readability test into every excerpt.
    """
    clasical_complexity_tests = {'fre_test': textstat.flesch_reading_ease,
                                 'fkg_test': textstat.flesch_kincaid_grade,
                                 'gf_test': textstat.gunning_fog,
                                 'si_test': textstat.smog_index,
                                 'dcrs_test': textstat.dale_chall_readability_score}

    # Creating text complexity feature for every test:
    for test in clasical_complexity_tests.keys():
        test_func = clasical_complexity_tests[test]

        dataframe[test] = dataframe['excerpt'].apply(lambda value: test_func(value))

        
def clean_stem_and_lemmatize(dataframe):
    """lemmatize and stem excerpts to improve the TF-IDF process"""
    texts = dataframe['excerpt'].values
    
    cleaned_texts = []
    for text in texts:
        text = text.lower()
        
        # bye bye punctuation marks
        text = text.replace('?', '').replace('.', '').replace(',', '')
        text = text.replace(':', '').replace(';', '').replace('!', '')
        text = text.replace('(', '').replace(')', '')

        # streaming:
        ps = PorterStemmer()
        stems = [ps.stem(word) for word in text.split()]
        text = ' '.join(stems)

        # lemmatizing:
        wnl = WordNetLemmatizer()
        lemma = [wnl.lemmatize(word) for word in text.split()]
        text = ' '.join(lemma)
        
        cleaned_texts.append(text)

    dataframe['excerpt'] = cleaned_texts

## Final Model

* After a lot of iteration and testing, this is the best model I have found so far.
* The main improvement here is adding the TF-IDF (Term frequency - Inverse Frequency) pre-procesing to the mix.
* Also I tweak the hyperparameters of best models to find the best combination.

#### I'm going to import the train.csv again to mantain order:

In [3]:
train = pd.read_csv('train.csv', usecols=['id', 'excerpt', 'target'])
validation = pd.read_csv('validation.csv', usecols=['id', 'excerpt', 'target'])

#### To ilustrate the data-cleaning process here you have a before and after of the excerptwhen you apply the streaming and lemmatization:

In [4]:
# before:
train['excerpt'].iloc[0]

"An earthquake (also known as a quake, tremor or temblor) is the perceptible shaking of the surface of the Earth, resulting from the sudden release of energy in the Earth's crust that creates seismic waves. Earthquakes can be violent enough to toss people around and destroy whole cities. The seismicity or seismic activity of an area refers to the frequency, type and size of earthquakes experienced over a period of time.\nEarthquakes are measured using observations from seismometers. The moment magnitude is the most common scale on which earthquakes larger than approximately 5 are reported for the entire globe. The more numerous earthquakes smaller than magnitude 5 reported by national seismological observatories are measured mostly on the local magnitude scale, also referred to as the Richter magnitude scale. These two scales are numerically similar over their range of validity. Magnitude 3 or lower earthquakes are mostly imperceptible or weak and magnitude 7 and over potentially cause

In [5]:
add_creative_features(train)
add_clasic_test(train)
clean_stem_and_lemmatize(train)

add_creative_features(validation)
add_clasic_test(validation)
clean_stem_and_lemmatize(validation)

In [6]:
# after:
train['excerpt'].iloc[0]

"an earthquak also known a a quak tremor or temblor is the percept shake of the surfac of the earth result from the sudden releas of energi in the earth' crust that creat seismic wave earthquak can be violent enough to toss peopl around and destroy whole citi the seismic or seismic activ of an area refer to the frequenc type and size of earthquak experienc over a period of time earthquak are measur use observ from seismomet the moment magnitud is the most common scale on which earthquak larger than approxim 5 are report for the entir globe the more numer earthquak smaller than magnitud 5 report by nation seismolog observatori are measur mostli on the local magnitud scale also refer to a the richter magnitud scale these two scale are numer similar over their rang of valid magnitud 3 or lower earthquak are mostli impercept or weak and magnitud 7 and over potenti caus seriou damag over larger area depend on their depth"

##### As you can see, the text have been stemmed and lemmatized properly!

### Let's find the best model using the TF-IDF matrix as input:

#### Notice: I'm only using the models that score the best according to the first_model notebook

In [7]:
X_train = train['excerpt']
y_train = train['target']

X_val = validation['excerpt']
y_val = validation['target']

models = {'MLPRegressor': MLPRegressor(),
          'SVR': SVR(kernel='rbf'),
          'Ridge': Ridge()}

for model, regr in models.items():
    pipeline = Pipeline([('tfidf', TfidfVectorizer()), ('rgr', regr)])
    pipeline.fit(X_train, y_train)
    
    # validation predictions
    preds = pipeline.predict(X_val)
    
    # metrics
    r2 = round(pipeline.score(X_val, y_val), 3)
    mae = round(mean_absolute_error(y_val, preds), 3)
    rmse = round(math.sqrt(mean_squared_error(y_val, preds)), 3)
    
    # Printing results
    print(f'{model} vs target... R2: {r2}')
    print(f'{model}  vs target... MAE: {mae}')
    print(f'{model}  vs target... RMSE: {rmse}', '\n')

MLPRegressor vs target... R2: 0.517
MLPRegressor  vs target... MAE: 0.569
MLPRegressor  vs target... RMSE: 0.707 

SVR vs target... R2: 0.518
SVR  vs target... MAE: 0.581
SVR  vs target... RMSE: 0.707 

Ridge vs target... R2: 0.551
Ridge  vs target... MAE: 0.566
Ridge  vs target... RMSE: 0.682 



#### So, the Ridge regression yielded the best results! Let's use those predictions + the custom features to improve the final model a little

### Training new model that uses the Ridge predictions + the custom features:

In [8]:
train['ridge_preds'] = pipeline.predict(X_train)
validation['ridge_preds'] = pipeline.predict(X_val)

variables = ['rarity', 'avg_word_len', 'fre_test', 'dcrs_test', 'ridge_preds']

# Let's iterate over every candidate model, train it and compare results

models = {'MLPRegressor': MLPRegressor(max_iter=1000, learning_rate='adaptive', early_stopping=True),
          'Ridge_a2': Ridge(alpha=2),
          'SVR_C5': SVR(kernel='rbf', C=5),
         }

for model, regressor in models.items():
    # training:
    X_train = train[variables].values
    
    # fitting model
    regressor.fit(X_train, y_train)
    
    # checking the model results in the validation set
    X_val = validation[variables].values
    X_val_pred = regressor.predict(X_val)
    
    # metrics
    r2 = round(regressor.score(X_val, y_val), 3)
    mae = round(mean_absolute_error(y_val, X_val_pred), 3)
    rmse = round(math.sqrt(mean_squared_error(y_val, X_val_pred)), 3)
    
    # Printing results
    print(f'{model} vs target... R2: {r2}')
    print(f'{model}  vs target... MAE: {mae}')
    print(f'{model}  vs target... RMSE: {rmse}', '\n')

MLPRegressor vs target... R2: 0.542
MLPRegressor  vs target... MAE: 0.567
MLPRegressor  vs target... RMSE: 0.689 

Ridge_a2 vs target... R2: 0.525
Ridge_a2  vs target... MAE: 0.582
Ridge_a2  vs target... RMSE: 0.701 

SVR_C5 vs target... R2: 0.571
SVR_C5  vs target... MAE: 0.552
SVR_C5  vs target... RMSE: 0.666 



#### As you can see, the SVR have an RMSE of 0.666 the best result so far! A little better than using the Ridge alone.
### So this is our final model! let's see how it performs on the test set

## Comparing results with test set

In [9]:
test = pd.read_csv('test.csv', usecols=['id', 'excerpt', 'target'])
print(test.shape)

(141, 3)


In [10]:
# Let's apply same pre-process to the test data set.
add_creative_features(test)
add_clasic_test(test)
clean_stem_and_lemmatize(test)

In [11]:
# Cheking pipeline
pipeline

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('rgr', Ridge())])

In [12]:
# Aplaying the TF-IDF + Ridge Regression pipeline to have the first model:
X_test = test['excerpt']
y_test = test['target']
pipeline_preds = pipeline.predict(X_test)

# saving results
test['ridge_preds'] = pipeline_preds

In [14]:
# Metrics with just the first model:
r2 = round(pipeline.score(X_test, y_test), 6)
mae = round(mean_absolute_error(y_test, pipeline_preds), 6)
rmse = round(math.sqrt(mean_squared_error(y_test, pipeline_preds)), 6)

# Printing results
print(f'{model} vs target... R2: {r2}')
print(f'{model}  vs target... MAE: {mae}')
print(f'{model}  vs target... RMSE: {rmse}', '\n')

SVR_C5 vs target... R2: 0.478639
SVR_C5  vs target... MAE: 0.613966
SVR_C5  vs target... RMSE: 0.774197 



In [22]:
# we can check the regressor variable is still the trained svr model:
regressor

SVR(C=5)

In [16]:
# aplying SVR model with extra features ('rarity', 'avg_word_len', 'fre_test', 'dcrs_test')
variables = ['rarity', 'avg_word_len', 'fre_test', 'dcrs_test', 'ridge_preds']

# preparing input:
X_test_final_features = test[variables].values
# aplying model:
svr_preds = regressor.predict(X_test_final_features)

# saving results:
test['final_pred'] = svr_preds

In [19]:
test[['ridge_preds', 'final_pred', 'target']].head()

Unnamed: 0,ridge_preds,final_pred,target
0,0.112013,0.215251,0.610319
1,-0.955616,-0.909717,-1.926422
2,-0.07025,0.000402,-0.013471
3,-0.234369,-0.148283,0.009684
4,-0.370253,-0.304843,-0.684945


In [18]:
# metrics
r2 = round(regressor.score(X_test_final_features, y_test), 6)
mae = round(mean_absolute_error(y_test, svr_preds), 6)
rmse = round(math.sqrt(mean_squared_error(y_test, svr_preds)), 6)

# Printing results
print(f'{model} vs target... R2: {r2}')
print(f'{model}  vs target... MAE: {mae}')
print(f'{model}  vs target... RMSE: {rmse}', '\n')

SVR_C5 vs target... R2: 0.478906
SVR_C5  vs target... MAE: 0.615625
SVR_C5  vs target... RMSE: 0.773999 



### So this are the metrics of my final model against the Test data set!

# Comparison with the Benchmark:

#### Remember that in the exploratory analysis notebook, the benchmark scored as follows:
* test set R2: 0.188189
* test set mae: 0.767187
* test set rmse: 0.966072

#### So this final model improve in every simple metric! The RMSE is 24% better.


## Comparison with the Best Kaggle submition so far:

<img src='kaggle.png'>

The best performer got an RMSE of 0.440, well above my result but obviosly he is spending more time on the project that I can possible can. 

Bear in mind:
* I trained my data on just a portion of the whole training set.
* The test set that kaggle uses is private and different than the one I used, so is not fair to compare RMSE unless we use the same test set.
* The person on the top of this list may be overfitting the testing set (he/she have tried a total of 106 times against the same test set)
* I used the test set just once to avoid overfiting it.