## Online news Popularity Data set:
### Abstract:
The dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The analysis is done on them to find out which feature/attribute contribute maximum in the popularity and predict the popularity of future news articles in advance before they are published online.

The dataset is obtained from : _[Online news Popularity Data set](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity)_

### Goal:
Predict the popularity of an online article i.e the maximum no of shares.

### Regression models:
Since the goal is to predict the popularity of an online article, we need to apply regression models to find the outcome of dependent variable (shares) from rest of the independent variables.
Given below we are using 3 Regression models:
- Linear Regressor
- Random forest Regressor
- Lasso Regressor

In [2]:
#import libraries
import numpy as np
import pandas as pd
import csv
import operator
from matplotlib import pyplot as plt
from sklearn.feature_selection import RFE
from math import sqrt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn import cross_validation, metrics 
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
%matplotlib inline



In [3]:
#read csv
df = pd.read_csv("OnlineNewsPopularity.csv",delimiter=',',skipinitialspace=True)

In [4]:
df

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.100000,0.700000,-0.350000,-0.600000,-0.200000,0.500000,-0.187500,0.000000,0.187500,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.700000,-0.118750,-0.125000,-0.100000,0.000000,0.000000,0.500000,0.000000,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.575130,1.0,0.663866,3.0,1.0,1.0,...,0.100000,1.000000,-0.466667,-0.800000,-0.133333,0.000000,0.000000,0.500000,0.000000,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.800000,-0.369697,-0.600000,-0.166667,0.000000,0.000000,0.500000,0.000000,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.540890,19.0,19.0,20.0,...,0.033333,1.000000,-0.220192,-0.500000,-0.050000,0.454545,0.136364,0.045455,0.136364,505
5,http://mashable.com/2013/01/07/beewi-smart-toys/,731.0,10.0,370.0,0.559889,1.0,0.698198,2.0,2.0,0.0,...,0.136364,0.600000,-0.195000,-0.400000,-0.100000,0.642857,0.214286,0.142857,0.214286,855
6,http://mashable.com/2013/01/07/bodymedia-armba...,731.0,8.0,960.0,0.418163,1.0,0.549834,21.0,20.0,20.0,...,0.100000,1.000000,-0.224479,-0.500000,-0.050000,0.000000,0.000000,0.500000,0.000000,556
7,http://mashable.com/2013/01/07/canon-poweshot-n/,731.0,12.0,989.0,0.433574,1.0,0.572108,20.0,20.0,20.0,...,0.100000,1.000000,-0.242778,-0.500000,-0.050000,1.000000,0.500000,0.500000,0.500000,891
8,http://mashable.com/2013/01/07/car-of-the-futu...,731.0,11.0,97.0,0.670103,1.0,0.836735,2.0,0.0,0.0,...,0.400000,0.800000,-0.125000,-0.125000,-0.125000,0.125000,0.000000,0.375000,0.000000,3600
9,http://mashable.com/2013/01/07/chuck-hagel-web...,731.0,10.0,231.0,0.636364,1.0,0.797101,4.0,1.0,1.0,...,0.100000,0.500000,-0.238095,-0.500000,-0.100000,0.000000,0.000000,0.500000,0.000000,710


### Feature Engineering:
Drop URL and timedelta because they are non-predictive attributes.Also, classify popular and non popular articles according to no. of shares. 

In [5]:
df =df.drop(['url','timedelta'],axis =1)

### Normalize data:
Ref: http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-16.html

In [6]:
colmeans = df.sum()/df.shape[0]  # Get column means

In [7]:
centered = df-colmeans

In [8]:
column_deviations = df.std(axis=0)   # Get column standard deviations
centered_and_scaled = centered/column_deviations

In [9]:
from sklearn import preprocessing
scaled_data = preprocessing.scale(df)  # Scale the data
scaled_data = pd.DataFrame(scaled_data,    # Remake the DataFrame
                           index=df.index,
                           columns=df.columns)

In [10]:
scaled_data['shares'] = np.cbrt(scaled_data['shares'])
scaled_data['kw_max_min'] = np.cbrt(scaled_data['kw_max_min'])
scaled_data['n_unique_tokens'] = np.cbrt(scaled_data['n_unique_tokens'])
scaled_data['self_reference_min_shares'] = np.cbrt(scaled_data['self_reference_min_shares'])

In [11]:
scaled_data.head(5)

Unnamed: 0,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,0.757447,-0.69521,0.320013,0.000675,0.038658,-0.607463,-0.335566,-0.426526,-0.304268,0.156474,...,0.063865,-0.228941,-0.708369,-0.268895,-0.969886,0.671245,-0.975432,-1.810719,0.13892,-0.622332
1,-0.661657,-0.618794,0.252277,0.000675,0.031479,-0.695709,-0.594963,-0.426526,-0.304268,0.432838,...,-0.870968,-0.228941,1.102174,1.367424,0.078642,-0.870807,-0.269076,0.837749,-0.689658,-0.613472
2,-0.661657,-0.712192,0.196993,0.000675,-0.007752,-0.695709,-0.594963,-0.426526,-0.304268,-0.183415,...,0.063865,0.981798,-1.621797,-0.957871,-0.270867,-0.870807,-0.269076,0.837749,-0.689658,-0.546276
3,-0.661657,-0.032933,-0.232815,0.000675,-0.007211,-0.166229,-0.85436,-0.426526,-0.304268,-0.169758,...,0.573773,0.174639,-0.862584,-0.268895,-0.620377,-0.870807,-0.269076,0.837749,-0.689658,-0.573698
4,1.230482,1.115439,-0.335177,0.000675,-0.04542,0.716237,4.074185,1.860061,-0.304268,0.1594,...,-0.870968,0.981798,0.307944,0.075594,0.602906,0.531059,0.244637,-1.569949,-0.087056,-0.628779


### Features list(all):

In [12]:
features= ['kw_avg_avg','data_channel_is_entertainment', 'is_weekend','data_channel_is_tech', 'self_reference_min_shares', 
           'self_reference_avg_sharess','data_channel_is_socmed','kw_max_avg', 'LDA_02', 'n_unique_tokens','kw_max_max',
           'n_tokens_content', 'n_non_stop_unique_tokens','LDA_00', 'kw_avg_max','kw_avg_min','LDA_01','LDA_04',
           'kw_min_min','kw_min_avg','global_subjectivity','LDA_03','kw_max_min','self_reference_max_shares',
           'min_positive_polarity','average_token_length','kw_min_max','global_rate_positive_words','avg_positive_polarity',
           'global_sentiment_polarity', 'num_hrefs','data_channel_is_world','avg_negative_polarity','rate_negative_words',
           'rate_positive_words','num_self_hrefs','title_sentiment_polarity','abs_title_subjectivity', 'n_tokens_title',
           'title_subjectivity','num_keywords','num_videos','abs_title_sentiment_polarity','min_negative_polarity',
          'data_channel_is_bus','max_negative_polarity','max_positive_polarity']
print(features)

['kw_avg_avg', 'data_channel_is_entertainment', 'is_weekend', 'data_channel_is_tech', 'self_reference_min_shares', 'self_reference_avg_sharess', 'data_channel_is_socmed', 'kw_max_avg', 'LDA_02', 'n_unique_tokens', 'kw_max_max', 'n_tokens_content', 'n_non_stop_unique_tokens', 'LDA_00', 'kw_avg_max', 'kw_avg_min', 'LDA_01', 'LDA_04', 'kw_min_min', 'kw_min_avg', 'global_subjectivity', 'LDA_03', 'kw_max_min', 'self_reference_max_shares', 'min_positive_polarity', 'average_token_length', 'kw_min_max', 'global_rate_positive_words', 'avg_positive_polarity', 'global_sentiment_polarity', 'num_hrefs', 'data_channel_is_world', 'avg_negative_polarity', 'rate_negative_words', 'rate_positive_words', 'num_self_hrefs', 'title_sentiment_polarity', 'abs_title_subjectivity', 'n_tokens_title', 'title_subjectivity', 'num_keywords', 'num_videos', 'abs_title_sentiment_polarity', 'min_negative_polarity', 'data_channel_is_bus', 'max_negative_polarity', 'max_positive_polarity']


### Split into test and train dataset with Cross Validation technique:

In [13]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(scaled_data[features], scaled_data['shares'], 
                                                                     train_size = 0.70,test_size=0.30, random_state=0)

### Error metric table for all Regression models:

In [25]:
rmse_dict = {}    
def rmse(correct,estimated):
    rmse_val = np.sqrt(mean_squared_error(correct,estimated)) 
    return rmse_val

def calc_error_metric(name,modelname, model, X_train, y_train, X_test, y_test):
    
    y_train_predicted = model.predict(X_train)
    y_test_predicted = model.predict(X_test)
        
    #MAE, RMS, MAPE, R2
    
    r2_train = r2_score(y_train, y_train_predicted)
    r2_test = r2_score(y_test, y_test_predicted)
    
    rms_train = rmse(y_train, y_train_predicted)
    rms_test = rmse(y_test, y_test_predicted)
        
    mae_train = mean_absolute_error(y_train, y_train_predicted)
    mae_test = mean_absolute_error(y_test, y_test_predicted)
        
    mape_train = np.mean(np.abs((y_train - y_train_predicted) / y_train)) * 100
    mape_test = np.mean(np.abs((y_test - y_test_predicted) / y_test)) * 100
    
    rmse_dict[modelname] = rms_test
        
    df_local = pd.DataFrame({'Model':[name],
                            'ModelType':[modelname],
                            'r2_train': [r2_train],
                            'r2_test': [r2_test],
                            'rms_train':[rms_train], 
                            'rms_test': [rms_test],
                            'mae_train': [mae_train],
                            'mae_test': [mae_test],
                            'mape_train':[mape_train],
                            'mape_test':[mape_test]})

    error_metric = pd.concat([df_local])
    return error_metric

### Prediction models:

In [26]:
error_metric = pd.DataFrame({'Model':[],
                            'ModelType':[],
                            'r2_train': [],
                            'r2_test': [],
                            'rms_train':[], 
                            'rms_test': [],
                            'mae_train': [],
                            'mae_test': [],
                            'mape_train':[],
                            'mape_test':[]})


def models(name,X_train, y_train, X_test, y_test):
    global error_metric
    n = name

    # Linear Regressor
    clf = LinearRegression()
    lm =clf.fit(X_train, y_train)
    linear = calc_error_metric('Linear Regression',lm, clf, X_train, y_train, X_test, y_test)
    print('Linear Regression completed')

    # Random Forest Regressor
    rf = RandomForestRegressor(n_estimators=100, max_depth=7)
    rfmodel = rf.fit(X_train, y_train)
    randomforest = calc_error_metric('RandomForest', rfmodel, rf, X_train, y_train, X_test, y_test)
    print('RandomForest completed')

    #Lasso Regressor (Linear with L1 regularization)
    from sklearn import linear_model
    clf = linear_model.Lasso(alpha=0.1)
    lasso_clf = clf.fit(X_train,y_train)
    Lasso = calc_error_metric('Lasso',lasso_clf, clf, X_train, y_train, X_test, y_test)
    print('Lasso Regression Completed')

    #### Calculate best model
    best_model =  min(rmse_dict.items(),key=operator.itemgetter(1))[0]
    print('Best Model is ', best_model)
    
    error_metric = pd.concat([error_metric,linear,randomforest,Lasso])
    
    #### Write the error
    error_metric.to_csv('error_metrics.csv')
    
    return error_metric

### Models prediction(no feature selection):

In [43]:
teston = models('No feature selection',X_train, y_train, X_test, y_test)

Linear Regression completed
RandomForest completed
Lasso Regression Completed
Best Model is  RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=7,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


### Error metrics (no feature selection):

In [44]:
error_metric

Unnamed: 0,Model,ModelType,mae_test,mae_train,mape_test,mape_train,r2_test,r2_train,rms_test,rms_train
0,Linear Regression,"LinearRegression(copy_X=True, fit_intercept=Tr...",259770.376254,0.365647,43334040.0,64.777803,-2954859000000000.0,0.102609,28330420.0,0.500354
0,RandomForest,"(DecisionTreeRegressor(criterion='mse', max_de...",0.36547,0.35045,64.31252,62.180414,0.08982102,0.187632,0.4972194,0.476061
0,Lasso,"Lasso(alpha=0.1, copy_X=True, fit_intercept=Tr...",0.39629,0.398947,68.8187,69.468138,0.009009953,0.009783,0.5188231,0.525595


In [45]:
pd.read_csv('error_metrics.csv')

Unnamed: 0.1,Unnamed: 0,Model,ModelType,mae_test,mae_train,mape_test,mape_train,r2_test,r2_train,rms_test,rms_train
0,0,Linear Regression,"LinearRegression(copy_X=True, fit_intercept=Tr...",259770.376254,0.365647,43334040.0,64.777803,-2954859000000000.0,0.102609,28330420.0,0.500354
1,0,RandomForest,"RandomForestRegressor(bootstrap=True, criterio...",0.36547,0.35045,64.31252,62.180414,0.08982102,0.187632,0.4972194,0.476061
2,0,Lasso,"Lasso(alpha=0.1, copy_X=True, fit_intercept=Tr...",0.39629,0.398947,68.8187,69.468138,0.009009953,0.009783,0.5188231,0.525595


### Feature Selection using Boruta Package:

In [53]:
X_boruta = scaled_data.drop(['shares'],axis=1)
X_boruta = X_boruta.values

In [58]:
y_boruta = scaled_data
y_boruta = y_boruta['shares']

In [59]:
import pandas as pd
#from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from boruta import BorutaPy

# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
y_boruta = y_boruta.ravel()

# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
rf1 = RandomForestRegressor(n_jobs=-1, max_depth=5)

# define Boruta feature selection method
feat_selector = BorutaPy(rf1, n_estimators='auto', verbose=5, random_state=1)

# find all relevant features - 5 features should be selected
feat_selector.fit(X_boruta,y_boruta)

# check selected features - first 5 features are selected
feat_selector.support_

# check ranking of features
feat_selector.ranking_

# call transform() on X to filter it down to selected features
#X_filtered = feat_selector.transform(X)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	58
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	5
Tentative: 	21
Rejected: 	32


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	9 / 100
Confirmed: 	5
Tentative: 	21
Rejected: 	32


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	10 / 100
Confirmed: 	5
Tentative: 	21
Rejected: 	32


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	11 / 100
Confirmed: 	5
Tentative: 	21
Rejected: 	32


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	12 / 100
Confirmed: 	11
Tentative: 	13
Rejected: 	34


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	13 / 100
Confirmed: 	11
Tentative: 	13
Rejected: 	34


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	14 / 100
Confirmed: 	11
Tentative: 	13
Rejected: 	34


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	15 / 100
Confirmed: 	11
Tentative: 	13
Rejected: 	34


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	16 / 100
Confirmed: 	14
Tentative: 	10
Rejected: 	34


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	17 / 100
Confirmed: 	14
Tentative: 	9
Rejected: 	35


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	18 / 100
Confirmed: 	14
Tentative: 	9
Rejected: 	35


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	19 / 100
Confirmed: 	14
Tentative: 	9
Rejected: 	35


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	20 / 100
Confirmed: 	14
Tentative: 	9
Rejected: 	35


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	21 / 100
Confirmed: 	14
Tentative: 	9
Rejected: 	35


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	22 / 100
Confirmed: 	14
Tentative: 	8
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	23 / 100
Confirmed: 	14
Tentative: 	8
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	24 / 100
Confirmed: 	14
Tentative: 	8
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	25 / 100
Confirmed: 	14
Tentative: 	8
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	26 / 100
Confirmed: 	15
Tentative: 	7
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	27 / 100
Confirmed: 	15
Tentative: 	7
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	28 / 100
Confirmed: 	15
Tentative: 	7
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	29 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	30 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	31 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	32 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	33 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	34 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	35 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	36 / 100
Confirmed: 	16
Tentative: 	6
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	37 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	38 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	39 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	40 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	41 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	42 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	43 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	44 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	45 / 100
Confirmed: 	17
Tentative: 	5
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	46 / 100
Confirmed: 	18
Tentative: 	4
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	47 / 100
Confirmed: 	18
Tentative: 	4
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	48 / 100
Confirmed: 	18
Tentative: 	4
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	49 / 100
Confirmed: 	18
Tentative: 	4
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	50 / 100
Confirmed: 	18
Tentative: 	4
Rejected: 	36


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	51 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	52 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	53 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	54 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	55 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	56 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	57 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	58 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	59 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	60 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	61 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	62 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	63 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	64 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	65 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	66 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	67 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	68 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	69 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	70 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	71 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	72 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	73 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	74 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	75 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	76 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	77 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	78 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	79 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	80 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	81 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	82 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	83 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	84 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	85 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	86 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	87 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	88 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	89 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	90 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	91 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	92 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	93 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	94 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	95 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	96 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	97 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	98 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37
Iteration: 	99 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	37


BorutaPy finished running.

Iteration: 	100 / 100
Confirmed: 	19
Tentative: 	1
Rejected: 	37


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


array([21,  1,  1, 39,  1,  1, 17,  1, 15,  5, 30, 31,  1, 34, 18,  1, 33,
       16, 10,  1,  3,  1,  1,  1,  1,  1,  1,  8,  1, 35, 38, 36, 39, 37,
        5, 28,  1,  4,  1,  1,  7,  1,  2, 13, 14, 23, 26, 26, 19,  9, 29,
       11, 22, 25, 20, 12, 32, 23])

In [60]:
print(feat_selector.ranking_)

[21  1  1 39  1  1 17  1 15  5 30 31  1 34 18  1 33 16 10  1  3  1  1  1  1
  1  1  8  1 35 38 36 39 37  5 28  1  4  1  1  7  1  2 13 14 23 26 26 19  9
 29 11 22 25 20 12 32 23]


In [67]:
# check selected features
print(feat_selector.support_)

[False  True  True False  True  True False  True False False False False
  True False False  True False False False  True False  True  True  True
  True  True  True False  True False False False False False False False
  True False  True  True False  True False False False False False False
 False False False False False False False False False False]


### Features selected using Boruta:

In [71]:
boruta_features = ['n_tokens_content','n_unique_tokens','n_non_stop_unique_tokens','num_hrefs', 'num_imgs','data_channel_is_entertainment',
                  'data_channel_is_tech','kw_avg_min','kw_max_max', 'kw_avg_max', 'kw_min_avg','kw_max_avg','kw_avg_avg', 
                   'self_reference_min_shares','self_reference_avg_sharess','is_weekend', 'LDA_01', 'LDA_02','LDA_04']
print(boruta_features)

['n_tokens_content', 'n_unique_tokens', 'n_non_stop_unique_tokens', 'num_hrefs', 'num_imgs', 'data_channel_is_entertainment', 'data_channel_is_tech', 'kw_avg_min', 'kw_max_max', 'kw_avg_max', 'kw_min_avg', 'kw_max_avg', 'kw_avg_avg', 'self_reference_min_shares', 'self_reference_avg_sharess', 'is_weekend', 'LDA_01', 'LDA_02', 'LDA_04']


In [72]:
X_boruta_train, X_boruta_test, y_boruta_train, y_boruta_test = cross_validation.train_test_split(scaled_data[boruta_features], 
                                        scaled_data['shares'],train_size = 0.70,test_size=0.30, random_state=0)

### Model prediction with features selected:

In [73]:
test2 = models('Boruta feature selection',X_boruta_train, y_boruta_train, X_boruta_test, y_boruta_test)

Linear Regression completed
RandomForest completed
Lasso Regression Completed
Best Model is  RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=7,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


### Final Error metrics:

In [74]:
error_metric

Unnamed: 0,Model,ModelType,mae_test,mae_train,mape_test,mape_train,r2_test,r2_train,rms_test,rms_train
0,Linear Regression,"LinearRegression(copy_X=True, fit_intercept=Tr...",259770.376254,0.365647,43334040.0,64.777803,-2954859000000000.0,0.102609,28330420.0,0.500354
0,RandomForest,"(DecisionTreeRegressor(criterion='mse', max_de...",0.36547,0.35045,64.31252,62.180414,0.08982102,0.187632,0.4972194,0.476061
0,Lasso,"Lasso(alpha=0.1, copy_X=True, fit_intercept=Tr...",0.39629,0.398947,68.8187,69.468138,0.009009953,0.009783,0.5188231,0.525595
0,Linear Regression,"LinearRegression(copy_X=True, fit_intercept=Tr...",0.375645,0.367938,66.15781,65.101318,-4.143225,0.0956,1.181959,0.502304
0,RandomForest,"(DecisionTreeRegressor(criterion='mse', max_de...",0.365542,0.351354,64.39506,62.31702,0.08864769,0.179594,0.4975397,0.478411
0,Lasso,"Lasso(alpha=0.1, copy_X=True, fit_intercept=Tr...",0.39629,0.398947,68.8187,69.468138,0.009009953,0.009783,0.5188231,0.525595


In [75]:
pd.read_csv('error_metrics.csv')

Unnamed: 0.1,Unnamed: 0,Model,ModelType,mae_test,mae_train,mape_test,mape_train,r2_test,r2_train,rms_test,rms_train
0,0,Linear Regression,"LinearRegression(copy_X=True, fit_intercept=Tr...",259770.376254,0.365647,43334040.0,64.777803,-2954859000000000.0,0.102609,28330420.0,0.500354
1,0,RandomForest,"RandomForestRegressor(bootstrap=True, criterio...",0.36547,0.35045,64.31252,62.180414,0.08982102,0.187632,0.4972194,0.476061
2,0,Lasso,"Lasso(alpha=0.1, copy_X=True, fit_intercept=Tr...",0.39629,0.398947,68.8187,69.468138,0.009009953,0.009783,0.5188231,0.525595
3,0,Linear Regression,"LinearRegression(copy_X=True, fit_intercept=Tr...",0.375645,0.367938,66.15781,65.101318,-4.143225,0.0956,1.181959,0.502304
4,0,RandomForest,"RandomForestRegressor(bootstrap=True, criterio...",0.365542,0.351354,64.39506,62.31702,0.08864769,0.179594,0.4975397,0.478411
5,0,Lasso,"Lasso(alpha=0.1, copy_X=True, fit_intercept=Tr...",0.39629,0.398947,68.8187,69.468138,0.009009953,0.009783,0.5188231,0.525595
