# Training notebook

In [1]:
# Some models implemented by the lierature to test
from sklearn.neural_network import MLPRegressor
import pandas as pd
import os
import numpy as np
from sklearn.linear_model import LinearRegression
from pyGRNN import GRNN
from sklearn.model_selection import  GridSearchCV
from sklearn.metrics import mean_squared_error as MSE

In [2]:
# Utils
def average(lst): 
    return sum(lst) / len(lst) 

days = [1,2,3,4,5,6,7,14,21,28]

In [3]:
# IMPORTANT: Seeds to try
seeds = [1,2,3,4,5]

### Columns Description

*Corresponding to the number of instances before a certain polling release date (e.g 24 XPosts located on a 1_1_ file belong to the number of X posts for candidate 1 over a span of 1 day before a specified date)*

* XPosts: Number of overall posts in X (Twitter)
* Xcomments: Number of overall comments in X
* XRts: Number of overall Rt´s in X
* XLikes: Number of overall likes in X
* XCommsPPost: Average number of comments per post for X
* XRtsPPost: Average number of Rts per post for X
* XLikesPPost: Average number of likes per post for X

* FBPosts: Number of overall posts in Facebook
* FBReactions: Number of overall reactions in Facebook
* FBComments: Number of overall comments in Facebook
* FBShares: Number of overall comments in Facebook
* FBCommsPPost: Average number of comments per post for Facebook
* FBReactsPPost: Average number of reactions per post for Facebook
* FBLikesPPost: Average number of likes per post for Facebook

* IGPosts: Number of overall posts in Instagram
* IGLikes: Number of overall likes in Instagram
* IGLikesPPost: Average number of likes per post for Instagram

* YTPosts: Number of overall posts in YouTube
* YTViews: Number of overall views in YouTube
* YTViewsPPost: Average number of views per post for YouTube

* Target: the reported vote share for the candidate



In [4]:
#Setting columns to use (see New_DB)
columns = ['XPosts', 'Xcomments', 'XRts', 'Xlikes', 'XCommsPPost', 'XRTsPPost', 'XlikesPPost', 'FBPosts', 'FBReactions', 'FBComments', 'FBShares', 'FBReactsPPost', 'FBCommsPPost', 'FBSharesPPost', 'IGPosts', 'IGLikes', 'IGLikesPPost', 'YTPosts', 'YTViews', 'YTViewsPPost', 'Target']

target = ['Target']

feature_columns = ['XPosts', 'Xcomments', 'XRts', 'Xlikes', 'XCommsPPost', 'XRTsPPost', 'XlikesPPost', 'FBPosts', 'FBReactions', 'FBComments', 'FBShares', 'FBReactsPPost', 'FBCommsPPost', 'FBSharesPPost', 'IGPosts', 'IGLikes', 'IGLikesPPost', 'YTPosts', 'YTViews', 'YTViewsPPost']

## Multi-layer Perceptron (MLP)

MLPRegressor trains iteratively since at each time step the partial derivatives of the loss function with respect to the model parameters are computed to update the parameters.

It can also have a regularization term added to the loss function that shrinks model parameters to prevent overfitting.

This implementation works with data represented as dense and sparse numpy arrays of floating point values.

### MLP

We would expect to get 10 predicitons, 1 per window frame (days) and the final vote share prediction will be the average of them.

#### Claudia

In [85]:
predictions = []
claudia_mlp = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./claudia/1_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]
  y_train = y_train['Target'].values

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]
  y_test = y_test['Target'].values
  # Assign random states (as literature)
  for y in seeds:
    regr = MLPRegressor(hidden_layer_sizes=(3, ), solver="lbfgs", alpha=0.05, random_state=y, max_iter=500).fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    claudia_mlp.append(prediction[0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(claudia_mlp))


In [95]:
print(f"The number of predictions to be averaged is {len(predictions)}, and the prediction is {round(average(predictions))}%, the real result being {y_test[0]}%");

The number of predictions to be averaged is 10, and the prediction is 54%, the real result being 63.0%


#### Gálvez

In [96]:
galvez_mlp = []
predictions = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./galvez/2_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]
  y_train = y_train['Target'].values

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]
  y_test = y_test['Target'].values
  # Assign random states (as literature)
  for y in seeds:
    regr = MLPRegressor(hidden_layer_sizes=(3, ), solver="lbfgs", alpha=0.05, random_state=y, max_iter=500).fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    galvez_mlp.append(prediction[0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(galvez_mlp))

In [98]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by MLP alone is {round(average(predictions))}%, the real result being {y_test[0]}%");

The number of predictions to be averaged is 10 and for that, the predicted result by MLP alone is 25%, the real result being 22.0%


### MLP Scaled

Now to proceed, we´ll include scaled data to predict.

In [111]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from sklearn.preprocessing import MinMaxScaler
minmaxer = MinMaxScaler()
from sklearn import preprocessing

#### Claudia

In [118]:
predictions = []
claudia_mlp = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./claudia/1_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  # Assign random states (as literature)
  for y in seeds:
    regr = MLPRegressor(hidden_layer_sizes=(3, ), solver="lbfgs", alpha=0.05, random_state=y, max_iter=100).fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    claudia_mlp.append(prediction[0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(claudia_mlp))

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=T

In [119]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by MLP alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by MLP alone is 56%, the real vote share was 63%


#### Gálvez

In [120]:
galvez_mlp = []
predictions = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./galvez/2_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)
  
  # Assign random states (as literature)
  for y in seeds:
    regr = MLPRegressor(hidden_layer_sizes=(3, ), solver="lbfgs", alpha=0.05, random_state=y, max_iter=500).fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    galvez_mlp.append(prediction[0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(galvez_mlp))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Inc

In [121]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by MLP alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by MLP alone is 25%, the real vote share was 22%


### MLP w/ PCA

Principal component analysis (PCA).

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation. - DataCamp

'We ran the fit on the scaller only on the train data, and the transform on train and test.'

In [122]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#### Claudia

In [123]:
predictions = []
claudia_mlp = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./claudia/1_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Instantiating both the standard scaler and PCA
  scaler = StandardScaler()
  pca = PCA(n_components=0.95, svd_solver='full')

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  # Fitting PCA to the X_train anf transforming both X_train and y-train based on it
  pca.fit(X_train)
  X_train = pca.transform(X_train)
  X_test = pca.transform(X_test)

  # Details about the PCA instantiation
  pca_components = pca.n_components_
  pca_variance = pca.explained_variance_ratio_

  print(f"{pca_components} resulting components for this data, which is based on {i} days for candidate 1.")

  # Assign random states (as literature)
  for y in seeds:
    regr = MLPRegressor(hidden_layer_sizes=(3, ), solver="lbfgs", alpha=0.05, random_state=y, max_iter=500).fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    claudia_mlp.append(prediction[0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(claudia_mlp))

3 resulting components for this data, which is based on 1 days for candidate 1.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 2 days for candidate 1.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)


4 resulting components for this data, which is based on 3 days for candidate 1.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


4 resulting components for this data, which is based on 4 days for candidate 1.


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


4 resulting components for this data, which is based on 5 days for candidate 1.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


4 resulting components for this data, which is based on 6 days for candidate 1.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or 

4 resulting components for this data, which is based on 7 days for candidate 1.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)


4 resulting components for this data, which is based on 14 days for candidate 1.


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the nu

4 resulting components for this data, which is based on 21 days for candidate 1.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)


4 resulting components for this data, which is based on 28 days for candidate 1.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    h

In [124]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by MLP alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by MLP alone is 56%, the real vote share was 63%


#### Gálvez

In [125]:
predictions = []
galvez_mlp = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./galvez/2_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Instantiating both the standard scaler and PCA
  scaler = StandardScaler()
  pca = PCA(n_components=0.95, svd_solver='full')

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  # Fitting PCA to the X_train anf transforming both X_train and y-train based on it
  pca.fit(X_train)
  X_train = pca.transform(X_train)
  X_test = pca.transform(X_test)

  # Details about the PCA instantiation
  pca_components = pca.n_components_
  pca_variance = pca.explained_variance_ratio_

  print(f"{pca_components} resulting components for this data, which is based on {i} days for candidate 2.")

  # Assign random states (as literature)
  for y in seeds:
    regr = MLPRegressor(hidden_layer_sizes=(3, ), solver="lbfgs", alpha=0.05, random_state=y, max_iter=500).fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    galvez_mlp.append(prediction[0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(galvez_mlp))

3 resulting components for this data, which is based on 1 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 2 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 3 days for candidate 2.


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 4 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 5 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 6 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 7 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 14 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 21 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


3 resulting components for this data, which is based on 28 days for candidate 2.


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [126]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by MLP alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by MLP alone is 25%, the real vote share was 22%


## Linear Regression

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

### LR

#### Claudia

In [127]:
predictions = []
claudia_lr = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./claudia/1_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]
  # Assign random states (as literature)
  for y in seeds:
    regr = LinearRegression().fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    claudia_lr.append(prediction[0][0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(claudia_lr))

In [128]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by LR alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by LR alone is 56%, the real vote share was 63%


#### Gálvez

In [129]:
predictions = []
galvez_lr = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./galvez/2_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]
  # Assign random states (as literature)
  for y in seeds:
    regr = LinearRegression().fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    galvez_lr.append(prediction[0][0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(galvez_lr))

In [130]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by LR alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by LR alone is 25%, the real vote share was 22%


Once again.. there is a noticeable difference between standarizing the data and not doing so like above, perhaps something to dig into in the EDA notebook. Much more variance in Galvez´s information?

### LR Scaled

In [131]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from sklearn.preprocessing import MinMaxScaler
minmaxer = MinMaxScaler()

#### Claudia

In [132]:
predictions = []
claudia_lr = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./claudia/1_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)
  
  # Assign random states (as literature)
  for y in seeds:
    regr = LinearRegression().fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    claudia_lr.append(prediction[0][0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(claudia_lr))

In [133]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by LR alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by LR alone is 56%, the real vote share was 63%


#### Galvez

In [135]:
predictions = []
galvez_lr = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./galvez/2_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)
  
  # Assign random states (as literature)
  for y in seeds:
    regr = LinearRegression().fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    galvez_lr.append(prediction[0][0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(galvez_lr))

In [136]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by LR alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by LR alone is 25%, the real vote share was 22%


Once again, the importance of scaling...

### LR w/ PCA

In [137]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#### Claudia

In [138]:
predictions = []
claudia_lr = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./claudia/1_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Instantiating both the standard scaler and PCA
  scaler = StandardScaler()
  pca = PCA(n_components=0.95, svd_solver='full')

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  # Fitting PCA to the X_train anf transforming both X_train and y-train based on it
  pca.fit(X_train)
  X_train = pca.transform(X_train)
  X_test = pca.transform(X_test)

  # Details about the PCA instantiation
  pca_components = pca.n_components_
  pca_variance = pca.explained_variance_ratio_

  print(f"{pca_components} resulting components for this data, which is based on {i} days for candidate 1.")
  
  # Assign random states (as literature)
  for y in seeds:
    regr = LinearRegression().fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    claudia_lr.append(prediction[0][0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(claudia_lr))

3 resulting components for this data, which is based on 1 days for candidate 1.
3 resulting components for this data, which is based on 2 days for candidate 1.
4 resulting components for this data, which is based on 3 days for candidate 1.
4 resulting components for this data, which is based on 4 days for candidate 1.
4 resulting components for this data, which is based on 5 days for candidate 1.
4 resulting components for this data, which is based on 6 days for candidate 1.
4 resulting components for this data, which is based on 7 days for candidate 1.
4 resulting components for this data, which is based on 14 days for candidate 1.
4 resulting components for this data, which is based on 21 days for candidate 1.
4 resulting components for this data, which is based on 28 days for candidate 1.


In [139]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by LR alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by LR alone is 55%, the real vote share was 63%


#### Galvez

In [140]:
predictions = []
galvez_lr = []
for i in days:
  # Scan the file and set data
  data = pd.read_csv(f'./galvez/2_{i}.csv', usecols=columns, encoding="utf-8")
  # Training and testing data; Remove last row which is the testing row
  training_data = data.iloc[:-1]
  testing_data = pd.DataFrame(data.iloc[-1])
  testing_data = testing_data.T
  # Splitting
  X_train = training_data[feature_columns]
  y_train = training_data[target]

  X_test = testing_data[feature_columns]
  y_test = testing_data[target]

  # Instantiating both the standard scaler and PCA
  scaler = StandardScaler()
  pca = PCA(n_components=0.95, svd_solver='full')

  # Fitting the scaler to the X_train anf transforming both X_train and y-train based on it
  scaler.fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  # Fitting PCA to the X_train anf transforming both X_train and y-train based on it
  pca.fit(X_train)
  X_train = pca.transform(X_train)
  X_test = pca.transform(X_test)

  # Details about the PCA instantiation
  pca_components = pca.n_components_
  pca_variance = pca.explained_variance_ratio_

  print(f"{pca_components} resulting componentns for this data, which is based on {i} days for candidate 2.")
  
  # Assign random states (as literature)
  for y in seeds:
    regr = LinearRegression().fit(X_train, y_train)
    # Predict
    prediction = regr.predict(X_test)
    # Append prediction
    galvez_lr.append(prediction[0][0])
  # Append the 5 averaged predictions to a final predictions list
  predictions.append(average(galvez_lr))

3 resulting componentns for this data, which is based on 1 days for candidate 2.
3 resulting componentns for this data, which is based on 2 days for candidate 2.
3 resulting componentns for this data, which is based on 3 days for candidate 2.
3 resulting componentns for this data, which is based on 4 days for candidate 2.
3 resulting componentns for this data, which is based on 5 days for candidate 2.
3 resulting componentns for this data, which is based on 6 days for candidate 2.
3 resulting componentns for this data, which is based on 7 days for candidate 2.
3 resulting componentns for this data, which is based on 14 days for candidate 2.
3 resulting componentns for this data, which is based on 21 days for candidate 2.
3 resulting componentns for this data, which is based on 28 days for candidate 2.


In [141]:
print(f"The number of predictions to be averaged is {len(predictions)} and for that, the predicted result by LR alone is {round(average(predictions))}%, the real vote share was {int(y_test['Target'].iloc[0])}%");

The number of predictions to be averaged is 10 and for that, the predicted result by LR alone is 25%, the real vote share was 22%


## General Regression NN (GRNN)

The GRNN also uses the supervised training approach, and falls into the category of probabilistic neural networks. The use of probabilistic neural networks is especially advantageous because the network learns in one pass through the data and is able to generalize from examples the moment they have been stored.

In [5]:
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split
from pyGRNN import GRNN

#### Claudia

In [6]:
# Scan the file and set data
data = pd.read_csv(f'./claudia/1_1.csv', usecols=columns, encoding="utf-8")
# Splitting
X = np.array(data[feature_columns])
y = np.array(data[target])
# Splitting data into training and testing
X_train, X_test, y_train, y_test = train_test_split(preprocessing.minmax_scale(X),
                                                    preprocessing.minmax_scale(y.reshape((-1, 1))),
                                                    test_size=0.05)
# Example 1: use Isotropic GRNN with a Grid Search Cross validation to select the optimal bandwidth
IGRNN = GRNN()
params_IGRNN = {'kernel':["RBF"],
                'sigma' : list(np.arange(0.1, 4, 0.2)),
                'calibration' : ['None']
                 }
grid_IGRNN = GridSearchCV(estimator=IGRNN,
                          param_grid=params_IGRNN,
                          scoring='neg_mean_squared_error',
                          cv=5,
                          verbose=1
                          )
grid_IGRNN.fit(X_train, y_train.ravel())
best_model = grid_IGRNN.best_estimator_
y_pred = best_model.predict(X_test)
mse_IGRNN = MSE(y_test, y_pred)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


## SVMs (Support Vector Machine)