# Politcal Sentiment Analyzer - Feature Engineering

## Michael Trent

The purpose of this notebook is to explore some feature engineering for the political sentiment analyzer project. 

## Process

### 1. Import packages and load the data

### 2. Exolore the count vectorizer in sklearn



In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import PorterStemmer
from nltk import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
import tensorflow
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import utils

In [7]:
articles = pd.read_pickle('Data/articles.pkl')

Rather than using the allsides bias indicator which is "from the left" or "from the right" or "From the center", for simplicitys sake we'll code the bias as -1, 0, 1 for 'left', 'center', or'right' respectively. We also need to ensure that there is consistency in the capitalization scheme of the bias column. 

In [77]:
articles['bias'] = articles.allsides_bias.apply(word_tokenize)
articles['bias'] = articles.bias.apply(lambda x: x[-1].lower())
articles['bias_score'] = articles.bias.apply(lambda x: -1 if x == 'left' else 0 if x == 'center' else 1 )

'obama administr alum roger fisk republican strategist chri turner weigh presidenti elect field would look like howard schultz ran independ presid trump monday fire back howard schultz say former starbuck ceo gut run presid announc mull white hous bid independ took shot trump howard schultz gut run presid watch last night i agre smartest besid america alreadi i hope starbuck still pay rent trump tower trump tweet monday morn ceo howard schultz aposseri apo mull presidenti run bash trump as aposnot qualifi apo the presid tweet come schultz aposs interview cb news minut reveal serious think run presid centrist independ challeng trump fit offic we live fragil time fact presid qualifi presid fact parti consist necessari behalf american peopl schultz said specif cite nation debt reckless exampl republican democrat well reckless failur constitut a fieri respons trump expect when told interview trump would tweet respons schultz said i aposv becom bore presid trump tweet schultz though take he

In [163]:
comp = ''
for word in articles.stemmed[0]:
    comp = comp + ' ' + word
comp

' obama administr alum roger fisk republican strategist chri turner weigh presidenti elect field would look like howard schultz ran independ presid trump monday fire back howard schultz say former starbuck ceo gut run presid announc mull white hous bid independ took shot trump howard schultz gut run presid watch last night i agre smartest besid america alreadi i hope starbuck still pay rent trump tower trump tweet monday morn ceo howard schultz aposseri apo mull presidenti run bash trump as aposnot qualifi apo the presid tweet come schultz aposs interview cb news minut reveal serious think run presid centrist independ challeng trump fit offic we live fragil time fact presid qualifi presid fact parti consist necessari behalf american peopl schultz said specif cite nation debt reckless exampl republican democrat well reckless failur constitut a fieri respons trump expect when told interview trump would tweet respons schultz said i aposv becom bore presid trump tweet schultz though take h

In [169]:
articles['content_cleaned'] = articles.stemmed.apply(lambda x: str(x).replace('[', '').replace('"', '').replace(',', "").replace("'", '').replace(']', ''))

In [171]:
x_train, x_test, y_train, y_test = train_test_split(articles.content_cleaned, articles.bias)

In [172]:
vect = CountVectorizer()
vect.fit(x_train)

CountVectorizer()

In [173]:
x_train_dtm = vect.transform(x_train)

In [174]:
x_test_dtm = vect.transform(x_test)

In [322]:
max_tokens = 600
max_len = 100
vectorize_layer = TextVectorization(max_tokens = max_tokens, output_mode = 'int', output_sequence_length = max_len)

In [323]:
# I need to convert the data from a pandas series to a tensorflow Dataset

x_train_tf = tensorflow.data.Dataset.from_tensor_slices(x_train)

In [324]:
#Adapt the vectorize layer to the training text
vectorize_layer.adapt(x_train_tf)

In [428]:
y_train_cat2 = pd.get_dummies(y_train).to_numpy()

In [474]:
y_train_cat2

array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       ...,
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=uint8)

In [367]:
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y = encoder.transform(y_train)
#Convert to dummy variables
y_train_cat = utils.to_categorical(encoded_y)

In [513]:
#Create the model
model = Sequential()
model.add(Input(shape = (1,), dtype = 'string'))

#Add the layer to the model
model.add(vectorize_layer)
#We now need an embedding layer to convert the integers from the vectorize layer into fixed-length vectors
model.add(Embedding(max_tokens + 1, 10))
#Now add the recurrent layer LSTM, here we define the dimenstionality of the output space
#model.add(LSTM(10, return_sequences=True))
#model.add(LSTM(10, return_sequences=True))
#model.add(LSTM(10, return_sequences=True))
#model.add(LSTM(10))

#I'm going to try adding a hidden layer
model.add(Dense(3, activation = 'relu'))
#model.add(Dense(32, activation = 'relu'))
#model.add(Dense(100, activation = 'relu'))
#model.add(Dense(64, activation = 'relu'))
#Finally we need to add the output layer, which is a fully connected dense layer using the tanh activation function
#to keep the output between -1 and 1

model.add(Dense(3, activation = 'relu'))
model.add(Dense(3, activation = 'softmax'))

In [514]:
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy', 'mae'])

In [515]:
early_stopping = tensorflow.keras.callbacks.EarlyStopping(patience = 50)
model.fit(x_train, y_train_cat2, epochs = 200, validation_split = 0.25, callbacks = [early_stopping])

Epoch 1/200


ValueError: in user code:

    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:806 train_function  *
        return step_function(self, iterator)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:796 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:1211 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2585 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2945 _call_for_each_replica
        return fn(*args, **kwargs)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:789 run_step  **
        outputs = model.train_step(data)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:749 train_step
        y, y_pred, sample_weight, regularization_losses=self.losses)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\engine\compile_utils.py:204 __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\losses.py:149 __call__
        losses = ag_call(y_true, y_pred)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\losses.py:253 call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201 wrapper
        return target(*args, **kwargs)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\losses.py:1535 categorical_crossentropy
        return K.categorical_crossentropy(y_true, y_pred, from_logits=from_logits)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201 wrapper
        return target(*args, **kwargs)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\keras\backend.py:4687 categorical_crossentropy
        target.shape.assert_is_compatible_with(output.shape)
    C:\Users\micha\anaconda3\lib\site-packages\tensorflow\python\framework\tensor_shape.py:1134 assert_is_compatible_with
        raise ValueError("Shapes %s and %s are incompatible" % (self, other))

    ValueError: Shapes (None, 3) and (None, 100, 3) are incompatible


In [None]:
def plot_metric(history, metric):
    train_metrics = history.history[metric]
    val_metrics = history.history['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics)
    plt.plot(epochs, val_metrics)
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()

In [None]:
plot_metric(model.history, 'mae')

In [None]:
pred = model.predict(x_test)

In [401]:
preds = pred*[-1, 0, 1]

scores = np.sum(a = preds, axis = 1)

In [402]:
scores


array([0.02289035, 0.10275431, 0.20798171, ..., 0.49131817, 0.38642809,
       0.03334981])

In [403]:
p_scores = []
for pred in range(0, len(preds)):
    p_scores.append(sum(preds[pred]))

In [404]:
min(p_scores)

0.0015632488939445466

In [411]:
np.average(a = pred, axis = 0)

array([0.1806113 , 0.38975903, 0.42962965], dtype=float32)

In [406]:
for features_tensor, target_tensor in training_data:
    print(f'features:{features_tensor} target:{target_tensor}')

features:b'Barack Obama has warned that the US will retaliate for Russian cyberattacks during the presidential election.\nIn an interview on National Public Radio on Friday morning, the US president said he is waiting for a final report he has ordered into a range of Russian hacking attacks, but promised there would be a response.\n\xe2\x80\x9cI think there is no doubt that when any foreign government tries to impact the integrity of our elections \xe2\x80\xa6 we need to take action,\xe2\x80\x9d Obama said. \xe2\x80\x9cAnd we will \xe2\x80\x93 at a time and place of our own choosing.\n\xe2\x80\x9cSome of it may be explicit and publicised; some of it may not be.\xe2\x80\x9d\nThe CIA has judged that the Russian cyber attacks, including the hacking of emails from the Democratic National Committee, were intended to influence the election in Donald Trump\xe2\x80\x99s favour, according to reports. The FBI agrees that there was Russian hacking but has not as yet concluded it was intended to f

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

