In [41]:
import pandas as pd
import numpy as np
import tensorflow as tf
import sklearn
import os
import json
import sklearn.preprocessing
from sklearn.model_selection import train_test_split

In [42]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [43]:
# References

# https://www.tensorflow.org/api_docs/python/tf/keras/Model
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

**by Brady Deyak**

**"Luke, I am your father" - Darth Vader (fun Star Wars reference)**

In [44]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [45]:
analyzer = SentimentIntensityAnalyzer

**by Brady Deyak**

Due to the nature of the project and the difficulty in getting specific data directories to work with the program, it did not work as well as I had hoped to model the trend in time from the 1980s to the 2000s. Therefore, to show the model and its results, I utilized the 2010s decade as the testing data.

In [46]:
drive_path = '/content/drive/MyDrive/data/disab/2010s/json'

In [47]:
data = []

**by Brady Deyak**

This goes through the directory of the path above and collects the .json files within.

In [48]:
from pathlib import Path
json_path = Path(drive_path)
json_paths = list(json_path.rglob('*.json'))
# returns the number of .json files in the given directory
print(f'Found {len(json_paths)} json files')

Found 5 json files


**by Brady Deyak**

Iterates through the .json files and reads the data within, which in this case is the words from the .rtf files. Afterwards, they are appended to a dataframe holding all of the words used in the decade.

In [49]:
bodies = []
for f in json_paths:
  # loads and iterates through the data or words in every .json file
  data = json.load(f.open())
  # once the file has been opened, we iterate through every word and add it to a new DataFrame. Each set of words is added as bodies to represent each document
  for d in data:
    bodies.append(d['body'])
print(f'Found {len(bodies)} bodies')

Found 499 bodies


**by Brady Deyak**

Each set of words or body represents a document and there are 499 documents that make up the 2010s decade.

In [50]:
type(bodies)

list

**by Brady Deyak**

The VADER model ges through all of the words in each document and calculates the polarity scores or sentiment of each document. I then created a dataframe to store such sentiment scores.

In [51]:
sia = SentimentIntensityAnalyzer()
# utilizes the VADER function to determine polarity scores for sentiment. It does this for every document and the words within.
sents = [sia.polarity_scores(' '.join(b)) for b in bodies]
# creates a new DataFrame with the sentiment scores
df = pd.DataFrame(sents)
df

Unnamed: 0,neg,neu,pos,compound
0,0.012,0.903,0.084,0.9645
1,0.147,0.768,0.085,-0.9968
2,0.028,0.840,0.132,0.9972
3,0.021,0.829,0.150,0.9976
4,0.000,0.831,0.169,0.9881
...,...,...,...,...
494,0.039,0.885,0.076,0.9999
495,0.029,0.825,0.146,0.9976
496,0.048,0.817,0.135,0.9961
497,0.006,0.865,0.130,0.9867


**by Brady Deyak**

After using the VADER sentiment analysis analyzer, the compound or overall sentiment score of the document is used to determine the sentiment label for the document, '1' being positive and '0' being negative.

In [52]:
# collects the overall sentiment scores
sentiment = df['compound']
# positive if greater than or equal and negative if less
threshold = 0.5
# creates new column in dataframe that takes the overall sentiment score and determines a sentiment label 1 (positive) or 0 (negative)
df['Sentiment Label'] = df['compound'].apply(lambda x: 1 if x >= threshold else 0)
sortedDataFrame = df.sort_values(by='compound', ascending=True)
sortedDataFrame

Unnamed: 0,neg,neu,pos,compound,Sentiment Label
150,0.123,0.805,0.071,-0.9990,0
297,0.170,0.745,0.085,-0.9990,0
199,0.283,0.668,0.048,-0.9989,0
424,0.245,0.693,0.062,-0.9988,0
200,0.159,0.772,0.069,-0.9988,0
...,...,...,...,...,...
116,0.071,0.796,0.132,0.9999,1
287,0.061,0.815,0.124,0.9999,1
446,0.029,0.844,0.127,0.9999,1
389,0.048,0.848,0.104,1.0000,1


**by Brady Deyak**

This was the part of my section that I spent the most time experimenting with as I am new to Tensorflow as a Python tool and some of the functions were ambigious for me. After much experimenting, I believe that the presets of this model work with the results.

In [53]:
# initiates a TensorFlow Keras model with a linear stack of layers
model = tf.keras.Sequential()
# the Embedding layer vectorizes positive integers or indices
# The size of the vocabulary is set to 10000 given that the amount of total words would most likely be near that
# The dimensions of the embedding is 128 meaning the shape is (None, 10, 128)
model.add(tf.keras.layers.Embedding(10000, 128))
# LSTM is the Long Short-Term Memory Layer
# The dimensions of the output is (128,)
# The dropout returns the fraction of units to drop for the linear transformation of the inputs, in this case 2/10 or 1/5
# The recurrent dropout returns the fraction of units to drop for the linear transformation of the recurrent state which in this case is also 2/10
model.add(tf.keras.layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2))
# Dense is a fully connected layer
# The activation function is relu
# The output layer is a single neuron
model.add(tf.keras.layers.Dense(50, activation='relu'))
# The shape of the output is (1,)
# The sigmoid function returns a value between 0 and 1
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [54]:
# This compiles the model using the binay cross-entropy loss function to effectively measure the difference between the predicted and real labels
# The optimizer is adam which minimizes the loss function through iterative training
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [55]:
# This separates the input training and test sets with the output sets
# The input is the sentiment values from the DataFrame
# The output is the sentiment label from the DataFrame
# This splits 80% of the data to be used for the testing set and the other 20% for the training set
# The random state applies shuffling at a specific frequency
x_train, x_test, y_train, y_test = train_test_split(sentiment, df["Sentiment Label"], test_size=0.8, random_state=42)

**by Brady Deyak**

This fits the model to the sentiment data given and runs at 5 epochs. The loss and accuracy scores are promising given the work that I was able to do with this model, although there is still much improvement to go.

In [56]:
# This fits the input and output training data into the model for a specific number of epochs
# The batch size is 48 which is the number of samples per gradient
# The epochs is 5 meaning there are 5 iterations over the data sets
sentData = model.fit(x_train, y_train, batch_size=48, epochs=5, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [57]:
x_train.sort_index()

1     -0.9968
13     0.9638
20     0.9869
21     0.9988
34     0.9873
        ...  
481    0.9990
487    0.9756
491    0.9552
495    0.9976
498    0.9984
Name: compound, Length: 99, dtype: float64

In [58]:
# The evaluate function evaluates the input and output testing sets to determine the loss and accuracy
loss, accuracy = model.evaluate(x_test, y_test)



**by Brady Deyak**

Once the model is trained, it is used to determine the sentiment of the 2010s decade using the training sets.

In [59]:
# predicts the sentiment of the input using the trained model
sentimentPrediction = model.predict(x_train)



**by Brady Deyak**

This checks the prediction to determine if the decade has more positive or negative sentiment.

In [60]:
# determines if sentiment is positive or negative based on prediction
modelSentiment = "Positive" if sentimentPrediction[0] >= threshold else "Negative"
modelSentiment

'Positive'

**It appears that there was more positive sentiment towards disability in the 2010s. Yay!**

**by Brady Deyak**

I plan to continue working on this after the class and building up my portions for a full working model. I was not able to get a full-working representation of the sentiment over time, however, I learned a lot about TensorFlow and other Python tools that are efficient in building neural networks. I think that I got lots of value from this project and am looking forward to continuing this!