# Intro

Team: BofaBros

In [272]:
# Imports
import pandas as pd
import numpy as np
import plotly.express as px
import string
from collections import Counter
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, GRU
from tensorflow.keras import activations

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
import keras_tuner as kt

import spacy
from spacy import displacy

import nltk

In [42]:
# Remove automatic formatting with symbol usage (Ex: $ sign -> MathJax)
pd.options.display.html.use_mathjax=False

In [261]:
# Read training data
train_data = pd.read_csv('../data/advanced_trainset.csv')
train_data.head()

Unnamed: 0,Sentence,Sentiment
0,According to the Finnish-Russian Chamber of Co...,neutral
1,The Swedish buyout firm has sold its remaining...,neutral
2,$SPY wouldn't be surprised to see a green close,positive
3,Shell's $70 Billion BG Deal Meets Shareholder ...,negative
4,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative


In [44]:
# Read testing data
test_data = pd.read_csv('../data/advanced_testset.csv')
test_data.head()

Unnamed: 0,Sentence
0,Earnings per share ( EPS ) dropped to EUR 0.21...
1,$SONC Amazing run since middle of March - obvi...
2,"Ruukki Romania , the local arm of Finnish meta..."
3,Self-service and automation are in a bigger ro...
4,Alma Media 's operating profit amounted to EUR...


In [48]:
# Read supplementary stock ticker data
stocks = pd.read_csv('../data/stock_tickers.csv')
stocks.head()

Unnamed: 0,Symbol,Name,Last Sale,Net Change,% Change,Market Cap,Country,IPO Year,Volume,Sector,Industry
0,A,Agilent Technologies Inc. Common Stock,$134.87,-1.06,-0.78%,40476290000.0,United States,1999.0,2070939,Capital Goods,Electrical Products
1,AA,Alcoa Corporation Common Stock,$84.15,-1.95,-2.265%,15519010000.0,,2016.0,4585478,Basic Industries,Metal Fabrications
2,AAC,Ares Acquisition Corporation Class A Ordinary ...,$9.83,0.02,0.204%,1228750000.0,,2021.0,186747,Finance,Business Services
3,AACG,ATA Creativity Global American Depositary Shares,$1.21,-0.06,-4.724%,37966070.0,China,,7154,Miscellaneous,Service to the Health Industry
4,AACI,Armada Acquisition Corp. I Common Stock,$9.9781,0.1181,1.198%,206641500.0,United States,2021.0,174251,Consumer Durables,Consumer Electronics/Appliances


In [171]:
# Set spacy NLP English pipeline
nlp = spacy.load('en_core_web_sm')

# EDA

Look through the dataset for things that catch your eye. What proportion of responses are negative, positive, and neutral? Do you see any imbalances in the data? What else do you find? Please provide charts and visualizations to support your claim.

In [25]:
sentiment_counts = train_data['Sentiment'].value_counts().to_frame().reset_index()
sentiment_counts

Unnamed: 0,index,Sentiment
0,neutral,2363
1,positive,1383
2,negative,636


In [26]:
fig = px.bar(sentiment_counts, x='index', y='Sentiment', \
             title="Sentiment Counts in Training Data", labels={'index':'Sentiment', 'Sentiment': 'Count'})
fig.show()

As we see from this bar chart, there is a significant imbalance in the number of observations we have for neutral, positive, and negative sentences. This will mean... TODO: HERE

## Subject of Sentences

Another point of interest is to identify the subject of the sentence. This gives us an idea of what the sentiment is directed towards. For example, if the sentence is "AAPL is popping off," we would want to identify the sentiment as well as what the sentiment is directed towards. This process is a combination of EDA and feature engineering, so we will include visualizations here and the actual data manipulation in the **Feature Engineering** section.

TODO: talk about tokenization here

In [178]:
sent = train_data.loc[0]['Sentence']
doc=nlp(sent)
displacy.render(doc, style="dep")

In [179]:
displacy.render(doc, style="ent")

With this visualization, we can see the breakdown of the sentence and determine the subjects as well as the relations between different words. However, as we can see in this example, the spacy NLP processing is not quite able to identify complex sentence tokens such as the "Finnish-Russian Chamber of Commerce." Thus, we will have to select multiple groups to identify as subjects...

## Negative Sentences

In [106]:
# Convert Sentence series of negative sentiment into a string for EDA purposes
neg_words = train_data[train_data['Sentiment'] == 'negative']['Sentence'].str.cat(sep=' ')
neg_words = neg_words.split(' ')

In [107]:
# Grab as many words as possible while ignoring numbers, or incorrectly formatted words (preprocessing step)
neg_words = [word.strip().lower() for word in neg_words if not any(c for c in word.strip() if c not in string.ascii_letters + "'")]

In [109]:
Counter(neg_words).most_common()[:10]

[('the', 524),
 ('in', 346),
 ('of', 314),
 ('to', 287),
 ('eur', 228),
 ('a', 185),
 ('mn', 164),
 ('from', 151),
 ('and', 149),
 ('for', 125)]

As we can see from the top 10 most common words in negative sentiment, it's impossible to gauge distinct or important words that correlate with negative sentiment. In order to find the more important words, we can calculate the term frequency - inverse data frequency score for each word and identify highest weighted words.

TODO: discuss tf-idf formula and reasoning here.

In [501]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [483]:
important_words = train_data.copy()

In [515]:
important_words['Cleaned Sentence'] = important_words['Sentence'].apply(lambda x: ' '.join([''.join([char for char in word if char in string.ascii_letters + "'"]) for word in x.strip().lower().split()]))

In [516]:
important_words

Unnamed: 0,Sentence,Sentiment,Cleaned Sentence
0,According to the Finnish-Russian Chamber of Co...,neutral,according to the finnishrussian chamber of com...
1,The Swedish buyout firm has sold its remaining...,neutral,the swedish buyout firm has sold its remaining...
2,$SPY wouldn't be surprised to see a green close,positive,spy wouldn't be surprised to see a green close
3,Shell's $70 Billion BG Deal Meets Shareholder ...,negative,shell's billion bg deal meets shareholder ske...
4,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative,ssh communications security corp stock exchang...
...,...,...,...
4377,Investments in product development stood at 6....,neutral,investments in product development stood at m...
4378,HSBC Says Unit to Book $585 Million Charge on ...,negative,hsbc says unit to book million charge on sett...
4379,RISING costs have forced packaging producer Hu...,negative,rising costs have forced packaging producer hu...
4380,"In the building and home improvement trade , s...",neutral,in the building and home improvement trade sa...


In [517]:
vectorizer = TfidfVectorizer(use_idf=True, max_df=0.5,min_df=1, ngram_range=(1,3))
vectors = vectorizer.fit_transform(important_words['Cleaned Sentence'])

In [518]:
dict_of_tokens={i[1]:i[0] for i in vectorizer.vocabulary_.items()}

In [521]:
tfidf_vectors = []  # all vectors by tfidf
for row in vectors:
    tfidf_vectors.append({dict_of_tokens[column]:value for (column,value) in zip(row.indices,row.data)})

In [522]:
doc_sorted_tfidfs =[]  # list of doc features each with tfidf weight
#sort each dict of a document
for dn in tfidf_vectors:
    newD = sorted(dn.items(), key=lambda x: x[1], reverse=True)
    newD = dict(newD)
    doc_sorted_tfidfs.append(newD)

In [534]:
tfidf_kw = [] # get the keyphrases as a list of names without tfidf values
for doc_tfidf in doc_sorted_tfidfs:
    ll = list(doc_tfidf.keys())
    tfidf_kw.append(ll)

In [115]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(neg_words)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)

In [None]:
imp_words = []
for index in df.index:
    doc = df.iloc[index].to_dict()
    imp_words.append(dict(sorted(doc_0.items(), key=lambda item: item[1], reverse=True)).items[0])

In [554]:
imp_words

{'bg': 1.0,
 'aaland': 0.0,
 'ab': 0.0,
 'aberdeen': 0.0,
 'about': 0.0,
 'above': 0.0,
 'abp': 0.0,
 'acanb': 0.0,
 'acando': 0.0,
 'accommodation': 0.0,
 'according': 0.0,
 'account': 0.0,
 'accounting': 0.0,
 'accused': 0.0,
 'achieve': 0.0,
 'acknowledged': 0.0,
 'acquisition': 0.0,
 'acting': 0.0,
 'action': 0.0,
 'activities': 0.0,
 'activity': 0.0,
 'actual': 0.0,
 'actually': 0.0,
 'added': 0.0,
 'addition': 0.0,
 'additional': 0.0,
 'adds': 0.0,
 'adjusted': 0.0,
 'administration': 0.0,
 'administrators': 0.0,
 'adp': 0.0,
 'adpnews': 0.0,
 'adults': 0.0,
 'advert': 0.0,
 'advertising': 0.0,
 'advice': 0.0,
 'aero': 0.0,
 'affect': 0.0,
 'affected': 0.0,
 'affecting': 0.0,
 'affecto': 0.0,
 'affectogenimap': 0.0,
 'after': 0.0,
 'afternoon': 0.0,
 'ag': 0.0,
 'again': 0.0,
 'against': 0.0,
 'agencies': 0.0,
 'aggregate': 0.0,
 'ago': 0.0,
 'agree': 0.0,
 'agreed': 0.0,
 'agreement': 0.0,
 'ahead': 0.0,
 'aiming': 0.0,
 'air': 0.0,
 'airline': 0.0,
 'airspace': 0.0,
 'aker': 0.

# Feature Engineering

Do you need to make any changes to “Sentence” to make it more digestible for your model? Will you make any restrictions to your sample? Even if you don’t choose to make any changes to the data, please describe your reasoning.

We want to map the sentences to a specific stock, market, or even country. We'll be attempting to create a category that contains information on the `subject` of the sentence.

To do this, we will use spacy's token labels to identify proper nouns or subjects of each sentence.

**Note: We weren't able to incorporate this in our model, but we believe this feature would help us address the prompt in the future.**

In [190]:
def get_subject(sent):
    '''
    Tokenizes and identifies the subject of the sentence using spacy's English pipeline.
    '''
    doc=nlp(sent)
    sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj" or tok.pos_ == "PROPN")]
    return sub_toks

In [191]:
with_subject = train_data.copy()

Unnamed: 0,Sentence,Sentiment
0,According to the Finnish-Russian Chamber of Co...,neutral
1,The Swedish buyout firm has sold its remaining...,neutral
2,$SPY wouldn't be surprised to see a green close,positive
3,Shell's $70 Billion BG Deal Meets Shareholder ...,negative
4,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative
...,...,...
4377,Investments in product development stood at 6....,neutral
4378,HSBC Says Unit to Book $585 Million Charge on ...,negative
4379,RISING costs have forced packaging producer Hu...,negative
4380,"In the building and home improvement trade , s...",neutral


In [192]:
with_subject['Subject'] = with_subject['Sentence'].apply(get_subject)

In [193]:
with_subject.head()

Unnamed: 0,Sentence,Sentiment,Subject
0,According to the Finnish-Russian Chamber of Co...,neutral,"[Chamber, Commerce, companies, Finland, Russia]"
1,The Swedish buyout firm has sold its remaining...,neutral,"[firm, Finland]"
2,$SPY wouldn't be surprised to see a green close,positive,[SPY]
3,Shell's $70 Billion BG Deal Meets Shareholder ...,negative,"[Shell, BG, Shareholder, Skepticism]"
4,SSH COMMUNICATIONS SECURITY CORP STOCK EXCHANG...,negative,"[SSH, COMMUNICATIONS, SECURITY, CORP, STOCK, E..."
...,...,...,...
4377,Investments in product development stood at 6....,neutral,[Investments]
4378,HSBC Says Unit to Book $585 Million Charge on ...,negative,"[HSBC, Unit, Book]"
4379,RISING costs have forced packaging producer Hu...,negative,"[RISING, costs, Huhtamaki, Hampshire]"
4380,"In the building and home improvement trade , s...",neutral,"[sales, EUR, mn]"


Now, we want to clean `Sentences` in a way that enables accurate tokenization. To do this, we use Spacy, string, and Natural Language Toolkit tokenize function.

In [700]:
undesired = string.punctuation.replace('-', '')
def punc_clean(text):
    a=[w for w in text if w not in undesired]
    return ''.join(a)

In [701]:
def remove_stopwords(text):
    stopword = nltk.corpus.stopwords.words('english')
    stopword.remove('not')
    a=[w for w in nltk.word_tokenize(text) if w not in stopword]
    return ' '.join(a)

In [702]:
cleaned_dataset = train_data.copy()

In [703]:
cleaned_dataset['Sentence'] = cleaned_dataset['Sentence'].apply(punc_clean)
cleaned_dataset['Sentence'] = cleaned_dataset['Sentence'].apply(remove_stopwords).str.lower()

# Model Building

Create a NLP model that uses the “Sentence” as an input, using “Sentiment”  as labels. Ideally, you will compare the results of several different models to find the optimal choice. What led you to choose your final model? Did you run into any roadblocks? Please describe your process in depth. Make sure to train your model on the training set only.

The process we are using to build the model involves setting a baseline model using a quick and easy approach of preprocessing the data in a way that improves tokenization of the sentences. To do this, we used the Natural Language Toolkit package to clean and tokenize the `Sentence` data. Then, we used Scikit-Learn's TF-IDF Vectorizer to transform the `Sentence` data into a sparse matrix with dimensions matching the overall vocab count. After this, a simple Logistic Regression classifier is run on the data to perform multi-class classification for the three 

## NLTK and Scikit-Learn Model

In [704]:
x_train, x_test, y_train, y_test = train_test_split(cleaned_dataset['Sentence'], cleaned_dataset['Sentiment'], test_size=0.25)

In [705]:
vector = TfidfVectorizer(ngram_range=(1,2),min_df=1) # TODO: ADJUST PARAMS
vector.fit(x_train)
vect_X = vector.transform(x_train)

In [706]:
vect_x_test = vector.transform(x_test)

In [707]:
model = LogisticRegression()
clf = model.fit(vect_X, y_train)

In [708]:
preds = clf.predict(vect_x_test)

In [709]:
accuracy = accuracy_score(preds, y_test)
f1_score = f1_score(preds, y_test, average='macro')

print(f'accuracy: {accuracy}, f1_score: {f1_score}')

accuracy: 0.6788321167883211, f1_score: 0.515463002636719


## Tensorflow

In [631]:
tf_dataset = train_data.copy()

In [632]:
tf_dataset['Sentence'] = tf_dataset['Sentence'].apply(punc_clean)
tf_dataset['Sentence'] = tf_dataset['Sentence'].apply(remove_stopwords).str.lower()

In [633]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tf_dataset['Sentence'])

In [634]:
X = tokenizer.texts_to_sequences(tf_dataset['Sentence'])

# pad to same length
X = pad_sequences(X, maxlen=pd.Series(X).apply(len).max())
X.shape

(4382, 48)

In [635]:
y = tf_dataset['Sentiment'].replace(['negative', 'neutral', 'positive'],
                        [0, 1, 2]).to_numpy()

In [636]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [637]:
max_seq_len = tf_dataset['Sentence'].apply(len).max()
emb_dim = 250
cell_dim = 128
num_classes = 3
vocab_size = len(tokenizer.word_index) + 1
penalty = 0.01

### LSTM layers only

In [662]:
class save_weights(keras.callbacks.Callback):
    """Callback to save weights that maximize test accuracy"""
    
    def __init__(self):
        super(save_weights, self).__init__()
        
        self.test_accuracy = []
        
        self.best = {"Weights": None, "acc": float("-inf")}
        
        
    def on_epoch_end(self, epoch, logs=None):
        
        
        # evaluate model loss and accuracy and update best based on evaluation
        loss, acc = self.model.evaluate(x_test, y_test, verbose=False)
        
        if acc > self.best["acc"]:
            self.best["Weights"] = self.model.get_weights()
            self.best["acc"] = acc
            
        self.test_accuracy.append(acc)

In [663]:
save = save_weights()

In [664]:
def lstm_builder(hp):
    l2 = keras.regularizers.l2(penalty)
    model = Sequential()
    model.add(Embedding(vocab_size, emb_dim))
    
    # Tune the dropout rate
    hp_dropout = hp.Choice('dropout', values=[0.1, 0.2, 0.3])
    hp_re_dropout = hp.Choice('recurrent_dropout', values=[0.1, 0.2, 0.3])
    
    model.add(LSTM(cell_dim, return_sequences=True, dropout=hp_dropout, recurrent_dropout=hp_re_dropout))
    model.add(LSTM(cell_dim))
    model.add(Dense(num_classes, activation="softmax", kernel_regularizer=l2)) # try relu?

    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

In [665]:
lstm_tuner = kt.Hyperband(lstm_builder,
                     objective='val_accuracy',
                     max_epochs=20,
                     factor=3,
                     directory='lstm',
                     project_name='DataHacks2022_tuning')

INFO:tensorflow:Reloading Oracle from existing project lstm\DataHacks2022_tuning\oracle.json


In [666]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

In [667]:
lstm_tuner.search(x_train, y_train, epochs=20, validation_split=0.2, callbacks=[stop_early])

# Get the optimal hyperparameters
best_hps = lstm_tuner.get_best_hyperparameters(num_trials=1)[0]

Trial 9 Complete [00h 00m 45s]
val_accuracy: 0.6261398196220398

Best val_accuracy So Far: 0.6352583765983582
Total elapsed time: 00h 06m 48s
INFO:tensorflow:Oracle triggered exit


In [668]:
model = model_builder(best_hps)

In [669]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [670]:
history = model.fit(x_train, y_train, epochs=20, batch_size=64, callbacks=[save])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [694]:
print("Using weights after full training:")
model.evaluate(x_test, y_test)
print("Using best weights from tf2.0 callback:")
model.set_weights(save.best["Weights"])
model.evaluate(x_test, y_test)

Using weights after full training:
Using best weights from tf2.0 callback:


ValueError: Layer sequential_1 weight shape (128, 384) is not compatible with provided weight shape (128, 512).

In [697]:
model.predict(test_data)

UnimplementedError: Graph execution error:

Detected at node 'sequential_1/Cast' defined at (most recent call last):
    File "C:\Users\ericw\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\ericw\anaconda3\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
      app.launch_new_instance()
    File "C:\Users\ericw\anaconda3\lib\site-packages\traitlets\config\application.py", line 846, in launch_instance
      app.start()
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 677, in start
      self.io_loop.start()
    File "C:\Users\ericw\anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\ericw\anaconda3\lib\asyncio\base_events.py", line 570, in run_forever
      self._run_once()
    File "C:\Users\ericw\anaconda3\lib\asyncio\base_events.py", line 1859, in _run_once
      handle._run()
    File "C:\Users\ericw\anaconda3\lib\asyncio\events.py", line 81, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 457, in dispatch_queue
      await self.process_one()
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 446, in process_one
      await dispatch(*args)
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 353, in dispatch_shell
      await result
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 648, in execute_request
      reply_content = await reply_content
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 353, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "C:\Users\ericw\anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
      return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    File "C:\Users\ericw\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2898, in run_cell
      result = self._run_cell(
    File "C:\Users\ericw\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2944, in _run_cell
      return runner(coro)
    File "C:\Users\ericw\anaconda3\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\ericw\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3169, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\ericw\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3361, in run_ast_nodes
      if (await self.run_code(code, result,  async_=asy)):
    File "C:\Users\ericw\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\ericw\AppData\Local\Temp/ipykernel_18176/1543604611.py", line 1, in <module>
      model.predict(test_data)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\training.py", line 1982, in predict
      tmp_batch_outputs = self.predict_function(iterator)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\training.py", line 1801, in predict_function
      return step_function(self, iterator)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\training.py", line 1790, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\training.py", line 1783, in run_step
      outputs = model.predict_step(data)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\training.py", line 1751, in predict_step
      return self(x, training=False)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\sequential.py", line 374, in call
      return super(Sequential, self).call(inputs, training=training, mask=mask)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\functional.py", line 451, in call
      return self._run_internal_graph(
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\functional.py", line 571, in _run_internal_graph
      y = self._conform_to_reference_input(y, ref_input=x)
    File "C:\Users\ericw\anaconda3\lib\site-packages\keras\engine\functional.py", line 671, in _conform_to_reference_input
      tensor = tf.cast(tensor, dtype=ref_input.dtype)
Node: 'sequential_1/Cast'
Cast string to float is not supported
	 [[{{node sequential_1/Cast}}]] [Op:__inference_predict_function_820759]

### GRU layers only

In [663]:
save = save_weights()

In [673]:
def gru_builder(hp):
    l2 = keras.regularizers.l2(penalty)
    model = Sequential()
    model.add(Embedding(vocab_size, emb_dim))
    
    # Tune the dropout rate
    hp_dropout = hp.Choice('dropout', values=[0.1, 0.2, 0.3])
    hp_re_dropout = hp.Choice('recurrent_dropout', values=[0.1, 0.2, 0.3])
    
    # Tune the activation function
    hp_activation = hp.Choice('activation', values=['tanh', 'relu'])
    
    model.add(GRU(cell_dim, activation=hp_activation, return_sequences=True, dropout=hp_dropout, recurrent_dropout=hp_re_dropout))
    model.add(GRU(cell_dim, activation=hp_activation, dropout=hp_dropout, recurrent_dropout=hp_re_dropout))
    model.add(Dense(num_classes, activation="softmax", kernel_regularizer=l2)) # try relu?

    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

In [674]:
gru_tuner = kt.Hyperband(gru_builder,
                     objective='val_accuracy',
                     max_epochs=20,
                     factor=3,
                     directory='gru',
                     project_name='DataHacks2022_tuning')

In [675]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

In [676]:
gru_tuner.search(x_train, y_train, epochs=20, validation_split=0.2, callbacks=[stop_early])

# Get the optimal hyperparameters
best_hps = gru_tuner.get_best_hyperparameters(num_trials=1)[0]

Trial 23 Complete [00h 01m 58s]
val_accuracy: 0.6337385773658752

Best val_accuracy So Far: 0.6413373947143555
Total elapsed time: 00h 32m 07s
INFO:tensorflow:Oracle triggered exit


In [677]:
model = gru_builder(best_hps)

In [678]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [679]:
history = model.fit(x_train, y_train, epochs=20, batch_size=64, callbacks=[save])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [681]:
print("Using weights after full training:")
model.evaluate(x_test, y_test)
print("Using best weights from tf2.0 callback:")
model.set_weights(save.best["Weights"])
model.evaluate(x_test, y_test)

Using weights after full training:
Using best weights from tf2.0 callback:


ValueError: Layer sequential_1 weight shape (250, 384) is not compatible with provided weight shape (250, 512).

### LSTM and GRU layers combined

In [685]:
def lstm_gru_builder(hp):
    l2 = keras.regularizers.l2(penalty)
    model = Sequential()
    model.add(Embedding(vocab_size, emb_dim))
    
    # Tune the dropout rate
    hp_dropout = hp.Choice('dropout', values=[0.1, 0.2, 0.3])
    hp_re_dropout = hp.Choice('recurrent_dropout', values=[0.1, 0.2, 0.3])
    
    # Tune the activation function
    hp_activation = hp.Choice('activation', values=['tanh', 'relu'])
    
    model.add(LSTM(cell_dim, return_sequences=True, dropout=hp_dropout, recurrent_dropout=hp_re_dropout))
    model.add(GRU(cell_dim, activation=hp_activation, dropout=hp_dropout, recurrent_dropout=hp_re_dropout))
    model.add(Dense(num_classes, activation="softmax", kernel_regularizer=l2))


    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

In [686]:
lstm_gru_save = save_weights()

In [687]:
lstm_gru_tuner = kt.Hyperband(lstm_gru_builder,
                     objective='val_accuracy',
                     max_epochs=20,
                     factor=3,
                     directory='lstm_gru',
                     project_name='DataHacks2022_tuning')

In [688]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

In [689]:
lstm_gru_tuner.search(x_train, y_train, epochs=20, validation_split=0.2, callbacks=[stop_early])

# Get the optimal hyperparameters
best_hps = lstm_gru_tuner.get_best_hyperparameters(num_trials=1)[0]

Trial 15 Complete [00h 01m 15s]
val_accuracy: 0.6413373947143555

Best val_accuracy So Far: 0.6413373947143555
Total elapsed time: 00h 15m 47s

Search: Running Trial #16

Value             |Best Value So Far |Hyperparameter
0.3               |0.3               |dropout
0.1               |0.3               |recurrent_dropout
tanh              |relu              |activation
7                 |7                 |tuner/epochs
3                 |3                 |tuner/initial_epoch
2                 |2                 |tuner/bracket
1                 |1                 |tuner/round
0009              |0007              |tuner/trial_id

Epoch 4/7
Epoch 5/7
Epoch 6/7
18/83 [=====>........................] - ETA: 13s - loss: 0.3396 - accuracy: 0.8785

KeyboardInterrupt: 

In [690]:
model = lstm_gru_builder(best_hps)

In [691]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [692]:
history = model.fit(x_train, y_train, epochs=20, batch_size=64, callbacks=[save])

Epoch 1/20
 6/52 [==>...........................] - ETA: 34s - loss: 1.1230 - accuracy: 0.4479

KeyboardInterrupt: 

In [693]:
print("Using weights after full training:")
model.evaluate(x_test, y_test)
print("Using best weights from tf2.0 callback:")
model.set_weights(lstm_gru_save.best["Weights"])
model.evaluate(x_test, y_test)

Using weights after full training:
Using best weights from tf2.0 callback:


TypeError: object of type 'NoneType' has no len()

# Model Testing

Please report the performance of your model on the training set. How does your model perform? Please report your accuracy and F1 score. Also, using the test set, please provide a CSV of your predicted values for “Sentiment” with your submission.

In [724]:
# Performance was done up above in Model Building section since it was easier to adjust model parameters with it up there.

In [710]:
cleaned_test = test_data.copy()

In [711]:
cleaned_test['Sentence'] = cleaned_test['Sentence'].apply(punc_clean)
cleaned_test['Sentence'] = cleaned_test['Sentence'].apply(remove_stopwords).str.lower()

In [713]:
test_input = vector.transform(cleaned_test['Sentence'])

In [714]:
test_predictions = clf.predict(test_input)

In [722]:
pd.Series(test_predictions).to_frame().to_csv('AdvancedTrack_BofaBros_predictions.csv', header=False,index=False)

# References

https://www.nasdaq.com/market-activity/stocks/screener