# Spark 64 Test
This is my submission for the Spark 64 Natural Language Processing technical test.

I used a pre-trained model that is based off of the BERT model, from the Huggingface (🤗) model repository. To run the notebook, ***simply drag-and-drop the .csv folder that you wish to analyse and press Runtime > Run all***. Appendeded at the bottom of the notebook is a summary of my work and my key findings from the sample dataset. I recommend reading through the 'Findings and Final Thoughts' chapter whilst the notebook is running.

Please input the name of the .csv file (with the extension) in the below cell before running (has been set to "support_tweets.csv" by default). If no name is given, the first .csv files in the root directory will be analysed.

In [None]:
# Read in files.
file_dir = 'support_tweets.csv' #@param {type:"string"}

# Imports, File I/O and Configurations

## Install Modules
Install transformers to initialise the pre-trainied NLP model pipeline.

In [None]:
# Install necessary modules (Not pre-loaded on Colab)
%%capture
!pip install transformers

## Imports
Import necessary modules to run the code.

In [None]:
# Import necessary modules
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
from transformers import AutoModel, AutoTokenizer, pipeline
from collections import Counter

import tensorflow as tf
import pandas as pd
import numpy as np
import multiprocessing
import plotly.express as px

from time import time, sleep
from glob import glob
from threading import Thread
from itertools import repeat

## File I/O
Read in the appropriate files.

Please provide the filename/directory to the file in the form on the right and click run.

In [None]:
if file_dir == '':
    file_dir = glob('*.csv')
    tweets = pd.read_csv(file_dir[0])
else:
    tweets = pd.read_csv(file_dir)
tweets.isnull().values.any()
tweets = tweets.values.tolist()
len(tweets)

## Preprocessings
Preprocess input tweets dataset to remove excess punctuation, lower case all letters.

In [None]:
# Convert input .csv file data into a usable format
tweets_temp = np.array(tweets)
tweets_temp.flatten()
# Lower case all characters
master_string = " ".join([str(tweet[0]).lower() for tweet in tweets_temp.tolist()])
# Remove punctuation except for '@' characters
master_string = master_string.translate(str.maketrans('', '', '!"#$%&/()*+,-./:;<=>?[\\]^_`{|}~')).split()
unique_words = list(set(master_string))

##Retrieve the most common word(s)
As we do not need to do sentiment analysis to compute the number of each word in the .csv file, we can find the most common words immediately after string pre-processing.

Input the amount of top words you wish to retrieve from the dataset in the form on the right.

In [None]:
# Define how many of the most common words you
# wish to see
number_occurances = 3 #@param {type:"number"}

In [None]:
# Pass the master_string list to an instance of the Counter class.
counter = Counter(master_string)
  
# Output the most common words
most_occur = counter.most_common(int(number_occurances))

# Print the 'number_occurances' amount of most frequent words.
print('The %i most frequent word(s) were(is):'%(number_occurances))
for pair in most_occur:
    print('\'%s\', appearing %i times'% (pair[0], pair[1]))

# BERT Model
Create a pipeline from a pre-trained BERT NLP sentiment analysis model.

## Initialise BERT Model
Run the cell below to create the sentiment analysis polarity model pipeline.

In [None]:
# Create Pipeline for sentiment analysis using the pre-trained BERTweet model
sentiment_analysis = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Average sentiment of tweets
Average sentiment of tweets containing a certain word/handle/string.
This is currently written specifically for the outputs of the `distilbert-base-uncased-finetuned-sst-2-english` pre-trained model.

In [None]:
def average_sentiment_word(word:str, tweets_all:list) -> 'sentiment':
    """ 
        Given a word and a list of strings, output the average sentiment of tweets
        containing the word. The model will use the 'sentiment_analysis' pre-trained
        model pipeline.
        Author: Jae Min Seo
        Inputs:
        word: str
            A Python string for which we will find the average sentiment of in the
            given strings in the tweets_all list of strings.
        tweets_all: list
            A list containing all of the tweets/strings that we will analyse to
            gauge/calculate the average sentiment. 
        Outputs:
        word: str
            The same string as the 'word' input. Repeated for ease of implementation
            with the multiprocessing processes used to speed up code.
        sentiment: float
            A float containing the average sentiment of the tweets containing the
            'word' string. Values in (0,1] have net average positive polarities 
            whilst values in [-1,0) have a net average negative polarity.
            Returns 'None' if the word is not contained in any of the strings in
            'tweets_all'.
    """
    # Ensure that word is lower cased
    word = word.lower()
    sentiment = []
    for tweet in tweets_all:
        tweet = str(tweet).lower()
        if word in tweet:
            result = sentiment_analysis(tweet)
            label = result[0]['label']
            score = result[0]['score']
            # Append results to sentiment list
            if label == 'POSITIVE':
                sentiment.append(score)
            else:
                sentiment.append(-1*score)
    # Account for the case where the word does not exist in the tweets_all list
    try:
        return word, sum(sentiment)/len(sentiment)
    except:
        return word, None

## Filter out words that appear less than a specific number of times
Define the number of minimum occurances you wish to see of a word in in the .csv file before it is analysed by the NLP pipeline.

In [None]:
minimum_occurances = 3 #@param {type:"number"}

In [None]:
# Filter out all words that appear less than 'minimum_occurances' number of times
filtered_unique_words = [unique_word for unique_word in unique_words
                         if master_string.count(unique_word) >= minimum_occurances]
print('There are %i unique words that occur more than %i times.'%(len(filtered_unique_words), minimum_occurances))

## The average polarity of tweets containing certain words.
This cell will attempt to compute the average sentiments of tweets containing a word that is passed into the `average_sentiment_word()` function.

Note that this process may take a while (around 3 minutes with the sample dataset), and the operation will grow in time complexity with $O(n^2)$ operations (with inputs being the words being processed).


In [None]:
# Iterate through each word to find the average sentiment of tweets containing each word
# that appears more than 'minimum_occurance' of timees in the dataset of tweets
max = len(filtered_unique_words)
pool = multiprocessing.Pool(processes = 8)
results = pool.starmap(average_sentiment_word, zip(filtered_unique_words, repeat(tweets)))
results = np.array(results).reshape(max,2)

## Getting the top few most polarised/divisive words
Define how many of each category (positive, negative, neutral) words you wish to see from our exploration of the tweets dataset.

Note that handles have been separated from normal words (handles can be processed in following cells).

In [None]:
top_occurances = 3 #@param {type:"number"}

Run the cell below to see the top polarity results for words only.

In [None]:
# Remove all None occurances and handles
results_filtered = np.array([result for result in results if (result[1] and '@' not in result[0])])
# Ensure results and polarities have the same shape
polarities = [float(result) for result in results_filtered[:,1]]
results_filtered = results_filtered.tolist()

print('The most positive words appearing more than %i times were:'%(minimum_occurances))
for i in range(top_occurances):
    max = results_filtered.pop(np.argmax(polarities))
    polarities.pop(np.argmax(polarities))
    print("\'%s\' with a score of: %.4f"%(max[0], float(max[1])))

print('\nThe most negative words appearing more than %i times were:'%(minimum_occurances))
for i in range(top_occurances):
    min = results_filtered.pop(np.argmin(polarities))
    polarities.pop(np.argmin(polarities))
    print("\'%s\' with a score of: %.4f"%(min[0], float(min[1])))

abs_polarities = [abs(val) for val in polarities]
print('\nThe most divisive words appearing more than %i times were:'%(minimum_occurances))
for i in range(top_occurances):
    div = results_filtered.pop(np.argmin(abs_polarities))
    abs_polarities.pop(np.argmin(abs_polarities))
    print("\'%s\' with a score of: %.4f"%(div[0], float(div[1])))

Run the cell below to see the top polarity results for handles only.

In [None]:
# Remove all None scores
results = np.array([result for result in results if result[1]])
# Get an array of only handles (words including '@')
handles = [word for word in results if '@' in word[0]]
handles_polarity = [float(handle[1]) for handle in handles]

print('The most positive tweets included the handles:')
for i in range(top_occurances):
    # Retrieve the most positive handles
    index = np.argmax(np.array(handles_polarity))
    max = handles.pop(index)
    handles_polarity.pop(index)
    print("\'%s\' with a score of: %.4f"%(max[0], float(max[1])))

print('\nThe most negative tweets included the handles:')
for i in range(top_occurances):
    # Retrieve the most positive handles
    index = np.argmin(np.array(handles_polarity))
    min = handles.pop(index)
    handles_polarity.pop(index)
    print("\'%s\' with a score of: %.4f"%(min[0], float(min[1])))

abs_handles_polarity = [abs(val) for val in handles_polarity]
print('\nThe most divisive tweets included the handles:')
for i in range(top_occurances):
    # Retrieve the most positive handles
    index = np.argmin(np.array(abs_handles_polarity))
    div = handles.pop(index)
    abs_handles_polarity.pop(index)
    print("\'%s\' with a score of: %.4f"%(div[0], float(div[1])))

# Visualising the data with Plotly Dash

## Sort words
Ascend-sorting words make them more easy to view and see patterns.

In [None]:
# Sort all words, handles according to their average sentiment (ascending)
results_sorted = sorted(results,key=lambda results: results[1])
handles = [word for word in results if '@' in word[0]]
handles_sorted = sorted(handles,key=lambda handles: handles[1])
words = [result for result in results if (result[1] and '@' not in result[0])]
words_sorted = sorted(words,key=lambda words: words[1])

## Sorted raw output for the sentiment polarity of the most frequent words & handles

In [None]:
data = pd.DataFrame(results_sorted)
data.columns = ['Words/Handles','Sentiment']
fig = px.bar(data, x='Words/Handles', y='Sentiment',
             hover_data=['Words/Handles','Sentiment'],color='Sentiment',
             color_continuous_scale='Agsunset')

fig.update_layout(
    title="Sentiment Score of the most frequent words & handles",
    title_x=0.5,
    xaxis_title="Words/Handles",
    yaxis_title="Sentiment Score")

fig.show(renderer="colab")

## Sorted raw output for the sentiment polarity of the most frequent handles

In [None]:
data = pd.DataFrame(handles_sorted)
data.columns = ['Handles','Sentiment']
fig = px.bar(data, x='Handles', y='Sentiment',
             hover_data=['Handles','Sentiment'],color='Sentiment',
             color_continuous_scale='Agsunset')

fig.update_layout(
    title="Sentiment Score of the most frequent handles",
    title_x=0.5,
    xaxis_title="Handles",
    yaxis_title="Sentiment Score")

fig.show(renderer="colab")

## Sorted raw output for the sentiment polarity of the most frequent words

In [None]:
data = pd.DataFrame(words_sorted)
data.columns = ['Words','Sentiment']
fig = px.bar(data, x='Words', y='Sentiment',
             hover_data=['Words','Sentiment'],color='Sentiment',
             color_continuous_scale='Agsunset')

fig.update_layout(
    title="Sentiment Score of the most frequent words",
    title_x=0.5,
    xaxis_title="Words",
    yaxis_title="Sentiment Score")

fig.show(renderer="colab")

# Findings and Final Thoughts

From the pre-trained BERT model, we can observe some key statistics, here is a summary of the outputs:

```
There are 249 unique words that occur more than 3 times.
```
Not that interesting, moving on.
```
The 3 most frequent word(s) were(is):
'the', appearing 105 times
'to', appearing 87 times
'you', appearing 65 times
```
Again, not that insightful. However, it is interesting to see that 'you' is one of the more frequently used words. From a brief glance at the dataset it is fair to say that it has come up a lot in accusational contexts.
```
The most positive words appearing more than 3 times were:
'great' with a score of: 0.9928
'together' with a score of: 0.9261
'you're' with a score of: 0.6729

The most negative words appearing more than 3 times were:
'it’s' with a score of: -0.9996
'fix' with a score of: -0.9996
'getting' with a score of: -0.9995

The most divisive words appearing more than 3 times were:
'address' with a score of: -0.0001
'we' with a score of: -0.0002
'out' with a score of: -0.0007
```
Not surprisingly, the word 'great' had a high sentimentality score. As we are dealing with tech support/complaint tweets it is fair to see that 'fix' has a very low address, as consumers may be using not-so-polite language to try and resolve their issue. Unlike the example output, the word 'we' appeared to be more divisive on average than net positive.

```
The most positive tweets included the handles:
'@105840' with a score of: 0.5832
'@105856' with a score of: 0.3143
'@115861' with a score of: 0.2974

The most negative tweets included the handles:
'@tesco' with a score of: -0.9983
'@virgintrains' with a score of: -0.9978
'@115858' with a score of: -0.9975

The most divisive tweets included the handles:
'@115855' with a score of: 0.0012
'@spotifycares' with a score of: 0.1133
'@115865' with a score of: -0.1793
```
The most positive handles were all customers, and weren't the most polarised. The top two most negative tweets were companies that were receiving criticism, which is fair enough considering the nature of the tweets from which the dataset was sampled from, though it was interesting to see that Spotify had a very neutral average sentiment.

The plotly outputs show that there is an overwhelming majority of words that had a net average negative sentiment from the dataset. It was interesting to note that even words like 'love' were negative, and this is because the other words around them had a anchoring effect on it, or other cases where there was clear sarcasm involved(eg. "My apps stop working without warning and my phone freezes every five minutes! Love the new update @76099!!!!"), which the BERT model does not appear the most competent at recognising. 

The plotly outputs are pasted below:
###  Plotly outputs for both words and handles
![Words/Handles](https://drive.google.com/uc?export=view&id=1BR5w-cxNQ33zuSGfOS_8P9ma1ME8bnXZ)
(https://drive.google.com/uc?export=view&id=1BR5w-cxNQ33zuSGfOS_8P9ma1ME8bnXZ)
### Plotly outputs for handles only
![Handles](https://drive.google.com/uc?export=view&id=1q0eWUEj8YWJlYF-lqzD5bBWnKqqATB03)
(https://drive.google.com/uc?export=view&id=1q0eWUEj8YWJlYF-lqzD5bBWnKqqATB03)
### Plotly outputs for words only
![Words](https://drive.google.com/uc?export=view&id=1YnV1lgcsYh4EanO2Oi1wwNHtbob_WZU6)
(https://drive.google.com/uc?export=view&id=1YnV1lgcsYh4EanO2Oi1wwNHtbob_WZU6)


Upon inspection from plotly, we can observe that most of the technical vocab had a lower sentimentality score, and most of the twitter handles (and almost all twitter handles that had company names), were negative. Considering the context (again) in which these tweets were received, it is not that surprising. 

As mentioned previous, there were some limitations to the BERT model that was used to create the pipeline. Although a string like (for example) 'This is not amazing!' can be detected well:
```
sentiment_analysis('This is not amazing!')
[{'label': 'NEGATIVE', 'score': 0.9997857213020325}]
```
A slight variation of the string (eg. 'This is amazing... not!') is not well recognised.
```
sentiment_analysis('This is amazing... not!')
[{'label': 'POSITIVE', 'score': 0.9929170608520508}]
```
Which is a clear limitation of the model, and should be further investigated (perhaps with some fine-tuning of the model).

Whilst other models were considered, it was clear that this version of the BERT model had the best outcomes overall, outdoing other pretrained models that had been fine-tuned for tweets, like `vinai/bertweet_base`. Which was one of the main reasons I had considered using the generalised BERT model. Further, there were the occasional foreign language tweets, but I went with the BERT model again because of the ease of use. There were other models that could use other/multiple languages, but those did not have the speed and accuracy of the BERT model, which was another reason I used this model.

The dataset contained some very strange/unusual characters such as 'üò°üò°'. I did not remove/filter these words/characters out for this dataset specifically, as they were relatively few and far between, and did not appear to have much of an adverse affect on our model predictions.

An obvious but crucial point to say is that most of these words (at least for my run) had only had a sample tweet size only slightly greater than 3 occurances. This is obviously not enough to classify the average sentiment of a word (as can be seen in the 'love' example). A larger sample size is crucial to more holistically observe the words of interest in the twitter tech-support space.

Again, I would like to thank Spark 64 to for giving me the opportunity to take part in the technical test. I hope that there were not jarring issues with running the Colab Notebook.