<a href="https://colab.research.google.com/github/julrods/aggressive-tweet-analyzer/blob/main/4_Web_app.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Environment" data-toc-modified-id="Environment-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Environment</a></span><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Twitter-API-authentication" data-toc-modified-id="Twitter-API-authentication-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Twitter API authentication</a></span></li></ul></li><li><span><a href="#BERT-Setup" data-toc-modified-id="BERT-Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>BERT Setup</a></span></li><li><span><a href="#Tweet-extraction" data-toc-modified-id="Tweet-extraction-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Tweet extraction</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Prediction" data-toc-modified-id="Prediction-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Prediction</a></span></li><li><span><a href="#Final-step" data-toc-modified-id="Final-step-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Final step</a></span></li></ul></div>

# Web app

In this notebook I have written the code that runs in the backend of my web app called Aggressive Tweet Analyzer. 

The app takes a Twitter handle as input and makes a request of the last 100 tweets of that user to the Twitter API. Then it preprocesses and tokenizes the text, and predicts a label for each tweet: aggressive or not aggressive. Finally, it returns an aggressiveness score for the user (n aggressive tweets / 100) and displays all the tweets labeled as aggressive. 

This way I have completed my first full-cycle NLP project, from building a model (in this case fine-tuning BERT) to deploying it into production. 

## Environment

### Libraries

In [None]:
!pip install transformers

In [None]:
# Base libraries
import os
import pandas as pd
import numpy as np
import itertools

# Libraries to work with the Twitter API
import json
import tweepy

# ML and DL libraries
import tensorflow as tf
from transformers import TFBertModel, TFBertForSequenceClassification, BertTokenizer

# Text processing libraries
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
from tqdm import tqdm

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Functions

In [None]:
def unicode_to_ascii(sentence):
    """ 
    Input: a string (one sentence) in Unicode character encoding
    Output: a string (one sentence) in ASCII character encoding
    """
    return ''.join(character for character in unicodedata.normalize('NFD', sentence) if unicodedata.category(character) != 'Mn')

def clean_stopwords_shortwords(sentence):
    """ 
    Input: a string (one sentence)
    Output: a string (one sentence) without stop words and words shorter than 2 characters
    """
    stopwords_list = stopwords.words('english')
    words = sentence.split() 
    clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
    return " ".join(clean_words) 

def preprocess_sentence(sentence):
    """
    Input: a raw sentence
    Output: a clean sentence ready to be passed to a tokenizer
    """
    sentence = unicode_to_ascii(sentence.lower().strip())
    sentence = re.sub(r"([?.!,¿])", r" ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[^a-zA-Z?.!,¿]+", " ", sentence)
    sentence = clean_stopwords_shortwords(sentence)
    sentence = re.sub(r'@\w+', '', sentence)
    if 'http' in sentence:
      sentence = sentence.split(' http')[0]
    return sentence

### Twitter API authentication

In [None]:
credentials_filepath = '/content/gdrive/MyDrive/Cyber-bullying-project/keys/twitter_credentials.json'

In [None]:
with open(credentials_filepath) as data_file:
    credentials = json.load(data_file)
auth = tweepy.OAuthHandler(credentials['api_key'], credentials['api_secret_key'])
auth.set_access_token(credentials['access_token'], credentials['access_token_secret'])
api = tweepy.API(auth)

## BERT Setup

In [None]:
# Load BERT base model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define loss, metric and optimizer
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,
                                   epsilon=1e-08)
# Compile model
model.compile(loss = loss, optimizer = optimizer, metrics = [metric])

# Load the weights of the final model
model.load_weights('/content/gdrive/MyDrive/Cyber-bullying-project/models/aggression_model_1epoch.h5')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Load the BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




## Tweet extraction

In [None]:
# Define the user handle we want to extract tweets from
user = 'DonaldJTrumpJr'

In [None]:
# Extract the tweets via the Twitter API using Tweepy
tweet_dict = [{'tweet': tweet.full_text, # Tweet's full text
              'created_at': tweet.created_at, # Date
              'username': user, # Username
              'headshot_url': tweet.user.profile_image_url, # Profile picture URL
              'url': f'https://twitter.com/user/status/{tweet.id}' # Tweet URL
               } for tweet in tweepy.Cursor(api.user_timeline, # Extract from a user timeline
                                            screen_name = user, # Specify the user
                                            exclude_replies = True, # Include replies
                                            include_rts = False, # Exclude retweets
                                            tweet_mode = "extended" # Include the full text of the tweet
                                            ).items(100)] # Extract 100 tweets

## Preprocessing

In [None]:
# Save all the texts in a list
text_list = [tweet['tweet'] for tweet in tweet_dict]

In [None]:
# Preprocess tweets
tweets_clean = list(map(preprocess_sentence, text_list))

## Tokenization

In [None]:
# Create empty lists to append the tweet vectors
input_ids = []
attention_masks = []

# Tokenize each tweet
for tweet in tweets_clean:
    bert_inp = bert_tokenizer.encode_plus(tweet,
                                          add_special_tokens = True,
                                          max_length = 100,
                                          truncation = True,
                                          padding = 'max_length',
                                          return_attention_mask = True)
    # Append every sentence vector to a list
    input_ids.append(bert_inp['input_ids'])
    attention_masks.append(bert_inp['attention_mask'])

# Convert the lists to arrays so that we can input them into the model
input_ids = np.asarray(input_ids)
attention_masks = np.array(attention_masks)

## Prediction

In [None]:
# Make predictions
preds = model.predict([input_ids, attention_masks], batch_size=32)

# Find the predicted label
pred_labels = [np.argmax(pred) for pred in preds[0]]

In [None]:
# Create a new variable 'label' for each tweet in tweet_dict
for tweet, pred in zip(tweet_dict, pred_labels):
  tweet['label'] = pred

## Final step

The final step is to use this code to create the backend of a web app and deploy the model into production. I built the app with Flask using a code editor. The files can be found on the 'Web app' folder in the GitHub repository.