# TweetNLP Modified Implementation
This colab notebook is an adaptation of the demo notebook for TweetNLP,   in which the developers provide a short introduction of [`tweetnlp`](https://github.com/cardiffnlp/tweetnlp), a python library of NLP models for tweets. 

------------------

In my modified implementation of the tweetnlp library, I first add four additional features to each of the datasets using the emotion recognition, topic classification, irony detection, and offensive language detection tasks from the tweetnlp library. 

Second, I  utilize methods for recognizing stressed language found in another paper. I use the offensive language detection model as a base to create a new model that can detect stressed language in text. I will then create another feature which labels each data point with the presence or absence of stressed language.

From there, I analyze the resulting data and determine patterns between the four features I identified and the presence of stressed language. I compare the findings separately between the pre- and post-covid datasets to make conclusions about how Covid has impacted mental health. 

To conclude, I train the emotion recognition model on the new labeled post-covid dataset and determine if it performs better than in the parent paper. 

# Load Datasets and Packages

In [2]:
import pandas as pd

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

#Pre-Covid Dataset
pre_url = "https://raw.githubusercontent.com/mrp5636/DS340W_Project/main/Project/datasets/train_text.txt"
Pre_Covid_Raw = pd.read_csv(pre_url, delimiter = "\n", header = None)
Pre_Covid_Raw = Pre_Covid_Raw.sample(n = 10000) #take sample of data because there are over 40,000 in original dataset

#Post-Covid
post_url = "https://raw.githubusercontent.com/mrp5636/DS340W_Project/main/Project/datasets/Twitter_Data.csv"
Post_Covid_Raw = pd.read_csv(post_url, delimiter = "\n", header = None)
Post_Covid_Raw = Post_Covid_Raw.drop([0]) #remove header row
Post_Covid_Raw = Post_Covid_Raw.sample(n = 10000) #take sample of data because there are over 100,000 in the original dataset

## Installation
TweetNLP is available on pip or can be installed from source.


In [14]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
# Fix Colab Error
!pip install --upgrade google-cloud-storage

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting google-cloud-storage
  Downloading google_cloud_storage-2.6.0-py2.py3-none-any.whl (105 kB)
[K     |████████████████████████████████| 105 kB 5.5 MB/s 
Installing collected packages: google-cloud-storage
  Attempting uninstall: google-cloud-storage
    Found existing installation: google-cloud-storage 2.5.0
    Uninstalling google-cloud-storage-2.5.0:
      Successfully uninstalled google-cloud-storage-2.5.0
Successfully installed google-cloud-storage-2.6.0


In [4]:
# via pip
!pip install tweetnlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tweetnlp
  Downloading tweetnlp-0.1.2.tar.gz (25 kB)
Collecting allennlp
  Downloading allennlp-2.10.1-py3-none-any.whl (730 kB)
[K     |████████████████████████████████| 730 kB 7.2 MB/s 
Collecting urlextract
  Downloading urlextract-1.7.1-py3-none-any.whl (20 kB)
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 38.5 MB/s 
[?25hCollecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 4.2 MB/s 
Collecting tensorboardX>=1.2
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 45.4 MB/s 
[?25hCollecting base58>=2.1.1
  Downloading base58-2.1.1-py3-none-any.whl (5.6 kB)
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[

In [5]:
! pip list | grep tweetnlp

tweetnlp                      0.1.2


All you need is to import `tweetnlp` !

In [6]:
import tweetnlp

## Split Into Testing and Training Sets

In [7]:
#Split each dataset into test and train
train_pre, test_pre = train_test_split(Pre_Covid_Raw, test_size = 0.2)  #train_pre used as pre-covid data for labeling and analysis
train_post, test_post = train_test_split(Post_Covid_Raw, test_size = 0.2) #train_post used as post-covid data for labeling and analysis

#Combine pre and post testing sets to create one testing set to test accuracy
Test_Data = test_pre.append(test_post)

## Labeling Training Sets

I create four new features using the topic classification, irony detection, offensive language detection, and emotion recognition tasks to label each tweet. 

In [8]:
#Convert the dataframes to list of tweets for labeling
pre = train_pre[train_pre.columns[0]].values.tolist()
post = train_post[train_post.columns[0]].values.tolist()

In [9]:
#Import tweetnlp models
topic_model = tweetnlp.load("topic_classification")
irony_model = tweetnlp.load("irony")
offensive_model = tweetnlp.load("offensive")
emotion_model = tweetnlp.load("emotion")

Downloading:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/589 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

### Irony Detection 
This is a binary classification task where given a tweet, the goal is to detect whether it is ironic or not. It is based on the Irony Detection dataset from the SemEval 2018 task.

### Offensive Language Identification
This task consists in identifying whether some form of offensive language is present in a tweet. For our benchmark we rely on the SemEval2019 OffensEval dataset.

### Emotion Recognition
Given a tweet, this task consists of associating it with its most appropriate emotion. As a reference dataset we use the SemEval 2018 task on Affect in Tweets, simplified to only four emotions used in TweetEval: anger, joy, sadness and optimism.

In [10]:
#Function to create labels for irony, offensive langauge, and emotion 
def create_labels(tweet_list, model1, model2, model3):
  irony_labels = []
  offensive_labels = []
  emotion_labels = []

  for tweet in tweet_list:
    irony = model1.irony(tweet)
    i_label = irony["label"]
    irony_labels.append(i_label)

    offensive = model2.offensive(tweet)
    o_label = offensive["label"]
    offensive_labels.append(o_label)

    emotion = model3.emotion(tweet)
    e_label = emotion["label"]
    emotion_labels.append(e_label)
    

  labels = [irony_labels, offensive_labels, emotion_labels]

  return labels

In [11]:
#Create labels for pre-covid data
labels = create_labels(pre, irony_model, offensive_model, emotion_model)
Irony = labels[0]
Offensive = labels[1]
Emotion = labels[2]

#Append to original dataframe
train_pre["Irony"] = Irony
train_pre["Offensive"] = Offensive
train_pre["Emotion"] = Emotion

In [16]:
#Create labels for post-covid data
labels = create_labels(post, irony_model, offensive_model, emotion_model)
Irony = labels[0]
Offensive = labels[1]
Emotion = labels[2]

#Append to original dataframe
train_post["Irony"] = Irony
train_post["Offensive"] = Offensive
train_post["Emotion"] = Emotion

### Topic Classification
The aim of this task is, given a tweet to assign topics related to its content. The task is formed as a supervised multi-label classification problem where each tweet is assigned one or more topics from a total of 19 available topics. The topics were carefully curated based on Twitter trends with the aim to be broad and general and consist of classes such as: arts and culture, music, or sports. Our internally-annotated dataset contains over 10K manually-labeled tweets.

In [86]:
#Function to create labels for topic classification, list top 3 topics
topic_list = ["arts_&_culture", "business_&_entrepreneurs", "celebrity_&_pop_culture", "diaries_&_daily_life", "family", "fashion_&_style", "film_tv_&_video", "fitness_&_health", "food_&_dining", "gaming", "learning_&_educational", "music", "news_&_social_concern", "other_hobbies", "relationships", "science_&_technology", "sports", "travel_&_adventure", "youth_&_student_life"]
def create_topic(tweet_list, model):
  label = []
  for tweet in tweet_list:
    #labels = []
    output = model.topic(tweet)
    item = output["probability"]
    pred_list = item.values()
    labels = sorted(zip(pred_list, topic_list), reverse=True)[:3]
    i = []
    for item in labels:
      l = item[1]
      i.append(l)
    label.append(i)
  return label


In [88]:
#Create topic labels for both datasets
labels1 = create_topic(pre, topic_model)
train_pre["Topics"] = labels1

labels2 = create_topic(post, topic_model)
train_post["Topics"] = labels2

## Stressed Language Detection

I utilize a lexicon of stressed language developed from the [Second Parent Paper](https://paperswithcode.com/paper/understanding-and-measuring-psychological) to predict a stress score for each tweet individually. The stress score determines the level of stress within the tweet based on the words used, a higher score means a higher predicted level of stress.

In [64]:
#Load Stress Lexicon
stress_lexicon = "https://raw.githubusercontent.com/chandrasg/lexica/master/stress_1grams_fbtw.csv"
stress_lex = pd.read_csv(stress_lexicon, delimiter = ",")
stress_words = stress_lex[stress_lex.columns[0]].values.tolist() #Create list of words to easily search

In [65]:
#Function to create stress score for each tweet
def stress(tweet_list):
  labels = []
  for tweet in tweet_list:
    score = []
    for word in tweet:
      if word in stress_words:
        i = stress_words.index(word)
        s = stress_lex["weight"]
        idx = s[i]
        score.append(idx)
    score = sum(score)
    labels.append(score)
  return labels

In [66]:
#Label each dataset for stressed language
labels1 = stress(pre)
train_pre["Stress Score"] = labels1

labels2 = stress(post)
train_post["Stress Score"] = labels2

In [93]:
#Save results of labeling to google drive
train_pre.to_csv("/content/drive/My Drive/train_pre.csv")
train_post.to_csv("/content/drive/My Drive/train_post.csv")

## Analysis of Results

In this step of my implementation, I analyze the factors that are most frequently correlated with a high stress score. 

In [97]:
#Load labeled datasets from 
pre_url = "https://raw.githubusercontent.com/mrp5636/DS340W_Project/main/Project/datasets/train_pre.csv"
Pre_Covid = pd.read_csv(pre_url, delimiter = ",")

post_url = "https://raw.githubusercontent.com/mrp5636/DS340W_Project/main/Project/datasets/train_post.csv"
Post_Covid = pd.read_csv(post_url, delimiter = ",")

First, to determine what stress score correlates with "high stress" I evaluate the distribution of Stress Scores over both datasets. Based on these results, it appears that the average stress score for the post-covd data is slightly higher than the

In [108]:
#Investigate range of stress scores for pre-covid data
avg_stress = Pre_Covid["Stress Score"].mean()
min_stress = Pre_Covid["Stress Score"].min()
max_stress = Pre_Covid["Stress Score"].max()
print(f"Average Stress Score: {avg_stress}\n Minimum Stress Score: {min_stress}\n Maximum Stress Score: {max_stress}")

Average Stress Score: 196.73878849300638
 Minimum Stress Score: -725.803942916767
 Maximum Stress Score: 4025.021602340852


In [103]:
#Investigate range of stress scores for post-covid data
avg_stress = Post_Covid["Stress Score"].mean()
min_stress = Post_Covid["Stress Score"].min()
max_stress = Post_Covid["Stress Score"].max()
print(f"Average Stress Score: {avg_stress}\n Minimum Stress Score: {min_stress}\n Maximum Stress Score: {max_stress}")

Average Stress Score: 328.252170569081
 Minimum Stress Score: -762.750041438628
 Maximum Stress Score: 2258.846220172245


In [131]:
#Create Datatables of high stress tweets
High_Stress_Pre = Pre_Covid[Pre_Covid["Stress Score"] > 800]
High_Stress_Post = Post_Covid[Post_Covid["Stress Score"] > 800]

In [202]:
#Determine frequency of offensive language in pre and post covid datasets
offensive_count = High_Stress_Pre["Offensive"].value_counts().to_frame()
offensive_count2 = High_Stress_Post["Offensive"].value_counts().to_frame()
print(f"Pre-Covid: {offensive_count} \n \n Post-Covid: {offensive_count2}")

Pre-Covid:                Offensive
not-offensive        287
offensive             31 
 
 Post-Covid:                Offensive
not-offensive        658
offensive             65


In [203]:
#Determine frequency of irony in pre and post covid datasets
irony_count = High_Stress_Pre["Irony"].value_counts().to_frame()
irony_count2 = High_Stress_Post["Irony"].value_counts().to_frame()
print(f"Pre-Covid: {irony_count} \n \n Post-Covid: {irony_count}")

Pre-Covid:            Irony
irony        169
non_irony    149 
 
 Post-Covid:            Irony
irony        169
non_irony    149


Compare emotions most frequently associated with these high stress tweets

In [140]:
emotions_count = High_Stress_Pre["Emotion"].value_counts()
print(emotions_count)

joy         209
anger        55
sadness      38
optimism     16
Name: Emotion, dtype: int64


In [141]:
emotions_count2 = High_Stress_Post["Emotion"].value_counts()
print(emotions_count2)

anger       473
optimism    102
sadness      82
joy          66
Name: Emotion, dtype: int64


Determine the topics most frequently associated with joy or anger.

In [178]:
#Evaluate Pre-Covid Data for "joy"
topics_count = High_Stress_Pre[High_Stress_Pre["Emotion"] == "joy"]
topics = topics_count["Topics"]

d = {}
for row in topics:
  for item in row:
    if item not in d:
      d[item] = 1
    else:
      d[item] += 1

counts = pd.DataFrame(list(d.items()), columns = ["topic", "count"])
print(counts.sort_values(by = "count", ascending=False))
  

                       topic  count
4    celebrity_&_pop_culture    140
1      news_&_social_concern    102
5                      music     87
0                     sports     86
3            film_tv_&_video     62
6       diaries_&_daily_life     59
9             arts_&_culture     25
10             food_&_dining     11
2   business_&_entrepreneurs      8
7              other_hobbies      7
13                    gaming      7
14        travel_&_adventure      7
8           fitness_&_health      6
15             relationships      6
11      youth_&_student_life      5
12    learning_&_educational      5
17           fashion_&_style      3
16                    family      1


In [182]:
#Evaluate Pre-Covid Dataset for "anger"
topics_count2 = High_Stress_Pre[High_Stress_Pre["Emotion"] == "anger"]
topics2 = topics_count2["Topics"]

d = {}
for row in topics2:
  for item in row:
    if item not in d:
      d[item] = 1
    else:
      d[item] += 1

counts = pd.DataFrame(list(d.items()), columns = ["topic", "count"])
print(counts)

                       topic  count
0    celebrity_&_pop_culture     36
1      news_&_social_concern     46
2            film_tv_&_video     17
3       diaries_&_daily_life     18
4                     sports     22
5   business_&_entrepreneurs      3
6                     gaming      1
7              relationships      2
8           fitness_&_health      1
9                      music      5
10        travel_&_adventure      1
11            arts_&_culture      4
12      youth_&_student_life      1
13    learning_&_educational      1
14             other_hobbies      5
15                    family      1
16      science_&_technology      1


In [189]:
#Evaluate Post-Covid Data for "joy"
topics_count3 = High_Stress_Post[High_Stress_Post["Emotion"] == "joy"]
topics3 = topics_count3["Topics"]

d = {}
for row in topics3:
  row = row.strip("[")
  row = row.strip("]")
  row = row.split(",")
  for item in row:
    if item not in d:
      d[item] = 1
    else:
      d[item] += 1

counts = pd.DataFrame(list(d.items()), columns = ["topic", "count"])
print(counts.sort_values(by = "count", ascending=False))

                          topic  count
0       'news_&_social_concern'     53
11    'celebrity_&_pop_culture'     28
2        'diaries_&_daily_life'     26
3        'science_&_technology'     22
1    'business_&_entrepreneurs'     20
5       'news_&_social_concern'     11
6      'learning_&_educational'      9
10            'film_tv_&_video'      7
4        'science_&_technology'      4
15              'other_hobbies'      4
12             'arts_&_culture'      3
17            'film_tv_&_video'      2
9                      'sports'      2
8               'relationships'      2
13    'celebrity_&_pop_culture'      1
14         'travel_&_adventure'      1
16       'diaries_&_daily_life'      1
7                      'family'      1
18       'youth_&_student_life'      1


In [190]:
#Evaluate Post-Covid Data for "anger"
topics_count4 = High_Stress_Post[High_Stress_Post["Emotion"] == "anger"]
topics4 = topics_count4["Topics"]

d = {}
for row in topics4:
  row = row.strip("[")
  row = row.strip("]")
  row = row.split(",")
  for item in row:
    if item not in d:
      d[item] = 1
    else:
      d[item] += 1

counts = pd.DataFrame(list(d.items()), columns = ["topic", "count"])
print(counts.sort_values(by = "count", ascending=False))

                          topic  count
0       'news_&_social_concern'    457
2        'diaries_&_daily_life'    361
3     'celebrity_&_pop_culture'    295
1    'business_&_entrepreneurs'     75
6        'science_&_technology'     62
7               'other_hobbies'     40
11            'film_tv_&_video'     37
8      'learning_&_educational'     20
14      'news_&_social_concern'     15
12         'travel_&_adventure'     11
5        'youth_&_student_life'     10
9                      'sports'      9
15   'business_&_entrepreneurs'      4
18       'science_&_technology'      4
13            'film_tv_&_video'      3
20             'arts_&_culture'      3
4      'learning_&_educational'      2
16              'food_&_dining'      2
17              'relationships'      2
19                     'sports'      2
21           'fitness_&_health'      2
10                     'family'      1
22                      'music'      1
23              'food_&_dining'      1


# Compare Results