This notebook contains the code to autolabel the rest of the tweet corpus (from 8000 onwards) using our finetuned huggingface transformer model.

In [1]:
import time
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv("tweet_corpus.csv")
print("df shape:",df.shape)
df = df.iloc[8000:]
df.drop(['label2'], axis=1, inplace = True)
print("df shape:",df.shape)
df.head()

df shape: (50739, 14)
df shape: (42739, 13)


Unnamed: 0,created_at,id_str,full_text,quote_count,reply_count,retweet_count,favorite_count,lang,user_id_str,conversation_id_str,username,tweet_url,Sentiment Analysis (Label)
8000,Sun Nov 28 23:59:03 +0000 2021,1.465108e+18,When is the Biden’s admin suppose to start his...,1.0,1.0,0,2.0,en,1293697365556301837,1465108006912368642,k_matthew26,https://twitter.com/k_matthew26/status/1465108...,
8001,Sun Nov 28 23:59:02 +0000 2021,1.465108e+18,Weird how the hospitals of full of people sick...,0.0,1.0,10,13.0,en,17857130,1465108000050491396,debbterhune,https://twitter.com/debbterhune/status/1465108...,
8002,Sun Nov 28 23:58:39 +0000 2021,1.465108e+18,@BurDarius Bullsht. I have not read anywhere t...,0.0,1.0,0,0.0,en,1222945528704589824,1465102195469205506,felidaeny,https://twitter.com/felidaeny/status/146510790...,
8003,Sun Nov 28 23:58:26 +0000 2021,1.465108e+18,@kylenabecker Why and how: Dr. Sucharit Bhakd...,0.0,0.0,0,8.0,en,1360639012684374021,1464046666969931777,AldenteRoger,https://twitter.com/AldenteRoger/status/146510...,
8004,Sun Nov 28 23:58:21 +0000 2021,1.465108e+18,@henchnips @doritmi @TheFrankmanMN @MaryOskii ...,0.0,2.0,0,2.0,en,3042593172,1464762015931584512,handmadekathy,https://twitter.com/handmadekathy/status/14651...,


In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("./models/")
classifier = pipeline('text-classification',model=model,tokenizer=tokenizer)






In [4]:
classifier(df['full_text'][8000])

[{'label': 'NEG', 'score': 0.9970840811729431}]

In [5]:
mappings = {
    'NEG': 'Negative',
    'POS': 'Positive',
    'NEU': 'Neutral'
}

label = []
start_time = time.time()
for i in range(8000,50739):
    if i%500 == 0:
        print('reached',i)
    try:
        labelraw = classifier(df['full_text'][i])[0]['label']
        label.append(mappings[labelraw])
    except:
        label.append('Neutral')
print("time taken to label 42739 tweets (s):", time.time() - start_time)

reached 8000


Token indices sequence length is longer than the specified maximum sequence length for this model (208 > 128). Running this sequence through the model will result in indexing errors


reached 8500
reached 9000
reached 9500
reached 10000
reached 10500
reached 11000
reached 11500
reached 12000
reached 12500
reached 13000
reached 13500
reached 14000
reached 14500
reached 15000
reached 15500
reached 16000
reached 16500
reached 17000
reached 17500
reached 18000
reached 18500
reached 19000
reached 19500
reached 20000
reached 20500
reached 21000
reached 21500
reached 22000
reached 22500
reached 23000
reached 23500
reached 24000
reached 24500
reached 25000
reached 25500
reached 26000
reached 26500
reached 27000
reached 27500
reached 28000
reached 28500
reached 29000
reached 29500
reached 30000
reached 30500
reached 31000
reached 31500
reached 32000
reached 32500
reached 33000
reached 33500
reached 34000
reached 34500
reached 35000
reached 35500
reached 36000
reached 36500
reached 37000
reached 37500
reached 38000
reached 38500
reached 39000
reached 39500
reached 40000
reached 40500
reached 41000
reached 41500
reached 42000
reached 42500
reached 43000
reached 43500
reached 4

In [10]:
df["Sentiment Analysis (Label)"] = label
df["Sentiment Analysis (Label)"].value_counts()

Sentiment Analysis (Label)
Neutral     26963
Positive     9924
Negative     5852
Name: count, dtype: int64

In [12]:
df2 = pd.read_csv("tweet_corpus2.csv")
fulldf = pd.concat([df2.iloc[0:8000], df])
print("fulldf shape:",fulldf.shape)
fulldf["Sentiment Analysis (Label)"].value_counts()

fulldf shape: (50739, 13)


Sentiment Analysis (Label)
Neutral     31304
Positive    11957
Negative     7478
Name: count, dtype: int64

In [13]:
fulldf.to_csv('tweet_corpus3.csv', index=False)