In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
from tqdm.notebook import tqdm
import nltk

Followed: https://www.youtube.com/watch?v=QpzMWQvxXWk

Reading in CSV file. Have to change encoding because it would throw errors if not

In [36]:
df = pd.read_csv('DataTweets.csv', encoding='cp1252')
df.head()

Unnamed: 0,Id,Name,Tweet
0,1,Trey Benson,I think Trey fits us from a schematic standpoi...
1,2,Trey Benson,And then one thing that stands out about Trey ...
2,3,MarShawn Lloyd,"No, I would like to get him out there as much ..."
3,4,Bucky Irving,The nice thing I like about Bucky is he gets t...
4,5,Bucky Irving,He has taken every detail that we’ve coached t...


In [37]:
example = df['Tweet'][0]
print(example)

I think Trey fits us from a schematic standpoint, in that he’s instinctive, he’s tough, he’s physical, he’s got good contact balance, he’s able to run through and gain tough yards.


Just looking at the nltk tokenize function which splits the text into tokens. Not related to the roberta model.

In [38]:
tokens = nltk.word_tokenize(example)
tokens[:10]

['I',
 'think',
 'Trey',
 'fits',
 'us',
 'from',
 'a',
 'schematic',
 'standpoint',
 ',']

In [39]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

Pulling in a very specific model that has been pretrained on sentiment. Hugging face gives us this. The model was already trained on twitter comments, so we don't have to retrain the model at all. Pre-trained weights are already applied

In [40]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)



Running the roberta model on our example. First thing is encoding on text using the tokenizer allowing the model to understand it (0s and 1s)

In [41]:
encoded_text = tokenizer(example, return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
    'roberta_neg': scores[0],
    'roberta_neu': scores[1],
    'roberta_pos': scores[2]
}
print(scores_dict)

{'roberta_neg': 0.0035831851, 'roberta_neu': 0.21310855, 'roberta_pos': 0.78330815}


Creating a function to run the model on for each piece of text we give it.

In [42]:
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
    'roberta_neg': scores[0],
    'roberta_neu': scores[1],
    'roberta_pos': scores[2]
}
    return scores_dict

Storing our results for the entire dataset into a dictionary which we will then convert to a dataframe

In [43]:
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    try: 
        tweet = row['Tweet']
        tweetId = row['Id']
        roberta_result = polarity_scores_roberta(tweet)
        res[tweetId] = roberta_result
    except RuntimeError:
        print(f'Broke for id {id}')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=21.0), HTML(value='')))




Converting the dictionary to a dataframe and then merging the original dataframe (df) with our new dataframe to get a side by side of the scores with the tweet.

In [44]:
results_df = pd.DataFrame(res).T
results_df = results_df.reset_index().rename(columns={'index': 'Id'})
results_df = results_df.merge(df, how='left')

In [45]:
results_df

Unnamed: 0,Id,roberta_neg,roberta_neu,roberta_pos,Name,Tweet
0,1,0.003583,0.213109,0.783308,Trey Benson,I think Trey fits us from a schematic standpoi...
1,2,0.004016,0.11429,0.881694,Trey Benson,And then one thing that stands out about Trey ...
2,3,0.006738,0.160118,0.833143,MarShawn Lloyd,"No, I would like to get him out there as much ..."
3,4,0.010072,0.089185,0.900744,Bucky Irving,The nice thing I like about Bucky is he gets t...
4,5,0.022405,0.748693,0.228902,Bucky Irving,He has taken every detail that we’ve coached t...
5,6,0.045574,0.840049,0.114377,Breece Hall,"Breece is the unquestioned bellcow, but even t..."
6,7,0.004232,0.098714,0.897054,Malachi Corley,"He is raw, from a route-running ability standp..."
7,8,0.110729,0.74543,0.143841,Drake Maye,"Ultimately, he still has to win that job and w..."
8,9,0.239015,0.594932,0.166052,Drake Maye,I don’t think many rookies are ready to just j...
9,10,0.005282,0.123267,0.871451,Derrick Henry,He ran very well in Tennesee. I think what it’...


Average the scores for each of the three columns, grouped by the player names.

In [66]:
grouped_df = results_df[['Name', 'roberta_neg', 'roberta_neu', 'roberta_pos']].groupby('Name')

# Calculate the mean of each group
averages_df = grouped_df.mean().reset_index()
averages_df = averages_df.rename(columns={
    'roberta_neg': 'Negative',
    'roberta_neu': 'Neutral',
    'roberta_pos': 'Positive'
})
# Print the resulting dataframe
print(averages_df)

                  Name  Negative   Neutral  Positive
0   Anthony Richardson  0.004275  0.156407  0.839317
1          Ben Sinnott  0.012392  0.409611  0.577998
2          Breece Hall  0.045574  0.840049  0.114377
3         Bucky Irving  0.016238  0.418939  0.564823
4     Christian Watson  0.019709  0.346437  0.633854
5        Derrick Henry  0.005282  0.123267  0.871451
6      Devontez Walker  0.011445  0.311536  0.677020
7           Drake Maye  0.174872  0.670181  0.154947
8       Jalen McMillan  0.010955  0.317371  0.671674
9           Joe Burrow  0.006339  0.162784  0.830877
10      Malachi Corley  0.004232  0.098714  0.897054
11      MarShawn Lloyd  0.004136  0.092779  0.903085
12          Nick Chubb  0.015956  0.543904  0.440141
13         Trey Benson  0.003800  0.163699  0.832501
14       Treylon Burks  0.014465  0.470272  0.515263
15       Xavier Worthy  0.005397  0.094420  0.900182


Writing these scores into the excel sheet with predefined player names. There is an error where I have to delete the file everytime I want to write to it, but it works.

In [65]:
# Read the Excel sheet
excel_file = 'DataTweets.xlsx'
excel_df = pd.read_excel(excel_file, sheet_name='Scores')

# Merge the Excel dataframe with the averages dataframe based on the 'Name' column
merged_df = pd.merge(excel_df, averages_df, on='Name', how='left')


# Identify columns to replace
replace_columns = [col for col in existing_data.columns if col in new_data.columns]
# Write the merged dataframe back to a specific sheet within the Excel file
with pd.ExcelWriter(excel_file, engine='openpyxl', mode='a') as writer:
    merged_df.to_excel(writer, sheet_name='Scores', index=False)


Quick and easy way to run sentiment predictions via the pretrained pipelines that hugging face offers. There are more models that you can specify but this is a quick way of doing it.

In [46]:
from transformers import pipeline
sent_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [49]:
sent_pipeline(df['Tweet'][15])

[{'label': 'POSITIVE', 'score': 0.9919127225875854}]

In [50]:
df['Tweet'][15]

'"[Boyd signing] means you don\'t have to rely on Burks to produce, which takes the pressure off of him and allows him to just go make plays when he gets opportunities (yes as that 4th receiver).'