<a href="https://colab.research.google.com/github/jj0ng/TIL/blob/main/sentiment_analysis_day_2(starter_code).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agenda

- Questions and Debugging together
- Share our work
- Exercises if time allows

In [None]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m93.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m105.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0
Looking in indexes: https://pypi.org/simple, https://u

In [None]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax
import csv
from datetime import datetime
from matplotlib.dates import date2num

# more packages, tools for getting to google drive
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

import pandas as pd # basically the excel of python

In [None]:
# deep learning toolkit
from torch.utils.data import DataLoader
from torch.nn import Softmax
import torch

In [None]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

from datasets import Dataset

# Exercise 1 - Calculate model accuracy

### Load our previous work 

[datasets dropbox](https://drive.google.com/drive/u/0/folders/1bM9JN8U5yxeH0wZdA_xvqjOhjEnRATOK)

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
pred_id = "1ALwuQAihQhR1XOGGYJZIh7bN6ZnNkEum"
brandwatch_scores_id = "1587Dio_6LPg_Tbqbz2UQUSNVEiw-IP1W"


# This step will move the file from Drive to the workspace
# it will take a little bit of time
download1 = drive.CreateFile({'id':pred_id}) 
download1.GetContentFile('predicted_scores.xlsx')

download2 = drive.CreateFile({'id':brandwatch_scores_id}) 
download2.GetContentFile('brandwatch_scores.xlsx')


# This step will load the file from the workspace to our notebook session
pred_df = pd.read_excel('predicted_scores.xlsx')
true_df = pd.read_excel('brandwatch_scores.xlsx')

In [None]:
pred_df.columns

In [None]:
true_df.columns

### How to calculate the model accuracy

If you have some sentiment labels for a set of text samples, you can calculate the accuracy of the model as the number of correct predictions divided by the number of samples (then that fraction times 100 if you want it as a percentage. 

How would we do this in numpy? We have two numpy arrays, one holding our models predictions, one holding the "correct" answers. Then we have these tools:

[np.equal()](https://numpy.org/doc/stable/reference/generated/numpy.equal.html)  
[np.sum()](https://numpy.org/doc/stable/reference/generated/numpy.sum.html)  

[python arithmetic operators](https://www.w3schools.com/python/python_operators.asp)





What if we don't have labels? This will be a common scenario in research applications. There are "unsupervised" models that don't need labels to train, but that can open up new questions. For now, I suggest you focus on "supervised models" and 1) use model trained on text very similar to yours or 2) use a model trained on text kind of similar to yours, label a small set of your text, and make sure that the model can get most of your labels right.


For example, the first sentiment analysis model we used was trained on tweets, and applied on tweets. The domains are very close, so we can probably trust the results. The more general sentiment model we used may be less familiar with tweet vernacular, so if we were going to apply the general model to tweets, we should probably get labels for a couple hundred tweets and make sure the model can predict them correctly. A spotcheck like this at the beginning of the research process could prevent you from wasting too much time on the wrong model for your problem. 

[just twitter model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest?text=Covid+cases+are+increasing+fast%21)
| [all-english model](https://huggingface.co/j-hartmann/sentiment-roberta-large-english-3-classes?text=Oh+no.+This+is+bad..)

# Review examples of using different models

[Colab with examples of using models](https://colab.research.google.com/drive/1mHuSzPP2MDIvdcNZ_hFi9z2O3PQqv8sb?usp=sharing)

- emotion
- general sentiment

# Exercise 2 - find hate speech

At this point we have a good bit of code in our colab notebooks. In theory, we should have all the pieces we need to perform a analysis. In the following cells, lets try to apply a new model (a hate speech detector) to a new dataset. Try to put the data and the code you already have together to complete the following steps. 

On your own:

1) Load the csv on hatespeech located [in this google folder](https://drive.google.com/drive/folders/1bM9JN8U5yxeH0wZdA_xvqjOhjEnRATOK?usp=sharing)   
2) Get the data from a DataFrame to a batched dataset (use hugging face dataset)   
3) Instantiate the new model [model card](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target?text=I+hate+you)  (note - the authors coding example is located under "use in transformer").  


Together we will:   
4) pass each batch of the dataset through the model.   
5) print out the percent of the data samples that are hatespeech


In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
hate_id = "1b3qYAj3k-RcLUAlJanygLqBWLxIR2Ne3"

download1 = drive.CreateFile({'id':hate_id}) 
download1.GetContentFile('hate.csv')

hate_df = pd.read_csv('hate.csv')

In [None]:
hate_df.head()

Unnamed: 0.1,Unnamed: 0,id,text,label,type,model_wrong,db.model_preds,status,round,split,annotator
0,1,10000,It should be legal for a husband to rape his wife,hate,notgiven,True,0.97457,dynabench entry,1,train,gnzuctwed3
1,2,10001,dalits are lowlife scum,hate,notgiven,False,0.08233,dynabench entry,1,test,cagnluiznm
2,3,10002,dalits are lowlives,hate,notgiven,True,0.92319,dynabench entry,1,train,cagnluiznm
3,4,10003,It was a better world when women dare not ques...,hate,notgiven,True,0.99006,dynabench entry,1,test,gnzuctwed3
4,5,10004,I don't work this hard so that those immigrant...,hate,notgiven,True,0.98836,dynabench entry,1,train,cagnluiznm


In [None]:
hate_df.columns

Index(['Unnamed: 0', 'id', 'text', 'label', 'type', 'model_wrong',
       'db.model_preds', 'status', 'round', 'split', 'annotator'],
      dtype='object')

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("facebook/roberta-hate-speech-dynabench-r4-target")

model = AutoModelForSequenceClassification.from_pretrained("facebook/roberta-hate-speech-dynabench-r4-target")

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [None]:
from datasets import Dataset
data = pd.DataFrame(hate_df['text'])
dataset = Dataset.from_pandas(data)

In [None]:
# save outputs in a dataframe
# move everything to gpu
# define labels

labels = ['hate', 'nothate']
model.to('cuda')

class_df = pd.DataFrame(columns=labels)

for text in progress_bar(dataset):
  tokens = tokenizer(text['text'], return_tensors='pt')
  tokens.to('cuda')
  outputs = model(tokens['input_ids'])

  embeddings = outputs[0].detach().cpu().numpy()
  scores = softmax(embeddings)

  score_df = pd.DataFrame(scores, columns=labels)
  class_df = pd.concat((class_df, score_df), axis=0, ignore_index=True)


  0%|          | 0/40623 [00:00<?, ?it/s]

KeyboardInterrupt: ignored