## 04 Stance Detection

This script takes in data from \data and performs stance detection. Base code taken from [hugging face](https://huggingface.co/rwillh11/mdeberta_NLI_stance_NoContext), with modifications added to (1) average across target synonyms and (2) work with our data. Note that torch will not import with Python 3.11--switched tp Python 3.10.9 for this script

In [1]:
#libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pandas as pd
import numpy as np
import json
from sklearn.metrics import confusion_matrix

  from .autonotebook import tqdm as notebook_tqdm


### Off the Shelf Classifer

In [2]:
#read in data
df = pd.read_csv("../data/scraping_results.csv")

In [3]:
#load model and tokenizer
model_name = "rwillh11/mdeberta_NLI_stance_NoContext"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast = False)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [4]:
#set target and hypotheses
stance_list = ["positive", "negative", "neutral"]
target_group = ["shsat", "test", "exam", "specialized high school", "specialized high schools"]

#dict of hypothesis for each stance and word
hypotheses_list = dict.fromkeys(stance_list)

for stance in stance_list:
    hyp = {}
    for target_word in target_group:
        hyp[target_word] = f"The text is {stance} towards {target_word}."

    hypotheses_list[stance] = hyp


In [5]:
def stance_detection(headline, hypotheses):
    """
    takes in a headline and computes probability of each
    hypothesis being correct. returns probabilities for 
    each stance as well as final detected stance
    """
    #dictionary of results
    results = {}

    #compute prob of stance for each hypothesis
    with torch.no_grad():
        for stance, hyp_list in hypotheses.items():
            temp_results = {}
            for word, hyp  in hyp_list.items():
                inputs = tokenizer(headline, hyp, return_tensors="pt", truncation=True)
                outputs = model(**inputs)
                probs = torch.softmax(outputs.logits, dim=-1)
                temp_results[word] = probs[0][0].item()  
            #average prob across target words
            results[stance] = max(temp_results.values()) 

    #select stance with highest entailment probability
    results.update({"max": max(results, key = results.get)})
    return results
        

In [6]:
#dict for storage
stance_dict = {}

In [7]:
#get stance for each text
for row in range(0, len(df)):
    temp = stance_detection(df["full_text"].iloc[row], hypotheses = hypotheses_list)
    stance_dict[df["text_id"].iloc[row]] = temp

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [8]:
#transform into df
stance_df = pd.DataFrame.from_dict(stance_dict, orient="index")
stance_df.index.name = "text_id"
stance_df = stance_df.reset_index()

In [9]:
#merge it with text data
df_stance = df.merge(stance_df, on = "text_id", how = "left")

In [10]:
#save df
df_stance.to_csv("../data/stance_detection_df.csv", index=False)

### Validation

In [11]:
#read in data
df_stance = pd.read_csv("../data/stance_detection_df.csv")

In [12]:
#create validation dataset
validation_df = df_stance[["text_id", "full_text"]].sample(200)

validation_df["Coder"] = None

In [None]:
#export data
validation_df.to_csv("../validation/validation_uncoded.csv", index = False)