# Sentiment Analysis of Text Data

#### Introduction to sentiment analysis

Sentiment analysis consists of assigning the label "positive", "negative" or "neutral" to a document based on its overall polarity. For example, this sentence would have a negative sentiment: "Climate change is terrible. I am really worried."

While it seems simple, sentiment analysis can be quite complex. For instance, sarcasm is hard to identify. Notice how minor changes to the above sentence would lead a human to understand it differently, while a computer would have a harder time: "Sure, sure, I get it, climate change is terrible. I am really worried..."

Sentiment analysis is related to other methods for natural language processing. For example, sentiment analysis is related to the task of assigning a topic to a document (topic modeling). Similarly, sentiment analysis is related to stance detection, but is not exactly the same. Sentiment analysis focuses on the overall tone of a document, while stance detection focuses on the tone of a document regarding a specific entity.

#### Sentiment analysis for research

Sentiment analysis can be useful for research with text data, for example, to analyze social media posts, open-ended survey responses, news, and political speeches.

This notebook covers different approaches to sentiment analysis focusing on research applications. Many methods and datasets for sentiment analysis are tailored for industry applications and may not work well off-the-shelf in a research context.

#### What to expect from this notebook

This notebook focuses on conveying the idea that there are different approaches to sentiment analysis, give you a sense of what they are and when to consider them, and show you possible basic implementations. However, **each of the approaches has many more details to consider**, which is outside of the scope of this notebook. You are always welcome to [submit a consult request with Research Computing and Data Services](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f) if you need help with the data-related aspects of your research.

## Import libraries

In [None]:
# To use dataframes
import pandas as pd

# To use the VADER dictionary and NLTK's tokenizer
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('punkt_tab')
from nltk import tokenize

# To use a pre-trained model from Hugging Face
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# To use structured output
from pydantic import BaseModel
from typing import Literal

# To use OpenAI API's key in Google Colab
# https://drlee.io/how-to-use-secrets-in-google-colab-for-api-key-protection-a-guide-for-openai-huggingface-and-c1ec9e1277e0
from google.colab import userdata
import os

# To use OpenAI's API
from openai import OpenAI

# To train a model from scratch
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

## Read data

The data are available in [this GitHub repo](https://github.com/emiliolehoucq/trainings/tree/main/data).
- `tweets.csv` is a sample of 5000 rows from [this replication package](https://doi.org/10.7910/DVN/SFQTJZ) for [this article](https://doi.org/10.1073/pnas.2210988119). The original dataset consisted of 18,896,054 publicly available tweets from 1 January 2019 to 31 December 2021 that mention climate change in the text of the post.
- `nyt.csv` is a sample of 5000 rows from [this replication package](https://doi.org/10.7910/DVN/FVRZYU) for [this article](https://doi.org/10.1111/ajps.12702). The original dataset consisted of 9,341 *New York Times* articles that contain phrases related to economic mobility (i.e., "upward mobility", "land of opportunity", "self-made success").

In [None]:
df_nyt = pd.read_csv("https://raw.githubusercontent.com/emiliolehoucq/trainings/refs/heads/main/data/nyt.csv")
df_tweets = pd.read_csv("https://raw.githubusercontent.com/emiliolehoucq/trainings/refs/heads/main/data/tweets.csv")

# Lowercase column names in df_tweets
df_tweets.columns = df_tweets.columns.str.lower()

## Explore data

In [None]:
df_nyt.shape

In [None]:
df_nyt.head(1)

We'll focus on the `text` column.

In [None]:
# Define a function to print "=" to make the output easier to read
def print_format():
  print("=" * 100)
  print("=" * 100)
  print("=" * 100)
  print("=" * 100)
  print("=" * 100)

for i, row in df_nyt.sample(10, random_state=51425).iterrows():
  print(row['text'])
  print_format()

Notice that the text has been processed to some degree.

You can learn more about text processing in this [workshop on parsing text with NLTK](https://github.com/nuitrcs/parsing_text_nltk).

In [None]:
df_tweets.shape

In [None]:
df_tweets.head(1)

We'll focus on the `text` column.

In [None]:
for i, row in df_tweets.sample(10, random_state=51425).iterrows():
  print(row['text'])
  print_format()

Notice that the text has been processed to some degree.

## Overview of approaches to sentiment analysis

This section provides an overview of different approaches to sentiment analysis, their pros and cons, and some possible use cases. **Keep in mind that you can potentially use approaches in combination.**

#### Warning on evaluation

Given data and time constraints, this notebook does not delve deep into evaluation. **However, regardless of which approach you use for a particular project, it is important to evaluate its performance. Typically, the evaluation requires manually labelling of a sample (which could be a simple random sample, a stratified sample) and comparing against the computational method using [metrics such as accuracy, recall, and precision](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall). Also, as you (probably iteratively) decide on and implement a particular approach, it is important to understand your corpus well, keep in mind your research focus, and quickly see how a given approach seems to be performing. Do not leave evaluation just until the end. Finally, you may want to try different approaches and compare their performance.**

### Dictionary or lexicon

Definition: assign sentiment based on the prevalence of positive vs. negative words contained in a dictionary or lexicon.<br>
Pros: transparent, easy to implement, doesn't require labeled data.<br>
Cons: doesn't account for how words are used (ambiguity, context, domain-specific connotations, sarcasm, negation), may not identify relevant words or tokens, requires tokenization.<br>
Use cases: baseline to compare other approaches, small dataset, too expensive to label data, secondary analysis.<br>

There are various common dictionaries used in sentiment analysis. They are designed for different use cases and can vary in size and complexity.

For example, the [VADER](https://doi.org/10.1609/icwsm.v8i1.14550) lexicon is designed for short, informal documents such as social media posts or open-ended survey responses. VADER uses a crowdsourced vocabulary of over 7,500 terms including slang, emojis, and unconventional spelling. Further, VADER uses an algorithmic approach that doesn't rely solely on word counts, but follow algorithmic rules. For instance, exclamation marks and all caps serve as multipliers and intensifying adjectives and adverbs can also increase the sentiment of the term that they modify. Finally, VADER evaluates terms in local-window contexts of three words to capture negations and flip sentiment polarity accordingly.

Another example is [Lexicoder Sentiment Dictionary (LSD)](https://doi.org/10.1080/10584609.2012.671234), which is designed to measure sentiment in political texts by including terms specific to political discourse that do not exist in other lexicons.

For this notebook, we're going to use the [VADER dictionary](https://www.nltk.org/api/nltk.sentiment.vader.html#module-nltk.sentiment.vader) from the [NLTK](https://www.nltk.org/) library. ([More information here](https://www.analyticsvidhya.com/blog/2022/10/sentiment-analysis-using-vader/).)

In [None]:
# Initialize the sentiment analyzer
sia = SentimentIntensityAnalyzer()

In [None]:
# Example
sia.polarity_scores("This workshop is amazing! Love it! Great instructor!")

Notice that the sentiment analyzer takes full sentences. If your document has more than one sentence, you need to [use a sentence tokenizer](https://www.nltk.org/howto/sentiment.html#vader). In this case, since the tweets have already been processed, we're going to pretend that they are only one sentence.

In [None]:
# Apply the sentiment analyzer to the text
df_tweets['sentiment_scores'] = df_tweets['text'].apply(sia.polarity_scores)

In [None]:
df_tweets.columns

In [None]:
for i, row in df_tweets.sample(10, random_state=51425).iterrows():
  print(row['text'])
  print(row['sentiment_scores'])
  print_format()

In [None]:
# Extract compound score
# Learn more about the compound score: https://stackoverflow.com/questions/40325980/how-is-the-vader-compound-polarity-score-calculated-in-python-nltk
df_tweets['compound'] = df_tweets['sentiment_scores'].apply(lambda x: x['compound'])

# Define function to categorize sentiment
# Notice that this is a choice you have to make
# There are other ways of classifying
# You have to think about the "right" way for your project
def categorize_sentiment(compound_score):
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Apply categorization
df_tweets['sentiment'] = df_tweets['compound'].apply(categorize_sentiment)

In [None]:
for i, row in df_tweets.sample(10, random_state=51425).iterrows():
  print(row['text'])
  print(row['sentiment_scores'])
  print(row['sentiment'])
  print_format()

#### Exercise

Apply the approach above to the `df_nyt` dataset.

In [None]:
# Tokenize each document
df_nyt['sentences'] = df_nyt['text'].apply(tokenize.sent_tokenize)

In [None]:
for i, row in df_nyt.sample(10, random_state=51425).iterrows():
  print(row['text'])
  print(row['sentences'])
  print_format()

In [None]:
# Subset dataset for speed
df_nyt_subset = df_nyt.sample(10, random_state=51425)

# Analyze each sentence
df_nyt_subset['sentiment_scores'] = df_nyt_subset['sentences'].apply(lambda x: [sia.polarity_scores(sentence) for sentence in x])

In [None]:
for i, row in df_nyt_subset.sample(10, random_state=51425).iterrows():
  print(row['sentences'])
  print(row['sentiment_scores'])
  print_format()

In [None]:
# Extract the compound score
df_nyt_subset['compound'] = df_nyt_subset['sentiment_scores'].apply(lambda x: [score['compound'] for score in x])

In [None]:
for i, row in df_nyt_subset.sample(10, random_state=51425).iterrows():
  print(row['sentences'])
  print(row['compound'])
  print_format()

In [None]:
# Calculate the average compound score for each document
# Note that this is a choice. There are different options and you have to decide for your project
df_nyt_subset['average_compound'] = df_nyt_subset['compound'].apply(lambda x: sum(x) / len(x))

In [None]:
for i, row in df_nyt_subset.sample(10, random_state=51425).iterrows():
  print(row['sentences'])
  print(row['compound'])
  print(row['average_compound'])
  print_format()

In [None]:
# Calculate sentiment label for each document
df_nyt_subset['sentiment'] = df_nyt_subset['average_compound'].apply(categorize_sentiment)

In [None]:
for i, row in df_nyt_subset.sample(10, random_state=51425).iterrows():
  print(row['text'])
  print(row['sentiment'])

### Pre-trained classifier

Definition: using a machine learning model that has already been trained for sentiment analysis.<br>
Pros: easy to implement, potentially better than dictionaries.<br>
Cons: opaque, requires a pre-trained model appropriate for the task, can be biased.<br>
Use cases: many of the same use cases than dictionary-based sentiment analysis--baseline to compare other approaches, small dataset, too expensive to label data, secondary analysis.<br>

Hugging Face is a good place to [find pre-trained models](https://huggingface.co/models) and has a [task page](https://huggingface.co/tasks/text-classification) with information about text classification, including sentiment analysis.

In [None]:
# Load the sentiment analysis pipeline
model_name = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
sentiment_pipeline = pipeline("sentiment-analysis", model=model_name)

You have to look at the model card for information about the specific model. For example, this is the [model card for `distilbert/distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).

In [None]:
# Select a subset of the dataset because this can take a bit
df_tweets_subset = df_tweets.sample(50, random_state=51425)

# Apply the sentiment analysis pipeline to each tweet
df_tweets_subset['sentiment'] = df_tweets_subset['text'].apply(lambda x: sentiment_pipeline(x))

In [None]:
for i, row in df_tweets_subset.sample(10, random_state=51425).iterrows():
  print(row['text'])
  print(row['sentiment'])
  print_format()

#### Exercise

Apply the approach above to the `df_nyt` dataset.

In [None]:
# Notice that it's a bit more complicated because the documents are longer than what the model takes
# You need to make a decision about how to deal with that
# Here I'm tokenizing at the sentence level and then calculating the sentiment of each sentence
# You could go about it differently

def calculate_sentiment_scores(sentences, sentiment_pipeline):
    """
    Calculates sentiment scores for a list of sentences using a sentiment analysis pipeline.

    Args:
        sentences (list): A list of text strings to analyze.
        sentiment_pipeline (callable): A sentiment analysis function or pipeline that accepts a sentence and returns a list of dictionaries with 'label' and 'score'.

    Returns:
        list: A list of sentiment analysis results for each sentence.
    """
    # Apply the sentiment pipeline to each sentence and collect results in a list
    sentiment_scores = [sentiment_pipeline(sentence) for sentence in sentences]
    return sentiment_scores


def categorize_scores(input_list):
    """
    Categorizes sentiment scores into POSITIVE and NEGATIVE buckets.

    Args:
        input_list (list): List of sentiment analysis results, each expected to be a list with one dictionary.

    Returns:
        dict: Dictionary with two keys 'POSITIVE' and 'NEGATIVE', each containing a list of scores.
    """
    result = {'POSITIVE': [], 'NEGATIVE': []}  # Initialize result dictionary with empty lists
    for item in input_list:
        label = item[0]['label']  # Extract the sentiment label ('POSITIVE' or 'NEGATIVE')
        score = item[0]['score']  # Extract the associated confidence score
        if label == 'POSITIVE':
            result['POSITIVE'].append(score)  # Append score to 'POSITIVE' list
        elif label == 'NEGATIVE':
            result['NEGATIVE'].append(score)  # Append score to 'NEGATIVE' list
    return result


def key_with_more_elements(input_dict):
    """
    Identifies the key in a dictionary that has more elements in its value list.

    Args:
        input_dict (dict): Dictionary where values are lists.

    Returns:
        str: Key with the most elements.
    """
    # Use max with key function to find the key with the longest list
    return max(input_dict, key=lambda k: len(input_dict[k]))


def key_with_highest_average(scores_dict):
    """
    Finds the key with the highest average score in a dictionary of lists.

    Args:
        scores_dict (dict): Dictionary with keys 'POSITIVE' and 'NEGATIVE' and list of scores as values.

    Returns:
        str: Key with the highest average score.
    """
    max_avg = float('-inf')  # Initialize max average with the lowest possible float
    best_key = None  # Initialize variable to store the best key

    for key, scores in scores_dict.items():
        avg_score = sum(scores) / len(scores)  # Compute average score for each sentiment
        if avg_score > max_avg:
            max_avg = avg_score  # Update max average
            best_key = key  # Update key with highest average

    return best_key


def calculate_sentiment_more_common(sentences, sentiment_pipeline):
    """
    Determines the more common sentiment (POSITIVE or NEGATIVE) based on number of occurrences.

    Args:
        sentences (list): List of text sentences.
        sentiment_pipeline (callable): A sentiment analysis function or pipeline that accepts a sentence and returns a list of dictionaries with 'label' and 'score'.

    Returns:
        str: Sentiment label with more occurrences.
    """
    sentiment_scores = calculate_sentiment_scores(sentences, sentiment_pipeline)  # Get sentiment scores for all sentences
    scores_dict = categorize_scores(sentiment_scores)  # Categorize scores into POSITIVE and NEGATIVE
    sentiment = key_with_more_elements(scores_dict)  # Identify which sentiment appears more frequently
    return sentiment


def calculate_sentiment_higher_average(sentences, sentiment_pipeline):
    """
    Determines the sentiment with the highest average confidence score.

    Args:
        sentences (list): List of text sentences.
        sentiment_pipeline (callable): A sentiment analysis function or pipeline that accepts a sentence and returns a list of dictionaries with 'label' and 'score'.

    Returns:
        str: Sentiment label with the highest average score.
    """
    sentiment_scores = calculate_sentiment_scores(sentences, sentiment_pipeline)  # Get sentiment scores for all sentences
    scores_dict = categorize_scores(sentiment_scores)  # Categorize scores into POSITIVE and NEGATIVE
    sentiment = key_with_highest_average(scores_dict)  # Find sentiment with the highest average score
    return sentiment

sentences = df_nyt_subset['sentences'].to_list()[0][:5]
print(sentences)
print(calculate_sentiment_more_common(sentences, sentiment_pipeline))
print(calculate_sentiment_higher_average(sentences, sentiment_pipeline))

Notice how we get different answers depending on the way to aggregate. These are all choices that you have to think about and decide based on the context of your research.

In [None]:
# Take even smaller subset for speed
df_nyt_sub_subset = df_nyt.sample(5, random_state=51425)

# Apply calculate_sentiment_more_common to dataset
df_nyt_sub_subset['sentiment'] = df_nyt_sub_subset['sentences'].apply(lambda x: calculate_sentiment_more_common(x, sentiment_pipeline))

In [None]:
for i, row in df_nyt_sub_subset.iterrows():
  print(row['text'])
  print(row['sentiment'])
  print_format()

### Using a decoder model

Definition: using a decoder model (such as the one behind ChatGPT, which goes from text to text) to get sentiment labels.<br>
Pros: easy to implement, flexible, good language understanding.<br>
Cons: opaque, computationally intensive, not optimized for the task, you may have to pay, can be biased.<br>
Use cases: many of the same use cases than dictionary-based sentiment analysis, particularly if there is not a good dictionary or pre-trained model for your task.<br>

For [more information about encoder vs. decoder models, you can consult this workshop](https://github.com/nuitrcs/CoDEx-Choose-Your-LLM), which also provides advice on how to choose an LLM for your research project.

For this notebook, we'll use the [OpenAI API](https://platform.openai.com/docs/overview). (You can see the [billing here](https://platform.openai.com/settings/organization/billing/overview), [get API keys here](https://platform.openai.com/api-keys), and [check your usage here](https://platform.openai.com/settings/organization/usage).) You can find open-source decoder models in [Hugging Face](https://huggingface.co/models) and [Ollama](https://ollama.com/library), both of which you can use locally or on [Quest](https://www.it.northwestern.edu/departments/it-services-support/research/computing/quest/).

**Keep in mind that you may not be able to use APIs (such as OpenAI's) or even open-source models on Quest depending on the privacy and security level of your data. Please consult [Northwestern's Guidance on the Use of Generative AI](https://www.it.northwestern.edu/about/policies/guidance-on-the-use-of-generative-ai.html) and feel free to [submit a consult request with Research Computing and Data Services](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f) if you have any questions.**

In [None]:
# Set API key as environmental variable
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Initialize the OpenAI client
client = OpenAI()

# Define a Pydantic model for sentiment analysis response
# https://platform.openai.com/docs/guides/structured-outputs
class SentimentResponse(BaseModel):
    # The 'sentiment' field must be one of the specified literal values
    sentiment: Literal["positive", "negative", "neutral", "unsure"]

# Define function to classify the sentiment
def classify_sentiment(text):
    """
    Classify the sentiment of the given text as positive, negative, neutral, or unsure.

    Input:
        text (str): A string containing the text to analyze for sentiment.

    Output:
        str: One of "positive", "negative", "neutral", or "unsure", representing the sentiment of the input text.
    """
    # Call the OpenAI API using the structured output parsing interface
    response = client.beta.chat.completions.parse(
        model="gpt-4o",  # Use the GPT-4o model for generating the response
        messages=[
            # Provide system instructions to guide the assistant's behavior
            {
                "role": "system",
                "content": "You are a sentiment analysis assistant. Classify the sentiment of the provided text as positive, negative, neutral, or unsure."
            },
            # Include the user-provided text as input for classification
            {"role": "user", "content": text},
        ],
        response_format=SentimentResponse  # Use the SentimentResponse schema to enforce structured output
    )

    # Return the sentiment value parsed from the structured response
    return response.choices[0].message.parsed.sentiment

In [None]:
example_tweet = df_tweets['text'].to_list()[10]
print(example_tweet)
# print(classify_sentiment(example_tweet)) # Commenting to avoid keep sending requests

#### Exercise

Use the inference widget for [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), [HuggingChat](https://huggingface.co/chat/), and/or [ChatGPT](https://chatgpt.com/) to copy and paste some of the documents from `df_nyt` and get the sentiment. Compare the output that you get with the output of some of the previous approaches to sentiment analysis.

### Training a classifier from scratch

Definition: supervised learning model trained to classify documents.<br>
Pros: can perform better than off-the-shelf approaches, tailored to the specific task, can be transparent.<br>
Cons: requires more work, requires labeled data, can require feature engineering, can be opaque, can be computationally intensive.<br>
Use cases: easier-to-implement approaches don't work, core analysis, you have a fair amount of labeled data, you want to label a fair amount of documents or the model will become part of a pipeline for your lab/future work.<br>

You can find [more information about training a classifier from scratch in this workshop](https://github.com/nuitrcs/scikit-learn-workshop). You can also attend [Research Computing and Data Services' Scikit-Learn workshop this summer](https://www.it.northwestern.edu/departments/it-services-support/research/training-and-consultation/research-code-academy.html). [This free book](https://www.statlearning.com/) is also a good place to learn more about supervised learning.
<br>
<br>
Remember that in `df_tweets` we created a column called `sentiment` using the VADER dictionary:

In [None]:
df_tweets['sentiment'].value_counts()

For the sake of this notebook, we're going to use that column as our label to create a classifier from scratch. **Of course, in real life research it wouldn't make sense to do that. You'd want to have ground truth data to train a classifier from scratch, most likely a set of manually labeled data.**

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df_tweets['text'],
    df_tweets['sentiment'],
    test_size=0.2, # 80% train, 20% test
    random_state=51425
)

In [None]:
X_train.shape

In [None]:
X_train.head()

In [None]:
X_test.shape

In [None]:
X_test.head()

In [None]:
y_train.shape

In [None]:
y_train.head()

In [None]:
y_test.shape

In [None]:
y_test.head()

In [None]:
# Build a pipeline with:
# https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
# https://github.com/nuitrcs/sklearn_pipelines
# 1. TfidfVectorizer: converts text into numerical features using TF-IDF
# https://en.wikipedia.org/wiki/Document-term_matrix
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/
#    - ngram_range=(1,1): unigrams only
#    - max_features=3000: limit vocabulary to top 1000 terms
# 2. MultinomialNB: Naive Bayes classifier suited for word count features
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
# "The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 1), max_features=1000)),
    ('nb', MultinomialNB())
])

pipeline

In [None]:
# Train the pipeline
pipeline.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred = pipeline.predict(X_test)

y_pred[:10]

In [None]:
# Evaluate the model
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
print("Classification Report:\n", classification_report(y_test, y_pred))

#### Exercise

Apply the approach above on the same dataset, but using a support vector machine and/or logistic regression instead of multinomial naive bayes.

In [None]:
from sklearn.svm import LinearSVC

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 1), max_features=1000)),
    # https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
    # "LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel."
    ('svm', LinearSVC())
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred))

In [None]:
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 1), max_features=1000)),
    # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    ('logreg', LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred))

### Fine-tuning a language model

Definition: using a pre-trained language model and tailoring it to your specific task.<br>
Pros: takes advantage of the knowledge of pre-trained models, can work really well.<br>
Cons: opaque, computationally intensive, requires labeled data, can be biased.<br>
Use cases: similar use cases than for training a classifier from scratch, typically with less data required.<br>

You can find open-source language models to fine-tune in [Hugging Face](https://huggingface.co/models).

Given time constraints, this notebook cannot cover fine-tuning. However, you can always [submit a consult request with Research Computing and Data Services](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f) if you need help with the data-related aspects of your research and attend [Research Computing and Data Services' fine-tuning workshop this summer](https://www.it.northwestern.edu/departments/it-services-support/research/training-and-consultation/research-code-academy.html).

#### Exercise

This notebook covered various approaches to sentiment analysis, particularly dictionaries or lexicons, pre-trained classifiers, using decoders, fine-tuning a language model, and training a model from scratch.

What approach(es) to sentiment analysis (if at all) would you use for each of these research projects?

1. A marketing scholar is studying sentiment in Amazon product reviews to compare customer satisfaction between sustainable and non-sustainable products. There are 10,000 labeled reviews.<br>
2. A health researcher has collected open-ended responses from 300 patients about their hospital experience. They want to summarize the overall sentiment to identify major areas of concern.<br>
3. A legal scholar is analyzing judicial opinions to see if courts have become more negative in tone toward environmental regulations over time. They have 1,000 documents and no labeled data.<br>
4. A sociologist is studying public reaction to a major protest movement using tweets. They have 50,000 tweets collected during a two-week period.<br>
5. A political scientist is analyzing how candidates express sentiment about the economy during campaign speeches. They want to track differences between parties and over time.<br>
6. A historian is analyzing personal letters written during the Great Depression to understand emotional tone. The data are unstructured and old-fashioned in language.<br>

## Conclusions and next steps to continue learning

This notebook covered various approaches to sentiment analysis, particularly dictionaries or lexicons, pre-trained classifiers, using decoders, fine-tuning a language model, and training a model from scratch. **Keep in mind that you can potentially use approaches in combination.**

The notebook also provided resources and code that you can use to continue learning about the different approaches and find out the one that works best for your project.

**Keep in mind that, while not elaborated on this notebook, evaluating your approach is critical.**

You are always welcome to [submit a consult request with Research Computing and Data Services](https://app.smartsheet.com/b/form/2f2ec327e6164f83b588b7bbe2e2b56f) if you need help with the data-related aspects of your research.

This notebook is partly based on these three articles, where you can read more in-depth and find other references:
- Stine, R.A. (2019). Sentiment Analysis.*Annual Review of Statistics and Its Applications*, 6, 287-308. (Available [here](https://doi.org/10.1146/annurev-statistics-030718-105242).)
- Bestvater, S.E. & Monroe, B.L. (2023). Sentiment is Not Stance: Target-Aware Opinion Classification for Political Text Analysis. *Political Analysis*, 31, 235-256. (Available [here](https://doi.org/10.1017/pan.2022.10).)
- Wankmüller, S. (2024). Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis. *Sociological Methods & Research*, 53(4), 1676-1752. (Available [here](https://doi.org/10.1177/00491241221134527).)

This notebook uses code produced by ChatGPT. That's okay to do as long as you consider the privacy, security, and intellectual property implications, as well as understand the code. We do have a [workshop on writing effective prompts for coding with LLMs](https://github.com/nuitrcs/promptEngineering).