# Section 9. Word Embeddings

#### Instructor: Pierre Biscaye 

This is the third of three notebooks covering the foundations for performing **text analysis** in Python. In the previous parts, we learned how to perform text preprocessing and convert text into numeric representations with Bags of Words (BoW) and TF-IDF. These methods make heavy use of word frequency but not much of the relative positions between words, but there's still rich semantic (relating to language or logic) and syntactic (relating to structure) meanings left to be captured beyond independent frequencies of words. 

We need a more powerful tool that has the potential to represent rich semantics (and more) of our text data. In the final part of this series, we will dive into word embeddings, a method widely combined with more advanced Natural Language Processing (NLP) tasks. We'll make extensive use of the `gensim` package, which hosts a range of word embedding models, including `word2vec` and `glove`, the two models we'll explore today.

The content of this notebook is taken from UC Berkeley D-Lab's Python Text Analysis [course](https://github.com/dlab-berkeley/Python-Text-Analysis).

### Sections
1. Understand word embeddings
2. Word similarity and word analogy
3. Semantic axes and associations in word embeddings
4. Bringing it together: tweet sentiment prediction

In [None]:
# Run if you do not have gensim installed
# !pip install gensim

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors

# 1. Understand Word Emebeddings 

As famously put by British Linguist J.R. Firth:

> **You shall know a word by the company it keeps.**

This quote summarizes the essence of word embeddings, which take the numerical representation of text a step further to consider word context. 

Recall from notebook 9b that a BoW representation is a **sparse** matrix. Its dimension is determined by vocabulary size and the number of documents. Importantly, a sparse matrix like BoW is interpretable: the cell values refer to the count of a word in a document. Oftentimes the cell values are zeros: many words do not simply appear in a particular document. 

We can think of word embedding as a matrix likewise, but this time a **dense** matrix, where the cell values are real numbers. Word embeddings project a word's meaning onto a high-dimensional vector space, which is why it is also called **word vectors**. A word vector is essentially an array of real numbers, the length of which, as we'll see today, could be as low as 50, or as high as 300 (or even higher in Large Language Models). These real number vectors do not make explicit sense to us, but capture aspects of semantic or syntactic meanings of words.  

BOW:
- Sparse matrix
- Dimension: $D$ x $V$, where rows are **D**ocuments and columns are words in the **V**ocabulary.
- Interpretable: e.g., in a financial document, "bank" and "banker" could appear a lot of times but not "bang".

<img src='Images/bow-illustration-2.png' alt="BoW" width="500">

Word embeddings:
- Dense matrix
- Dimension: $V$ x $D$, where rows are **V**ocabulary and columns are vectors with dimension **D**.
- Not immediately interpretable

<img src='Images/bow-illustration-3.png' alt="BoW" width="500">

Today, we are going to explore two widely used word embedding models, `word2vec` and `glove`. We will use the package `gensim` to access both models.

## `word2vec`

Before diving into `word2vec`, let's talk a bit of history first. The idea of word vectors, i.e, projecting a word's meaning onto a vector space, has been around for a long time. The `word2vec` model, proposed by [Mikolov et al.](https://arxiv.org/abs/1310.4546) in 2013, introduces an efficient model of word embeddings. Since then it has stimulated a new wave of research into this topic. 

The key question asked in the 2013 paper is: how do we go about learning a good vector representation from the data?

Mikolov et al. proposed two approaches: the **continuous bag-of-words (CBOW)** and the **skip-gram (SG)**. Both are similar in that we use the vector representation of a token to try and predict what the nearby tokens are with a shallow neural network.   

Take the following sentence for example. If our target token is $w_t$, "banks", the context tokens would be the preceding tokens $w_{t-2}, w_{t-1}$ and the following ones $w_{t+1}, w_{t+2}$. This corresponds to a **window size** of 2: 2 words on either side of the target word. Similarly when we move onto the next target token, the context window (tokens underlined) moves as well.

<img src='Images/target_word.png' alt="Target word" width="500">

In the continuous bag-of-words model, the goal is to predict the target token, given the context tokens. In the skip-gram model, the task is to predict the context tokens from the target token. This is the reverse of the continuous bag-of-words, and is a harder task, since we have to predict more from less information.

<img src='Images/word2vec-model.png' alt="word2vec" width="550">

**CBOW** (Left):
- **Input**: context tokens
- **Inner dimension**: embedding layer
- **Output**: the target token

**Skip-gram** (Right):
- **Input**: the target token
- **Inner dimension**: embedding layer
- **Output**: context tokens

The above figure illustrates the direction of prediction. It also serves as a schematic representation of a neural network, i.e., the mechanism underlying the training of `word2vec`. The input and output are known to us, represented by **one-hot encodings** in Mikolov et al. The **hidden layer**, the inner dimension in-between the input and the output, is the vector representation that we are trying to find out. 

We won't go into the specifics of training but provide a brief idea of where does embedding come from. The `word2vec` model we will be interacting with today is **pretrained**, meaning that the embeddings have already been trained on a large corpus (or a number of corpora). The pretrained `word2vec` and `glove`, as well as other models, are available through `gensim`. 

Let's take a look at a few word embedding models.

In [None]:
# Get word embedding models
gensim_models = list(api.info()['models'].keys())

for model in gensim_models:
    print(model)

Consider the one named `word2vec-google-news-300`. The model name is usually formatted as `model-corpora-dimension`, so this is a `word2vec` model that is trained on Google News, and the embedding has 300 dimensions. 

We can retrieve this model in two ways:
- Downloading it via `api.load()`
- Downloading the model as a zip file beforehand and then loading it in with `KeyedVectors.load()`

The pretrained word2vec is archived by Google, and is available to download via this [link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g). 

In [None]:
# Run the following line if your local machine has plenty of memory
#wv = api.load("word2vec-google-news-300")
# Alternatively, load the model in (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view?resourcekey=0-wjGZdNAUop6WykTtMip30g)
wv = KeyedVectors.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin', binary=True)
# The parameter `binary` asks whether the model is in the binary format (indicated by the extension `.bin`).

Accessing the actual word vectors can be done by treating the word vector model as a dictionary. 

For example, let's take a look at the word vector for "banana":

In [None]:
wv['banana']

We can take a look at the shape of the "banana" vector. As promised, it is an 1-D array that holds 300 values. 

In [None]:
wv['banana'].size

These values appear to be random floats. However, now that the word has been transformed into a vector, we can more easily perform computations on it. 

Let's take a look at a few examples!

# 2. Word Similarity and Analogy

Following the example phrase above about word embedding for "bank, the first question we can ask is: what words are similar to "bank"? In vector space, we'd expect similar words to have vectors that are closer to each other.

There are many metrics for measuring vector similarity, one of the most useful being [**cosine similarity**](https://en.wikipedia.org/wiki/Cosine_similarity). Cosine similarity ranges from 0 to 1, with orthogonal vectors having a cosine similarity of 0 and parallel vectors having a cosine similarity of 1.

`gensim` provides a function called `most_similar()` that lets us find the words most similar to a queried word. The output is a tuple of the word and its cosine similarity to the queried word.

Let's give it a shot!

In [None]:
wv.most_similar(['bank'])

It looks like most similar vectors to "bank" are other financial terms! 

Recall that `word2vec` is trained to capture a word's meaning based on contextual information. These results pop up because these words commonly appear in similar contexts as the word "bank". 

In addition to querying the most similar words, we can also ask the model to return the cosine similarity between two words by calling the function `similarity()`.

Let's go ahead and check out the similarities between the following pairs of words.

In [None]:
# banana
wv.similarity('bank', 'banana')

In [None]:
# river 
wv.similarity('bank', 'river')

In [None]:
# bank with a capitalized B
wv.similarity('Bank', 'river')

In [None]:
# the present participle of bank
wv.similarity('banking', 'river')

**Question**: Why do "bank" and "river" appear to have higher similarity than other pairs?

## Word Analogy

One of the most famous usages of `word2vec` is via word analogies. For example:

`man : king :: woman : queen`

Oftentimes, word analogy like this is visualized with parallelogram, such as shown in the following figure, which is adapted from [Ethayarajh et al. (2019)](https://aclanthology.org/P19-1315.pdf). 

<img src='Images/word_analogy.png' alt="Word analogy" width="450">

The upper side (difference between `man` and `woman`) should approximate the lower side (difference between `king` and `queen`); the vector difference represents the meanig of `female`. 

- $\mathbf{V}_{\text{man}} - \mathbf{V}_{\text{woman}} \approx \mathbf{V}_{\text{king}} - \mathbf{V}_{\text{queen}}$

Similarly, the left side (difference between `king` and `man`) should approximate the right side (difference between `queen` and `woman`); the vector difference represents the meaning of `royal`.

- $\mathbf{V}_{\text{king}} - \mathbf{V}_{\text{man}} \approx \mathbf{V}_{\text{queen}} - \mathbf{V}_{\text{woman}}$

We can take either equation and rearrange it:

- $\mathbf{V}_{\text{king}} - \mathbf{V}_{\text{man}} + \mathbf{V}_{\text{woman}} \approx \mathbf{V}_{\text{Queen}}$

If the vectors of `king`, `man`, and `woman` are known, by vector arithmatics we should be able to get a vector that approximates the meaning of `queen`. 

Let's implement it using the word2vec model and the `get_vector` function. Note: In all the operations below, we set `norm=True`, and renormalize. That's because different vectors might be of different lengths, so the normalization puts everything on a common scale.

In [None]:
# Calculate "royal" vector difference
difference = wv.get_vector('king', norm=True) - wv.get_vector('man', norm=True) 

# Add on woman
difference += wv.get_vector('woman', norm=True)

# Renormalize vector
difference = difference / np.linalg.norm(difference)

In [None]:
# What is the most similar vector?
wv.most_similar(difference)

The word "queen" is the second most similar one. 

Carrying out these operations can be done in one swoop with the `most_similar` function. 

We pass in two arguments `positive` and `negative`, wherein `positive` holds the words that we want the output to be similar with, and `negative` the words we'd like the output to be dissimilar with.

In [None]:
wv.most_similar(positive=['woman', 'king'], negative='man')

# 3. Semantic Axes and Associations in Word Embeddings

### `glove` 

As you can work through in the practice code problems, we can show that gender bias is present in the pre-trained word embeddings. Let's take a closer look at it!

We will switch from `word2vec` to a smaller size embedding, the pretrained `glove`, starting from this section. Let's load it with the `api.load()` function. 

The model we load in is trained from Wikipedia and Gigaword (news data). Check out the [documentation](https://nlp.stanford.edu/projects/glove/) if you want to know further.

In [None]:
glove = api.load('glove-wiki-gigaword-50')

Let's double check the size of the embedding vector.

In [None]:
glove['banana'].size

In [None]:
wv['banana'].size

### Semantic Axis

To investigate gender bias in word embeddings. We first need a vector representation that captures the concept of gender. The idea is to construct **a semantic axis** (or "SemAxis") of the concept. This concept is often complex, and cannot be simply denoted by a single word. And it is often fluid, meaning that its meaning spans from one end to the other. Once we've got the vector representation of this concept, we can project a list of terms onto that axis, and see if each of the terms is more aligned towards one end or the other of the concept. 

The methods of doing so come from [An et al. 2018](https://aclanthology.org/P18-1228/). We will first need to come up with two lists of pole words, which are opposing to each other. 

- $\mathbf{V}_{\text{plus}} = \{v_{1}^{+}, v_{2}^{+}, v_{3}^{+}, ..., v_{n}^{+}\}$

- $\mathbf{V}_{\text{minus}} = \{v_{1}^{-}, v_{2}^{-}, v_{3}^{-}, ..., v_{n}^{-}\}$

We take the mean of each vector set to represent the core meaning of that set. 

- $\mathbf{V}_{\text{plus}} = \frac{1}{n}\sum_{1}^{n}v_{i}^{+}$

- $\mathbf{V}_{\text{minus}} = \frac{1}{n}\sum_{1}^{n}v_{j}^{-}$

Next we take the difference between the two means to represent the corresponding semantic axis. 

- $\mathbf{V}_{\text{axis}} = \mathbf{V}_{\text{plus}} - \mathbf{V}_{\text{minus}}$

Projecting a specific term to the semantic axis is, as we've learned above, operationalized as taking the `cosine similarity` between the word's vector and the semantic axis vector. A positive value would indicate that the term is more closer to the $\mathbf{V}_{\text{plus}}$ end, and a negative value meaning proximity to the $\mathbf{V}_{\text{minus}}$ end. 

- $score(w) = cos(v_{w},  \mathbf{V}_{\text{axis}})$

*Warning:* A binary distinction of gender is a simplification of the diversity and complexity of gender identities. This method is limited, as it is only capable of constructing two polarities. Along the way, we'll discover how much stereotyping is encoded in it.

We will contruct the gender semantic axis using two sets of pole words for "female" and "male" from Bolukbasi et al. (2016). We will get the embeddings for these words from `glove` to calculate the gender axis. 

In [None]:
# Define two sets of pole words (examples from Bolukbasi et al., 2016)
female = ['she', 'woman', 'female', 'daughter', 'mother', 'girl']
male = ['he', 'man', 'male', 'son', 'father', 'boy']

In [None]:
def get_semaxis(list1, list2, model, embedding_size):
    '''Calculate the embedding of a semantic axis given two lists of pole words.'''

    # STEP 1: Get the embeddings for terms in each list
    vplus = [model[term] for term in list1]
    vminus = [model[term] for term in list2]

    # Step 2: Calculate the mean embeddings for each list
    vplus_mean = np.mean(vplus, axis=0)
    vminus_mean = np.mean(vminus, axis=0)

    # Step 3: Get the difference between two means
    sem_axis = vplus_mean - vminus_mean

    # Sanity check
    assert sem_axis.size == embedding_size
    
    return sem_axis

In [None]:
# Plug in the gender lists to calculate the semantic axis for gender
gender_axis = get_semaxis(list1=female, 
                          list2=male, 
                          model=glove, 
                          embedding_size=50)
gender_axis

We have the gender axis ready! The next step is to project a list of terms onto the gender axis. Let's test it with a set of occupation terms which might be associated with gender stereotypes. 

Before we go ahead to calculate the cosine similarity, let first rate the following occupation terms, use your intuition!

The rating should be between $[-1, 1]$: the negative value means the term is closer to the male end and positive value to the female end. 

In [None]:
# Define a list of occupations terms (examples taken from Bolukbaski et al., 2016)
occupations = ['engineer',
               'nurse',
               'designer',
               'receptionist',
               'banker',
               'librarian',
               'architect',
               'hairdresser',
               'philosopher']

In [None]:
# Rate the following occupation terms
occ_rating = {'engineer': ...,
              'nurse': ...,
              'designer': ...,
              'receptionist': ...,
              'banker': ...,
              'librarian': ...,
              'architect': ...,
              'hairdresser': ...,
              'philosopher': ...
             }

In [None]:
# Calculate cosine similarity between a given word and the axis
def get_projection(word, model, axis):
    '''Get the projection of a word onto a semantic axis'''
    
    word_norm = model[word] / np.linalg.norm(model[word])
    axis_norm = axis / np.linalg.norm(axis)
    projection = np.dot(word_norm, axis_norm) 
    
    return projection

In [None]:
occ_projections = {word: get_projection(word, glove, gender_axis) for word in occupations}
occ_projections 

## Visualize the Projection

Now that we have calculated the projection of each occupation term onto the gender axis, let's plot these values to gain a more straightforward understanding of how much gender stereotyping is hidden in these terms.

We will use a bar plot to visualize them, with the color of each bar corresponding to the proximity of a term to an end.

In [None]:
from matplotlib.colors import Normalize

def plot_semantic_axis(projections, title, xlab):
    '''Return a horizontal bar plot of the projections.'''

    # Sort the projections in descending order
    projection_sorted = sorted(projections.items(), key=lambda term: term[1], reverse=True)

    # Extract the terms
    terms = [term_value[0] for term_value in projection_sorted]

    # Extract corresponding values of projections
    values = [term_value[1] for term_value in projection_sorted]

    # Take the absolute values for gradient color fill
    values_abs = np.abs(values)
    norm = Normalize(vmin=min(values_abs), vmax=max(values_abs))
    cmap = plt.get_cmap("YlOrBr")  
    colors = [cmap(norm(value)) for value in values_abs]

    plt.figure(figsize=(8, 6))  
    plt.barh(terms, values, color=colors)
    plt.grid(axis="x", linestyle=":", alpha=0.5)
    plt.xlim(-np.max(values_abs+0.05), np.max(values_abs+0.05))
    plt.xlabel(xlab)
    plt.title(title)
    plt.show();

We will visualize the projections as well as your self-ratings together. 

In [None]:
title1 = 'Projections onto the gender axis'
title2 = 'Self-rated projections onto the gender axis'
xlab = 'Gender-stereotypical occpuation terms'

plot_semantic_axis(occ_projections, title1, xlab)
plot_semantic_axis(occ_rating, title2, xlab)

**Question**: Do you find the results surprising or expected? Why does gender stereotyping exist in word embeddings?

## The Class Axis

In addition to projecting terms onto a single axis, we can also project terms onto two axes and plot the results on a scatter plot, where the coordinates correspond to projections onto the two axes.

Social class is another dimension that has been frequently discussed in the literature. In this example, we'll create a semantic axis for social class, using two sets of pole words representing the two ends of class, as described in [Kozlowski et al. 2019](https://journals.sagepub.com/doi/full/10.1177/0003122419877135).

First, we'll project a list of sports terms onto both the gender and social class axes, similar to the method used in [Kozlowski et al. 2019](https://journals.sagepub.com/doi/full/10.1177/0003122419877135). We'll visualize the results on a scatter plot, with the x-axis representing gender and the y-axis representing social class. The coordinates of a term on this plot correspond to its projections onto these axes.

Next, we'll repeat the process to visualize occupation terms, which will give us a rough idea of how much a term is biased towards either end of these two dimensions.

Let's dive in!

In [None]:
# Define two sets of pole words of social class (examples taken from Kozlowski et. al, 2019)
poor = ['poor', 'poorer', 'poorest', 'poverty', 'inexpensive', 'impoverished', 'cheap']
rich = ['rich', 'richer', 'richest', 'affluence', 'expensive', 'wealthy', 'luxury']

class_axis = get_semaxis(list1=poor, 
                         list2=rich, 
                         model=glove,
                         embedding_size=50)

We will project sports terms onto the social class axis to see if some sports are more associated with "high" society and others with "low" society. 

In [None]:
# Define a list of sports terms (examples taken from Kozlowski et. al, 2019)
sports = ['camping', 
          'boxing', 
          'bowling', 
          'baseball', 
          'soccer', 
          'tennis', 
          'golf', 
          'basketball', 
          'skiing', 
          'sailing', 
          'volleyball']

Next, let's use the `get_projection` function to calculate the cosine similarity between each sport term and the axis (gender and class). 

In [None]:
proj_spt_class = {word: get_projection(word, glove, class_axis) for word in sports}
proj_spt_gender = {word: get_projection(word, glove, gender_axis) for word in sports}

Finally, let's plot the results in a scatter plot!

In [None]:
plt.figure(figsize=(9, 7))

# Use scatter plot to visualize the results
plt.scatter(list(proj_spt_gender.values()), 
            list(proj_spt_class.values()), 
            color='cornflowerblue',
            s=75)

# Add text label to each dot
for term in sports:
    plt.annotate(term, 
                 (proj_spt_gender[term], proj_spt_class[term]), 
                 fontsize=10)

# Add more annotations to four corners of the plot
plt.annotate('Male/High', (-0.48, -0.28), color='gray', horizontalalignment='left')
plt.annotate('Female/High', (0.48, -0.28), color='gray', horizontalalignment='right')
plt.annotate('Male/Low', (-0.48, 0.27), color='gray', horizontalalignment='left')
plt.annotate('Female/Low', (0.48, 0.27), color='gray', horizontalalignment='right')

# Add reference lines to each semantic axis
plt.hlines(xmin=-1, xmax=1, y=0, color='lightcoral', linewidth=1, linestyle=':')
plt.vlines(ymin=-1, ymax=1, x=0, color='lightcoral', linewidth=1, linestyle=':')

# Other parameter settings
plt.xlim(-0.5, 0.5)
plt.ylim(-0.3, 0.3)
plt.grid(True, linestyle=':')
plt.xlabel('Projection onto Gender')
plt.ylabel('Projection onto Class')
plt.show();

**Questions**: 
- Which sport term is most biased towards male and which toward female?
- Which sport seems to be gender-neutral?
- Which sport term is most biased towards high social class, and which towards low social class?
- Which sport seems to be neutral to class?

Ok! Let's go back to occupation terms. We will first add a few new occupations to the list, and then get the projections onto both axes.

In [None]:
occupations = ['engineer',
               'nurse',
               'designer',
               'receptionist',
               'banker',
               'librarian',
               'architect',
               'hairdresser',
               'philosopher',
               'plumber',
               'police',
               'pilot',
               'cashier',
               'janitor']

proj_occ_gender = {word: get_projection(word, glove, gender_axis) for word in occupations}
proj_occ_class = {word: get_projection(word, glove, class_axis) for word in occupations}

Next, let's visualize the results in a scatter plot.

In [None]:
plt.figure(figsize=(9, 7))

# Use scatter plot to visualize the results
plt.scatter(list(proj_occ_gender.values()), 
            list(proj_occ_class.values()), 
            color='tan', 
            s=75)

# Add text label to each dot
for term in occupations:
    plt.annotate(term, 
                 (proj_occ_gender[term], proj_occ_class[term]), 
                 fontsize=10)

# Add more annotations to four corners of the plot
plt.annotate('Male/High', (-0.48, -0.48), color='gray', horizontalalignment='left')
plt.annotate('Female/High', (0.48, -0.48), color='gray', horizontalalignment='right')
plt.annotate('Male/Low', (-0.48, 0.45), color='gray', horizontalalignment='left')
plt.annotate('Female/Low', (0.48, 0.45), color='gray', horizontalalignment='right')

# Add reference lines to each semantic axis
plt.hlines(xmin=-1, xmax=1, y=0, color='lightcoral', linewidth=1, linestyle=':')
plt.vlines(ymin=-1, ymax=1, x=0, color='lightcoral', linewidth=1, linestyle=':')

# Other parameter settings
plt.xlim(-0.5, 0.5)
plt.ylim(-0.5, 0.5)
plt.grid(True, linestyle=':')
plt.xlabel('Projection onto Gender')
plt.ylabel('Projection onto Class')
plt.show();

**Questions**: 
- Which occupation is most biased towards high social class, and which towards low social class?
- Which occupation seems to be neutral to class?
- Are any of the social class mappings surprising?
- Are any of the gender mappings surprising?

Hopefully these plots leave you with some food for thought to further explore word embeddings. Constructing an axis of gender or social class has been widely researched, but with the tool of semantic axis, we can investigate much more. It is useful for capturing the abstract meaning of various notions, such as an axis of coldness, an axis of kindness, and so on.

## Key Points

* Pre-trained word embeddings like `word2vec` and `glove` take contextual information into representations of words' meanings. 
* Similarities between words is conveniently reflected in cosine similarity. 
* We can explore biases in word embeddings with the methods of semantic axis.



# 4. Bringing it together: Predicting tweet sentiment

In the previous notebook, we used machine learning with vocabulary as features to predict pre-classified tweet sentiment. But what if we did not have access to pre-classified tweet sentiment and needed to define it ourselves based on the data? 

In this final text analysis activity, we will use the concept of semantic axes to create a vector representation of positive vs negative sentiment. We will then apply this axis to the vocabulary in the airline tweets dataset, and use this as a basis for determining sentiment.

The first thing we will do is load and prep the tweet data following the steps in notebook 9b.

In [None]:
# Load clean dataset in
tweets_path = 'Data/tweets_clean.csv'
tweets = pd.read_csv(tweets_path, sep=',')
tweets.head()

In [None]:
# Import function to create a document-term matrix (DTM)
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Customize the parameter setting
vectorizer = CountVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

In [None]:
# Fit, transform, and get tokens
counts = vectorizer.fit_transform(tweets['text_lemmatized'])
tokens = vectorizer.get_feature_names_out()

# Create the second DTM
dtm = pd.DataFrame(data=counts.todense(),
                          index=tweets.index,
                          columns=tokens)
print(dtm.shape)
# get most frequent tokens
dtm.sum().sort_values(ascending=False).head(10)

In [None]:
dtm.head()

Now we prepare the semantic axis using terms capturing positivity and negativity. I use the words from a list in [An et al 2018](https://yongyeol.com/papers/an2018semaxis.pdf) that conducted a similar exercise.

In [None]:
# Define two sets of pole words (examples from An et al., 2018)
positive = ['good', 'lovely', 'excellent', 'fortunate', 'pleasant', 'delightful', 'perfect', 'loved', 'love', 'happy',
           'awesome', 'nice', 'amazing', 'best', 'fantastic']
negative = ['bad', 'horrible', 'poor', 'unfortunate', 'unpleasant', 'disgusting', 'evil', 'hated', 'hate', 'unhappy',
           'terrible', 'nasty', 'awful', 'worst', 'sad']

In [None]:
# Plug in the positive/negative lists to calculate the semantic axis for positive sentiment
# Using the same function we created above
positive_axis = get_semaxis(list1=positive, 
                          list2=negative, 
                          model=glove, 
                          embedding_size=50)
positive_axis

Ok, now we have a document-term matrix indicating which terms from the full corpus appear in each tweet/document, and a semantic axis against which to evaluate these terms. We will first compute positivity scores for each term in the DTM. 

In [None]:
# Import cosine_similarity function from scikit-learn
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Compute positivity score for each word
word_scores = {}
for word in tokens:
    if word in glove:
        word_vector = glove[word].reshape(1, -1)
        score = cosine_similarity(word_vector, positive_axis.reshape(1, -1))[0, 0]  # Cosine similarity
        word_scores[word] = score
    else:
        word_scores[word] = 0  # Assign zero if word is missing from embeddings


In [None]:
# Convert word_scores into a DataFrame
word_scores_df = pd.DataFrame(list(word_scores.items()), columns=["word", "positivity_score"])
# Let's browse some of these word scores. How do they look?
word_scores_df[10:20]

We will now use those scores and the DTM to compute overall positivity for each tweet. Formaly, we will use matrix-vector multiplication to perform a dot product which sums up word positivity scores for each tweet. We will then normalize these scores across all tweets.

In [None]:
# Compute positivity for each tweet
tweets['positive_score'] = dtm.dot(pd.Series(word_scores))  # Multiply DTM matrix by word scores

We've done it! Let's see how these estimated positivity scores line up against the labels provided in the data.

In [None]:
import seaborn as sns

# Set plot style
sns.set_style("whitegrid")

# Create the plot
plt.figure(figsize=(10, 6))

# Plot distributions for positive and negative tweets
sns.kdeplot(tweets.loc[tweets['airline_sentiment'] == 'positive', 'positive_score'], 
            label="Positive Sentiment", fill=True, color='green', alpha=0.5)

sns.kdeplot(tweets.loc[tweets['airline_sentiment'] == 'negative', 'positive_score'], 
            label="Negative Sentiment", fill=True, color='red', alpha=0.5)

# Labels and title
plt.xlabel("Positive Score")
plt.ylabel("Density")
plt.title("Distribution of Positive Score by Sentiment")
plt.legend()

# Show the plot
plt.show()


There's quite a bit of overlap here. Is there any signal in the positivity score we created?

In [None]:
from scipy.stats import ttest_ind

# Compute mean positive scores by airline_sentiment
mean_scores = tweets.groupby("airline_sentiment")["positive_score"].mean().reset_index()

# Perform an independent t-test (assuming only "positive" and "negative" labels exist)
pos_scores = tweets.loc[tweets['airline_sentiment'] == 'positive', 'positive_score']
neg_scores = tweets.loc[tweets['airline_sentiment'] == 'negative', 'positive_score']
t_stat, p_value = ttest_ind(pos_scores, neg_scores, equal_var=False)  # Welch’s t-test

# Create a barplot
plt.figure(figsize=(8, 6))
sns.barplot(data=mean_scores, x="airline_sentiment", y="positive_score", 
            hue="airline_sentiment", palette={"positive": "green", "negative": "red"}, legend=False)

# Annotate with p-value
plt.text(0.5, mean_scores["positive_score"].max(), f'p-value = {p_value:.4f}', 
         ha='center', fontsize=12, color='black', fontweight='bold')

# Labels and title
plt.xlabel("Airline Sentiment")
plt.ylabel("Mean Positive Score")
plt.title("Mean Positive Score by Sentiment Category")
plt.ylim(0, mean_scores["positive_score"].max() + 0.05)  # Adjust y-axis for clarity

# Show the plot
plt.show()


Let's see if we can get a more specific measure of sentiment by only calculating tweet-level sentiment using words with larger absolute values on the positivity semantic axes. Many words many have low positive values but not really help with identifying sentiment. Setting those aside may make the distinction more clear. Let's first look at the distribution of word positivity scores.

In [None]:
plt.figure(figsize=(10, 6))

# Plot distributions of word-level positivity scores
sns.kdeplot(word_scores_df['positivity_score'], 
            label="Positive Sentiment", fill=True, color='green', alpha=0.5)
plt.xlabel("Positive Score")
plt.ylabel("Density")
plt.title("Distribution of Positive Score by Word")
plt.legend()
plt.show()

Let's set a threshold of -0.1 for negative words and 0.2 for positive words, and set everything else equal to 0. Then we'll recompute the tweet-level positivity scores, and see how it lines up with the sentiment classification.

In [None]:
word_scores2 = {word: (0 if -0.1 <= score <= 0.2 else score) for word, score in word_scores.items()}
tweets['positive_score2'] = dtm.dot(pd.Series(word_scores2))  # Multiply DTM matrix by word scores

In [None]:
# Create the plot
plt.figure(figsize=(10, 6))

# Plot distributions for positive and negative tweets
sns.kdeplot(tweets.loc[tweets['airline_sentiment'] == 'positive', 'positive_score2'], 
            label="Positive Sentiment", fill=True, color='green', alpha=0.5)

sns.kdeplot(tweets.loc[tweets['airline_sentiment'] == 'negative', 'positive_score2'], 
            label="Negative Sentiment", fill=True, color='red', alpha=0.5)

# Labels and title
plt.xlabel("Positive Score, Revised")
plt.ylabel("Density")
plt.title("Distribution of Positive Score by Sentiment")
plt.legend()

# Show the plot
plt.show()

**Question**: How does it look? How does it compare to the initial graph?

The original dataset included a measure of certainty in their sentiment classification. Let's use this to create a continuous measure of positivity, and plot this against our positivity measure.

In [None]:
# Create 'sentiment_continuous' variable
tweets['sentiment_continuous'] = np.where(
    tweets['airline_sentiment'] == 'positive', tweets['airline_sentiment_confidence'],
    np.where(tweets['airline_sentiment'] == 'negative', -tweets['airline_sentiment_confidence'], 0)
)
tweets['sentiment_continuous'].head(10)

In [None]:
import statsmodels.api as sm

# Perform linear regression
X = tweets['positive_score2']
y = tweets['sentiment_continuous']
X = sm.add_constant(X)  # Add intercept for regression
model = sm.OLS(y, X).fit()  # Fit regression model
r_squared = model.rsquared  # Extract R² value

# Plot scatterplot with regression line
plt.figure(figsize=(8, 6))
sns.regplot(x=tweets['positive_score2'], y=tweets['sentiment_continuous'], scatter_kws={'alpha': 0.5}, line_kws={"color": "red"})

# Annotate R² value on the plot
plt.text(0.05, 0.8, f'R² = {r_squared:.4f}', fontsize=12, color='black', fontweight='bold', transform=plt.gca().transAxes)

# Labels and title
plt.xlabel("Positive Score, Revised")
plt.ylabel("Sentiment Continuous")
plt.title("Relationship between Positive Score and Sentiment Confidence")

# Show the plot
plt.show()


**Question:** What do we conclude? What more could we try to do?

**Challenge:** See if you can improve on this process to create a measure of sentiment that better matches the labels provided in the data.