# Semantle

Try this game! https://semantle.com/

Does anyone know what's going on here?

### Distributional semantics

https://en.wikipedia.org/wiki/Distributional_semantics

The core insight of distributional semantics is that a word’s meaning is shaped by the contexts in which it appears. This is often summed up by the phrase: “You shall know a word by the company it keeps” (Firth, 1957).

Think of the sentences:

-- "I fed the *dog* two treats"

-- "I fed the *horse* some hay"

-- "I fed the *cat* then went to the store"

-- "I fed my _____ to keep him distracted"

What else could go in the blank?

# Introduction to Co-occurrence Matrices

Understanding Embeddings for Computational Social Science: The Foundation of Word Relationships

In this section, we'll explore how computers can understand relationships between words
by examining which words appear together in text. This concept of "co-occurrence" is
fundamental to many text analysis techniques, including word embeddings.

First we'll load some packages....

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from IPython.display import display, HTML
from nltk.stem import WordNetLemmatizer
from collections import defaultdict

# Initialize lemmatizer once
lemmatizer = WordNetLemmatizer()

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

Here we're going to choose the target word, which today will be "king", for simplicity.

In [2]:
# Target word
target_word = "king"

And initialize the matrices to keep track of how often words appear around the word "king".

### Helper functions

In [3]:
# Storage for co-occurrence counts
left_counts = defaultdict(int)
right_counts = defaultdict(int)

def process_sentence(sentence, print_statements=True):
    """Process sentence using n-grams and update co-occurrence counts"""
    if print_statements:
        print(f"Processing: {sentence}")
    
    # Tokenize and lemmatize
    raw_tokens = [word.lower() for word in word_tokenize(sentence) if word.isalpha()]
    tokens = [lemmatizer.lemmatize(word, pos="v") for word in raw_tokens]  # 'v' for verb
    if print_statements:
        print(f"Tokens: {tokens}")
    
    found_contexts = []

    for i, token in enumerate(tokens):
        if token == target_word:
            if i > 0:
                left_word = tokens[i - 1]
                left_counts[left_word] += 1
            else:
                left_word = None
            if i < len(tokens) - 1:
                right_word = tokens[i + 1]
                right_counts[right_word] += 1
            else:
                right_word = None

            context_str = f"{left_word or '[START]'} -> {token} -> {right_word or '[END]'}"
            found_contexts.append(context_str)

    if print_statements:
        if found_contexts:
            print(f"Found contexts: {found_contexts}")
        else:
            print(f"No instances of '{target_word}' found in sentence")
        print("-" * 50)

def show_counts():
    """Display current co-occurrence counts as DataFrame"""
    # Get all unique context words
    all_words = set(left_counts.keys()) | set(right_counts.keys())
    
    if not all_words:
        print("No co-occurrences found yet.")
        return
    
    # Create single row DataFrame with context words as columns
    row_data = {}
    for word in sorted(all_words):
        row_data[word] = left_counts[word] + right_counts[word]
    
    df = pd.DataFrame([row_data], index=[target_word])
    print(f"Co-occurrence matrix:")
    print(df)
    print("=" * 50)
    
def show_counts_full():
    """Display current co-occurrence counts as a horizontally scrollable DataFrame, sorted by frequency"""
    # Get all unique context words
    all_words = set(left_counts.keys()) | set(right_counts.keys())
    
    if not all_words:
        print("No co-occurrences found yet.")
        return
    
    # Create dictionary of context word totals
    row_data = {word: left_counts[word] + right_counts[word] for word in all_words}
    
    # Sort by frequency (descending)
    sorted_items = sorted(row_data.items(), key=lambda x: x[1], reverse=True)
    sorted_row_data = {word: count for word, count in sorted_items}
    
    # Create single-row DataFrame
    df = pd.DataFrame([sorted_row_data], index=[target_word])
    
    # Display scrollable table
    display(HTML(df.to_html(notebook=True)))
    display(HTML("<style>div.output_scroll {overflow-x: auto; white-space: nowrap;}</style>"))



Let's try constructing a co-occurrence matrix

In [4]:
print("Enter sentences one at a time. Type 'done' to finish.\n")

# Storage for co-occurrence counts
left_counts = defaultdict(int)
right_counts = defaultdict(int)

target_word = 'king' 

while True:
    sentence = input("Enter a sentence: ")
    if sentence.lower().strip() == "done":
        print("\nFinished input. Final counts:")
        show_counts()
        break
    process_sentence(sentence)
    show_counts()


Enter sentences one at a time. Type 'done' to finish.



Enter a sentence:  Give me a sample sentence


Processing: Give me a sample sentence
Tokens: ['give', 'me', 'a', 'sample', 'sentence']
No instances of 'king' found in sentence
--------------------------------------------------
No co-occurrences found yet.


Enter a sentence:  King didn't appear in that sentence!


Processing: King didn't appear in that sentence!
Tokens: ['king', 'do', 'appear', 'in', 'that', 'sentence']
Found contexts: ['[START] -> king -> do']
--------------------------------------------------
Co-occurrence matrix:
      do
king   1


Enter a sentence:  The king loved me :)


Processing: The king loved me :)
Tokens: ['the', 'king', 'love', 'me']
Found contexts: ['the -> king -> love']
--------------------------------------------------
Co-occurrence matrix:
      do  love  the
king   1     1    1


Enter a sentence:  Done



Finished input. Final counts:
Co-occurrence matrix:
      do  love  the
king   1     1    1


### Now let's read in some sample sentences.
Let's take a look at some sentences containing the word "king".

In [5]:
sentences = pd.read_csv('Sample_Sentences.csv')['sentence'].tolist()

In [7]:
target_word = 'king'

# Reinitialize storage counts
left_counts = defaultdict(int)
right_counts = defaultdict(int)

for i, sentence in enumerate(sentences):
    if i % 20 == 0:
        print(sentence)
    process_sentence(sentence, print_statements=False)

The king dines in silence with his trusted advisors.
The peacetime king commanded reduced military forces during stability
The king drinks tea from a jeweled goblet.
The specialized king ordered experts handle specific technical responsibilities
The king drinks ale from a jeweled goblet.
The foreign king ruled these lands for only three years
The king drinks tea from a jeweled goblet.
The king dines in silence with his trusted advisors.
The king drinks broth from a jeweled goblet.
The king dines by candlelight with his trusted advisors.
The influenced king decided after listening to persuasive advisors
The king eats figs at the royal table.
The king dines by candlelight with his trusted advisors.
The nationalist king ruled by strengthening cultural identity
The patient king reigned by waiting for the right moments
The king eats berries at the royal table.
The king dines with guests with his trusted advisors.
The conservative king decided to maintain traditional ways completely
The king

In [8]:
show_counts_full()

Unnamed: 0,the,eat,drink,din,decide,reign,rule,command,order,hearts,be,joker,spade,diamonds,democratic,military,practical,wise,legendary,constitutional,strategic,patient,pragmatic,conservative,desert,scholarly,cautious,warrior,young,revolutionary,innovative,reformist,independent,impulsive,traditional,economic,siege,religious,local,visionary,scientific,radical,archer,reasonable,exile,supervise,ceremonial,besiege,mysterious,charismatic,systematic,ancient,accessible,customize,educate,enlighten,proud,nationalist,naval,lenient,document,final,intuitive,extremist,regular,flexible,generalize,strong,absolute,generous,reserve,isolate,rational,alliance,inclusive,pessimistic,travel,infantry,peaceful,highland,forest,decentralize,literary,conscript,nepotistic,eastern,social,brave,thoughtful,expensive,gentle,uncertain,collaborative,mountain,specialize,benevolent,coastal,informal,spiritual,philosopher,great,maritime,realistic,winter,happy,autocratic,foreign,professional,moderate,standardize,experience,merchant,reluctant,egalitarian,clever,old,deliberate,permanent,northern,summer,first,regulate,formal,righteous,serious,architectural,handsome,athletic,island,optimistic,confident,active,guerrilla,centralize,reclusive,executive,private,unpopular,demand,temporary,artistic,mercenary,hierarchical,imperial,peacetime,southern,analytical,recruit,humble,influence,immortal,consistent,stern,cavalry,aggressive,victorious,defensive,idealistic,volunteer,popular,puppet,cheerful,administrative,deregulate,inexperienced,autonomous,ambitious,diplomatic,emotional,unify,schedule,bold,meritocratic,progressive,veteran,coordinate,retire,stubborn,spontaneous,elite,tyrant,unpredictable,complex,musical,immediate,defeat,liberate,beloved,casual
king,348,114,98,88,51,50,50,48,47,11,7,7,7,6,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


#### Q: What if we want to consolidate these columns?
##### Which columns would you choose to consolidate? What do they have in common?

This process is comptuationally intensive, but the intuition is that you don't "need" every column to preserve the information from that matrix -- if you keep "decide", you basically know what you need to know about "rule". 

This is what is known as a "low-dimensional representation": it keeps much of the information of the larger version, while being easier computationally to handle.

#### Consolidated below

In [9]:
# Consolidated rows
data = {
    'food': 100,
    'power': 49,
    'games': 7,
    'democratic':1,
    'reformist': 1,
    'practical':1,
    'strategic':1,
    'constitutional':1,
    'independent':1
}

# Create DataFrame
collapsed = pd.DataFrame(data, index=['king'])

collapsed

Unnamed: 0,food,power,games,democratic,reformist,practical,strategic,constitutional,independent
king,100,49,7,1,1,1,1,1,1


Now let's imagine that we ingested a great deal more texts. All of Wikipedia, say. Or Google Books. Or the internet!

And we get the numbers below:

In [413]:
# Consolidated rows
data_king = {
    'food': 13492011,
    'power': 1476662,
    'games': 51018,
    'female': 145,
    '5': 1124100003,
    '6': 944380,
    '7': 31414685,
    '8': 51492,
    '9': 313364,
    '10': 767899
}

# Create DataFrame
collapsed_king = pd.DataFrame(data_king, index=['king'])

collapsed_king


Unnamed: 0,food,power,games,female,5,6,7,8,9,10
king,13492011,1476662,51018,145,1124100003,944380,31414685,51492,313364,767899


### Q: What are columns 5-10?

#### Then we go get the words nearby for queen, too.
And we get the values below.

In [414]:
# Consolidated rows
data_queen = {
    'food': 11732711,
    'power': 1076561,
    'games': 45833,
    'female': 145345,
    '5': 884100003,
    '6': 964380,
    '7': 14212,
    '8': 114922,
    '9': 3254,
    '10': 557669
}

# Create DataFrame
collapsed_queen = pd.DataFrame(data_queen, index=['queen'])

collapsed_queen


Unnamed: 0,food,power,games,female,5,6,7,8,9,10
queen,11732711,1076561,45833,145345,884100003,964380,14212,114922,3254,557669


##### Now let's put them together.

In [415]:
combined_kq = pd.concat([collapsed_king, collapsed_queen], axis=0)

In [416]:
combined_kq

Unnamed: 0,food,power,games,female,5,6,7,8,9,10
king,13492011,1476662,51018,145,1124100003,944380,31414685,51492,313364,767899
queen,11732711,1076561,45833,145345,884100003,964380,14212,114922,3254,557669


### Q: What do you notice about these values?

In [377]:
# Consolidated rows
data_bird = {
    'food': 15732711,
    'power': 672,
    'games': 1876,
    'female': 62194,
    '5': 8882841,
    '6': 952085,
    '7': 467618,
    '8': 734666,
    '9': 645365,
    '10': 396391
}

# Create DataFrame
collapsed_bird = pd.DataFrame(data_bird, index=['bird'])

combined_kqb = pd.concat([combined_kq, collapsed_bird], axis=0)

In [378]:
combined_kqb

Unnamed: 0,food,power,games,female,5,6,7,8,9,10
king,13492011,1476662,51018,145,1124100003,944380,31414685,51492,313364,767899
queen,11732711,1076561,45833,145345,884100003,964380,14212,114922,3254,557669
bird,15732711,672,1876,62194,8882841,952085,467618,734666,645365,396391


In [394]:
# Consolidated rows
data_rock = {
    'food': 34111,
    'power': 299011,
    'games': 21221,
    'female': 616,
    '5': 260415,
    '6': 41,
    '7': 904109,
    '8': 7346660,
    '9': 5375571,
    '10': 9819601
}

# Create DataFrame
collapsed_rock = pd.DataFrame(data_rock, index=['rock'])

combined_kqbr = pd.concat([combined_kqb, collapsed_rock], axis=0)

In [395]:
combined_kqbr

Unnamed: 0,food,power,games,female,5,6,7,8,9,10
king,13492011,1476662,51018,145,1124100003,944380,31414685,51492,313364,767899
queen,11732711,1076561,45833,145345,884100003,964380,14212,114922,3254,557669
bird,15732711,672,1876,62194,8882841,952085,467618,734666,645365,396391
rock,34111,299011,21221,616,260415,41,904109,7346660,5375571,9819601


In [396]:
# Normalize using min-max scaling
normalized_kqbr = (combined_kqbr - combined_kqbr.min()) / (combined_kqbr.max() - combined_kqbr.min())

In [397]:
normalized_kqbr

Unnamed: 0,food,power,games,female,5,6,7,8,9,10
king,0.857268,1.0,1.0,0.0,1.0,0.97926,1.0,0.0,0.057724,0.039425
queen,0.7452,0.728927,0.894489,1.0,0.786446,1.0,0.0,0.008695,0.0,0.017115
bird,1.0,0.0,0.0,0.427335,0.007672,0.98725,0.014439,0.093647,0.119522,0.0
rock,0.0,0.202128,0.393655,0.003244,0.0,0.0,0.02834,1.0,1.0,1.0


In [398]:
normalized_kqbr.loc['????'] = [0.912821, 0.050113, 0.88113, 0.081213, 0.155179, 0.101645, 0.669487, 0.740819, 0.900297, 0.239374]

In [399]:
normalized_kqbr

Unnamed: 0,food,power,games,female,5,6,7,8,9,10
king,0.857268,1.0,1.0,0.0,1.0,0.97926,1.0,0.0,0.057724,0.039425
queen,0.7452,0.728927,0.894489,1.0,0.786446,1.0,0.0,0.008695,0.0,0.017115
bird,1.0,0.0,0.0,0.427335,0.007672,0.98725,0.014439,0.093647,0.119522,0.0
rock,0.0,0.202128,0.393655,0.003244,0.0,0.0,0.02834,1.0,1.0,1.0
????,0.912821,0.050113,0.88113,0.081213,0.155179,0.101645,0.669487,0.740819,0.900297,0.239374


In [400]:
normalized_kqbr = normalized_kqbr.rename(index={'????': 'pizza'})

In [401]:
normalized_kqbr

Unnamed: 0,food,power,games,female,5,6,7,8,9,10
king,0.857268,1.0,1.0,0.0,1.0,0.97926,1.0,0.0,0.057724,0.039425
queen,0.7452,0.728927,0.894489,1.0,0.786446,1.0,0.0,0.008695,0.0,0.017115
bird,1.0,0.0,0.0,0.427335,0.007672,0.98725,0.014439,0.093647,0.119522,0.0
rock,0.0,0.202128,0.393655,0.003244,0.0,0.0,0.02834,1.0,1.0,1.0
pizza,0.912821,0.050113,0.88113,0.081213,0.155179,0.101645,0.669487,0.740819,0.900297,0.239374


### 3D embeddings

https://projector.tensorflow.org/

# Practice math

See the whiteboard for final lesson on embeddings math!