<img src='https://www.di.uniroma1.it/sites/all/themes/sapienza_bootstrap/logo.png' width="200"/>  

# Part_1_11_Vector Semantics (Sparse)

In Natural Language Processing (`NLP`), vector semantics provides a powerful framework for representing words and documents in a numerical form, enabling efficient computation and semantic analysis. Sparse vector representations, such as **Bag of Words (`BoW`)**, **TF-IDF**, and **Pointwise Mutual Information (`PPMI`)**, have been foundational in the evolution of `NLP`. These approaches rely on statistical co-occurrence patterns and word frequency to capture linguistic meaning, forming the basis for more sophisticated methods like dense embeddings and contextualized models.

Sparse representations are particularly useful in understanding the core principles of vector semantics and building intuition about the role of word-document relationships in tasks like text classification, clustering, and retrieval systems.

### **Objectives:**
In this notebook, Parham provides an overview of sparse vector semantics, including the key methods used to represent text data and their significance in `NLP`. Through practical exercises, Parham will demonstrate the implementation of **Bag of Words (`BoW`)** for document representation, **TF-IDF** to highlight significant terms within documents, and **PPMI** to extract meaningful statistical relationships from co-occurrence matrices.

### **References:**
- [https://www.datacamp.com/tutorial/python-bag-of-words-model](https://www.datacamp.com/tutorial/python-bag-of-words-model)  
- [https://spotintelligence.com/2022/12/20/bag-of-words-python](https://spotintelligence.com/2022/12/20/bag-of-words-python/)  
- [https://stackoverflow.com/questions/58701337/how-to-construct-ppmi-matrix-from-a-text-corpus](https://stackoverflow.com/questions/58701337/how-to-construct-ppmi-matrix-from-a-text-corpus)   

### **Tutors**:
- Professor Stefano Farali
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: Stefano.faralli@uniroma1.it
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/stefano-faralli-b1183920/) 
- Professor Iacopo Masi
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: masi@di.uniroma1.it  
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/iacopomasi/)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**: [GitHub](https://github.com/iacopomasi)  
    

### **Contributors:**
- Parham Membari  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: p.membari96@gmail.com  
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/p-mem/)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**: [GitHub](https://github.com/parham075)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Medium_logo_Monogram.svg" alt="Logo" width="20" height="20"> **Medium**: [Medium](https://medium.com/@p.membari96)  

**Table of Contents:**
1. Import Libraries  
2. Introduction to Vector Semantics  
3. Bag of Words (`BoW`) Representation      
4. Term Frequency-Inverse Document Frequency (`TF-IDF`)
5. Pointwise Mutual Information (`PMI`)   
6. Closing Thoughts


## 1. Import Libraries 

In [3]:
import os
import requests
import tarfile
import zipfile
import pandas as pd
import nltk
import numpy as np
import spacy
import torch
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tag import pos_tag
from loguru import logger
from tqdm import tqdm
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint
import math
from collections import Counter
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/p/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/p/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2. Introduction to Vector Semantics
Vector semantics is a methodology in `NLP` that uses numerical representations to encode the meaning of words, phrases, or entire documents. These numerical representations are essential for enabling machines to process and analyze human language. By transforming textual data into vectors, computational models can perform mathematical operations to assess similarity, context, and relationships between words or documents.

### Why Vector Semantics?
- **Quantitative Representation**: Text data, being inherently qualitative, is challenging for computers to process directly. Vector semantics bridges this gap by converting text into a mathematical form.
- **Efficient Computation**: Mathematical representations allow for quick calculations of similarity, clustering, and classification tasks.
- **Foundation for Machine Learning Models**: Many machine learning models rely on vectorized representations of data as input.


## 3. Bag of Words (`BoW`) Representation 

**Bag of Words** (`BoW`) is a simple and widely-used representation of text data in Natural Language Processing. It represents text as a collection of words, ignoring grammar, word order, and context, while preserving the frequency of words.BoW boadly used in tasks such as text classification and sentiment analysis. This is important because machine learning algorithms can’t process textual data. The process of converting the text to numbers is known as feature extraction or feature encoding.

### 3.1 Understanding Bag of Words with example:
Imagine two sentences:
1. Document 1: "Natural Language Processing is amazing."
2. Document 2: "Language models are important for NLP."


The `BoW` model begins by creating a vocabulary, a unique list of all words across the corpus. Each document is then represented as a vector of word frequencies. Table below, represents the Bag of Words vectors:

| **Vocabulary** | **Document 1** | **Document 2** |
|-----------------|----------------|----------------|
| Natural         | 1              | 0              |
| Language        | 1              | 1              |
| Processing      | 1              | 0              |
| is              | 1              | 0              |
| amazing         | 1              | 0              |
| models          | 0              | 1              |
| are             | 0              | 1              |
| important       | 0              | 1              |
| for             | 0              | 1              |
| NLP             | 0              | 1              |

Each position in the vector corresponds to a word in the vocabulary, and the value represents its frequency in the document.

### How to implement `BoW`
The steps involved to create `BoW` are:
- Tokenization: Split the text into individual words or tokens.
- Preprocessing:
    - Convert text to lowercase.
    - Remove special characters, punctuation, and numbers.
    - Remove stopwords (e.g., "the", "is", "and").
- Apply stemming or lemmatization to normalize words.
- Vocabulary Creation: Build a unique list of words from the corpus.
- Vectorization: Represent each document as a vector of word frequencies based on the vocabulary.

In [4]:

paragraph = """
By tokenizing, you can conveniently split up text by word or by sentence. 
This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. 
It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. 
Here’s what both types of tokenization bring to the table:

Tokenizing by word: Words are like the atoms of natural language. 
They’re the smallest unit of meaning that still makes sense on its own. 
Tokenizing your text by word allows you to identify words that come up particularly often. 
For example, if you were analyzing a group of job ads, 
then you might find that the word “Python” comes up often. 
That could suggest high demand for Python knowledge, but you’d need to look deeper to know more.

Tokenizing by sentence: When you tokenize by sentence, 
you can analyze how those words relate to one another and see more context. 
Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? 
Are there more terms from the domain of herpetology than the domain of software development, 
suggesting that you may be dealing with an entirely different kind of python than you were expecting?

"""
sentences = nltk.sent_tokenize(paragraph)

# Step 2: Preprocessing each sentence
ps = PorterStemmer()
corpus = []

for sentence in sentences:
    # Remove special characters, numbers, and punctuations
    review = re.sub('[^a-zA-Z]', ' ', sentence)
    # Convert to lowercase
    review = review.lower()
    # Split into words
    review = review.split()
    # Remove stopwords and apply stemming
    review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))]
    # Join the words back into a sentence
    review = ' '.join(review)
    # Add the cleaned sentence to the corpus
    corpus.append(review)

# Step 3: Create Bag of Words model
cv = CountVectorizer(max_features=1500)  # Limit to top 1500 features (if necessary)
X = cv.fit_transform(corpus).toarray()

# Get feature names (vocabulary)
vocabulary = cv.get_feature_names_out()

# Create DataFrame with each sentence as a column
bow_df = pd.DataFrame(X.T, index=vocabulary, columns=[f"Document {i+1}" for i in range(X.shape[0])])
bow_df

Unnamed: 0,Document 1,Document 2,Document 3,Document 4,Document 5,Document 6,Document 7,Document 8,Document 9,Document 10,Document 11,Document 12
ad,0,0,0,0,0,0,0,1,0,0,0,0
allow,0,1,0,0,0,0,1,0,0,0,0,0
analyz,0,0,1,1,0,0,0,1,0,1,0,0
anoth,0,0,0,0,0,0,0,0,0,1,0,0
around,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
type,0,0,0,0,1,0,0,0,0,0,0,0
unit,0,0,0,0,0,1,0,0,0,0,0,0
unstructur,0,0,1,0,0,0,0,0,0,0,0,0
word,1,0,0,1,2,0,2,1,0,1,2,0


## 4. Term Frequency-Inverse Document Frequency (`TF-IDF`)  

While the Bag of Words model provides a simple representation of text data, it does not account for the importance of words within a document or across multiple documents. **Term Frequency-Inverse Document Frequency (TF-IDF)** improves upon this by assigning a weight to each word that reflects its relevance to a specific document in the corpus.


### 4.1 Understanding TF-IDF

The TF-IDF score for a term $ t $ in a document $ d $ is calculated as:
\[
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
\]
Where:
1. **Term Frequency (TF)**: Measures how frequently a term appears in a document.
   \[
   \text{TF}(t, d) = \frac{\text{Number of occurrences of } t \text{ in } d}{\text{Total number of terms in } d}
   \]

2. **Inverse Document Frequency (IDF)**: Measures how important a term is by reducing the weight of commonly used words. To avoid division by zero, smoothing is applied:
   \[
   \text{IDF}(t) = \log\left(\frac{\text{Total number of documents} + 1}{\text{Number of documents containing } t + 1}\right) + 1
   \]

The result is a weighted score that increases with the frequency of the term in the document but decreases with its frequency across the corpus.

### 4.2 Example: Calculating TF-IDF  

Let’s revisit the two documents:  
1. Document 1: "Natural Language Processing is amazing."  
2. Document 2: "Language models are important for NLP."  

The vocabulary is:  
`['amazing', 'are', 'for', 'important', 'is', 'language', 'models', 'natural', 'nlp', 'processing']`  

We calculate TF, IDF, and TF-IDF for each term:

| **Vocabulary** | **TF (Doc 1)**   | **TF (Doc 2)**   | **IDF**                                             | **TF-IDF (Doc 1)**       | **TF-IDF (Doc 2)**       |
|-----------------|------------------|------------------|-----------------------------------------------------|--------------------------|--------------------------|
| amazing         | $ \frac{1}{5} = 0.2 $ | $ \frac{0}{7} = 0.0 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.2 \times 1.41 = 0.28 $  | $ 0.0 \times 1.41 = 0.0 $  |
| are             | $ \frac{0}{5} = 0.0 $ | $ \frac{1}{7} \approx 0.14 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.0 \times 1.41 = 0.0 $  | $ 0.14 \times 1.41 \approx 0.20 $ |
| for             | $ \frac{0}{5} = 0.0 $ | $ \frac{1}{7} \approx 0.14 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.0 \times 1.41 = 0.0 $  | $ 0.14 \times 1.41 \approx 0.20 $ |
| important       | $ \frac{0}{5} = 0.0 $ | $ \frac{1}{7} \approx 0.14 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.0 \times 1.41 = 0.0 $  | $ 0.14 \times 1.41 \approx 0.20 $ |
| is              | $ \frac{1}{5} = 0.2 $ | $ \frac{0}{7} = 0.0 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.2 \times 1.41 = 0.28 $  | $ 0.0 \times 1.41 = 0.0 $  |
| language        | $ \frac{1}{5} = 0.2 $ | $ \frac{1}{7} \approx 0.14 $ | $ \log\left(\frac{2+1}{2+1}\right) + 1 = 1.00 $  | $ 0.2 \times 1.00 = 0.20 $  | $ 0.14 \times 1.00 \approx 0.14 $ |
| models          | $ \frac{0}{5} = 0.0 $ | $ \frac{1}{7} \approx 0.14 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.0 \times 1.41 = 0.0 $  | $ 0.14 \times 1.41 \approx 0.20 $ |
| natural         | $ \frac{1}{5} = 0.2 $ | $ \frac{0}{7} = 0.0 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.2 \times 1.41 = 0.28 $  | $ 0.0 \times 1.41 = 0.0 $  |
| nlp             | $ \frac{0}{5} = 0.0 $ | $ \frac{1}{7} \approx 0.14 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.0 \times 1.41 = 0.0 $  | $ 0.14 \times 1.41 \approx 0.20 $ |
| processing      | $ \frac{1}{5} = 0.2 $ | $ \frac{0}{7} = 0.0 $ | $ \log\left(\frac{2+1}{1+1}\right) + 1 = 1.41 $  | $ 0.2 \times 1.41 = 0.28 $  | $ 0.0 \times 1.41 = 0.0 $  |



In [9]:
import math
from collections import Counter
from numpy.linalg import norm

# Step 2: Preprocess the text (lowercase, remove punctuation)
def preprocess_text(text):
    text = text.lower()
    text = ''.join([char if char.isalnum() or char.isspace() else ' ' for char in text])
    return text.split()

# Step 3: Create the vocabulary
def build_vocabulary(corpus):
    vocabulary = set()
    for document in corpus:
        vocabulary.update(preprocess_text(document))
    return sorted(vocabulary)

# Step 4: Calculate Term Frequency (TF)
def compute_tf(document, vocabulary):
    term_count = Counter(preprocess_text(document))
    total_terms = sum(term_count.values())
    return [term_count[word] / total_terms for word in vocabulary]

# Step 5: Calculate Inverse Document Frequency (IDF) with smoothing
def compute_idf(corpus, vocabulary):
    num_documents = len(corpus)
    idf = []
    for word in vocabulary:
        doc_count = sum(1 for document in corpus if word in preprocess_text(document))
        idf.append(math.log((1 + num_documents) / (1 + doc_count)) + 1)
    return idf

# Step 6: Compute TF-IDF for each document
def compute_tfidf(corpus):
    vocabulary = build_vocabulary(corpus)
    tf_matrix = [compute_tf(document, vocabulary) for document in corpus]
    idf = compute_idf(corpus, vocabulary)
    tfidf_matrix = [[tf * idf[idx] for idx, tf in enumerate(tf_doc)] for tf_doc in tf_matrix]
    # Normalize each row (document)
    tfidf_matrix_normalized = [
        [value / norm(doc) if norm(doc) != 0 else 0 for value in doc]
        for doc in tfidf_matrix
    ]
    return vocabulary, tfidf_matrix_normalized

# Run the implementation
corpus = [
    "Natural Language Processing is amazing.",
    "Language models are important for NLP."
]
vocabulary, tfidf_matrix = compute_tfidf(corpus)

# Step 7: Display the results
print("Vocabulary:", vocabulary)
print("\nTF-IDF Matrix:")
for idx, doc_tfidf in enumerate(tfidf_matrix):
    print(f"Document {idx + 1}: {doc_tfidf}")


Vocabulary: ['amazing', 'are', 'for', 'important', 'is', 'language', 'models', 'natural', 'nlp', 'processing']

TF-IDF Matrix:
Document 1: [0.4710778123316179, 0.0, 0.0, 0.0, 0.4710778123316179, 0.335175743327926, 0.0, 0.4710778123316179, 0.0, 0.4710778123316179]
Document 2: [0.0, 0.42615959880289433, 0.42615959880289433, 0.42615959880289433, 0.0, 0.3032160644503863, 0.42615959880289433, 0.0, 0.42615959880289433, 0.0]


### **Exercise 1**: Representing Text with TF-IDF Using `TfidfVectorizer`  

**Objective**  
In this exercise, you will use the `TfidfVectorizer` from `sklearn` to calculate and visualize the **TF-IDF scores** for a given corpus.  

Given a small corpus of two documents:  
1. Document 1: `"Natural Language Processing is amazing."`  
2. Document 2: `"Language models are important for NLP."`  

**Your Task**  

1. Use the `TfidfVectorizer` to calculate the TF-IDF scores for the given corpus.  
2. Extract the vocabulary and TF-IDF matrix.   


**Expected Output**  

- **Vocabulary**: A list of unique words in the corpus.  
- **TF-IDF Matrix**: A table where rows represent documents and columns represent words, with values showing the TF-IDF score for each word in each document.  

In [None]:
# @title 🧑🏿‍💻 Your code here
# Example Corpus
corpus = [
    "Natural Language Processing is amazing.",
    "Language models are important for NLP."
]

In [8]:
# @title 👀 Solution

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Display the vocabulary and TF-IDF matrix
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:")

# Create a DataFrame for better visualization
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Vocabulary: ['amazing' 'are' 'for' 'important' 'is' 'language' 'models' 'natural'
 'nlp' 'processing']
TF-IDF Matrix:


Unnamed: 0,amazing,are,for,important,is,language,models,natural,nlp,processing
0,0.471078,0.0,0.0,0.0,0.471078,0.335176,0.0,0.471078,0.0,0.471078
1,0.0,0.42616,0.42616,0.42616,0.0,0.303216,0.42616,0.0,0.42616,0.0


## 5. Pointwise Mutual Information (`PMI`) 

## 6. Closing Thoughts 