### 1. Imports and Setup

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import word_tokenize
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import string


### Download NLKT resources

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

### 2. Define a simple test corpus

In [None]:
documents = [
    "I love deep learning and natural language processing.",
    "Natural language models are fascinating.",
    "Topic modeling helps discover themes in text.",
    "Machine learning enables automatic topic discovery.",
    "Neural networks learn embeddings from data.",
    "Artificial Intelligence is transforming industries.",
    "Text analysis techniques improve information retrieval.",
    "Large language models power chatbots and assistants."
]


✅ Simple Preprocessing

In [None]:
# Tokenization + lowercase + remove stopwords and punctuation
def preprocess(doc):
    tokens = word_tokenize(doc.lower())
    return [word for word in tokens if word.isalpha() and word not in stop_words]

processed_docs = [preprocess(doc) for doc in documents]

In [None]:
processed_docs

✅ Create dictionary and BoW corpus

In [None]:
# Create dictionary and corpus
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]


In [None]:
# Opzionalmente: elenco di tutte le parole nel vocabolario
vocab = [dictionary[i] for i in range(len(dictionary))]

# Matrice documenti-parole
bow_matrix = []
for doc_bow in corpus:
    word_freq = dict(doc_bow)
    row = [word_freq.get(i, 0) for i in range(len(dictionary))]
    bow_matrix.append(row)

df_bow = pd.DataFrame(bow_matrix, columns=vocab)
df_bow.index = [f'Doc {i+1}' for i in range(len(documents))]

In [None]:
df_bow

### 3. Bag of Words

The **Bag-of-Words** model is one of the simplest ways to represent text numerically. It ignores grammar and word order and focuses only on word occurrence.

##### **What is it?**
- Each document is treated as a "bag" of individual words.
- A vocabulary is built from all the unique words in the corpus.
- Each document is then represented as a vector counting how many times each word from the vocabulary appears.

This results in a **document-term matrix**:
- Each row corresponds to a document.
- Each column corresponds to a word from the vocabulary.
- Each cell contains the count of the word in that document.

Although simple, BoW has limitations:
- It does not consider word order or context.
- It can result in very high-dimensional and sparse data.

##### **Simple Example**
Let's say we have two short documents:

- Document 1: "I love NLP"
- Document 2: "I love machine learning"

The combined vocabulary is: `[I, love, NLP, machine, learning]`

We can represent each document as a vector of word counts:

| Document | I | love | NLP | machine | learning |
|----------|---|------|-----|---------|----------|
| Doc 1    | 1 | 1    | 1   | 0       | 0        |
| Doc 2    | 1 | 1    | 0   | 1       | 1        |

This matrix shows how many times each word appears in each document. No word order is preserved.

Still, it’s a foundational method and helps build intuition for more sophisticated approaches like TF-IDF and word embeddings.

##### 🛠️ **Code Example**

The code block below uses `CountVectorizer` from `sklearn` to create the BoW matrix and displays it as a Pandas DataFrame for readability.

This block creates a Bag-of-Words (BoW) representation of our corpus:
- CountVectorizer transforms the documents into a matrix (documents x words).
- Each element in the matrix represents how many times a word appears in a document.
- The 'fit_transform' function builds the vocabulary and generates the counts.
- Finally, we convert the matrix into a Pandas DataFrame for better visualization.

In [None]:
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(documents)
pd.DataFrame(X_bow.toarray(), columns=vectorizer.get_feature_names_out())

In [None]:
type(X_bow)