# Bag of Words (BoW) Representation

The **Bag of Words (BoW)** model is one of the most fundamental methods used to represent text data in numerical form. It provides a way to convert a collection of text documents into a set of features that can be used in machine learning models. BoW is simple yet effective for many natural language processing (NLP) tasks, such as classification, clustering, and sentiment analysis.

##### What is Bag of Words?

In the **Bag of Words** model, each document is represented by a vector of word frequencies. Essentially, BoW disregards grammar and word order, focusing only on whether known words occur in the document and how often. The model assumes that the occurrence of each word is independent of other words, which simplifies the representation.

For example, consider the following two sentences:

- "The cat sat on the mat."
- "The dog lay on the mat."

The vocabulary for these two sentences would be: **[the, cat, sat, on, mat, dog, lay]**. Each sentence is then represented by a frequency vector showing how often each word appears.

##### How Bag of Words Works

1. **Create Vocabulary**: Construct a list of all unique words that appear across all documents in the dataset. This list is referred to as the vocabulary.
2. **Vector Representation**: For each document, create a vector where each element corresponds to a word in the vocabulary. The value in each position indicates how many times the corresponding word appears in that document.

For the example above, the frequency vectors would be:

- "The cat sat on the mat" -> **[2, 1, 1, 1, 1, 0, 0]**
- "The dog lay on the mat" -> **[2, 0, 0, 1, 1, 1, 1]**

##### Advantages and Limitations of Bag of Words

**Advantages**:
- **Simplicity**: BoW is easy to implement and understand.
- **Effectiveness**: Works well for a variety of text classification tasks.

**Limitations**:
- **Sparsity**: For large vocabularies, the resulting vectors can be very sparse (most elements are zero).
- **No Semantic Information**: BoW does not consider word order or semantics, so the context is lost.

### 1. Example of BoW

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "Machine learning is a fascinating field",
    "Data science and machine learning are closely related",
    "Deep learning is a subfield of machine learning",
    "Supervised learning involves labeled data",
    "Unsupervised learning deals with unlabeled data",
    "Feature engineering is crucial for model performance",
    "Data preprocessing is an important step in machine learning",
    "Natural language processing is a key area in AI",
    "Hyperparameter tuning helps to optimize models",
    "Model evaluation is necessary for understanding model accuracy"
]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Alternatively, for larger corpora
vectorizer = CountVectorizer(stop_words='english', max_features=50)

# Fit and transform the documents to create the BoW representation
bow_matrix = vectorizer.fit_transform(documents)

# Convert the BoW matrix to an array and print the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Representation:\n", bow_matrix.toarray())


### 2. Visualize Frequency
You can visualize the frequency of words in the Bag of Words model using bar plots or word clouds. Visualizing word frequency can provide valuable insights into the text data, allowing you to see which words are most prominent in the dataset.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Get the vocabulary and word frequencies (we created the vectorizer earlier)
vocabulary = vectorizer.get_feature_names_out()
frequencies = np.array(bow_matrix.sum(axis=0)).flatten()

# Sort the vocabulary and frequencies by frequency in descending order
sorted_indices = np.argsort(-frequencies)
sorted_vocabulary = vocabulary[sorted_indices]
sorted_frequencies = frequencies[sorted_indices]

# Plot the word frequencies
plt.figure(figsize=(10, 5))
plt.bar(sorted_vocabulary, sorted_frequencies)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Word Frequencies in Bag of Words')
plt.xticks(rotation=45)
plt.show()

### 3. Visualize Wordcloud
Another way to visualize word frequencies is by using a word cloud, which can provide an engaging representation of the most frequent words in a dataset.

In [None]:
%pip install -U wordcloud

In [None]:
from wordcloud import WordCloud

# Create a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(zip(vocabulary, frequencies)))

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### 4. Working with a Dataframe

##### 4.1 Load the Data

In [None]:
import pandas as pd

# Load the data
df = pd.read_csv('movies_cleaned.csv')


##### 4.2 Bag-of-Words

In [None]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer(stop_words='english', max_features=50)

# Fit and transform the documents to create the BoW representation
bow_matrix = vectorizer.fit_transform(df['Plot'])

# Create a DataFrame for the BoW representation
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out(), index=df.index)

# Display the vocabulary and BoW DataFrame
print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBag of Words DataFrame:")
print(bow_df)

# Merge the BoW DataFrame with the original DataFrame for a complete view
df = pd.concat([df, bow_df], axis=1)
print("\nOriginal DataFrame with BoW Representation:")
print(df)

##### 4.3 Visualize Frequency

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Get the vocabulary and word frequencies (we created the vectorizer earlier)
vocabulary = vectorizer.get_feature_names_out()
frequencies = np.array(bow_matrix.sum(axis=0)).flatten()

# Sort the vocabulary and frequencies by frequency in descending order
sorted_indices = np.argsort(-frequencies)
sorted_vocabulary = vocabulary[sorted_indices]
sorted_frequencies = frequencies[sorted_indices]

# Plot the word frequencies
plt.figure(figsize=(10, 5))
plt.bar(sorted_vocabulary, sorted_frequencies)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Word Frequencies in Bag of Words')
plt.xticks(rotation=90)
plt.show()

##### 4.4 Visualize Wordcloud

In [None]:
from wordcloud import WordCloud

# Create a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(zip(vocabulary, frequencies)))

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### 5. Translate to the Case
Go to the case and perform BoW on the news articles