# Exercises: Understanding and Extending Bag of Words

---

### **1. Modify the Stopword List**
- Add custom words to the stopword list (e.g., domain-specific terms such as "cat", "dog", "squirrel", etc.).
- Remove certain words from the default stopword list.
- Observe how these changes affect the vocabulary and BoW representations.

**Questions to Explore:**
- How does the size of the vocabulary change when stopwords are modified?
- Do certain documents now share more or fewer words in common?

---

### **2. Implement N-grams**
- Extend the `preprocess` function to include **bigrams** (n=2) or **trigrams** (n=3) instead of individual words.
- Update the BoW generation to account for these n-grams.

**Example:**
- Original Text: `"The cat sat on the mat."`
- Vocabulary with Bigrams: `['the cat', 'cat sat', 'sat on', 'on the', 'the mat']`

**Questions to Explore:**
- How does the vocabulary size change when using n-grams?
- Does the BoW representation become more meaningful or less meaningful?

---

### **3. Normalize the BoW Representation**
- Modify the `create_bow` function to normalize the BoW vectors by the total word count in each document.
- This converts raw frequencies to relative frequencies.

**Questions to Explore:**
- How does normalization affect the BoW representations for documents with different lengths?
- Is normalization helpful when comparing documents?

---

### **4. Visualize the Vocabulary**
- Use a bar chart to plot the most frequent words across all documents.
- Highlight words excluded by the stopword list.

**Questions to Explore:**
- Which words dominate the vocabulary?
- How does the stopword list affect word frequency distributions?

---

### **5. Extend the Corpus**
- Use a larger, more diverse corpus (e.g., a collection of news articles or book excerpts).
- Generate the vocabulary and BoW representations for this extended dataset.

**Questions to Explore:**
- How does the size of the vocabulary grow with a larger corpus?
- Are there new challenges with handling a larger vocabulary?

---

### **6. Experiment with Token Filters**
- Update the `preprocess` function to include additional filters:
  - **Word length**: Exclude words shorter than 3 characters.
  - **Frequency threshold**: Exclude words appearing in less than 2 documents.

**Questions to Explore:**
- How do these filters reduce the size of the vocabulary?
- Do the BoW representations lose or retain meaningful information?

---

### **7. Compare BoW with Term Frequency (TF)**
- Modify the `create_bow` function to calculate **term frequency (TF)** instead of raw word counts.
- TF normalizes word counts by the total number of words in the document.

**Questions to Explore:**
- How do the BoW representations differ when using raw counts vs. term frequency?
- Which representation is more useful for comparing documents?

---

