# Exercises: Understanding and Extending N-Gram Language Models

---

### **1. Experiment with Different N-Gram Sizes**
- Train the model with **different values of `n`** (e.g., 1 for unigrams, 2 for bigrams, 3 for trigrams).
- Compare the quality of the generated text and the next word predictions for each model.

**Questions to Explore:**
- How does increasing `n` affect the coherence of the generated text?
- Does a higher `n` capture context better, or does it overfit the training text?

---

### **2. Extend the Corpus**
- Use a larger and more diverse corpus, such as a collection of news articles, books, or Wikipedia text.
- Train the model on this extended dataset and observe how the predictions and generated text improve.

**Questions to Explore:**
- How does the diversity of the corpus affect the model's ability to generate coherent sentences?
- Does the model generalize better with a larger corpus?

---

### **3. Add Smoothing**
- Modify the `train` method to implement **additive smoothing** (e.g., Laplace smoothing).
- This ensures that the model assigns a small nonzero probability to unseen n-grams.

**Questions to Explore:**
- How does smoothing affect next word predictions for rare or unseen n-grams?
- Is the generated text more diverse after adding smoothing?

---

### **4. Generate Text with Random Seeds**
- Use randomly chosen seed words from the corpus to generate text.
- Analyze the variety and coherence of the output for different seeds.

**Questions to Explore:**
- Are some seed words more likely to produce coherent text than others?
- Does the model produce repetitive sequences for certain seeds?

---

### **5. Evaluate with Perplexity**
- Implement a method to calculate **perplexity**, a measure of how well the model predicts a test set.
- Evaluate the model’s perplexity on held-out data.

**Questions to Explore:**
- How does perplexity vary with different values of `n`?
- Does perplexity correlate with the perceived quality of generated text?

---

### **6. Predict Multiple Next Words**
- Extend the `generate_next_word` method to predict the top `k` most likely next words.
- Generate text by choosing from these top candidates randomly.

**Questions to Explore:**
- Does allowing multiple next words improve the diversity of the generated text?
- How does the generated text quality change with larger values of `k`?

---

### **7. Handle Case Sensitivity**
- Update the preprocessing step to handle case sensitivity more intelligently.
- For example, treat `The` and `the` as the same word during training but preserve case in the output.

**Questions to Explore:**
- How does case normalization affect the model’s predictions and vocabulary size?
- Is case information necessary for certain types of text (e.g., proper nouns)?

---

### **8. Visualize N-Gram Frequencies**
- Plot a bar chart showing the top 10 most frequent n-grams in the corpus.
- Highlight differences in frequency distributions for different values of `n`.

**Questions to Explore:**
- What are the most frequent n-grams, and do they make sense given the corpus?
- How does the distribution change as `n` increases?

---
