# Group Names
- Hamad Alqahtani
- Samer Almontasheri
- Ali Alghamdi


Intro to NLP Practical<br>
======================<br>
Students will work through problems on n-grams, probabilities, OOV handling, and classifiers.<br>

In [None]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Toy corpus for language modeling

In [None]:
corpus = [
    "Mary had a little lamb",
    "Its fleece was white as snow",
    "And everywhere that Mary went",
    "The lamb was sure to go"
]

--- Part 1: Preprocessing ---

 Q1.1 Sequence notation<br>
Exercise: Write sequence notation for the sentence:<br>
"Mary had a little lamb, its fleece was white as snow"

 Q1.2 Add start/end tokens<br>
Exercise: Write a function to tokenize the corpus and add <s>, </s>

In [None]:
import nltk
from nltk import word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
sentence = "Mary had a little lamb, its fleece was white as snow"

tokenized_sentence = word_tokenize(sentence)
print(tokenized_sentence)


['Mary', 'had', 'a', 'little', 'lamb', ',', 'its', 'fleece', 'was', 'white', 'as', 'snow']


In [None]:
def add_start_end_tokens(corpus):
    tokenized_corpus = [word_tokenize(sentence) for sentence in corpus]
    start_end_corpus = [['<s>'] + sentence + ['</s>'] for sentence in tokenized_corpus]
    return start_end_corpus


In [None]:
senx = add_start_end_tokens(corpus)
print(senx)

[['<s>', 'Mary', 'had', 'a', 'little', 'lamb', '</s>'], ['<s>', 'Its', 'fleece', 'was', 'white', 'as', 'snow', '</s>'], ['<s>', 'And', 'everywhere', 'that', 'Mary', 'went', '</s>'], ['<s>', 'The', 'lamb', 'was', 'sure', 'to', 'go', '</s>']]


--- Part 2: N-grams & Probabilities ---

 Q2.1 Extract unigrams, bigrams, trigrams

In [None]:
import nltk
from nltk.util import ngrams

In [None]:
# sentence = "Mary had a little lamb, its fleece was white as snow"   already have it tokenized

uni = list(ngrams(tokenized_sentence, 1))
bi = list(ngrams(tokenized_sentence, 2))
tri = list(ngrams(tokenized_sentence, 3))

In [None]:
print("Uni-grams:", uni)
print("Bi-grams:", bi)
print("Tri-grams:", tri)

Uni-grams: [('Mary',), ('had',), ('a',), ('little',), ('lamb',), (',',), ('its',), ('fleece',), ('was',), ('white',), ('as',), ('snow',)]
Bi-grams: [('Mary', 'had'), ('had', 'a'), ('a', 'little'), ('little', 'lamb'), ('lamb', ','), (',', 'its'), ('its', 'fleece'), ('fleece', 'was'), ('was', 'white'), ('white', 'as'), ('as', 'snow')]
Tri-grams: [('Mary', 'had', 'a'), ('had', 'a', 'little'), ('a', 'little', 'lamb'), ('little', 'lamb', ','), ('lamb', ',', 'its'), (',', 'its', 'fleece'), ('its', 'fleece', 'was'), ('fleece', 'was', 'white'), ('was', 'white', 'as'), ('white', 'as', 'snow')]


 Q2.2 Bigram probabilities<br>
Exercise: Write function to compute P(w_i | w_{i-1})

In [None]:
def bigram_probabilities(corpus):
  token = [['<s>'] + word_tokenize(sent) + ['</s>'] for sent in corpus]
  bigrams = [list(ngrams(word, 2)) for word in token]
  bi_freq = Counter([gram for sublist in bigrams for gram in sublist])
  uni_freq = Counter([word for sublist in token for word in sublist])

  # P(wi‚Äã‚à£wi‚àí1‚Äã)=Count(wi‚àí1‚Äã)Count(wi‚àí1‚Äã,wi‚Äã)‚Äã

  bi_prob = {(w1, w2): count / uni_freq[w1] for (w1, w2), count in bi_freq.items()}
  return bi_prob , uni_freq


prob, uni_freq = bigram_probabilities(corpus)
print(prob)
print(uni_freq)



{('<s>', 'Mary'): 0.25, ('Mary', 'had'): 0.5, ('had', 'a'): 1.0, ('a', 'little'): 1.0, ('little', 'lamb'): 1.0, ('lamb', '</s>'): 0.5, ('<s>', 'Its'): 0.25, ('Its', 'fleece'): 1.0, ('fleece', 'was'): 1.0, ('was', 'white'): 0.5, ('white', 'as'): 1.0, ('as', 'snow'): 1.0, ('snow', '</s>'): 1.0, ('<s>', 'And'): 0.25, ('And', 'everywhere'): 1.0, ('everywhere', 'that'): 1.0, ('that', 'Mary'): 1.0, ('Mary', 'went'): 0.5, ('went', '</s>'): 1.0, ('<s>', 'The'): 0.25, ('The', 'lamb'): 1.0, ('lamb', 'was'): 0.5, ('was', 'sure'): 0.5, ('sure', 'to'): 1.0, ('to', 'go'): 1.0, ('go', '</s>'): 1.0}
Counter({'<s>': 4, '</s>': 4, 'Mary': 2, 'lamb': 2, 'was': 2, 'had': 1, 'a': 1, 'little': 1, 'Its': 1, 'fleece': 1, 'white': 1, 'as': 1, 'snow': 1, 'And': 1, 'everywhere': 1, 'that': 1, 'went': 1, 'The': 1, 'sure': 1, 'to': 1, 'go': 1})


 Q2.3 Sentence probability<br>
Exercise: Compute probability of "Mary had a little lamb"

In [None]:
sen = "Mary had a little lamb"
sen = add_start_end_tokens([sen])[0]
print(sen)
print(prob)
print(uni_freq)
print(len(sen))
print(corpus)

prob1 = 1.0
for i in range(1, len(sen)):
  pair = (sen[i-1], sen[i])
  if pair in prob:
    prob1 *= prob[pair]
  else:
    prob1 *= 0  # if unseen and no smoothing

print(prob1)



['<s>', 'Mary', 'had', 'a', 'little', 'lamb', '</s>']
{('<s>', 'Mary'): 0.25, ('Mary', 'had'): 0.5, ('had', 'a'): 1.0, ('a', 'little'): 1.0, ('little', 'lamb'): 1.0, ('lamb', '</s>'): 0.5, ('<s>', 'Its'): 0.25, ('Its', 'fleece'): 1.0, ('fleece', 'was'): 1.0, ('was', 'white'): 0.5, ('white', 'as'): 1.0, ('as', 'snow'): 1.0, ('snow', '</s>'): 1.0, ('<s>', 'And'): 0.25, ('And', 'everywhere'): 1.0, ('everywhere', 'that'): 1.0, ('that', 'Mary'): 1.0, ('Mary', 'went'): 0.5, ('went', '</s>'): 1.0, ('<s>', 'The'): 0.25, ('The', 'lamb'): 1.0, ('lamb', 'was'): 0.5, ('was', 'sure'): 0.5, ('sure', 'to'): 1.0, ('to', 'go'): 1.0, ('go', '</s>'): 1.0}
Counter({'<s>': 4, '</s>': 4, 'Mary': 2, 'lamb': 2, 'was': 2, 'had': 1, 'a': 1, 'little': 1, 'Its': 1, 'fleece': 1, 'white': 1, 'as': 1, 'snow': 1, 'And': 1, 'everywhere': 1, 'that': 1, 'went': 1, 'The': 1, 'sure': 1, 'to': 1, 'go': 1})
7
['Mary had a little lamb', 'Its fleece was white as snow', 'And everywhere that Mary went', 'The lamb was sure to go

Q2.4 Handling OOV/UNK<br>
Exercise: Replace unseen words with <UNK> and recompute


In [None]:
def replace_oov_with_unk(corpus, vocabulary):
    unk_corpus = []
    for sentence in corpus:
        unk_sentence = [word if word in vocabulary else '<UNK>' for word in word_tokenize(sentence)]
        unk_corpus.append(unk_sentence)
    return unk_corpus

initial_vocabulary = set([word for sentence in corpus for word in word_tokenize(sentence)])


unk_corpus = replace_oov_with_unk(corpus, initial_vocabulary)
print("Corpus with <UNK> tokens:")
print(unk_corpus)


Corpus with <UNK> tokens:
[['Mary', 'had', 'a', 'little', 'lamb'], ['Its', 'fleece', 'was', 'white', 'as', 'snow'], ['And', 'everywhere', 'that', 'Mary', 'went'], ['The', 'lamb', 'was', 'sure', 'to', 'go']]


--- Part 3: Classifier ---

 Q3.1 Naive Bayes sentiment classifier

# üìΩ Exercise 3.1: Sentiment Classification on toy dataset

In this exercise, you will build a simple sentiment classification model that predicts whether a given sentence is **positive** or **negative**.

---

## ‚úèÔ∏è Instructions:


### 1Ô∏è‚É£ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 2Ô∏è‚É£ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.


üöÄ **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

In [None]:
train_texts = [
    "I love my dog",
    "This food is great",
    "I hate waiting",
    "The movie was boring",
    "Happy with my phone",
    "This is awful"
]
train_labels = ["pos", "pos", "neg", "neg", "pos", "neg"]

# üìΩ Exercise 3.2: Movie Review Classification using Movies Review Corpus

In this exercise, you will build a simple text classification model that predicts whether a given **movie review** is **positive** or **negative** using the **NLTK Movie Reviews Corpus**.

This is a classical example of text classification at the **sentence level**.

---

## ‚úèÔ∏è Instructions:

### 1Ô∏è‚É£ Load the Data
- Import the **Movie Reviews corpus** from **NLTK**.
- Create a dataset where each example is a review and the label is either `'positive'` or `'negative'`.

---

### 2Ô∏è‚É£ Perform Feature Extraction
- Use **TF-IDF Vectorization** to convert names into numerical features.


---

### 3Ô∏è‚É£ Train a Machine Learning Classifier
- Use any classifier you are familiar with (e.g., **Logistic Regression** or **Naive Bayes**).
- Split the data into **training** and **testing** sets.
- Train the classifier on the training data.

---

### 4Ô∏è‚É£ Evaluate the Classifier
- Use **accuracy** and a **classification report** to evaluate your model on the test set.
- Think about: How well does the model perform? Which reviews are harder to classify?

---

‚úÖ You are free to explore:
- Trying different classifiers.
- Visualizing the results (e.g., confusion matrix).

---

üöÄ **Goal:** By the end of this exercise, you should be able to:
- Apply **feature extraction** to text data.
- Train and evaluate a **text classification model** using **machine learning**.

 Q3.3 Discussion: Why bigrams vs unigrams?<br>

 Q3.4 Limitations of n-grams

--- Part 4: Wrap-up Reflection ---

 Discussion Questions<br>
1. Why do we need <UNK> tokens?<br>
2. Why start/end tokens?<br>
3. Why not always use higher n-grams?<br>
4. How do classifiers differ from language models?