#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [None]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import random
import string
from collections import defaultdict, Counter
from gensim.models.fasttext import FastText
import numpy as np

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
"""
# This function preprocesses the input text by tokenizing it into sentences, converting all words to lower case,
removing punctuation and non-alphabetic tokens, and then joining the words back into sentences and the sentences back into a text.
"""
# Function to preprocess text
def preprocess_text(text):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    preprocessed_sentences = []

    for sentence in sentences:
        # Tokenize the sentence into words
        tokens = word_tokenize(sentence)

        # Convert to lower case
        tokens = [word.lower() for word in tokens]

        # Remove punctuation and non-alphabetic tokens
        words = [word for word in tokens if word not in string.punctuation]

        # Join the words back into a sentence and add to the list
        words.append('.')
        preprocessed_sentences.append(' '.join(words))

    # Join the sentences back into a text
    preprocessed_text = ' '.join(preprocessed_sentences)

    return preprocessed_text

"""
# This function builds a probabilistic n-gram model from the given corpus.
It tokenizes the text into sentences, then into words, and creates n-grams from these words.
It then calculates the probabilities of each word following a given (n-1)-gram.
"""
# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    # Create a dictionary to hold the n-gram model
    model = defaultdict(Counter)

    # Populate the dictionary with counts of n-grams and (n-1)-grams
    for text in corpus:
      sentences = sent_tokenize(text)
      preprocessed_sentences = []
      for sentence in sentences:
        tokens = ['<s>'] * (n - 1) + word_tokenize(sentence)
        n_grams = list(ngrams(tokens, n+1))
        for n_gram in n_grams:
            n_1_gram = n_gram[:-1]
            next_word = n_gram[-1]
            model[n_1_gram][next_word] += 1

    # Calculate the probabilities
    for n_1_gram, next_words in model.items():
        total_count = sum(next_words.values())
        for next_word, count in next_words.items():
            model[n_1_gram][next_word] = count / total_count

    return model

"""
This function generates text using the probabilistic n-gram model. It tokenizes the seed text, generates words until it reaches the minimum length,
gets the probabilities of the next word given the context, and stops generating if the probability of the next word is below the threshold.
It then joins the generated words into a string.
"""
# Function to generate text using the probabilistic n-gram model with stop criteria
def generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):
    # Tokenize the seed text
    seed_text = seed_text.lower()
    seed_tokens = word_tokenize(seed_text)
    seed_tokens = ['<s>'] * (n - 1) + seed_tokens[:n]
    # Generate words until we reach the minimum length
    generated_words = seed_tokens
    num = 0
    while True:
        # Get the last n words as the context
        context = tuple(generated_words[-n:])
        # Get the probabilities of the next word given the context
        next_word_probs = model[context]
        if not next_word_probs:
            # If next_word_probs is empty, use n-1gram, n-2gram, ... until 1gram
            for i in range(n-1, 0, -1):
                context = tuple(generated_words[-i:])
                model = build_probabilistic_ngram_model(preprocessed_corpus, i)
                next_word_probs = model[context]
                if next_word_probs:
                    break
            if not next_word_probs:
                next_word = random.choice(list(model.keys()))[0]
            else:
                next_word = random.choices(list(next_word_probs.keys()), weights=list(next_word_probs.values()))[0]
        else:
            next_word = random.choices(list(next_word_probs.keys()), weights=list(next_word_probs.values()))[0]
            # If the probability of the next word is below the threshold, stop generating
        if next_word_probs[next_word] < probability_threshold and num >= min_length:
            break

        # Add the next word to the generated words
        generated_words.append(next_word)
        num += 1
    # Join the words into a string
    generated_words = generated_words[n-1:]
    generated_text = ' '.join(generated_words)
    return generated_text



In [None]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]
# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

In [None]:
# Choose an n for the n-gram model
n_value = 3  # You may change this value

# Build the probabilistic n-gram model
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

In [None]:
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.1, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is expected to exceed 125 mln dlrs . 2,239 dlrs for


# **Section 4**

In [None]:
n_value = [2, 3, 4]
probability_threshold = [0.05, 0.1, 0.2]
min_length = [10,15,20]

for val in n_value:
  for prob in probability_threshold:
    for length in min_length:
      generated_text = generate_text(probabilistic_ngram_model, seed_text, val, probability_threshold=prob, min_length=length)
      print(f'n_value={val}, probability_threshold={prob}, min_length={length}: generated_text={generated_text}')
      print('-----------------')


n_value=2, probability_threshold=0.05, min_length=10: generated_text=inflation is perceived target markets and fairer treatment and the gatt and
-----------------
n_value=2, probability_threshold=0.05, min_length=15: generated_text=inflation is payable april and beet crop soybeans 610 billion dlrs . 193,193 dlrs jan 31 against
-----------------
n_value=2, probability_threshold=0.05, min_length=20: generated_text=inflation is a lb effective immediately . s.a.y said any pubilic policy review world 's customers the economy ministers are likely to
-----------------
n_value=2, probability_threshold=0.1, min_length=10: generated_text=inflation is rising . 5,896,322 vs 43.1 billion dlrs a canadian-led oil
-----------------
n_value=2, probability_threshold=0.1, min_length=15: generated_text=inflation is under-utilised because of 297,000 dlrs per day bpd from depths of its world 's new york
-----------------
n_value=2, probability_threshold=0.1, min_length=20: generated_text=inflation is expect

# **Analyze**

For **n-gram**: the higher the value of n, the more meaningful the output will be. which is the reason for considering more words before itself. For example, the word "inflation", which is the subject of the sentence, has a lot of meaning. Now, if we take n equal to 2, for example, we will not affect it in the next selections. For this reason, the bigger n is, the more meaningful the sentence is. The larger n is, the number of special cases increases, in such cases, n-i is used.

**For probability_threshold**: the higher this value is, the shorter the sentence is because the probability of the next word needs to be a larger number.

**Min Length**: Because less than 10 sentences usually cannot convey the meaning completely, usually 10 is a suitable number for this task.

#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [1]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [3]:
class NaiveBayesClassifier:
    def __init__(self, classes):
        # Initialize the class labels and the dictionaries to hold class and feature probabilities
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))

    def train(self, training_data):
        # This function trains the Naive Bayes Classifier
        # Initialize counters for classes and features
        alpha = 1
        class_counts = defaultdict(int)
        feature_counts = defaultdict(lambda: defaultdict(int))

        # Count features and classes
        for features, class_ in training_data:
            # Extract useful tokens from the dataset using the get_features function
            features = get_features(features)
            # Increment the count of the current class
            class_counts[class_] += 1
            # Increment the count of the current feature for the current class
            for feature in features:
                feature_counts[feature][class_] += 1

        # Calculate probabilities
        for class_, count in class_counts.items():
            # Calculate the probability of each class
            self.class_probs[class_] = count / len(training_data)
            # Calculate the sum of feature counts for the current class
            x = sum(feature_counts[feature][class_] for feature, counts in feature_counts.items())
            # Calculate the probability of each feature given each class
            for feature in feature_counts:
                self.feature_probs[feature][class_] = (feature_counts[feature][class_]+alpha) / (x+alpha*len(feature_counts))

    def classify(self, features):
        # This function classifies a given set of features
        # Initialize the dictionary to hold class probabilities
        class_probs = defaultdict(float)
        for class_ in self.classes:
            # Calculate the log probability of each class
            class_probs[class_] = math.log(self.class_probs[class_])
            # Extract useful tokens from the features using the get_features function
            for feature in get_features(features):
                # If the feature is in the feature probabilities dictionary, add its log probability to the class probability
                if feature in self.feature_probs:
                    class_probs[class_] += math.log(self.feature_probs[feature][class_])

        # Return the class with the highest probability
        return max(class_probs, key=class_probs.get)


In [10]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)
# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]
# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

Train Accuracy: 0.964375
Test Accuracy: 0.8175


In [11]:
tokens = ['the', 'weather', 'is', 'not', 'bad']
features = get_features(tokens)
classifier_var = classifier.classify(features)
classifier_var

'neg'

# **Analyze:**

Because the problem is a bayes problem, it is not possible to reach 100% accuracy because sometimes the word order is effective if bayes ignores this case. Sometimes in Bayes, only the word is important, not the structure of the sentence. Bigram can be used to solve this issue.

For example, the sentence "the weather is not bad" is a negative sentence, but because bayes only pays attention to words and the word "bad" is present in it, it predicts this sentence positively.

#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.