***Matin Mahmoodkhani - 99522095***





#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [None]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
import random
import string
from collections import defaultdict

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

*Document on functions*


 **Preprocess_text Function:**
this function does some tast on our corpus like making all the alphabets in lowercase and removing punctuations.
after that we use nltk tokenizer so that we tokenize every sentence in our corpus.

**build_probabilistic_ngram_model:**
In this function, we want to make our ngram model. in the input, we get n for the n gram model and also all the texts.

first, in a loop we consider every senteces in our corpus. and considering n, we make our model. here we consider n as 3. so that we make the model based on every 3 words that are in a row.
after that we calculate the probabilities.

we want to find the next word. so we separate every n word that is token to 2 groups. last word and the rest

**generate_text:**
In the function, first we get the text we want to predict. the we split the text and after the we loop through the model and check for expressions that can complete the text. then we choose the one with the highest probability.




In [None]:
# Function to preprocess text
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    return tokens

# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    model = defaultdict(lambda: defaultdict(lambda: 0))

    for sentence in corpus:
        ngrams_in_sentence = list(ngrams(sentence, n, pad_left=True, pad_right=True))
        for ngram in ngrams_in_sentence:
            prefix = tuple(ngram[:-1])
            suffix = ngram[-1]
            model[prefix][suffix] += 1

    # Convert counts to probabilities
    for prefix in model:
        total_count = float(sum(model[prefix].values()))
        for suffix in model[prefix]:
            model[prefix][suffix] /= total_count

    return model


# Function to generate text using the probabilistic n-gram model with stop criteria
def generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):
    generated_text = seed_text.lower().split()
    current_length = len(generated_text)

    while current_length < min_length or (current_length < 100 and random.uniform(0, 1) > probability_threshold):
        current_prefix = tuple(generated_text[-n+1:])

        # Check if the current_prefix exists in the model
        if current_prefix in model:
            next_word = max(model[current_prefix].items(), key=lambda x: x[1])[0]
            generated_text.append(next_word)
            current_length += 1
        else:
            break  # Break if the current_prefix is not in the model

    return ' '.join(generated_text)




In [None]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Choose an n for the n-gram model
n_value = 3  # You may change this value

# Build the probabilistic n-gram model
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

In [None]:
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is expected to be a major trade bill that would be a major



**Parameters :** After experimenting different values, the n_value may vary depending on the complexity of the language structure in the corpus. In this example, n_value = 3 might be the best option.
The probability_threshold parameter controls the likelihood of stopping the generation based on the probability of the next word. The optimal value depends on the randomness and coherence. 0.02 may be the best choice.
The min_length parameter defines the minimum length of the generated text. The optimal value depends on the corpus, but 5 is the best one for this example.

**trade-offs and considerations :** when choosing the n_order, we should make the a balance between randomness and coherence. if it is too small we would have more randomness and less coherence. considering it so big also would be the opposite.
probability_threshold is also the same. considering it too small or too large may result in bad results. The trade-off is between diversity and coherence.
Setting min_length low might generate very short text and Setting it high may limit the diversity of the generated text.

**Insights :** having more texts for learning the algorithm, will result in having a better model. after that, setting best values for our parameters are also very important for us.

#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [65]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [66]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [76]:
class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))

    def train(self, training_data):
        # Implement training here
        # You should use get_features function to extract useful tokens from dataset and use them to train the classifier.

        # Here, first we compute count of positive and negative data
        class_counts = defaultdict(int)
        for _, sentiment in training_data:
            class_counts[sentiment] += 1

        # Then, we calculate class probabilities based on count of each class
        total_examples = len(training_data)
        for sentiment in self.classes:
            self.class_probs[sentiment] = class_counts[sentiment] / total_examples

        # After that, we create a new dict and inside it, we get all the features that are used in the corpus
        # then, we calculate count of appearencing in the corpus based on their negative or positive sentiment.
        # for example word "taken" may be used 3 times in the texts with positive sentiment and 1 time with negative sentiment
        # so the dict would be like this: feature_counts = {"taken" : {"pos" : 3, "neg" : 1}}
        feature_counts = defaultdict(lambda: defaultdict(int))
        for tokens, sentiment in training_data:
            features = get_features(tokens)
            for feature in features:
                feature_counts[feature][sentiment] += 1

        # using the dict above, we calculate every feature probability based on being negative or positive
        for feature in feature_counts:
            total = sum(feature_counts[feature].values())
            for sentiment in self.classes:
                self.feature_probs[feature][sentiment] = feature_counts[feature][sentiment] / total


    def classify(self, features):
        # Implement classification here
        # first we define 2 variables that save the best case for them
        # after that for each sentiment, we go through the features of the text and check for their probability based on the feature_probs dict
        # also we check that feature exists in the dict. then, we add its log to probability.
        # note that we consider probability of the sentiment from results of the training
        max_prob = float('-inf')
        best_class = None
        for sentiment in self.classes:
            prob = math.log(self.class_probs[sentiment])
            for feature in features:
                prob += math.log(self.feature_probs[feature][sentiment] + 1e-10)
            if prob > max_prob:
                max_prob = prob
                best_class = sentiment

        return best_class

In [83]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# data that we are tokenized
random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)


def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")


calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

Train Accuracy: 0.9975
Test Accuracy: 0.705


**Analysis :**

In the exercise, we make a classifier for finding if a sentence is positive or negative. So, we built a class and added to functions to it. explanations of each function is written beside it in the code and can be easily underestood.

First we made a dictionary of all features based on the training set. then, calculate probability of each feature in negative or positive cases.

consider the word "High", high has been used 6 times in the positive texts and 4 in the negative sentences. So in our dictionary, it would have a probability like this: { "High" : {"pos" : 0.6, "neg" : 0.4}}. other words are the same. in the classification part, we use these numbers and math.log to calculate the probability for every sentiment and finally we choose the best option for it.

I've test the classifier using different sizes of the training tests. Usually having the more training examples would give us the more accurate probabilities and also list of our features increases. So, that would help us a lot to increase our accuracy.

It may be intresting to find out that why our training accuracy is not 100%. that is because in some cases, there are multiple words that are used in the sentence with the opposite meaning from their usual meaning. for example word "bad" is a negative word. But maybe it is used in a positive sentence and while testing, these words may affect the probability and consider that a text is positive while it is negative acually.

In the test part, we could get 70.5% accuracy which is not bad but can be better with having more training examples. For example, if the split ratio changes to 0.9, our testing accuracy may be higher as 71.5% which is more than before.

Because the examples are too long, we can't represent a texts that is classified wrong. but we can say some reasons for misclassification:

1. some texts are too complex and words are getting the wrong sentiment.
2. there may be words that are not in the training example.



#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.