# Introduction

The task at hand is to construct a Naïve Bayes classifier for sentiment analysis, specifically to gauge the sentiment of movie reviews. This involves developing a model that can categorize a movie review as either positive or negative. The dataset comprises 1000 positive and 1000 negative reviews, providing a balanced framework for training and testing the classifier. The ultimate goal is to create a reliable classifier that can help in understanding public sentiment toward movies, which is a valuable asset for various stakeholders in the film industry and potential audiences.

Approach to Address the Problem
1. **Data Acquisition**: The first step involves downloading the polarity dataset v2.0 from the provided Cornell University website. This dataset will serve as the foundation for our sentiment analysis model.

2. **Data Preparation**:
* Text Cleaning: The movie reviews, being extracted from web postings, will likely contain a mix of useful and irrelevant textual content. The cleaning process will involve removing punctuation, special characters, and stop words to ensure that only meaningful words are fed into the model.
* Tokenization: This involves breaking down the text into individual words or tokens. This step is crucial for converting text into a format that can be analyzed by the classifier.
* Vectorization: To use textual data for machine learning, we need to convert the text into numerical format. This will be achieved by creating count vectors, representing the frequency of each word's occurrence in the document.
* Creation of Term-Document Matrix: This matrix will represent the frequency of terms that occur in the collection of documents, helping in understanding the significance of words in the dataset.

3. **Frequency Distribution Analysis**: Before diving into model building, analyzing the frequency distribution of words will provide insights into the commonality of words in positive and negative reviews. This step involves generating plots to visually represent word count frequencies in the dataset.

4. **Model Training**: 
* Selection of Naïve Bayes Variant: Based on the nature of the data, we will decide on the most suitable variant of the Naïve Bayes classifier (e.g., Multinomial, Bernoulli, or Gaussian).
* Training and Validation Split: The dataset will be divided into training and testing sets to evaluate the model's performance effectively.
* Model Training: With the training data, the Naïve Bayes classifier will be trained to understand patterns in the data that indicate positive or negative sentiment.

5. **Model Testing and Evaluation**:
* Performance Evaluation: Once the model is trained, it will be tested on unseen data to assess its accuracy and other performance metrics (like precision, recall, and F1-score).
* Feature Analysis: We will examine the most informative features or words that the model uses to differentiate between positive and negative sentiments.

6. **Application to New Data**: Finally, the trained model will be applied to a new movie review to demonstrate its practical utility and assess its real-world applicability.

7. **Conclusion**: We will summarize the findings, discuss the effectiveness of the model, and propose potential improvements or future directions for enhancing sentiment analysis in movie reviews.

In [None]:
# Libaries import
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from copy import copy