# Sentiment Analysis Project

## Introduction
Hello! This is my first big project in Natural Language Processing (NLP). I'm going to build a model that can figure out the sentiment (feeling) behind movie review phrases. This is super useful for understanding what people think about movies!

I'll be using TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numbers and then a Logistic Regression model to classify the sentiments. The sentiments are labeled from 0 (negative) to 4 (positive), with 2 being neutral.

Let's get started!

In [None]:
# First, we need to bring in all the tools we'll use.
import pandas as pd # For working with data tables (DataFrames)
import numpy as np  # For numerical operations, especially with arrays
import matplotlib.pyplot as plt # For making cool charts and graphs
import nltk # The Natural Language Toolkit, essential for text processing
from nltk.corpus import stopwords # To remove common words like 'the', 'is', 'a'
from nltk.stem.snowball import SnowballStemmer # To reduce words to their root form (e.g., 'running' to 'run')
from nltk.tokenize import word_tokenize # To break sentences into individual words
from sklearn.feature_extraction.text import TfidfVectorizer # To convert text into numerical features
from sklearn.linear_model import LogisticRegression # Our chosen model for classification
from sklearn.model_selection import train_test_split # To split our data for training and testing
from sklearn.metrics import accuracy_score # To see how well our model performs

## Loading the Data
I'll load the training and testing datasets. These files contain phrases and their corresponding sentiments (for training) or just phrases (for testing).

In [None]:
# Load the training data. 'sep=\t' means the values are separated by tabs.
try:
    train_df = pd.read_csv('train.tsv', sep='\t')
    test_df = pd.read_csv('test.tsv', sep='\t')
    sample_submission_df = pd.read_csv('sampleSubmission.csv')
except FileNotFoundError:
    print("Make sure 'train.tsv', 'test.tsv', and 'sampleSubmission.csv' are in the same directory!")
    # Create dummy dataframes for demonstration if files are not found
    train_df = pd.DataFrame({
        'PhraseId': [1, 2, 3, 4, 5],
        'SentenceId': [1, 1, 1, 1, 1],
        'Phrase': [
            'A very good movie!',
            'This film was terrible.',
            'It was okay, nothing special.',
            'Absolutely fantastic!',
            'Could have been better.'
        ],
        'Sentiment': [3, 0, 2, 4, 1]
    })
    test_df = pd.DataFrame({
        'PhraseId': [156061, 156062, 156063],
        'SentenceId': [8545, 8545, 8545],
        'Phrase': [
            'An intermittently pleasing but mostly routine effort',
            'Not bad, not great.',
            'The worst movie ever.'
        ]
    })
    sample_submission_df = pd.DataFrame({
        'PhraseId': [156061, 156062, 156063],
        'Sentiment': [2, 2, 2]
    })

# Let's see the first few rows of our training data
print("\n--- Training Data Head ---")
print(train_df.head())

# And the test data
print("\n--- Test Data Head ---")
print(test_df.head())

# And the sample submission file format
print("\n--- Sample Submission Data Head ---")
print(sample_submission_df.head())

## Data Exploration and Visualization
It's good to understand our data. Let's look at the distribution of sentiments in the training set.

In [None]:
# Let's see how many phrases fall into each sentiment category
sentiment_counts = train_df['Sentiment'].value_counts(normalize=True).sort_index()

print("\n--- Sentiment Distribution ---")
print(sentiment_counts)

# Now, let's visualize it with a bar chart!
plt.figure(figsize=(8, 5))
sentiment_counts.plot(kind='bar', color='skyblue')
plt.title('Distribution of Sentiments in Training Data')
plt.xlabel('Sentiment (0: Negative, 1: Somewhat Negative, 2: Neutral, 3: Somewhat Positive, 4: Positive)')
plt.ylabel('Proportion')
plt.xticks(rotation=0) # Keep labels horizontal
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

print("\nLooks like 'neutral' (sentiment 2) is the most common category. This is important to remember!")

## Text Preprocessing
Before we can feed text to a machine learning model, we need to clean it up and convert it into a format the model understands. This involves:
1.  **Tokenization**: Breaking sentences into words.
2.  **Lowercasing**: Converting all words to lowercase to treat 'The' and 'the' as the same word.
3.  **Removing Punctuation/Numbers**: Keeping only alphabetic characters.
4.  **Stemming**: Reducing words to their base form (e.g., 'running', 'runs', 'ran' all become 'run').
5.  **Stopword Removal**: Getting rid of common words that don't add much meaning (like 'a', 'an', 'the').

In [None]:
# We need to download some NLTK data for tokenization and stopwords
print("Downloading NLTK 'punkt' tokenizer data...")
nltk.download('punkt', quiet=True) # For word_tokenize
print("Downloading NLTK 'stopwords' data...")
nltk.download('stopwords', quiet=True) # For english stopwords

# Initialize our stemmer and get English stopwords
stemmer = SnowballStemmer(language='english')
english_stopwords = stopwords.words('english')

print("\nExample of a stemmed word: 'Running' becomes '" + stemmer.stem('Running') + "'")
print("\nSome English stopwords: " + ", ".join(english_stopwords[:10]) + "...")

In [None]:
# This function will do all our text cleaning steps!
def custom_tokenizer(text):
    # 1. Tokenize and lowercase
    tokens = word_tokenize(text.lower())
    
    # 2. Keep only alphabetic characters (remove punctuation, numbers)
    alphabetic_tokens = [token for token in tokens if token.isalpha()]
    
    # 3. Remove stopwords and then stem the remaining words
    processed_tokens = [stemmer.stem(token) for token in alphabetic_tokens if token not in english_stopwords]
    
    return processed_tokens

# Let's test our tokenizer with an example sentence
example_sentence = "This is a really great movie, I loved it!"
print(f"\nOriginal sentence: '{example_sentence}'")
print(f"Processed tokens: {custom_tokenizer(example_sentence)}")

## TF-IDF Vectorization
Now that our text is clean, we need to turn it into numbers. TF-IDF helps us do this by giving importance to words that are frequent in one document but not too common across all documents. This way, important words stand out.

I'll use `ngram_range=(1,2)` to consider single words (unigrams) and pairs of words (bigrams). I'll also limit the `max_features` to keep the most important ones.

In [None]:
# Initialize the TF-IDF Vectorizer
# tokenizer=custom_tokenizer: tells it to use our cleaning function
# ngram_range=(1,2): considers single words and two-word phrases
# max_features: limits the number of unique words/phrases to consider (to avoid too many features)
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer, 
                           ngram_range=(1,2),
                           max_features=2300) # Using 2300 features as seen in the reference notebook

# Fit the vectorizer on our training phrases. This learns the vocabulary and IDF values.
print("\nFitting TF-IDF Vectorizer on training phrases...")
inputs_train_transformed = vectorizer.fit_transform(train_df['Phrase'])

# Transform the test phrases using the same fitted vectorizer
print("Transforming test phrases...")
test_df = test_df.dropna(subset=['Phrase']) # Important: remove any empty phrases from test data
inputs_test_transformed = vectorizer.transform(test_df['Phrase'])

print(f"\nShape of transformed training data (samples, features): {inputs_train_transformed.shape}")
print(f"Shape of transformed test data (samples, features): {inputs_test_transformed.shape}")

print("\nFirst 10 feature names (words/phrases) learned by the vectorizer:")
print(vectorizer.get_feature_names_out()[:10])

## Splitting Data for Training and Validation
It's good practice to split our training data into a smaller training set and a validation set. This helps us check how well our model generalizes to new, unseen data before we test it on the actual test set.

In [None]:
# Let's define our features (X) and targets (y)
X = inputs_train_transformed
y = train_df['Sentiment']

# Splitting the data into training and validation sets
# test_size=0.2 means 20% of data will be for validation
# random_state=42 ensures we get the same split every time
print("Splitting training data into training and validation sets...")
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nShape of X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Shape of X_val: {X_val.shape}, y_val: {y_val.shape}")

## Training the Logistic Regression Model
Logistic Regression is a simple yet powerful model for classification tasks. It's a good starting point for sentiment analysis.

In [None]:
# Initialize the Logistic Regression model
# max_iter=1000: increases the number of iterations for the solver to converge
print("Initializing Logistic Regression model...")
model = LogisticRegression(max_iter=1000, solver='liblinear') # 'liblinear' solver works well for small datasets and L1/L2 regularization

# Train the model using our training data
print("Training the model...")
model.fit(X_train, y_train)
print("Model training complete!")

## Model Evaluation
Now let's see how well our model performs on the training data and, more importantly, on the validation data (which it hasn't seen during training).

In [None]:
# Make predictions on the training set
train_preds = model.predict(X_train)

# Calculate accuracy on the training set
train_accuracy = accuracy_score(y_train, train_preds)
print(f"\nTraining Accuracy: {train_accuracy:.4f}")

# Make predictions on the validation set
val_preds = model.predict(X_val)

# Calculate accuracy on the validation set
val_accuracy = accuracy_score(y_val, val_preds)
print(f"Validation Accuracy: {val_accuracy:.4f}")

print("\nObservation: The accuracy on the validation set is a bit lower than the training set. This is normal and indicates how well the model generalizes. Also, remember that the dataset has a class imbalance, with many neutral sentiments, which can affect overall accuracy.")

## Making Predictions on Test Data and Creating Submission File
Finally, we'll use our trained model to predict sentiments for the test dataset and prepare the `submission.csv` file.

In [None]:
# Make predictions on the actual test data
print("Making predictions on the test data...")
test_preds = model.predict(inputs_test_transformed)

# Create the submission DataFrame
submission_df = pd.DataFrame({
    'PhraseId': test_df['PhraseId'],
    'Sentiment': test_preds
})

# Display the first few rows of our submission file
print("\n--- Submission File Head ---")
print(submission_df.head())

# Save the submission file to a CSV without the index
submission_df.to_csv('submission.csv', index=False)
print("\nSubmission file 'submission.csv' created successfully!")

## Conclusion
We successfully built a sentiment analysis model using TF-IDF and Logistic Regression! This notebook covers the full pipeline from data loading and exploration to preprocessing, model training, evaluation, and generating a submission file.

### Next Steps/Possible Improvements:
* **More Advanced Preprocessing**: Try lemmatization instead of stemming.
* **Different Models**: Experiment with other machine learning models like Naive Bayes, SVM, or even deep learning models (RNNs, Transformers).
* **Hyperparameter Tuning**: Optimize the parameters of the `TfidfVectorizer` and `LogisticRegression` for better performance.
* **Handling Imbalance**: Explore techniques to handle the class imbalance (e.g., oversampling, undersampling, using weighted loss functions).

Thanks for checking out my project!