# Chapter 03: Machine Learning for NLP - Sentiment Analysis

Now that we know how to turn text into numbers (Vectorization), it's time to build a model that can actually "understand" the mood of a sentence. 

In this chapter, we will:
1. **Prepare a Dataset**: Sentences labeled as Positive or Negative.
2. **Train a Classifier**: Using the **Multinomial Naive Bayes** algorithm (a classic for NLP).
3. **Evaluate the Model**: Testing it on new, unseen data.

## 1. The Data
We'll start with a small, manual dataset to understand the flow.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample Dataset
data = {
    'text': [
        'I love this movie, it was amazing!',
        'The food was delicious and the service was great.',
        'What a wonderful day for a walk in the park.',
        'I am so happy with my new phone!',
        'This is the worst experience I have ever had.',
        'The movie was boring and way too long.',
        'I hate the rainy weather, it makes me sad.',
        'The service was terrible and I will not return.',
        'Technically impressive but emotionally distant.',
        'It was an okay experience, nothing special.'
    ],
    'label': ['positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative']
}

df = pd.DataFrame(data)
print(df)

                                                text     label
0                 I love this movie, it was amazing!  positive
1  The food was delicious and the service was great.  positive
2       What a wonderful day for a walk in the park.  positive
3                   I am so happy with my new phone!  positive
4      This is the worst experience I have ever had.  negative
5             The movie was boring and way too long.  negative
6         I hate the rainy weather, it makes me sad.  negative
7    The service was terrible and I will not return.  negative
8    Technically impressive but emotionally distant.  negative
9        It was an okay experience, nothing special.  negative


## 2. Training the Model
We will convert the text to TF-IDF vectors and then feed them into the Naive Bayes classifier.

In [16]:
# Vectorize the text
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])
y = df['label']

# Split into training and testing sets (though our dataset is tiny!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the classifier
model = MultinomialNB()
model.fit(X_train, y_train)

print("Model trained successfully!")

Model trained successfully!


## 3. Testing and Prediction
Let's see if the model can predict the sentiment of a sentence it has never seen before.

In [17]:
new_reviews = [
    "I really enjoyed the plot of the film.",
    "The battery life of this device is awful."
]

new_X = vectorizer.transform(new_reviews)
predictions = model.predict(new_X)

for review, pred in zip(new_reviews, predictions):
    print(f"Review: '{review}' -> Prediction: {pred}")

Review: 'I really enjoyed the plot of the film.' -> Prediction: negative
Review: 'The battery life of this device is awful.' -> Prediction: negative


## Wrap-up Task
1. Add 4 more sentences (2 positive, 2 negative) to the `data` dictionary.
2. Re-run the cells to retrain the model.
3. Try predicting the sentiment of: "This lesson is quite interesting but a bit hard."

In [18]:
# Expanded Sample Dataset
data = {
    'text': [
        'I love this movie, it was amazing!',
        'The food was delicious and the service was great.',
        'What a wonderful day for a walk in the park.',
        'I am so happy with my new phone!',
        'I really enjoyed the plot of the film.', # NEW: Added to help the model learn 'enjoyed'
        'The acting was superb and the plot was engaging.', # NEW: Added
        'This is the worst experience I have ever had.',
        'The movie was boring and way too long.',
        'I hate the rainy weather, it makes me sad.',
        'The service was terrible and I will not return.',
        'I am very disappointed with the quality.', # NEW: Added
        'This product is a total waste of money.', # NEW: Added
        'Technically impressive but emotionally distant.',
        'It was an okay experience, nothing special.'
    ],
    'label': [
        'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 
        'negative', 'negative', 'negative', 'negative', 'negative', 'negative',
        'negative', 'negative'
    ]
}

df = pd.DataFrame(data)

In [19]:
# Test the "Mixed" sentiment sentence
test_sentence = ["This lesson is quite interesting but a bit hard."]

# Transform using the SAME vectorizer
test_X = vectorizer.transform(test_sentence)
prediction = model.predict(test_X)

print(f"Sentence: '{test_sentence[0]}'")
print(f"Prediction: {prediction[0]}")

Sentence: 'This lesson is quite interesting but a bit hard.'
Prediction: negative
