STEP 1: Install Required Libraries

In [14]:
# Install necessary NLP and ML libraries
!pip install nltk scikit-learn pandas




[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


üìå Why?
These libraries help us process text, convert text to numbers, and train mode

üîπ STEP 2: Import Libraries

In [15]:
# Data handling
import pandas as pd

# NLP processing
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


üìå Why?

NLTK ‚Üí Text processing

Scikit-learn ‚Üí ML models

Pandas ‚Üí Dataset handling

üîπ STEP 3: Download NLTK Resources

In [16]:
# Download tokenizer and stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ENrolment\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ENrolment\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

üìå Why?
NLTK needs these resources to split sentences and remove common words.

üîπ STEP 4: Create a Sample Dataset

In [17]:
# Load dataset from CSV file
df = pd.read_csv("../dataset/sentiment_dataset.csv")

# Display first 5 rows
df.head()

Unnamed: 0,text,sentiment
0,Worth every penny,positive
1,Very disappointing,negative
2,Nothing special,neutral
3,Worth every penny,positive
4,Poor performance,negative


üìå Why?
This dataset teaches the model which text belongs to which sentiment.

üîπ STEP 5: Text Preprocessing (Core NLP Step)

In [18]:
# Load English stopwords
stop_words = set(stopwords.words('english'))

# Function to clean and preprocess text
def preprocess(text):
    text = text.lower()                         # Convert text to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)               # Split text into words (tokens)
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(tokens)                    # Join words back into sentence


üìå Why?
Computers understand clean, meaningful words, not symbols or fillers like is, the, at.

üîπ STEP 6: Apply Preprocessing

In [19]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Apply preprocessing to dataset
df['clean_text'] = df['text'].apply(preprocess)
df


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ENrolment\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ENrolment\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ENrolment\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,sentiment,clean_text
0,Worth every penny,positive,worth every penny
1,Very disappointing,negative,disappointing
2,Nothing special,neutral,nothing special
3,Worth every penny,positive,worth every penny
4,Poor performance,negative,poor performance
...,...,...,...
296,It works as expected,neutral,works expected
297,This is amazing,positive,amazing
298,Very disappointing,negative,disappointing
299,The product is fine,neutral,product fine


üìå Explain:

‚ÄúRaw text is converted into cleaned text for better understanding.‚Äù



üîπ STEP 7: Convert Text to Numbers (TF-IDF)

In [20]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Convert cleaned text into numerical vectors
X = vectorizer.fit_transform(df['clean_text'])

# Target labels
y = df['sentiment']


üìå Why?
Machines cannot understand text, so NLP converts words into numerical values.

In [21]:
# Split dataset: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


üìå Why?

Training data ‚Üí Teach the model

Testing data ‚Üí Check performance

üîπ STEP 9: Train the Naive Bayes Model

In [22]:
# Initialize the model
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)


0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",1.0
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


üìå Explain:

‚ÄúThe model learns patterns between words and sentiments.‚Äù

üîπ STEP 10: Test the Model Accuracy

In [23]:
# Predict sentiments for test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracy


1.0

üìå Why?
Accuracy shows how correctly the model predicts sentiments.

üîπ STEP 11: Predict New Sentence (Demo)

In [24]:
tests = [
    "I dont like the movie but the plot is okay"
]

for t in tests:
    clean = preprocess(t)
    vec = vectorizer.transform([clean])
    print(t, "‚Üí", model.predict(vec)[0])

print("Model Accuracy:", accuracy * 100, "%")

I dont like the movie but the plot is okay ‚Üí neutral
Model Accuracy: 100.0 %


STEP 1: Save Your Trained Model

You must save both the model and the vectorizer.

In [25]:
import joblib

# Save trained model
joblib.dump(model, "sentiment_model.pkl")

# Save TF-IDF vectorizer
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")


['tfidf_vectorizer.pkl']