Step 1: Data Collection
For this task, we will use the IMDb Reviews dataset, which is publicly available and frequently used for text classification tasks.

You can download the dataset from the IMDb Dataset on Kaggle.

Alternatively, you can use other publicly available datasets from sources like Amazon product reviews.

Step 2: Data Preprocessing
We'll start by cleaning the text data. This involves:

Removing unnecessary characters (special characters, digits, etc.)
Converting text to lowercase
Tokenizing the text (splitting the text into words)
Removing stopwords (commonly used words that don’t add significant meaning, e.g., “the”, “is”)

In [8]:
import pandas as pd
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


In [9]:
# Download the stopwords (only once)
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saten\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saten\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
# Load the IMDb dataset
df = pd.read_csv('IMDB Dataset.csv')  # Change to your file path
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [11]:
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

In [12]:
def preprocess_text(text):
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert text to lowercase
    text = text.lower()
    # Tokenize the text using spaCy
    doc = nlp(text)
    # Remove stopwords
    words = [token.text for token in doc if not token.is_stop]
    return ' '.join(words)

In [6]:
# Apply preprocessing to the review column
df['cleaned_review'] = df['review'].apply(preprocess_text)

Step 3: Feature Extraction
For text classification, we need to convert the text data into numerical features that the model can work with. TF-IDF is a commonly used technique for this.

Why TF-IDF over Bag of Words (BoW)?
Bag of Words (BoW) simply counts the frequency of words in a document, which doesn't take into account the importance of words across the entire corpus. This can lead to common words being over-represented.
TF-IDF adjusts the word frequency by how commonly the word appears in all documents. Words that appear frequently in one document but are rare across others get higher weights, which helps the model focus on more informative words.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000)

# Transform the cleaned reviews into TF-IDF features
X = tfidf.fit_transform(df['cleaned_review']).toarray()

# Get labels (positive or negative)
y = df['sentiment'].map({'positive': 1, 'negative': 0}).values


Step 4: Model Selection and Training
We will choose Logistic Regression as our model for this task, but you could also use Naive Bayes or SVM based on your preference.

Train-Test Split
We’ll split the data into 80% training and 20% testing.

In [16]:
from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Train Logistic Regression Model
Now, we'll train a logistic regression model on the training data.

In [17]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)


Step 5: Evaluation
After training, we will evaluate the model's performance using accuracy, precision, and recall.

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')


Accuracy: 0.8831
Precision: 0.8734
Recall: 0.8982


Step 6: Model Improvement
Hyperparameter Tuning: You can improve the model's performance by tuning hyperparameters like the regularization strength in Logistic Regression (C).
Try Different Algorithms: Experiment with Naive Bayes or Support Vector Machine (SVM) to compare performance.
Increase Data: If the dataset is small, adding more labeled data can significantly improve the model’s performance.
Use Pretrained Embeddings: If accuracy is still low, you can switch to using pretrained embeddings like Word2Vec or GloVe for better feature representation.