# Building a Bag-of-Words Model with the IMDb Movie Reviews Dataset using Random Forest


This document outlines the process of creating a Bag-of-Words (BoW) model for sentiment analysis using the IMDb Movie Reviews Dataset, with a Random Forest classifier. We will use Python and the Scikit-learn library for this task.

## Step 1: Importing Required Libraries

Start by importing the necessary Python libraries.

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
```

## Step 2: Loading the Dataset

The IMDb dataset can be loaded from a file or directly using a library like TensorFlow or PyTorch. Here, we'll assume the dataset is in a CSV file.

```python
# Load the dataset
df = pd.read_csv('imdb_dataset.csv')

# Assuming the dataset has two columns: 'review' and 'sentiment'
print(df.head())
```

## Step 3: Preprocessing the Data

Text preprocessing is crucial for NLP tasks. For simplicity, we'll just split the data into training and test sets.

```python
# Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)
```

## Step 4: Creating the Bag-of-Words Model

We use the `CountVectorizer` from Scikit-learn to convert the text documents to a matrix of token counts.

```python
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the training data to a BoW model
X_train_bow = vectorizer.fit_transform(X_train)

# Transform the test data to the same BoW model
X_test_bow = vectorizer.transform(X_test)
```

## Step 5: Training a Random Forest Classifier

With our BoW model, we can now train a classifier. Here, we use the Random Forest classifier.

```python
# Initialize the Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
clf.fit(X_train_bow, y_train)
```

## Step 6: Evaluating the Model

Finally, we evaluate the classifier's performance on the test data.

```python
# Predicting the sentiment for test data
y_pred = clf.predict(X_test_bow)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Detailed classification report
print(classification_report(y_test, y_pred))
```

In [1]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
clf.fit(X_train_bow, y_train)

## Step 6: Evaluating the Model

Finally, we evaluate the classifier's performance on the test data.


# Predicting the sentiment for test data
y_pred = clf.predict(X_test_bow)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Detailed classification report
print(classification_report(y_test, y_pred))

SyntaxError: invalid syntax (<ipython-input-1-947cc17e2a46>, line 8)