<a href="https://colab.research.google.com/github/javlonravshanov/nlp_movie_review_classification/blob/main/movie_review_classifier_model_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Movie Review Classification using FastAI

This notebook demonstrates the process of building a movie review classification model using FastAI.
The model will classify reviews as positive or negative based on their content.
We'll use the IMDb dataset for training and evaluation.

### Steps involved:
1. **Dataset Loading**: Fetching the IMDb dataset from FastAI.
2. **Text Preprocessing**: Preparing the text for the language model.
3. **Language Model Training**: Training a language model to understand movie reviews.
4. **Classification Model Training**: Fine-tuning the language model for sentiment analysis.
5. **Evaluation**: Testing the model on unseen data.

---



## Step 1: Dataset Loading

We are using the IMDb dataset, which is a large dataset of movie reviews for binary sentiment classification (positive/negative).
This dataset is provided by FastAI and is widely used for text classification tasks.

FastAI makes it easy to load the dataset using the `untar_data()` method, which downloads the dataset if it's not already present locally.
The dataset contains text files categorized into `train`, `test`, and `unsup` (unsupervised) folders.


In [None]:
# Step 1: Load the IMDb dataset from FastAI's built-in collection.
from fastai.text.all import *
path = untar_data(URLs.IMDB)


## Step 2: Text Preprocessing

Before training the model, we need to preprocess the text data. This involves tokenization and numericalization:

- **Tokenization**: Splitting the text into individual tokens (words, punctuation, etc.).
- **Numericalization**: Converting tokens into numerical values, which the model can process.

FastAI provides `TextBlock` for text preprocessing, which automatically handles tokenization and numericalization. We use the `get_text_files()` function to retrieve text files from the dataset folders.


In [None]:
# Step 2: Retrieve text files from the dataset (train, test, and unsupervised folders).
files = get_text_files(path, folders=['train', 'test', 'unsup'])

In [None]:
# Step 3: Open and display the first 75 characters of a sample file.
txt = files[0].open().read()
txt[:75]


## Step 3: Language Model Training

A **language model** is trained to predict the next word in a sequence. Training a language model on movie reviews helps the model understand the structure of the text.

Once the language model is trained, we can fine-tune it for classification tasks. This step is crucial because language models capture contextual information from the data, improving performance on downstream tasks like sentiment classification.

We use FastAI's `DataBlock` API to create the data loaders for the language model.


In [None]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks = TextBlock.from_folder(path, is_lm=True),
    get_items = get_imdb, splitter = RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

In [None]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj as embarrassing as it is to admit , i was listed as production manager on this film … my very first ! xxmaj as a matter of fact , it was the first feature film for almost everyone who participated . xxmaj watch carefully , and you even get to see me in one of the opening scenes , as a soon - to - be - murdered asylum attendant named … "" xxunk "" ( my own","xxmaj as embarrassing as it is to admit , i was listed as production manager on this film … my very first ! xxmaj as a matter of fact , it was the first feature film for almost everyone who participated . xxmaj watch carefully , and you even get to see me in one of the opening scenes , as a soon - to - be - murdered asylum attendant named … "" xxunk "" ( my own last"
1,"a lot of amazing laughs . xxmaj on top of everything , the biggest star are the special effects -- amazing and so important to the movie . xxbos i really do n't have any complaints about this movie , except for the disturbing scenes with the body . i fell upon it while switching around the tv one night . xxmaj the acting was actually amazing , i did n't expect it to be better than it appeared !","lot of amazing laughs . xxmaj on top of everything , the biggest star are the special effects -- amazing and so important to the movie . xxbos i really do n't have any complaints about this movie , except for the disturbing scenes with the body . i fell upon it while switching around the tv one night . xxmaj the acting was actually amazing , i did n't expect it to be better than it appeared ! i"


In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [None]:
learn.fit_one_cycle(1, 2e-2)

In [None]:
learn.save('/content/1epoch')

In [None]:
learn = learn.load('/content/1epoch')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(4, 2e-3)

In [None]:
learn.save_encoder('/content/finetuned')

In [None]:
learn = learn.load_encoder('/content/finetuned')

  wgts = torch.load(join_path_file(file,self.path/self.model_dir, ext='.pth'), map_location=device)


In [None]:
TEXT = "I didn't like this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

i did n't like this movie because it seemed to have been made with international funding by Orlando Jones and Lucio Fulci . Another reviewer mentioned that there were n't any hero in the movie at all . They were playing
i did n't like this movie because of its low budget and that alone is n't enough to save it as an heather fan of Disney . 

 As far as Disney goes , it was n't as easy to pick up on a



## Step 4: Classification Model Training

After training the language model, we fine-tune it for classification. The classification model uses the pre-trained language model to classify movie reviews as **positive** or **negative**.

This is done by adding a classifier head to the language model and training it on the labeled IMDb data. FastAI's `text_classifier_learner()` function simplifies this process by allowing us to create and fine-tune the model.


In [None]:
dls_class = DataBlock(
    blocks = (TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y = parent_label,
    get_items = partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)



In [None]:
dls_class.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj titanic directed by xxmaj james xxmaj cameron presents a fictional love story on the historical setting of the xxmaj titanic . xxmaj the plot is simple , xxunk , or not for those who love plots that twist and turn and keep you in suspense . xxmaj the end of the movie can be figured out within minutes of the start of the film , but the love story is an interesting one , however . xxmaj kate xxmaj winslett is wonderful as xxmaj rose , an aristocratic young lady betrothed by xxmaj cal ( billy xxmaj zane ) . xxmaj early on the voyage xxmaj rose meets xxmaj jack ( leonardo dicaprio ) , a lower class artist on his way to xxmaj america after winning his ticket aboard xxmaj titanic in a poker game . xxmaj if he wants something , he goes and gets it",pos
2,"xxbos xxrep 3 * xxmaj warning - this review contains "" plot spoilers , "" though nothing could "" spoil "" this movie any more than it already is . xxmaj it really xxup is that bad . xxrep 3 * \n\n xxmaj before i begin , xxmaj i 'd like to let everyone know that this definitely is one of those so - incredibly - bad - that - you - fall - over - laughing movies . xxmaj if you 're in a lighthearted mood and need a very hearty laugh , this is the movie for you . xxmaj now without further ado , my review : \n\n xxmaj this movie was found in a bargain bin at wal - mart . xxmaj that should be the first clue as to how good of a movie it is . xxmaj secondly , it stars the lame action",neg


In [None]:
learn = text_classifier_learner(dls_class, AWD_LSTM, drop_mult=0.5,
                                metrics=accuracy).to_fp16()

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [None]:
learn = learn.load_encoder('/content/finetuned')

  wgts = torch.load(join_path_file(file,self.path/self.model_dir, ext='.pth'), map_location=device)


In [None]:
learn.fit_one_cycle(1, 2e-2)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.290999,0.210447,0.91604,01:14


In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.248308,0.192855,0.92512,01:20


In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.211402,0.171274,0.9348,01:35


In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.178325,0.165058,0.93656,01:52
1,0.157321,0.166849,0.93684,01:52


In [None]:
learn.save('/content/final_nlp_model')

Path('/content/final_nlp_model.pth')

In [None]:
learn.validate()

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


(#2) [0.1668490618467331,0.9368399977684021]

In [None]:
import warnings

# Suppress specific warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)



## Step 5: Evaluation

Once the model is trained, we evaluate its performance on a test set of movie reviews. We use metrics like **accuracy** to measure how well the model performs on unseen data.

Evaluation helps us understand the model's generalization capability and identifies areas for improvement.


In [None]:
# Example movie reviews
movie_reviews = [
    "An absolute masterpiece! The storyline was gripping from start to finish.",
    "I can't believe I wasted two hours of my life on this film. It was dreadful.",
    "The cinematography was stunning, but the plot was confusing and hard to follow.",
    "It was a fun movie to watch with the family, full of laughter and heartwarming moments.",
    "The movie had its moments, but overall it was pretty forgettable.",
    "I loved the acting, but the pacing was too slow for my taste.",
    "One of the worst movies I've ever seen. Don't waste your time."
]

for idx, review in enumerate(movie_reviews, 1):
    pred, _, probs = learn.predict(review)

    # Convert probabilities to percentages
    pos_prob = probs[1].item() * 100
    neg_prob = probs[0].item() * 100

    # Format probabilities and prediction
    result = f"{idx}. Review: {review}\n" \
             f"   Predicted Sentiment: {pred}\n" \
             f"   Probability: {pos_prob:.2f}% (Positive), {neg_prob:.2f}% (Negative)\n"

    print(result)


1. Review: An absolute masterpiece! The storyline was gripping from start to finish.
   Predicted Sentiment: pos
   Probability: 99.99% (Positive), 0.01% (Negative)



2. Review: I can't believe I wasted two hours of my life on this film. It was dreadful.
   Predicted Sentiment: neg
   Probability: 0.00% (Positive), 100.00% (Negative)



3. Review: The cinematography was stunning, but the plot was confusing and hard to follow.
   Predicted Sentiment: neg
   Probability: 0.41% (Positive), 99.59% (Negative)



4. Review: It was a fun movie to watch with the family, full of laughter and heartwarming moments.
   Predicted Sentiment: pos
   Probability: 99.94% (Positive), 0.06% (Negative)



5. Review: The movie had its moments, but overall it was pretty forgettable.
   Predicted Sentiment: neg
   Probability: 0.07% (Positive), 99.93% (Negative)



6. Review: I loved the acting, but the pacing was too slow for my taste.
   Predicted Sentiment: neg
   Probability: 2.23% (Positive), 97.77% (Negative)



7. Review: One of the worst movies I've ever seen. Don't waste your time.
   Predicted Sentiment: neg
   Probability: 0.00% (Positive), 100.00% (Negative)




## Conclusion

In this notebook, we trained a movie review classification model using FastAI. We started by training a language model on the IMDb dataset, which was then fine-tuned for sentiment classification.

### Key Takeaways:
- **Language Model Pretraining**: Pretraining a language model on domain-specific data improves classification performance.
- **FastAI Framework**: FastAI simplifies the process of building and fine-tuning NLP models.
- **Evaluation**: Our model achieved strong performance, but further optimization and testing could enhance its accuracy.

Future work might include experimenting with different architectures or hyperparameters to further improve the results.
