# Text Translation and Sentiment Analysis using Transformers

## Project Overview:

The objective of this project is to analyze the sentiment of movie reviews in three different languages - English, French, and Spanish. We have been given 30 movies, 10 in each language, along with their reviews and synopses in separate CSV files named `movie_reviews_eng.csv`, `movie_reviews_fr.csv`, and `movie_reviews_sp.csv`.

- The first step of this project is to convert the French and Spanish reviews and synopses into English. This will allow us to analyze the sentiment of all reviews in the same language. We will be using pre-trained transformers from HuggingFace to achieve this task.

- Once the translations are complete, we will create a single dataframe that contains all the movies along with their reviews, synopses, and year of release in all three languages. This dataframe will be used to perform sentiment analysis on the reviews of each movie.

- Finally, we will use pretrained transformers from HuggingFace to analyze the sentiment of each review. The sentiment analysis results will be added to the dataframe. The final dataframe will have 30 rows


The output of the project will be a CSV file with a header row that includes column names such as **Title**, **Year**, **Synopsis**, **Review**, **Review Sentiment**, and **Original Language**. The **Original Language** column will indicate the language of the review and synopsis (*en/fr/sp*) before translation. The dataframe will consist of 30 rows, with each row corresponding to a movie.

In [2]:
!pip install -U jupyter ipywidgets

Collecting jupyter
  Downloading jupyter-1.1.1-py2.py3-none-any.whl (2.7 kB)
Collecting ipywidgets
  Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting jupyter-console
  Downloading jupyter_console-6.6.3-py3-none-any.whl (24 kB)
Collecting jupyterlab
  Downloading jupyterlab-4.3.0-py3-none-any.whl (11.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting widgetsnbextension~=4.0.12
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyterlab-widgets~=3.0.12
  Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

Collecting h11<0.15,>=0.13
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting babel>=2.10
  Downloading babel-2.16.0-py3-none-any.whl (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m80.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Collecting json5>=0.9.0
  Downloading json5-0.9.25-py3-none-any.whl (30 kB)
Installing collected packages: widgetsnbextension, tomli, jupyterlab-widgets, json5, h11, babel, async-lru, httpcore, httpx, ipywidgets, jupyter-console, jupyterlab-server, jupyter-lsp, jupyterlab, jupyter
[31mERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/opt/venv/lib/python3.10/site-packages/widgetsnbextension'
Check the permissions.
[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> 

In [1]:
# imports
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


### Get data from `.csv` files and then preprocess data

In [13]:
def preprocess_data() -> pd.DataFrame:
    df_eng = pd.read_csv("data/movie_reviews_eng.csv")
    df_fr = pd.read_csv("data/movie_reviews_fr.csv")
    df_sp = pd.read_csv("data/movie_reviews_sp.csv")

    df_eng.columns = ['Title', 'Year', 'Synopsis', 'Review']
    df_fr.columns = ['Title', 'Year', 'Synopsis', 'Review']
    df_sp.columns = ['Title', 'Year', 'Synopsis', 'Review']

    df_eng['Original Language'] = 'en'
    df_fr['Original Language'] = 'fr'
    df_sp['Original Language'] = 'sp'

    df = pd.concat([df_eng, df_fr, df_sp], ignore_index=True)

    return df


df = preprocess_data()
print(df.head())


                       Title  Year  \
0  The Shawshank Redemption   1994   
1           The Dark Knight   2008   
2               Forrest Gump  1994   
3             The Godfather   1972   
4                  Inception  2010   

                                            Synopsis  \
0  Andy Dufresne (Tim Robbins), a successful bank...   
1  Batman (Christian Bale) teams up with District...   
2  Forrest Gump (Tom Hanks) is a simple man with ...   
3  Don Vito Corleone (Marlon Brando) is the head ...   
4  Dom Cobb (Leonardo DiCaprio) is a skilled thie...   

                                              Review Original Language  
0  "The Shawshank Redemption is an inspiring tale...                en  
1  "The Dark Knight is a thrilling and intense su...                en  
2  "Forrest Gump is a heartwarming and inspiratio...                en  
3  "The Godfather is a classic movie that stands ...                en  
4  "Inception is a mind-bending and visually stun...                e

In [14]:
df.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language
17,Astérix aux Jeux Olympiques,2008,Dans cette adaptation cinématographique de la ...,"""Ce film est une déception totale. Les blagues...",fr
7,The Nice Guys,2016,"In 1970s Los Angeles, a private eye (Ryan Gosl...","""The Nice Guys tries too hard to be funny, and...",en
28,Torrente: El brazo tonto de la ley,1998,"En esta comedia española, un policía corrupto ...","""Torrente es una película vulgar y ofensiva qu...",sp
11,Intouchables,2011,Ce film raconte l'histoire de l'amitié improba...,"""Intouchables est un film incroyablement touch...",fr
4,Inception,2010,Dom Cobb (Leonardo DiCaprio) is a skilled thie...,"""Inception is a mind-bending and visually stun...",en
9,The Island,2005,In a future where people are cloned for organ ...,"""The Island is a bland and forgettable sci-fi ...",en
22,Y tu mamá también,2001,Dos amigos adolescentes (Gael García Bernal y ...,"""Y tu mamá también es una película que se qued...",sp
13,Les Choristes,2004,Ce film raconte l'histoire d'un professeur de ...,"""Les Choristes est un film magnifique qui vous...",fr
21,La Casa de Papel,(2017-2021),Esta serie de televisión española sigue a un g...,"""La Casa de Papel es una serie emocionante y a...",sp
15,Le Dîner de Cons,1998,Le film suit l'histoire d'un groupe d'amis ric...,"""Je n'ai pas aimé ce film du tout. Le concept ...",fr


### Text translation

Translate the **Review** and **Synopsis** column values to English.

In [15]:
fr_en_model_name = 'Helsinki-NLP/opus-mt-fr-en'
es_en_model_name = 'Helsinki-NLP/opus-mt-es-en'

fr_en_model = MarianMTModel.from_pretrained(fr_en_model_name)
es_en_model = MarianMTModel.from_pretrained(es_en_model_name)
fr_en_tokenizer = MarianTokenizer.from_pretrained(fr_en_model_name)
es_en_tokenizer = MarianTokenizer.from_pretrained(es_en_model_name)



In [16]:
def translate(text: str, model, tokenizer) -> str:
    # Encode the text using the tokenizer
    inputs = tokenizer(text, return_tensors='pt')

    # Generate the translation using the model with `max_new_tokens`
    outputs = model.generate(**inputs, max_new_tokens=512)  # Thay thế max_length bằng max_new_tokens

    # Decode the generated output and return the translated text
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded


text_fr = "Bonjour tout le monde!"
translation_fr = translate(text_fr, fr_en_model, fr_en_tokenizer)
print(f"Translation from French to English: {translation_fr}")

text_es = "¡Hola, mundo!"
translation_es = translate(text_es, es_en_model, es_en_tokenizer)
print(f"Translation from Spanish to English: {translation_es}")


Translation from French to English: Hello, everybody!
Translation from Spanish to English: Hello, world!


In [17]:
# Filter reviews in French and translate to English
dataframe = preprocess_data()
fr_reviews = dataframe[dataframe['Original Language'] == 'fr']['Review'].tolist()
fr_reviews_en = [translate(review, fr_en_model, fr_en_tokenizer) for review in fr_reviews]

# Filter synopsis in French and translate to English
fr_synopsis = dataframe[dataframe['Original Language'] == 'fr']['Synopsis'].tolist()
fr_synopsis_en = [translate(synopsis, fr_en_model, fr_en_tokenizer) for synopsis in fr_synopsis]

# Filter reviews in Spanish and translate to English
es_reviews = dataframe[dataframe['Original Language'] == 'sp']['Review'].tolist()
es_reviews_en = [translate(review, es_en_model, es_en_tokenizer) for review in es_reviews]

# Filter synopsis in Spanish and translate to English
es_synopsis = dataframe[dataframe['Original Language'] == 'sp']['Synopsis'].tolist()
es_synopsis_en = [translate(synopsis, es_en_model, es_en_tokenizer) for synopsis in es_synopsis]

# Update dataframe with translated text
df_fr_idx = dataframe[dataframe['Original Language'] == 'fr'].index
df_sp_idx = dataframe[dataframe['Original Language'] == 'sp'].index

dataframe.loc[df_fr_idx, 'Review'] = fr_reviews_en
dataframe.loc[df_fr_idx, 'Synopsis'] = fr_synopsis_en
dataframe.loc[df_sp_idx, 'Review'] = es_reviews_en
dataframe.loc[df_sp_idx, 'Synopsis'] = es_synopsis_en

In [20]:
dataframe.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language
26,Toc Toc,2017,"In this Spanish comedy, a group of people with...","""Toc Toc is a boring and unoriginal film that ...",sp
24,Amores perros,2000,Three stories intertwine in this Mexican film:...,"""Amores dogs is an intense and moving film tha...",sp
0,The Shawshank Redemption,1994,"Andy Dufresne (Tim Robbins), a successful bank...","""The Shawshank Redemption is an inspiring tale...",en
22,Y tu mamá también,2001,Two teenage friends (Gael García Bernal and Di...,"""And your mom is also a movie that stays with ...",sp
18,Les Visiteurs en Amérique,2000,In this continuation of the French comedy The ...,"""The film is a total waste of time. The jokes ...",fr
16,La Tour Montparnasse Infernale,2001,Two incompetent office workers find themselves...,"""I can't believe I've wasted time watching thi...",fr
5,Blade Runner 2049,2017,"Officer K (Ryan Gosling), a new blade runner f...","""Boring and too long. Nothing like the origina...",en
23,El Laberinto del Fauno,2006,"During the Spanish postwar period, Ofelia (Iva...","""The Labyrinth of Fauno is a fascinating and e...",sp
7,The Nice Guys,2016,"In 1970s Los Angeles, a private eye (Ryan Gosl...","""The Nice Guys tries too hard to be funny, and...",en
14,Le Fabuleux Destin d'Amélie Poulain,2001,This romantic comedy tells the story of Amélie...,"""The Fabulous Destiny of Amélie Poulain is an ...",fr


### Sentiment Analysis

Use HuggingFace pretrained model for sentiment analysis of the reviews. Store the sentiment result **Positive** or **Negative** in a new column titled **Sentiment** in the dataframe.

In [19]:
from transformers import pipeline

# Load sentiment analysis model
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
sentiment_classifier = pipeline('sentiment-analysis', model=model_name)

Downloading config.json: 100%|██████████| 629/629 [00:00<00:00, 2.99MB/s]
Downloading pytorch_model.bin: 100%|██████████| 268M/268M [00:01<00:00, 155MB/s]  
Downloading tokenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 222kB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 4.58MB/s]


In [24]:
def analyze_sentiment(text, classifier):
    """
    Function to perform sentiment analysis on a text using a model.
    """
    result = classifier(text)[0]
    sentiment = 'Positive' if result['label'] == 'POSITIVE' else 'Negative'
    return sentiment


dataframe['Sentiment'] = dataframe['Review'].apply(lambda review: analyze_sentiment(review, sentiment_classifier))

dataframe.to_csv('result/reviews_with_sentiment.csv', index=False)

In [26]:
!jupyter nbconvert --to html *.ipynb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[NbConvertApp] Converting notebook project.ipynb to html
[NbConvertApp] Writing 322060 bytes to project.html
