# Airbnb Listings

## Problem Statement

The aim of this project

- different strategies:
- different models:
- classification algorithms:

## Configuration
We import all the required packages.

In [3]:
!pip install pyspellchecker
!pip install vaderSentiment
!pip install contractions
!pip install gensim
!pip install -U deep-translator

!pip install torch == 2.0.0+cu118 torchvision == 0.15.1+cu118 torchaudio == 2.0.1 --index-url https: // download.pytorch.org/whl/cu118
!pip install transformers requests beautifulsoup4

Collecting vaderSentiment
  Using cached vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
[31mERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/opt/conda/lib/python3.11/site-packages/vaderSentiment/__init__.py'
Consider using the `--user` option or check the permissions.
[0m[31m
[31mERROR: Invalid requirement: '=='[0m[31m


In [10]:
# Utility
import numpy as np
import pandas as pd
%matplotlib inline
from matplotlib import style

style.use('ggplot')
import os
import random

# Extra tools for data preprocessing
import nltk

nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
import re
from nltk.stem.snowball import SnowballStemmer
from collections import OrderedDict
from langdetect import detect
from deep_translator import GoogleTranslator

# Vader
import vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Downloader for Glove Word Embedding
import gensim
import gensim.downloader as gloader

# WordCloud
from wordcloud import WordCloud, STOPWORDS

# Contractions
import contractions

# Sklearn
from sklearn.base import BaseEstimator, TransformerMixin

# Pytorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

#####################
#  REPRODUCIBILITY  #
#####################

# Seed value
SEED_VALUE = 42

# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)

# 2. Set `python` built-in pseudo-random generator at a fixed value
random.seed(SEED_VALUE)

# 3. Set `numpy` pseudo-random generator at a fixed value
np.random.seed(SEED_VALUE)

import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="bs4")

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
bert_base_multilingual_tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
bert_base_multilingual_model = AutoModelForSequenceClassification.from_pretrained(
    'nlptown/bert-base-multilingual-uncased-sentiment')

## The Airbnb Unlisting Dataset
The dataset provided for this project is a subset of the Airbnb Unlisting Dataset. It contains information about Airbnb properties in Lisbon, Portugal, and their status (listed or unlisted) in the first quarter of 2018.
The dataset does not provide a sentiment label assigned to each review (*positive* or *negative*); to address this problem we can use the following strategies:
- use **VADER** (Valence Aware Dictionary and sEntiment Reasoner)
- use **BERT** (Bidirectional Encoder Representations from Transformers)

But first let's take a look at the data.

### Dataset structure

__The data is divided in following sets:__

* __Train (train.xlsx) (12,496 lines):__

Contains the Airbnb and host descriptions (“description” and “host_about” columns), as well as the information regarding the property listing status (“unlisted” column). A property is considered unlisted (1) if it got removed from the quarterly Airbnb list and it is considered listed (1) if it remains on that same list.
- **index** (numerical): unique identifier associated to the Airbnb property
- **description** (text): Airbnb description
- **host_about** (text): host description
- **unlisted** (binary): 1 if the property is unlisted, 0 otherwise

**ex:** {"**index**": "1", "**description**": "This is a shared mixed room in our hostel, with shared bathroom.`<br />`We are located right across the street from subway station Parque, we are 5 min walk to Marques de Pombal square.`<br /><br />`...", "**host_about**": "Alojamento Local Registro: 20835/AL", "**unlisted**": "0"}

--------
* __Train Reviews (train_reviews.xlsx) (72,1402):__

This file has all the guests’ comments made to each Airbnb property. Note that there can be more than one comment per property, not all properties have comments, and comments can appear in many languages!
- **index** (numerical): unique identifier associated to the Airbnb property
- **comments** (text): guest comment

**ex:** {"**index**": "1", "**comments**": "The host canceled this reservation 2 days before arrival. This is an automated posting."}

--------
* __Test (test.xlsx) (1,389 lines):__

The structure of this dataset is the same as the train set, except that it does not contain the “unlisted” column. The teaching team is keeping this information secret! You are expected to provide the predicted status (0 or 1) for each Airbnb in this set. Once the projects are delivered, we will compare your predictions with the actual (true) labels.

* __Test Reviews (test_reviews.xlsx) (80,877):__

The structure of this dataset is the same as the train reviews set, but the comments correspond to the properties present on the test set

### Data Import

It starts with data

In [12]:
current_directory = os.getcwd()

corpora_train = pd.read_excel(os.path.join(f"{current_directory}/corpora", 'train.xlsx'))
corpora_train_review = pd.read_excel(os.path.join(f"{current_directory}/corpora", 'test_reviews.xlsx'))
corpora_test = pd.read_excel(os.path.join(f"{current_directory}/corpora", 'test.xlsx'))
corpora_test_review = pd.read_excel(os.path.join(f"{current_directory}/corpora", 'test_reviews.xlsx'))

### Data Inspection

In this section we are going to briefly inspect the data.

**Done in the other notebook**


### Data Understanding

Before working on our dataset, we want to explore it to find out interesting insights and irregularities.

We will look at some features and try to find out interesting facts and patterns from them.
**This Section done in the other notebook|**

### Text Preprocessing

During the **Data Understanding** phase we have already removed some noisy data. Let's continue the exploration of data and perform some preprocessing.

#### Review Preprocessing
In this section we are going to preprocess the **reviews**.

We had to define three types of preprocessing.

In particular:
* The first preprocessing pipeline will be used to extract the `clean_text` to be used later.
* The second preprocessing pipeline will be used to extract the `clean_text` to be exploited in **VADER-labeling**.
* The second preprocessing pipeline will be used to extract the `clean_text` to be exploited in **BERT-labeling**.
* The third preprocessing pipeline will be used to extract the `clean_text` to be exploited in the construction of the **Dense Word Embedding**.

The order of execution within the pipelines is important.


**Pipeline 1**:
* strip html
* strip text
* remove stopwords
* replace special characters
* filter out uncommon symbols
* combine whitespace
* lower text
* stemming


**Pipeline 2**:
* strip html
* strip text

**Pipeline 3**:
* strip html
* strip text
* remove stopwords
* replace special characters
* filter out uncommon symbols
* combine whitespace
* lower text
* expand contractions

**Pipeline 4**:
* strip html
* strip text
* remove stopwords
* replace special characters
* filter out uncommon symbols
* combine whitespace
* lower text
* expand contractions

---------------------------

***NOTES ABOUT PIPELINE 2***:

As you can see from the lists above, in the second pipeline (compared to the first):
- we do not remove the punctuation and special characters
- we do not lower the text
- we do not perform stemming
- we do not remove the stopwords

We need to use this pipeline because Bert assigns a polarity in reference to the following as well:

- Punctuation
- Capitalization
- Conjunctions
- Preceding Tri-gram
- Emojis, Slangs, and Emoticons
---------------------------

***NOTES ABOUT PIPELINE 3***:

As you can see from the lists above, in the third pipeline (compared to the first):
- we do not perform stemming
- we expand contractions

We need to use this pipeline because otherwise we would have too much OOV terms (**Glove-300**, the pretrained word embedding that we will use, is trained on non-stemmed words).


In [39]:
# Check first three reviews
for i in corpora_train_review['comments'][0:3]:
    print(i, '\n')

Thank you very much Antonio ! All has been perfect during our stay, and the appartment is perfectly located in your fabulous city. We would love to visit you again next time :)_x000D_<br/> 

Very nice appartment in the old town of Lissabon, quite central but still calm in a small lane. No traffic noises etc.! There was enough space for 6 people, everything was clean, kitchen full equipped. Nice contact with the owner. Recommended! 

When travelling we're looking for kids friendly places to stay, and Antonios place was such a place. It's spacious and well equipped._x000D_<br/>_x000D_<br/>He's friendly mother was at the apartment to greet us and she had made ready a baby bed, a high chair and bought cookies,fruit and buns. Very nice._x000D_<br/>_x000D_<br/>The apartment had a hint of damp smell upon arriving, but after we have had the heaters on for some time it disappeared. So stay in the apartment for more than 15 minuttes._x000D_<br/>_x000D_<br/>The neighborhood is nice and we found g

#### Reviews Sentiment text classification

There are many words that include not, like needn't. These words are key parts of emotional analysis, so we will remove them from stopwords.

In [17]:
# List of preserved negations
NEG_LIST = ['nor',
            'no',
            'needn',
            'weren',
            'hasn\'t',
            'isn\'t',
            'wasn',
            'don\'t',
            'couldn\'t',
            'don',
            'hasn',
            'won\'t',
            'must',
            'didn',
            'can\'t'
            'haven\'t',
            'weren\'t ',
            'didn\'t',
            'mustn\'t',
            'wouldn\'t',
            'doesn\'t',
            'needn\'t',
            'wasn\'t',
            'aren\'t',
            'couldn',
            'isn',
            'dosen',
            'shouldn\'t',
            'mightn',
            'mightn\'t',
            'not',
            'never'
            'aren',
            "aren't",
            'couldn',
            "couldn't",
            'didn',
            "didn't",
            'doesn',
            "doesn't",
            'hadn',
            "hadn't",
            'hasn',
            "hasn't",
            'haven',
            "haven't",
            'isn',
            "isn't",
            'ma',
            'mightn',
            "mightn't",
            'mustn',
            "mustn't",
            'needn',
            "needn't",
            'shan',
            "shan't",
            'shouldn',
            "shouldn't",
            'wasn',
            "wasn't",
            'weren',
            "weren't",
            'won',
            "won't",
            'wouldn',
            "wouldn't"]

##### Preprocessing Pipelines

In [18]:
train_reviews_df = corpora_train_review.copy()

train_reviews_df['comments'] = train_reviews_df['comments'].astype(str)  # Convert to string

In [19]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;:.#+_?!"$%&@~]\t')
GOOD_SYMBOLS_RE = re.compile('[^0-9a-zA-Z ]')
COMBINE_WHITESPACE = re.compile(r"\s+")

translator = GoogleTranslator(source='auto', target='en')

class CleanText:

    # Translation
    def translate(self, text):
        """
        Translates text to english
        """
        translated = translator.translate(text)
        return str(translated)

    # STOPWORDS
    def stopwords(self, text):
        """Removes english stopwords"""
        stop_words = set(nltk.corpus.stopwords.words('english'))
        stop_words_no_neg = stop_words - set(NEG_LIST)
        text = ' '.join([word for word in text.split() if word not in stop_words_no_neg])
        return text

    # HTML TAGS
    def strip_html(self, text):
        """
        Removes any HTML tags
        """
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    # STRIP TEXT
    def strip_text(self, text):
        """
        Removes any left or right spacing (including carriage return) from text.
        Example:
        Input: '  This assignment is cool\n'
        Output: 'This assignment is cool'
        """
        return text.strip()

    # REPLACE SPECIAL CHARACTER
    def replace_special_characters(self, text):
        """
        Replaces special characters, such as paranthesis,
        with spacing character
        """

        return REPLACE_BY_SPACE_RE.sub('', text)

    # FILTER UNCOMMON SYMBOLS
    def filter_out_uncommon_symbols(self, text):
        """
        Removes any special character that is not in the
        good symbols list (check regular expression)
        """
        return GOOD_SYMBOLS_RE.sub(' ', text)

    # COMBINE WHITESPACE
    def combine_whitespace(self, text):
        """
        Removes multiple white-spaces from text.
        Example:
        Input: 'This    assignment is    cool'
        Output: 'This assignment is cool'
        """
        return COMBINE_WHITESPACE.sub(" ", text).strip()

    # LOWER TEXT
    def lower(self, text):
        """
        Transforms given text to lower case.
        Example:
        Input: 'I really like New York city'
        Output: 'i really like new york city'
        """
        return text.lower()

    # STEMMING
    def stemming(self, text):
        """
        Remove the suffixes from the words to
        get the root form of the word
        Example:
        Input: 'Wording'
        Output: 'Word'
        """
        stemmer = SnowballStemmer('english')
        text = ' '.join(stemmer.stem(token) for token in nltk.word_tokenize(text))
        return text

    # CONTRACTIONS
    def expand_contractions(self, input_text):
        """ Transform contracted words into their standard form. """
        return contractions.fix(input_text)

    def fit(self, X, y=None, **fit_params):
        return self

    # PIPELINE 1
    def transform(self, X, **transform_params):
        clean_X = X.apply(self.strip_html) \
            .apply(self.strip_text) \
            .apply(self.stopwords) \
            .apply(self.replace_special_characters) \
            .apply(self.filter_out_uncommon_symbols) \
            .apply(self.combine_whitespace) \
            .apply(self.lower) \
            .apply(self.stemming)

        return clean_X

    # PIPELINE 2
    def transform_vader(self, X, **transform_params):
        clean_X = X.apply(self.strip_html) \
            .apply(self.strip_text) \
            #.apply(self.translate)

        return clean_X

    # PIPELINE 3
    def transform_embedding(self, X, **transform_params):
        clean_X = X.apply(self.strip_html) \
            .apply(self.strip_text) \
            .apply(self.stopwords) \
            .apply(self.replace_special_characters) \
            .apply(self.filter_out_uncommon_symbols) \
            .apply(self.combine_whitespace) \
            .apply(self.lower) \
            .apply(self.expand_contractions)

        return clean_X

    # PIPELINE 4
    def transform_bert(self, X, **transform_params):
        clean_X = X.apply(self.strip_html) \
            .apply(self.strip_text) \
            .apply(self.stopwords) \
            .apply(self.replace_special_characters) \
            .apply(self.filter_out_uncommon_symbols) \
            .apply(self.combine_whitespace) \
            .apply(self.lower) \
            .apply(self.expand_contractions)

        return clean_X


##### Clean Text Pipeline

In [20]:
%%time

ct = CleanText()
train_reviews_df['clean_text'] = ct.fit(train_reviews_df.comments).transform(train_reviews_df.comments)

  soup = BeautifulSoup(text, "html.parser")


CPU times: user 30.5 s, sys: 207 ms, total: 30.7 s
Wall time: 30.7 s


In [21]:
train_reviews_df.head()

Unnamed: 0,index,comments,clean_text
0,1,Thank you very much Antonio ! All has been per...,thank much antonio all perfect stay appart per...
1,1,Very nice appartment in the old town of Lissab...,veri nice appart old town lissabon quit centra...
2,1,When travelling we're looking for kids friendl...,when travel we re look kid friend place stay a...
3,1,We've been in Lisbon in march 2013 (3 adults a...,we ve lisbon march 2013 3 adult 3 children the...
4,1,Our host Antonio was very helpful with informa...,our host antonio help inform lissabon he pick ...


##### VADER Clean Text Pipeline

In [26]:
%%time

# Vader requires a different preprocessing pipeline
ct = CleanText()
train_reviews_df['VADER_clean_text'] = ct.fit(train_reviews_df.comments).transform_vader(train_reviews_df.comments)

  soup = BeautifulSoup(text, "html.parser")


CPU times: user 2.13 s, sys: 0 ns, total: 2.13 s
Wall time: 2.13 s


##### BERT Clean Text Pipeline

In [25]:
%%time

# BERT requires a different preprocessing pipeline
ct = CleanText()
train_reviews_df['BERT_clean_text'] = ct.fit(train_reviews_df.comments).transform_bert(train_reviews_df.comments)

  soup = BeautifulSoup(text, "html.parser")


CPU times: user 9.1 s, sys: 749 ms, total: 9.84 s
Wall time: 9.86 s


##### Dense Embeddings Clean Text Pipeline

In [22]:
%%time
# Dense Embeddings require a different preprocessing pipeline
ct = CleanText()
train_reviews_df['embedding_clean_text'] = ct.fit(train_reviews_df.comments).transform_embedding(
    train_reviews_df.comments)

  soup = BeautifulSoup(text, "html.parser")


CPU times: user 7.47 s, sys: 529 ms, total: 8 s
Wall time: 8 s


In [27]:
# Keep only those columns that will be used in from now on
train_reviews_df = train_reviews_df[
    ['index', 'clean_text', 'VADER_clean_text', 'BERT_clean_text', 'embedding_clean_text']]

In [28]:
train_reviews_df.head()

Unnamed: 0,index,clean_text,VADER_clean_text,BERT_clean_text,embedding_clean_text
0,1,thank much antonio all perfect stay appart per...,Thank you very much Antonio ! All has been per...,thank much antonio all perfect stay appartment...,thank much antonio all perfect stay appartment...
1,1,veri nice appart old town lissabon quit centra...,Very nice appartment in the old town of Lissab...,very nice appartment old town lissabon quite c...,very nice appartment old town lissabon quite c...
2,1,when travel we re look kid friend place stay a...,When travelling we're looking for kids friendl...,when travelling we re looking kids friendly pl...,when travelling we re looking kids friendly pl...
3,1,we ve lisbon march 2013 3 adult 3 children the...,We've been in Lisbon in march 2013 (3 adults a...,we ve lisbon march 2013 3 adults 3 children th...,we ve lisbon march 2013 3 adults 3 children th...
4,1,our host antonio help inform lissabon he pick ...,Our host Antonio was very helpful with informa...,our host antonio helpful information lissabon ...,our host antonio helpful information lissabon ...


In [32]:
# Save the cleaned dataframe
train_reviews_df.to_csv('corpora/train_reviews_cleaned_df.csv', index=False)

@TODO do same pipelines on test data

## Labeling

##### **VADER based**

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

VADER uses a combination of a sentiment lexicon (which is a list of lexical features e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.

VADER has been found to be quite successful when dealing with social media texts, NY Times editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

VADER analyses sentiments primarily based on certain key points:
- **Punctuation**: The use of an exclamation mark(!), increases the magnitude of the intensity without modifying the semantic orientation. For example, “The food here is good!” is more intense than “The food here is good.” and an increase in the number of (!), increases the magnitude accordingly.
- **Capitalization**: Using upper case letters to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity. For example, “The food here is GREAT!” conveys more intensity than “The food here is great!
- **Degree modifiers**: Also called intensifiers, they impact the sentiment intensity by either increasing or decreasing the intensity. For example, “The service here is extremely good” is more intense than “The service here is good”, whereas “The service here is marginally good” reduces the intensity.
- **Conjunctions**: Use of conjunctions like “but” signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. “The food here is great, but the service is horrible” has mixed sentiment, with the latter half dictating the overall rating.
- **Preceding Tri-gram**: By examining the tri-gram preceding a sentiment-laden lexical feature, we catch nearly 90% of cases where negation flips the polarity of the text. A negated sentence would be “The food here isn’t really all that great”.
- **Emojis, Slangs, and Emoticons**: VADER performs very well with emojis, slangs, and acronyms in sentences.

The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive)

-----

**Strategy**

Convert the compound scores to a normalized sentiment scale ranging from 1 (less positive) to 5 (more positive), with 2.5 representing a neutral sentiment.

In [34]:
train_reviews_cleaned_df = pd.read_csv('corpora/train_reviews_cleaned_df.csv')

In [35]:
analyzer = SentimentIntensityAnalyzer()

def vader_sentiment_score(text):
    score = analyzer.polarity_scores(text)['compound']
    normalized_score = (score + 1) * 2.5 + 1
    return normalized_score

In [None]:
%%time

train_reviews_cleaned_df['VADER_sentiment'] = train_reviews_cleaned_df['VADER_clean_text'].apply(
    lambda x: vader_sentiment_score(x))

In [None]:
%%time

from tqdm import tqdm
import math

# Create a new column to store the VADER sentiment scores
train_reviews_cleaned_df['VADER_sentiment'] = ''

# Define the batch size
batch_size = 100

# Calculate the number of batches
num_batches = math.ceil(len(train_reviews_cleaned_df) / batch_size)


# Define the function for calculating VADER sentiment scores
def calculate_vader_sentiment_batch(batch_df):
    # Perform your VADER sentiment analysis here on the batch_df
    # You can modify this function to fit your specific implementation
    batch_df['VADER_sentiment'] = batch_df['VADER_clean_text'].apply(lambda x: vader_sentiment_score(x))
    return batch_df


# Iterate over the batches with tqdm for progress tracking
for i in tqdm(range(num_batches)):
    start_index = i * batch_size
    end_index = (i + 1) * batch_size
    batch_df = train_reviews_cleaned_df[start_index:end_index].copy()
    batch_df = calculate_vader_sentiment_batch(batch_df)
    train_reviews_cleaned_df[start_index:end_index] = batch_df

In [None]:
train_reviews_cleaned_df.head(10)

In [None]:
positive_num = len(train_reviews_cleaned_df[train_reviews_cleaned_df['VADER_sentiment'] >= 2.5])
negative_num = len(train_reviews_cleaned_df[train_reviews_cleaned_df['VADER_sentiment'] < 2.5])
positive_num, negative_num

In [None]:
train_reviews_cleaned_df.to_csv('corpora/train_reviews_vader_sentiment_df.csv', index=False)

##### **BERT based**

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.

##### Select 10 reviews for each index (hotel) as a sample

In [36]:
train_reviews_cleaned_df = pd.read_csv('corpora/train_reviews_cleaned_df.csv')

In [37]:
# Create an empty DataFrame to store the selected comments
selected_df = pd.DataFrame(columns=['index', 'BERT_clean_text'])

# Iterate over each unique index
for index in train_reviews_cleaned_df['index'].unique():
    # Filter the DataFrame to get comments for the current index
    comments = train_reviews_cleaned_df[train_reviews_cleaned_df['index'] == index].head(10)

    # Select only the 'index' and 'BERT_clean_text' columns
    selected_comments = comments[['index', 'BERT_clean_text']]

    # Concatenate the selected comments to the new DataFrame
    selected_df = pd.concat([selected_df, selected_comments])

# Reset the index of the new DataFrame
selected_df.reset_index(drop=True, inplace=True)

selected_df.head(20)

Unnamed: 0,index,BERT_clean_text
0,1,thank much antonio all perfect stay appartment...
1,1,very nice appartment old town lissabon quite c...
2,1,when travelling we re looking kids friendly pl...
3,1,we ve lisbon march 2013 3 adults 3 children th...
4,1,our host antonio helpful information lissabon ...
5,1,very nice place be large clean apartment x000d...
6,1,everything great antonio mother margarida good...
7,1,a comfortable clean nice flat pleasant sunny t...
8,1,s jour id al nous avons t accueillis tr s chal...
9,1,i spent great time this place


In [38]:
# Save the cleaned dataframe
train_reviews_cleaned_df.to_csv('corpora/train_reviews_sentiment_df.csv', index=False)
selected_df.to_csv('corpora/samples_train_reviews_sentiment_df.csv', index=False)

##### Sentiment Analysis with BERT

In [None]:
samples_train_reviews_sentiment = pd.read_csv('corpora/samples_train_reviews_sentiment_df.csv')

In [None]:
def bert_score(review):
    if not review:
        return 2  # Set score to 1 if the review string is empty

    tokens = bert_base_multilingual_tokenizer.encode_plus(
        review,
        return_tensors='pt',
        padding=True,
        truncation=True,
        max_length=512  # Set the maximum length as per your requirement
    )
    input_ids = tokens['input_ids']
    attention_mask = tokens['attention_mask']

    with torch.no_grad():
        outputs = bert_base_multilingual_model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    predicted_label = torch.argmax(logits, dim=1) + 1

    return predicted_label.item()


def sentiment_score(reviews):
    predicted_labels = []
    for review in tqdm(reviews, desc="Processing reviews"):
        if not isinstance(review, str) or not review.strip():
            predicted_labels.append(2)  # Set score to 1 if the review is empty or not a string
        else:
            score = bert_score(review)
            predicted_labels.append(score)

    return predicted_labels


# Process sentiment scores for the reviews
sentiment_scores = sentiment_score(samples_train_reviews_sentiment['BERT_clean_text'])

# Assign sentiment scores to the DataFrame
samples_train_reviews_sentiment['BERT_sentiment'] = np.nan
samples_train_reviews_sentiment.loc[:len(sentiment_scores) - 1, 'BERT_sentiment'] = sentiment_scores

samples_train_reviews_sentiment.to_csv("samples_train_reviews_sentiment_final.csv", index=False)

#### **Sentiment Labels Comparison**
Here we want to discover if the two methods actually have associated a high percentage of different "sentiment" labels.

In the event that the two methods have produced almost the same labels, we will proceed with the use of a single dataset.

----------------------------
**Results**

I manually analyzed some of the reviews that got a different label through the two methodologies (Bert based and VADER based).

**Insights**

From my **non-exhaustive** exploration of the dataset, I was able to ascertain that, according to the feeling I have when reading the selected reviews, the label associated by VADER turns out to be accurate with English text. For example, reviews like "good, comfortable, etc .." obtained the label "1", while it didn't work with same accuracy with non-English text as VADER is primarily designed for sentiment analysis in English text. It relies on a pre-trained lexicon that contains sentiment scores for English words. Therefore, its performance may not be as accurate or reliable when applied to languages other than English.

That's why I decided to use BERT, which is a multilingual model, to label the reviews. The results are more accurate than VADER.

Furthermore, it must be said that this experiment was useful in experimenting with another technique and discovering the existence and use of VADER-Sentiment Analyzer.

#### Aggregate results

In [8]:
# Load the cleaned dataset
sample_test_reviews_sentiments_df = pd.read_csv('corpora/aggregated_sentiments_results.csv')

In [9]:
# Calculate the average sentiment score for each index
avg_scores = sample_test_reviews_sentiments_df.groupby('index')['BERT_sentiment'].mean()

# Create a new DataFrame with the average scores
sample_test_reviews_sentiments_df = pd.DataFrame({'index': avg_scores.index, 'avg_sentiment_score': avg_scores})

# Save the DataFrame
sample_test_reviews_sentiments_df.to_csv('corpora/hotels_test_sentiment_avg_scores_df.csv', index=False)

In [70]:
train_hotels_df = pd.read_excel('corpora/train.xlsx')
train_hotels_df['description'] = train_hotels_df['description'].astype(str)  # Convert to string

# Vader requires a different preprocessing pipeline
ct = CleanText()
train_hotels_df['desc_clean_text'] = ct.fit(train_hotels_df.description).transform(train_hotels_df.description)

  soup = BeautifulSoup(text, "html.parser")


In [71]:
train_hotels_df.head(10)

Unnamed: 0,index,description,host_about,unlisted,desc_clean_text
0,1,"This is a shared mixed room in our hostel, wit...",Alojamento Local Registro: 20835/AL,0,this share mix room hostel share bathroom we l...
1,2,"O meu espaço fica perto de Parque Eduardo VII,...","I am friendly host, and I will try to always b...",1,o meu espa o fica perto de parqu eduardo vii s...
2,3,Trafaria’s House is a cozy and familiar villa ...,"I am a social person liking to communicate, re...",1,trafaria s hous cozi familiar villa facil need...
3,4,"Apartamento Charmoso no Chiado, Entre o Largo ...",Hello!_x000D_\nI m Portuguese and i love to me...,0,apartamento charmoso no chiado entr largo carm...
4,5,Joli appartement en bordure de mer.<br /> 2 m...,Nous sommes une famille avec deux enfants de 1...,0,joli appart en bordur de mer 2 min pie de la p...
5,6,"IMPORTANT: In response to COVID-19, this prope...","Hi, we are Homing - a company that develops it...",0,import in respons covid 19 properti extend cle...
6,7,This is my home that I rent out when I'm trave...,Globe trotter. I'm of Portuguese nationality w...,1,this home i rent i m travel perfect vacat with...
7,8,Find tranquility in this meticulously curated ...,I travel a lot and I love it. _x000D_\nOrigina...,0,find tranquil meticul curat lifestyl space the...
8,9,Charming apartment with one bedroom with doubl...,"Isabel & Helder, portugueses, parents of three...",0,charm apart one bedroom doubl bed doubl sofa b...
9,10,Walk up original wooden stairs to the entrance...,Serviced holiday apartments casa in Azenhas do...,0,walk origin wooden stair entranc apart bath li...


#### Aspects Extraction

In [None]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertForTokenClassification

# Assuming you have a DataFrame called 'df' with a 'hotel_desc' column

# Preprocess the text
preprocessed_text = train_hotels_df['desc_clean_text']

# Load BERT tokenizer and encode the text
encoded_inputs = bert_base_multilingual_tokenizer(preprocessed_text, padding=True, truncation=True, return_tensors='pt')

# Load pre-trained BERT model for token classification
model = bert_base_multilingual_model

# Forward pass through the model
outputs = model(**encoded_inputs)

# Extract predicted amenity labels
predicted_labels = torch.argmax(outputs.logits, dim=2).squeeze()

# Count the number of predicted amenities
num_amenities = (predicted_labels == 1).sum().item()

# Print the number of amenities mentioned in the hotel description
print("Number of amenities mentioned:", num_amenities)
