# __Predicting Airbnb Unlisting Project__ <img align='right' width='150' height='150' src='https://maiseducativa.com/wp-content/uploads/2015/02/Logo_Nova-IMS.jpg'>

## <font color='SeaGreen'>__Text Mining__</font><br>

> __Group O__ composed by:
>> __Daniel Franco, nº20210719__ <p>
>> __João Malho, nº20220696__ <p>

***

## 🧬 __Introduction__

__Your solution should present the following points:__

__1. Data Exploration (1.5 points):__

Here you should analyze the corpora and provide some conclusions and visual information (bar charts, word clouds, etc.) that contextualize the data.

__2. Data Preprocessing (2 points):__

You must apply a method to split your training corpus into train/validation sets to evaluate the performance of your model (you can also resort to KFold cross validation, or other methods). Moreover, you must correctly implement and experiment at least four (4) of the data preprocessing techniques shown in class (stop words, regular expressions, lemmatization, stemming, etc.).

__3. Feature Engineering (5 points):__

You must correctly implement and experiment with two (2) of the feature engineering techniques seen in class (TF-IDF, GloVe embeddings, etc.).

__4. Classification Models (4.5 points):__

You must correctly implement and test three (3) of the classification algorithms seen in class (KNN, LR, MLP, LSTM, etc.). 

__5. Evaluation (1.5 points):__

You must evaluate your models resorting, at least, to Recall, Precision, Accuracy and F1-Score.
Moreover, the development of extra work (more techniques than the minimum required in the previous points and/or techniques not shown in class) is highly recommended and will account for a maximum of 4.5 points divided as follows:
    
>__1. Data Preprocessing__ – 0.25 points for each extra method (unseen in class) used (maximum of 2 extra methods).

>__2. Feature Engineering__ – 1 point for each extra method using Transformed-based embeddings (maximum of 2 extra methods)

>__3. Classification Models__ – 1 point for each extra


## 📖 __Glossary__

__The data is divided in following sets:__

* __Train (train.xlsx) (12,496 lines):__

> Contains the Airbnb and host descriptions (“description” and “host_about” columns), as well as the information regarding the property listing status (“unlisted” column). A property is considered unlisted (1) if it got removed from the quarterly Airbnb list and it is considered listed (1) if it remains on that same list.

* __Train Reviews (train_reviews.xlsx) (72,1402):__ 

> This file has all the guests’ comments made to each Airbnb property. Note that there can be more than one comment per property, not all properties have comments, and comments can appear in many languages!

* __Test (test.xlsx) (1,389 lines):__

> The structure of this dataset is the same as the train set, except that it does not contain the “unlisted” column. The teaching team is keeping this information secret! You are expected to provide the predicted status (0 or 1) for each Airbnb in this set. Once the projects are delivered, we will compare your predictions with the actual (true) labels.

* __Test Reviews (test_reviews.xlsx) (80,877):__ 

> The structure of this dataset is the same as the train reviews set, but the comments correspond to the properties present on the test set

***

## 📈 __Methodology__


***
## 👨🏻‍💻 __Code Changes__



### __1. Libraries and Data import__

In [1]:
#!pip install wordcloud
#!pip install langdetect
#!pip install googletrans

import re
import requests
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
import plotly.express as px
from langdetect import detect
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import plotly.graph_objects as go
from googletrans import Translator
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS 
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')

In [2]:
corpora_test = pd.read_excel('test.xlsx') 
corpora_train = pd.read_excel('train.xlsx') 
corpora_test_review = pd.read_excel('test_reviews.xlsx')
corpora_train_review = pd.read_excel('train_reviews.xlsx')

FileNotFoundError: [Errno 2] No such file or directory: 'test.xlsx'

## 1. Data Exploration 
* data presentation and explanation of the main finding from the exploratory analysis (accounts for 50% of criteria 4.1).

In [3]:
# Check data information
corpora_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1389 entries, 0 to 1388
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   index        1389 non-null   int64 
 1   description  1389 non-null   object
 2   host_about   1389 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.7+ KB


In [4]:
corpora_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12496 entries, 0 to 12495
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   index        12496 non-null  int64 
 1   description  12496 non-null  object
 2   host_about   12496 non-null  object
 3   unlisted     12496 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 390.6+ KB


In [5]:
corpora_test_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80877 entries, 0 to 80876
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   index     80877 non-null  int64 
 1   comments  80877 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.2+ MB


In [6]:
corpora_train_review.head(10)

Unnamed: 0,index,comments
0,1,this is a very cozy and comfortable house to s...
1,1,good<br/>
2,1,"My first hostel experience, and all I have to ..."
3,1,Das Hostel war neu und deshalb funktionierte a...
4,1,"It was fine for a dorm, but I think for the pe..."
5,1,Our stay in Lisbon Tip Hostel was very good. T...
6,1,Close to shops in town and a comfortable place...
7,1,Young and friendly staff. Great location along...
8,1,The place is just off the Parque metro stop wh...
9,1,Had a nice stay at this hostel. The beds were ...


In [7]:
corpora_train.head(2)

# Hotels names, user names, numbers

Unnamed: 0,index,description,host_about,unlisted
0,1,"This is a shared mixed room in our hostel, wit...",Alojamento Local Registro: 20835/AL,0
1,2,"O meu espaço fica perto de Parque Eduardo VII,...","I am friendly host, and I will try to always b...",1


In [8]:
# Calculate the percentage for each category
total = len(corpora_train)
corpora_train_counts = corpora_train['unlisted'].value_counts()
corpora_train_percentages = 100 * corpora_train_counts / total

# Create the horizontal bar chart
fig = go.Figure(data=[
    go.Bar(
        y=corpora_train_percentages.index,
        x=corpora_train_percentages.values,
        orientation='h',
        marker=dict(color='rgb(121, 157, 196)')
    )
])

# Add percentage labels to each bar
for i, percentage in enumerate(corpora_train_percentages.values):
    fig.add_annotation(
        x=percentage + 2,
        y=corpora_train_percentages.index[i],
        text=f'{percentage:.1f}%',
        showarrow=False,
        font=dict(size=12, color='black')
    )

# Set the layout
fig.update_layout(
    title='Count of Unlisted',
    xaxis_title='Percentage',
    yaxis_title='Unlisted',
    yaxis=dict(autorange="reversed"),
    height=400,
    width=800,
    margin=dict(l=100, r=20, t=60, b=20)
)

# Show the plot
fig.show()


In [9]:
# detect language
def detect_textlang(text):
    try:
        src_lang = detect(text)
        if src_lang =='en':
            return 'en'
        else:
        #return "NA"    
            return src_lang
    except:
        return "NA"
corpora_train['description_language']=corpora_train.description.apply(detect_textlang)
#corpora_train_review['comments_language']=corpora_train_review.comments.apply(detect_textlang)

In [10]:
# Calculate the percentage for each category
total = len(corpora_train)
corpora_train_counts = corpora_train['description_language'].value_counts()
corpora_train_percentages = 100 * corpora_train_counts / total

# Create the horizontal bar chart
fig = go.Figure(data=[
    go.Bar(
        y=corpora_train_percentages.index,
        x=corpora_train_percentages.values,
        orientation='h',
        marker=dict(color='rgb(121, 157, 196)')
    )
])

# Add percentage labels to each bar
for i, percentage in enumerate(corpora_train_percentages.values):
    fig.add_annotation(
        x=percentage + 2,
        y=corpora_train_percentages.index[i],
        text=f'{percentage:.1f}%',
        showarrow=False,
        font=dict(size=12, color='black')
    )

# Set the layout
fig.update_layout(
    title='Languages in Train',
    xaxis_title='Percentage',
    yaxis_title='Language',
    yaxis=dict(autorange="reversed"),
    height=400,
    width=800,
    margin=dict(l=100, r=20, t=60, b=20)
)

# Show the plot
fig.show()



__After data analysis can be noted that:__

> File `corpora_train_review`:
__Features__:

1. __[index]__ - (Int) Index correspondency with Index in file `corpora_train`
2. __[comments]__ - (Str) Each comment per line


> File `corpora_test_review`:
__Features__:

1. __[index]__ - (Int) Index correspondency with Index in file `corpora_test`
2. __[comments]__ - (Str) Each comment per line


> File `corpora_train`:
__Features__:

1. __[index]__ - (Int) Index correspondency
2. __[description]__ - (Str) Description of the asset
3. __[host_about]__ - (Str) Description of the asset host
4. __[unlisted]__ - (Bool) Is asset unlisted or not 

> File `corpora_test`:
__Features__:

1. __[index]__ - (Int) Index correspondency
2. __[description]__ - (Str) Description of the asset
3. __[host_about]__ - (Str) Description of the asset host


Is also noted that train data is not balanced, model will be train with ~28% of unlisted individuals.

Our data contains several languages as well were the majority is 81,6% of English although there are more.

## 2. Data Preprocessing 
* explanation of the different preprocessing methods developed (accounts for 25% of criteria 4.Erro! A origem da referência não foi encontrada.).

__Data Cleaning:__

* __Lowercase text__
> Converting everything to lowercase reducing vocabulary size

* __Remove Numerical Data and Punctuation__
> Normalize Dates and Numbers

* __Remove Stop Words__
> Removing words that are not very informative. Less tokens same meaning.

* __Lemmatize__
> Turning words into their root word

* __Stemmer__
> Remove last few characters to obtain the shorter form of each word, this one is setted to False in order to not perform.

***

In [11]:
# Data Pre Processing
stop = stopwords.words('english')

def pre_process(text_list, lemmatize, stemmer):
    
    updates = []
    
    for j in tqdm(text_list):
        
        text = j
        
        # Lowercase text
        text = text.lower()
        
        # Remove tags
        text = re.sub("&lt;/?.*?&gt;", " &lt;&gt; ", text)
        
        # Remove special characters and digits
        text = re.sub("[^a-zA-Z]", " ", text)
        text = re.sub("(\\d|\\W)+", " ", text)
        text = re.sub('[0-9]+', '', text)
        text = re.sub(r'http\S+', '',text)
        
        # Remove stopwords
        text = " ".join([word for word in text.split() if word not in stop])
        
        # Lemmatize
        if lemmatize:
            lemma = WordNetLemmatizer()
            text = " ".join(lemma.lemmatize(word) for word in text.split())
        
        # Stemming
        if stemmer:
            stemmer = SnowballStemmer('english')
            text = " ".join(stemmer.stem(word) for word in text.split())
            
        updates.append(text)
        
    return updates

corpora_train['description_clean'] = pre_process(corpora_train['description'], lemmatize=True, stemmer=False)


100%|██████████| 12496/12496 [00:07<00:00, 1595.84it/s]


In [12]:
corpora_train['host_about_clean'] = pre_process(corpora_train['host_about'], lemmatize=True, stemmer=False)


100%|██████████| 12496/12496 [00:03<00:00, 3658.32it/s]


In [13]:
corpora_test['description_clean'] = pre_process(corpora_test['description'], lemmatize=True, stemmer=False)


100%|██████████| 1389/1389 [00:00<00:00, 1989.67it/s]


In [14]:
corpora_test['host_about_clean'] = pre_process(corpora_test['host_about'], lemmatize=True, stemmer=False)


100%|██████████| 1389/1389 [00:00<00:00, 3738.91it/s]


In [15]:
corpora_test_review['comments_clean'] = pre_process(corpora_test_review['comments'], lemmatize=True, stemmer=False)


100%|██████████| 80877/80877 [00:14<00:00, 5721.94it/s]


In [18]:
corpora_train.head(5)

Unnamed: 0,index,description,host_about,unlisted,description_language,description_clean,host_about_clean
0,1,"This is a shared mixed room in our hostel, wit...",Alojamento Local Registro: 20835/AL,0,en,shared mixed room hostel shared bathroom br lo...,alojamento local registro al
1,2,"O meu espaço fica perto de Parque Eduardo VII,...","I am friendly host, and I will try to always b...",1,pt,meu espa fica perto de parque eduardo vii sald...,friendly host try always around need anything ...
2,3,Trafaria’s House is a cozy and familiar villa ...,"I am a social person liking to communicate, re...",1,en,trafaria house cozy familiar villa facility ne...,social person liking communicate reading trave...
3,4,"Apartamento Charmoso no Chiado, Entre o Largo ...",Hello!_x000D_\nI m Portuguese and i love to me...,0,pt,apartamento charmoso chiado entre largo carmo ...,hello x portuguese love meet people around wor...
4,5,Joli appartement en bordure de mer.<br /> 2 m...,Nous sommes une famille avec deux enfants de 1...,0,fr,joli appartement en bordure de mer br min pied...,nous somme une famille avec deux enfants de et...


In [17]:
# Retorn error probably due the size or strange word
# corpora_train_review['comments_clean'] = pre_process(corpora_train_review['comments'], lemmatize=True, stemmer=False)


1. Were created new features according to each column in each file.

2. New Features regard a sufix in their name as __`"_clean"`__.

3. The new features regard the new data already __Tokenized__ (process of splitting a text into individual words or tokens), in __Lowercase__ (no Uppercase Letters), __without special characters, digits, tags and stopwords__ and __Lemmatized__ (word in their root word). 

## 3. Feature Engineering
* description of the methods implemented (accounts for 25% of
criteria 4.3)

* The oject in study will be file __`"corpora_train"`__ which is the one that will be used for model train

* Feature Importance 
    > Term Frequency 
    
    TF = (Number of times the word appears in the document) / (Total number of words in the document)

    > Inverse Doc. Frequency

    IDF = log((Total number of documents in the corpus) / (Number of documents containing the word))

In [None]:
# #Translate to English
# from googletrans import Translator
# def translate_text(lang,text):
#     translator= Translator()
#     trans_text = translator.translate(text, src=lang).text
#     return trans_text

# corpora_train['translated_text']=corpora_train.apply(lambda x: x.description_clean if x.text_lang == 'en' else translate_text(x.text_lang, x.description_clean), axis=1)
# corpora_train.translated_text = corpora_train.translated_text.str.lower()

In [34]:
# Label
y = np.array(corpora_train['unlisted'])

In [24]:
# 1 Gram
word_tfidf = TfidfVectorizer(max_df=0.8, ngram_range=(1, 1))
X_word = word_tfidf.fit_transform(corpora_train["description_clean"])

In [23]:
# N Gram
ngram_tfidf = TfidfVectorizer(max_df=0.8, ngram_range=(1, 3))
X_ngram = ngram_tfidf.fit_transform(corpora_train["description_clean"])

In [32]:
## 1 Gram Classifier 
knn_model = KNeighborsClassifier(n_neighbors = 10, metric = 'cosine', weights = 'distance')
knn_model.fit(X_word, y)

In [33]:
## N Gram Classifier
modelknn_ngram = KNeighborsClassifier(n_neighbors = 10, metric = 'cosine', weights = 'distance')
modelknn_ngram.fit(X_ngram,y)

### 1 Gram

In [36]:
X_test_word = word_tfidf.transform(corpora_train["description_clean"])

In [38]:
y_pred_word = knn_model.predict(X_test_word)

In [42]:
labels = {"Unlisted":1, "Keep Listed":0}
print(classification_report(y_pred_word, y_test, target_names = labels.keys()))


## Tem de ser a test

NameError: name 'y_test' is not defined

## 4. Classification Models
* description of the models implemented (accounts for 25% of criteria 4.4)