### Goal of data preprocessing

#### Data prepration 
1. Change No Positive, No Negative fields into ""
2. Remove empty reviews from the dataset 
3. change the capitalization of all words to lower 
4. If the given score is above 7.5 assign positive sentiment 
5. If the given score is below 4.0 assign negative sentiment 
6. Only keep the reviews which either have a negative or a positive review
7. Combine the positive and negative reviews in a new column called review 
8. keep only the review field with text and the sentiment 

#### Word Embedding 
1. Remove words which aren't words with regex
2. tokenize all of the words using word_tokenize from nltk
3. Make sure that all of the words are in english using a library not sure which yet
4. remove the reviews of which the words are not all in english or empty
5. embed all of the words using Word Embedding, also add padding to make sure that all of the sentences have the same amount of vectors 



In [None]:
import pandas as pd 

df = pd.read_csv("../data/Hotel_Reviews_Clean.csv")

df['Positive_Review'] = df['Positive_Review'].str.lower()
df['Negative_Review'] = df['Negative_Review'].str.lower()

df['Positive_Review'] = df['Positive_Review'].replace(["no positive",
                                                       "everything"], "")
df['Negative_Review'] = df['Negative_Review'].replace(["no negative", 
                                                       "nothing",
                                                       "none",
                                                       "no",], "")

df = df[~((df['Positive_Review'] == "") & (df['Negative_Review'] == ""))]


In [80]:
import dask.dataframe as dd 
import numpy as np 
# Use dask to calculate and apply sentiment 

def assign_sentiment(score):
    if score > 7.5:
        return 1
    elif score < 4.0:
        return 0
    else:
        return np.nan

import dask.dataframe as dd 

ddf = dd.from_pandas(df, npartitions=8)

ddf['sentiment'] = ddf['Reviewer_Score'].map(assign_sentiment, meta=('sentiment', 'int64'))

# Drop the nan reviews 
ddf = ddf[~ddf['sentiment'].isnull()]

ddf['review'] = ddf['Positive_Review'].fillna('') + " " + ddf['Negative_Review'].fillna('')

ddf = ddf[['review', 'sentiment']]

In [81]:
ddf[['review', 'sentiment']].compute().to_csv('output.csv', index=False)

In [66]:
import re 
from nltk.tokenize import word_tokenize


def tokenize_review(row):
    text = row['review']
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-letters
    text = text.lower()
    tokens = word_tokenize(text)
    return [t for t in tokens if t.strip()]


ddf['tokens'] = ddf.map_partitions(lambda df: df.apply(tokenize_review, axis=1), meta=('tokens', 'object'))

In [None]:
# Doesn't work as well as it should

# from langdetect import detect

# def filter_english_tokens(tokens):
#     clean_tokens = []
#     for token in tokens:
#         try:
#             if detect(token) == 'en':
#                 clean_tokens.append(token)
#         except:
#             continue  # skip undetectable/garbage tokens
#     return clean_tokens

# ddf['language_tokens'] = ddf['tokens'].map(filter_english_tokens, meta=('language_tokens', 'object')) 

# ddf.compute()

Unnamed: 0,review,sentiment,tokens,language_tokens
0,only the park outside of the hotel was beauti...,0.0,"[only, the, park, outside, of, the, hotel, was...","[the, of, the, that, this, when, the, of, this..."
3,great location in nice surroundings the bar a...,0.0,"[great, location, in, nice, surroundings, the,...","[location, surroundings, the, and, and, the, b..."
7,good location set in a lovely park friendly s...,1.0,"[good, location, set, in, a, lovely, park, fri...","[location, food, high, oth, the, from, the, th..."
9,the room was big enough and the bed is good t...,1.0,"[the, room, was, big, enough, and, the, bed, i...","[the, enough, and, the, the, food, and, servic..."
10,rooms were stunningly decorated and really sp...,1.0,"[rooms, were, stunningly, decorated, and, real...","[and, really, the, top, of, the, building, of,..."
...,...,...,...,...
395,we loved our room especially the huge comfy b...,1.0,"[we, loved, our, room, especially, the, huge, ...","[the, from]"
396,the room was so luxurious very comfy bed and ...,1.0,"[the, room, was, so, luxurious, very, comfy, b...","[the, and, character, the, shower, the]"
397,lovely room taseful decor and artwork around ...,1.0,"[lovely, room, taseful, decor, and, artwork, a...","[and, artwork, around, the, and, helpful, the,..."
398,breakfast service one gruff check in member,1.0,"[breakfast, service, one, gruff, check, in, me...",[service]
