**Goal: Using the Kaggle Boston Airbnb dataset, I attempt to use sentiment analysis on written reviews to predict the numerical Airbnb rating. (SPOILER ALERT: It doesn't go very well)**

The Airbnb dataset has 3 csv files: "calendar.csv", "listings.csv", "reviews.csv". I first create an LSTM model in Keras and train it on the IMDB sentiment dataset. Then I use the trained model on the comments from "reviews.csv" to predict sentiment: either positive (1), or negative (0). Taking the average of these, I see how closely this correlates to the actual ratings, found in "listings.csv".

A couple things of note: the initial sentiment analysis is a classification task, whereas predicting the rating is really a regression task. My intention is to see how closely we can correlate these.

Also, since this is my first Kaggle submission, note that I really like functions. I primarily program in Spyder, so writing functions allows for easy and discrete code chunks, and makes debugging much easier.

In [1]:
import numpy as np
import pandas as pd
import os
import string
from sklearn.model_selection import train_test_split
import keras
from keras.models import Sequential, save_model, load_model
from keras.layers import Embedding, LSTM, Dense, Flatten
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import datetime
import math

Using TensorFlow backend.


First, we load in the IMDB dataset. This is also available via `keras.datasets`, but it comes preprocessed and already encoded there. I want to also work on the preprocessing and cleaning, so I got it directly from the source: http://ai.stanford.edu/~amaas/data/sentiment/ The dataset has 50,000 reviews: 25,000 for the training set and 25,000 for the testing set. Each set is further divided into positive ('pos') and negative ('neg') reviews.

In [6]:
def clean_imdb(directory):
    '''
    Returns cleaned dataframe of IMDB reviews with columns ['review', 'sentiment']
    '''
    sentiment = {'neg': 0, 'pos': 1}
    df_columns = ['review', 'sentiment']
    reviews_with_sentiment = pd.DataFrame(columns = df_columns)
    for i in ('test', 'train'):
        for j in ('neg', 'pos'):
            file_path = directory + i + '/' + j
            for file in os.listdir(file_path):
                with open((file_path + '/' + file), 'r',
                          encoding = 'utf-8') as text_file:
                    text = text_file.read()
                review = pd.DataFrame([[text, sentiment[j]]],
                                      columns = df_columns)
                reviews_with_sentiment = reviews_with_sentiment.\
                                         append(review, ignore_index = True)
    return reviews_with_sentiment

directory = 'Data/IMDB/'
cleaned_imdb = clean_imdb(directory)

Let's take a look at what a single review with sentiment looks like:

In [10]:
cleaned_imdb.iloc[13]

review       Man with the Screaming Brain is a story of gre...
sentiment                                                    0
Name: 13, dtype: object

Next we will load in one of the GloVe word embeddings (https://nlp.stanford.edu/projects/glove/). A word embedding represents words as vectors, which allows them to actually be interpreted by a computer. There are several different versions of GloVe depending on how many dimensions - and therefore memory - you would like to use. I used one of the smaller ones, with 50 dimensions and 6 billion tokens.

In [13]:
def load_GloVe(file_path):
    '''
    Loads word embedding .txt file
    Returns word embedding as dictionary
    '''
    GloVe_dict = dict()
    with open(file_path, encoding = 'utf-8') as GloVe_file:
        for line in GloVe_file:
            values = line.split()
            word = values[0]
            coef = np.asarray(values[1:], dtype = 'float32')
            GloVe_dict[word] = coef
    return GloVe_dict

GloVe_file_path = 'Data/glove.6B.50d.txt'
embedding_dict = load_GloVe(GloVe_file_path)

Let's check out this word embedding a bit. You can see that it includes 400,000 entries of all lower case words. We'll also look at a random slice of the `embedding_dict` keys.

In [14]:
len(embedding_dict)

400000

In [15]:
[print(word) for word in list(embedding_dict.keys()) if word != word.lower()]

[]

In [16]:
list(embedding_dict.keys())[100:125]

['so',
 'them',
 'what',
 'him',
 'united',
 'during',
 'before',
 'may',
 'since',
 'many',
 'while',
 'where',
 'states',
 'because',
 'now',
 'city',
 'made',
 'like',
 'between',
 'did',
 'just',
 'national',
 'day',
 'country',
 'under']

Let's check out how `'under'`, the last word in our list we just printed, is represented in vector space.

In [17]:
embedding_dict['under']

array([ 1.3721e-01, -2.9500e-01, -5.9160e-02, -5.9235e-01,  2.3010e-02,
        2.1884e-01, -3.4254e-01, -7.0213e-01, -5.5748e-01, -7.8537e-01,
        4.6417e-01,  4.4733e-01, -7.4178e-01, -4.6287e-01,  4.2665e-01,
        3.9795e-01, -2.1767e-01,  2.6260e-02, -3.1353e-01,  7.8520e-02,
        2.8495e-01,  1.1671e-01,  2.9981e-01, -9.1376e-01, -4.7744e-01,
       -1.6573e+00,  7.4029e-03, -1.1224e-01, -1.0604e-01,  2.9894e-01,
        3.4634e+00, -2.9341e-01, -7.6777e-01, -3.0120e-01, -3.7192e-03,
        2.3122e-01,  4.7334e-01,  1.3078e-01,  5.0225e-02,  1.9911e-01,
       -5.0179e-01, -3.4197e-03,  3.8654e-01,  5.7375e-02, -1.0157e+00,
       -3.3991e-01, -6.1970e-01, -5.9706e-01, -1.1377e-01, -6.4195e-01],
      dtype=float32)

Now that we have our reviews with sentiment and our word embedding loaded, it's time to start cleaning up and preprocessing our reviews. The reviews have all sorts of non-alphanumeric and uppercase characters, as shown in this rather eloquent review:

In [22]:
cleaned_imdb.iloc[489].values

array(['I was lucky enough to grow up surfing in San Diego (not the biggest waves in the world but it was a hell of childhood, I\'ll tell you that) and I have seen A LOT of so-called surfer flicks in my life. After watching NORTH SHORE for the first time just now, all I can say is THANK GOD I never saw this as a kid. If I had seen this and mistakenly thought that this was a realistic portrayal of the surf scene, I would sold my board and totally gotten into, I don\'t know, accounting or something.<br /><br />Seriously, this movie has a as much in common with real surfing as TOP GUN has was real military life. The acting is terrible, the music is worse, the cinematography is iffy at best and OH MY GOD what was Laird Hanilton thinking?! WOW!!! DO NOT SEE THIS MOVIE!!! IT SUCKS!!! If you want a REAL surf flick, see RIDING GIANTS. Hell, watch SURF\'S UP instead of this. Seriously. Sucks. Sucks bad. Sucks REAL bad. Brah. ;)<br /><br />PS: Had to change my summery from "WTF?!" to wtf because

Next, we will strip off all non-alphanumeric characters. This process takes a while, so I also added a print statement to update us on how far we've gotten. Note that while the `cleaned_imdb` reviews are a `pd.DataFrame`, this function is also used later when we are stripping the Airbnb reviews, which are a `pd.Series`, hence the `if` statement. We first strip off all the punctuation and replace it with a single space. Then we replace all whitespace characters, except actual spaces.

It's important to note that this is not always best practice. Consider the following sentence from the review above:

>WOW!!! DO NOT SEE THIS MOVIE!!! IT SUCKS!!!

You can tell that this carries a much stronger sentiment than just one (or no) exclamation points. There are ways of dealing with punctuation (see [VADER](https://github.com/cjhutto/vaderSentiment) for example), but I did not use them here.

In [23]:
def strip_punctuation_and_whitespace(reviews_df, verbose = True):
    '''
    Strips all punctuation and whitespace from reviews EXCEPT spaces (i.e. ' ')
    Removes "<br />"
    Returns dataframe of cleaned IMDB reviews
    '''
    trans_punc = str.maketrans(string.punctuation,
                               ' ' * len(string.punctuation))
    whitespace_except_space = string.whitespace.replace(' ', '')
    trans_white = str.maketrans(whitespace_except_space,
                                ' ' * len(whitespace_except_space))
    stripped_df = pd.DataFrame(columns = ['review', 'sentiment'])
    for i, row in enumerate(reviews_df.values):
        if i % 5000 == 0 and verbose == True:
            print('Stripping review: ' + str(i) + ' of ' + str(len(reviews_df)))
        if type(reviews_df) == pd.DataFrame:
            review = row[0]
            sentiment = row[1]
        elif type(reviews_df) == pd.Series:
            review = row
            sentiment = np.NaN
        try:
            review.replace('<br />', ' ')
            for trans in [trans_punc, trans_white]:
                review = ' '.join(str(review).translate(trans).split())
            combined_df = pd.DataFrame([[review, sentiment]],
                                       columns = ['review', 'sentiment'])
            stripped_df = pd.concat([stripped_df, combined_df],
                                    ignore_index = True)
        except AttributeError:
            continue
    return stripped_df

stripped_imdb = strip_punctuation_and_whitespace(cleaned_imdb)

Stripping review: 0 of 50000
Stripping review: 5000 of 50000
Stripping review: 10000 of 50000
Stripping review: 15000 of 50000
Stripping review: 20000 of 50000
Stripping review: 25000 of 50000
Stripping review: 30000 of 50000
Stripping review: 35000 of 50000
Stripping review: 40000 of 50000
Stripping review: 45000 of 50000
