# Fellowship ai Sentiment Analysis Challenge with the Yelp Polarity dataset

__Yelp Review Polarity Dataset__

Version 1, Updated 09/09/2015

ORIGIN

The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset_challenge

The Yelp reviews polarity dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is first used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).


DESCRIPTION

The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples. Negative polarity is class 1, and positive class 2.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 2 columns in them, corresponding to class index (1 and 2) and review text. The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".



In [1]:
%matplotlib inline
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from fastai.metrics import accuracy
from fastai.text import *
from fastai import *

In [2]:
def read_in_data(df_names, directory = './'):
    '''Read in data into seperate dataframes'''
    
    # change directory to data storage location
    os.chdir(directory)
    
    # list of filenames
    extension = 'csv'
    filenames = [i for i in glob.glob('*.{}'.format(extension))]
    
    # create global dfs within function
    for name, file in zip(df_names, filenames):
        globals()[name] = pd.read_csv(file, header=None, names = ['sentiment', 'text'])

In [3]:
# list of prefered dataframe names in same order as directory
df_names = ['test', 'train']
directory = '/Users/mattmerrill/Springboard/fellowship-ai_challenge/data'

read_in_data(df_names, directory)

In [4]:
train.head()

Unnamed: 0,sentiment,text
0,1,"Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the aggravation. It's with regret that I..."
1,2,"Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life."
2,1,"I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you c..."
3,1,"I'm writing this review to give you a heads up before you see this Doctor. The office staff and administration are very unprofessional. I left a message with multiple people regarding my bill, and no one ever called me back. I had to hound them to get an answer about my bill. \n\nSecond, and most important, make sure your insurance is going to cover Dr. Goldberg's visits and blood work. He recommended to me that I get a physical, and he knew I was a student because I told him. I got the physical done. Later, I found out my health insurance doesn't pay for preventative visits. I received an..."
4,2,"All the food is great here. But the best thing they have is their wings. Their wings are simply fantastic!! The \""Wet Cajun\"" are by the best & most popular. I also like the seasoned salt wings. Wing Night is Monday & Wednesday night, $0.75 whole wings!\n\nThe dining area is nice. Very family friendly! The bar is very nice is well. This place is truly a Yinzer's dream!! \""Pittsburgh Dad\"" would love this place n'at!!"


In [5]:
test.head()

Unnamed: 0,sentiment,text
0,2,"Contrary to other reviews, I have zero complaints about the service or the prices. I have been getting tire service here for the past 5 years now, and compared to my experience with places like Pep Boys, these guys are experienced and know what they're doing. \nAlso, this is one place that I do not feel like I am being taken advantage of, just because of my gender. Other auto mechanics have been notorious for capitalizing on my ignorance of cars, and have sucked my bank account dry. But here, my service and road coverage has all been well explained - and let up to me to decide. \nAnd they ..."
1,1,"Last summer I had an appointment to get new tires and had to wait a super long time. I also went in this week for them to fix a minor problem with a tire they put on. They \""fixed\"" it for free, and the very next morning I had the same issue. I called to complain, and the \""manager\"" didn't even apologize!!! So frustrated. Never going back. They seem overpriced, too."
2,2,"Friendly staff, same starbucks fair you get anywhere else. Sometimes the lines can get long."
3,1,"The food is good. Unfortunately the service is very hit or miss. The main issue seems to be with the kitchen, the waiters and waitresses are often very apologetic for the long waits and it's pretty obvious that some of them avoid the tables after taking the initial order to avoid hearing complaints."
4,2,"Even when we didn't have a car Filene's Basement was worth the bus trip to the Waterfront. I always find something (usually I find 3-4 things and spend about $60) and better still, I am always still wearing the clothes and shoes 3 months later. \n\nI kind of suspect this is the best shopping in Pittsburgh; it's much better than the usual department stores, better than Marshall's and TJ Maxx and better than the Saks downtown, even when it has a sale. Selection, bargains AND quality.\n\nI like this Filene's better than Gabriel Brothers, which are harder to get to. Gabriel Brothers are a real..."


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560000 entries, 0 to 559999
Data columns (total 2 columns):
sentiment    560000 non-null int64
text         560000 non-null object
dtypes: int64(1), object(1)
memory usage: 8.5+ MB


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38000 entries, 0 to 37999
Data columns (total 2 columns):
sentiment    38000 non-null int64
text         38000 non-null object
dtypes: int64(1), object(1)
memory usage: 593.9+ KB


## Fellowship ai Challenge Description

Perform Sentiment analysis using ULMFiT on Yelp reviews - Polarity and apply the EDA augmentation. Report the performance metrics for before and after augmentation? show examples of data augmentaion?

## Exploratory Analysis

Before processing with ULMFiT let's perform some expoloratory analysis to find out most common words, in the process we will also clean the text of urls and lower the case of all words to avoid unnecessary repeats. 

In [8]:
import itertools
import collections

import nltk
from nltk.corpus import stopwords
import re
import networkx

import warnings
warnings.filterwarnings("ignore")

sns.set(font_scale=1.5)
sns.set_style("whitegrid")

In [9]:
all_reviews = [text for text in train.text]

In [10]:
all_reviews[:5]

["Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.",
 "Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doe

In [11]:
def remove_url(txt):
    """Replace URLs found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with url's removed.
    """

    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

In [12]:
all_reviews_no_urls = [remove_url(text) for text in all_reviews]
all_reviews_no_urls[:5]

['Unfortunately the frustration of being Dr Goldbergs patient is a repeat of the experience Ive had with so many other doctors in NYC good doctor terrible staff It seems that his staff simply never answers the phone It usually takes 2 hours of repeated calling to get an answer Who has time for that or wants to deal with it I have run into this problem with many other doctors and I just dont get it You have office workers you have patients with medical needs why isnt anyone answering the phone Its incomprehensible and not work the aggravation Its with regret that I feel that I have to give Dr Goldberg 2 stars',
 'Been going to Dr Goldberg for over 10 years I think I was one of his 1st patients when he started at MHMG Hes been great over the years and is really all about the big picture It is because of him not my now former gyn Dr Markoff that I found out I have fibroids He explores all options with you and is very patient and understanding He doesnt judge and asks all the right questio

In [13]:
# Create a list of lists containing lowercase words for each tweet
words_in_review = [review.lower().split() for review in all_reviews_no_urls]
words_in_review[:1]

[['unfortunately',
  'the',
  'frustration',
  'of',
  'being',
  'dr',
  'goldbergs',
  'patient',
  'is',
  'a',
  'repeat',
  'of',
  'the',
  'experience',
  'ive',
  'had',
  'with',
  'so',
  'many',
  'other',
  'doctors',
  'in',
  'nyc',
  'good',
  'doctor',
  'terrible',
  'staff',
  'it',
  'seems',
  'that',
  'his',
  'staff',
  'simply',
  'never',
  'answers',
  'the',
  'phone',
  'it',
  'usually',
  'takes',
  '2',
  'hours',
  'of',
  'repeated',
  'calling',
  'to',
  'get',
  'an',
  'answer',
  'who',
  'has',
  'time',
  'for',
  'that',
  'or',
  'wants',
  'to',
  'deal',
  'with',
  'it',
  'i',
  'have',
  'run',
  'into',
  'this',
  'problem',
  'with',
  'many',
  'other',
  'doctors',
  'and',
  'i',
  'just',
  'dont',
  'get',
  'it',
  'you',
  'have',
  'office',
  'workers',
  'you',
  'have',
  'patients',
  'with',
  'medical',
  'needs',
  'why',
  'isnt',
  'anyone',
  'answering',
  'the',
  'phone',
  'its',
  'incomprehensible',
  'and',
  'n

In [14]:
# List of all words across tweets
all_words_no_urls = list(itertools.chain(*words_in_review))

In [15]:
# Collections library for python includes the counter function, which we will use here to count the frequency of words
counts_no_urls = collections.Counter(all_words_no_urls)
counts_no_urls.most_common(15)

[('the', 3679294),
 ('and', 2403526),
 ('i', 1928076),
 ('to', 1883800),
 ('a', 1877056),
 ('was', 1325476),
 ('of', 1107741),
 ('it', 950032),
 ('for', 860446),
 ('in', 845959),
 ('is', 827323),
 ('that', 702757),
 ('my', 664211),
 ('we', 597584),
 ('this', 590778)]

> Let's remove the stop words and analyze the top words again. This may give us more insight into the text.

In [16]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mattmerrill/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
stop_words = set(stopwords.words('english'))

In [18]:
words_in_review[0]

['unfortunately',
 'the',
 'frustration',
 'of',
 'being',
 'dr',
 'goldbergs',
 'patient',
 'is',
 'a',
 'repeat',
 'of',
 'the',
 'experience',
 'ive',
 'had',
 'with',
 'so',
 'many',
 'other',
 'doctors',
 'in',
 'nyc',
 'good',
 'doctor',
 'terrible',
 'staff',
 'it',
 'seems',
 'that',
 'his',
 'staff',
 'simply',
 'never',
 'answers',
 'the',
 'phone',
 'it',
 'usually',
 'takes',
 '2',
 'hours',
 'of',
 'repeated',
 'calling',
 'to',
 'get',
 'an',
 'answer',
 'who',
 'has',
 'time',
 'for',
 'that',
 'or',
 'wants',
 'to',
 'deal',
 'with',
 'it',
 'i',
 'have',
 'run',
 'into',
 'this',
 'problem',
 'with',
 'many',
 'other',
 'doctors',
 'and',
 'i',
 'just',
 'dont',
 'get',
 'it',
 'you',
 'have',
 'office',
 'workers',
 'you',
 'have',
 'patients',
 'with',
 'medical',
 'needs',
 'why',
 'isnt',
 'anyone',
 'answering',
 'the',
 'phone',
 'its',
 'incomprehensible',
 'and',
 'not',
 'work',
 'the',
 'aggravation',
 'its',
 'with',
 'regret',
 'that',
 'i',
 'feel',
 'that

In [19]:
# Remove stop words from each review list of words
reviews_nsw = [[word for word in review_words if not word in stop_words]
              for review_words in words_in_review]
reviews_nsw[0]

['unfortunately',
 'frustration',
 'dr',
 'goldbergs',
 'patient',
 'repeat',
 'experience',
 'ive',
 'many',
 'doctors',
 'nyc',
 'good',
 'doctor',
 'terrible',
 'staff',
 'seems',
 'staff',
 'simply',
 'never',
 'answers',
 'phone',
 'usually',
 'takes',
 '2',
 'hours',
 'repeated',
 'calling',
 'get',
 'answer',
 'time',
 'wants',
 'deal',
 'run',
 'problem',
 'many',
 'doctors',
 'dont',
 'get',
 'office',
 'workers',
 'patients',
 'medical',
 'needs',
 'isnt',
 'anyone',
 'answering',
 'phone',
 'incomprehensible',
 'work',
 'aggravation',
 'regret',
 'feel',
 'give',
 'dr',
 'goldberg',
 '2',
 'stars']

In [None]:
# let's take a look again at the most common words
all_words_nsw = list(itertools.chain(*reviews_nsw))

counts_nsw = collections.Counter(all_words_nsw)

counts_nsw.most_common(15)

In [None]:
clean_reviews_nsw = pd.DataFrame(counts_nsw.most_common(15),
                             columns=['words', 'count'])

fig, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph
clean_reviews_nsw.sort_values(by='count').plot.barh(x='words',
                      y='count',
                      ax=ax,
                      color="purple")

ax.set_title("Common Words Found in Reviews (Without Stop Words)")

plt.show()