# Midterm Assignment: McDonald's Sentiment Data Analysis - Solution

## Problem

McDonald’s receives thousands of consumer comment on their website every day and many of them are negative. Their corporate employees do not have the time to browse through every single comment, but they do want to read a subset that they are most interested in. In particular, articles about the rude service of their employees have recently surfaced on social media. In order to take appropriate action, they would now like to review comments about **rude service**. 

You are hired to develop a system that ranks each comment by the **likelihood that it is referring to rude service**. They will use this system to build a “rudeness dashboard” for their corporate employees, so that the employees can spend a few minutes each day examining the **most relevant recent comments**.


## Data

McDonald’s used the CrowdFlower platform to pay humans to hand-annotate approximately 1500 comments with the type of complaint. The list of complaint types can be found below, with the encoding used listed in parentheses: 
- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na) 

You will be asked to perform some tasks. In the midst of these tasks, some MCQs will be asked. You are to select the best possible option as your answer. Please answer them accordingly. 

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Task 1

Read **'mcdonalds.csv'** into a pandas DataFrame and examine it. (Instructions: mcdonalds.csv can be found in “IVLE Workbin > Midterm Assignment > data”) 

A description of the more important columns to get you started: 
- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists CrowdFlower's confidence in the judgments of its human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

**Please answer Question 1 as in midterm.pdf.** 

In [2]:
# read mcdonalds.csv using a relative path
import pandas as pd
path = 'mcdonalds.csv'
mcd = pd.read_csv(path)

In [3]:
# examine the first three rows
mcd.head(3)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",


In [4]:
# examine the text of the 128th review
mcd.loc[127, 'review']

"Ok I'm waiting for like 10 minutes to place my order with the staff walking back & forth just looking at me like I'm crazy. And another 10 minutes or so before i got my food, This location use to be my stop in the mornings when I worked near here but they have fallen way off."

## Task 2

Remove any rows from the DataFrame in which the policies_violated column has a null value.
- **Note**: Null values are also known as “missing values”, and are encoded in pandas with the special value “NaN’. This is different from the “na” encoding used by CrowdFlower to denote “None of the above”. Rows that contain “na” should not be removed. 

**Please answer Questions 2 and 3 as in midterm.pdf.**

In [5]:
# examine the shape before removing any rows
mcd.shape

(1525, 11)

In [6]:
# count the number of null values in each column
mcd.isnull().sum()

_unit_id                           0
_golden                            0
_unit_state                        0
_trusted_judgments                 0
_last_judgment_at                  0
policies_violated                 54
policies_violated:confidence      54
city                              87
policies_violated_gold          1525
review                             0
Unnamed: 10                     1525
dtype: int64

In [7]:
# filter the DataFrame to only include rows in which policies_violated is not null
mcd = mcd[mcd.policies_violated.notnull()]

# alternatively, use the 'dropna' method to accomplish the same thing
mcd.dropna(subset=['policies_violated'], inplace=True)

In [8]:
# examine the shape after removing rows
mcd.shape

(1471, 11)

## Task 3

Add a new column to the DataFrame called **"rude"** that is 1 if the **policies_violated** column contains the text "RudeService", and 0 if the **policies_violated** column does not contain "RudeService". The "rude" column is going to be your response variable, so check how many zeros and ones it contains.

**Please answer Question 4 as in midterm.pdf.**

In [9]:
# search for the string 'RudeService', and convert the boolean results to integers
mcd['rude'] = mcd.policies_violated.str.contains('RudeService').astype(int)

In [10]:
# confirm that it worked
mcd.loc[0:4, ['policies_violated', 'rude']]

Unnamed: 0,policies_violated,rude
0,RudeService\nOrderProblem\nFilthy,1
1,RudeService,1
2,SlowService\nOrderProblem,0
3,na,0
4,RudeService,1


In [11]:
# examine the class distribution
mcd.rude.value_counts()

0    968
1    503
Name: rude, dtype: int64

In [12]:
mcd.rude.value_counts()/sum(mcd.rude.value_counts())

0    0.658056
1    0.341944
Name: rude, dtype: float64

## Task 4

Define X using the **review** column and y using the **rude** column. Split X and y into training and testing sets (using the parameter **`random_state=1`**). Use CountVectorizer (with the **default parameters**) to create document-term matrices from X_train and X_test. 
- Note: Please remember to follow the instructions carefully by setting the parameters as required for reproducibility of results. 

**Please answer Questions 5 and 6 as in midterm.pdf.**

In [13]:
# define X and y
X = mcd.review
y = mcd.rude

In [14]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)



In [15]:
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [16]:
# fit and transform X_train into X_train_dtm
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(1103, 7300)

In [17]:
# transform X_test into X_test_dtm
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape

(368, 7300)

In [18]:
X_test_dtm[24]

<1x7300 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

## Task 5

Fit a Multinomial Naive Bayes model to the training set, calculate the **predicted probabilities** for the testing set, and then calculate the AUC. Repeat this task using a logistic regression model to compare which of the two models achieves a better AUC. 
- **Note**: McDonald’s requires you to rank the comments by the likelihood that they refer to rude service. In this case, classification accuracy is NOT the relevant evaluation metric. Area Under Curve (AUC) is a more useful evaluation metric for this scenario, since it measures the ability of the classifier to assign higher predicted probabilities to positive instances than to negative instances. 

**Please answer Questions 7, 8 and 9 as in midterm.pdf.** 

In [19]:
# import/instantiate/fit a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [20]:
# calculate the predicted probability of rude=1 for each testing set observation
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]

In [21]:
y_pred = nb.predict(X_test_dtm)
sum(y_pred[0:5])/5

0.40000000000000002

In [22]:
# calculate the AUC
from sklearn import metrics
nb_AUC = metrics.roc_auc_score(y_test, y_pred_prob)
nb_AUC

0.84260054045461774

In [23]:
# repeat this task using a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
logreg_AUC = metrics.roc_auc_score(y_test, y_pred_prob)
logreg_AUC

0.82339850580193941

In [24]:
nb_AUC - logreg_AUC

0.019202034652678335

## Task 6

Using Naive Bayes, try **tuning CountVectorizer** using some of the techniques we learned in class. Check the testing set AUC after each change, and find the set of parameters that increases AUC the most. (This is meant for your own learning experience)
- **Hint**: It is highly recommended that you adapt the **`tokenize_test()`** function from class for this purpose, since it will allow you to iterate quickly through different sets of parameters. 

**Please answer Questions 10 and 11 as in midterm.pdf.**

In [25]:
# define a function that accepts a vectorizer and calculates the AUC
def tokenize_test(vect):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features:', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to calculate predicted probabilities
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
    
    # print the AUC
    print('AUC:', metrics.roc_auc_score(y_test, y_pred_prob))

In [26]:
# confirm that the AUC is identical to task 5 when using the default parameters
vect = CountVectorizer()
tokenize_test(vect)

Features: 7300
AUC: 0.842600540455


In [27]:
# tune CountVectorizer to increase the AUC
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)
tokenize_test(vect)

Features: 1732
AUC: 0.862152281036


## Task 7 

The city column might be predictive of the response, but we are currently not using it as a feature. We will now explore to see if we can increase the AUC by adding city to the model. You are to do the following: 
1. Create a new DataFrame column, review_city, that concatenates the review text with the city text. One easy way to combine string columns in pandas is by using the `Series.str.cat()` method. Make sure to use the whitespace character as a separator, as well as replacing null city values with a reasonable string value such as ‘na’. 
2. Redefine X using the review_city column, and re-split X and y into training and testing sets (using the parameter `random_state=1`). 
3. By allowing for English stopwords removal, and setting the following parameters as `max_df = 0.3`, `min_df=4` in the CountVectorizer, check whether it has increased or decreased the AUC. 

**Please answer Question 12 as in midterm.pdf.** 

In [28]:
# concatenate review and city, separated by a space, replacing nulls with 'na'
mcd['review_city'] = mcd.review.str.cat(mcd.city, sep=' ', na_rep='na')

In [29]:
# examine review_city for the first row to confirm that it worked
mcd.loc[0, 'review_city']

"I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care. Atlanta"

In [30]:
# redefine X and y
X = mcd.review_city
y = mcd.rude

In [31]:
# re-split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [32]:
# check whether it increased or decreased the AUC of my best model
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)
tokenize_test(vect)

Features: 1739
AUC: 0.864854554125


In [33]:
0.864934032745 - 0.862231759657

0.0027022730879999735

## Task 8 

The **policies_violated:confidence** column may be useful as it is a measure of the training data quality. You are to calculate the **mean confidence** score for each row of your McDonald’s dataset (i.e. X_train together with X_test) and store these mean scores in a new column. For example the confidence scores for the first row are 1.0\r\n0.6667\r\n0.6667, so you should calculate a mean of 0.7778. Here are some of the steps you can follow: 
1. Using the `Series.str.split()` method, convert the policies_violated:confidence column into lists of one or more “confidence scores”. Save the results as a new DataFrame column called **confidence_list**. 
2. Apply a function that can calculate the mean of a list of numbers, and pass that function to the `Series.apply()` method of the **confidence_list** column. Save those scores in a new DataFrame column called **confidence_mean**. 

**Please answer Question 13 as in midterm.pdf.**

In [34]:
# examine the existing confidence column
mcd['policies_violated:confidence'].head()

0    1.0\n0.6667\n0.6667
1                      1
2               1.0\n1.0
3                 0.6667
4                      1
Name: policies_violated:confidence, dtype: object

In [35]:
# split the column into lists of one or more confidence scores
mcd['confidence_list'] = mcd['policies_violated:confidence'].str.split()
mcd.confidence_list.head()

0    [1.0, 0.6667, 0.6667]
1                      [1]
2               [1.0, 1.0]
3                 [0.6667]
4                      [1]
Name: confidence_list, dtype: object

In [36]:
import numpy as np

# define a function that accepts a list of strings and returns the mean
def mean_of_list(conf_list):
    
    # convert the list to a NumPy array of floats
    conf_array = np.array(conf_list, dtype=float)
    
    # return the mean of the array
    return np.mean(conf_array)

In [37]:
# calculate the mean confidence score for each row
mcd['confidence_mean'] = mcd.confidence_list.apply(mean_of_list)
mcd.confidence_mean.head()

0    0.7778
1    1.0000
2    1.0000
3    0.6667
4    1.0000
Name: confidence_mean, dtype: float64

In [38]:
sum(mcd.confidence_mean==1)

785

We will now like to remove lower-quality rows from the training set to reduce noise. You are to remove all rows from X_train and y_train that have a confidence_mean lower than 0.75. 

**Please answer Questions 14 and 15 as in midterm.pdf.**

In [39]:
# check the shapes of X_train and y_train before removing any rows
print(X_train.shape)
print(y_train.shape)

(1103,)
(1103,)


In [40]:
# remove any rows from X_train and y_train that have a confidence_mean lower than 0.75
X_train = X_train[mcd.confidence_mean >= 0.75]
y_train = y_train[mcd.confidence_mean >= 0.75]
print(X_train.shape)
print(y_train.shape)

(799,)
(799,)


In [41]:
# check whether it increased or decreased the AUC of my best model
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)
tokenize_test(vect)

Features: 1353
AUC: 0.849690033381
