# Homework with McDonald's sentiment data

In [82]:
import glob

import pandas as pd
import numpy as np

import utils
import funcs

In [83]:
datafiles = glob.glob('data/**')
datafiles

['data/mcdonalds.csv', 'data/mcdonalds_new.txt']

In [84]:
mcdo_csv = datafiles[0]
mcdo_test = datafiles[1]

## Imaginary problem statement

McDonald's receives **thousands of customer comments** on their website per day, and many of them are negative. Their corporate employees don't have time to read every single comment, but they do want to read a subset of comments that they are most interested in. In particular, the media has recently portrayed their employees as being rude, and so they want to read any comments about **rude service** so that they can update their employee training accordingly.

McDonald's has hired you to develop a system that ranks each comment by the **likelihood that it is referring to rude service**. They will use your system to build a "rudeness dashboard" for their corporate employees, so that employees can spend a few minutes each day examining the **most relevant recent comments**.

## Description of the data

Before hiring you, McDonald's used the [CrowdFlower platform](http://www.crowdflower.com/data-for-everyone) to pay humans to **hand-annotate** about 1500 comments with the **type of complaint**. The complaint types are listed below, with the encoding used in the data listed in parentheses:

- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na)

## Task 1

Read **`mcdonalds.csv`** into a Pandas DataFrame and examine it. (It can be found in the **`data`** directory of the course repository.)

- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists CrowdFlower's confidence in the judgments of its human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

In [85]:
df = pd.read_csv(mcdo_csv)
df = utils.clean_columns(df)
df.columns

Index([u'_unit_id', u'_golden', u'_unit_state', u'_trusted_judgments',
       u'_last_judgment_at', u'policies_violated',
       u'policies_violated_confidence', u'city', u'policies_violated_gold',
       u'review', u'unnamed__10'],
      dtype='object')

In [86]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1525 entries, 0 to 1524
Data columns (total 11 columns):
_unit_id                        1525 non-null int64
_golden                         1525 non-null bool
_unit_state                     1525 non-null object
_trusted_judgments              1525 non-null int64
_last_judgment_at               1525 non-null object
policies_violated               1471 non-null object
policies_violated_confidence    1471 non-null object
city                            1438 non-null object
policies_violated_gold          0 non-null float64
review                          1525 non-null object
unnamed__10                     0 non-null float64
dtypes: bool(1), float64(2), int64(2), object(6)
memory usage: 132.5+ KB


In [87]:
try:
    del df['unnamed__10']
    del df['policies_violated_gold']
except KeyError:
    pass


In [88]:
df['policies_violated'] = df.policies_violated.str.replace('\n', ',')
df['policies_violated_confidence'] = df['policies_violated_confidence'].str.replace('\n', ',')

## Task 2

Remove any rows from the DataFrame in which the **policies_violated** column has a **null value**. Check the shape of the DataFrame before and after to confirm that you only removed about 50 rows.

- **Note:** Null values are also known as "missing values", and are encoded in Pandas with the special value "NaN". This is distinct from the "na" encoding used by CrowdFlower to denote "None of the above". Rows that contain "na" should **not** be removed.
- **Hint:** This [code snippet](http://chrisalbon.com/python/pandas_missing_data.html) shows different ways for handling missing data in Pandas, and includes one strategy that will work for this task.

In [89]:
df = df.dropna(thresh=8)
df.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1471 entries, 0 to 1524
Data columns (total 9 columns):
_unit_id                        1471 non-null int64
_golden                         1471 non-null bool
_unit_state                     1471 non-null object
_trusted_judgments              1471 non-null int64
_last_judgment_at               1471 non-null object
policies_violated               1471 non-null object
policies_violated_confidence    1471 non-null object
city                            1385 non-null object
review                          1471 non-null object
dtypes: bool(1), int64(2), object(6)
memory usage: 104.9+ KB


In [90]:
df = df.reset_index()

## Task 3

Add a new column to the DataFrame called **"rude"** that is 1 if the **policies_violated** column contains "RudeService", and 0 if the **policies_violated** column does not contain "RudeService". The "rude" column is going to be your response variable, so check how many zeros and ones it contains.

- **Hint:** This [code snippet](http://chrisalbon.com/python/pandas_string_munging.html) shows how to use a Pandas string method to search for the presence of a sub-string. You will also have to figure out how to convert the boolean results (True/False) to integers (1/0).

In [91]:
df = utils.make_dummies(df, columns=['city'])

In [168]:
### This might be overkill, but useful. 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
vect = CountVectorizer()

In [93]:
vect = vect.fit(df.policies_violated)

In [94]:
dtm = vect.transform(df.policies_violated)

In [95]:
policies_violated_df = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

In [96]:
df = pd.concat([df, policies_violated_df], axis=1)

In [97]:
df.head(3)

Unnamed: 0,index,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated_confidence,city,review,...,city_portland,badfood,cost,filthy,missingfood,na,orderproblem,rudeservice,scarymcds,slowservice
0,0,679455653,False,finalized,3,2/21/15 0:36,"RudeService,OrderProblem,Filthy","1.0,0.6667,0.6667",Atlanta,"I'm not a huge mcds lover, but I've been to be...",...,0,0,0,1,0,0,1,1,0,0
1,1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,Terrible customer service. ŒæI came in at 9:30...,...,0,0,0,0,0,0,0,1,0,0
2,2,679455655,False,finalized,3,2/21/15 0:26,"SlowService,OrderProblem","1.0,1.0",Atlanta,"First they ""lost"" my order, actually they gave...",...,0,0,0,0,0,0,1,0,0,1


## Task 4

1. Define X (the **review** column) and y (the **rude** column).
2. Split X and y into training and testing sets (using the parameter **`random_state=1`**).
3. Use CountVectorizer (with the **default parameters**) to create document-term matrices from X_train and X_test.

In [17]:
X = df.review
y = df.rudeservice


In [18]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [19]:
vect = CountVectorizer()    
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

## Task 5

Fit a Multinomial Naive Bayes model to the training set, calculate the **predicted probabilites** (not the class predictions) for the testing set, and then calculate the **AUC**. Repeat this task using a logistic regression model to see which of the two models achieves a better AUC.

- **Note:** Because McDonald's only cares about ranking the comments by the likelihood that they refer to rude service, **classification accuracy** is not the relevant evaluation metric. **Area Under the Curve (AUC)** is a more useful evaluation metric for this scenario, since it measures the ability of the classifier to assign higher predicted probabilities to positive instances than to negative instances.
- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to calculate predicted probabilities and AUC, and my [blog post and video](http://www.dataschool.io/roc-curves-and-auc-explained/) explain AUC in-depth.

In [27]:
reload(funcs)
for model in  funcs.many_nb_models(X_train_dtm, X_test_dtm, y_train, y_test):
    print model

Model: <class 'sklearn.naive_bayes.BernoulliNB'> 
 
    Accuracy Score: 0.673913043478 
 
    ROC AUC Score: 0.656541090447 
 
    Classification Report: 
              precision    recall  f1-score   support

          0       0.70      0.86      0.77       233
          1       0.59      0.36      0.44       135

avg / total       0.66      0.67      0.65       368

    
    
Model: <class 'sklearn.linear_model.logistic.LogisticRegression'> 
 
    Accuracy Score: 0.766304347826 
 
    ROC AUC Score: 0.823398505802 
 
    Classification Report: 
              precision    recall  f1-score   support

          0       0.78      0.88      0.83       233
          1       0.73      0.57      0.64       135

avg / total       0.76      0.77      0.76       368

    
    
Model: <class 'sklearn.naive_bayes.MultinomialNB'> 
 
    Accuracy Score: 0.798913043478 
 
    ROC AUC Score: 0.841964711493 
 
    Classification Report: 
              precision    recall  f1-score   support

         

## Task 6

Using either Naive Bayes or logistic regression (whichever one had a better AUC in the previous step), try **tuning CountVectorizer** using some of the techniques we learned in class. Check the testing set **AUC** after each change, and find the set of parameters that increases AUC the most.

- **Hint:** It is highly recommended that you adapt the **`tokenize_test()`** function from class for this purpose, since it will allow you to iterate quickly through different sets of parameters.

In [34]:
reload(tune_countvectorizer)
import tune_countvectorizer as tc


tc.find_best_cvect_params(X, y)

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1e-05, 1e-06),
 'clf__penalty': ('l2', 'elasticnet'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  50 jobs       | elapsed:    7.7s
[Parallel(n_jobs=-1)]: Done  66 out of  72 | elapsed:   10.8s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:   11.5s finished


done in 12.275s

Best score: 0.787
Best parameters set:
	clf__alpha: 1e-05
	clf__penalty: 'elasticnet'
	vect__max_df: 0.5
	vect__ngram_range: (1, 2)


In [61]:
reload(tune_countvectorizer)
import tune_countvectorizer as tc

tc.find_best_cvect_params(X, y)

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1e-05,),
 'clf__n_iter': (80,),
 'clf__penalty': ('l2',),
 'tfidf__norm': ('l1', 'l2'),
 'tfidf__use_idf': (True, False),
 'vect__analyzer': ('word', 'char_wb'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__max_features': (None, 5000, 10000, 50000),
 'vect__ngram_range': ((1, 1), (1, 2), (2, 2), (2, 3), (2, 4)),
 'vect__stop_words': ('english',)}
Fitting 3 folds for each of 480 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done  50 jobs       | elapsed:   11.7s
[Parallel(n_jobs=-1)]: Done 200 jobs       | elapsed:   53.9s
[Parallel(n_jobs=-1)]: Done 450 jobs       | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 800 jobs       | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 1250 jobs       | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 1434 out of 1440 | elapsed:  9.4min remaining:    2.4s
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed:  9.4min finished


done in 567.201s

Best score: 0.787
Best parameters set:
	clf__alpha: 1e-05
	clf__n_iter: 80
	clf__penalty: 'l2'
	tfidf__norm: 'l2'
	tfidf__use_idf: True
	vect__analyzer: 'word'
	vect__max_df: 0.75
	vect__max_features: 50000
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'


## Task 7 (Challenge)

The **city** column might be predictive of the response, but we are not currently using it as a feature. Let's see whether we can increase the AUC by adding it to the model:

1. Create a new DataFrame column, **review_city**, that includes both the **review** text and the **city** text. One easy way to combine string columns in Pandas is by using the [`Series.str.cat()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.cat.html) method. Make sure to use the **space character** as a separator, as well as replacing **null city values** with a reasonable string value (such as 'na').
2. Redefine X as the **review_city** column, and re-split X and y into training and testing sets.
3. When you run **`tokenize_test()`**, CountVectorizer will simply treat the city as an extra word in the review, and thus it will automatically be included in the model! Check to see whether it increased or decreased the AUC of your **best model**.

In [109]:
df['review_with_city'] = df['review'].str.cat(df['city'].values.astype(str), sep=' ')


In [110]:
X = df.review_with_city
y = df.rudeservice

In [129]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [136]:
params = {'analyzer':'word', 
          'max_df':0.75, 
          'max_features':50000, 
          'ngram_range':(1, 3), 
          'lowercase': True,

          'stop_words':'english'}

vect = CountVectorizer(params)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [137]:
reload(funcs)
for model in  funcs.many_nb_models(X_train_dtm, X_test_dtm, y_train, y_test):
    print model

Model: <class 'sklearn.naive_bayes.BernoulliNB'> 
 
    Accuracy Score: 0.671195652174 
 
    ROC AUC Score: 0.65773326975 
 
    Classification Report: 
              precision    recall  f1-score   support

          0       0.70      0.85      0.77       233
          1       0.59      0.36      0.44       135

avg / total       0.66      0.67      0.65       368

    
    
Model: <class 'sklearn.linear_model.logistic.LogisticRegression'> 
 
    Accuracy Score: 0.766304347826 
 
    ROC AUC Score: 0.820696232713 
 
    Classification Report: 
              precision    recall  f1-score   support

          0       0.78      0.88      0.83       233
          1       0.73      0.57      0.64       135

avg / total       0.76      0.77      0.76       368

    
    
Model: <class 'sklearn.naive_bayes.MultinomialNB'> 
 
    Accuracy Score: 0.785326086957 
 
    ROC AUC Score: 0.842568749007 
 
    Classification Report: 
              precision    recall  f1-score   support

          

## Task 8 (Challenge)

The **policies_violated:confidence** column may be useful, since it essentially represents a measurement of the training data quality. Let's see whether we can improve the AUC by only training the model using the highest-quality rows:

1. Using the [`Series.str.split()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html) method, convert the **policies_violated:confidence** column into lists of one or more "confidence scores". Save the results as a new DataFrame column called **confidence_list**.
2. Define a function that calculates the mean of a list of numbers, and pass that function to the [`Series.apply()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) method for the **confidence_list** column. That will calculate the mean confidence score for each row. Save those scores in a new DataFrame column called **confidence_mean**.
3. Create a new DataFrame that only includes the rows with a **confidence_mean** of 1. Compare the shapes of the original and new DataFrames.
4. Redefine X and y using the new DataFrame, and re-split X and y into training and testing sets.
5. Check to see whether this process increased or decreased the AUC of your **best model**.
6. Try **re-tuning** CountVectorizer to maximize the AUC, to see if this strategy was worthwhile.

In [155]:
df['confidence_list'] =df.policies_violated_confidence.str.split(',').apply(lambda x: np.mean([float(n) for n in x ]))

In [159]:
fullconf_df = df[df.confidence_list==1]

In [160]:
X = fullconf_df.review_with_city
y = fullconf_df.rudeservice


In [161]:
reload(tune_countvectorizer)
import tune_countvectorizer as tc

tc.find_best_cvect_params(X, y)

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1e-05,),
 'clf__n_iter': (80,),
 'clf__penalty': ('l2',),
 'tfidf__norm': ('l1', 'l2'),
 'tfidf__use_idf': (True, False),
 'vect__analyzer': ('word', 'char_wb'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__max_features': (None, 5000, 10000, 50000),
 'vect__ngram_range': ((1, 1), (1, 2), (2, 2), (2, 3), (2, 4)),
 'vect__stop_words': ('english',)}
Fitting 3 folds for each of 480 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  50 jobs       | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done 200 jobs       | elapsed:   23.2s
[Parallel(n_jobs=-1)]: Done 450 jobs       | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 800 jobs       | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1250 jobs       | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 1434 out of 1440 | elapsed:  4.5min remaining:    1.1s
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed:  4.6min finished


done in 273.563s

Best score: 0.868
Best parameters set:
	clf__alpha: 1e-05
	clf__n_iter: 80
	clf__penalty: 'l2'
	tfidf__norm: 'l2'
	tfidf__use_idf: False
	vect__analyzer: 'word'
	vect__max_df: 0.5
	vect__max_features: None
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'


In [None]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [175]:
params = {'analyzer':'word', 
          'max_df':0.5, 
          'max_features':None, 
          'ngram_range':(1, 2), 
          'lowercase': True,
          'norm': 'l2',
          'use_idf': False,
          'stop_words':'english',
         }

vect = TfidfVectorizer(params)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [176]:
reload(funcs)
for model in  funcs.many_nb_models(X_train_dtm, X_test_dtm, y_train, y_test):
    print model

Model: <class 'sklearn.naive_bayes.BernoulliNB'> 
 
    Accuracy Score: 0.671195652174 
 
    ROC AUC Score: 0.65773326975 
 
    Classification Report: 
              precision    recall  f1-score   support

          0       0.70      0.85      0.77       233
          1       0.59      0.36      0.44       135

avg / total       0.66      0.67      0.65       368

    
    
Model: <class 'sklearn.linear_model.logistic.LogisticRegression'> 
 
    Accuracy Score: 0.736413043478 
 
    ROC AUC Score: 0.846383722779 
 
    Classification Report: 
              precision    recall  f1-score   support

          0       0.72      0.96      0.82       233
          1       0.83      0.36      0.50       135

avg / total       0.76      0.74      0.70       368

    
    
Model: <class 'sklearn.naive_bayes.MultinomialNB'> 
 
    Accuracy Score: 0.638586956522 
 
    ROC AUC Score: 0.81589572405 
 
    Classification Report: 
              precision    recall  f1-score   support

          0

## Task 9 (Challenge)

New comments have been submitted to the McDonald's website, and you need to **score them with the likelihood** that they are referring to rude service.

1. Before making predictions on out-of-sample data, it is important to re-train your model on all relevant data using the tuning parameters and preprocessing steps that produced the best AUC above.
    - In other words, X should be defined using either **all rows** or **only those rows with a confidence_mean of 1**, whichever produced a better AUC above.
    - X should refer to either the **review column** or the **review_city column**, whichever produced a better AUC above.
    - CountVectorizer should be instantiated with the **tuning parameters** that produced the best AUC above.
    - **`train_test_split()`** should not be used during this process.
2. Build a document-term matrix (from X) called **X_dtm**, and examine its shape.
3. Read the new comments stored in **`mcdonalds_new.csv`** into DataFrame called **new_comments**, and examine it.
4. If your model uses a **review_city** column, create that column in the new_comments DataFrame. (Otherwise, skip this step.)
5. Build a document_term matrix (from the **new_comments** DataFrame) called **new_dtm**, and examine its shape.
6. Train your best model (Naive Bayes or logistic regression) using **X_dtm** and **y**.
7. Predict the "rude probability" for each comment in **new_dtm**, and store the probabilities in an object called **new_pred_prob**.
8. Print the **full text** for each new comment alongside its **"rude probability"**. Examine the results, and comment on how well you think the model performed!

In [187]:
X = fullconf_df.review_with_city
y = fullconf_df.rudeservice

X.shape == y.shape

True

In [183]:
params = {'analyzer':'word', 
          'max_df':0.5, 
          'max_features':None, 
          'ngram_range':(1, 2), 
          'lowercase': True,
          'norm': 'l2',
          'use_idf': False,
          'stop_words':'english',
         }

vect = TfidfVectorizer(params)
X_train_dtm = vect.fit_transform(X)


In [182]:
new_comments = pd.read_csv(mcdo_test)
new_comments.head(3)

Unnamed: 0,city,review
0,Las Vegas,Went through the drive through and ordered a #...
1,Chicago,Phenomenal experience. Efficient and friendly ...
2,Los Angeles,Ghetto lady helped me at the drive thru. Very ...


In [195]:
new_comments['review_with_city'] = new_comments['review'].str.cat(new_comments['city'].values.astype(str), sep=' ')
new_comments.review



0    Went through the drive through and ordered a #...
1    Phenomenal experience. Efficient and friendly ...
2    Ghetto lady helped me at the drive thru. Very ...
3    Close to my workplace. It was well manged befo...
4    I've made at least 3 visits to this particular...
5    Why did I revisited this McDonald's  again.  I...
6    This specific McDonald's is the bar I hold all...
7    My friend and I stopped in to get a late night...
8    Friendly people but completely unable to deliv...
9    Having visited many McDonald's over the years,...
Name: review, dtype: object

In [191]:
new_reviews = new_comments['review_with_city']

new_comments_dtm = vect.transform(new_reviews)

In [222]:
from sklearn.linear_model import LogisticRegression


from sklearn.naive_bayes import MultinomialNB, BernoulliNB

nb = LogisticRegression()
nb.fit(X_train_dtm, y)
y_pred_class = nb.predict(new_comments_dtm)

y_pred_prob = nb.predict_proba(new_comments_dtm)[:, 1]

new_comments['rude_proba'] =  y_pred_prob

In [223]:

new_comments.sort_values('rude_proba', ascending=False)

Unnamed: 0,city,review,review_with_city,rude_proba
7,Dallas,My friend and I stopped in to get a late night...,My friend and I stopped in to get a late night...,0.573713
2,Los Angeles,Ghetto lady helped me at the drive thru. Very ...,Ghetto lady helped me at the drive thru. Very ...,0.466396
0,Las Vegas,Went through the drive through and ordered a #...,Went through the drive through and ordered a #...,0.344322
4,Portland,I've made at least 3 visits to this particular...,I've made at least 3 visits to this particular...,0.266857
8,Cleveland,Friendly people but completely unable to deliv...,Friendly people but completely unable to deliv...,0.250615
1,Chicago,Phenomenal experience. Efficient and friendly ...,Phenomenal experience. Efficient and friendly ...,0.238424
9,,"Having visited many McDonald's over the years,...","Having visited many McDonald's over the years,...",0.197025
6,Atlanta,This specific McDonald's is the bar I hold all...,This specific McDonald's is the bar I hold all...,0.190381
5,Houston,Why did I revisited this McDonald's again. I...,Why did I revisited this McDonald's again. I...,0.17167
3,New York,Close to my workplace. It was well manged befo...,Close to my workplace. It was well manged befo...,0.140397


In [245]:
for index in new_comments.index:
    row = new_comments.loc[index]
    print 'Rude Proba: %s \n' % (row.rude_proba) 
    print row.review
    print '\n' + '_'*20 + '\n'

Rude Proba: 0.344321810886 

Went through the drive through and ordered a #10 (cripsy sweet chili chicken wrap) without fries- the lady couldn't understand that I did not want fries and charged me for them anyways. I got the wrong order- a chicken sandwich and a large fries- my boyfriend took it back inside to get the correct order. The gentleman that ordered the chicken sandwich was standing there as well and she took the bag from my bf- glanced at the insides and handed it to the man without even offering to replace. I mean with all the scares about viruses going around... ugh DISGUSTING SERVICE. Then when she gave him the correct order my wrap not only had the sweet chili sauce on it, but the nasty (just not my first choice) ranch dressing on it!!!! I mean seriously... how lazy can you get!!!! I worked at McDonalds in Texas when I was 17 for about 8 months and I guess I was spoiled with good management. This was absolutely ridiculous. I was beyond disappointed.

____________________

In [207]:
new_comments.loc[2].review

'Ghetto lady helped me at the drive thru. Very rude and disrespectful to the co workers. Never coming back. Yuck!'

In [209]:
new_comments.loc[0].review

"Went through the drive through and ordered a #10 (cripsy sweet chili chicken wrap) without fries- the lady couldn't understand that I did not want fries and charged me for them anyways. I got the wrong order- a chicken sandwich and a large fries- my boyfriend took it back inside to get the correct order. The gentleman that ordered the chicken sandwich was standing there as well and she took the bag from my bf- glanced at the insides and handed it to the man without even offering to replace. I mean with all the scares about viruses going around... ugh DISGUSTING SERVICE. Then when she gave him the correct order my wrap not only had the sweet chili sauce on it, but the nasty (just not my first choice) ranch dressing on it!!!! I mean seriously... how lazy can you get!!!! I worked at McDonalds in Texas when I was 17 for about 8 months and I guess I was spoiled with good management. This was absolutely ridiculous. I was beyond disappointed."

In [214]:
new_comments.loc[4].review

"I've made at least 3 visits to this particular location just because it's right next to my office building.. and all my experience have been consistently bad.  There are a few helpers taking your orders throughout the drive-thru route and they are the worst. They rush you in placing an order and gets impatient once the order gets a tad bit complicated.  Don't even bother changing your mind oh NO! They will glare at you and snap at you if you want to change something.  I understand its FAST food, but I want my order placed right.  Not going back if I can help it."

In [224]:
new_comments.loc[8].review

'Friendly people but completely unable to deliver what was ordered at the drive through.  Out of my last 6 orders they got it right 3 times.  Incidentally, the billing was always correct - they just could not read the order and deliver.  Very frustrating!'