#Pandas Express: NLP
###An express guide to becoming a Kung Fu Pandas master

<img src="http://vignette1.wikia.nocookie.net/kungfupanda/images/8/88/Po2.jpg/revision/latest?cb=20100726062228" width="300"/>

After defeating the evil snow leopard Tai Lung, our favorite kung fu panda master Po returns to the Valley of Peace to help his father Mr. Ping with his noodle restaurant. Mr. Ping's noodle restaurant hasn't been doing so well, so Po is determined to help his dad figure out what he can do to improve his restaurant. Luckily, Po has been trained in the revered and ancient Python style of Shaolin martial arts and will analyze a dataset from Yelp to save his father's restaurant, like a true Kung Fu Pandas master.

###The Tools
This tutorial will walk you through doing some basic data cleaning and exploratory analysis with Pandas and a suite of other Python data analysis tools. Below are a few of the tools we will be using:

* [numpy](http://docs.scipy.org/doc/numpy-dev/user/index.html), for arrays
* [pandas](http://pandas.pydata.org/), for data frames
* [matplotlib](http://matplotlib.org/), for plotting
* [seaborn](http://stanford.edu/~mwaskom/software/seaborn/), for making plots pretty
* [statsmodels](http://statsmodels.sourceforge.net/), for statistical analysis
* [sklearn](http://scikit-learn.org), for machine learning

In [2]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels
import sklearn 
import nltk

# Download NLTK text datasets
nltk.download() 

# iPython command to format matplotlib plots
%matplotlib inline 

If you have trouble importing any of the packages, you might need to install it first from the website or, if you're on Mac OS or ubuntu, from the console with: `pip install <name of package>`

###The Dataset
We will be using a dataset of Yelp reviews provided by the [Yelp Dataset Challenge](http://www.yelp.com/dataset_challenge). The download consists of the following files in JSON format:
* business.json - information on businesses
* review.json - text and metadata of reviews
* tip.json - text and metadata of tips
* user.json - information on users
* checkin.json - number of checkins at each business

In this tutorial, we will be primarily focused on the business.json file

###Loading in and cleaning the data

####Load in data

In [3]:
import json

'''
load_data(filepath) 
Given a filepath to a JSON file, loads in the file and formats the JSON
'''
def load_data(filepath):
    data = []
    
    # Open file and read in line by line
    with open(filepath) as file:
        for line in file:
            # Strip out trailing whitespace at the end of the line
            data.append(json.loads(line.rstrip()))

    return data

In [7]:
data = load_data('data/review.json')

In [8]:
review_df = pd.DataFrame.from_dict(data)

####Now let's take a peek inside
The [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe) has a full list of functions, but below are some helpful ones for doing some initial poking around. 

In [10]:
review_df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,vcNAWiLM4dR7D2nwwJ7nCA,2007-05-17,15SdjuK7DmYqUAj6rjGowg,5,dr. goldberg offers everything i look for in a...,review,Xqd0DzHaiyRqVH3WRG7hzg,"{u'funny': 0, u'useful': 2, u'cool': 1}"
1,vcNAWiLM4dR7D2nwwJ7nCA,2010-03-22,RF6UnRTtG7tWMcrO2GEoAg,2,"Unfortunately, the frustration of being Dr. Go...",review,H1kH6QZV7Le4zqTRNxoZow,"{u'funny': 0, u'useful': 2, u'cool': 0}"
2,vcNAWiLM4dR7D2nwwJ7nCA,2012-02-14,-TsVN230RCkLYKBeLsuz7A,4,Dr. Goldberg has been my doctor for years and ...,review,zvJCcrpm2yOZrxKffwGQLA,"{u'funny': 0, u'useful': 1, u'cool': 1}"
3,vcNAWiLM4dR7D2nwwJ7nCA,2012-03-02,dNocEAyUucjT371NNND41Q,4,Been going to Dr. Goldberg for over 10 years. ...,review,KBLW4wJA_fwoWmMhiHRVOA,"{u'funny': 0, u'useful': 0, u'cool': 0}"
4,vcNAWiLM4dR7D2nwwJ7nCA,2012-05-15,ebcN2aqmNUuYNoyvQErgnA,4,Got a letter in the mail last week that said D...,review,zvJCcrpm2yOZrxKffwGQLA,"{u'funny': 0, u'useful': 2, u'cool': 1}"


In [12]:
review_df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1569264 entries, 0 to 1569263
Data columns (total 8 columns):
business_id    1569264 non-null object
date           1569264 non-null object
review_id      1569264 non-null object
stars          1569264 non-null int64
text           1569264 non-null object
type           1569264 non-null object
user_id        1569264 non-null object
votes          1569264 non-null object
dtypes: int64(1), object(7)
memory usage: 107.8+ MB


In [14]:
review_df.shape

(1569264, 8)

In [17]:
review_df['text'][0]

u"dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."

###Cleaning data and pre-processing text

####Flatten votes column

In [18]:
# Format the votes as a list of dict objects
votes_dict = [{'votes': x} for x in review_df['votes'].values]

In [19]:
# Create a DataFrame with json_normalize
votes_df = pd.io.json.json_normalize(votes_dict)

In [None]:
# Merge the formatted votes_df with our original review_df
review_df = pd.merge(review_df, votes_df, left_index = True, right_index = True)

# Drop the votes column
review_df = review_df.drop('votes', axis=1)

####Tokenize

In [None]:
# Convert text to lowercase and tokenize words
review_df['text'] = review_df['text'].str.lower().str.split()

In [None]:
review_df['text']

####Create dummy/indicator variables for categories column

Next up, if we now look at the `categories` column, we see that the categories are stored as lists. While that's easy to read, it's not actually in the most usable format if we're going to conduct any data analysis (for example, if we wanted to know how many Chinese restaurants we had in our dataset). We want to create dummy variables for the categories similar to what we did for attributes, but the categories pose an interesting dilemma because they are stored as lists. So we are going to use a slightly modified version of [`get_dummies`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) by splitting the lists up using a spring operator and then creating the dummy variables from a string.

In [None]:
# Create dummy variables for categories
categories_df = business_df['categories'].str.join(sep=',').str.get_dummies(sep=',')

In [None]:
# Save the list of categories for future use
categories = categories_df.columns.values

In [None]:
# Merge it with our original dataframe
business_df = pd.merge(business_df, categories_df, left_index = True, right_index = True)

Instead of dropping the `categories` column, we're going to keep it around, but reformat it as a tuple

In [None]:
business_df['categories'] = business_df['categories'].apply(lambda x: tuple(x))

Now we can do things like, say, filter `business_df` for all Chinese restaurants, or do a count of the number of Chinese restaurants to size up the competition.

In [None]:
business_df[business_df['Chinese'] == 1].head()

In [None]:
business_df['Chinese'].sum()

####So far so good!
There is definitely more clean-up work to be done with our datasets (we can continue to work with the `neighbors` or `hour` columns), but for now, we're ready to start doing some analysis!

###Descriptive Statistics
First, we might be interested in some basic descriptive statistics about our dataset. With a series of filters and statistical functions, we can do some initial exploratory analysis.

####Looking at relevant attributes
If we look at our attributes again, we see that there is a good amount of missing info (because, for example, certain attributes like `Hair Types Specialized In` simply aren't going to be applicable to any businesses other than hair salons). Since we are looking at restaurants for now, we might want to know the attributes that have the most non-null values, and therefore potentially the more important attributes for restaurants.

In [None]:
# Count the number of non-null attributes 
nonnull_attributes_count = business_df[business_df['Restaurants'] == 1][numeric_attributes].notnull().sum()

# Sort the attribute counts
sorted_attributes = nonnull_attributes_count.order(ascending=False)

# Print the top 20
sorted_attributes[:20]

####Top restaurant categories

In [None]:
# Count the number of restaurants are in each category
restaurant_category_counts = business_df[business_df['Restaurants'] == 1][categories].sum()

# Sort the category counts
sorted_categories = restaurant_category_counts.order(ascending=False)

# Print the top 20
sorted_categories[:20]

In [None]:
# Get the categories that are not relevant to restaurants 
non_restaurant_categories = restaurant_category_counts[restaurant_category_counts <= 0].index.values

###Visualizing the data

We might want to also generate some plots to visualize our data. Python has a number of visualization libraries, some built on top of others. We will primarily be using [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/index.html), which is a library based on [matplotlib](http://matplotlib.org/), but feel free to check out some of the other options as well!

####Ratings Distribution

In [None]:
sns.set(rc={"figure.figsize": (8, 4)})

data = business_df['stars']
sns.distplot(data, kde=False, bins=10)

# Add headers and labels to the plot
plt.title('Ratings Distribution')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

# Print some descriptive statistics
print "Mean: %f" % data.mean()
print "Min: %f" % data.min()
print "Max: %f" % data.max()

####Ratings Distribution for Chinese Restaurants

In [None]:
sns.set(rc={"figure.figsize": (8, 4)})

data = business_df[business_df['Chinese'] == 1]['stars']
sns.distplot(data, kde=False, bins=10)

# Add headers and labels to the plot
plt.title('Ratings Distribution for Chinese Restaurants')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

# Print some descriptive statistics
print "Mean: %f" % data.mean()
print "Min: %f" % data.min()
print "Max: %f" % data.max()

###Analysis
There are a variety of methods that we could use to conduct the analysis we want to do, but here, we just do a very simple classifier to see what features are important for creating a good restaurant. 

####Select the data we want and format it for use with sklearn

In [None]:
# Get just the numeric columns
numeric_only = business_df.select_dtypes(exclude=['object'])

In [None]:
# Filter for the attributes and categories we have most information about
filtered_df = (numeric_only
                .drop('open', axis=1)
                .drop(sorted_attributes[20:].index.values, axis=1)
                .drop(['latitude', 'longitude'], axis=1)
                .drop(non_restaurant_categories, axis=1))

For now, we just replace any nan values with a 0, but in reality, there are better ways of filling in missing data

In [None]:
# Fill any na values with 0
filtered_df = filtered_df.fillna(0)

Create training and test sets and pull out the labels (in this case we are looking at stars)

In [None]:
# Split into data and labels
data = filtered_df[filtered_df['Restaurants'] == 1].drop('stars', axis=1)
labels = filtered_df[filtered_df['Restaurants'] == 1]['stars']

In [None]:
# Format labels as dummy variables for classification
labels = labels.astype(str).str.get_dummies()

In [None]:
from sklearn.cross_validation import train_test_split

# Split into test and train sets
train_data, test_data, train_labels, test_labels = train_test_split(
    data.values, labels.values, test_size=0.3, random_state=42)

####Using a random forest classifier to look at feature importance

In [None]:
from sklearn.ensemble import RandomForestClassifier

features = data.columns.values

# Instantiate the classifier
clf = RandomForestClassifier(n_estimators = 100, max_features='auto', max_depth=4)

# Fit the classifier to our training data
clf = clf.fit(train_data,train_labels)

# look at feature importance
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")
sorted_features = []
for f in range(len(indices)):
    print("%s | %f" % (features[indices[f]], importances[indices[f]]))
    sorted_features.append(features[indices[f]])

Based on this analysis, Po might try to focus on the attributes that had large importance in determining a restaurant's rating. There is, however, a lot of additional statistical and machine learning techniques we can use to better help Po conduct his analysis. 

Stay tuned for future tutorials on how Po can use techniques like natural language processing or network analysis to better help his father's restaurant!

<img src="http://img3.wikia.nocookie.net/__cb20100727192424/kungfupanda/images/4/47/Po%26Mr.Ping.jpg"/>