# **Part 0: Introduction to Yelp Dataset Project**

### Part 0.0: Yelp Academic Dataset

This project delves into exploratory analysis and building predictive models using the [Yelp academic dataset](https://www.yelp.com/dataset_challenge/). It is an opportunity for you to explore a machine learning problem in the context of a real-world data set using big data analysis tools. In order to use the dataset and finish this project, you must agree to the dataset's terms of use provided [here](https://www.yelp.com/html/pdf/Dataset_Challenge_Academic_Dataset_Agreement.pdf).

We have chosen a subset of the Yelp academic dataset for you to work with. This subsampled data is loaded into RDDs in section (1). The complete dataset is available from Yelp's website [here](https://www.yelp.com/dataset_challenge/dataset). Remember that you are limited by the DataBricks Community Edition's limits on memory and computation. Yelp has provided some example code at their Github repository [here](https://github.com/Yelp/dataset-examples) that might be helpful in getting started. However, these are pure Python code and not Spark code that provide parallelism.

By design, the project is open-ended; you are free to decide how you want to approach the problem and what tools you want to employ. We want to see a best-effort solution that utilizes what you learned in class and also potentially trying new things beyond class. Your project will be worth 20% of your final class grade.

### Part 0.1: Grading Rubric:

** Course staff will use the following rubric when grading your final project reports: **


*  *Introduction/Motivation/Problem Definition (10%)*
  * Identify, define, and motivate the problem that you are addressing.
  * How (precisely) will a machine learning solution address the problem?

*  *Data Understanding and Preparation (15%)*
  * What preliminary analyses have you performed on the data? What observations have you made? How did those observations help shape your approach?
  * Provide the preliminary data analysis results and your observations.
  * Specify how the data will be transformed to the format required for machine learning. 

*  *Methodology (35%)*
  * This is where you give a detailed description of your primary contributions. It is especially important that this part be clear and well written so that we can fully understand what you did.
  * Specify the type of model(s) built and/or information/knowledge extracted.
  * Discuss choices for machine learning algorithm: what are other alternatives, and what are their pros and cons (in the context of the problem and as compared to your proposed solution)?
  * Discuss why and how this model should "solve" the problem (i.e., improve along some dimension of interest). 
  * Outline the big data analysis tools and libraries you have used. 

It is not so important how well your method performs but rather, (a) how thorough and careful your methodology is, and (b) how interesting and clever the approaches you took and the tools you have used are. 

*  *Evaluation and Results (30%)*
  * We are interested in seeing a clear and conclusive set of experiments which successfully evaluate the problem you set out to solve. Make sure to interpret the results and talk about what we can conclude and learn from your approach.
  * How do you evaluate your machine learning solution to the specific question(s) you have addressed?
  * What do these results tell you about your solution?
  * Present and discuss your evaluation results and findings. You may use tables or figures (e.g. ROC plot) to visualize your results.

*  *Style and writing (10%)*
  * Overall writing, grammar, organization, figures and illustrations.
 
Note that, for reference, you can look up the details of the relevant Spark methods in [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) and the relevant NumPy methods in the [NumPy Reference](http://docs.scipy.org/doc/numpy/reference/index.html)

### ** Part 0.2: Code of Conduct **

** Please follow the following guidelines with respect to collaboration: **

* You have to use the data we have provided you. You cannot choose your own dataset. By using the dataset, you agree to Yelp's terms of use available [here](https://www.yelp.com/html/pdf/Dataset_Challenge_Academic_Dataset_Agreement.pdf).
* You will be given 48 hours to work on the project. Use of late days are not allowed for this submission.
* You are free to use the Web, APIs, ML toolkits, etc. in this project to your best benefit. Please credit any online or offline sources (even casual sources like StackOverflow) if you use them in the project.
* Project is to be done individually. No collaboration is allowed between students. No discussion is allowed about the project with anyone else except the class instructors. Students who use each other's ideas or code will be heavily penalized.

### **Part 0.3: Project Suggestions **

Before you embark on the project, please plan out your task by breaking it into smaller chunks that incrementally build on top of each other. For example, you may begin with a simpler set of features and then add more complex features to the dataset. Such modular planning will ensure that you will have a working deliverable in case you run out of time tackling more complicated aspects of the project you had planned to complete. Try to have a barebones but working version of the project after 24 hours, and build on it in the next 24 hours leading up to the deadline. Create a backup version of your notebook after finishing a substantial chunk of the work so you can go back to a working version in case of a catastrophe.

**Here is a list of potential aspects you can tackle in this final open-ended project:**

*  *Exploratory Data Analysis (Perform those that help you get started with your chosen business question, not all of these.)*
  * Plot a map showing the locations of various businesses. Helper code to help in the creation of maps using "mpl_toolkits" is provided in a later section on exploratory data analysis.
  * Plot a map showing the locations of businesses checkins made by Yelp users.
  * Plot a histogram of ratings that the businesses get i.e. see how many businesses got ratings of 1-5 each. Is this distribution skewed? Are there ratings that are used rarely as compared to others?
  * What are the most popular keywords that reviewers use by city or state?
  * What are the most popular keywords that reviewers use for American/Thai/Chinese restaurants by state?
  * What are the most popular keywords or adjectives that reviewers use for American/Thai/Chinese restaurants by state?
  * What are the most popular keywords or adjectives that reviewers use in 5 star reviews for American/Thai/Chinese restaurants by state?
  * What are the most frequent keywords or adjectives that reviewers use in 1 star reviews for American/Thai/Chinese restaurants by state?
  * Does the distribution of restaurants with parking space or outdoor seating differ from state to state or city to city?
  * Are there temporal trends (daily, weekly, holidays) associated with business checkins?
  * Does the number of checkins per restaurant differ across various restaurant categories?
  * Is there a correlation between how long a user has been "yelping" and the number of reviews he has written?
  * Is there a correlation between how many friends/fans a user has and the number of votes his reviews get?
  * What are the 5 most common types of restaurants in each city?
  * What is the fraction of businesses that accounts for restaurants?
  * Do the typical business hours vary by city and by type of business?
  * What does the histogram of number of friends/fans of Yelp users look like? Is it long-tailed or does it follow a certain distribution?
  * What is the distribution of number of reviews by neighborhood?
  * Is there a correlation between the star rating and length of reviews?
  * What are the top keywords or adjectives used by the two genders (male and female, sorry for being binary) in their reviews?
  * What fraction of Yelp users is male? What fraction is female? What if the fraction of users for whom gender cannot be determined based on the list of male and female names provided in this notebook?
  * What is the average number of friends/fans for male and female users?
  * What is the average number of reviews written by male and female users?
*  *Classification (Any classification task should also include a description of all the features used and which of these features impacted classification performance the most and why.)*
  * Classify businesses into various business categories (restaurant, dry cleaner, auto body, etc.).
  * Classify businesses by the type of parking they provide (street, garage, valet, etc.).
  * Predict the location of a reviewer (east or west coast or mid-country).
  * Predict the location of a business (east or west coast or mid-country).
  * Predict if a review is funny, cool, or useful (label should be based on the corresponding votes associated with the review, votes therefore may not be used as features).
  * Predict which type of restaurant a user reviews most based on the restaurant types reviewed by his friends.
  * Given the current categories of a business, predict a new category that it could be labeled as. You will need businesses - each with mutiple categories - to hold out some categories randomly from each business for testing purposes.
  * Predict if two users are friends based on the locations of businesses for which they have written reviews and other user characteristics.
  * Based on businesses reviewed by a user until a certain timepoint, predict the type of business the user might review next.
  * Predict if the ratings for a business are going to increase or decline with time. Are some types of restaurants more inclines to suffer from declining ratings?
  * Predict the gender of Yelp user based on businesses they have written reviews for. Examine a few examples of Yelp users where your classifier is incorrect, and provide any insighful suggestions for improving the classifier.
  * Predict the gender of Yelp users based on their business reviews. Examine your model to determine if and how the two genders use different words when writing reviews.
  * Predict the gender of Yelp users based on the numbers of various types of votes their reviews get and the numbers of various types of compliments they receive. Examine a few examples of Yelp users where your classifier is incorrect, and provide any insighful suggestions for improving the classifier.
*  *Regression (Any regression task should also include a description of all the features used and which of these features impacted regression performance the most and why.)*
  * Predict the average rating of a business from its reviews and other business characteristics such as location.
  * Predict the total number of reviews on a given week for each business.
  * Predict the total number of checkins based on business location, type, and other business characteristics present in "attributes" such as "Happy Hour, "Accepts Credit Cards", "Good For Groups", "Outdoor Seating", and "Price Range".
  * Predict the number of compliments received by a user.
  * Predict the number of friends a user has on Yelp based on user characteristics like number of reviews written by him, compliments received, etc.
  * Based on reviews written by a user until a certain timepoint, predict the star rating the user will give as part of his next review. Are certain users more likely to give extreme ratings to reviewed businesses than others?
  * Predict the number of funny/cool/useful votes sent by a user. Does it depend significantly on how long the user has been "yelping" or on gender?
  * Predict the number of funny/cool/useful votes received by a review. Does it depend significantly on how long the user has been "yelping" or on gender?
* *Clustering*
  * Cluster business by using their features using a clustering algorithm such as [K-Means](https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means). Choose the number of clusters in a data-driven fashion such as by using the elbow heuristic. Analyze clusters and see if they are homogeneous i.e. the business within each cluster look similar and as if they belong within the same group.
  * Cluster users based on their characteristics. See if users in the same cluster patronize similar businesses.
* *Recommendation Systems*
  * Given previous ratings by a Yelp user, recommend other businesses that the user might like. See [Collaborative Filtering on Spark](https://spark.apache.org/docs/2.1.0/mllib-collaborative-filtering.html).
  * You may also think of restaurant/business recommendation as a link prediction problem. You can use GraphX for this task.
* *Discovering Insights using Unsupervised Algorithms*
  * In the Yelp user dataset, you are provided a social network in the form of friends of each Yelp user. You may perform social network analysis using GraphX. This can help you discover the most influential users by eigen-centrality. You may use other measures of network centrality besides eigen-centrality.
  * Discover dense clusters of closely connected friends and see if they patronize the same businesses.
  * Verify if the dense clusters of closely connected friends are also homogeneous in terms of gender.
  * Learn a set of topics by applying topic modeling algorithms such as [LDA](https://spark.apache.org/docs/2.1.0/mllib-clustering.html#latent-dirichlet-allocation-lda) on textual reviews of businesses. Choose the number of topics in a data-driven fashion such as by using a figure that plots perplexity versus number of topics. Explore if the topics are insightful and whether or not they can be used as inputs to some predictive algorithms (see Classification tasks above).
  * Perform the above topic modeling precedure on the reviews of male and female reviewers separately to obtain two topic models. Explore if the topics are insightful and/or useful in any predictive tasks.
  * Apply PCA to the matrix where rows are businesses and columns are features of the businesses such as parking type, location, etc. Choose the number of components in a data-driven fashion such as by using a scree plot. Explore if the top components are insightful and can be used as inputs to any predictive algorithms (see classification tasks above).
  
Your task is to choose one of these ML problems, or define your own, on the provided dataset and address the problem of your choice with the big data analysis tools you learned during the course as well as others you explore based on the APIs in Spark. If you create your own question, please be sure to state it clearly at the beginning of section 4 (methodology).

### ** Part 0.4: Setup your DataBricks CE Spark Cluster and IPython Notebook **

Step I: Visit the web interface of DataBricks Community Edition at https://community.cloud.databricks.com/.

Step II: Start a DataBricks Community Edition cluster by selecting "New Cluster" from the homepage.

Step III: Give your cluster a name and click on "Create Cluster". This creates a single node cluster with 6GB memory for your account.

Step IV: Go back to homepage, and choose "Import Notebook." Upload the IPython assignment notebook by following the prompts and open the notebook. Rename the notebook from "project_yelp_dataset.ipynb" to "andrewid_project_yelp_dataset.ipynb" where "andrewid" is your actual Andrew ID.

Step V: By default, the notebook is not attached to a Spark cluster and will show "Detached" as its status at the top of the notebook in the browser. Click on the "Detached" status and attach it to your cluster. It should now show the message as "Attached (cluster name)"

Step VI: You can now import pyspark into the notebook. Also, attaching to the DataBricks Community Edition cluster automatically provides the SparkContext variable "sc" to the Python code in your notebook. Use it to create RDDs and write further Spark code.

These instructions are detailed with screenshots in slides 10-16 of the setup recitation available at https://www.andrew.cmu.edu/user/amaurya/docs/95869/hadoop-spark-setup-recitation.pdf

### ** Part 0.5: Submission Instructions **

You will submit both a zipped file on Blackboard and a hardcopy to the TA Abhinav Maurya in HBH 3026. If the TA's office is closed, please slide the hardcopy under the door.

Please complete the project, and feel free to add new cells as required. Upon completion, execute all cells in the completed notebook, and make sure all results show up. Export the contents of the notebook by choosing "File > Export > HTML" and saving the resulting file as "andrewid_project_yelp_dataset.html" Place the two files "andrewid_project_yelp_dataset.ipynb" and "andrewid_project_yelp_dataset.html" in a folder, zip the folder to a zipped file named "andrewid_project.zip" and submit it to Blackboard by the deadline. In addition, print the HTML file and submit the hardcopy to the TA Abhinav Maurya in HBH 3026 by the deadline. If the TA's office is closed, please slide the hardcopy under the door.

# ** Part 1: Load the datasets required for the project **

We will load four datasets for this project. Please feel free to reasonably subsample the dataset depending on the question you are answering and its complexity. If you choose to subsample any of the four datasets, please explain why you subsampled it and what was the number of datapoints you were left with in the subsampled version. In addition to the four datasets, we will also load two lists which contain names by gender. These lists are helpful in assigning a gender to a Yelp user by their name, since gender is not available in the Yelp dataset.

In [10]:
import json
import os
import sys
import os.path
import pyspark
import urllib2
import numpy as np

# helper function to load a JSON dataset from a publicly accessible url
def get_rdd_from_url(url):
  response = urllib2.urlopen(url)
  str_contents = response.read().strip().split('\n')
  json_contents = [json.loads(x) for x in str_contents]
  rdd = sc.parallelize(json_contents)
  return rdd

The first dataset we are going to load is information about Yelp businesses. The information of each business will be stored as a Python dictionary within an RDD. The dictionary consists of the following fields:

* "business_id":"encrypted business id"
* "name":"business name"
* "neighborhood":"hood name"
* "address":"full address"
* "city":"city"
* "state":"state -- if applicable --"
* "postal code":"postal code"
* "latitude":latitude
* "longitude":longitude
* "stars":star rating, rounded to half-stars
* "review_count":number of reviews
* "is_open":0/1 (closed/open)
* "attributes":["an array of strings: each array element is an attribute"]
* "categories":["an array of strings of business categories"]
* "hours":["an array of strings of business hours"]
* "type": "business"

In [12]:
# load the data about Yelp businesses in an RDD
# each RDD element is a Python dictionary parsed from JSON using json.loads()
# if your chosen project does not need this data, please comment out the lines below
businesses_rdd = get_rdd_from_url('https://www.andrew.cmu.edu/user/amaurya/docs/95869/yelp_academic_dataset_business.json')
print businesses_rdd.count()
print businesses_rdd.take(2)

The second dataset we are going to load is information about Yelp users. Each user's information will be stored as a Python dictionary within an RDD. The dictionary consists of the following fields:

*  "user_id":"encrypted user id"
*  "name":"first name"
*  "review_count":number of reviews
*  "yelping_since": date formatted like "2009-12-19"
*  "friends":["an array of encrypted ids of friends"]
*  "useful":"number of useful votes sent by the user"
*  "funny":"number of funny votes sent by the user"
*  "cool":"number of cool votes sent by the user"
*  "fans":"number of fans the user has"
*  "elite":["an array of years the user was elite"]
*  "average_stars":floating point average like 4.31
*  "compliment_hot":number of hot compliments received by the user
*  "compliment_more":number of more compliments received by the user
*  "compliment_profile": number of profile compliments received by the user
*  "compliment_cute": number of cute compliments received by the user
*  "compliment_list": number of list compliments received by the user
*  "compliment_note": number of note compliments received by the user
*  "compliment_plain": number of plain compliments received by the user
*  "compliment_cool": number of cool compliments received by the user
*  "compliment_funny": number of funny compliments received by the user
*  "compliment_writer": number of writer compliments received by the user
*  "compliment_photos": number of photo compliments received by the user
*  "type":"user"

In [14]:
# load the data about Yelp users in an RDD
# each RDD element is a Python dictionary parsed from JSON using json.loads()
# if your chosen project does not need this data, please comment out the lines below
users_rdd = get_rdd_from_url('https://www.andrew.cmu.edu/user/amaurya/docs/95869/yelp_academic_dataset_user.json')
print users_rdd.count()
print users_rdd.take(2)

The third dataset we are going to load is information about business checkins reported by users on Yelp. Each checkin's information will be stored as a Python dictionary within an RDD. The dictionary consists of the following fields:

*  "checkin_info":["an array of check ins with the format day-hour:number of check ins from hour to hour+1"]
*  "business_id":"encrypted business id"
*  "type":"checkin"

In [16]:
# load the data about business checkins reported by users on Yelp in an RDD
# each RDD element is a Python dictionary parsed from JSON using json.loads()
# if your chosen project does not need this data, please comment out the lines below
checkins_rdd = get_rdd_from_url('https://www.andrew.cmu.edu/user/amaurya/docs/95869/yelp_academic_dataset_checkin.json')
print checkins_rdd.count()
print checkins_rdd.take(2)

The fourth dataset we are going to load is information about business reviews written by users on Yelp. Each review's data will be stored as a Python dictionary within an RDD. The dictionary consists of the following fields:

*  "review_id":"encrypted review id"
*  "user_id":"encrypted user id"
*  "business_id":"encrypted business id"
*  "stars":star rating rounded to half-stars
*  "date":"date formatted like 2009-12-19"
*  "text":"review text"
*  "useful":number of useful votes received
*  "funny":number of funny votes received
*  "cool": number of cool review votes received
*  "type": "review"

In [18]:
# load the data about business reviews written by users on Yelp in an RDD, limited to businesses in Pittsburgh due to DataBricks computational limits
# each RDD element is a Python dictionary parsed from JSON using json.loads()
# if your chosen project does not need this data, please comment out the lines below
reviews_rdd = get_rdd_from_url('https://www.andrew.cmu.edu/user/amaurya/docs/95869/yelp_academic_dataset_review_pittsburgh.json')
print reviews_rdd.count()
print reviews_rdd.take(2)

Finally, we will load two lists. The first list consists of male names, and the second list consists of female names. You can use these lists to predict the gender of Yelp users if you plan to do any gender-based analysis of users or their reviews.

In [20]:
# helper function to load a list of names from a publicly accessible url
def get_names_from_url(url):
  response = urllib2.urlopen(url)
  str_contents = response.read().strip().split('\n')
  result = str_contents[6:]
  return result

male_names = get_names_from_url('https://www.andrew.cmu.edu/user/amaurya/docs/95869/male.txt')
print('First five male names: ', male_names[:5])

female_names = get_names_from_url('https://www.andrew.cmu.edu/user/amaurya/docs/95869/female.txt')
print('First five female names: ', female_names[:5])

# ** Part 2: Introduction, Motivation, and Problem Definition **

Please write your answer here. Add additional IPython code/markup cells as needed. Please the grading rubric at the top of this notebook to understand expectations from this section.

Describe your chosen problem and why you think it is interesting from a business perspective. Also mention which of the four datasets you will use for the analysis. What metric(s) will you use to evaluate methods on your chosen task?

The chosen problem for my study was: 

* "Predict the gender of Yelp user based on businesses they have written reviews for. Examine a few examples of Yelp users where your classifier is incorrect, and provide any insighful suggestions for improving the classifier"*

Purpose: The problem is about predicting the gender of the Yelp user who are putting reviews on businesses. This task will enable us to get insights into the demographic inofrmation about the population who is active on reviwing the various types of businesses and also would enable us to know whether more number of females or males are taking intersest in which type of business. Also, which business is liked most by each gender type will be visualized by the gender distribution which can  reveal hidden user demographics and showcase product success or failure within the groups. Gender based Yelp data analysis will also trigger a mechanism where we can detect which busnesses needs improvement or which businesses are doing great in terms of reviews and their location. Being able to accurately predict gender from businesses data would provide valuable information for marketing, sales and evaluation.

Chosen Datasets: The datasets which are of interest and important for our observation, model development and analysis are: businesses, reviews, user and gender.

For evaluation puposes, I will first join/explore the datasets to get it in desired (features, label) format. Then quickly try different binary classification methods to gauge which one best suites the dataset. While deciding which method suites best, I plan to use log-loss & accuracy metrics. Once I decide on a model that seems to work well, I will try to improve it. To improve the model I will do hyperparameter tuning, apply any dataset specific fix (like down-sampling/up-sampling samples from a specific category), etc. Also, as the given task requires me to perform a classification based Machine Learning approach, I will be using MLlib extensively over pyspark for my model building and python libraries like Matplotlib & Numpy.

# ** Part 3: Data Understanding and Preparation **

Please write your answer here. Add additional IPython code/markup cells as needed. Please the grading rubric at the top of this notebook to understand expectations from this section.

Describe your exploratory data analysis in this section. This is really important because it establishes that the datasets you are exploring is capable of answering your chosen project question. Make this section rich with visualization to give the reader a comprehensive understanding of the datasets you have chosen to use.

Below, you are provided helper code to install matplotlib's extra toolkit "mpl_toolkits" required for drawing maps. Also provided is an example map created using mpl_toolkits. You can refer to Matplotlib Basemap Toolkit documentation [here](https://matplotlib.org/basemap/).

In [26]:
%sh -e

# shell commands to install mpl_toolkits
sudo pip install matplotlib
cd /databricks
mkdir -p mpl_toolkit
cd mpl_toolkit

wget https://www.andrew.cmu.edu/user/amaurya/docs/95869/basemap-1.0.7.tar.gz
tar -xvf basemap-1.0.7.tar.gz

cd basemap-1.0.7/geos-3.3.3
export GEOS_DIR=/usr/local
./configure --prefix=$GEOS_DIR
make; make install
cd ..
python setup.py install

In [27]:
import matplotlib.pyplot as plt
import mpl_toolkits
mpl_toolkits.__path__.append('/usr/local/lib/python2.7/dist-packages/mpl_toolkits/')
from mpl_toolkits.basemap import Basemap


def preparePlot(xticks, yticks, figsize=(10.5, 6), hideLabels=False, gridColor='#999999',
                gridWidth=1.0):
    """Template for generating the plot layout."""
    plt.close()
    fig, ax = plt.subplots(figsize=figsize, facecolor='white', edgecolor='white')
    ax.axes.tick_params(labelcolor='#999999', labelsize='10')
    for axis, ticks in [(ax.get_xaxis(), xticks), (ax.get_yaxis(), yticks)]:
        axis.set_ticks_position('none')
        axis.set_ticks(ticks)
        axis.label.set_color('#999999')
        if hideLabels: axis.set_ticklabels([])
    plt.grid(color=gridColor, linewidth=gridWidth, linestyle='-')
    map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])
    return fig, ax

In [28]:
fig, ax = preparePlot(np.arange(0, 100, 20), np.arange(0, 100, 20))
m = Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80, llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c')
m.drawcoastlines()
m.fillcontinents(color='coral',lake_color='aqua')
m.drawparallels(np.arange(-90.,91.,30.))
m.drawmeridians(np.arange(-180.,181.,60.))
m.drawmapboundary(fill_color='aqua')
plt.title("Mercator Projection")
display(fig)

In [29]:
fig, ax = preparePlot(np.arange(0, 100, 20), np.arange(0, 100, 20))
m = Basemap(width=12000000,height=9000000,projection='lcc', resolution=None,lat_1=45.,lat_2=55,lat_0=50,lon_0=-107.)
m.bluemarble()
display(fig)

In [30]:
# Data Visualization & Preliminary analysis

# Start with storing the required RDDs into memory. Since we will
# be using them over and over again, its best to cache them to speed
# up computations.
users_rdd.cache()
reviews_rdd.cache()
businesses_rdd.cache()
checkins_rdd.cache()

In [31]:
# Since different datasets have same columns names, to avoid confusion
# I prefix all the columns with the dataset identifier. This way when we
# do joins in the following steps, all the column are uniquely identified.
def prefix_columns(df, prefix):
  field_name = map(lambda field: field.name, df.schema.fields)
  for name in field_name:
    if name.startswith(prefix):
      continue
      
    df = df.withColumnRenamed(name, prefix + name)
    
  return df

In [32]:
# Neighborhoods has mixed types data (bool & string) and 
# messes up spark auto schema infer.
def drop_neighborhoods(v):
  del v["neighborhoods"]
  return v

users_df = prefix_columns(spark.createDataFrame(users_rdd), "user_")
reviews_df = prefix_columns(spark.createDataFrame(reviews_rdd), "review_")
checkins_df = prefix_columns(spark.createDataFrame(checkins_rdd), "checkin_")
businesses_df = prefix_columns(spark.createDataFrame(businesses_rdd.map(drop_neighborhoods)), "business_")

In [33]:
MALE = "MALE"
FEMALE = "FEMALE"
UNKNOWN = "UNKNOWN_GENDER"

users_gender_by_name = {}
for name in male_names:
  users_gender_by_name[name] = MALE
for name in female_names:
  users_gender_by_name[name] = FEMALE
    
users_gender_by_name_df = spark.createDataFrame(users_gender_by_name.items(), ("user_name", "gender"))
print users_gender_by_name_df.take(2)

In [34]:
# We are missing gender for some user names, so I do left_outer join here to retain those rows in the output.
df = users_df.join(users_gender_by_name_df, users_df.user_name == users_gender_by_name_df.user_name, 'left_outer')
df = df.join(reviews_df, df.user_id == reviews_df.review_user_id)
df = df.join(businesses_df, df.review_business_id == businesses_df.business_id)
df.cache()

print df.count()
print df.take(1)

In [35]:
# Here is a plot showing number of business review grouped by 
# user gender.
#
# We notice that the dataset is highly skewed towards females.
# Therefore we should expect that model will be biased towards
# females and will make false positive predictions for it.
#
# Later in the evaluation section, I will try to handle this anamoly.

from matplotlib import pyplot as plt
 
plt.close()
by_gender = df.groupBy('gender').count().collect()
categories = [i[0] for i in by_gender]
counts = [i[1] for i in by_gender]
 
ind = np.array(range(len(categories)))
width = 0.50
plt.bar(ind, counts, width=width, color='r')
 
plt.ylabel('reviews counts')
plt.title('Review count by gender')
plt.xticks(ind + width/2., categories)
display(plt.gcf())

In [36]:
#Import numpy, pandas, and ggplot
from ggplot import *
from datetime import datetime
from pyspark.sql.functions import udf

In [37]:
# 1. Day of week vs reveiw count broken down by gender
to_day = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime("%A"))
day_gender_df = df.select(df.gender, to_day(df.review_date).alias('day')).groupby("gender", "day").count()

In [38]:
# This suggests females consistently review more business than males on any day of the week
p = ggplot(day_gender_df.toPandas(), aes('day', 'count', fill='gender')) + geom_bar(stat="identity")
display(p)

In [39]:
# 2. Review ratings by city
reviews_by_city = df.groupby('business_city').count()

In [40]:
# Majority of data samples are from Pittsburgh. Later any observation about city can be ignored.
p = ggplot(reviews_by_city.toPandas(), aes('business_city', 'count')) + geom_bar(stat="identity") + scale_y_log() + theme(axis_text_x  = element_text(angle = 45, hjust = 1))
display(p)

In [41]:
# 3. review star rating distribution by gender
review_stars_by_gender = df.groupby('review_stars', 'gender').count()

In [42]:
p = ggplot(review_stars_by_gender.toPandas(), aes(x='gender', y='count', fill='review_stars')) + geom_bar(stat="identity")
display(p)

In [43]:
# Here I define features I am going to use for our prediction model. I make all the
# features categorical.
#
# These include - 
# * business attributes (all attributes are treated as top level feature)
# * business hours (made categorical by making <day>-<open/close> as top level feature)
# * business city
# * business name
# * business stars (this is categorical since there are only 9 possible values of it)
# * business state
# * business longitude/latitude (made categorical by using them to map business on a grid)
# * business review count (made categorical by taking log(10))
#
# And user gender is treated as label.

# Making all the business attributes a top level column
def extract_business_attributes(row):
  names = []
  for attr in row['attributes'].iterkeys():
    names.append(attr)
  return names

business_attributes = businesses_rdd.flatMap(extract_business_attributes).distinct().collect()
#print business_attributes[:5]

def extract_business_hours(row):
  names = []
  for day, kv in row['hours'].iteritems():
    for k in kv.iterkeys():
      names.append(day + '-' + k)
  return names

business_hours = businesses_rdd.flatMap(extract_business_hours).distinct().collect()
#print business_hours[:5]

business_top_level_features = [
  'business_city',
  'business_name',
  'business_stars',
  'business_state',
  'business_latitude',
  'business_longitude',
  'business_review_count',
]

business_features = business_attributes + business_hours + business_top_level_features

business_features_index = {}
for idx, feature in enumerate(business_features):
  business_features_index[feature] = idx
  
# Only using features from businesses users have written reviews for.
print "Total number of features:", len(business_features)
print "Sample features:", business_features[:5]

From above, I observed that few business features were having distinct and definite levels, so I flattened those attributes and assigned their respective values without effecting the actual row count.

In [45]:
from math import log

# Converts a row from joined dataset (where some values are nested dictionaries)
# to a pair of (label, features).
def flatten(row):
  label = row['gender']
  
  features = [None] * len(business_features)
  for attr, value in row['business_attributes'].iteritems():
    features[business_features_index[attr]] = value
  
  for day, kv in row['business_hours'].iteritems():
    for k, v in kv.iteritems():
      feature = day + '-' + k
      features[business_features_index[feature]] = v
  
  for feature in business_top_level_features:
    val = row[feature]
    
    # For latitude/longitude we kind of divide location in grid by taking
    # only int component. We can make finer grid for better results.
    if feature == 'business_latitude':
      val = int(val)
    elif feature == 'business_longitude':
      val = int(val)
    elif feature == 'business_review_count':
      # For review_count we take log, which kind of divides business in 
      # fewer groups - (0-9) very few review, (10-99) few review, 
      # (100-999) good number of reviews, etc.
      val = int(log(val + 1, 10))
      
    features[business_features_index[feature]] = val
  
  return (label, features)

rawRdd = df.rdd.map(flatten)
rawRdd.cache()
print df.take(1)
print business_features
print rawRdd.take(1)

Here, we will employ One-Hot-Encoding technique on categorical features to numerical features to make our dataset ready for classification.

In [47]:
# I am going to use OHE on categorical features ignoring None values.
# OHE helps us map features like 'BYOB', 'Take-out', etc to integer value for analysis.
oheFeatures = rawRdd \
  .flatMap(lambda (label, features): zip(business_features, features)) \
  .filter(lambda (key, value): value is not None) \
  .distinct()

# Features with top distinct values
print oheFeatures.mapValues(lambda x: 1).reduceByKey(lambda acc, val: acc + val).top(5, key=lambda x: x[1])

# Total number of OHE features
print "Total number of OHE features:", oheFeatures.count()
print "Sample OHE features:", oheFeatures.take(5)

# OHE indexes for all (categorical features, values) & ordinal values
oheDict = {}
for idx, oheFeature in enumerate(oheFeatures.collect()):
  oheDict[oheFeature] = idx

In [48]:
# Data points can typically be represented with a small number of non-zero 
# OHE features relative to the total number of features that occur in the dataset.
# By leveraging this sparsity and using sparse vector representations of OHE data, 
# I can reduce storage and computational burdens. Therefore, I use SparseVectors 
# for our newly constructed LabeledPoints with gender as our label and rest all 
# others as our feature set.
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector

def extract_label(row):
  gender = row[0]
  if gender == MALE:
    return 0
  elif gender == FEMALE:
    return 1
  else:
    return 2

def extract_features(row):
  d = {}
  
  for feature in business_features:
    idx = business_features_index[feature]
    val = row[1][idx]
    if val is not None:
      d[oheDict[(feature, val)]] = 1
    
  return SparseVector(len(oheDict), d)

labeled_points = rawRdd.map(lambda row: LabeledPoint(extract_label(row), extract_features(row)))
print labeled_points.take(5)

This marks the completion of the feature engineering and making our feature set ready to be put into model development for classification using MLlib.

# ** Part 4: Methodology **

Please write your answer here. Add additional IPython code/markup cells as needed. Please the grading rubric at the top of this notebook to understand expectations from this section.

In this section, explain what method you have chosen to address the chosen problem. Are you going to be using regression, classification, clustering, topic modeling, collaborative filtering, or a combination of some of these? Describe why this method is suitable for answering your problem.

In [52]:
# I start my methodology with sampling out my complete dataset into 3 parts
# randomly splitted into training data, test, data and validation data.
weights = [.8, .1, .1]
seed = 42

# Division of complete dataset
trainData, valData, testData = labeled_points.randomSplit(weights, seed)
trainData.cache()
valData.cache()
testData.cache()

# Train/test/validate dataset where I exlude reviews with UNKNOWN gender
binaryTrainData, binaryValData, binaryTestData = labeled_points.filter(lambda lp: lp.label != 2).randomSplit(weights, seed)
binaryTrainData.cache()
binaryValData.cache()
binaryTestData.cache()

In [53]:
# Function to display evaluation Metrics for our classifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.mllib.evaluation import BinaryClassificationMetrics

def evaluate_model(name, model_cls, trainData, testData, **kwargs):
  model = model_cls.train(trainData, **kwargs)
  predictionAndLabels = testData.map(lambda lp: (float(model.predict(lp.features)), lp.label))

  # ref: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html
  print("Summary Stats for model")
  labels = testData.map(lambda lp: lp.label).distinct().collect()
  if len(labels) == 2:
    metrics = BinaryClassificationMetrics(predictionAndLabels)
    print("Area under PR = %s" % metrics.areaUnderPR)
    print("Area under ROC = %s" % metrics.areaUnderROC)
  else:
    metrics = MulticlassMetrics(predictionAndLabels)
    precision = metrics.precision()
    recall = metrics.recall()
    f1Score = metrics.fMeasure()
    print("Precision = %s" % precision)
    print("Recall = %s" % recall)
    print("F1 Score = %s" % f1Score)

    # Statistics by class
    for label in sorted(labels):
      print("Class %s precision = %s" % (label, metrics.precision(label)))
      print("Class %s recall = %s" % (label, metrics.recall(label)))
      print("Class %s F1 Measure = %s" % (label, metrics.fMeasure(label, beta=1.0)))

    # Weighted stats
    print("Weighted recall = %s" % metrics.weightedRecall)
    print("Weighted precision = %s" % metrics.weightedPrecision)
    print("Weighted F(1) Score = %s" % metrics.weightedFMeasure())
    print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5))
    print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate)
  
  print ("")
  return model, predictionAndLabels

In [54]:
from math import log

# Calculates the value of log loss for a given probabilty and label.
def computeLogLoss(p, y):
    """

    Note:
        log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
        and when p is 1 we need to subtract a small value (epsilon) from it.

    Args:
        p (float): A probabilty between 0 and 1.
        y (int): A label.  Takes on the values 0 and 1.

    Returns:
        float: The log loss value.
    """
    epsilon = 10e-12
    
    if p == 0:
      p = p + epsilon
    if p == 1:
      p = p - epsilon
    
    if y == 1:
      logLoss = -log(p)
    else:
      logLoss = -log(1-p)
      
    return logLoss

# Function to get raw predictions from a model. These are used to compute log loss
# and plot ROC graphs.
def get_raw_predictions(model, data):
  threshold = model.threshold
  model.clearThreshold()
  
  # Force RDD to evaluate
  rawPredictionsAndLabels = data.map(lambda lp: (float(model.predict(lp.features)), lp.label))
  rawPredictionsAndLabels.count()
  rawPredictionsAndLabels.cache()
  
  model.setThreshold(threshold)
  return rawPredictionsAndLabels

# Function to print some basic stats about model predictions.
def print_stats(model, valData):
  predictionsAndLabels = valData.map(lambda lp: (float(model.predict(lp.features)), lp.label))
  accuracy = predictionsAndLabels.map(lambda (pred, label): pred == label).mean()
  
  rawPredictionsAndLabel = get_raw_predictions(model, valData)
  logLoss = rawPredictionsAndLabel.map(lambda (pred, label): computeLogLoss(pred, label)).mean()
  print "LogLoss: {} Accuracy: {}".format(logLoss, accuracy)
  return logLoss, accuracy

In [55]:
# Function to plot ROC curves for our selected model and on the selected dataset.
def plot_roc(model, valData):
  rawPredictionsAndLabels = get_raw_predictions(model, valData)
  labelsAndWeights = rawPredictionsAndLabels.map(lambda (pred, label): (label, pred)).collect()
  labelsAndWeights.sort(key=lambda (k, v): v, reverse=True)
  labelsByWeight = np.array([k for (k, v) in labelsAndWeights])

  length = labelsByWeight.size
  truePositives = labelsByWeight.cumsum()
  numPositive = truePositives[-1]
  falsePositives = np.arange(1.0, length + 1, 1.) - truePositives

  truePositiveRate = truePositives / numPositive
  falsePositiveRate = falsePositives / (length - numPositive)

  # Generate layout and plot data
  fig, ax = preparePlot(np.arange(0., 1.1, 0.1), np.arange(0., 1.1, 0.1))
  ax.set_xlim(-.05, 1.05), ax.set_ylim(-.05, 1.05)
  ax.set_ylabel('True Positive Rate (Sensitivity)')
  ax.set_xlabel('False Positive Rate (1 - Specificity)')
  plt.plot(falsePositiveRate, truePositiveRate, color='#8cbfd0', linestyle='-', linewidth=3.)
  plt.plot((0., 1.), (0., 1.), linestyle='--', color='#d6ebf2', linewidth=2.)  # Baseline model
  return fig

In order to select the strongest classifier model for our resulting dataset post feature engineering, I ran a variety of different classifiers available on the scikit-learn Python library on the training dataset.

In [57]:
# Logistic Regression with SGD applied on training data with only 2 classes for gender: "FEMALE" & "MALE"
from pyspark.mllib.classification import LogisticRegressionWithSGD

lr_model = LogisticRegressionWithSGD.train(binaryTrainData)

In [58]:
# Compute the logloss and accuracy for logistic regression with SGD on Validation dataset.
print_stats(lr_model, binaryValData)

In [59]:
# Plot ROC curve for Logistic Regression SGD
fig = plot_roc(lr_model, binaryValData)
display(fig)

In [60]:
# Classification model training based on SVM SGD
from pyspark.mllib.classification import SVMWithSGD

svm_model = SVMWithSGD.train(binaryTrainData)
predictionsAndLabels = binaryValData.map(lambda lp: (float(svm_model.predict(lp.features)), lp.label))
accuracy = predictionsAndLabels.map(lambda (pred, label): pred == label).mean()
print "Accuracy: {}".format(accuracy)

In [61]:
# Naive Bayes classifier model implementation on training data
from pyspark.mllib.classification import NaiveBayes

nb_model = NaiveBayes.train(binaryTrainData)
predictionsAndLabels = binaryValData.map(lambda lp: (float(nb_model.predict(lp.features)), lp.label))
accuracy = predictionsAndLabels.map(lambda (pred, label): pred == label).mean()
print "Accuracy: {}".format(accuracy)

In [62]:
# Random Forest model building on training dataset to classify gender
from pyspark.mllib.tree import RandomForest

rf_model = RandomForest.trainClassifier(
  binaryTrainData,
  numClasses=2,
  categoricalFeaturesInfo=dict((idx, 2) for idx in oheDict.itervalues()),
  numTrees=16)

predictions = rf_model.predict(binaryValData.map(lambda x: x.features))
predictionsAndLabels = predictions.zip(binaryValData.map(lambda lp: lp.label))
accuracy = predictionsAndLabels.map(lambda (pred, label): pred == label).mean()
print "Accuracy: {}".format(accuracy)

In [63]:
# Gradient Boosted Trees as classifier for gender classification.
from pyspark.mllib.tree import GradientBoostedTrees

gbt_model = GradientBoostedTrees.trainClassifier(
  binaryTrainData, 
  categoricalFeaturesInfo=dict((idx, 2) for idx in oheDict.itervalues()),
  numIterations=5)

predictions = gbt_model.predict(binaryValData.map(lambda x: x.features))
predictionsAndLabels = predictions.zip(binaryValData.map(lambda lp: lp.label))
accuracy = predictionsAndLabels.map(lambda (pred, label): pred == label).mean()
print "Accuracy: {}".format(accuracy)

Summary of stats from different classification methods on binary dataset

|Method|Log loss|Accuracy|
|----|----|----|
|LinearRegression|0.666|0.610|
|SVM|na|0.610|
|NaiveBayes|na|0.577|
|RandomForest|na|0.610|
|GradientBoostedTress|na|0.611|

I will proceed out evaluation with LinearRegressionWithSGD since it is one of
the simpler model, very fast to train/test and is giving result almost as good
as any other model.

# ** Part 5: Evaluation and Results **

Please write your answer here. Add additional IPython code/markup cells as needed. Please the grading rubric at the top of this notebook to understand expectations from this section.

In this section, describe all experiemntal parameters such as those used for grid search on hyperparameters. Include results of the chosen methods on your task. How does the metric of interest vary with changes in important method hyperparameters such as regularization, number of iterations, etc.?

In [67]:
# Hyperparameter Tunning - Here I try to optimize LinearRegression model
# by trying different parameters.
import itertools

includeIntercept = [True, False]
regType = ['l1', 'l2']
numIters = [10, 100, 500]
stepSizes = [1, 10]
regParams = [1e-6, 1e-3]

gridParams = list(itertools.product(includeIntercept, regType, numIters, stepSizes, regParams))
print "Total combinations to try: ", len(gridParams)
print gridParams

In [68]:
# Initialize variables using values from initial model training
bestModel = None
bestLogLoss = 1e10

for params in gridParams:
  includeIntercept = params[0]
  regType = params[1]
  numIters = params[2]
  stepSizes = params[3]
  regParams = params[4]
  
  model = LogisticRegressionWithSGD.train(
    binaryTrainData,
    iterations=numIters,
    step=stepSizes,
    regType=regType,
    regParam=regParam,
    intercept=includeIntercept)
  
  logLoss, accuracy = print_stats(model, binaryValData)
  print ('params: {}: logloss = {:.3f}, accuracy = {:.3f}'
    .format(params, logLoss, accuracy))
        
  # I choose best model by log loss
  if (logLoss < bestLogLoss):
    bestModel = model
    bestLogLoss = logLoss

The following output is of the grid search on hyperparameters for our Best Model.
```
params: (True, 'l1', 10, 1, 1e-06): logloss = 0.668, accuracy = 0.610
params: (True, 'l1', 10, 1, 0.001): logloss = 0.668, accuracy = 0.610
params: (True, 'l1', 10, 10, 1e-06): logloss = 1.163, accuracy = 0.445
params: (True, 'l1', 10, 10, 0.001): logloss = 1.163, accuracy = 0.445
params: (True, 'l1', 100, 1, 1e-06): logloss = 0.667, accuracy = 0.610
params: (True, 'l1', 100, 1, 0.001): logloss = 0.667, accuracy = 0.610
params: (True, 'l1', 100, 10, 1e-06): logloss = 1.063, accuracy = 0.610
params: (True, 'l1', 100, 10, 0.001): logloss = 1.063, accuracy = 0.610
params: (True, 'l1', 500, 1, 1e-06): logloss = 0.667, accuracy = 0.610
params: (True, 'l1', 500, 1, 0.001): logloss = 0.667, accuracy = 0.610
params: (True, 'l1', 500, 10, 1e-06): logloss = 0.666, accuracy = 0.610
params: (True, 'l1', 500, 10, 0.001): logloss = 0.666, accuracy = 0.610
params: (True, 'l2', 10, 1, 1e-06): logloss = 0.667, accuracy = 0.610
params: (True, 'l2', 10, 1, 0.001): logloss = 0.667, accuracy = 0.610
params: (True, 'l2', 10, 10, 1e-06): logloss = 1.007, accuracy = 0.457
params: (True, 'l2', 10, 10, 0.001): logloss = 1.007, accuracy = 0.457
params: (True, 'l2', 100, 1, 1e-06): logloss = 0.666, accuracy = 0.610
params: (True, 'l2', 100, 1, 0.001): logloss = 0.666, accuracy = 0.610
params: (True, 'l2', 100, 10, 1e-06): logloss = 0.826, accuracy = 0.421
params: (True, 'l2', 100, 10, 0.001): logloss = 0.826, accuracy = 0.421
params: (True, 'l2', 500, 1, 1e-06): logloss = 0.666, accuracy = 0.610
params: (True, 'l2', 500, 1, 0.001): logloss = 0.666, accuracy = 0.610
params: (True, 'l2', 500, 10, 1e-06): logloss = 0.665, accuracy = 0.609
params: (True, 'l2', 500, 10, 0.001): logloss = 0.665, accuracy = 0.609
params: (False, 'l1', 10, 1, 1e-06): logloss = 0.667, accuracy = 0.610
params: (False, 'l1', 10, 1, 0.001): logloss = 0.667, accuracy = 0.610
params: (False, 'l1', 10, 10, 1e-06): logloss = 4.440, accuracy = 0.610
params: (False, 'l1', 10, 10, 0.001): logloss = 4.440, accuracy = 0.610
params: (False, 'l1', 100, 1, 1e-06): logloss = 0.667, accuracy = 0.610
params: (False, 'l1', 100, 1, 0.001): logloss = 0.667, accuracy = 0.610
params: (False, 'l1', 100, 10, 1e-06): logloss = 0.785, accuracy = 0.443
params: (False, 'l1', 100, 10, 0.001): logloss = 0.785, accuracy = 0.443
params: (False, 'l1', 500, 1, 1e-06): logloss = 0.667, accuracy = 0.610
params: (False, 'l1', 500, 1, 0.001): logloss = 0.667, accuracy = 0.610
params: (False, 'l1', 500, 10, 1e-06): logloss = 0.666, accuracy = 0.610
params: (False, 'l1', 500, 10, 0.001): logloss = 0.666, accuracy = 0.610
params: (False, 'l2', 10, 1, 1e-06): logloss = 0.667, accuracy = 0.610
params: (False, 'l2', 10, 1, 0.001): logloss = 0.667, accuracy = 0.610
params: (False, 'l2', 10, 10, 1e-06): logloss = 4.731, accuracy = 0.610
params: (False, 'l2', 10, 10, 0.001): logloss = 4.731, accuracy = 0.610
params: (False, 'l2', 100, 1, 1e-06): logloss = 0.666, accuracy = 0.609
params: (False, 'l2', 100, 1, 0.001): logloss = 0.666, accuracy = 0.609
params: (False, 'l2', 100, 10, 1e-06): logloss = 0.924, accuracy = 0.610
params: (False, 'l2', 100, 10, 0.001): logloss = 0.924, accuracy = 0.610
params: (False, 'l2', 500, 1, 1e-06): logloss = 0.666, accuracy = 0.609
params: (False, 'l2', 500, 1, 0.001): logloss = 0.666, accuracy = 0.609
params: (False, 'l2', 500, 10, 1e-06): logloss = 0.665, accuracy = 0.609
params: (False, 'l2', 500, 10, 0.001): logloss = 0.665, accuracy = 0.609
```

In [70]:
# We see that hyper parameter tuning reduced our logloss (slightly)
fig = plot_roc(bestModel, binaryValData)
display(fig)

In [71]:
# We known from earlier analysis that our data is highly skewed towards FEMALE. 
# Lets try assigning weights to training dataset and see if it performs better.
# I will chose equal number of samples with male/female label and learn on that.

maleTrainData = binaryTrainData.filter(lambda lp: lp.label == 0)
femaleTrainData = binaryTrainData.filter(lambda lp: lp.label != 0)
print maleTrainData.count(), femaleTrainData.count()

trainDataWithEqualSamples = maleTrainData.union(femaleTrainData.sample(False, 1.0 * maleTrainData.count() / femaleTrainData.count()))
print trainDataWithEqualSamples.count()

In [72]:
# Logistic Regression with SGD classifier model on equally weighted dataset.
model_with_equal_weights = LogisticRegressionWithSGD.train(
  trainDataWithEqualSamples, 
  iterations=500, 
  step=10.0,
  regParam=1e-03, 
  regType='l2',
  intercept=True)

print_stats(model_with_equal_weights, binaryValData)

In [73]:
# ROC plot for un-skewed data using logistic regression model.
fig = plot_roc(model_with_equal_weights, binaryValData)
display(fig)

In [74]:
# Run bestModel binaryTestData and show the final result.
print_stats(bestModel, binaryTestData)

In [75]:
# Features with top weight in the best model. This will show which 
# features are most important to our model.
features_with_weight = list(zip(oheFeatures.collect(), bestModel.weights))

features_with_weight.sort(key=lambda x: -x[1], reverse=True)
top_features_for_male = features_with_weight[:10]
print "Top feature for males:"
for feature, weight in top_features_for_male:
  print feature, weight

features_with_weight.sort(key=lambda x: x[1], reverse=True)
top_features_for_female = features_with_weight[:10]
print "Top feature for females:"
for feature, weight in top_features_for_female:
  print feature, weight

In [76]:
# Analysing where our model goes wrong.
wrong_predictions = binaryTestData.map(lambda lp: (float(bestModel.predict(lp.features)), lp)).filter(lambda (pred, lp): pred != lp.label)
female_false_positives = wrong_predictions.filter(lambda (pred, lp): pred == 1.0).count()
male_false_positives = wrong_predictions.filter(lambda (pred, lp): pred == 0.0).count()

# Very few male false positives and large number of female false positives.
print male_false_positives, female_false_positives

In [77]:
# Some samples of wrong predictions
for wrong in wrong_predictions.take(5):
  prediction = wrong[0]
  label = wrong[1].label
  features = wrong[1].features
  
  features_with_names = list(zip(oheFeatures.collect(), features))
  features_with_names = filter(lambda (feature, value): value == 1.0, features_with_names)
  print "Label: {}, Prediction: {}".format(label, prediction)
  for feature, values in features_with_names:
    print feature
  print ""

UNSUCCESSFUL - Here is a failed attempt trying to handle skeweness of the dataset by using weights feature of the LinearRegression model in ml library. The code executed really slow and I gave up on this to not waste too much time running the model.

```
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.types import StructType, DoubleType, StructField

schema = StructType([
    StructField("label", DoubleType(), True),
    StructField("features", VectorUDT(), True)
])

def to_ml_vector(sv):
  return Vectors.sparse(sv.size, sv.indices, sv.values)

train_df = spark.createDataFrame(binaryTrainData.map(lambda lp: (lp.label, to_ml_vector(lp.features))), schema)
val_df = spark.createDataFrame(binaryValData.map(lambda lp: (lp.label, to_ml_vector(lp.features))), schema)
test_df = spark.createDataFrame(binaryTestData.map(lambda lp: (lp.label, to_ml_vector(lp.features))), schema)

# Since we will be using TrainValidatorSplit to fit our model
# we merge our validation and train data.
train_df = train_df.unionAll(val_df)
print train_df.count()

# We will be doing hyperparameter tuning on LogisticRegressionWithSGD since the
# model is simple, fast to execute and has given results almost as good as any
# other method. We use the non-deprecated LogicicRegression class.
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder

lr = LogisticRegression()
paramGrid = ParamGridBuilder() \
  .addGrid(lr.regParam, [1.0, 1e-3, 1e-6]) \
  .build()
  
tvs = TrainValidationSplit(
  estimator=lr,
  estimatorParamMaps=paramGrid,
  evaluator=RegressionEvaluator(),
  trainRatio=0.8)

tuned_model = tvs.fit(train_df)

# Let us check how tuning has helped
tuned_result = tuned_model.transform(test_df)
log_loss = tuned_result.rdd.map(lambda row: computeLogLoss(row.probability[int(row.label)], row.label)).mean()
print "log_loss: {}".format(log_loss)

# We known from earlier analysis that our data is highly skewed towards FEMALE.
# Lets try assigning weights to training dataset and see if it performs better.
from pyspark.sql import functions as F

total_count = train_df.count()
count_by_gender = dict((int(k), v) for k, v in train_df.groupby("label").count().collect())
print count_by_gender

male_ratio = 1.0 * count_by_gender[0] / total_count
female_ratio = 1.0 * count_by_gender[1] / total_count
print male_ratio, female_ratio

train_df_with_weights = train_df.withColumn("weights", F.when(train_df.label == 0, male_ratio).otherwise(female_ratio))
print train_df_with_weights.head()

lr_with_weights = LogisticRegression(weightCol="weights")
lr_model_with_weights = lr_with_weights.fit(train_df_with_weights)


result_with_weights = lr_model_with_weights.transform(test_df)
log_loss = result_with_weights.rdd.map(lambda row: computeLogLoss(row.probability[int(row.label)], row.label)).mean()
print "log_loss: {}".format(log_loss)
```

# ** Part 6: Conclusions **

Please write your answer here. Add additional IPython code/markup cells as needed. Please the grading rubric at the top of this notebook to understand expectations from this section.

This should be a short final section stating whether the methods you explored on Yelp dataset were able to satisfactorily solve the problem you set out to solve. Discuss any business implications of the performance metrics you obtained such as accuracy, RMSE, runtime, etc. Finally, state if you think this implementation is ready for production-deployment or if there are kinks that need to be worked through before it is usable.

## Observations

* LinearRegressionWithSGD even though is a very simple model, it seems to work equally well on this problem as any other model.
* Hyper-parameter tuning seems to help very little. As we increase the number of iterations or reduce regParam, the runtime to train the model keeps increasing.
* Handling bias via down-sampling the majority class samples doesn't seem to help in this case.
  * Maybe some other technique to handle skewed datasets like assigning weights to each sample, etc might work better.
* Finally the accuracy on the test dataset is ~62.5% (slightly better than what it shows on validation dataset ~61%).
  * Suggesting that the model is not overfitting to the train/validation dataset.
* From the top-features it looks like businesses that are in PA, having TV (and other things listed above) seem to get lot more reviews from male while those businesses that are closed late in the night, have high rating and specify whether they take appointment or not seems to get lot more female reviews.
* Very few male false positives and high number of female positives suggest that most of the error is because of training data skewness. If we can handle that well in our model, we can improve the accuracy of our model.
* Another option we could have pursued is to include UNKNOWN gender data in our training dataset using one-vs-rest or other Multiclass classification techniques. 
  * For examples, to use One-vs-Rest classification technique, we could train a single classifier for each class with the samples of that class as positive (1) samples and all other samples as negatives (0). In the case of gender classification into 3 categories viz. Female, Male & Unknown, we build classifiers for each of the 3 classes and when predicting use the class with maximum score among the three classifier.

In conclusion, based on the results for accuracy for our best classifier model selected, I have come up with 
Misclassification Analysis: Overall, our classifier did worse in classifying men over women. The major factors which I observed here were:

* Overall, female associated business reveiws had a much higher weight than male associated business reviews. 
* Also, the population of Yelp users who actually reveiwed were highly skewed towards female gender. 
* Another reason being that apart from the two gender categories of female and male, we had significant number of business review observations which didn't have a particular gender (for those whose gender were UNKNOWN).