[View in Colaboratory](https://colab.research.google.com/github/jagatfx/turicreate-colab/blob/master/turicreate_classify_data.ipynb)

# Classify Data
*   https://apple.github.io/turicreate/docs/userguide/supervised-learning/classifier.html
*   https://turi.com/learn/userguide/supervised-learning/classifier.html

## Classification
Classification is a core task in machine learning. 

The pieces of information fed to a classifier for each data point are called features, and the category they belong to is a ‘target’ or ‘label’. Typically, the classifier is given data points with both features and labels, so that it can learn the correspondence between the two. 

Later, the classifier is queried with a data point and the classifier predicts the category it belongs to. A large group of these query data-points constitute a prediction-set, and the classifier is usually evaluated on its accuracy, or how many prediction queries it gets correct.

Currently, the following models are supported for classification:

*   [Logistic regression](https://apple.github.io/turicreate/docs/userguide/supervised-learning/logistic-regression.html)
*   [Nearest neighbor classifier](https://apple.github.io/turicreate/docs/userguide/supervised-learning/knn_classifier.html)
*   [Support vector machines](https://apple.github.io/turicreate/docs/userguide/supervised-learning/svm.html) (SVM)
*   [Boosted Decision Trees](Boosted Decision Trees)
*   [Random Forests](https://apple.github.io/turicreate/docs/userguide/supervised-learning/random_forest_classifier.html)
*   [Decision Tree](https://apple.github.io/turicreate/docs/userguide/supervised-learning/decision_tree_classifier.html)
*   [Image Classifier](https://apple.github.io/turicreate/docs/userguide/image_classifier/)

These algorithms differ in how they make predictions, but conform to the same API. With all models, call
*   create() to create a model
*   predict() to make flexible predictions on the returned model
*   classify() which provides all the sufficient statistics for classifying data
*   evaluate() to measure performance of the predictions

Models can incorporate:

*   Numeric features
*   Categorical variables
*   Dictionary features (i.e sparse features)
*   List features (i.e dense arrays)
*   Text data
*   Images (using the image classifier)

## Model Selector

It isn't always clear that we know exactly which model is suitable for a given task. Turi Create's model selector automatically picks the right model for you based on statistics collected from the data set. You can also manually create a model with a particular classifier.



# Google Drive Access

You will be asked to click a link to generate a secret key to access your Google Drive. 

Copy and paste secret key it into the space provided with the notebook.

In [2]:
# Install a Drive FUSE wrapper.
# https://github.com/astrada/google-drive-ocamlfuse
!apt-get update -qq 2>&1 > /dev/null
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse

Preconfiguring packages ...
Selecting previously unselected package cron.
(Reading database ... 18408 files and directories currently installed.)
Preparing to unpack .../00-cron_3.0pl1-128ubuntu5_amd64.deb ...
Unpacking cron (3.0pl1-128ubuntu5) ...
Selecting previously unselected package libapparmor1:amd64.
Preparing to unpack .../01-libapparmor1_2.11.0-2ubuntu17.1_amd64.deb ...
Unpacking libapparmor1:amd64 (2.11.0-2ubuntu17.1) ...
Selecting previously unselected package libdbus-1-3:amd64.
Preparing to unpack .../02-libdbus-1-3_1.10.22-1ubuntu1_amd64.deb ...
Unpacking libdbus-1-3:amd64 (1.10.22-1ubuntu1) ...
Selecting previously unselected package dbus.
Preparing to unpack .../03-dbus_1.10.22-1ubuntu1_amd64.deb ...
Unpacking dbus (1.10.22-1ubuntu1) ...
Selecting previously unselected package dirmngr.
Preparing to unpack .../04-dirmngr_2.1.15-1ubuntu8.1_amd64.deb ...
Unpacking dirmngr (2.1.15-1ubuntu8.1) ...
Selecting previously unselected package distro-info-data.
Preparing to unpack .

In [0]:
# Generate auth tokens for Colab
from google.colab import auth
auth.authenticate_user()

In [4]:
# Generate creds for the Drive FUSE library.
from google.colab import output
from oauth2client.client import GoogleCredentials
import time
creds = GoogleCredentials.get_application_default()
import getpass
# Determine if Drive Fuse credential setup is already complete.
fuse_credentials_configured = False
with output.temporary():
  !google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1
  # _exit_code is set to the result of the last "!" command.
  fuse_credentials_configured = _exit_code == 0

# Sleep for a short period to ensure that the previous output has been cleared.
time.sleep(1)
  
if fuse_credentials_configured:
  print('Drive FUSE credentials already configured!')
else:
  # Work around misordering of STREAM and STDIN in Jupyter.
  # https://github.com/jupyter/notebook/issues/3159
  prompt = !google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
  vcode = getpass.getpass(prompt[0] + '\n\nEnter verification code: ')
  !echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}


Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force

Enter verification code: ··········
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
Please enter the verification code: Access token retrieved correctly.


In [0]:
# Create a directory and mount Google Drive using that directory.
!mkdir -p drive
!google-drive-ocamlfuse drive

In [13]:
!ls

adc.json  drive  sample_data  wget-log	yelp-data.csv


# Setup GPU

In [14]:
!apt install libnvrtc8.0
!pip uninstall -y mxnet-cu80 && pip install mxnet-cu80==1.1.0
!pip install turicreate

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  libnvrtc8.0
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 6,225 kB of archives.
After this operation, 28.3 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libnvrtc8.0 amd64 8.0.61-1 [6,225 kB]
Fetched 6,225 kB in 1s (3,898 kB/s)
Selecting previously unselected package libnvrtc8.0:amd64.
(Reading database ... 19845 files and directories currently installed.)
Preparing to unpack .../libnvrtc8.0_8.0.61-1_amd64.deb ...
Unpacking libnvrtc8.0:amd64 (8.0.61-1) ...
Setting up libnvrtc8.0:amd64 (8.0.61-1) ...
Processing triggers for libc-bin (2.26-0ubuntu2.1) ...
[33mSkipping mxnet-cu80 as it is not installed.[0m
Collecting mxnet-cu80==1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/9c/55/bcfd26fd408a4bab27bca1ef5dc1df42954509c904699a6c371d5a4c23ab/

# Fetch Example Data

*   UCI Mushroom data: https://archive.ics.uci.edu/ml/datasets/mushroom
*   Yelp review data: https://www.yelp.com/dataset

In [15]:
!if [ -f "/content/drive/Colab Notebooks/data/yelp-data.csv.zip" ]; then echo "already downloaded yelp data, copying to workspace" && cp "/content/drive/Colab Notebooks/data/yelp-data.csv.zip" . && unzip yelp-data.csv.zip; else echo "downloading yelp..." && mkdir -p "/content/drive/Colab Notebooks/data" && wget "https://static.turi.com/datasets/regression/yelp-data.csv"; fi

already downloaded yelp data, copying to workspace
Archive:  yelp-data.csv.zip
replace yelp-data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: yelp-data.csv           
   creating: __MACOSX/
  inflating: __MACOSX/._yelp-data.csv  


In [16]:
!if [ -f "/content/drive/Colab Notebooks/data/mushroom.csv" ]; then echo "already downloaded mushroom data, copying to workspace" && cp "/content/drive/Colab Notebooks/data/mushroom.csv" .; else echo "downloading mushroom..." && mkdir -p "/content/drive/Colab Notebooks/data" && wget "https://static.turi.com/datasets/xgboost/mushroom.csv"; fi

downloading mushroom...

Redirecting output to ‘wget-log.1’.


# Setup Turi Create

In [0]:
import mxnet as mx
import turicreate as tc

In [0]:
# Use all GPUs (default)
tc.config.set_num_gpus(-1)

# Use only 1 GPU
#tc.config.set_num_gpus(1)

# Use CPU
#tc.config.set_num_gpus(0)

# Logistic Regression Classifier

Logistic regression is a regression model that is popularly used for classification tasks. In logistic regression, the probability that a binary target is True is modeled as a logistic function of a linear combination of features.

## Background

Given a set of features , and a label, logistic regression interprets the probability that the label is in one class as a logistic function of a linear combination of the features.

The following figure illustrates how logistic regression is used to train a 1-dimensional classifier. The training data consists of positive examples (depicted in blue) and negative examples (in orange). The decision boundary (depicted in pink) separates out the data into two classes.

![logistic regression](https://i.imgur.com/56D0V9f.png)

Logistic regression predictions can take one of three forms:

*   Classes (default): Thresholds the probability estimate at 0.5 to predict a class label i.e. 0/1.
*   Probabilities: A probability estimate (in the range [0,1]) that the example is in the True class. Note that this is not the same as the probability estimate in the classify function.
*   Margins : Distance to the linear decision boundary learned by the model. The larger the distance, the more confidence we have that it belongs to one class or the other.

## Example for Yelp Reviews

Let's construct a binary target variable. In this example, we will predict if a restaurant is good or bad, with 1 and 2 star ratings indicating a bad business and 3-5 star ratings indicating a good one. We will use the following features.

*   Average rating of a given business
*   Average rating made by a user
*   Number of reviews made by a user
*   Number of reviews that concern a business

In [19]:
# Load the data
ydata =  tc.SFrame('yelp-data.csv')
print(ydata)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int,str,str,str,dict,int,int,int,list,str,str,float,float,str,int,int,float,str,str,float,str,int,str,int,int,int,dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


business_id,date,review_id,stars,text,type
9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for break ...,review
ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad reviews ...,review
6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also ...,review
_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! ...",review
6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!! ...,review
-yxfBYGB6SEqszmxJxd97A,2007-12-13,m2CKSsepBCoRYWxiRUsxAg,4,"Quiessence is, simply put, beautiful. Full ...",review
zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I ...,review
hW0Ne_HTHEAgGF1rAdmR-g,2012-07-12,JL7GXJ9u4YMx7Rzs05NfiQ,4,"Luckily, I didn't have to travel far to make my ...",review
wNUea3IXZWD63bbOQaOH-g,2012-08-17,XtnfnYmnJYi71yIuGsXIUA,4,Definitely come for Happy hour! Prices are amaz ...,review
nMHhuYan8e3cONo3PornJA,2010-08-11,jJAIXA46pU1swYyRCdfXtQ,5,Nobuo shows his unique talents with everything ...,review

user_id,votes,year,month,day,categories,city
rLtl8ZkDX5vH5nAx9C3q5Q,"{'funny': 0, 'useful': 5, 'cool': 2} ...",2011,1,26,"[Breakfast & Brunch, Restaurants] ...",Phoenix
0a2KyEL0d3Yb1V6aivbIuQ,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2011,7,27,"[Italian, Pizza, Restaurants] ...",Phoenix
0hT2KtfLiobPvh6cDC8JQg,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,6,14,"[Middle Eastern, Restaurants] ...",Tempe
uZetl9T0NcROGOyFfughhg,"{'funny': 0, 'useful': 2, 'cool': 1} ...",2010,5,27,"[Active Life, Dog Parks, Parks] ...",Scottsdale
vYmM4KTsC8ZfQBg-j5MWkw,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2012,1,5,"[Tires, Automotive]",Mesa
sqYN3lNgvPbPCTRsMFu27g,"{'funny': 1, 'useful': 3, 'cool': 4} ...",2007,12,13,"[Wine Bars, Bars, American (New), ...",Phoenix
wFweIWhv2fREZV_dYkz_1g,"{'funny': 4, 'useful': 7, 'cool': 7} ...",2010,2,12,"[Mexican, Restaurants]",Phoenix
1ieuYcKS7zeAv_U15AB13A,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,7,12,"[Hotels & Travel, Airports] ...",Phoenix
Vh_DlizgGhSqQh4qfZ2h6A,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2012,8,17,"[Sushi Bars, Restaurants]",Phoenix
sUNkXg8-KFtCMQDV6zRzQg,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2010,8,11,"[Food, Tea Rooms, Japanese, Restaurants] ...",Phoenix

full_address,latitude,longitude,name,open,business_review_count,business_avg_stars
"6106 S 32nd St\nPhoenix, AZ 85042 ...",33.3908,-112.013,Morning Glory Cafe,1,116,4.0
"4848 E Chandler Blvd\nPhoenix, AZ 85044 ...",33.3056,-111.979,Spinato's Pizzeria,1,102,4.0
"1513 E Apache Blvd\nTempe, AZ 85281 ...",33.4143,-111.913,Haji-Baba,1,265,4.5
"5401 N Hayden Rd\nScottsdale, AZ 85250 ...",33.5229,-111.908,Chaparral Dog Park,1,88,4.5
"1357 S Power Road\nMesa, AZ 85206 ...",33.391,-111.684,Discount Tire,1,5,4.5
"6106 S 32nd St\nPhoenix, AZ 85042 ...",33.3908,-112.013,Quiessence Restaurant,1,109,3.5
"1919 N 16th St\nPhoenix, AZ 85006 ...",33.4691,-112.048,La Condesa Gourmet Taco Shop ...,1,307,4.0
"3400 E Sky Harbor Blvd\nPhoenix, AZ 85034 ...",33.4348,-112.006,Phoenix Sky Harbor International Airport ...,1,862,3.0
"2574 E Camelback Rd\nPhoenix, AZ 85016 ...",33.5096,-112.026,Stingray Sushi,1,163,3.0
"622 E Adams St\nPhoenix, AZ 85004 ...",33.4495,-112.066,Nobuo At Teeter House,1,189,4.5

state,business_type,user_avg_stars,user_name,user_review_count,user_type,votes_funny,votes_cool,votes_useful
AZ,business,3.72,Jason,376,user,331,322,1034
AZ,business,5.0,Paul,2,user,2,0,0
AZ,business,4.33,Nicole,3,user,0,0,3
AZ,business,4.29,lindsey,31,user,18,36,75
AZ,business,3.25,Roger,28,user,3,8,32
AZ,business,3.54,Deborah,654,user,743,1121,1584
AZ,business,3.79,Monique,295,user,1187,1200,1376
AZ,business,3.42,Heather,173,user,85,89,164
AZ,business,3.56,Sherri,18,user,9,8,11
AZ,business,4.17,Mark,6,user,0,3,4

categories_dict
"{'Breakfast & Brunch': 1, 'Restaurants': 1} ..."
"{'Italian': 1, 'Pizza': 1, 'Restaurants': 1} ..."
"{'Middle Eastern': 1, 'Restaurants': 1} ..."
"{'Dog Parks': 1, 'Parks': 1, 'Active Life': 1} ..."
"{'Tires': 1, 'Automotive': 1} ..."
"{'Bars': 1, 'American (New)': 1, 'Nightlife': ..."
"{'Mexican': 1, 'Restaurants': 1} ..."
"{'Airports': 1, 'Hotels & Travel': 1} ..."
"{'Sushi Bars': 1, 'Restaurants': 1} ..."
"{'Food': 1, 'Tea Rooms': 1, 'Japanese': 1, ..."


In [0]:
# Restaurants with rating >=3 are good
ydata['is_good'] = ydata['stars'] >= 3

In [39]:
# Make a train-test split
train_data, test_data = ydata.random_split(0.8)
print(test_data[0])

{'business_id': '9yKzy9PApeiPPOUJEtnvkg', 'date': '2011-01-26', 'review_id': 'fWKvX83p0-ka4JS3dc6E5A', 'stars': 5, 'text': 'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I 

In [41]:
# Create a model
model = tc.logistic_classifier.create(train_data, target='is_good',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [42]:
# Save predictions (probability estimates) to an SArray
predictions = model.classify(test_data)
print(predictions)

class,probability
1,0.9352182507518976
1,0.8601645079588497
1,0.9612867897113871
1,0.8379834309295088
1,0.9880353703324491
1,0.9877780626948478
1,0.8712578317287561
1,0.8994915127423796
1,0.9446488884450592
1,0.9973268126220596


In [43]:
# Evaluate the model and save the results into a dictionary
results = model.evaluate(test_data)
print(results)

{'accuracy': 0.8658375904394258,
 'auc': 0.8322696873681188,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  948  |
 |      0       |        0        |  2302 |
 |      0       |        1        |  4856 |
 |      1       |        1        | 35155 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9237459600073573,
 'log_loss': 0.3301879520996757,
 'precision': 0.878633375821649,
 'recall': 0.973741794310722,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-------+------+
 | threshold | fpr | tpr |   p   |  n   |
 +-----------+-----+-----+-------+------+
 |    0.0    | 1.0 | 1.0 | 36103 | 7158 |
 |   1e-05   | 1.0 | 1.0 | 36103 | 7158 |
 |   2e-05   | 1

In [44]:
print("Accuracy         : %s" % results['accuracy'])
print("Confusion Matrix : \n%s" % results['confusion_matrix'])

Accuracy         : 0.8658375904394258
Confusion Matrix : 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      1       |        0        |  948  |
|      0       |        0        |  2302 |
|      0       |        1        |  4856 |
|      1       |        1        | 35155 |
+--------------+-----------------+-------+
[4 rows x 3 columns]



Using basic SFrame operations, we can also isolate the examples in the test data where the model made mistakes:

In [45]:
predictions = model.predict(test_data)
print(predictions)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ]


In [47]:
# Compute a boolean SArray of whether or not the model was right
mistakes_filter = predictions != test_data[model.target]
correct_filter = predictions == test_data[model.target]

# Apply the logical filter on the data
mistakes = test_data[mistakes_filter]
correct = test_data[correct_filter]

print(mistakes[0])

{'business_id': 'vvA3fbps4F9nGlAEYKk_sA', 'date': '2012-05-04', 'review_id': 'S9OVpXat8k5YwWCn6FAgXg', 'stars': 1, 'text': "Disgusting!  Had a Groupon so my daughter and I tried it out.  Very outdated and gaudy 80's style interior made me feel like I was in an episode of Sopranos.  The food itself was pretty bad.  We ordered pretty simple dishes but they just had no flavor at all!  After trying it out I'm positive all the good reviews on here are employees or owners creating them.", 'type': 'review', 'user_id': '8AMn6644NmBf96xGO3w6OA', 'votes': {'funny': 0, 'useful': 1, 'cool': 0}, 'year': 2012, 'month': 5, 'day': 4, 'categories': ['Restaurants', 'Italian'], 'city': 'Phoenix', 'full_address': '2728 E Thomas Rd\nPhoenix, AZ 85016', 'latitude': 33.4805, 'longitude': -112.022, 'name': 'Avanti Restaurant of Distinction', 'open': 1, 'business_review_count': 50, 'business_avg_stars': 4.0, 'state': 'AZ', 'business_type': 'business', 'user_avg_stars': 4.0, 'user_name': 'Lori', 'user_review_co

In [48]:
# Compute boolean filters
false_positive_filter = (predictions == 1) & (test_data[model.target] == 0)
false_negative_filter = (predictions == 0) & (test_data[model.target] == 1)

false_negatives = test_data[false_negative_filter]
print(false_negatives[0])

{'business_id': '06kfoeRs9Acj82Yl3i9p_w', 'date': '2010-07-03', 'review_id': 'g0PIxO362sPzzvoYXNU4KQ', 'stars': 5, 'text': 'Recently moved back to Mesa, I have been on the hunt for a place to get a good sandwich and fresh bread. This place now tops my list of places in the East Valley to get a good sandwich. My first sandwich was the Atlantic Haddock Provencal. A sandwich made with seared, lightly breaded haddock, sliced hard-boiled egg, fresh basil, romaine lettuce, tomato, Provencal tartar sauce on a grilled Brioche roll. The brioche was amazing. It was light and fluffy. The haddock was perfectly cooked with just the right amount of breading, and the vegetables were fresh and in perfect proportions as to not overpower the bread and haddock. Sometimes when you go to a sandwich shop they overload it with everything else that the meat and bread are just there to accessorize the vegetables and sauce. This was perfectly done. The atmosphere is pleasant and open with nice music as the ligh

In [49]:
false_positives = test_data[false_positive_filter]
print(false_positives[0])

{'business_id': 'vvA3fbps4F9nGlAEYKk_sA', 'date': '2012-05-04', 'review_id': 'S9OVpXat8k5YwWCn6FAgXg', 'stars': 1, 'text': "Disgusting!  Had a Groupon so my daughter and I tried it out.  Very outdated and gaudy 80's style interior made me feel like I was in an episode of Sopranos.  The food itself was pretty bad.  We ordered pretty simple dishes but they just had no flavor at all!  After trying it out I'm positive all the good reviews on here are employees or owners creating them.", 'type': 'review', 'user_id': '8AMn6644NmBf96xGO3w6OA', 'votes': {'funny': 0, 'useful': 1, 'cool': 0}, 'year': 2012, 'month': 5, 'day': 4, 'categories': ['Restaurants', 'Italian'], 'city': 'Phoenix', 'full_address': '2728 E Thomas Rd\nPhoenix, AZ 85016', 'latitude': 33.4805, 'longitude': -112.022, 'name': 'Avanti Restaurant of Distinction', 'open': 1, 'business_review_count': 50, 'business_avg_stars': 4.0, 'state': 'AZ', 'business_type': 'business', 'user_avg_stars': 4.0, 'user_name': 'Lori', 'user_review_co

# Nearest Neighbor Classifier

## Background
The nearest neighbors classifier predicts the class of a data point to be the most common class among that point's neighbors.

Defining the criteria for the **neighborhood** of a prediction data point requires careful thought and domain knowledge. A function must be specified to measure the distance between any two data points, and then the size of "neighborhoods" relative to this distance function must be set.

For the first step, there are many standard distance functions (e.g. Euclidean, Jaccard, Levenshtein) that work well for data whose features are all of the same type, but for heterogeneous data the task is a bit trickier. Turi Create overcomes this problem with **composite distances**, which are weighted sums of standard distance functions applied to appropriate subsets of features. For more about distance functions in Turi Create, including composite distances, please see the [API documentation](https://apple.github.io/turicreate/docs/api/turicreate.toolkits.distances.html) for the distances module. The end of this chapter describes how to use a composite distance with the nearest neighbor classifier in particular.

Once the distance function is defined, the user must indicate the criteria for deciding when training data are in the "neighborhood" of a prediction point. This is done by setting two constraints:

1.  radius - the maximum distance a training example can be from the prediction point and still be considered a neighbor, and

2.   max_neighbors - the maximum number of neighbors for the prediction point. If there are more points within radius of the prediction point, the closest max_neighbors are used.

Unlike the other classifiers in the Turi Create classifier toolkit, the nearest neighbors classifiers is an instance-based method, which means that the model must store all of the training data. For each prediction, the model must search all of the training data to find the neighbor points in the training data. Turi Create performs this search intelligently, but predictions are nevertheless typically slower than other classification models.

## Example for Yelp Ratings

Use the Yelp restaurant review data with the goal of predicting how many "stars" a user will give a particular business.

In [50]:
print(ydata)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int,str,str,str,dict,int,int,int,list,str,str,float,float,str,int,int,float,str,str,float,str,int,str,int,int,int,dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


+------------------------+------------+------------------------+-------+
|      business_id       |    date    |       review_id        | stars |
+------------------------+------------+------------------------+-------+
| 9yKzy9PApeiPPOUJEtnvkg | 2011-01-26 | fWKvX83p0-ka4JS3dc6E5A |   5   |
| ZRJwVLyzEJq1VAihDhYiow | 2011-07-27 | IjZ33sJrzXqU-0X6U8NwyA |   5   |
| 6oRAC4uyJCsJl1X0WZpVSA | 2012-06-14 | IESLBzqUCLdSzSqm0eCSxQ |   4   |
| _1QQZuf4zZOyFCvXc0o6Vg | 2010-05-27 | G-WvGaISbqqaMHlNnByodA |   5   |
| 6ozycU1RpktNG2-1BroVtw | 2012-01-05 | 1uJFq2r5QfJG_6ExMRCaGw |   5   |
| -yxfBYGB6SEqszmxJxd97A | 2007-12-13 | m2CKSsepBCoRYWxiRUsxAg |   4   |
| zp713qNhx8d9KCJJnrw1xA | 2010-02-12 | riFQ3vxNpP4rWLk_CSri2A |   5   |
| hW0Ne_HTHEAgGF1rAdmR-g | 2012-07-12 | JL7GXJ9u4YMx7Rzs05NfiQ |   4   |
| wNUea3IXZWD63bbOQaOH-g | 2012-08-17 | XtnfnYmnJYi71yIuGsXIUA |   4   |
| nMHhuYan8e3cONo3PornJA | 2010-08-11 | jJAIXA46pU1swYyRCdfXtQ |   5   |
+------------------------+------------+------------

In [0]:
train_data, test_data = ydata.random_split(0.9)

### Features

The review counts features are typically much larger than the average stars features, which would cause the review counts to dominate standard numeric distance functions. To avoid this we standardize the features before creating the model.

In [0]:
numeric_features = ['user_avg_stars', 
                    'business_avg_stars', 
                    'user_review_count', 
                    'business_review_count']

for ftr in numeric_features:
    mean = train_data[ftr].mean()
    stdev = train_data[ftr].std()
    train_data[ftr] = (train_data[ftr] - mean) / stdev
    test_data[ftr] = (test_data[ftr] - mean) / stdev

In [59]:
print(train_data.groupby('stars', [tc.aggregate.COUNT]).sort("stars", ascending = False))

stars,Count
5,64771
4,67451
3,29634
2,17574
1,14753


In [60]:
print(test_data.groupby('stars', [tc.aggregate.COUNT]).sort("stars", ascending = False))

+-------+-------+
| stars | Count |
+-------+-------+
|   5   |  7333 |
|   4   |  7571 |
|   3   |  3263 |
|   2   |  1894 |
|   1   |  1635 |
+-------+-------+
[5 rows x 2 columns]



In [53]:
# create classifier
model2 = tc.nearest_neighbor_classifier.create(train_data, target='stars',
                                          features=numeric_features)

In [54]:
predictions2 = model2.classify(test_data, max_neighbors=20, radius=None)
print(predictions2)

+-------+-------------+
| class | probability |
+-------+-------------+
|   5   |     0.65    |
|   4   |     0.5     |
|   5   |     1.0     |
|   4   |     0.35    |
|   2   |     0.35    |
|   5   |     0.45    |
|   4   |     0.65    |
|   4   |     0.45    |
|   4   |     0.5     |
|   5   |     0.5     |
+-------+-------------+
[21696 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [55]:
# Get top k rows according to the given column to see the fraction of neighbors belonging to every target class
topk = model2.predict_topk(test_data[:5], max_neighbors=20, k=3)
print(topk)

+--------+-------+-------------+
| row_id | class | probability |
+--------+-------+-------------+
|   2    |   5   |     1.0     |
|   4    |   2   |     0.35    |
|   4    |   4   |     0.35    |
|   4    |   1   |     0.1     |
|   0    |   5   |     0.65    |
|   0    |   4   |     0.35    |
|   3    |   3   |     0.35    |
|   3    |   4   |     0.35    |
|   3    |   2   |     0.2     |
|   1    |   4   |     0.5     |
+--------+-------+-------------+
[12 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [56]:
# Evaluate the model’s predictive accuracy by predicting the target class for instances in a new dataset and comparing to known target values.
evals = model2.evaluate(test_data[:3000])
print(evals['accuracy'])



0.44866666666666666


The accuracy seems low, but remember that we are in a multi-class classification setting. The most common class (4 stars) only occurs in 34.8% of the test data, so our model has indeed learned something. The confusion matrix produced by the evaluate method can help us to better understand the model performance. In this case we see that 83.9% of our predictions are actually within 1 star of the true number of stars.

In [61]:
conf_matrix = evals['confusion_matrix']
conf_matrix['within_one'] = conf_matrix.apply(
    lambda x: abs(x['target_label'] - x['predicted_label']) <= 1)
num_within_one = conf_matrix[conf_matrix['within_one']]['count'].sum()
print(float(num_within_one) / len(test_data))

0.11545907079646017


Suppose we want to add the text column as a feature. One way to do this is to treat each entry as a "bag of words" by simply counting the number of times each word appears but ignoring the order (see the text analytics chapter for more detail).

In [62]:
train_data['word_counts'] = tc.text_analytics.count_words(train_data['text'],
                                                          to_lower=True)
test_data['word_counts'] = tc.text_analytics.count_words(test_data['text'],
                                                         to_lower=True)
print(train_data['text'][0])
print(train_data['word_counts'][0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
{'back': 1, 'to': 1, 'wait': 1, 'complete': 1, 'meal': 1, 'of': 1, 'pieces': 1, 'came': 1, 'delicious': 1, 'sk

The weighted_jaccard distance measures the difference between two sets, weighted by the counts of each element (please see the API documentation for details). To combine this output with the numeric distance we used above, we specify a composite distance. Each element in this list includes a list (or tuple) of feature names, a standard distance function name, and a numeric weight. The weight on each component can be adjusted to produce the same effect as normalizing features.

In [63]:
my_dist = [
    [numeric_features, 'euclidean', 1.0],
    [['word_counts'], 'weighted_jaccard', 1.0]
    ]

model2 = tc.nearest_neighbor_classifier.create(train_data, target='stars', distance=my_dist)

Defaulting to brute force instead of ball tree because there are multiple distance components.


In [64]:
accuracy = model2.evaluate(test_data[:3000], metric='accuracy')
print(accuracy)



{'accuracy': 0.4713333333333333}


# The Gradient Boosted Regression Tree

The Gradient Boosted Regression Trees (GBRT) model (also called Gradient Boosted Machine or GBM), is one of the most effective machine learning models for predictive analytics, making it the industrial workhorse for machine learning. 

The prediction is based on a collection of base learners i.e decision tree classifiers and combines them through a technique called gradient boosting.

Different from linear models like logistic regression or SVM, gradient boost trees can model non-linear interactions between the features and the target. This model is suitable for handling numerical features and categorical features with tens of categories but is less suitable for highly sparse features (text data), or with categorical variables that encode a large number of categories.

## Example for Mushrooms

Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

Use UCI Mushroom dataset https://archive.ics.uci.edu/ml/datasets/mushroom

### Attribute Information: 

(classes: edible=e, poisonous=p)

cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

bruises: bruises=t,no=f

odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

gill-attachment: attached=a,descending=d,free=f,notched=n

gill-spacing: close=c,crowded=w,distant=d

gill-size: broad=b,narrow=n

gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

stalk-shape: enlarging=e,tapering=t

stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

veil-type: partial=p,universal=u

veil-color: brown=n,orange=o,white=w,yellow=y

ring-number: none=n,one=o,two=t

ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

In [89]:
# Load the data
mdata =  tc.SFrame('mushroom.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [90]:
# Label 'c' is edible
mdata['edible'] = mdata['label'] == 'c'
del mdata['label']
print(mdata[0])

{'cap-shape': 'x', 'cap-surface': 's', 'cap-color': 'n', 'bruises?': 't', 'odor': 'p', 'gill-attachment': 'f', 'gill-spacing': 'c', 'gill-size': 'n', 'gill-color': 'k', 'stalk-shape': 'e', 'stalk-root': 'e', 'stalk-surface-above-ring': 's', 'stalk-surface-below-ring': 's', 'stalk-color-above-ring': 'w', 'stalk-color-below-ring': 'w', 'veil-color': 'w', 'ring-number': 'o', 'ring-type': 'p', 'spore-print-color': 'k', 'population': 's', 'habitat': 'u', 'veil-type': 'p', 'edible': 1}


In [0]:
# Make a train-test split
train_data, test_data = mdata.random_split(0.8)

In [109]:
# Create a model.
model = tc.boosted_trees_classifier.create(train_data, target='edible',
                                           max_iterations=4,
                                           max_depth = 3)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [110]:
model.get_feature_importance().sort("count", ascending = False)

name,index,count
odor,n,4
spore-print-color,r,4
stalk-surface-below-ring,y,3
bruises?,t,3
stalk-root,r,2
stalk-root,c,2
odor,l,1
stalk-color-below-ring,y,1
gill-spacing,c,1
odor,a,1


In [111]:
# Save predictions to an SFrame (class and corresponding class-probabilities)
predictions = model.classify(test_data)
print(predictions)

+-------+--------------------+
| class |    probability     |
+-------+--------------------+
|   0   | 0.8398730754852295 |
|   1   | 0.7440834045410156 |
|   1   | 0.7440834045410156 |
|   0   | 0.850181058049202  |
|   0   | 0.5245263874530792 |
|   0   | 0.8398730754852295 |
|   0   | 0.8374579846858978 |
|   0   | 0.850181058049202  |
|   0   | 0.850181058049202  |
|   0   | 0.850181058049202  |
+-------+--------------------+
[1653 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [0]:
# Evaluate the model and save the results into a dictionary
results = model.evaluate(test_data)

In [113]:
print("Accuracy         : %s" % results['accuracy'])
print("Confusion Matrix : \n%s" % results['confusion_matrix'])

Accuracy         : 0.9921355111917726
Confusion Matrix : 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        0        |  858  |
|      1       |        1        |  782  |
|      0       |        1        |   13  |
+--------------+-----------------+-------+
[3 rows x 3 columns]



## Tuning hyperparameters

The Gradient Boosted Trees model has many tuning parameters. Here we provide a simple guideline for tuning the model.

*   `max_iterations` Controls the number of trees in the final model. Usually the more trees, the higher accuracy. However, both the training and prediction time also grows linearly in the number of trees.

*   `max_depth` Restricts the depth of each individual tree to prevent overfitting.

*   `step_size` Also called shrinkage, appeared as the in the equations in the Background section. It works similar to the learning rate of the gradient descent procedure: smaller value will take more iterations to reach the same level of training error of a larger step size. So there is a trade off between step_size and number of iterations.

*   `min_child_weight` One of the pruning criteria for decision tree construction. In classification problem, this corresponds to the minimum observations required at a leaf node. Larger value produces simpler trees.

*   `min_loss_reduction` Another pruning criteria for decision tree construction. This restricts the reduction of loss function for a node split. Larger value produces simpler trees.

*   `row_subsample` Use only a fraction of data at each iteration. This is similar to the mini-batch stochastic gradient descent which not only reduce the computation cost of each iteration, but may also produce more robust model.

*   `column_subsample` Use only a subset of the columns to use at each iteration.

# Blogs, Articles, Notebooks
*   https://github.com/AFathi/turicreate-notebooks/blob/master/notebooks/Classify%20Data%20Example-kNN.ipynb
*   https://www.kaggle.com/uciml/mushroom-classification
