<a href="https://colab.research.google.com/github/karenbennis/Xy/blob/ml_model/ml_model_1_3_5_stars_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <br>**Connect to Database**<br><br>

In [1]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar xf spark-2.4.6-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

#Interact with SQL
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

# Start Spark Session(Creating spark application with name defined by appName()) ---IMPORTED WITH EVERY COLAB NOTEBOOK
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("database_transformation").config("spark.driver.memory","5g").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()


--2020-07-26 18:00:36--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar’


2020-07-26 18:00:36 (8.35 MB/s) - ‘postgresql-42.2.9.jar’ saved [914037/914037]



In [2]:
# gcloud login and check the DB
!gcloud auth login
!gcloud config set project 'xy-yelp'
!gcloud sql instances describe 'xy-yelp'

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&code_challenge=pozFi1Jvjjxj6Wps5KvjGszCSal1-0jaKnZNcjaooxU&code_challenge_method=S256&access_type=offline&response_type=code&prompt=select_account


Enter verification code: 4/2QEcr2gGT3USv-zEFfd722-0VQeKwtOJU0mPCAPrEqIwmBQrXosqRvM

You are now logged in as [helenly25@gmail.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


To take a quick anonymous survey, run:
  $ gcloud survey

Updated property [core/project].
backendType: SECOND_GEN
connecti

In [3]:
# download and initialize the psql proxy
!wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
!chmod +x cloud_sql_proxy
# "connectionName" is from the previous block
!nohup ./cloud_sql_proxy -instances="xy-yelp:northamerica-northeast1:xy-yelp"=tcp:5432 &
!sleep 30s

--2020-07-26 18:01:19--  https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64
Resolving dl.google.com (dl.google.com)... 172.217.214.93, 172.217.214.136, 172.217.214.190, ...
Connecting to dl.google.com (dl.google.com)|172.217.214.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14492253 (14M) [application/octet-stream]
Saving to: ‘cloud_sql_proxy’


2020-07-26 18:01:19 (242 MB/s) - ‘cloud_sql_proxy’ saved [14492253/14492253]

nohup: appending output to 'nohup.out'


In [4]:
db_password = 'kjhbyelpdb'

In [5]:
# Configure settings for RDS
mode = "append"
jdbc_url="jdbc:postgresql://127.0.0.1:5432/xy_yelp_db"
config = {"user":"postgres", 
          "password": db_password, 
          "driver":"org.postgresql.Driver"}

## **Extract tables**

In [6]:
# Read data from database
review_df2 = spark.read \
    .jdbc(url=jdbc_url, table='review_two',
          properties=config)
review_df2.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|
+--------------------+--------------------+-----+----+------+-----+-----------+
|K8avYPWsh45v7VoZg...|another pie place...|    3|   0|     0|    0| 2008-01-23|
|BkiZn5XSzAv9q7J7_...|I came with my si...|    4|   1|     1|    0| 2017-01-30|
|L6kc7Nr7hWiqo7ZvW...|I am very disappo...|    1|   0|     0|    0| 2017-01-19|
|y35xKzutHXT985mUp...|Stopped for lunch...|    4|   1|     0|    0| 2017-01-04|
|UqQGtBDEfkYMLV-Fy...|DON'T DO IT!  You...|    1|   0|     1|    0| 2010-01-20|
+--------------------+--------------------+-----+----+------+-----+-----------+
only showing top 5 rows



In [7]:
# Pull data from business table
business_df2 = spark.read \
    .jdbc(url=jdbc_url, table='business_two',
          properties=config)
business_df2.show(5)

+--------------------+--------------------+
|           review_id|         business_id|
+--------------------+--------------------+
|K8avYPWsh45v7VoZg...|UGyEr_PMA-v1cuim0...|
|BkiZn5XSzAv9q7J7_...|N_2yEZ41g9zDW_gWA...|
|L6kc7Nr7hWiqo7ZvW...|XhLM_OtYslzyd4Gyv...|
|y35xKzutHXT985mUp...|ILa-Xv5-h23A9OMrY...|
|UqQGtBDEfkYMLV-Fy...|gBfPyzPRmeOaj3Sdc...|
+--------------------+--------------------+
only showing top 5 rows



In [8]:
# Pull data from yelp_user table

user_df2 = spark.read \
    .jdbc(url=jdbc_url, table='yelp_user_two',
          properties=config)
user_df2.show(5)

+--------------------+--------------------+
|           review_id|             user_id|
+--------------------+--------------------+
|4gHv8mFFL77vdr6_-...|suiXZ_6jjf9YriAEl...|
|CHcdI_ZDxt2L7Ju5v...|Zoec9wehLFa8CV1Jn...|
|W5Zkqs8RtShQK8u-m...|Iye9krZCjW79lB324...|
|ZVoX65BkaRN0Sr349...|hJ2BkfY_iOhtIizGO...|
|e1EEHis4eT6XwagD2...|7msjG0EeNnaef-tWD...|
+--------------------+--------------------+
only showing top 5 rows



In [9]:
# Join review_df2 and business_df2
review_df2 = review_df2.join(business_df2, on="review_id", how="inner")

In [10]:
# Join review_df2 and user_df2
review_df2 = review_df2.join(user_df2, on="review_id", how="inner")
review_df2.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+
|06FL63x1PSHK1IE3i...|An hour and a hal...|    1|   0|     0|    0| 2016-01-24|Z3ZSar8IVAR2qIupq...|1luyQBuF2iH1Tbqs3...|
|1lGcbt9vMSWY5NLbW...|J'ai été séduite ...|    2|   0|     2|    0| 2015-01-27|frVru1HZYyGZ9sfbO...|AK4k713ocyWht0W47...|
|1xXPggQNNBjkwxxwH...|I'm always game t...|    1|   0|     2|    0| 2014-01-22|6tY0tn39Mb8FCLYBA...|gaPf1qNX7PAf14wIP...|
|37Ci4Q8bRm3PyYHZH...|Hmmm, it was okay...|    3|   1|     1|    0| 2010-01-17|Rj-7ymdw8aNZBRqGR...|uj4iopBWA0RjpqoJ5...|
|37FEOT7W5jpApoad7...|My wife and I had...|    1|   0|     0|    0| 2017-01-25|TTDMJetAQKfxVzKZy...|x20piGQtvm8hOKe8E...|
+--------------------+--

In [11]:
# Create DF with selected columns

col_list = ['business_id', 'review_date', 'review_id', 'stars', 'review_text', 'user_id', 'cool', 'useful', 'funny']
df = review_df2.select(col_list)
df.show(5)

+--------------------+-----------+--------------------+-----+--------------------+--------------------+----+------+-----+
|         business_id|review_date|           review_id|stars|         review_text|             user_id|cool|useful|funny|
+--------------------+-----------+--------------------+-----+--------------------+--------------------+----+------+-----+
|Z3ZSar8IVAR2qIupq...| 2016-01-24|06FL63x1PSHK1IE3i...|    1|An hour and a hal...|1luyQBuF2iH1Tbqs3...|   0|     0|    0|
|frVru1HZYyGZ9sfbO...| 2015-01-27|1lGcbt9vMSWY5NLbW...|    2|J'ai été séduite ...|AK4k713ocyWht0W47...|   0|     2|    0|
|6tY0tn39Mb8FCLYBA...| 2014-01-22|1xXPggQNNBjkwxxwH...|    1|I'm always game t...|gaPf1qNX7PAf14wIP...|   0|     2|    0|
|Rj-7ymdw8aNZBRqGR...| 2010-01-17|37Ci4Q8bRm3PyYHZH...|    3|Hmmm, it was okay...|uj4iopBWA0RjpqoJ5...|   1|     1|    0|
|TTDMJetAQKfxVzKZy...| 2017-01-25|37FEOT7W5jpApoad7...|    1|My wife and I had...|x20piGQtvm8hOKe8E...|   0|     0|    0|
+--------------------+--

In [12]:
import pandas as pd

# Convert df to a pandas df
pandas_df = df.select('*').toPandas()
pandas_df.head()

Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0
1,frVru1HZYyGZ9sfbOchaXg,2015-01-27,1lGcbt9vMSWY5NLbW5jx3g,2,J'ai été séduite par l'originalité du lieu qui...,AK4k713ocyWht0W47DvV_g,0,2,0
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0


## **Transformation**

In [13]:
pandas_df = pandas_df.loc[(pandas_df['stars'] != 2) & (pandas_df['stars'] != 4)]
pandas_df.head()

Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0
5,99TrGqU8ngQphSkvoe6zgg,2017-01-03,6lQS-_8VbWtUkZ3ZSr_qjA,5,Love this place! They are always friendly and...,vWlZqhUfeN8J0_k2NEnDIw,0,0,0


In [14]:
# Import dependencies for nltk
# https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b
import nltk

In [15]:
# Import string and punctuations
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [17]:
# Function to remove Punctuation
def remove_punct(text):

  # Discard all punctuations
  text_nopunct = ''.join([char for char in text if char not in string.punctuation])
  return text_nopunct

pandas_df['body_text_clean'] = pandas_df['review_text'].apply(lambda x: remove_punct(x))

pandas_df.head()

Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny,body_text_clean
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0,An hour and a half waiting for a pizza This is...
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0,Im always game to trying all Chinese takeout p...
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0,Hmmm it was okay I guess Nothing wrong but not...
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0,My wife and I had chosen to fly with your airl...
5,99TrGqU8ngQphSkvoe6zgg,2017-01-03,6lQS-_8VbWtUkZ3ZSr_qjA,5,Love this place! They are always friendly and...,vWlZqhUfeN8J0_k2NEnDIw,0,0,0,Love this place They are always friendly and ...


In [18]:
# Import re
import re

# Function to Tokenize words
def tokenize(text):

  # W+ means that either a word character (A-Za-z0-9) or a dash (-) can go there
  tokens = re.split('\W+', text)
  return tokens

# Convert to lowercase as Python is case-sensitive
pandas_df['body_text_tokenized'] = pandas_df['body_text_clean'].apply(lambda x: tokenize(x.lower()))

pandas_df.head()

Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny,body_text_clean,body_text_tokenized
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0,An hour and a half waiting for a pizza This is...,"[an, hour, and, a, half, waiting, for, a, pizz..."
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0,Im always game to trying all Chinese takeout p...,"[im, always, game, to, trying, all, chinese, t..."
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0,Hmmm it was okay I guess Nothing wrong but not...,"[hmmm, it, was, okay, i, guess, nothing, wrong..."
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0,My wife and I had chosen to fly with your airl...,"[my, wife, and, i, had, chosen, to, fly, with,..."
5,99TrGqU8ngQphSkvoe6zgg,2017-01-03,6lQS-_8VbWtUkZ3ZSr_qjA,5,Love this place! They are always friendly and...,vWlZqhUfeN8J0_k2NEnDIw,0,0,0,Love this place They are always friendly and ...,"[love, this, place, they, are, always, friendl..."


In [19]:
# Remove all English stopwords
import nltk
nltk.download('stopwords')
stopword = nltk.corpus.stopwords.words('english') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [20]:
# Function to remove stopwords
def remove_stopwords(tokenized_list):

  # Remove all stopwords
  text = [word for word in tokenized_list if word not in stopword]
  return text

pandas_df['body_text_nostop'] = pandas_df['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

pandas_df.head()

Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny,body_text_clean,body_text_tokenized,body_text_nostop
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0,An hour and a half waiting for a pizza This is...,"[an, hour, and, a, half, waiting, for, a, pizz...","[hour, half, waiting, pizza, ridiculous, idea,..."
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0,Im always game to trying all Chinese takeout p...,"[im, always, game, to, trying, all, chinese, t...","[im, always, game, trying, chinese, takeout, p..."
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0,Hmmm it was okay I guess Nothing wrong but not...,"[hmmm, it, was, okay, i, guess, nothing, wrong...","[hmmm, okay, guess, nothing, wrong, nothing, o..."
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0,My wife and I had chosen to fly with your airl...,"[my, wife, and, i, had, chosen, to, fly, with,...","[wife, chosen, fly, airline, affordable, rates..."
5,99TrGqU8ngQphSkvoe6zgg,2017-01-03,6lQS-_8VbWtUkZ3ZSr_qjA,5,Love this place! They are always friendly and...,vWlZqhUfeN8J0_k2NEnDIw,0,0,0,Love this place They are always friendly and ...,"[love, this, place, they, are, always, friendl...","[love, place, always, friendly, cera, great, j..."


In [21]:
# Import PorterStemmer
from nltk.stem import PorterStemmer

# Create an instance for stemmer
ps = nltk.PorterStemmer()

# Function for stemming
def stemming(tokenized_text):

  text = [ps.stem(word) for word in tokenized_text]
  return text

pandas_df['body_text_stemmed'] = pandas_df['body_text_nostop'].apply(lambda x: stemming(x))

pandas_df.head()

Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0,An hour and a half waiting for a pizza This is...,"[an, hour, and, a, half, waiting, for, a, pizz...","[hour, half, waiting, pizza, ridiculous, idea,...","[hour, half, wait, pizza, ridicul, idea, order..."
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0,Im always game to trying all Chinese takeout p...,"[im, always, game, to, trying, all, chinese, t...","[im, always, game, trying, chinese, takeout, p...","[im, alway, game, tri, chines, takeout, place,..."
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0,Hmmm it was okay I guess Nothing wrong but not...,"[hmmm, it, was, okay, i, guess, nothing, wrong...","[hmmm, okay, guess, nothing, wrong, nothing, o...","[hmmm, okay, guess, noth, wrong, noth, outstan..."
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0,My wife and I had chosen to fly with your airl...,"[my, wife, and, i, had, chosen, to, fly, with,...","[wife, chosen, fly, airline, affordable, rates...","[wife, chosen, fli, airlin, afford, rate, non,..."
5,99TrGqU8ngQphSkvoe6zgg,2017-01-03,6lQS-_8VbWtUkZ3ZSr_qjA,5,Love this place! They are always friendly and...,vWlZqhUfeN8J0_k2NEnDIw,0,0,0,Love this place They are always friendly and ...,"[love, this, place, they, are, always, friendl...","[love, place, always, friendly, cera, great, j...","[love, place, alway, friendli, cera, great, jo..."


In [22]:
# import WordNetLemmatizer 
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

# Create an instance
wn = nltk.WordNetLemmatizer()

# Function for lemmatization
def lemmatizing(tokenized_text):

  text = [wn.lemmatize(word) for word in tokenized_text]
  return text

pandas_df['body_text_lemmatized'] = pandas_df['body_text_nostop'].apply(lambda x: lemmatizing(x))

pandas_df.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0,An hour and a half waiting for a pizza This is...,"[an, hour, and, a, half, waiting, for, a, pizz...","[hour, half, waiting, pizza, ridiculous, idea,...","[hour, half, wait, pizza, ridicul, idea, order...","[hour, half, waiting, pizza, ridiculous, idea,..."
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0,Im always game to trying all Chinese takeout p...,"[im, always, game, to, trying, all, chinese, t...","[im, always, game, trying, chinese, takeout, p...","[im, alway, game, tri, chines, takeout, place,...","[im, always, game, trying, chinese, takeout, p..."
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0,Hmmm it was okay I guess Nothing wrong but not...,"[hmmm, it, was, okay, i, guess, nothing, wrong...","[hmmm, okay, guess, nothing, wrong, nothing, o...","[hmmm, okay, guess, noth, wrong, noth, outstan...","[hmmm, okay, guess, nothing, wrong, nothing, o..."
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0,My wife and I had chosen to fly with your airl...,"[my, wife, and, i, had, chosen, to, fly, with,...","[wife, chosen, fly, airline, affordable, rates...","[wife, chosen, fli, airlin, afford, rate, non,...","[wife, chosen, fly, airline, affordable, rate,..."
5,99TrGqU8ngQphSkvoe6zgg,2017-01-03,6lQS-_8VbWtUkZ3ZSr_qjA,5,Love this place! They are always friendly and...,vWlZqhUfeN8J0_k2NEnDIw,0,0,0,Love this place They are always friendly and ...,"[love, this, place, they, are, always, friendl...","[love, place, always, friendly, cera, great, j...","[love, place, alway, friendli, cera, great, jo...","[love, place, always, friendly, cera, great, j..."


In [24]:
# Add a length column to DataFrame
pandas_df['length'] = pandas_df['review_text'].apply(len)
pandas_df.head()

Unnamed: 0,business_id,review_date,review_id,stars,review_text,user_id,cool,useful,funny,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized,length
0,Z3ZSar8IVAR2qIupqxMynA,2016-01-24,06FL63x1PSHK1IE3iQ3yqg,1,An hour and a half waiting for a pizza!!!!!! T...,1luyQBuF2iH1Tbqs331uGA,0,0,0,An hour and a half waiting for a pizza This is...,"[an, hour, and, a, half, waiting, for, a, pizz...","[hour, half, waiting, pizza, ridiculous, idea,...","[hour, half, wait, pizza, ridicul, idea, order...","[hour, half, waiting, pizza, ridiculous, idea,...",113
2,6tY0tn39Mb8FCLYBAXXOUw,2014-01-22,1xXPggQNNBjkwxxwHnSHfQ,1,I'm always game to trying all Chinese take-out...,gaPf1qNX7PAf14wIPBUmVg,0,2,0,Im always game to trying all Chinese takeout p...,"[im, always, game, to, trying, all, chinese, t...","[im, always, game, trying, chinese, takeout, p...","[im, alway, game, tri, chines, takeout, place,...","[im, always, game, trying, chinese, takeout, p...",858
3,Rj-7ymdw8aNZBRqGRAjR3Q,2010-01-17,37Ci4Q8bRm3PyYHZHwbFFQ,3,"Hmmm, it was okay I guess. Nothing wrong, but ...",uj4iopBWA0RjpqoJ5xz_vQ,1,1,0,Hmmm it was okay I guess Nothing wrong but not...,"[hmmm, it, was, okay, i, guess, nothing, wrong...","[hmmm, okay, guess, nothing, wrong, nothing, o...","[hmmm, okay, guess, noth, wrong, noth, outstan...","[hmmm, okay, guess, nothing, wrong, nothing, o...",1453
4,TTDMJetAQKfxVzKZy4Z_2Q,2017-01-25,37FEOT7W5jpApoad7d-23Q,1,My wife and I had chosen to fly with your airl...,x20piGQtvm8hOKe8EkR0VQ,0,0,0,My wife and I had chosen to fly with your airl...,"[my, wife, and, i, had, chosen, to, fly, with,...","[wife, chosen, fly, airline, affordable, rates...","[wife, chosen, fli, airlin, afford, rate, non,...","[wife, chosen, fly, airline, affordable, rate,...",2539
5,99TrGqU8ngQphSkvoe6zgg,2017-01-03,6lQS-_8VbWtUkZ3ZSr_qjA,5,Love this place! They are always friendly and...,vWlZqhUfeN8J0_k2NEnDIw,0,0,0,Love this place They are always friendly and ...,"[love, this, place, they, are, always, friendl...","[love, place, always, friendly, cera, great, j...","[love, place, alway, friendli, cera, great, jo...","[love, place, always, friendly, cera, great, j...",206


## <br></br>**Pipeline**<br></br>

In [25]:
# Make a copy of data
pandas_df_copy = pandas_df.copy()

# Select columns for new DataFrame
pandas_df_copy = pandas_df_copy[['review_id', 'review_text', 'stars', 'cool', 'useful', 'funny', 'review_date', 'business_id', 'user_id', 'length', 'body_text_nostop', 'body_text_stemmed', 'body_text_lemmatized']]

# Convert pandas_df to sparks df
spark_df = spark.createDataFrame(pandas_df_copy)
spark_df.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+
|06FL63x1PSHK1IE3i...|An hour and a hal...|    1|   0|     0|    0| 2016-01-24|Z3ZSar8IVAR2qIupq...|1luyQBuF2iH1Tbqs3...|   113|[hour, half, wait...|[hour, half, wait...|[hour, half, wait...|
|1xXPggQNNBjkwxxwH...|I'm always game t...|    1|   0|     2|    0| 2014-01-22|6tY0tn39Mb8FCLYBA...|gaPf1qNX7PAf14wIP...|   858|[im, always, game...|[im, alway, game,...|[im, always, game...|
|37Ci4Q8bRm3PyYHZH...|Hmmm, it was okay.

In [26]:
# Import functions
from pyspark.ml.feature import HashingTF, IDF, StringIndexer

In [27]:
# Make stars values a list
from pyspark.sql.functions import col, split
spark_df = spark_df.withColumn("star_array", split(col("stars"), " "))
spark_df.show()

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+
|06FL63x1PSHK1IE3i...|An hour and a hal...|    1|   0|     0|    0| 2016-01-24|Z3ZSar8IVAR2qIupq...|1luyQBuF2iH1Tbqs3...|   113|[hour, half, wait...|[hour, half, wait...|[hour, half, wait...|       [1]|
|1xXPggQNNBjkwxxwH...|I'm always game t...|    1|   0|     2|    0| 2014-01-22|6tY0tn39Mb8FCLYBA...|gaPf1qNX7PAf14wIP...|   858|[im, always, game...|[im, alway, game,...|[im, always, game.

In [28]:
# Initialize a CoutVectorizer
from pyspark.ml.feature import CountVectorizer
star_vectorizer = CountVectorizer(inputCol="star_array", outputCol="stars_one_hot", vocabSize=5, minDF=1.0)

In [29]:
# Create a vector model
star_vector_model = star_vectorizer.fit(spark_df)

In [30]:
# One hot encoded column
df_ohe = star_vector_model.transform(spark_df)
df_ohe.show(3)

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|stars_one_hot|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+
|06FL63x1PSHK1IE3i...|An hour and a hal...|    1|   0|     0|    0| 2016-01-24|Z3ZSar8IVAR2qIupq...|1luyQBuF2iH1Tbqs3...|   113|[hour, half, wait...|[hour, half, wait...|[hour, half, wait...|       [1]|(3,[2],[1.0])|
|1xXPggQNNBjkwxxwH...|I'm always game t...|    1|   0|     2|    0| 2014-01-22|6tY0tn39Mb8FCLYBA...|gaPf1qNX7PAf14wIP...|   858|[im,

In [31]:
# Create all the features to the data set
star_rating = StringIndexer(inputCol='stars',outputCol='label')
hashingTF = HashingTF(inputCol="body_text_stemmed", outputCol='hash_token')
idf = IDF(inputCol='hash_token', outputCol='idf_token')

In [32]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector
# Create feature vector 
clean_up = VectorAssembler(inputCols=['idf_token', 'length'], outputCol='features')

In [33]:
# Create and run a data processing Pipeline
from pyspark.ml import Pipeline
data_prep_pipeline = Pipeline(stages=[star_rating, hashingTF, idf, clean_up])

In [34]:
# Fit and transform the pipeline
cleaner = data_prep_pipeline.fit(df_ohe)
cleaned = cleaner.transform(df_ohe)

In [35]:
cleaned.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+-----+--------------------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|stars_one_hot|label|          hash_token|           idf_token|            features|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+-----+--------------------+--------------------+--------------------+
|06FL63x1PSHK1IE3i...|An hour and a hal...|    1|   0|     0|    0| 2016-01-24|Z3ZSar8IVAR2qIupq...|1luyQBuF2iH1Tbqs3...|   113|[hour, half, w

# **Machine Learning Models**

In [36]:
#Drop intermediate columns
x=cleaned.select('features', 'label')
x.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[17252,27...|  2.0|
|(262145,[353,1353...|  2.0|
|(262145,[353,1353...|  1.0|
|(262145,[1707,232...|  2.0|
|(262145,[22567,35...|  0.0|
+--------------------+-----+
only showing top 5 rows



In [37]:
# Check dtypes
x.dtypes

[('features', 'vector'), ('label', 'double')]

In [38]:
# Import col
from pyspark.sql.functions import col

# Change column DataType for stars
x = x.withColumn('label', col('label').cast('int'))

x.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[17252,27...|    2|
|(262145,[353,1353...|    2|
|(262145,[353,1353...|    1|
|(262145,[1707,232...|    2|
|(262145,[22567,35...|    0|
+--------------------+-----+
only showing top 5 rows



**Naive Bayes**

In [39]:
# Break data down into a training set and a testing set
training, testing = x.randomSplit([0.8, 0.2], 21)

In [40]:
from pyspark.ml.classification import NaiveBayes
# Create a Naive Bayes model and fit training data
nb = NaiveBayes()
predictor = nb.fit(training)

In [41]:
# Use the Class Evaluator for a cleaned description
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.586072


**Logistic Regression**

In [42]:
# Break data down into a training set and a testing set
training, testing = x.randomSplit([0.8, 0.2], 21)

In [43]:
from pyspark.ml.classification import LogisticRegression
# Create a Naive Bayes model and fit training data
lg = LogisticRegression()
predictor = lg.fit(training)

In [44]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.750741


**Multilayer Percepron**

In [None]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
# Split the data into train and test
splits = x.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

In [None]:
train.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[97,13957...|    1|
|(262145,[98,6646,...|    2|
|(262145,[170,427,...|    2|
|(262145,[170,976,...|    0|
|(262145,[234,1395...|    2|
|(262145,[234,1729...|    1|
|(262145,[353,783,...|    0|
|(262145,[353,976,...|    2|
|(262145,[353,1353...|    0|
|(262145,[353,1707...|    1|
|(262145,[353,3028...|    2|
|(262145,[353,1034...|    1|
|(262145,[353,2234...|    1|
|(262145,[382,2437...|    2|
|(262145,[427,1560...|    0|
|(262145,[427,2325...|    0|
|(262145,[427,3811...|    0|
|(262145,[535,2089...|    2|
|(262145,[584,1353...|    2|
|(262145,[604,6981...|    0|
+--------------------+-----+
only showing top 20 rows



In [None]:
# specify layers for the neural network:
layers = [62000, 256, 3]

In [None]:
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=10, layers=layers, blockSize=128, seed=1234)

In [None]:
# train the model
model = trainer.fit(train)

Py4JJavaError: ignored