<a href="https://colab.research.google.com/github/karenbennis/Xy/blob/ml_model/pyspark_pipeline_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <br>**Connect to Database**<br><br>

In [1]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar xf spark-2.4.6-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

#Interact with SQL
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

# Start Spark Session(Creating spark application with name defined by appName()) ---IMPORTED WITH EVERY COLAB NOTEBOOK
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("database_transformation").config("spark.driver.memory","5g").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()


--2020-07-26 15:26:39--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar’


2020-07-26 15:26:40 (1.43 MB/s) - ‘postgresql-42.2.9.jar’ saved [914037/914037]



In [None]:
# gcloud login and check the DB
!gcloud auth login
!gcloud config set project 'xy-yelp'
!gcloud sql instances describe 'xy-yelp'

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&code_challenge=TGHQzzth5e5AMSwqAw6CpiH3wec9voSu-hui-MK-he0&code_challenge_method=S256&access_type=offline&response_type=code&prompt=select_account


Enter verification code: 4/2QHe7YUZLypxZoQQ6PhcY2igPX1F0aY8znLdi8Pv6YIxOPyVrLPM_vU

You are now logged in as [helenly25@gmail.com].
Your current project is [xy-yelp].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID
Updated property [core/project].
backendType: SECOND_GEN
connectionName: xy-yelp:northamerica-northeast1:xy-yelp
databaseV

In [None]:
# download and initialize the psql proxy
!wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
!chmod +x cloud_sql_proxy
# "connectionName" is from the previous block
!nohup ./cloud_sql_proxy -instances="xy-yelp:northamerica-northeast1:xy-yelp"=tcp:5432 &
!sleep 30s

cloud_sql_proxy: Text file busy
nohup: appending output to 'nohup.out'


In [None]:
db_password = 'kjhbyelpdb'

In [None]:
# Configure settings for RDS
mode = "append"
jdbc_url="jdbc:postgresql://127.0.0.1:5432/xy_yelp_db"
config = {"user":"postgres", 
          "password": db_password, 
          "driver":"org.postgresql.Driver"}

## **Extract tables**

In [2]:
# convert to pandas
import pandas as pd
pandas_df = pd.read_csv("https://raw.githubusercontent.com/karenbennis/Xy/storyboard/uniform_yelp.csv")

# Set index
# pandas_df = pandas_df.set_index('review_id')

pandas_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool
0,f5pGCpvkpRpJixZ0zA3hCg,8p2nss7UoZmIVZTr1IjR3w,caWUE0ItqsG51OaBVlr4Eg,2,2016-11-13,We went here three more times for lunch and tw...,0,0,0
1,8lPKsNFBiLmVL5nbsUXaZw,O3pSxv1SyHpY4qi4Q16KzA,dc3uoAmNo5STqKV6mlD_aA,1,2017-05-30,My husband and I went to the Drake for lunch t...,2,1,1
2,CANhCLzOoZ0mkL3mpnUSNg,ffC9zmbY4pBOS9ByrWoXxQ,sR9hPrIaG-J-GLcl4yaiLw,1,2017-11-19,"Very Problematic\nI'm gay, I'm not ashamed to ...",0,0,0
3,bZ02moAXlosgWPM3pXSHWw,zE49S2Em3l7vgIlvFzZFOw,NF6di6YcQxN0rDAleE7SyQ,3,2014-12-30,Today was the first time I sat down a table. I...,0,0,0
4,TmIla5Eh5SSLJ_bKgH4Syg,xde2rO3XVt0Do8kLRIt2Dw,Wc9UpJhOcdSj7olZkz7SJA,2,2013-02-19,I ordered chicken tacos with no cheese to go. ...,4,1,0


In [None]:
# Pull review table
review_df2 = spark.read.jdbc(url=jdbc_url, table='review',properties=config)
review_df2.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+
|cALYebKb5hygdKHql...|This is a very in...|    4|   0|     0|    0| 2011-01-12|     review|
|SawdMXLYD5ytRmMFv...|I LOVE Chic Nails...|    5|   0|     2|    0| 2011-01-20|     review|
|j-jMQdELr6AFkCcEH...|After the Padres ...|    5|   0|     0|    0| 2011-01-06|     review|
|SmUMyCUNrT9HEo_DX...|I have to admit t...|    4|   0|     1|    0| 2010-01-17|     review|
|oTB_mpCKcu-8wayQQ...|Best food, super ...|    5|   0|     1|    0| 2011-01-14|     review|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+
only showing top 5 rows



In [None]:
# Pull business table
business_df2 = spark.read.jdbc(url=jdbc_url, table='business',properties=config)
business_df2.show(5)

+--------------------+--------------------+
|           review_id|         business_id|
+--------------------+--------------------+
|fWKvX83p0-ka4JS3d...|9yKzy9PApeiPPOUJE...|
|IjZ33sJrzXqU-0X6U...|ZRJwVLyzEJq1VAihD...|
|IESLBzqUCLdSzSqm0...|6oRAC4uyJCsJl1X0W...|
|G-WvGaISbqqaMHlNn...|_1QQZuf4zZOyFCvXc...|
|1uJFq2r5QfJG_6ExM...|6ozycU1RpktNG2-1B...|
+--------------------+--------------------+
only showing top 5 rows



In [None]:
# Pull yelp_user table
user_df2 = spark.read.jdbc(url=jdbc_url, table='yelp_user',properties=config)
user_df2.show(5)

+--------------------+--------------------+
|           review_id|             user_id|
+--------------------+--------------------+
|GJGUHAAONtBSBj53c...|Z3c7xGRfeV-uMkSbA...|
|nQH2KAvAeOJOYKX99...|ryjqXdp68i2I9JPOp...|
|-yKcbjWSlmKC1zTMT...|5W-ruHmpkwLyI6Lla...|
|20aES_-g5Vyqfzojn...|vhxFLqRok6r-D_aQz...|
|W_d9w7yr3koSUXHco...|aBnKTxZzdhabTXfzt...|
+--------------------+--------------------+
only showing top 5 rows



In [None]:
# Join tables
spark_df = review_df2.join(business_df2, on="review_id", how="inner")
spark_df = spark_df.join(user_df2, on="review_id", how="inner")
spark_df.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|         business_id|             user_id|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|     review|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|     review|pa6K7DGByxBXxcVJ5...|_4lqpCYCqOQzbB6xQ...|
|-nQHHXi-d_yuW301_...|A pleasant place ...|    2|   0|     0|    0| 2011-01-12|     review|GIGI8bJfN6HyPzmEW...|4QORbyhfN01oKR_Gg...|
|2L30O7G8IQ6HILpR0...|part of a social ...|    5|   0|     0|    0| 2010-01-24|     review|qiwajZigq_2twTmYo...|ST8Yzlk2MqKlcaLqL...|
|4x5yLG7_yGLuN-w6f...|I love every plac...|    4|   0|     1| 

## **Transformation**

In [None]:
import pyspark.sql.functions as F

spark_df=spark_df.withColumn('length',F.length('review_text'))

In [None]:
spark_df = spark_df.filter((spark_df.stars != 2) & (spark_df.stars != 4))
spark_df.show()

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|         business_id|             user_id|length|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|     review|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|
|2L30O7G8IQ6HILpR0...|part of a social ...|    5|   0|     0|    0| 2010-01-24|     review|qiwajZigq_2twTmYo...|ST8Yzlk2MqKlcaLqL...|   415|
|5h0EVAee-RDbbKfhd...|A great value for...|    5|   2|     1|    0| 2012-01-21|     review|4VzaYvZntWBRbr8Vm...|feQpvbp8jGBWMuG5u...|   284|
|6jV77Bs_Vu_rHkdUx...|This review is fo...|    3|   1|     2|    1| 2011-01-23|     review|kwq3bK7BzPKLwXKqV...|RwVaQNP1Ag-Seu3U9...|  1092|
|8uV26l7-ktb4

In [None]:
# spark_df=spark_df.withColumn('class',F.when( (spark_df["stars"]==1), 0).when((spark_df["stars"]))
# spark_df.show()

In [3]:
# Import dependencies for nltk
# https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b
import nltk

In [4]:
# Import string and punctuations
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [5]:
# Function to remove Punctuation
def remove_punct(text):

  # Discard all punctuations
  text_nopunct = ''.join([char for char in text if char not in string.punctuation])
  return text_nopunct

pandas_df['body_text_clean'] = pandas_df['text'].apply(lambda x: remove_punct(x))

pandas_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,body_text_clean
0,f5pGCpvkpRpJixZ0zA3hCg,8p2nss7UoZmIVZTr1IjR3w,caWUE0ItqsG51OaBVlr4Eg,2,2016-11-13,We went here three more times for lunch and tw...,0,0,0,We went here three more times for lunch and tw...
1,8lPKsNFBiLmVL5nbsUXaZw,O3pSxv1SyHpY4qi4Q16KzA,dc3uoAmNo5STqKV6mlD_aA,1,2017-05-30,My husband and I went to the Drake for lunch t...,2,1,1,My husband and I went to the Drake for lunch t...
2,CANhCLzOoZ0mkL3mpnUSNg,ffC9zmbY4pBOS9ByrWoXxQ,sR9hPrIaG-J-GLcl4yaiLw,1,2017-11-19,"Very Problematic\nI'm gay, I'm not ashamed to ...",0,0,0,Very Problematic\nIm gay Im not ashamed to say...
3,bZ02moAXlosgWPM3pXSHWw,zE49S2Em3l7vgIlvFzZFOw,NF6di6YcQxN0rDAleE7SyQ,3,2014-12-30,Today was the first time I sat down a table. I...,0,0,0,Today was the first time I sat down a table I ...
4,TmIla5Eh5SSLJ_bKgH4Syg,xde2rO3XVt0Do8kLRIt2Dw,Wc9UpJhOcdSj7olZkz7SJA,2,2013-02-19,I ordered chicken tacos with no cheese to go. ...,4,1,0,I ordered chicken tacos with no cheese to go ...


In [6]:
# Tokenization
import re

# Function to Tokenize words
def tokenize(text):

  # W+ means that either a word character (A-Za-z0-9) or a dash (-) can go there
  tokens = re.split('\W+', text)
  return tokens

# Convert to lowercase as Python is case-sensitive
pandas_df['body_text_tokenized'] = pandas_df['body_text_clean'].apply(lambda x: tokenize(x.lower()))

pandas_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,body_text_clean,body_text_tokenized
0,f5pGCpvkpRpJixZ0zA3hCg,8p2nss7UoZmIVZTr1IjR3w,caWUE0ItqsG51OaBVlr4Eg,2,2016-11-13,We went here three more times for lunch and tw...,0,0,0,We went here three more times for lunch and tw...,"[we, went, here, three, more, times, for, lunc..."
1,8lPKsNFBiLmVL5nbsUXaZw,O3pSxv1SyHpY4qi4Q16KzA,dc3uoAmNo5STqKV6mlD_aA,1,2017-05-30,My husband and I went to the Drake for lunch t...,2,1,1,My husband and I went to the Drake for lunch t...,"[my, husband, and, i, went, to, the, drake, fo..."
2,CANhCLzOoZ0mkL3mpnUSNg,ffC9zmbY4pBOS9ByrWoXxQ,sR9hPrIaG-J-GLcl4yaiLw,1,2017-11-19,"Very Problematic\nI'm gay, I'm not ashamed to ...",0,0,0,Very Problematic\nIm gay Im not ashamed to say...,"[very, problematic, im, gay, im, not, ashamed,..."
3,bZ02moAXlosgWPM3pXSHWw,zE49S2Em3l7vgIlvFzZFOw,NF6di6YcQxN0rDAleE7SyQ,3,2014-12-30,Today was the first time I sat down a table. I...,0,0,0,Today was the first time I sat down a table I ...,"[today, was, the, first, time, i, sat, down, a..."
4,TmIla5Eh5SSLJ_bKgH4Syg,xde2rO3XVt0Do8kLRIt2Dw,Wc9UpJhOcdSj7olZkz7SJA,2,2013-02-19,I ordered chicken tacos with no cheese to go. ...,4,1,0,I ordered chicken tacos with no cheese to go ...,"[i, ordered, chicken, tacos, with, no, cheese,..."


In [7]:
# Remove all English stopwords
import nltk
nltk.download('stopwords')
stopword = nltk.corpus.stopwords.words('english') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
# Function to remove stopwords
def remove_stopwords(tokenized_list):

  # Remove all stopwords
  text = [word for word in tokenized_list if word not in stopword]
  return text

pandas_df['body_text_nostop'] = pandas_df['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

pandas_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,body_text_clean,body_text_tokenized,body_text_nostop
0,f5pGCpvkpRpJixZ0zA3hCg,8p2nss7UoZmIVZTr1IjR3w,caWUE0ItqsG51OaBVlr4Eg,2,2016-11-13,We went here three more times for lunch and tw...,0,0,0,We went here three more times for lunch and tw...,"[we, went, here, three, more, times, for, lunc...","[went, three, times, lunch, twice, dinner, thi..."
1,8lPKsNFBiLmVL5nbsUXaZw,O3pSxv1SyHpY4qi4Q16KzA,dc3uoAmNo5STqKV6mlD_aA,1,2017-05-30,My husband and I went to the Drake for lunch t...,2,1,1,My husband and I went to the Drake for lunch t...,"[my, husband, and, i, went, to, the, drake, fo...","[husband, went, drake, lunch, today, sat, pati..."
2,CANhCLzOoZ0mkL3mpnUSNg,ffC9zmbY4pBOS9ByrWoXxQ,sR9hPrIaG-J-GLcl4yaiLw,1,2017-11-19,"Very Problematic\nI'm gay, I'm not ashamed to ...",0,0,0,Very Problematic\nIm gay Im not ashamed to say...,"[very, problematic, im, gay, im, not, ashamed,...","[problematic, im, gay, im, ashamed, say, felt,..."
3,bZ02moAXlosgWPM3pXSHWw,zE49S2Em3l7vgIlvFzZFOw,NF6di6YcQxN0rDAleE7SyQ,3,2014-12-30,Today was the first time I sat down a table. I...,0,0,0,Today was the first time I sat down a table I ...,"[today, was, the, first, time, i, sat, down, a...","[today, first, time, sat, table, like, wait, w..."
4,TmIla5Eh5SSLJ_bKgH4Syg,xde2rO3XVt0Do8kLRIt2Dw,Wc9UpJhOcdSj7olZkz7SJA,2,2013-02-19,I ordered chicken tacos with no cheese to go. ...,4,1,0,I ordered chicken tacos with no cheese to go ...,"[i, ordered, chicken, tacos, with, no, cheese,...","[ordered, chicken, tacos, cheese, go, got, off..."


In [9]:
# Stemming
from nltk.stem import PorterStemmer

# Create an instance for stemmer
ps = nltk.PorterStemmer()

# Function for stemming
def stemming(tokenized_text):

  text = [ps.stem(word) for word in tokenized_text]
  return text

pandas_df['body_text_stemmed'] = pandas_df['body_text_nostop'].apply(lambda x: stemming(x))

pandas_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed
0,f5pGCpvkpRpJixZ0zA3hCg,8p2nss7UoZmIVZTr1IjR3w,caWUE0ItqsG51OaBVlr4Eg,2,2016-11-13,We went here three more times for lunch and tw...,0,0,0,We went here three more times for lunch and tw...,"[we, went, here, three, more, times, for, lunc...","[went, three, times, lunch, twice, dinner, thi...","[went, three, time, lunch, twice, dinner, thin..."
1,8lPKsNFBiLmVL5nbsUXaZw,O3pSxv1SyHpY4qi4Q16KzA,dc3uoAmNo5STqKV6mlD_aA,1,2017-05-30,My husband and I went to the Drake for lunch t...,2,1,1,My husband and I went to the Drake for lunch t...,"[my, husband, and, i, went, to, the, drake, fo...","[husband, went, drake, lunch, today, sat, pati...","[husband, went, drake, lunch, today, sat, pati..."
2,CANhCLzOoZ0mkL3mpnUSNg,ffC9zmbY4pBOS9ByrWoXxQ,sR9hPrIaG-J-GLcl4yaiLw,1,2017-11-19,"Very Problematic\nI'm gay, I'm not ashamed to ...",0,0,0,Very Problematic\nIm gay Im not ashamed to say...,"[very, problematic, im, gay, im, not, ashamed,...","[problematic, im, gay, im, ashamed, say, felt,...","[problemat, im, gay, im, asham, say, felt, ash..."
3,bZ02moAXlosgWPM3pXSHWw,zE49S2Em3l7vgIlvFzZFOw,NF6di6YcQxN0rDAleE7SyQ,3,2014-12-30,Today was the first time I sat down a table. I...,0,0,0,Today was the first time I sat down a table I ...,"[today, was, the, first, time, i, sat, down, a...","[today, first, time, sat, table, like, wait, w...","[today, first, time, sat, tabl, like, wait, we..."
4,TmIla5Eh5SSLJ_bKgH4Syg,xde2rO3XVt0Do8kLRIt2Dw,Wc9UpJhOcdSj7olZkz7SJA,2,2013-02-19,I ordered chicken tacos with no cheese to go. ...,4,1,0,I ordered chicken tacos with no cheese to go ...,"[i, ordered, chicken, tacos, with, no, cheese,...","[ordered, chicken, tacos, cheese, go, got, off...","[order, chicken, taco, chees, go, got, offic, ..."


In [11]:
# Lemmatization
# import these modules 
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

# Create an instance
wn = nltk.WordNetLemmatizer()

def lemmatizing(tokenized_text):

  text = [wn.lemmatize(word) for word in tokenized_text]
  return text

pandas_df['body_text_lemmatized'] = pandas_df['body_text_nostop'].apply(lambda x: lemmatizing(x))

pandas_df.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,length,body_text_lemmatized
0,f5pGCpvkpRpJixZ0zA3hCg,8p2nss7UoZmIVZTr1IjR3w,caWUE0ItqsG51OaBVlr4Eg,2,2016-11-13,We went here three more times for lunch and tw...,0,0,0,We went here three more times for lunch and tw...,"[we, went, here, three, more, times, for, lunc...","[went, three, times, lunch, twice, dinner, thi...","[went, three, time, lunch, twice, dinner, thin...",572,"[went, three, time, lunch, twice, dinner, thin..."
1,8lPKsNFBiLmVL5nbsUXaZw,O3pSxv1SyHpY4qi4Q16KzA,dc3uoAmNo5STqKV6mlD_aA,1,2017-05-30,My husband and I went to the Drake for lunch t...,2,1,1,My husband and I went to the Drake for lunch t...,"[my, husband, and, i, went, to, the, drake, fo...","[husband, went, drake, lunch, today, sat, pati...","[husband, went, drake, lunch, today, sat, pati...",407,"[husband, went, drake, lunch, today, sat, pati..."
2,CANhCLzOoZ0mkL3mpnUSNg,ffC9zmbY4pBOS9ByrWoXxQ,sR9hPrIaG-J-GLcl4yaiLw,1,2017-11-19,"Very Problematic\nI'm gay, I'm not ashamed to ...",0,0,0,Very Problematic\nIm gay Im not ashamed to say...,"[very, problematic, im, gay, im, not, ashamed,...","[problematic, im, gay, im, ashamed, say, felt,...","[problemat, im, gay, im, asham, say, felt, ash...",483,"[problematic, im, gay, im, ashamed, say, felt,..."
3,bZ02moAXlosgWPM3pXSHWw,zE49S2Em3l7vgIlvFzZFOw,NF6di6YcQxN0rDAleE7SyQ,3,2014-12-30,Today was the first time I sat down a table. I...,0,0,0,Today was the first time I sat down a table I ...,"[today, was, the, first, time, i, sat, down, a...","[today, first, time, sat, table, like, wait, w...","[today, first, time, sat, tabl, like, wait, we...",1682,"[today, first, time, sat, table, like, wait, w..."
4,TmIla5Eh5SSLJ_bKgH4Syg,xde2rO3XVt0Do8kLRIt2Dw,Wc9UpJhOcdSj7olZkz7SJA,2,2013-02-19,I ordered chicken tacos with no cheese to go. ...,4,1,0,I ordered chicken tacos with no cheese to go ...,"[i, ordered, chicken, tacos, with, no, cheese,...","[ordered, chicken, tacos, cheese, go, got, off...","[order, chicken, taco, chees, go, got, offic, ...",324,"[ordered, chicken, taco, cheese, go, got, offi..."


In [12]:
pandas_df['length'] = pandas_df['text'].apply(len)
pandas_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,length,body_text_lemmatized
0,f5pGCpvkpRpJixZ0zA3hCg,8p2nss7UoZmIVZTr1IjR3w,caWUE0ItqsG51OaBVlr4Eg,2,2016-11-13,We went here three more times for lunch and tw...,0,0,0,We went here three more times for lunch and tw...,"[we, went, here, three, more, times, for, lunc...","[went, three, times, lunch, twice, dinner, thi...","[went, three, time, lunch, twice, dinner, thin...",572,"[went, three, time, lunch, twice, dinner, thin..."
1,8lPKsNFBiLmVL5nbsUXaZw,O3pSxv1SyHpY4qi4Q16KzA,dc3uoAmNo5STqKV6mlD_aA,1,2017-05-30,My husband and I went to the Drake for lunch t...,2,1,1,My husband and I went to the Drake for lunch t...,"[my, husband, and, i, went, to, the, drake, fo...","[husband, went, drake, lunch, today, sat, pati...","[husband, went, drake, lunch, today, sat, pati...",407,"[husband, went, drake, lunch, today, sat, pati..."
2,CANhCLzOoZ0mkL3mpnUSNg,ffC9zmbY4pBOS9ByrWoXxQ,sR9hPrIaG-J-GLcl4yaiLw,1,2017-11-19,"Very Problematic\nI'm gay, I'm not ashamed to ...",0,0,0,Very Problematic\nIm gay Im not ashamed to say...,"[very, problematic, im, gay, im, not, ashamed,...","[problematic, im, gay, im, ashamed, say, felt,...","[problemat, im, gay, im, asham, say, felt, ash...",483,"[problematic, im, gay, im, ashamed, say, felt,..."
3,bZ02moAXlosgWPM3pXSHWw,zE49S2Em3l7vgIlvFzZFOw,NF6di6YcQxN0rDAleE7SyQ,3,2014-12-30,Today was the first time I sat down a table. I...,0,0,0,Today was the first time I sat down a table I ...,"[today, was, the, first, time, i, sat, down, a...","[today, first, time, sat, table, like, wait, w...","[today, first, time, sat, tabl, like, wait, we...",1682,"[today, first, time, sat, table, like, wait, w..."
4,TmIla5Eh5SSLJ_bKgH4Syg,xde2rO3XVt0Do8kLRIt2Dw,Wc9UpJhOcdSj7olZkz7SJA,2,2013-02-19,I ordered chicken tacos with no cheese to go. ...,4,1,0,I ordered chicken tacos with no cheese to go ...,"[i, ordered, chicken, tacos, with, no, cheese,...","[ordered, chicken, tacos, cheese, go, got, off...","[order, chicken, taco, chees, go, got, offic, ...",324,"[ordered, chicken, taco, cheese, go, got, offi..."


## <br></br>**Pipeline**<br></br>

In [13]:
pandas_df_copy = pandas_df.copy()

pandas_df_copy = pandas_df_copy[['review_id', 'text', 'stars', 'cool', 'useful', 'funny', 'date', 'business_id', 'user_id', 'length', 'body_text_nostop', 'body_text_stemmed', 'body_text_lemmatized']]

# Convert pandas_df to sparks df
spark_df = spark.createDataFrame(pandas_df_copy)
spark_df.show(5)

+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+
|           review_id|                text|stars|cool|useful|funny|      date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|
+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+
|f5pGCpvkpRpJixZ0z...|We went here thre...|    2|   0|     0|    0|2016-11-13|caWUE0ItqsG51OaBV...|8p2nss7UoZmIVZTr1...|   572|[went, three, tim...|[went, three, tim...|[went, three, tim...|
|8lPKsNFBiLmVL5nbs...|My husband and I ...|    1|   1|     2|    1|2017-05-30|dc3uoAmNo5STqKV6m...|O3pSxv1SyHpY4qi4Q...|   407|[husband, went, d...|[husband, went, d...|[husband, went, d...|
|CANhCLzOoZ0mkL3mp...|Very Problematic
...|  

In [14]:
# Import functions
from pyspark.ml.feature import HashingTF, IDF, StringIndexer

In [15]:
# Make stars values a list
from pyspark.sql.functions import col, split
spark_df = spark_df.withColumn("star_array", split(col("stars"), " "))
spark_df.show()

+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+
|           review_id|                text|stars|cool|useful|funny|      date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|
+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+
|f5pGCpvkpRpJixZ0z...|We went here thre...|    2|   0|     0|    0|2016-11-13|caWUE0ItqsG51OaBV...|8p2nss7UoZmIVZTr1...|   572|[went, three, tim...|[went, three, tim...|[went, three, tim...|       [2]|
|8lPKsNFBiLmVL5nbs...|My husband and I ...|    1|   1|     2|    1|2017-05-30|dc3uoAmNo5STqKV6m...|O3pSxv1SyHpY4qi4Q...|   407|[husband, went, d...|[husband, went, d...|[husband, went, d...|  

In [16]:
# Initialize a CoutVectorizer
from pyspark.ml.feature import CountVectorizer
star_vectorizer = CountVectorizer(inputCol="star_array", outputCol="stars_one_hot", vocabSize=5, minDF=1.0)

In [17]:
# Create a vector model
star_vector_model = star_vectorizer.fit(spark_df)

In [18]:
# One hot encoded column
df_ohe = star_vector_model.transform(spark_df)
df_ohe.show(3)

+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+
|           review_id|                text|stars|cool|useful|funny|      date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|stars_one_hot|
+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+
|f5pGCpvkpRpJixZ0z...|We went here thre...|    2|   0|     0|    0|2016-11-13|caWUE0ItqsG51OaBV...|8p2nss7UoZmIVZTr1...|   572|[went, three, tim...|[went, three, tim...|[went, three, tim...|       [2]|(5,[4],[1.0])|
|8lPKsNFBiLmVL5nbs...|My husband and I ...|    1|   1|     2|    1|2017-05-30|dc3uoAmNo5STqKV6m...|O3pSxv1SyHpY4qi4Q...|   407|[husband,

In [19]:
# Create all the features to the data set
star_rating = StringIndexer(inputCol='stars',outputCol='label')
hashingTF = HashingTF(inputCol="body_text_stemmed", outputCol='hash_token')
idf = IDF(inputCol='hash_token', outputCol='idf_token')

In [20]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector
# Create feature vector 
clean_up = VectorAssembler(inputCols=['idf_token', 'length'], outputCol='features')

In [21]:
# Create and run a data processing Pipeline
from pyspark.ml import Pipeline
data_prep_pipeline = Pipeline(stages=[star_rating, hashingTF, idf, clean_up])

In [22]:
# Fit and transform the pipeline
cleaner = data_prep_pipeline.fit(df_ohe)
cleaned = cleaner.transform(df_ohe)

In [23]:
cleaned.show(5)

+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+-----+--------------------+--------------------+--------------------+
|           review_id|                text|stars|cool|useful|funny|      date|         business_id|             user_id|length|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|stars_one_hot|label|          hash_token|           idf_token|            features|
+--------------------+--------------------+-----+----+------+-----+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+----------+-------------+-----+--------------------+--------------------+--------------------+
|f5pGCpvkpRpJixZ0z...|We went here thre...|    2|   0|     0|    0|2016-11-13|caWUE0ItqsG51OaBV...|8p2nss7UoZmIVZTr1...|   572|[went, three, tim.

# **Machine Learning Models**

In [24]:
#Drop intermediate columns
x=cleaned.select('features', 'label')
x.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[4200,299...|  4.0|
|(262145,[6872,156...|  3.0|
|(262145,[3067,690...|  3.0|
|(262145,[1353,232...|  1.0|
|(262145,[27564,31...|  4.0|
+--------------------+-----+
only showing top 5 rows



In [24]:
x.dtypes

[('features', 'vector'), ('label', 'double')]

In [25]:
# Import col
from pyspark.sql.functions import col

# Change column DataType for stars
x = x.withColumn('label', col('label').cast('int'))

x.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[4200,299...|    4|
|(262145,[6872,156...|    3|
|(262145,[3067,690...|    3|
|(262145,[1353,232...|    1|
|(262145,[27564,31...|    4|
+--------------------+-----+
only showing top 5 rows



**Naive Bayes**

In [31]:
# Break data down into a training set and a testing set
training, testing = x.randomSplit([0.8, 0.2], 21)

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:36333)
Traceback (most recent call last):
  File "/content/spark-2.4.6-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/spark-2.4.6-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


Py4JNetworkError: ignored

In [29]:
from pyspark.ml.classification import NaiveBayes
# Create a Naive Bayes model and fit training data
nb = NaiveBayes()
predictor = nb.fit(training)

In [30]:
# Use the Class Evaluator for a cleaner description
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.358936


**Logistic Regression**

In [None]:
# Break data down into a training set and a testing set
training, testing = x.randomSplit([0.8, 0.2], 21)

In [29]:
from pyspark.ml.classification import LogisticRegression
# Create a Naive Bayes model and fit training data
lg = LogisticRegression()
predictor = lg.fit(training)

In [30]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.444839


**Multilayer Percepron**

In [26]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [27]:
# Split the data into train and test
splits = x.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

In [27]:
train.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[47,1176,...|    0|
|(262145,[97,1353,...|    4|
|(262145,[97,3831,...|    0|
|(262145,[97,4402,...|    3|
|(262145,[97,5765,...|    1|
|(262145,[97,13957...|    2|
|(262145,[98,6646,...|    3|
|(262145,[170,427,...|    3|
|(262145,[170,976,...|    1|
|(262145,[170,1353...|    4|
|(262145,[198,427,...|    4|
|(262145,[198,991,...|    0|
|(262145,[234,1288...|    4|
|(262145,[234,1395...|    3|
|(262145,[234,1729...|    2|
|(262145,[251,976,...|    4|
|(262145,[320,1076...|    2|
|(262145,[323,2544...|    0|
|(262145,[323,2742...|    0|
|(262145,[343,991,...|    0|
+--------------------+-----+
only showing top 20 rows



In [28]:
# specify layers for the neural network:
# input layer of size 3 (features), two intermediate of size 5 and 4
# and output of size 5 (classes)
layers = [262145, 256, 5]

In [29]:
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=10, layers=layers, blockSize=128, seed=1234)

In [30]:
# train the model
model = trainer.fit(train)

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 47210)
Traceback (most recent call last):
  File "/usr/lib/python3.6/socketserver.py", line 320, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.6/socketserver.py", line 351, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.6/socketserver.py", line 364, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.6/socketserver.py", line 724, in __init__
    self.handle()
  File "/content/spark-2.4.6-bin-hadoop2.7/python/pyspark/accumulators.py", line 269, in handle
    poll(accum_updates)
  File "/content/spark-2.4.6-bin-hadoop2.7/python/pyspark/accumulators.py", line 241, in poll
    if func():
  File "/content/spark-2.4.6-bin-hadoop2.7/python/pyspark/accumulators.py", line 245, in accum_updates
    num_updates = read_int(self.rfile)
  File

Py4JNetworkError: ignored