<a href="https://colab.research.google.com/github/karenbennis/Xy/blob/ml_model/pyspark_pipeline_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br>**Connect to Database**<br><br>

In [28]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar xf spark-2.4.6-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

#Interact with SQL
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

# Start Spark Session(Creating spark application with name defined by appName()) ---IMPORTED WITH EVERY COLAB NOTEBOOK
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("database_transformation").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()


--2020-07-23 03:57:14--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar.1’


2020-07-23 03:57:14 (3.65 MB/s) - ‘postgresql-42.2.9.jar.1’ saved [914037/914037]



In [29]:
# gcloud login and check the DB
!gcloud auth login
!gcloud config set project 'xy-yelp'
!gcloud sql instances describe 'xy-yelp'

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&code_challenge=5e_NC6__d4G1UIO_ekA8ewmD0Cm-NgDSydYS_Z0pYw0&code_challenge_method=S256&access_type=offline&response_type=code&prompt=select_account


Enter verification code: 4/2AHd8t8p9HuJ_kX5ZUzaVi18KprPHWOUeNL2kBpEZUDCUQdnwEhlt7I

You are now logged in as [helenly25@gmail.com].
Your current project is [xy-yelp].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID
Updated property [core/project].
backendType: SECOND_GEN
connectionName: xy-yelp:northamerica-northeast1:xy-yelp
databaseV

In [30]:
# download and initialize the psql proxy
!wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
!chmod +x cloud_sql_proxy
# "connectionName" is from the previous block
!nohup ./cloud_sql_proxy -instances="xy-yelp:northamerica-northeast1:xy-yelp"=tcp:5432 &
!sleep 30s

cloud_sql_proxy: Text file busy
nohup: appending output to 'nohup.out'


In [31]:
db_password = 'kjhbyelpdb'

In [32]:
# Configure settings for RDS
mode = "append"
jdbc_url="jdbc:postgresql://127.0.0.1:5432/xy_yelp_db"
config = {"user":"postgres", 
          "password": db_password, 
          "driver":"org.postgresql.Driver"}

**Extract tables**

In [33]:
# Pull review table
review_df2 = spark.read.jdbc(url=jdbc_url, table='review',properties=config)
review_df2.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+
|cALYebKb5hygdKHql...|This is a very in...|    4|   0|     0|    0| 2011-01-12|     review|
|SawdMXLYD5ytRmMFv...|I LOVE Chic Nails...|    5|   0|     2|    0| 2011-01-20|     review|
|j-jMQdELr6AFkCcEH...|After the Padres ...|    5|   0|     0|    0| 2011-01-06|     review|
|SmUMyCUNrT9HEo_DX...|I have to admit t...|    4|   0|     1|    0| 2010-01-17|     review|
|oTB_mpCKcu-8wayQQ...|Best food, super ...|    5|   0|     1|    0| 2011-01-14|     review|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+
only showing top 5 rows



In [34]:
# Pull business table
business_df2 = spark.read.jdbc(url=jdbc_url, table='business',properties=config)
business_df2.show(5)

+--------------------+--------------------+
|           review_id|         business_id|
+--------------------+--------------------+
|fWKvX83p0-ka4JS3d...|9yKzy9PApeiPPOUJE...|
|IjZ33sJrzXqU-0X6U...|ZRJwVLyzEJq1VAihD...|
|IESLBzqUCLdSzSqm0...|6oRAC4uyJCsJl1X0W...|
|G-WvGaISbqqaMHlNn...|_1QQZuf4zZOyFCvXc...|
|1uJFq2r5QfJG_6ExM...|6ozycU1RpktNG2-1B...|
+--------------------+--------------------+
only showing top 5 rows



In [35]:
# Pull yelp_user table
user_df2 = spark.read.jdbc(url=jdbc_url, table='yelp_user',properties=config)
user_df2.show(5)

+--------------------+--------------------+
|           review_id|             user_id|
+--------------------+--------------------+
|GJGUHAAONtBSBj53c...|Z3c7xGRfeV-uMkSbA...|
|nQH2KAvAeOJOYKX99...|ryjqXdp68i2I9JPOp...|
|-yKcbjWSlmKC1zTMT...|5W-ruHmpkwLyI6Lla...|
|20aES_-g5Vyqfzojn...|vhxFLqRok6r-D_aQz...|
|W_d9w7yr3koSUXHco...|aBnKTxZzdhabTXfzt...|
+--------------------+--------------------+
only showing top 5 rows



In [36]:
# Join tables
spark_df = review_df2.join(business_df2, on="review_id", how="inner")
spark_df = spark_df.join(user_df2, on="review_id", how="inner")
spark_df.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|         business_id|             user_id|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|     review|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|     review|pa6K7DGByxBXxcVJ5...|_4lqpCYCqOQzbB6xQ...|
|-nQHHXi-d_yuW301_...|A pleasant place ...|    2|   0|     0|    0| 2011-01-12|     review|GIGI8bJfN6HyPzmEW...|4QORbyhfN01oKR_Gg...|
|2L30O7G8IQ6HILpR0...|part of a social ...|    5|   0|     0|    0| 2010-01-24|     review|qiwajZigq_2twTmYo...|ST8Yzlk2MqKlcaLqL...|
|4x5yLG7_yGLuN-w6f...|I love every plac...|    4|   0|     1| 

**Transformation**

In [37]:
import pyspark.sql.functions as F

spark_df=spark_df.withColumn('length',F.length('review_text'))

In [38]:
spark_df=spark_df.withColumn('class',F.when( (spark_df["stars"]>3), 1).otherwise(0))
spark_df.show()

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|         business_id|             user_id|length|class|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|     review|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|     review|pa6K7DGByxBXxcVJ5...|_4lqpCYCqOQzbB6xQ...|  1348|    1|
|-nQHHXi-d_yuW301_...|A pleasant place ...|    2|   0|     0|    0| 2011-01-12|     review|GIGI8bJfN6HyPzmEW...|4QORbyhfN01oKR_Gg...|   813|    0|
|2L30O7G8IQ6HILpR0...|part of a social ...|    5|   0|     0|    0| 2010-01-24|     review|qiwajZigq_2twTmYo...|ST8Yzl

In [39]:
# convert to pandas
pandas_df = spark_df.toPandas()

# Set index
# pandas_df = pandas_df.set_index('review_id')

pandas_df.head()

Unnamed: 0,review_id,review_text,stars,cool,useful,funny,review_date,review_type,business_id,user_id,length,class
0,-7yxrdY13ay15rGB7WibMA,I have been going to Arizona Auto Care since a...,5,0,0,0,2010-01-16,review,Lh9nz0KYyzE-YRbKuCYeUw,ayKW9eWwGFcrtJaHcwZUCw,670,1
1,-Be0UUGYuiDJVAM_YqeQuA,"Since Im big into breakfast foods, Im always o...",4,0,2,2,2011-01-25,review,pa6K7DGByxBXxcVJ59nWMw,_4lqpCYCqOQzbB6xQGGhrQ,1348,1
2,-nQHHXi-d_yuW301_Y0EZQ,"A pleasant place in Kierland Center, but has g...",2,0,0,0,2011-01-12,review,GIGI8bJfN6HyPzmEW-QqjA,4QORbyhfN01oKR_GgBstfQ,813,0
3,2L30O7G8IQ6HILpR0t5RFA,"part of a social event, we only had app's here...",5,0,0,0,2010-01-24,review,qiwajZigq_2twTmYofPmDQ,ST8Yzlk2MqKlcaLqL2djBg,415,1
4,4x5yLG7_yGLuN-w6fV0eBw,I love every place on South Mountain. I've bee...,4,0,1,0,2011-01-02,review,9yKzy9PApeiPPOUJEtnvkg,Vk-hJ1i5ZagPM87Kv9FOnA,302,1


In [40]:
# Import dependencies for nltk
# https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b
import nltk

In [41]:
# Import string and punctuations
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [42]:
# Function to remove Punctuation
def remove_punct(text):

  # Discard all punctuations
  text_nopunct = ''.join([char for char in text if char not in string.punctuation])
  return text_nopunct

pandas_df['body_text_clean'] = pandas_df['review_text'].apply(lambda x: remove_punct(x))

pandas_df.head()

Unnamed: 0,review_id,review_text,stars,cool,useful,funny,review_date,review_type,business_id,user_id,length,class,body_text_clean
0,-7yxrdY13ay15rGB7WibMA,I have been going to Arizona Auto Care since a...,5,0,0,0,2010-01-16,review,Lh9nz0KYyzE-YRbKuCYeUw,ayKW9eWwGFcrtJaHcwZUCw,670,1,I have been going to Arizona Auto Care since a...
1,-Be0UUGYuiDJVAM_YqeQuA,"Since Im big into breakfast foods, Im always o...",4,0,2,2,2011-01-25,review,pa6K7DGByxBXxcVJ59nWMw,_4lqpCYCqOQzbB6xQGGhrQ,1348,1,Since Im big into breakfast foods Im always on...
2,-nQHHXi-d_yuW301_Y0EZQ,"A pleasant place in Kierland Center, but has g...",2,0,0,0,2011-01-12,review,GIGI8bJfN6HyPzmEW-QqjA,4QORbyhfN01oKR_GgBstfQ,813,0,A pleasant place in Kierland Center but has go...
3,2L30O7G8IQ6HILpR0t5RFA,"part of a social event, we only had app's here...",5,0,0,0,2010-01-24,review,qiwajZigq_2twTmYofPmDQ,ST8Yzlk2MqKlcaLqL2djBg,415,1,part of a social event we only had apps here q...
4,4x5yLG7_yGLuN-w6fV0eBw,I love every place on South Mountain. I've bee...,4,0,1,0,2011-01-02,review,9yKzy9PApeiPPOUJEtnvkg,Vk-hJ1i5ZagPM87Kv9FOnA,302,1,I love every place on South Mountain Ive been ...


In [43]:
# Tokenization
import re

# Function to Tokenize words
def tokenize(text):

  # W+ means that either a word character (A-Za-z0-9) or a dash (-) can go there
  tokens = re.split('\W+', text)
  return tokens

# Convert to lowercase as Python is case-sensitive
pandas_df['body_text_tokenized'] = pandas_df['body_text_clean'].apply(lambda x: tokenize(x.lower()))

pandas_df.head()

Unnamed: 0,review_id,review_text,stars,cool,useful,funny,review_date,review_type,business_id,user_id,length,class,body_text_clean,body_text_tokenized
0,-7yxrdY13ay15rGB7WibMA,I have been going to Arizona Auto Care since a...,5,0,0,0,2010-01-16,review,Lh9nz0KYyzE-YRbKuCYeUw,ayKW9eWwGFcrtJaHcwZUCw,670,1,I have been going to Arizona Auto Care since a...,"[i, have, been, going, to, arizona, auto, care..."
1,-Be0UUGYuiDJVAM_YqeQuA,"Since Im big into breakfast foods, Im always o...",4,0,2,2,2011-01-25,review,pa6K7DGByxBXxcVJ59nWMw,_4lqpCYCqOQzbB6xQGGhrQ,1348,1,Since Im big into breakfast foods Im always on...,"[since, im, big, into, breakfast, foods, im, a..."
2,-nQHHXi-d_yuW301_Y0EZQ,"A pleasant place in Kierland Center, but has g...",2,0,0,0,2011-01-12,review,GIGI8bJfN6HyPzmEW-QqjA,4QORbyhfN01oKR_GgBstfQ,813,0,A pleasant place in Kierland Center but has go...,"[a, pleasant, place, in, kierland, center, but..."
3,2L30O7G8IQ6HILpR0t5RFA,"part of a social event, we only had app's here...",5,0,0,0,2010-01-24,review,qiwajZigq_2twTmYofPmDQ,ST8Yzlk2MqKlcaLqL2djBg,415,1,part of a social event we only had apps here q...,"[part, of, a, social, event, we, only, had, ap..."
4,4x5yLG7_yGLuN-w6fV0eBw,I love every place on South Mountain. I've bee...,4,0,1,0,2011-01-02,review,9yKzy9PApeiPPOUJEtnvkg,Vk-hJ1i5ZagPM87Kv9FOnA,302,1,I love every place on South Mountain Ive been ...,"[i, love, every, place, on, south, mountain, i..."


In [44]:
# Remove all English stopwords
import nltk
nltk.download('stopwords')
stopword = nltk.corpus.stopwords.words('english') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [45]:
# Function to remove stopwords
def remove_stopwords(tokenized_list):

  # Remove all stopwords
  text = [word for word in tokenized_list if word not in stopword]
  return text

pandas_df['body_text_nostop'] = pandas_df['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

pandas_df.head()

Unnamed: 0,review_id,review_text,stars,cool,useful,funny,review_date,review_type,business_id,user_id,length,class,body_text_clean,body_text_tokenized,body_text_nostop
0,-7yxrdY13ay15rGB7WibMA,I have been going to Arizona Auto Care since a...,5,0,0,0,2010-01-16,review,Lh9nz0KYyzE-YRbKuCYeUw,ayKW9eWwGFcrtJaHcwZUCw,670,1,I have been going to Arizona Auto Care since a...,"[i, have, been, going, to, arizona, auto, care...","[going, arizona, auto, care, since, 2002, work..."
1,-Be0UUGYuiDJVAM_YqeQuA,"Since Im big into breakfast foods, Im always o...",4,0,2,2,2011-01-25,review,pa6K7DGByxBXxcVJ59nWMw,_4lqpCYCqOQzbB6xQGGhrQ,1348,1,Since Im big into breakfast foods Im always on...,"[since, im, big, into, breakfast, foods, im, a...","[since, im, big, breakfast, foods, im, always,..."
2,-nQHHXi-d_yuW301_Y0EZQ,"A pleasant place in Kierland Center, but has g...",2,0,0,0,2011-01-12,review,GIGI8bJfN6HyPzmEW-QqjA,4QORbyhfN01oKR_GgBstfQ,813,0,A pleasant place in Kierland Center but has go...,"[a, pleasant, place, in, kierland, center, but...","[pleasant, place, kierland, center, gone, hill..."
3,2L30O7G8IQ6HILpR0t5RFA,"part of a social event, we only had app's here...",5,0,0,0,2010-01-24,review,qiwajZigq_2twTmYofPmDQ,ST8Yzlk2MqKlcaLqL2djBg,415,1,part of a social event we only had apps here q...,"[part, of, a, social, event, we, only, had, ap...","[part, social, event, apps, quite, delicious, ..."
4,4x5yLG7_yGLuN-w6fV0eBw,I love every place on South Mountain. I've bee...,4,0,1,0,2011-01-02,review,9yKzy9PApeiPPOUJEtnvkg,Vk-hJ1i5ZagPM87Kv9FOnA,302,1,I love every place on South Mountain Ive been ...,"[i, love, every, place, on, south, mountain, i...","[love, every, place, south, mountain, ive, wan..."


In [46]:
# Stemming
from nltk.stem import PorterStemmer

# Create an instance for stemmer
ps = nltk.PorterStemmer()

# Function for stemming
def stemming(tokenized_text):

  text = [ps.stem(word) for word in tokenized_text]
  return text

pandas_df['body_text_stemmed'] = pandas_df['body_text_nostop'].apply(lambda x: stemming(x))

pandas_df.head()

Unnamed: 0,review_id,review_text,stars,cool,useful,funny,review_date,review_type,business_id,user_id,length,class,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed
0,-7yxrdY13ay15rGB7WibMA,I have been going to Arizona Auto Care since a...,5,0,0,0,2010-01-16,review,Lh9nz0KYyzE-YRbKuCYeUw,ayKW9eWwGFcrtJaHcwZUCw,670,1,I have been going to Arizona Auto Care since a...,"[i, have, been, going, to, arizona, auto, care...","[going, arizona, auto, care, since, 2002, work...","[go, arizona, auto, care, sinc, 2002, work, co..."
1,-Be0UUGYuiDJVAM_YqeQuA,"Since Im big into breakfast foods, Im always o...",4,0,2,2,2011-01-25,review,pa6K7DGByxBXxcVJ59nWMw,_4lqpCYCqOQzbB6xQGGhrQ,1348,1,Since Im big into breakfast foods Im always on...,"[since, im, big, into, breakfast, foods, im, a...","[since, im, big, breakfast, foods, im, always,...","[sinc, im, big, breakfast, food, im, alway, lo..."
2,-nQHHXi-d_yuW301_Y0EZQ,"A pleasant place in Kierland Center, but has g...",2,0,0,0,2011-01-12,review,GIGI8bJfN6HyPzmEW-QqjA,4QORbyhfN01oKR_GgBstfQ,813,0,A pleasant place in Kierland Center but has go...,"[a, pleasant, place, in, kierland, center, but...","[pleasant, place, kierland, center, gone, hill...","[pleasant, place, kierland, center, gone, hill..."
3,2L30O7G8IQ6HILpR0t5RFA,"part of a social event, we only had app's here...",5,0,0,0,2010-01-24,review,qiwajZigq_2twTmYofPmDQ,ST8Yzlk2MqKlcaLqL2djBg,415,1,part of a social event we only had apps here q...,"[part, of, a, social, event, we, only, had, ap...","[part, social, event, apps, quite, delicious, ...","[part, social, event, app, quit, delici, crab,..."
4,4x5yLG7_yGLuN-w6fV0eBw,I love every place on South Mountain. I've bee...,4,0,1,0,2011-01-02,review,9yKzy9PApeiPPOUJEtnvkg,Vk-hJ1i5ZagPM87Kv9FOnA,302,1,I love every place on South Mountain Ive been ...,"[i, love, every, place, on, south, mountain, i...","[love, every, place, south, mountain, ive, wan...","[love, everi, place, south, mountain, ive, wan..."


In [47]:
# Lemmatization
# import these modules 
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

# Create an instance
wn = nltk.WordNetLemmatizer()

def lemmatizing(tokenized_text):

  text = [wn.lemmatize(word) for word in tokenized_text]
  return text

pandas_df['body_text_lemmatized'] = pandas_df['body_text_nostop'].apply(lambda x: lemmatizing(x))

pandas_df.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,review_id,review_text,stars,cool,useful,funny,review_date,review_type,business_id,user_id,length,class,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized
0,-7yxrdY13ay15rGB7WibMA,I have been going to Arizona Auto Care since a...,5,0,0,0,2010-01-16,review,Lh9nz0KYyzE-YRbKuCYeUw,ayKW9eWwGFcrtJaHcwZUCw,670,1,I have been going to Arizona Auto Care since a...,"[i, have, been, going, to, arizona, auto, care...","[going, arizona, auto, care, since, 2002, work...","[go, arizona, auto, care, sinc, 2002, work, co...","[going, arizona, auto, care, since, 2002, work..."
1,-Be0UUGYuiDJVAM_YqeQuA,"Since Im big into breakfast foods, Im always o...",4,0,2,2,2011-01-25,review,pa6K7DGByxBXxcVJ59nWMw,_4lqpCYCqOQzbB6xQGGhrQ,1348,1,Since Im big into breakfast foods Im always on...,"[since, im, big, into, breakfast, foods, im, a...","[since, im, big, breakfast, foods, im, always,...","[sinc, im, big, breakfast, food, im, alway, lo...","[since, im, big, breakfast, food, im, always, ..."
2,-nQHHXi-d_yuW301_Y0EZQ,"A pleasant place in Kierland Center, but has g...",2,0,0,0,2011-01-12,review,GIGI8bJfN6HyPzmEW-QqjA,4QORbyhfN01oKR_GgBstfQ,813,0,A pleasant place in Kierland Center but has go...,"[a, pleasant, place, in, kierland, center, but...","[pleasant, place, kierland, center, gone, hill...","[pleasant, place, kierland, center, gone, hill...","[pleasant, place, kierland, center, gone, hill..."
3,2L30O7G8IQ6HILpR0t5RFA,"part of a social event, we only had app's here...",5,0,0,0,2010-01-24,review,qiwajZigq_2twTmYofPmDQ,ST8Yzlk2MqKlcaLqL2djBg,415,1,part of a social event we only had apps here q...,"[part, of, a, social, event, we, only, had, ap...","[part, social, event, apps, quite, delicious, ...","[part, social, event, app, quit, delici, crab,...","[part, social, event, apps, quite, delicious, ..."
4,4x5yLG7_yGLuN-w6fV0eBw,I love every place on South Mountain. I've bee...,4,0,1,0,2011-01-02,review,9yKzy9PApeiPPOUJEtnvkg,Vk-hJ1i5ZagPM87Kv9FOnA,302,1,I love every place on South Mountain Ive been ...,"[i, love, every, place, on, south, mountain, i...","[love, every, place, south, mountain, ive, wan...","[love, everi, place, south, mountain, ive, wan...","[love, every, place, south, mountain, ive, wan..."


In [49]:
# Apply CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(pandas_df['review_text'])

In [51]:
import pandas as pd

X_counts_df = pd.DataFrame(X_counts.toarray(), columns=count_vect.get_feature_names())
X_counts_df.head(10)

Unnamed: 0,00,000,007,00a,00am,00pm,01,02,03,03342,04,05,06,07,08,09,0buxoc0crqjpvkezo3bqog,0l,0tzg,10,100,1000,1000x,1001,100lbs,100s,100th,101,102,102729,1030,104,105,1070,107f,108,109,10am,10ish,10k,...,zoftik,zola,zombi,zombie,zombies,zone,zoned,zoners,zones,zoning,zoo,zoom,zoomed,zooming,zoos,zoya,zoyo,zpizza,zu,zucca,zucchini,zuccini,zuch,zuchinni,zuma,zumba,zupa,zupas,zur,zuzu,zuzus,zweigel,zwiebel,zy,zzed,zzzzzzzzzzzzzzzzz,éclairs,école,ém,òc
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [56]:
pandas_df_copy = pandas_df.copy()

pandas_df_copy = pandas_df_copy[['review_id', 'review_text', 'stars', 'cool', 'useful', 'funny', 'review_date', 'business_id', 'user_id', 'length', 'class', 'body_text_nostop', 'body_text_stemmed', 'body_text_lemmatized']]

# Convert pandas_df to sparks df
spark_df = spark.createDataFrame(pandas_df_copy)
spark_df.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|class|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|[going, arizona, ...|[go, arizona, aut...|[going, arizona, ...|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|pa6K7DGByxBXxcVJ5...|_4lqpCYCqOQzbB6xQ...|  1348|    1|[since, im, big, ...|[sinc, im, big, b...|[since, im, big, ...|
|-nQHHXi-d

In [57]:
# Import functions
from pyspark.ml.feature import HashingTF, IDF, StringIndexer

In [58]:
# Make stars values a list
from pyspark.sql.functions import col, split
spark_df = spark_df.withColumn("star_array", split(col("stars"), " "))
spark_df.show()

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+----------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|class|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+----------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|[going, arizona, ...|[go, arizona, aut...|[going, arizona, ...|       [5]|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|pa6K7DGByxBXxcVJ5...|_4lqpCYCqOQzbB6xQ...|  1348|    1|[since, im, big, ...|[sinc, im

In [59]:
# Initialize a CoutVectorizer
from pyspark.ml.feature import CountVectorizer
star_vectorizer = CountVectorizer(inputCol="star_array", outputCol="stars_one_hot", vocabSize=5, minDF=1.0)

In [60]:
# Create a vector model
star_vector_model = star_vectorizer.fit(spark_df)

In [61]:
# One hot encoded column
df_ohe = star_vector_model.transform(spark_df)
df_ohe.show(3)

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+----------+-------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|class|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|stars_one_hot|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+----------+-------------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|[going, arizona, ...|[go, arizona, aut...|[going, arizona, ...|       [5]|(5,[1],[1.0])|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|pa6K7DGByxBXxcVJ5...|_4lqpCYC

In [62]:
# Create all the features to the data set
hashingTF = HashingTF(inputCol="body_text_stemmed", outputCol='hash_token')
idf = IDF(inputCol='hash_token', outputCol='idf_token')

In [63]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector
# Create feature vector 
clean_up = VectorAssembler(inputCols=['idf_token', 'length'], outputCol='features')

In [64]:
# Create and run a data processing Pipeline
from pyspark.ml import Pipeline
data_prep_pipeline = Pipeline(stages=[hashingTF, idf, clean_up])

In [65]:
# Fit and transform the pipeline
cleaner = data_prep_pipeline.fit(df_ohe)
cleaned = cleaner.transform(df_ohe)

In [66]:
cleaned.show(5)

+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+----------+-------------+--------------------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|         business_id|             user_id|length|class|    body_text_nostop|   body_text_stemmed|body_text_lemmatized|star_array|stars_one_hot|          hash_token|           idf_token|            features|
+--------------------+--------------------+-----+----+------+-----+-----------+--------------------+--------------------+------+-----+--------------------+--------------------+--------------------+----------+-------------+--------------------+--------------------+--------------------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|[going, 

<br></br>**Pipeline**<br></br>

In [51]:
# Import functions
# from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer

In [52]:
# Make stars values a list
# from pyspark.sql.functions import col, split
# spark_df = spark_df.withColumn("star_array", split(col("stars"), " "))
# spark_df.show()

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+----------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|         business_id|             user_id|length|class|star_array|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+----------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|     review|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|       [5]|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|     review|pa6K7DGByxBXxcVJ5...|_4lqpCYCqOQzbB6xQ...|  1348|    1|       [4]|
|-nQHHXi-d_yuW301_...|A pleasant place ...|    2|   0|     0|    0| 2011-01-12|     review|GIGI8bJfN6HyPzmEW...|4QORbyhfN01oKR_Gg...|   813|    0|       [2]|
|2L30O7G8IQ6HILpR0...|part of a social ...|    5|   

In [53]:
# Initialize a CoutVectorizer
# from pyspark.ml.feature import CountVectorizer
# star_vectorizer = CountVectorizer(inputCol="star_array", outputCol="stars_one_hot", vocabSize=5, minDF=1.0)

In [54]:
# Create a vector model
# star_vector_model = star_vectorizer.fit(spark_df)

In [55]:
# # One hot encoded column
# df_ohe = star_vector_model.transform(spark_df)
# df_ohe.show(3)

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+----------+-------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|         business_id|             user_id|length|class|star_array|stars_one_hot|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+----------+-------------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|     review|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|       [5]|(5,[1],[1.0])|
|-Be0UUGYuiDJVAM_Y...|Since Im big into...|    4|   0|     2|    2| 2011-01-25|     review|pa6K7DGByxBXxcVJ5...|_4lqpCYCqOQzbB6xQ...|  1348|    1|       [4]|(5,[0],[1.0])|
|-nQHHXi-d_yuW301_...|A pleasant place ...|    2|   0|     0|    0| 2011-01-12|     review|GIGI8bJfN6HyPzmEW...|4QORbyhfN01oKR_Gg...|   813|

In [56]:
# Create all the features to the data set
#star_rating = StringIndexer(inputCol='stars_one_hot',outputCol='label')
# tokenizer = Tokenizer(inputCol="review_text", outputCol="token_text")
# stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
# hashingTF = HashingTF(inputCol="stop_tokens", outputCol='hash_token')
# idf = IDF(inputCol='hash_token', outputCol='idf_token')

In [57]:
# from pyspark.ml.feature import VectorAssembler
# from pyspark.ml.linalg import Vector
# # Create feature vector 
# clean_up = VectorAssembler(inputCols=['idf_token', 'length'], outputCol='features')

In [58]:
# Create and run a data processing Pipeline
# from pyspark.ml import Pipeline
# data_prep_pipeline = Pipeline(stages=[tokenizer, stopremove, hashingTF, idf, clean_up])

In [59]:
# Fit and transform the pipeline
# cleaner = data_prep_pipeline.fit(df_ohe)
# cleaned = cleaner.transform(df_ohe)

In [60]:
# cleaned.show()

+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+----------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|           review_id|         review_text|stars|cool|useful|funny|review_date|review_type|         business_id|             user_id|length|class|star_array|stars_one_hot|          token_text|         stop_tokens|          hash_token|           idf_token|            features|
+--------------------+--------------------+-----+----+------+-----+-----------+-----------+--------------------+--------------------+------+-----+----------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|-7yxrdY13ay15rGB7...|I have been going...|    5|   0|     0|    0| 2010-01-16|     review|Lh9nz0KYyzE-YRbKu...|ayKW9eWwGFcrtJaHc...|   670|    1|       [5]|(5,[1],[1.0]

In [61]:
# Drop redundant column
# x=cleaned.drop('review_type')

# **Naive Bayes**

In [67]:
#Drop intermediate columns
x=cleaned.select('features', 'class')
x.show(5)

+--------------------+-----+
|            features|class|
+--------------------+-----+
|(262145,[7327,133...|    1|
|(262145,[4200,538...|    1|
|(262145,[4106,791...|    0|
|(262145,[8804,130...|    1|
|(262145,[535,1566...|    1|
+--------------------+-----+
only showing top 5 rows



In [68]:
# Rename class to label
x = x.withColumnRenamed('class', 'label')
x.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[7327,133...|    1|
|(262145,[4200,538...|    1|
|(262145,[4106,791...|    0|
|(262145,[8804,130...|    1|
|(262145,[535,1566...|    1|
+--------------------+-----+
only showing top 5 rows



In [63]:
#import pandas as pd
# data_df = x.select('*').toPandas()
# x = spark.createDataFrame(data_df)

KeyboardInterrupt: ignored

In [69]:
# Break data down into a training set and a testing set
training, testing = x.randomSplit([0.7, 0.3], 21)

**Naive Bayes**

In [70]:
from pyspark.ml.classification import NaiveBayes
# Create a Naive Bayes model and fit training data
nb = NaiveBayes()
predictor = nb.fit(training)

In [71]:
# Use the Class Evaluator for a cleaner description
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.732877


**Logistic Regression**

In [72]:
from pyspark.ml.classification import LogisticRegression
# Create a Naive Bayes model and fit training data
lg = LogisticRegression()
predictor = lg.fit(training)

In [73]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.772261


In [None]:
# Authenticate user
from google.colab import auth
auth.authenticate_user()

In [None]:
# Set project id
project_id = 'xy-yelp'

In [None]:
# Set project
!gcloud config set project {project_id}

Updated property [core/project].


In [None]:
# File path to save json file to
filepath = '/tmp/ml_j.json'

In [None]:
# Save json file **** this will break if the file already exists, which it does at this point, therefore its commented out for now
x.coalesce(1).write.format('json').save(filepath)

In [None]:
# Bucket name
bucket_name = 'xy-bucket'

In [None]:
# Copy saved file from /tmp to bucket
!gsutil cp -r /tmp/ml_j.json/ gs://{bucket_name}/json_files/

Copying file:///tmp/ml_j.json/.part-00000-f8a8f21c-0ebe-434c-a8b3-a5c8988dd298-c000.json.crc [Content-Type=application/octet-stream]...
Copying file:///tmp/ml_j.json/._SUCCESS.crc [Content-Type=application/octet-stream]...
Copying file:///tmp/ml_j.json/_SUCCESS [Content-Type=application/octet-stream]...
Copying file:///tmp/ml_j.json/part-00000-f8a8f21c-0ebe-434c-a8b3-a5c8988dd298-c000.json [Content-Type=application/json]...
\
Operation completed over 4 objects/63.0 MiB.                                     
