<a href="https://colab.research.google.com/github/karenbennis/Xy/blob/mess_management/ml_model_1_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <br>**Connect to Database**<br><br>

In [43]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar xf spark-2.4.6-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

#Interact with SQL
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

# Start Spark Session(Creating spark application with name defined by appName()) ---IMPORTED WITH EVERY COLAB NOTEBOOK
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("database_transformation").config("spark.driver.memory","5g").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()


--2020-07-31 03:03:24--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar.1’


2020-07-31 03:03:25 (4.56 MB/s) - ‘postgresql-42.2.9.jar.1’ saved [914037/914037]



In [44]:
# gcloud login and check the DB
!gcloud auth login
!gcloud config set project 'xy-yelp'
!gcloud sql instances describe 'xy-yelp'

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&code_challenge=_KODDG4-sXbhqkRJw9tvJV9Mth4QS5KyGQUYY1RWRpY&code_challenge_method=S256&access_type=offline&response_type=code&prompt=select_account


Enter verification code: 4/2gH5aLzW3YHQNfbaZ0UWID-Ajq-lhb9vGOIzudVIV4rEEW3-bLDkikI

You are now logged in as [jasmeersangha@gmail.com].
Your current project is [xy-yelp].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID
Updated property [core/project].
backendType: SECOND_GEN
connectionName: xy-yelp:northamerica-northeast1:xy-yelp
datab

In [45]:
# download and initialize the psql proxy
!wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
!chmod +x cloud_sql_proxy
# "connectionName" is from the previous block
!nohup ./cloud_sql_proxy -instances="xy-yelp:northamerica-northeast1:xy-yelp"=tcp:5432 &
!sleep 30s

--2020-07-31 03:03:46--  https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64
Resolving dl.google.com (dl.google.com)... 173.194.216.93, 173.194.216.136, 173.194.216.190, ...
Connecting to dl.google.com (dl.google.com)|173.194.216.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14492253 (14M) [application/octet-stream]
Saving to: ‘cloud_sql_proxy’


2020-07-31 03:03:46 (250 MB/s) - ‘cloud_sql_proxy’ saved [14492253/14492253]

nohup: appending output to 'nohup.out'


In [46]:
db_password = 'kjhbyelpdb'

In [47]:
# Configure settings for RDS
mode = "append"
jdbc_url="jdbc:postgresql://127.0.0.1:5432/xy_yelp_db"
config = {"user":"postgres", 
          "password": db_password, 
          "driver":"org.postgresql.Driver"}

## **Extract tables**

In [48]:
# Read data from database
review_df2 = spark.read.jdbc(url=jdbc_url, table='review_two',properties=config)
# review_df2.show(5)

In [50]:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
import string
import re

#PySpark ml imports
from pyspark.ml.feature import HashingTF, IDF, StringIndexer
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


In [51]:
# Convert to a pandas df
pandas_df = review_df2.toPandas()
# pandas_df.head()

## **Transformation**

In [52]:
# Function to remove Punctuation
def remove_punct(text):
  text_nopunct = ''.join([char for char in text if char not in string.punctuation])
  return text_nopunct

# Function to Tokenize words
def tokenize(text):
  tokens = re.split('\W+', text)
  return tokens

nltk.download('stopwords')
stopword = nltk.corpus.stopwords.words('english') 

# Function to remove stopwords
def remove_stopwords(tokenized_list):
  text = [word for word in tokenized_list if word not in stopword]
  return text

# Create an instance for stemmer
ps = nltk.PorterStemmer()

# Function for stemming
def stemming(tokenized_text):
  text = [ps.stem(word) for word in tokenized_text]
  return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [53]:
# Run NLP functions
pandas_df['body_text_clean'] = pandas_df['review_text'].apply(lambda x: remove_punct(x))
pandas_df['body_text_tokenized'] = pandas_df['body_text_clean'].apply(lambda x: tokenize(x.lower()))
pandas_df['body_text_nostop'] = pandas_df['body_text_tokenized'].apply(lambda x: remove_stopwords(x))
pandas_df['body_text_stemmed'] = pandas_df['body_text_nostop'].apply(lambda x: stemming(x))

# Add a length column to DataFrame
pandas_df['length'] = pandas_df['review_text'].apply(len)

# pandas_df.head()

## **Pipeline**


In [54]:
# Select columns for new DataFrame
pandas_df_useful = pandas_df[['review_id', 'review_text', 'stars',  'length', 'body_text_stemmed']]

# Convert pandas_df to sparks df
spark_df = spark.createDataFrame(pandas_df_useful)
spark_df.show(5)

+--------------------+--------------------+-----+------+--------------------+
|           review_id|         review_text|stars|length|   body_text_stemmed|
+--------------------+--------------------+-----+------+--------------------+
|K8avYPWsh45v7VoZg...|another pie place...|    3|   256|[anoth, pie, plac...|
|BkiZn5XSzAv9q7J7_...|I came with my si...|    4|   480|[came, sister, ex...|
|L6kc7Nr7hWiqo7ZvW...|I am very disappo...|    1|   236|[disappoint, braz...|
|y35xKzutHXT985mUp...|Stopped for lunch...|    4|   265|[stop, lunch, wif...|
|UqQGtBDEfkYMLV-Fy...|DON'T DO IT!  You...|    1|  3123|[dont, your, vega...|
+--------------------+--------------------+-----+------+--------------------+
only showing top 5 rows



In [55]:
# Create all the features to the data set
star_rating = StringIndexer(inputCol='stars',outputCol='label')
hashingTF = HashingTF(inputCol="body_text_stemmed", outputCol='hash_token')
idf = IDF(inputCol='hash_token', outputCol='idf_token')
clean_up = VectorAssembler(inputCols=['idf_token', 'length'], outputCol='features')

In [56]:
# Create and run a data processing Pipeline
data_prep_pipeline = Pipeline(stages=[star_rating, hashingTF, idf, clean_up])

In [57]:
# Fit and transform the pipeline
cleaner = data_prep_pipeline.fit(spark_df)
cleaned = cleaner.transform(spark_df)

# cleaned.show(5)

# **Machine Learning Models**

In [58]:
#Drop intermediate columns
x=cleaned.select('features', 'label')
x.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(262145,[22567,27...|  2.0|
|(262145,[427,1253...|  0.0|
|(262145,[17893,28...|  3.0|
|(262145,[31463,65...|  0.0|
|(262145,[2396,392...|  3.0|
+--------------------+-----+
only showing top 5 rows



**Naive Bayes**

In [59]:
# Break data down into a training set and a testing set
training, testing = x.randomSplit([0.8, 0.2], 21)

In [60]:
# Create a Naive Bayes model and fit training data
nb = NaiveBayes()
predictor = nb.fit(training)

In [64]:
# Use the Class Evaluator for a cleaned description
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.391433


**Logistic Regression**

In [65]:
# Create a Logistic Regression model and fit training data
lg = LogisticRegression()
predictor = lg.fit(training)

In [66]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictor.transform(testing))
print("Accuracy of model at predicting reviews was: %f" % acc)

Accuracy of model at predicting reviews was: 0.453939


**Multilayer Perceptron**

In [37]:
# specify layers for the neural network:
layers = [62000, 256, 3]

# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=10, layers=layers, blockSize=128, seed=1234)

# train the model
# model = trainer.fit(training)

# Proved to be too large of an input layer for our machines to handle 