# Part 2 Text Representation

#### This notebook contains different ways to represent word vectors from previous processed ``review`` data.

Methods we transform tokens to word vectors are:

- #### Bag of Words(BOW)
    - represent tokens by bow vector
    - calculate bow tfidf
- #### Bigrams
    - represent bigram tokens
    - calculate bigram tfidf  
- #### Word Embeddings

<br></br>

In [1]:
import numpy as np
import pyspark
from pyspark.sql import SparkSession

from pyspark.ml.feature import CountVectorizer, HashingTF, IDF, NGram, Word2Vec
from pyspark.ml import Pipeline

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7,application_1622827002238_0008,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### load preprocessed tokens

In [1]:
clean_tokens = spark.read.parquet("s3://dse230-project-data1/final_token.parquet")

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1622750579192_0005,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
clean_tokens.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- unigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- is_helpful: integer (nullable = true)

<br></br>

## Bag of Words

#### Create BOW representation

In [4]:
# define & fit `bow` transformer
countVectorizer = CountVectorizer(inputCol="unigrams", outputCol="bow", vocabSize=10000, minDF=5)
bow_transformer = countVectorizer.fit(clean_tokens)

# transform tokens to `bow`
df = bow_transformer.transform(clean_tokens)
df.printSchema()
df.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- unigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- is_helpful: integer (nullable = true)
 |-- bow: vector (nullable = true)

+--------------------+----------+--------------------+
|            unigrams|is_helpful|                 bow|
+--------------------+----------+--------------------+
|[stay, king, suit...|         0|(10000,[0,1,2,3,4...|
|[everi, visit, ny...|         0|(10000,[0,1,2,3,4...|
|[great, properti,...|         0|(10000,[1,2,3,8,1...|
|[andaz, nice, hot...|         0|(10000,[0,1,2,5,8...|
|[stay, andaz, pro...|         0|(10000,[0,1,2,3,6...|
|[excel, staff, re...|         0|(10000,[0,1,2,4,5...|
|[stay, setai, nig...|         0|(10000,[0,1,2,6,1...|
|[husband, stay, c...|         0|(10000,[0,1,2,3,5...|
|[wonder, boutiqu,...|         0|(10000,[0,5,7,15,...|
|[hotel, nice, sta...|         0|(10000,[0,1,2,3,4...|
|[ive, stay, star,...|         0|(10000,[0,1,2,4,5...|
|[stay, hotel, fou...|         0|(10000,[0,1,2,4,5...|
|[ho

#### calculate tf-idf for BOW
note: ``CountVectorizer`` performs the same functionality as ``HashingTF`` transformer, we we directly calculate ``idf`` term from previous step.

In [5]:
# define idf transformer
idf = IDF(inputCol="bow", outputCol="bow_tfidf", minDocFreq=5) # minDocFreq: remove sparse terms
idf_transformer = idf.fit(df)

# transform tfidf
df = idf_transformer.transform(df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
df.printSchema()
df.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- unigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- is_helpful: integer (nullable = true)
 |-- bow: vector (nullable = true)
 |-- bow_tfidf: vector (nullable = true)

+--------------------+----------+--------------------+--------------------+
|            unigrams|is_helpful|                 bow|           bow_tfidf|
+--------------------+----------+--------------------+--------------------+
|[stay, king, suit...|         0|(10000,[0,1,2,3,4...|(10000,[0,1,2,3,4...|
|[everi, visit, ny...|         0|(10000,[0,1,2,3,4...|(10000,[0,1,2,3,4...|
|[great, properti,...|         0|(10000,[1,2,3,8,1...|(10000,[1,2,3,8,1...|
|[andaz, nice, hot...|         0|(10000,[0,1,2,5,8...|(10000,[0,1,2,5,8...|
|[stay, andaz, pro...|         0|(10000,[0,1,2,3,6...|(10000,[0,1,2,3,6...|
|[excel, staff, re...|         0|(10000,[0,1,2,4,5...|(10000,[0,1,2,4,5...|
|[stay, setai, nig...|         0|(10000,[0,1,2,6,1...|(10000,[0,1,2,6,1...|
|[husband, stay, c...|      

<br></br>

## Bigrams
- First, generate bigram representation
- Then calculate bigram tfidf

#### Create bigram vectors

In [6]:
# define NGram transformer
ngram = NGram(n=2, inputCol="unigrams", outputCol="bigrams")

# create bigram_df as a transform of unigram_df using NGram tranformer
df = ngram.transform(df)

# check result
df.printSchema()
df.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- unigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- is_helpful: integer (nullable = true)
 |-- bow: vector (nullable = true)
 |-- bow_tfidf: vector (nullable = true)
 |-- bigrams: array (nullable = true)
 |    |-- element: string (containsNull = false)

+--------------------+----------+--------------------+--------------------+--------------------+
|            unigrams|is_helpful|                 bow|           bow_tfidf|             bigrams|
+--------------------+----------+--------------------+--------------------+--------------------+
|[stay, king, suit...|         0|(10000,[0,1,2,3,4...|(10000,[0,1,2,3,4...|[stay king, king ...|
|[everi, visit, ny...|         0|(10000,[0,1,2,3,4...|(10000,[0,1,2,3,4...|[everi visit, vis...|
|[great, properti,...|         0|(10000,[1,2,3,8,1...|(10000,[1,2,3,8,1...|[great properti, ...|
|[andaz, nice, hot...|         0|(10000,[0,1,2,5,8...|(10000,[0,1,2,5,8...|[andaz nice, nice...|
|[stay, andaz, pro...|

#### Calculate tf-idf for bigram

In [7]:
hashingTF = HashingTF(inputCol="bigrams", outputCol="bigram_tf", numFeatures=10000)
df = hashingTF.transform(df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
bigram_idf = IDF(inputCol="bigram_tf", outputCol="bigram_tfidf")
bigram_idf_transformer = bigram_idf.fit(df)
df = bigram_idf_transformer.transform(df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<br></br>

## Word embeddings

In [10]:
#create an average word vector for each document
word2vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'unigrams', outputCol = 'word2vec')
#word2vec_transformer = word2vec.fit(df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
#df = word2vec_transformer.transform(df)
text_df = df.select(['is_helpful','unigrams','bow','bow_tfidf','bigrams','bigram_tfidf','word2vec'])

In [21]:
text_df.printSchema()
text_df.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- is_helpful: integer (nullable = true)
 |-- unigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- bow: vector (nullable = true)
 |-- bow_tfidf: vector (nullable = true)
 |-- bigrams: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- bigram_tfidf: vector (nullable = true)
 |-- word2vec: vector (nullable = true)

+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|is_helpful|            unigrams|                 bow|           bow_tfidf|             bigrams|        bigram_tfidf|            word2vec|
+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|         0|[stay, one, room,...|(10000,[1,2,3,5,6...|(10000,[1,2,3,5,6...|[stay one, one ro...|(10000,[82,90,150...|[-0.0535272059800...|
|         0|[hotel, locat, we...|(10000,[0,1,3,5,7...|(10000,

In [22]:
# save processed result to parquet for modeling
#text_df.coalesce(5).write.parquet('s3://dse230-project-data1/text_df.parquet', mode="overwrite")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<br></br>

## An all-included Pipeline

In [8]:
countVectorizer = CountVectorizer(inputCol="unigrams", outputCol="bow", vocabSize=10000, minDF=5)
idf = IDF(inputCol="bow", outputCol="bow_tfidf", minDocFreq=5)
ngram = NGram(n=2, inputCol="unigrams", outputCol="bigrams")
hashingTF = HashingTF(inputCol="bigrams", outputCol="bigram_tf", numFeatures=10000)
bigram_idf = IDF(inputCol="bigram_tf", outputCol="bigram_tfidf")
word2vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'unigrams', outputCol = 'word2vec')

pipeline = Pipeline(stages=[countVectorizer, idf, ngram, hashingTF, bigram_idf, word2vec])
#pipelineFit = pipeline.fit(clean_tokens)
#df_for_test = pipelineFit.transform(clean_tokens)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### We will be only using `countVectorizer` to transform Airbnb reviews

In [6]:
airbnb_tokens = spark.read.parquet("s3://dse230-project-data1/final_review.parquet")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
airbnb_tokens.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- unigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- id: string (nullable = true)

In [11]:
# define & fit `bow` transformer
countVectorizer2 = CountVectorizer(inputCol="unigrams", outputCol="bow", vocabSize=10000, minDF=5)
bow_transformer2 = countVectorizer.fit(airbnb_tokens)

# transform tokens to `bow`
text_df2 = bow_transformer2.transform(airbnb_tokens)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
text_df2.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+---------+--------------------+
|            unigrams|       id|                 bow|
+--------------------+---------+--------------------+
|[respons, help, h...|214904211|(10000,[1,2,10,12...|
|[hous, perfect, c...|218048729|(10000,[0,1,2,10,...|
|[famili, babi, sp...|219935056|(10000,[7,14,20,2...|
|[great, hous, hil...|222829306|(10000,[0,2,4,18,...|
|[airbnb, exactli,...|227447978|(10000,[10,12,23,...|
|[group, wonder, s...|229828619|(10000,[0,2,3,4,5...|
|[realli, enjoy, s...|231585560|(10000,[0,2,6,7,1...|
|[thoroughli, enjo...|233713124|(10000,[0,2,8,9,1...|
|[place, real, get...|236677318|(10000,[0,1,4,16,...|
|[open, concept, h...|239212816|(10000,[2,6,7,11,...|
|[host, cancel, re...|243379080|(10000,[10,32,106...|
|[mahi, great, hos...|249212151|(10000,[0,2,10,21...|
|[hous, beauti, pi...|251934875|(10000,[13,17,21,...|
|[thank, everyth, ...|262128150|(10000,[0,7,8,19,...|
|[home, truli, gor...|264815252|(10000,[0,7,8,20,...|
|[mahi, hous, beau...|267497

In [12]:
# save processed result to parquet for modeling
#text_df2.coalesce(5).write.parquet('s3://dse230-project-data1/text_df2.parquet', mode="overwrite")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…