# **TEXT ANALYSIS: SENTIMENTAL ANALYSIS PRE PROCESSING**

**For sentimental analysis, we first need to pre process our comments**


**STEP 1. Remove punctuations**

In [32]:
def remove_punct(text):
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
    nopunct = regex.sub(" ", text)
    return nopunct

**STEP 2. Split the ratings at threshold =3 to classify them as positive or negative sentiment**

In [33]:
def convert_rating(rating):
    rating = int(rating)
    if rating >=3: return 1
    else: return 0

# Generating user-defined functions to remove punctuation and convert rating
punct_remover = udf(lambda x: remove_punct(x))
rating_convert = udf(lambda x: convert_rating(x))

# apply to review raw data
review_df = yelp_review.select('review_id', punct_remover('text'), rating_convert('stars'))

review_df = review_df.withColumnRenamed('<lambda>(text)', 'text')\
                     .withColumn('label', review_df["<lambda>(stars)"].cast(IntegerType()))\
                     .drop('<lambda>(stars)')\
                     .limit(1000000)

In [34]:
review_df.show(5)

+--------------------+--------------------+-----+
|           review_id|                text|label|
+--------------------+--------------------+-----+
|KU_O5udG6zpxOg-Vc...|If you decide to ...|    1|
|BiTunyQ73aT9WBnpR...|I ve taken a lot ...|    1|
|saUsX_uimxRlCVr67...|Family diner  Had...|    1|
|AqPFMleE6RsU23_au...|Wow   Yummy  diff...|    1|
|Sx8TMOWLNuJBWer-0...|Cute interior and...|    1|
+--------------------+--------------------+-----+
only showing top 5 rows



**STEP 3. Tokenizing the comments and removing stop words**

In [35]:
# tokenize
tok = Tokenizer(inputCol="text", outputCol="words")
review_tokenized = tok.transform(review_df)

# remove stop words
stopword_rm = StopWordsRemover(inputCol='words', outputCol='words_new')
review_tokenized = stopword_rm.transform(review_tokenized)

In [36]:
review_tokenized.show(5)

+--------------------+--------------------+-----+--------------------+--------------------+
|           review_id|                text|label|               words|           words_new|
+--------------------+--------------------+-----+--------------------+--------------------+
|KU_O5udG6zpxOg-Vc...|If you decide to ...|    1|[if, you, decide,...|[decide, eat, , a...|
|BiTunyQ73aT9WBnpR...|I ve taken a lot ...|    1|[i, ve, taken, a,...|[ve, taken, lot, ...|
|saUsX_uimxRlCVr67...|Family diner  Had...|    1|[family, diner, ,...|[family, diner, ,...|
|AqPFMleE6RsU23_au...|Wow   Yummy  diff...|    1|[wow, , , yummy, ...|[wow, , , yummy, ...|
|Sx8TMOWLNuJBWer-0...|Cute interior and...|    1|[cute, interior, ...|[cute, interior, ...|
+--------------------+--------------------+-----+--------------------+--------------------+
only showing top 5 rows



**FEATURE EXTRACTION**

For feature selection, we have used CountVectorisation and tf-Idf (term frequency and inverse document frequency)

**STEP 4: Count vectorisation** helps to extract features by converting text into vectors based on the count of its appearance.
It is useful for multiple documents to provide a matrix.

In [37]:
# count vectorizer
cv = CountVectorizer(inputCol='words_new', outputCol='tf')
cvModel = cv.fit(review_tokenized)
count_vectorized = cvModel.transform(review_tokenized)

In [38]:
count_vectorized.show(5)

+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
|           review_id|                text|label|               words|           words_new|                  tf|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
|KU_O5udG6zpxOg-Vc...|If you decide to ...|    1|[if, you, decide,...|[decide, eat, , a...|(202063,[0,1,2,6,...|
|BiTunyQ73aT9WBnpR...|I ve taken a lot ...|    1|[i, ve, taken, a,...|[ve, taken, lot, ...|(202063,[0,7,14,1...|
|saUsX_uimxRlCVr67...|Family diner  Had...|    1|[family, diner, ,...|[family, diner, ,...|(202063,[0,2,3,13...|
|AqPFMleE6RsU23_au...|Wow   Yummy  diff...|    1|[wow, , , yummy, ...|[wow, , , yummy, ...|(202063,[0,11,28,...|
|Sx8TMOWLNuJBWer-0...|Cute interior and...|    1|[cute, interior, ...|[cute, interior, ...|(202063,[0,2,4,6,...|
+--------------------+--------------------+-----+--------------------+--------------------+-----

In [39]:
# tf-idf
idf = IDF().setInputCol('tf').setOutputCol('tfidf')
tfidfModel = idf.fit(count_vectorized)
tfidf_df = tfidfModel.transform(count_vectorized)

In [40]:
tfidf_df.show(5)

+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+
|           review_id|                text|label|               words|           words_new|                  tf|               tfidf|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+
|KU_O5udG6zpxOg-Vc...|If you decide to ...|    1|[if, you, decide,...|[decide, eat, , a...|(202063,[0,1,2,6,...|(202063,[0,1,2,6,...|
|BiTunyQ73aT9WBnpR...|I ve taken a lot ...|    1|[i, ve, taken, a,...|[ve, taken, lot, ...|(202063,[0,7,14,1...|(202063,[0,7,14,1...|
|saUsX_uimxRlCVr67...|Family diner  Had...|    1|[family, diner, ,...|[family, diner, ,...|(202063,[0,2,3,13...|(202063,[0,2,3,13...|
|AqPFMleE6RsU23_au...|Wow   Yummy  diff...|    1|[wow, , , yummy, ...|[wow, , , yummy, ...|(202063,[0,11,28,...|(202063,[0,11,28,...|
|Sx8TMOWLNuJBWer-0...|Cute interior and...|    1|[cute, interi