# Data Preparation
The dataframe consists of reviews from Amazon. 

Data downloaded from [Kaggle amazon reviews](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews).

The input data is in the following format:
Id | ShortReview | ReviewContent 
--- | --- | ---
001 | Text | Review Description 
011 | Text | Review Description

 
Since the above dataset does not have any other information, except the text from the reviews, the follwoing columns are added to make the data more realistic.
- Index: unique ID
- Date: a randomly generated datetimestamp, YYYY-MM-DD HH:mm:ss (e.g. 2021-05-07 05:10:34) between 2020/05/09 and 2023/03/25 (these can be changed accordingly)
- ProductName: name of the product being reviewed. This is created by concatanating "productname" with the existing Id column.
- Category: randomly generated choice between 4 values, implying the category the product would belong to. This could be also thought as a Business Vertical or Location/Region. Only constraint - preferably not more than 5-6 values, as that would increase the computing time for the topic modeling. Details discussed in Topic modeling notebooks.

The output is following format.

Index | Date | ProductName | ReviewRate | Price | Category | ReviewContent 
--- | --- | --- | --- | --- | --- |---
001 | 2020-09-27 09:11:04 | name1 | 3.5 | 62.36 | Category1 | Text Description
011 | 2022-12-13 15:00:54 | name2 | 5 | 219.08 | Category2 | Text Description

This data is saved as a Table in the Hive Metastore under the **test_db** schema.

## Import packages

In [0]:
# spark packages
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

import re
import pandas as pd
import numpy as np
from datetime import datetime
import random

# Warnings
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)



## Load Data
Currently, this is being loaded from DBFS. This can be changed to read directly from ADLS as well, with the correct permissions, or from hive_metastore.

In [0]:
df = spark.read.format("csv").option("header", "false").load("dbfs:/FileStore/tables/nlp/reviewAll.csv") 

In [0]:
df.display()

_c0,_c1,_c2,_c3
3,more like funchuck,"""Gave this to my dad for a gag gift after directing """"Nunsense",""""" he got a reall kick out of it!"""
5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature.",
5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.",
4,Chrono Cross OST,"""The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; """"Scars Left by Time",The Girl who Stole the Stars
5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!,
5,There's a reason for the price,"There's a reason this CD is so expensive, even the version that's not an import.Some of the best music ever. I could listen to every track every minute of every day. That's about all i can say.",
1,Buyer beware,"""This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a """"worst book"""" contest. I can't believe Amazon even sells this kind of thing. Maybe I can offer them my 8th grade term paper on """"To Kill a Mockingbird""""--a book I am quite sure Ms. Haddon never heard of. Anyway",unless you are in a mood to send a book to someone as a joke---stay far
4,"Errors, but great story","I was a dissapointed to see errors on the back cover, but since I paid for the book I read it anyway. I have to say I love it. I couldn't put it down. I read the whole book in two hours. I say buy it. I say read it. It is sad, but it gives an interesting point of view on church today. We spend too much time looking at the faults of others. I also enjoyed beloved.Sincerly,Jaylynn R",
1,The Worst!,"A complete waste of time. Typographical errors, poor grammar, and a totally pathetic plot add up to absolutely nothing. I'm embarrassed for this author and very disappointed I actually paid for this book.",
1,Oh please,"I guess you have to be a romance novel lover for this one, and not a very discerning one. All others beware! It is absolute drivel. I figured I was in trouble when a typo is prominently featured on the back cover, but the first page of the book removed all doubt. Wait - maybe I'm missing the point. A quick re-read of the beginning now makes it clear. This has to be an intentional churning of over-heated prose for satiric purposes. Phew, so glad I didn't waste $10.95 after all.",


### Data Clean-up/Processing

In [0]:
@udf("string")
def cleanup(string):
    '''function to remove the additional "" from
    the text reviews'''
    clean_string = re.sub(r"\"", "", string)
    return clean_string

In [0]:
# data clean-up - removing null rows, renaming columns
# ccombining the two reviews-related columns
# adding index, date, productname, category columns

df = df.filter(col("_c2").isNotNull() | col("_c1").isNotNull()) \
        .withColumn("c3", when( \
                           ((col("_c1").isNotNull()) & (col("_c2").isNotNull())), concat_ws(". ", col("_c1"), col("_c2"))) \
                        .when( ((col("_c1").isNull()) & (col("_c2").isNotNull())), col("_c2") ) \
                        .when( ((col("_c1").isNotNull()) & (col("_c2").isNull())), col("_c1") ) ) \
        .select(cleanup(col("c3")).alias("Content"), "_c0") \
        .filter(col("Content").isNotNull() & col("_c0").isNotNull()) \
        .filter( (col("Content") != "") & (col("Content") != " ") ) \
        .withColumn("ProductName", concat_ws("-", lit("ProductName"), col("_c0"))) \
        .withColumn("id1", row_number().over(Window.orderBy(monotonically_increasing_id()))) \
        .withColumn("Index", monotonically_increasing_id()) \
        .withColumn("Category", array(lit("Category1"), lit("Category2"), lit("Category3"), lit("Category4"), ) \
              .getItem((rand()*4).cast("int"))) \
        .withColumn("Price", round(rand()*(101),2)) 



In [0]:
def random_date(first_date, second_date):
    '''
    funtion to generate a random datetimestamp
    between two given dates
    '''
    first_timestamp = int(first_date.timestamp())
    second_timestamp = int(second_date.timestamp())
    random_timestamp = random.randint(first_timestamp, second_timestamp)
    return datetime.fromtimestamp(random_timestamp)

d1 = datetime.strptime("2020/05/09", "%Y/%m/%d")
d2 = datetime.strptime("2023/03/25", "%Y/%m/%d")

# print(random_date(d1, d2))

In [0]:
# create a date datframe of the same length as the input dataframe

dates = [random_date(d1, d2) for _ in range(df.count())]
ids = range(1, df.count()+1)
datedf = spark.createDataFrame(pd.DataFrame({"Date": dates, "id2": ids}))

In [0]:
# join the input dataframe wuth the date dataframe to get the final dataframe

df = df.join(datedf, col("id1")==col("id2"),"inner") \
        .select("Index", "Date", "ProductName", "Category", "Price", "Content")

In [0]:
df.show(5)

+-----+-------------------+-------------+---------+-----+--------------------+
|Index|               Date|  ProductName| Category|Price|             Content|
+-----+-------------------+-------------+---------+-----+--------------------+
|    6|2022-12-23 03:52:04|ProductName-1|Category2|76.68|Buyer beware. Thi...|
|   18|2023-03-14 16:40:26|ProductName-4|Category2|15.52|i liked this albu...|
|   21|2021-02-14 15:35:50|ProductName-2|Category3|91.25|Problem with char...|
|   25|2021-10-29 14:49:19|ProductName-1|Category1|10.86|Batteries died wi...|
|   28|2020-08-28 19:38:12|ProductName-4|Category3|61.41|Excellent choice ...|
+-----+-------------------+-------------+---------+-----+--------------------+
only showing top 5 rows



In [0]:
print("number of rows in the dataset =", df.count())
print("number of categories in the dataset")
df.groupby("Category").count().display()
print("date range for the data")
df.agg(min("Date"), max("Date")).show()

number of rows in the dataset = 2999999
number of categories in the dataset


Category,count
Category2,750393
Category1,749865
Category3,750448
Category4,749293


date range for the data
+-------------------+-------------------+
|          min(Date)|          max(Date)|
+-------------------+-------------------+
|2020-05-09 00:00:20|2023-03-24 23:59:10|
+-------------------+-------------------+



## Write data to Databricks Hive metastore - test_db schema

In [0]:
# overwriteSchema option is set true while write the table

df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("test_db.reviewsData")

# Validation of Data

In [0]:
%sql
select * from test_db.reviewsData

Index,Date,ProductName,Category,Price,Content,Unnamed: 6
6,2022-12-23T03:52:04.000+0000,ProductName-1,Category2,76.68,"Buyer beware. This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a worst book contest. I can't believe Amazon even sells this kind of thing. Maybe I can offer them my 8th grade term paper on To Kill a Mockingbird--a book I am quite sure Ms. Haddon never heard of. Anyway",
18,2023-03-14T16:40:26.000+0000,ProductName-4,Category2,15.52,"i liked this album more then i thought i would. I heard a song or two and thought same o same o,but when i listened to songs like blue angel",
21,2021-02-14T15:35:50.000+0000,ProductName-2,Category3,91.25,"Problem with charging smaller AAAs. I have had the charger for more than two years. It charges AA batteries just fine, but has a huge problem securing smaller AAA batteries. To charge the smaller batteries you need to flip down the little button at the positive end. In the beginning one of the four AAA batteries would pop up, and now three out of the four won't hold. The problem is the flip mechanism became loose, and any horizontal pressure would push the buttons back up. What I have to do now is using duct tape and a segment of crayon, apply the crayon on the buttons, and wrap the tape around. You know how painful that is.",
25,2021-10-29T14:49:19.000+0000,ProductName-1,Category1,10.86,"Batteries died within a year .... I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power.",
28,2020-08-28T19:38:12.000+0000,ProductName-4,Category3,61.41,"Excellent choice for combination. After reading several reviews on this item, I purchased it as a Christmas gift. My brother liked it a lot, so I decided to get one for my wife and me. I'm really glad I did. It's pretty easy to set up and use, and the playback is excellent in both the VCR and DVD modes. The remote also operates my JVC TV. This is a great choice if you're looking for a quality combination player from a trusted name in electronics.",
53,2022-01-30T00:33:52.000+0000,ProductName-2,Category3,8.55,*** BEWARE ***. This TV is set so that it is not capable of a recall function. If you want to flash back between channels,
64,2022-08-25T06:28:57.000+0000,ProductName-2,Category3,96.85,"Scraped across the whole top.. Purchased this screen last week and it came with noticeable scrapes across the top. The box it came in was gigantic with plenty of packaging material, but the screen was packaged loosely at the bottom. However, since it was in another smaller box, I'm thinking that the scraping was there to begin with, and not from shipping. Still deciding whether to go through the trouble of exchanging it since another reviewer said they had the same damage. I wouldn't want to risk getting another with the same damage and I really need a screen right now. Other than the damage, it's a very nice screen.",
76,2022-07-09T20:24:44.000+0000,ProductName-3,Category4,99.57,Have yet to watch it yet.. I bought this movie to watch with my Thai girlfriend and did not get teh chance. She told me kn the phone she was watching it.She seemed to Enjoy it overall. When is howed this one and 1-2 to her she said its not for her. She lieks scary movies but this one did not apeal to her.Well she did watch it and enjoyed it. I wont get to watch it untill Aug. But i am looking forward to it.I gave the 3 star because i have not seen it. I would have givin it a 1 or a 2. It was moved to a three because of my girl friends opinon. First she said not for her than she said good movie. So it changed her mind.I am looking forward to it though,
111,2021-06-01T01:19:22.000+0000,ProductName-4,Category4,23.61,"Ok reference book. Dated (1980's) so don't expect a lot of current technology for this ancient craft. A pretty good read and part of my collection. This book is also co-authored by Sobon. Of the three (Build a Classic Timber-Framed House, Timber Frame Construction: All About Post-and-Beam Building and this one), you need this one the least. Interesting history but not so much on the construction aspect.",
112,2021-01-16T06:15:25.000+0000,ProductName-2,Category2,9.99,no technical information. this book is a great over view of the joints used in building but has no information on beam spans or loads to actually build a building,


In [0]:
%sql
select count(*) from test_db.reviewsData

count(1)
2999999
