# DATA603 Big Data Processing Project 
Group 3: Pooja Kangokar Pranesh, Yun-Zih Chen, Elizabeth Cardosa

The goal of this project is leverage big data technologies to train a model using the UCI ML Drug Review dataset to predict the star rating of drug based on the sentiment of the review. This model will then perform inference in a streaming manner on ‘real-time’ reviews coming in. This application can then be used to help potential customers understand the overall sentiment towards a drug and if it might be useful for them. 


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
working_folder = "/content/drive/My Drive/UMBC Fall 2022/DATA603 Big Data Processing/Project/"

# Install Libraries and Dependencies

In [3]:
!pip install -qq pyspark
!pip install -qq spark-nlp
!pip install -q findspark

In [4]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-12-11 18:15:43--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-12-11 18:15:44--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-11 18:15:45--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

In [5]:
import pyspark.pandas as ps
import pandas as pd



In [6]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext

In [7]:
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

In [8]:
spark = sparknlp.start()

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 4.2.4
Apache Spark version: 3.2.1


In [9]:
sc = SparkContext.getOrCreate();

# Read-in Dataset
Dataset: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29

The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. The intention was to study

- sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,
- the transferability of models among domains, i.e. conditions, and
- the transferability of models among different data sources (see 'Drug Review Dataset (Druglib.com)').

The data is split into a train (75%) a test (25%) partition (see publication) and stored in two .tsv (tab-separated-values) files, respectively.

Attribute Information:

1. drugName (categorical): name of drug
2. condition (categorical): name of condition
3. review (text): patient review
4. rating (numerical): 10 star patient rating
5. date (date): date of review entry
6. usefulCount (numerical): number of users who found review useful


Important notes:

When using this dataset, you agree that you
1. only use the data for research purposes
2. don't use the data for any commerical purposes
3. don't distribute the data to anyone else
4. cite us

Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125. DOI: [Web Link] 

## Load in Test Data

In [10]:
# Read in training data file
customschema = StructType([
  StructField("UniqueID", IntegerType(), True)
  ,StructField("drugName", StringType(), True)
  ,StructField("condition", StringType(), True)
  ,StructField("review", StringType(), True)
  ,StructField("rating", DoubleType(), True)
  ,StructField("date", StringType(), True)
  ,StructField("usefulCount", IntegerType(), True)
  ])

In [11]:
df_test = spark.read.format("csv")\
           .option("delimiter", "\t")\
           .option("header", "true")\
           .option("quote", "\"")\
           .option("escape", "\"")\
           .option("multiLine","true")\
           .option("quoteMode","ALL")\
           .option("mode","PERMISSIVE")\
           .option("ignoreLeadingWhiteSpace","true")\
           .option("ignoreTrailingWhiteSpace","true")\
           .option("parserLib","UNIVOCITY")\
           .schema(customschema)\
           .load(working_folder + "Data/drugsComTest_raw.tsv")

In [12]:
df_test.show(5)

+--------+---------------+--------------------+--------------------+------+------------------+-----------+
|UniqueID|       drugName|           condition|              review|rating|              date|usefulCount|
+--------+---------------+--------------------+--------------------+------+------------------+-----------+
|  163740|    Mirtazapine|          Depression|"I&#039;ve tried ...|  10.0| February 28, 2012|         22|
|  206473|     Mesalamine|Crohn's Disease, ...|"My son has Crohn...|   8.0|      May 17, 2009|         17|
|  159672|        Bactrim|Urinary Tract Inf...|"Quick reduction ...|   9.0|September 29, 2017|          3|
|   39293|       Contrave|         Weight Loss|"Contrave combine...|   9.0|     March 5, 2017|         35|
|   97768|Cyclafem 1 / 35|       Birth Control|"I have been on t...|   9.0|  October 22, 2015|          4|
+--------+---------------+--------------------+--------------------+------+------------------+-----------+
only showing top 5 rows



## Load in Training Data

In [13]:
# Read in training data file
customschema = StructType([
  StructField("UniqueID", IntegerType(), True)
  ,StructField("drugName", StringType(), True)
  ,StructField("condition", StringType(), True)
  ,StructField("review", StringType(), True)
  ,StructField("rating", DoubleType(), True)
  ,StructField("date", StringType(), True)
  ,StructField("usefulCount", IntegerType(), True)
  ])

df = spark.read.format("csv")\
           .option("delimiter", "\t")\
           .option("header", "true")\
           .option("quote", "\"")\
           .option("escape", "\"")\
           .option("multiLine","true")\
           .option("quoteMode","ALL")\
           .option("mode","PERMISSIVE")\
           .option("ignoreLeadingWhiteSpace","true")\
           .option("ignoreTrailingWhiteSpace","true")\
           .option("parserLib","UNIVOCITY")\
           .schema(customschema)\
           .load(working_folder + "Data/drugsComTrain_raw.tsv")

In [14]:
df.groupby('rating').count().show()

+------+-----+
|rating|count|
+------+-----+
|   8.0|18890|
|   7.0| 9456|
|   1.0|21619|
|   4.0| 5012|
|   3.0| 6513|
|   2.0| 6931|
|  10.0|50989|
|   6.0| 6343|
|   5.0| 8013|
|   9.0|27531|
+------+-----+



In [15]:
df.show(5)

+--------+--------------------+--------------------+--------------------+------+-----------------+-----------+
|UniqueID|            drugName|           condition|              review|rating|             date|usefulCount|
+--------+--------------------+--------------------+--------------------+------+-----------------+-----------+
|  206461|           Valsartan|Left Ventricular ...|"It has no side e...|   9.0|     May 20, 2012|         27|
|   95260|          Guanfacine|                ADHD|"My son is halfwa...|   8.0|   April 27, 2010|        192|
|   92703|              Lybrel|       Birth Control|"I used to take a...|   5.0|December 14, 2009|         17|
|  138000|          Ortho Evra|       Birth Control|"This is my first...|   8.0| November 3, 2015|         10|
|   35696|Buprenorphine / n...|   Opiate Dependence|"Suboxone has com...|   9.0|November 27, 2016|         37|
+--------+--------------------+--------------------+--------------------+------+-----------------+-----------+
o

In [16]:
df.count()

161297

# Attempt at Using John Snow Labs pretrained sentiment model pipeline
https://nlp.johnsnowlabs.com/

Medium Article: 
https://medium.com/analytics-vidhya/sentiment-analysis-with-sparknlp-couldnt-be-easier-2a8ea3b728a0

John Snow Labs Reference Notebook: 
https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/quick_start_google_colab.ipynb#scrollTo=tyMMD_upEfIa

This model using BioBERT would potentially perform better, but it is not free-tier:
https://nlp.johnsnowlabs.com/2022/07/28/bert_sequence_classifier_drug_reviews_webmd_en_3_0.html

Breakdown how pretrained pipeline works under the hood: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb

In [17]:
pipeline = PretrainedPipeline('analyze_sentimentdl_use_twitter', 'en')

analyze_sentimentdl_use_twitter download started this may take some time.
Approx size to download 935.1 MB
[OK!]


In [18]:
pipeline.model.stages

[DocumentAssembler_bc88e2e1b2ae,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 SentimentDLModel_eca587b575f7]

Universal Sentence Encoder: https://nlp.johnsnowlabs.com/2020/04/17/tfhub_use.html

In [19]:
# rename the text column as 'text', pipeline expects 'text' 
df_result = pipeline.transform(df.withColumnRenamed("review", "text"))

In [20]:
test = pipeline.annotate("Holy Hell is exactly how I feel. I had been taking Brisdelle for 1.5 years. The hot flashes did indeed subside - however, the side affects of this medicine coupled with the fact Noven was acquired by YET another pharmaceutical company - YOU CAN&#039;T PLACE A REP IN THE AREA, DISTRIBUTE YOUR DRUGS, AND THEN FIRE HER-AND NOT REPLACE THEREFORE there is NO medicine or support here. You dumped this drug in the Dr&#039;s hands and walked away. After calling Sebula - you act like you don&#039;t even care. You have made it impossible to obtain this. I happen to think this is illegal.  I just decided to wean myself off this and Premarin. It has been nothing short of a nightmare. If you don&#039;t need this drug- DON&#039;T START. Seriously.")

In [21]:
test

{'document': ['Holy Hell is exactly how I feel. I had been taking Brisdelle for 1.5 years. The hot flashes did indeed subside - however, the side affects of this medicine coupled with the fact Noven was acquired by YET another pharmaceutical company - YOU CAN&#039;T PLACE A REP IN THE AREA, DISTRIBUTE YOUR DRUGS, AND THEN FIRE HER-AND NOT REPLACE THEREFORE there is NO medicine or support here. You dumped this drug in the Dr&#039;s hands and walked away. After calling Sebula - you act like you don&#039;t even care. You have made it impossible to obtain this. I happen to think this is illegal.  I just decided to wean myself off this and Premarin. It has been nothing short of a nightmare. If you don&#039;t need this drug- DON&#039;T START. Seriously.'],
 'sentence_embeddings': ['Holy Hell is exactly how I feel. I had been taking Brisdelle for 1.5 years. The hot flashes did indeed subside - however, the side affects of this medicine coupled with the fact Noven was acquired by YET another p

In [22]:
print(df_result)

DataFrame[UniqueID: int, drugName: string, condition: string, text: string, rating: double, date: string, usefulCount: int, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentence_embeddings: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentiment: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>]


In [23]:
# extract results from "sentiments" column
df_result = df_result.withColumn("sentiment", explode('sentiment.result'))

In [24]:
df_result.show()

+--------+--------------------+--------------------+--------------------+------+------------------+-----------+--------------------+--------------------+---------+
|UniqueID|            drugName|           condition|                text|rating|              date|usefulCount|            document| sentence_embeddings|sentiment|
+--------+--------------------+--------------------+--------------------+------+------------------+-----------+--------------------+--------------------+---------+
|  206461|           Valsartan|Left Ventricular ...|"It has no side e...|   9.0|      May 20, 2012|         27|[{document, 0, 78...|[{sentence_embedd...| negative|
|   95260|          Guanfacine|                ADHD|"My son is halfwa...|   8.0|    April 27, 2010|        192|[{document, 0, 73...|[{sentence_embedd...| negative|
|   92703|              Lybrel|       Birth Control|"I used to take a...|   5.0| December 14, 2009|         17|[{document, 0, 75...|[{sentence_embedd...| negative|
|  138000|      