<a href="https://colab.research.google.com/github/nbchan/INMR96-Digital-Health-and-Data-Analytics/blob/main/Week_09_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INMR96-Digital Health and Data Analytics
## Week 9: Natural Language Processing (NLP) in Healthcare

### Dataset Preparation

**Dataset Description:**

The [Drug Review Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29) provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. The intention was to study:

1.   Sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,
2.   The transferability of models among domains, i.e. conditions, and
3.   The transferability of models among different data sources (see 'Drug Review Dataset (Druglib.com)').

**Dataset Split:**

The data is split into a train (75%) a test (25%) partition (see publication) and stored in two .tsv (tab-separated-values) files, respectively.

**Attribute Information**:

1. drugName (categorical): name of drug
2. condition (categorical): name of condition
3. review (text): patient review
4. rating (numerical): 10 star patient rating
5. date (date): date of review entry
6. usefulCount (numerical): number of users who found review useful

**Citation:**

Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125.

In [None]:
# Mount Google Drive, The Google Drive would be used to 
# store the data and results.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Print the current working folder.
!pwd

/content


In [None]:
# Set up working folder path.
work_path = "/content/drive/MyDrive/INMR96/"
print(work_path)

/content/drive/MyDrive/INMR96/


In [None]:
# Download Dataset to working folder.
!wget -P $work_path https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip

--2023-03-12 18:30:51--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42989872 (41M) [application/x-httpd-php]
Saving to: ‘/content/drive/MyDrive/INMR96/drugsCom_raw.zip.1’


2023-03-12 18:30:52 (26.1 MB/s) - ‘/content/drive/MyDrive/INMR96/drugsCom_raw.zip.1’ saved [42989872/42989872]



In [None]:
# Unzip the raw data into the working folder by using the python 
# zipfile package.
import os
from zipfile import ZipFile

# Loading the drugom_raw.zip and creating a zip object, which will 
# be used to unzip the raw data. Reminder: it is always recommended 
# to close the zip object after the unzip progress.
with ZipFile(os.path.join(work_path, "drugsCom_raw.zip"), 'r') as zo:
  # Extracting specific file in the zip into a specific location.
  zo.extractall(path=work_path)
zo.close()

## Data Preprocessing

In [None]:
import pandas as pd


In [None]:
# Read in the raw drugsComTrain_raw.tsv file into a pandas DataFrame data.
train_set = pd.read_csv(os.path.join(work_path, "drugsComTrain_raw.tsv"), sep="\t")


In [None]:
# Print the size of the train set, and have a look at the whole dataset.
print(train_set.shape)
train_set

(161297, 7)


Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37
...,...,...,...,...,...,...,...
161292,191035,Campral,Alcohol Dependence,"""I wrote my first report in Mid-October of 201...",10.0,"May 31, 2015",125
161293,127085,Metoclopramide,Nausea/Vomiting,"""I was given this in IV before surgey. I immed...",1.0,"November 1, 2011",34
161294,187382,Orencia,Rheumatoid Arthritis,"""Limited improvement after 4 months, developed...",2.0,"March 15, 2014",35
161295,47128,Thyroid desiccated,Underactive Thyroid,"""I&#039;ve been on thyroid medication 49 years...",10.0,"September 19, 2015",79


In [None]:
# Read in the drugsComTest_raw.tsv into pandas DataFrame as train set.
test_set = pd.read_csv(os.path.join(work_path, "drugsComTest_raw.tsv"), sep="\t")


In [None]:
# Have a glimpse of the test dataset.
print(test_set.shape)
test_set

(53766, 7)


Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4
...,...,...,...,...,...,...,...
53761,159999,Tamoxifen,"Breast Cancer, Prevention","""I have taken Tamoxifen for 5 years. Side effe...",10.0,"September 13, 2014",43
53762,140714,Escitalopram,Anxiety,"""I&#039;ve been taking Lexapro (escitaploprgra...",9.0,"October 8, 2016",11
53763,130945,Levonorgestrel,Birth Control,"""I&#039;m married, 34 years old and I have no ...",8.0,"November 15, 2010",7
53764,47656,Tapentadol,Pain,"""I was prescribed Nucynta for severe neck/shou...",1.0,"November 28, 2011",20


In [None]:
# Drop columns except for review and rating for test set.
# Randomly choose a small portion of train data, such as 800.
# Then, reset the index.
train_set = train_set[['review', 'rating']]
train_set = train_set.sample(n=800, random_state=1)
train_set = train_set.reset_index(drop=True)


In [None]:
# Print the sub train set information.
print(train_set.shape)
train_set

(800, 2)


Unnamed: 0,review,rating
0,"""This was my first pill I have ever tried, and...",1.0
1,"""Hi there,\r\n\r\nI also wanted to write a pos...",10.0
2,"""I&#039;m basically a very hard sell when it c...",10.0
3,"""My son is 5 1/2. We&#039;ve always noticed hi...",8.0
4,"""I&#039;ve had it about 2 months now, and my e...",10.0
...,...,...
795,"""I completely stopped having panic attacks aft...",8.0
796,"""I&#039;m 23, never given birth, had it insert...",6.0
797,"""I had Implanon for 3 years now. It&#039;s tim...",10.0
798,"""Simply wonderful. I took 20 mg for the first ...",9.0


In [None]:
# Drop columns except for review and rating as the same as train set.
# Here only 200 records are chosen at random.
# Also the index would be reset.
test_set = test_set[['review', 'rating']]
test_set = test_set.sample(n=200, random_state=1)
test_set = test_set.reset_index(drop=True)


In [None]:
# Print out sub test set information.
print(test_set.shape)
test_set

(200, 2)


Unnamed: 0,review,rating
0,"""I turned into a total zombie on this. Litera...",1.0
1,"""I took this yesterday after a whole night wit...",10.0
2,"""I have been on Latuda 80mg for almost 2 month...",1.0
3,"""I have been taking Victoza 1.2mg for about a ...",5.0
4,"""I started Siliq 6 months ago at 90 % body cov...",10.0
...,...,...
195,"""I took this pill next day after having unprot...",10.0
196,"""I&#039;ve loved Nexplanon. The insertion was ...",10.0
197,"""I receive the highest dose of this drug for s...",8.0
198,"""I absolutely hate this medication. I had cold...",1.0


## Using TF-IDF to Embed the Text Data

### Data Preprocessing without Extra Manipulations.

In [None]:
import numpy as np

# Transform the rating column type to int.
# If rating is equal and bigger than 5, set to 1.
# If rating is smaller than 5, set to 0
train_set['rating'] = train_set['rating'].apply(np.int64)
train_set['rating'] = train_set['rating'].apply(lambda x: 1 if x >= 5 else 0)

test_set['rating'] = test_set['rating'].apply(np.int64)
test_set['rating'] = test_set['rating'].apply(lambda x: 1 if x >= 5 else 0)

# Numbers of rating group by raing.
print(f"Train set:\n{train_set['rating'].value_counts()}")
print(f"Test set:\n{test_set['rating'].value_counts()}")

Train set:
1    608
0    192
Name: rating, dtype: int64
Test set:
1    150
0     50
Name: rating, dtype: int64


In [None]:
# Print the shape of the dataset.
print(f"Train set: {train_set.shape}")
print(f"Test set: {test_set.shape}")

Train set: (800, 2)
Test set: (200, 2)


In [None]:
# Prepare dataset for vectorization.
X_train, y_train = train_set['review'], train_set['rating']
X_test, y_test = test_set['review'], test_set['rating']

In [None]:
# Set up TF-IDF vectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [None]:
# Perform the TF-IDF vectorization.
tf_x_train = vectorizer.fit_transform(X_train)
tf_x_test = vectorizer.transform(X_test)

# Check the sparsity of the vectoried dataset.
print(f"Train set: {tf_x_train.shape}")
print(f"Test set: {tf_x_test.shape}")

Train set: (800, 5200)
Test set: (200, 5200)


### Data Preprocessing with Extra Manipulations.

In [None]:
# Conver to lower case.
X_train = X_train.apply(lambda x: " ".join(x.lower() for x in str(x).split()))
X_test = X_test.apply(lambda x: " ".join(x.lower() for x in str(x).split()))

In [None]:
# Remove the stopwords from corpus to reduce the dimension.
# Stop words are commonly used word (such as “the”, “a”, “an”, “in”) 
# that a search engine has been programmed to ignore, both when 
# indexing entries for searching and when retrieving them as the 
# result of a search query. We would not want these words to take 
# up space in our database, or taking up valuable processing time.
import nltk
import re
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
stop = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
# Print out the english stopwords.
print(len(stop))
print(stop)


179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In [None]:
# Remove non-alpha characters.
X_train = X_train.apply(lambda x: " ".join([re.sub("[^A-Za-z]+","", x) for x in nltk.word_tokenize(x)]))
X_test = X_test.apply(lambda x: " ".join([re.sub("[^A-Za-z]+","", x) for x in nltk.word_tokenize(x)]))

In [None]:
# Remove extra spaces.
X_train = X_train.apply(lambda x: re.sub(" +", " ", x))
X_test = X_test.apply(lambda x: re.sub(" +", " ", x))

In [None]:
# Remove stopwords.
X_train = X_train.apply(lambda x: " ".join([x for x in x.split() if x not in stop]))
X_test = X_test.apply(lambda x: " ".join([x for x in x.split() if x not in stop]))

In [None]:
# Lemmatization.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
X_train = X_train.apply(lambda x: " ".join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x)]))
X_test = X_test.apply(lambda x: " ".join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x)]))

In [None]:
# Perform the TF-IDF vectorization.
tf_x_train = vectorizer.fit_transform(X_train)
tf_x_test = vectorizer.transform(X_test)

# Check the sparsity of the vectoried dataset.
print(f"Train set: {tf_x_train.shape}")
print(f"Test set: {tf_x_test.shape}")

Train set: (800, 4509)
Test set: (200, 4509)


### Model Development and Evaluation

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [None]:
clf_ti = RandomForestClassifier()

In [None]:
clf_ti.fit(tf_x_train, y_train)


In [None]:
y_pred_ti = clf_ti.predict(tf_x_test)

In [None]:
print(classification_report(y_test, y_pred_ti))

              precision    recall  f1-score   support

           0       0.67      0.04      0.08        50
           1       0.76      0.99      0.86       150

    accuracy                           0.76       200
   macro avg       0.71      0.52      0.47       200
weighted avg       0.73      0.76      0.66       200



## Using OpenAI to Embed the Text Data

### Set up OpenAI Env

In [None]:
# Install openai package firt.
!pip install openai

In [None]:
# openai.api_key = os.getenv("OPENAI_API_KEY")
import openai
from openai.embeddings_utils import get_embedding

# Set up OpenAI API Key.
openai.api_key = "Your_OpenAI_API_Key"


# Set up Embedding model.
embedding_model = "text-embedding-ada-002"

# Define the embedding function.
def openai_embedding(sent):
  ebd = get_embedding(sent, engine=embedding_model) # get_embedding returns a list.
  return ebd


### Data Preprocessing

In [None]:
# Embedding the review data for each entry.
train_set['embedding'] = train_set['review'].apply(lambda x: openai_embedding(x))
test_set['embedding'] = test_set['review'].apply(lambda x: openai_embedding(x))


In [None]:
train_set.to_csv(os.path.join(work_path, "train_ebd.csv"), index=False)
train_set


Unnamed: 0,review,rating,embedding
0,"""This was my first pill I have ever tried, and...",1.0,"[-0.020021311938762665, -0.0066970461048185825..."
1,"""Hi there,\r\n\r\nI also wanted to write a pos...",10.0,"[-0.03102271445095539, -0.01036074012517929, 0..."
2,"""I&#039;m basically a very hard sell when it c...",10.0,"[-0.009960674680769444, -0.001272476278245449,..."
3,"""My son is 5 1/2. We&#039;ve always noticed hi...",8.0,"[-0.019227905198931694, 0.017923634499311447, ..."
4,"""I&#039;ve had it about 2 months now, and my e...",10.0,"[-0.04033423960208893, -0.005322061944752932, ..."
...,...,...,...
795,"""I completely stopped having panic attacks aft...",8.0,"[-0.012320872396230698, -0.004545219708234072,..."
796,"""I&#039;m 23, never given birth, had it insert...",6.0,"[-0.03953072428703308, -0.01564646326005459, 0..."
797,"""I had Implanon for 3 years now. It&#039;s tim...",10.0,"[-0.03473212197422981, -0.013803791254758835, ..."
798,"""Simply wonderful. I took 20 mg for the first ...",9.0,"[-0.012393943965435028, 0.01435157936066389, 0..."


In [None]:
test_set.to_csv(os.path.join(work_path, "test_ebd.csv"), index=False)
test_set


Unnamed: 0,review,rating,embedding
0,"""I turned into a total zombie on this. Litera...",1.0,"[-0.016931885853409767, -0.0051310695707798, 0..."
1,"""I took this yesterday after a whole night wit...",10.0,"[-0.00977715291082859, 0.00031594507163390517,..."
2,"""I have been on Latuda 80mg for almost 2 month...",1.0,"[-0.007015461102128029, 0.01202203519642353, 0..."
3,"""I have been taking Victoza 1.2mg for about a ...",5.0,"[-0.011468385346233845, -0.028590109199285507,..."
4,"""I started Siliq 6 months ago at 90 % body cov...",10.0,"[0.007876160554587841, 0.003719479078426957, 0..."
...,...,...,...
195,"""I took this pill next day after having unprot...",10.0,"[-0.03533007577061653, -0.009008128196001053, ..."
196,"""I&#039;ve loved Nexplanon. The insertion was ...",10.0,"[-0.04100845381617546, 0.0008877760265022516, ..."
197,"""I receive the highest dose of this drug for s...",8.0,"[-0.01644211634993553, 0.01941230334341526, 0...."
198,"""I absolutely hate this medication. I had cold...",1.0,"[-0.00036835757782682776, 0.021770551800727844..."


In [None]:
train_ebd = pd.read_csv(os.path.join(work_path, "train_ebd.csv"))
train_ebd['embedding'] = train_ebd.embedding.apply(eval).apply(np.array)

train_ebd = train_ebd[['rating', 'embedding']]
train_ebd

Unnamed: 0,rating,embedding
0,1.0,"[-0.020021311938762665, -0.0066970461048185825..."
1,10.0,"[-0.03102271445095539, -0.01036074012517929, 0..."
2,10.0,"[-0.009960674680769444, -0.001272476278245449,..."
3,8.0,"[-0.019227905198931694, 0.017923634499311447, ..."
4,10.0,"[-0.04033423960208893, -0.005322061944752932, ..."
...,...,...
795,8.0,"[-0.012320872396230698, -0.004545219708234072,..."
796,6.0,"[-0.03953072428703308, -0.01564646326005459, 0..."
797,10.0,"[-0.03473212197422981, -0.013803791254758835, ..."
798,9.0,"[-0.012393943965435028, 0.01435157936066389, 0..."


In [None]:
test_ebd = pd.read_csv(os.path.join(work_path, "test_ebd.csv"))
test_ebd['embedding'] = test_ebd.embedding.apply(eval).apply(np.array)

test_ebd = test_ebd[['rating', 'embedding']]
test_ebd

Unnamed: 0,rating,embedding
0,1.0,"[-0.016931885853409767, -0.0051310695707798, 0..."
1,10.0,"[-0.00977715291082859, 0.00031594507163390517,..."
2,1.0,"[-0.007015461102128029, 0.01202203519642353, 0..."
3,5.0,"[-0.011468385346233845, -0.028590109199285507,..."
4,10.0,"[0.007876160554587841, 0.003719479078426957, 0..."
...,...,...
195,10.0,"[-0.03533007577061653, -0.009008128196001053, ..."
196,10.0,"[-0.04100845381617546, 0.0008877760265022516, ..."
197,8.0,"[-0.01644211634993553, 0.01941230334341526, 0...."
198,1.0,"[-0.00036835757782682776, 0.021770551800727844..."


In [None]:
ratings = train_ebd.rating
train_ebd = train_ebd.embedding.apply(pd.Series)
train_ebd['rating'] = ratings

train_ebd['rating'] = train_ebd['rating'].apply(np.int64)
train_ebd['rating'] = train_ebd['rating'].apply(lambda x: 1 if x >= 5 else 0)
train_ebd

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1527,1528,1529,1530,1531,1532,1533,1534,1535,rating
0,-0.020021,-0.006697,0.033669,-0.018549,-0.018701,0.011528,-0.002480,-0.017749,-0.017723,-0.029175,...,-0.015083,0.021964,-0.028134,-0.007357,0.022332,-0.017216,-0.007205,-0.010982,-0.013356,0
1,-0.031023,-0.010361,0.008490,-0.020616,-0.011690,0.007412,-0.024173,-0.027161,-0.011762,-0.031340,...,-0.014136,0.048399,-0.032266,0.001012,0.009567,0.001682,0.001412,-0.019822,-0.027241,1
2,-0.009961,-0.001272,0.037983,-0.025499,-0.007265,0.020824,-0.022737,-0.047864,-0.025154,-0.019457,...,-0.019284,0.041304,-0.021210,-0.019549,0.002432,0.009317,-0.034052,0.022538,-0.011581,1
3,-0.019228,0.017924,0.017561,-0.046577,-0.015342,0.026650,-0.027564,-0.021420,-0.032002,0.011510,...,-0.011355,0.010562,-0.016996,0.006595,-0.022710,-0.029985,-0.012498,-0.011893,-0.024996,1
4,-0.040334,-0.005322,0.027250,-0.014337,-0.016342,0.009061,-0.022106,-0.049303,-0.018703,-0.015115,...,-0.004663,0.057296,-0.037037,-0.028332,-0.004607,-0.023900,-0.007729,-0.000958,-0.030046,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,-0.012321,-0.004545,0.039024,-0.042771,-0.024332,0.034527,-0.020714,-0.046751,-0.020546,-0.020701,...,-0.006448,0.024500,-0.010738,0.006451,0.017121,-0.032537,0.003263,-0.017703,-0.042978,1
796,-0.039531,-0.015646,0.007930,-0.011842,-0.016890,0.013948,-0.040708,-0.032470,-0.037311,-0.014229,...,-0.019404,0.040734,-0.032095,0.000709,-0.000961,-0.021450,-0.000255,0.004794,-0.022788,1
797,-0.034732,-0.013804,0.012576,0.013257,-0.026183,-0.007016,-0.019224,-0.027582,-0.026361,0.007385,...,-0.032926,0.033358,-0.019745,-0.012792,0.010604,-0.019516,-0.018791,0.002063,-0.012264,1
798,-0.012394,0.014352,0.014709,-0.018135,-0.011018,0.009365,-0.006653,-0.044179,-0.024484,-0.038941,...,-0.017830,0.050846,-0.004785,-0.007751,0.003707,-0.016018,-0.029550,-0.007599,-0.002832,1


In [None]:
train_ebd['rating'].value_counts()

1    608
0    192
Name: rating, dtype: int64

In [None]:
ratings = test_ebd.rating
test_ebd = test_ebd.embedding.apply(pd.Series)
test_ebd['rating'] = ratings

test_ebd['rating'] = test_ebd['rating'].apply(np.int64)
test_ebd['rating'] = test_ebd['rating'].apply(lambda x: 1 if x >= 5 else 0)
test_ebd

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1527,1528,1529,1530,1531,1532,1533,1534,1535,rating
0,-0.016932,-0.005131,0.002047,-0.013610,-0.013855,0.000470,-0.023524,-0.045066,-0.028250,-0.020563,...,-0.009953,0.043933,-0.010751,-0.011994,0.006312,-0.043624,0.008556,-0.009773,-0.050242,0
1,-0.009777,0.000316,0.012801,-0.002560,-0.025576,0.004477,-0.002910,-0.056116,-0.021017,-0.020064,...,-0.017569,0.033022,-0.030357,0.012977,0.006825,-0.032577,-0.003893,-0.006002,-0.028241,1
2,-0.007015,0.012022,0.037599,-0.029489,-0.035647,0.008280,-0.021941,-0.009331,-0.019926,-0.003395,...,-0.007466,0.022730,-0.017798,-0.006621,-0.000213,-0.023381,0.004159,-0.009043,-0.031617,0
3,-0.011468,-0.028590,0.043700,-0.012219,-0.015136,0.029030,-0.020026,-0.027348,-0.025123,-0.010013,...,-0.001472,0.037077,-0.011080,0.007387,0.005213,-0.029211,-0.008176,-0.005857,-0.019897,1
4,0.007876,0.003719,0.022017,-0.042311,-0.000964,0.022199,-0.031479,-0.035028,-0.011530,-0.017984,...,-0.012144,0.037717,-0.028268,0.005083,0.006114,-0.015426,-0.006956,0.005677,-0.016953,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,-0.035330,-0.009008,0.019616,-0.012345,-0.013463,0.009450,-0.006179,-0.036267,-0.032442,-0.019356,...,-0.017327,0.037359,-0.028800,-0.019200,0.002080,-0.029867,-0.019486,-0.010153,-0.028696,1
196,-0.041008,0.000888,0.015570,-0.008389,-0.002173,0.003969,-0.030292,-0.030876,-0.016008,-0.005643,...,-0.011698,0.032653,-0.025345,-0.017202,-0.001524,-0.023011,-0.021433,0.003614,-0.028648,1
197,-0.016442,0.019412,0.014321,-0.032884,-0.012239,0.036093,-0.005954,-0.017887,-0.041715,-0.001186,...,-0.023523,0.034767,-0.025008,-0.003345,0.005741,-0.029675,-0.013160,-0.000735,-0.016721,1
198,-0.000368,0.021771,0.020743,-0.026222,-0.033137,0.011162,-0.008376,-0.040433,0.014553,0.007948,...,-0.011748,0.048493,-0.037957,-0.002395,-0.011294,-0.027368,-0.011122,0.008291,-0.019518,0


In [None]:
test_ebd['rating'].value_counts()

1    150
0     50
Name: rating, dtype: int64

### Model Development and Evaluation

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [None]:
clf_oa = RandomForestClassifier()

In [None]:
X_train, y_train = train_ebd.loc[:, ~train_ebd.columns.isin(['rating'])], train_ebd['rating']
print(X_train.shape, y_train.shape)

(800, 1536) (800,)


In [None]:
X_test, y_test = test_ebd.loc[:, ~test_ebd.columns.isin(['rating'])], test_ebd['rating']
print(X_test.shape, y_test.shape)

(200, 1536) (200,)


In [None]:
clf_oa.fit(X_train, y_train)

In [None]:
y_pred = clf_oa.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.24      0.38        50
           1       0.80      0.99      0.88       150

    accuracy                           0.81       200
   macro avg       0.86      0.62      0.63       200
weighted avg       0.83      0.81      0.76       200



## Excercise


At the above sections, only the review data are used to build the model. But obviously, the drugname and the condition, to some extent, also contain useful information, which might be useful to further improve the model's prediction performance. Thus, in this excercise section, you are required to use drugname, condition, as well as review data together to build a machine leaning model to predict the rating, and compare the performance with the models using only the review data.



In [None]:
# Step 1: Concatenate the drugname, condition and review data 
# to generate a new column. You are encouraged to use a small 
# protion of data to try, such as 1000 patients that I used 
# in above sections, because the more data used, the more 
# computation time required.



In [None]:
# Step 2: Use TF-IDF or OpenAI embedding method 
# to encode the new generated column data. If you choose 
# to use the OpenAI embedding method, please don't use too 
# many data. Otherwise, it will take a long time to get the 
# embedding data.



In [None]:
# Step 3: Define the Random Forest Classifier to train the model.
# You can use any models you want to have a try, including both 
# the models you learnt or the models you haven't learnt.


In [None]:
# Step 4: Test the model on test dataset, and 
# compare the performance with the models just using review data.

