## Sentiment Prediction of Drug Reviews

In this section, we will predict customer's reviews towards drugs. We'll be using machine learning, deep learning, and transformers.

In [1]:
import numpy as np
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
file_path = '/content/drive/MyDrive/DrugReviews/DrugReviews_cleaned.csv'
df = pd.read_csv(file_path)

In [4]:
df

Unnamed: 0,MedicineUsedFor,MedicineBrandName,MedicineGenericName,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes
0,Cough,Acetaminophen / Codeine,Not Mentioned,1-Apr-08,smoore,Not Specified,Works good as a cough suppressant.,34,9,24
1,Cough,Benzonatate,Not Mentioned,1-Apr-08,Anonymous,Not Specified,Pneumonia cough was non-stop - gave almost ins...,210,9,39
2,Dermatologic Lesion,Methylprednisolone Dose Pack,Methylprednisolone,1-Apr-08,Anonymous,Not Specified,This steriod helped kill the pain of my condit...,162,8,24
3,"Hypogonadism, Male",Androgel,Not Mentioned,1-Apr-08,MikeC...,Not Specified,I'm a 35 year old male and I had no idea that ...,105,9,380
4,Depression,Celexa,Not Mentioned,1-Apr-08,Cherpie,Not Specified,It is so nice to have my life back!!!,37,10,206
...,...,...,...,...,...,...,...,...,...,...
255945,Birth Control,Isibloom,Desogestrel / Ethinyl Estradiol,9-Sep-22,Skylar,Not Specified,This birth control is awful severe nausea and ...,108,1,0
255946,Underactive Thyroid,Unithroid,Levothyroxine,9-Sep-22,Syd,Taken for less than 1 month,Post partial thyroidectomy due to a large beni...,224,2,7
255947,Bacterial Infection,Amoxicillin / Clavulanate,Not Mentioned,9-Sep-22,FLgirl,Taken for less than 1 month,I was given this for a tooth abscess. I was sc...,957,9,1
255948,Strep Throat,Augmentin,Amoxicillin / Clavulanate,9-Sep-22,peein...,Taken for less than 1 month,This stuff is great if you wanna pee out of yo...,263,1,0


In [5]:
import re
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))
stop_words.remove('not')

lemmatizer = WordNetLemmatizer()

# Lower text (outside fx for faster execution time)
df['Reviews'] = df['Reviews'].str.lower()

def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    # Tokenize text
    text = word_tokenize(text)
    # Remove stopwords
    text = [word for word in text if word not in stop_words]
    # Lemmatization
    text = [lemmatizer.lemmatize(word=word, pos='v') for word in text]
    # Join all text
    text = ' '.join(text)

    return text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [6]:
# Clean text data
df['ReviewsClean'] = df['Reviews'].apply(lambda x: clean_text(str(x)))

In [7]:
df['Rating'].value_counts().sort_index()

Rating
1     53670
2     13442
3     11466
4      8119
5     12867
6      9543
7     13266
8     25803
9     35341
10    72433
Name: count, dtype: int64

In [8]:
df['is_positive'] = np.where(df['Rating'] > 5, 1, 0)

In [9]:
df.sample(5)

Unnamed: 0,MedicineUsedFor,MedicineBrandName,MedicineGenericName,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes,ReviewsClean,is_positive
55950,Psoriasis,Raptiva,Not Mentioned,15-May-09,Anonymous,Not Specified,it worked great and should not have been pulle...,64,10,1,work great not pull market,1
146999,Insomnia,Sonata,Zaleplon,25-Jun-08,duwa,Not Specified,amazing. helps me fall asleep. i don't have a ...,207,10,66,amaze help fall asleep problem stay asleep use...,1
121363,Adhd,Amphetamine / Dextroamphetamine,Not Mentioned,22-Jan-22,eric,Taken for 1 to 6 months,i just started taking adderall again after ten...,124,8,0,start take adderall ten years without realize ...,1
123311,Depression,Mirtazapine,Not Mentioned,22-Mar-18,Anonymous,Taken for 1 to 6 months,deeper sleep but not sleeping any more than be...,74,5,3,deeper sleep not sleep start take 15mg,0
168587,Hepatitis C,Harvoni,Not Mentioned,28-Aug-16,ShazDog,Not Specified,i contracted hep c 1a thirty years ago.,39,10,36,contract hep c 1a thirty years ago,1


In [10]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
sentiments = df['Reviews'].apply(lambda x: sid.polarity_scores(x))

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [11]:
sentiments = pd.DataFrame(sentiments.tolist())

In [12]:
sentiments

Unnamed: 0,neg,neu,pos,compound
0,0.000,0.580,0.420,0.4404
1,0.000,0.947,0.053,0.2289
2,0.390,0.610,0.000,-0.9402
3,0.191,0.809,0.000,-0.5423
4,0.000,0.640,0.360,0.6697
...,...,...,...,...
255945,0.248,0.752,0.000,-0.6808
255946,0.213,0.686,0.100,-0.5574
255947,0.063,0.863,0.074,0.2731
255948,0.034,0.816,0.151,0.7650


In [13]:
df = pd.concat([df, sentiments], axis=1)

In [14]:
df.sample(5)

Unnamed: 0,MedicineUsedFor,MedicineBrandName,MedicineGenericName,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes,ReviewsClean,is_positive,neg,neu,pos,compound
234245,Birth Control,Loryna,Drospirenone / Ethinyl Estradiol,7-Jul-18,Boano,Taken for 1 to 6 months,i have been on a generic form of yaz for about...,245,1,2,generic form yaz 3 years switch vestura nikki ...,0,0.107,0.893,0.0,-0.5873
84098,Birth Control,Depo-Provera,Not Mentioned,18-Sep-19,MiMi,Not Specified,i'm sharing this to everyone who is considerin...,964,2,9,share everyone consider depo provera shoot ble...,0,0.116,0.781,0.103,-0.5417
87300,Bladder Infection,Macrobid,Not Mentioned,19-Jan-14,mommy...,Taken for less than 1 month,had a severe bladder infection..passing blood ...,672,8,77,severe bladder infection pass blood clot urine...,1,0.099,0.831,0.07,-0.3025
35162,Acne,Yaz,Drospirenone / Ethinyl Estradiol,13-Dec-16,Anonymous,Not Specified,i've had mild to moderate acne since i was 14 ...,721,1,16,mild moderate acne since 14 24 10 yrs try ever...,0,0.198,0.732,0.069,-0.9705
15722,Abnormal Uterine Bleeding,Depo-Provera,Not Mentioned,10-Oct-19,End...,Taken for 1 to 6 months,i'm so over this shot! i got this in mid-septe...,416,1,9,shoot get mid september wait exit body many di...,0,0.229,0.737,0.034,-0.9539


## Modelling Cleaned Reviews with Deep Learning

In [15]:
train_df = df[['ReviewsClean', 'is_positive']]

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df['ReviewsClean'], train_df['is_positive'],
    test_size = 0.2, random_state = 100)

In [17]:
y_train.value_counts()

is_positive
1    125041
0     79719
Name: count, dtype: int64

In [18]:
y_test.value_counts()

is_positive
1    31345
0    19845
Name: count, dtype: int64

In [19]:
import tensorflow as tf

from keras import Input
from keras.models import Sequential
from keras.layers import (
    TextVectorization, Embedding, LSTM, Dense, Bidirectional, Dropout)

from keras.optimizers import Adam
from keras.regularizers import L1, L2, L1L2

from transformers import TFAutoModelForSequenceClassification

### Shallow Neural Network

In [20]:
max_tokens   = 7500
input_length = 128
output_dim   = 128

vectorizer_layer = TextVectorization(
    max_tokens  = max_tokens,
    output_mode = 'int',
    standardize = 'lower_and_strip_punctuation',
    output_sequence_length = input_length
)

vectorizer_layer.adapt(X_train)

embedding_layer = Embedding(
    input_dim    = max_tokens,
    output_dim   = output_dim,
    input_length = input_length
)

In [21]:
# Define model
model = Sequential([
    Input(shape=(1,), dtype=tf.string),
    vectorizer_layer,
    embedding_layer,
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

# Fit model
model.fit(X_train, y_train, epochs=5)

# Evaluate model in accuracy
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f'Test set accuracy: {test_acc}')

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 dense (Dense)               (None, 128, 1)            129       
                                                                 
Total params: 960129 (3.66 MB)
Trainable params: 960129 (3.66 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test set accuracy: 0.6149800419807434


### Multi-Layer Deep Text Classification Model

In [22]:
# Define model
model_reg = Sequential([
    Input(shape=(1,), dtype=tf.string),
    vectorizer_layer,
    embedding_layer,
    Dense(128, activation='relu',
         kernel_regularizer=L1(l1=0.0005)),
    Dropout(rate=0.6),
    Dense( 64, activation='relu',
         kernel_regularizer=L1L2(l1=0.0005, l2=0.0005)),
    Dense( 32, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense( 16, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  1, activation='sigmoid')
])

# Compile model
model_reg.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy'])

model_reg.summary()

# Fit model
model_reg.fit(X_train, y_train, epochs=5)

# Evaluate model in accuracy
reg_test_loss, reg_test_acc = model_reg.evaluate(X_test, y_test)
print(f'Test set accuracy: {reg_test_acc}')

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 dense_1 (Dense)             (None, 128, 128)          16512     
                                                                 
 dropout (Dropout)           (None, 128, 128)          0         
                                                                 
 dense_2 (Dense)             (None, 128, 64)           8256      
                                                                 
 dense_3 (Dense)             (None, 128, 32)           2080      
                                                      

In [23]:
# Scratches for dashboard: Rating factor for Ranks

# rating_factor = df.groupby(['MedicineBrandName', 'MedicineUsedFor']).agg(
#     avg_rating = ('Rating', lambda x: np.mean(x)),
#     std_rating = ('Rating', lambda x: np.std(x)),
#     count_reviews = ('Reviews', lambda x: x.count())
# ).reset_index()

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'std_rating'],
#     ascending=[False, False, True])[:50]

# x = (rating_factor['avg_rating'] / rating_factor['std_rating'])

# rating_factor['impact_factor'] = (rating_factor['count_reviews'] *
#                                   (1 - np.exp(-x)) / (1 + np.exp(-x)))

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False).loc[rating_factor['MedicineUsedFor'] ==
#         'Weight Loss (Obesity/Overweight)'][:25]

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False)[:25]

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False).loc[rating_factor['MedicineUsedFor'] == 'Birth Control'][:25]

### Multi-Layer Bidirectional LSTM Model

In [24]:
# Define model
# Some tweaks:
## The algorithm are supposed to use activation='elu'.
## However, it doesn't fulfill the criteria when
## running the model in Colab's processing unit.

## Therefore, all 'elu' are changed into 'tanh'

model_ml_bi_lstm = Sequential([

    Input(shape=(1,), dtype=tf.string),

    vectorizer_layer,
    embedding_layer,

    Bidirectional(LSTM(128,
        activation='tanh',
        return_sequences=True)),
    Bidirectional(LSTM(128,
        activation='tanh',
        return_sequences=True)),
    Bidirectional(LSTM(64,
        activation='tanh')),

    Dense( 64, activation='tanh',
         kernel_regularizer=L1L2(
             l1=0.0001, l2=0.0001)),
    Dense( 32, activation='tanh',
         kernel_regularizer=L2(l2=0.0001)),
    Dense(  8, activation='tanh',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='tanh',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='tanh'),
    Dense(  4, activation='tanh'),

    Dense(  1, activation='sigmoid')

])

# Compile model
model_ml_bi_lstm.compile(optimizer=Adam(learning_rate=0.0001),
              loss='binary_crossentropy', metrics=['accuracy'])

model_ml_bi_lstm.summary()

# Fit
model_ml_bi_lstm.fit(X_train, y_train, epochs=5)

# Evaluate
bi_lstm_test_loss, bi_lstm_test_acc = model_ml_bi_lstm.evaluate(X_test, y_test)
print(f'Test set accuracy: {bi_lstm_test_acc}')

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 bidirectional (Bidirection  (None, 128, 256)          263168    
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, 128, 256)          394240    
 onal)                                                           
                                                                 
 bidirectional_2 (Bidirecti  (None, 128)               164352    
 onal)                                                

## Building a Transformer Model:
## `distilbert-base-uncased`

In [25]:
import os

# !pip install --upgrade transformers
# !pip install tf-keras
# os.environ['TF_USE_LEGACY_KERAS'] = '1'

In [26]:
import transformers
from transformers import DistilBertTokenizer

print(transformers.__version__)

4.42.4


In [27]:
# Findings: Characteristics of 'distilbert-base-uncased':
## is_fast = True : A Rust-based tokenizer; more efficient than Python
## model_max_length = 512       ## vocab_size = 30522

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DistilBertTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [28]:
# Max Length: 128 sequences,
# Truncation: Text pruned to amaximum length when
#             surpasses model's max input length
# Padding : When sequence < 128, the rest is filled
#           with padding tokens so all length's the same.

# Tokenize both training and test data
train_encodings = tokenizer(list(X_train),
  max_length=128, truncation=True, padding=True)
test_encodings = tokenizer(list(X_test),
  max_length=128, truncation=True, padding=True)

In [29]:
# Convert the data into TensorFlow datasets for
# more effective computation

train_dataset = tf.data.Dataset.from_tensor_slices(
    ( dict(train_encodings), tf.constant(y_train.values, dtype=tf.int32) )
)

test_dataset  = tf.data.Dataset.from_tensor_slices(
    ( dict( test_encodings), tf.constant(y_test.values, dtype=tf.int32) )
)

# Shuffle the 'train', but not 'test' (for real-world data predictions),
# and batch both to improve training efficiency of model.
train_dataset = train_dataset.shuffle(len(X_train)).batch(16)
test_dataset  = test_dataset.batch(16)

In [30]:
model_distilbert = (
    TFAutoModelForSequenceClassification
        .from_pretrained('distilbert-base-uncased', num_labels=2))

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [33]:
# model_distilbert = (`
#     TFAutoModelForSequenceClassification
#         .from_pretrained('distilbert-base-uncased', num_labels=2))

# Compile the model
## For some reason, transformers cannot accept
## customized optimizers (with different value of learning rate)
## In this case, we'll just use regular alias 'adam'.

optimizerr = tf.keras.optimizers.Adam(learning_rate=3e-5)
losss      = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metricss   = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model_distilbert.compile(
    optimizer =optimizerr,
    loss      =losss,
    metrics   =metricss
)

model_distilbert.summary()

model_distilbert.fit(train_dataset,
    epochs=5, validation_data=train_dataset)

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 

<tf_keras.src.callbacks.History at 0x7fb31c2e8970>

In [61]:
distilbert_test_loss, distilbert_test_acc = (
  model_distilbert.evaluate(test_dataset)
)
print(f'Test set accuracy: {distilbert_test_acc}')

Test set accuracy: 0.8097870945930481


## Conclusion

We've built a text classification model with comparison of accuracies as follows:

| Model | `model` | `model_reg` | `model_ml_bi_lstm` | `model_distilbert` |
|-------|---------|-------------|--------------------|--------------------|
| Training Acc | `61.36%` | `61.07%` | `81.86%` | `92.96%` |
| Test Acc | `61.50%` | `61.23%` | `79.10%` | `80.98%` |

Using a pre-trained model in `transformers` improved the accuracy compared to other models in classifying positive reviews from the other. We can use the algorithm to predict the upcoming data of drug reviews.

## Suggestion

With the project ending here, the author realizes this could still have more development. Some suggestions for higher accuracy include:
* Increasing the number of epoch for higher inputs
* Hyperparameter tuning on models with highest performances, from number of layers to optimizer's learning rate
* You may notice that several models have modifications that doesn't let it be ran on Google Colab, such as `LSTM-Bidirectional` with changes of activation function. We can fit the dataset into initial model to look for the accuracy.
* Experiment more on another types of model (more LSTM or transformers)