<a href="https://colab.research.google.com/github/mesh98a/DeepLearning/blob/main/labN2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget -O data.zip https://archive.ics.uci.edu/static/public/461/drug+review+dataset+druglib+com.zip

In [None]:
!mkdir -p lab2
!unzip -qq data.zip -d lab2


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import os
from keras.models import Sequential
from keras.layers import Dense, Dropout, SimpleRNN, Embedding, LSTM, GRU, Bidirectional
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import classification_report, accuracy_score, balanced_accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, roc_auc_score, ConfusionMatrixDisplay
from sklearn.utils.class_weight import compute_class_weight

In [None]:
df_train = pd.read_csv('lab2/drugLibTrain_raw.tsv', delimiter='\t')
df_test = pd.read_csv('lab2/drugLibTest_raw.tsv', delimiter='\t')

In [None]:
df_train.head()

To make predictions based on reviews, all three columns are combined into one

In [None]:
def combine_reviews(df):
    return (
        df['benefitsReview'].fillna('') + ' ' +
        df['sideEffectsReview'].fillna('') + ' ' +
        df['commentsReview'].fillna('')
    )
X_train_text = combine_reviews(df_train)
X_test_text = combine_reviews(df_test)
print(X_train_text[0])

In the first case, we will determine the medication rating using binary classification. For this, all ratings less than or equal to 5 are converted to 0, and those above 5 are converted to 1

In [None]:
def split_rating(df):
    df['rating_category'] = np.where(df['rating'] <= 5, 0, 1)
    return df

df_train = split_rating(df_train)
df_test = split_rating(df_test)

y_train = df_train['rating_category']
y_test = df_test['rating_category']

In [None]:
df_train['rating_category'].value_counts() # class imbalance

Since models can only work with numbers, we need to convert text into numerical form. Using a tokenizer, we split the text into tokens (usually words) and assign them indices based on frequency, with 1 being the most frequent word.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text)

vocab_size = len(tokenizer.word_index)

#print(tokenizer.index_word)
print(f"Vocabulary size: {vocab_size}")

The next step is converting the text into numerical sequences and bringing them to the same length.

In [None]:
X_train_seq = tokenizer.texts_to_sequences(X_train_text)
X_test_seq = tokenizer.texts_to_sequences(X_test_text)

X_train_pad = pad_sequences(X_train_seq, maxlen=200)
X_test_pad = pad_sequences(X_test_seq, maxlen=200)

Next come classification metrics to evaluate the performance of our model during training

In [None]:
def evaluate_model(model, history, X_test, y_test, is_binary=True):

    # Predictions
    y_pred_probs = model.predict(X_test)
    if is_binary:
        y_pred = (y_pred_probs > 0.5).astype(int) # [0,1]
    else:
        y_pred = np.argmax(y_pred_probs, axis=1) # [0,1,2]

    # Metrics
    acc = accuracy_score(y_test, y_pred)
    bal_acc = balanced_accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    roc = roc_auc_score(y_test, y_pred_probs) if is_binary else roc_auc_score(y_test, y_pred_probs, multi_class='ovr')
    prec = precision_score(y_test, y_pred, average='weighted')
    rec = recall_score(y_test, y_pred, average='weighted')

    print(f"Accuracy: {acc:.3f}")
    print(f"Balanced Accuracy: {bal_acc:.3f}")
    print(f"F1 Score: {f1:.3f}")
    print(f"ROC AUC: {roc:.3f}")
    print(f"Precision: {prec:.3f}")
    print(f"Recall: {rec:.3f}")

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()

    # Plot Loss Curve
    if 'loss' in history.history:
        plt.plot(history.history['loss'], color='blue', label='Train Loss')
        plt.plot(history.history['val_loss'], color='red', label='Validation Loss')
        plt.title('Loss Curve')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.legend()
        plt.show()


In our binary classification task based on text reviews, we use several key components.

First, we apply an Embedding layer to convert words into dense vector representations. This is necessary so the model can work with numerical features rather than raw text, capturing semantic relationships between words.

Next, we use a Bidirectional RNN, which allows the model to analyze both the previous and the next context of each word in the sentence. This is especially important for understanding the meaning of phrases within the text.

At the output, we use a layer with a sigmoid activation function, which returns the probability that a review belongs to the positive class.

Since the dataset may be imbalanced (e.g., more positive reviews than negative ones), we also use compute_class_weight to automatically adjust class weights, helping the model to fairly learn both classes

In [None]:
def train_binary_model(model_name, RNNLayer):
    print(f"Training: {model_name}")
    model = Sequential([
        Embedding(vocab_size + 1, 100),
        Bidirectional(RNNLayer(64, return_sequences=True)),
        Dropout(0.3),
        Bidirectional(RNNLayer(64)), # second layer
        Dropout(0.3),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    class_weights = dict(enumerate(class_weights))

    history = model.fit(X_train_pad, y_train, epochs=10, batch_size=128, validation_data =(X_test_pad,y_test), class_weight=class_weights, verbose=1)

    evaluate_model(model, history, X_test_pad, y_test, is_binary=True)


In [None]:
train_binary_model("SimpleRNN", SimpleRNN)

In [None]:
train_binary_model("LSTM", LSTM)

In [None]:
train_binary_model("GRU", GRU)

Multiclassification (Effectiveness)
We reduce the number of classes to three levels of effectiveness: Low, Moderate, and High.

In [None]:
def map_effectiveness(x):
    if x in ['Highly Effective', 'Considerably Effective']:
        return 'High'
    elif x in ['Moderately Effective', 'Marginally Effective']:
        return 'Moderate'
    else:
        return 'Low'

eff_map = {'Low': 0, 'Moderate': 1, 'High': 2}

df_train['eff_label'] = df_train['effectiveness'].apply(map_effectiveness).map(eff_map)
df_test['eff_label'] = df_test['effectiveness'].apply(map_effectiveness).map(eff_map)
df_train['eff_label'].value_counts()

The effectiveness labels (eff_label), encoded as integers 0, 1, and 2, are converted into one-hot encoded vectors using to_categorical.
This means each label becomes a vector of length 3, with a 1 at the position of the class index and 0s elsewhere.
This format is required for training models on multiclass classification tasks

In [None]:
y_train_eff = df_train['eff_label']
y_test_eff = df_test['eff_label']
y_train_eff_cat = to_categorical(df_train['eff_label'], num_classes=3)
y_test_eff_cat = to_categorical(df_test['eff_label'], num_classes=3)

print(y_train_eff[0])
print(y_train_eff_cat[0])

In [None]:
y_test_labels = df_test['eff_label'].values

In this multiclass classification task we use a sequential neural model.

We also use an Embedding layer and a Bidirectional RNN. In this case, dropout parameters are applied within the Bidirectional RNN to reduce overfitting. Additionally, a separate Dropout layer is added, which randomly deactivates some neurons during training to further prevent overfitting.

The output layer is a Dense layer with softmax activation, which returns probabilities for each of the three classes.

In [None]:
def train_multiclass_model(name, RNNLayer):
    print(f"Training Multiclass: {name}")
    model = Sequential([
        Embedding(vocab_size + 1, 100),
        #Bidirectional(RNNLayer(64, return_sequences=True, dropout=0.3, recurrent_dropout=0.2)),
        Bidirectional(RNNLayer(64,dropout=0.3, recurrent_dropout=0.2)),
        Dropout(0.3),
        Dense(3, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    history = model.fit(X_train_pad, y_train_eff_cat, epochs=10, batch_size=128, validation_data =(X_test_pad,y_test_eff_cat) ,verbose=1)

    y_pred = np.argmax(model.predict(X_test_pad), axis=1)

    evaluate_model(model, history, X_test_pad, y_test_labels, is_binary=False)


In [None]:
train_multiclass_model("SimpleRNN Effectiveness", SimpleRNN)

In [None]:
train_multiclass_model("LSTM Effectiveness", LSTM)

In [None]:
train_multiclass_model("GRU Effectiveness", GRU)

In conclusion, when comparing RNN, LSTM, and GRU architectures, LSTM tends to perform better due to its more complex structure, including additional gates that allow it to better control the flow of information. This makes it more effective at capturing long-term dependencies in text. However, despite this advantage, the overall model performance remains modest — with LSTM achieving around 65% accuracy, it still misclassifies approximately 35% of the cases. Additionally, the model shows signs of overfitting, learning the training data well but struggling to generalize to new, unseen data.

This performance limitation may be attributed to several factors: limited or imbalanced training data, noisy or unstructured text (e.g., spelling errors, informal language), insufficient preprocessing, overly simple model architecture, or the use of randomly initialized embeddings rather than pretrained ones.

To improve results, several strategies could be applied: incorporating pretrained word embeddings (such as GloVe or Word2Vec),, expanding and cleaning the dataset, applying regularization techniques (like dropout or L2), fine-tuning hyperparameters, and ensuring stratified sampling to preserve class balance during training. These enhancements could help the model generalize better and increase classification performance