# Drug Recommendation System

### Problem Statement:

Build a Drug recommender system that recommends the most effective drug for the given condition based on the reviews of various drugs used for that condition.

### Data Overview

- The dataset used is ‘UCI ML Drug Review Dataset’.
- Data source: https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018
- Data will be provided in two files:
<br>
<b>drugsComTrain_raw.csv</b> contains 7 columns: uniqueID, drugName, condition, review, rating, date, usefulCount </br>
<b>drugsComTest_raw.csv</b> contains same columns
<br>
Number of rows in Train dataset is - 161297 and Test dataset is - 53766


### Mapping the real-world problem to ML problem

<b>Objective</b>:Analyse a review and decide whether it is positive or negative.
To determine the most effective drug to recommend, a recommendation score must be determined based on the classification of the reviews.

We can use the given ratings to classify the reviews. The target feature is created by classifying the reviews as positive with ratings 6-10 and negative with ratings 1-5.

#### Type of Machine Learning Problem

Here the reviews need to classified to Positive and Negative classes. Hence it is a Binary Classification Problem i.e. logistics regression problem


#### Performance metrics

- F1 Score, Precision, Recall and Confusion Matrix

## Exploratory Data Analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import nltk
import regex as re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from tqdm import tqdm

In [2]:
data_train = pd.read_csv(r"C:\Users\LENOVO\Desktop\drugsComTrain_raw.csv")
data_test = pd.read_csv(r"C:\Users\LENOVO\Desktop\drugsComTrain_raw.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\LENOVO\\Desktop\\drugsComTrain_raw.csv'

In [None]:
print('Size of Train dataset is:',data_train.shape)
print('Size of Test dataset is:',data_test.shape)

In [None]:
data_train.values.shape[0] / data_test.values.shape[0]

In [None]:
print('Columns of the dataset are:\n',data_train.columns)

In [None]:
print('Overview of Train dataset:\n')
data_train.head(5)

In [None]:
print('Overview of Test dataset:\n')
data_test.head(5)

- The Train and Test datasets has same features. In both of the datasets we need to preprocess and create the target variable.
- So, we can concatenate the datasets and then preprocess and create the target variable accordingly.
- Before the modelling the whole data can be splitted to train and test.

In [None]:
data = pd.concat([data_train,data_test])
print('The size of the combined data is:',data.shape)

In [None]:
# resetting the index after concatenation
data.reset_index(inplace=True,drop=True)
data.tail()

In [None]:
#checking for data types in the dataset
data.dtypes

we have 3 numerical and 4 categorical features.

In [None]:
#checking the description of the data
data.describe()

average rating given is ~7 and upvotes as 28

In [None]:
#checking for Null values
data.isnull().any()

Only condition feature has Null Values in dataset.

In [None]:
#checking for the number of null values and percentage in given dataset

null_size = data.isnull().sum()['condition']
print('Total null values are:',null_size)
data_size = data.shape[0]
print('Percentage of null values are:',(null_size/data_size)*100)

- Null values are 0.5% of the total data points in Train and Test datasets. Hence we can drop the data points with null values.

In [None]:
# dropping the rows with null values

data = data.dropna(axis=0)
print('Size of the dataset after dropping null values:',data.shape)

In [None]:
# checking for number of unique conditions

print('Number of unique conditions are:',data['condition'].unique().shape[0])

In [None]:
data['condition'].unique

In [None]:
# plotting the top 10 conditions
conditions = dict(data['condition'].value_counts())
top_conditions = list(conditions.keys())[0:10]
values = list(conditions.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style='darkgrid')
sns.barplot(x=top_conditions,y=values,palette='summer')
plt.title('Top 10 Conditions')
plt.xlabel('Conditions')
plt.ylabel('Count')
plt.show()

- This plot shows that Birth Control is the top most people suffering condition  in given dataset followed by Depression, Pain, Anxiety and so on..

In [None]:
# plotting number of drugs for top 10 condition
val=[]
for c in list(conditions.keys()):
    val.append(data[data['condition']==c]['drugName'].nunique())

drug_cond = dict(zip(list(conditions.keys()),val))

top_conditions = list(drug_cond.keys())[0:10]
values = list(drug_cond.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style='darkgrid')
sns.barplot(x=top_conditions,y=values,palette='summer')
plt.title('Number of Drugs for each Top 10 Conditions')
plt.xlabel('Conditions')
plt.ylabel('Count of Drugs')
plt.show()

- There are multiple drugs used by patients for each condition.
- Pain and Birth Control conditions has highest number of different drugs available.  
- This shows that it is necessary to analyze and recommend the most effective drug for each condition from the available drugs.
- There are few conditions where only 1 drug is used by patient.

In [None]:
#plotting the most used drug for Birth Control

drugs_birth = dict(data[data['condition']=='Birth Control']['drugName'].value_counts())

top_drugs = list(drugs_birth.keys())[0:10]
values = list(drugs_birth.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style='darkgrid')
sns.barplot(x=values,y=top_drugs,palette='summer')
plt.title('Top 10 Drugs used for Birth Control')
plt.ylabel('Drug Names')
plt.xlabel('Count of Patients used')
plt.show()

- This plot helps to understand, even if the condition has wide variety of drugs but the most used drugs are very few in number. Etonogestrol is most used drug by patients.

In [None]:
#plotting the most used drug for Pain

drugs_pain = dict(data[data['condition']=='Pain']['drugName'].value_counts())
top_drugs = list(drugs_pain.keys())[0:10]
values = list(drugs_pain.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style='darkgrid')
sns.barplot(x=values,y=top_drugs,palette='summer')
plt.title('Top 10 Drugs used for Pain')
plt.ylabel('Drug Names')
plt.xlabel('Count of Patients used')
plt.show()

- Unlike above plot, here each drug has good number of usage count

In [None]:
# plotting the top 10 drugs rated as 10
drugs_rating = dict(data[data['rating']==10]['drugName'].value_counts())

top_drugs = list(drugs_rating.keys())[0:10]
values = list(drugs_rating.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style='darkgrid')
sns.barplot(x=values,y=top_drugs,palette='summer')
plt.title('Top 10 Drugs rated as 10')
plt.ylabel('Drug Names')
plt.xlabel('Count of Ratings')
plt.show()

- Birth Control and Weight Loss/Obesity drugs are top rated.

In [None]:
# plotting the top 10 drugs rated as 1
drugs_rating = dict(data[data['rating']==1]['drugName'].value_counts())

top_drugs = list(drugs_rating.keys())[0:10]
values = list(drugs_rating.values())[0:10]
plt.figure(figsize=(16,8))
sns.set_style(style='darkgrid')
sns.barplot(x=values,y=top_drugs,palette='summer')
plt.title('Top 10 Drugs rated as 1')
plt.ylabel('Drug Names')
plt.xlabel('Count of Ratings')
plt.show()

- The two Drugs Levonogestrol and Etonogestrel are in top 10 drugs with ratings '10' as well as '1'.
- This implies there might be certain patients where the drugs were not effective or resulted in severe side effects which made it in less ratings.

In [None]:
#plotting the distribution of ratings
f,ax = plt.subplots(1,2,figsize=(16,8))
ax1= sns.histplot(data['rating'],ax=ax[0])
ax1.set_title('Count of Ratings')
ax2= sns.distplot(data['rating'],ax=ax[1])
ax2.set_title('Distribution of Ratings density')
plt.show()

- Most of the drugs are rated with 10,9,8 and 1 ratings.

In [None]:
#plotting the percentage distribution of ratings using pie chart

ratings_count = dict(data['rating'].value_counts())
count = list(ratings_count.values())
labels = list(ratings_count.keys())
plt.figure(figsize=(18,9))
plt.pie(count,labels=labels, autopct='%1.1f%%')
plt.title('Pie Chart Representation of Ratings')
plt.legend(title='Ratings')
plt.show()

We can see ~75% of drugs are rated with 10,9,8 and 1 ratings.

In [None]:
# chaning to date time format.

data['date']= pd.to_datetime(data['date'])

In [None]:
#checking for ratings given in each year

year_ratings = dict(data['date'].dt.year.value_counts())
years = list(year_ratings.keys())
values = list(year_ratings.values())
plt.figure(figsize=(18,9))
sns.barplot(x=years,y=values,palette='summer')
plt.xlabel('Years')
plt.ylabel('Count of Ratings')
plt.title('Count of Ratings in each Year')
plt.show()

- Patients starting giving reviews and ratings more from 2015.
- We need to analyze if this date of entry has any impact on predicting the review sentiment.

In [None]:
#checking the distribution of usefulCount feature

plt.figure(figsize=(16,8))
ax =sns.distplot(data['usefulCount'])

plt.title('Distribution of usefulCount')
plt.show()

- Maximum number of the drug review has not more than 200 upvotes.

### Data Preprocessing

In [None]:
# creating the target feature using ratings
# here 1 represents positive and 0 - represents negative

data['review_sentiment'] = data['rating'].apply(lambda x: 1 if x > 5 else 0)

In [None]:
data.head(5)

In [None]:
# Plotting the pie chart for review sentiments

plt.figure(figsize=(14,7))
plt.pie(data['review_sentiment'].value_counts(),labels=['Positive','Negative'],autopct='%1.1f%%')
plt.title('Pie Chart representation of Review Sentiment')
plt.show()

- The positive reviews are 70% of the data. This is an imbalanced data.
- Minority class need to be oversampled to overcome the problems of impbalanced data.

<b> Building the word cloud for positive and Negative Reviews

In [None]:
# word cloud for positive reviews

positive_reviews = " ".join([review for review in data['review'][data['review_sentiment'] == 1]])


stop_words = set(STOPWORDS)

wordcloud = WordCloud(width = 1200, height = 800,background_color ='white',stopwords = stop_words,min_font_size = 10).generate(positive_reviews)

# plot the WordCloud image
plt.figure(figsize = (12, 8), facecolor = None)
plt.imshow(wordcloud)
plt.title('WordCloud for positive reviews')
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

In [None]:
# word cloud for negative reviews

negative_reviews = " ".join([review for review in data['review'][data['review_sentiment'] == 0]])

wordcloud = WordCloud(width = 1200, height = 800,background_color ='white',stopwords = stop_words,min_font_size = 10).generate(negative_reviews)

# plot the WordCloud image
plt.figure(figsize = (12, 8), facecolor = None)
plt.imshow(wordcloud)
plt.title('WordCloud for negative reviews')
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

- In the wordclouds, we can see the frequent words are almost common in both the positive and negative reviews

<b>Removing the conditions which are mentioned in above form</b>

In [None]:
#this code is to remove the unwanted conditions in the above form.

del_index = []
conds =[]
for c in data['condition']:
    if ('helpful' in c) or ('Listed' in c):
        f= list(data[data['condition']==c].index)
        del_index.extend(f)
        conds.append(c)

In [None]:
print('Size of the data before removing the conditions:',data.shape)

In [None]:
print('The removable conditions count is:',len(conds))

In [None]:
data.drop(del_index,inplace=True)
print('Size of the data after dropping the condtions:',data.shape)

In [None]:
data.reset_index(inplace=True,drop=True)
data.tail()

<b>Preprocessing the Reviews</b>

In [None]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
def preprocess_text(text_data):

    text_data = decontracted(text_data)

    text_data = text_data.replace('\n',' ')
    text_data = text_data.replace('\r',' ')
    text_data = text_data.replace('\t',' ')
    text_data = text_data.replace('-',' ')
    text_data = text_data.replace("/",' ')
    text_data = text_data.replace(">",' ')
    text_data = text_data.replace('"',' ')
    text_data = text_data.replace('?',' ')
    return text_data

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# loading stop words from nltk library

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

#removing 'no' from the stop words list as there is an importance of 'side effects' and 'no side effects' in review
stop_words.remove('no')

def nlp_preprocessing(review):
    '''This functional block preprocess the text data by removing digits, extra spaces, stop words
    and converting words to lower case and stemming words'''

    if type(review) is not int:
        string = ""
        review = preprocess_text(review)
        review = re.sub('[^a-zA-Z]', ' ', review)

        review = re.sub('\s+',' ', review)

        review = review.lower()

        for word in review.split():

            if not word in stop_words:
                word = stemmer.stem(word)
                string += word + " "

        return string

In [None]:
data['cleaned_review'] = data['review'].apply(nlp_preprocessing)

In [None]:
# converting to lower case
data['drugName'] = data['drugName'].apply(lambda x:x.lower())

In [None]:
data['condition'] = data['condition'].apply(lambda x:x.lower())

In [None]:
data.head()

In [None]:
data.shape

In [None]:
import nltk
nltk.download('vader_lexicon')

In [None]:
# adding the sentiment scores for reviews and preprocessed reviews as new features

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
data['sentiment_score'] = [sid.polarity_scores(v)['compound'] for v in data['review']]
data['sentiment_score_clean'] = [sid.polarity_scores(v)['compound'] for v in data['cleaned_review']]

In [None]:
data.head()

In [None]:
#checking the correlation of features

data.corr()

In [None]:
#data.to_csv('new_data_processed.csv',index=False)

In [None]:
csv_data = data.to_csv(index=False)
# Specify the file path and name for saving the CSV file
file_path = '/content/drive/MyDrive/Python Practice/Prathamesh/DRUG RECOMMENDATION SYSTEM/csv_data.csv'
with open(file_path, 'w') as file:
    file.write(csv_data)
print("CSV file saved successfully at", file_path)

- The useful features now are usefulCount, sentiment_score and sentiment_score_clean.

<b>Conclusion:</b>
<br>Based on the above analysis, the below are the important features to be used in next implementation stages:
- condition - This feature can be used by performing labelencoding.
- review - The new Feature extractions can be done from reviews before preprocessing like word count,char length, avg word count,stop word count etc..
- date - New feature can be created with only year extraction and then label encoding, as we saw the analysis of year and count of reviews in respective years.
- usefulCount - This feature is important from above correlation matrix.
- cleaned_review - The preprocessed reviews are used after Vectorization using BoW , Tf-idf.
- sentiment_score - This feature is important and is correlated with target feature closely.
- sentiment_score_clean - This feature is important and is correlated with target feature closely.

<b>Next Steps</b>
- Split the data to Train and Test.
- Encode the categorical features
- We need to vectorize the cleaned reviews using BoW, TF-IDF and also come up with few Feature extractions from reviews and cleaned reviews.
- Normalize the numerical features.
- Apply all the above encoded features to various classificatoin algorithms to come up with best models.
<br>
<b>Recommendation Approach </b>:
- Select the best model from each of the different set of features applied while building the model (like best model with reviews encoded using Bow, best model with applying TF-IDF , best model with some extracted/important features extracted )
- Add all the best model predicted values for each of the drug to get combined value and then multiply with usefulCount to create a new feature called recommendation score.
- For each condition among the multiple available drugs, the drug with highest recommendation score is recommended.
