For this project i will be using dataset downloaded from https://archive.ics.uci.edu/ml/index.php. It's a great public repository of datasets to use for machine learning projects. The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.

In [1]:
import pandas as pd
import re

train = pd.read_csv('drugsCom_raw/drugsComTrain_raw.tsv', delimiter="\t")
test = pd.read_csv('drugsCom_raw/drugsComTest_raw.tsv', delimiter="\t")

Let's look at the first few records of the data.

In [2]:
train.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37


Ratings distribution shows higher proportion of 8, 9 and 10 stars reviews as well as 1 star. Everything else in between is less frequent.

In [3]:
train.groupby('rating')['rating'].count()

rating
1.0     21619
2.0      6931
3.0      6513
4.0      5012
5.0      8013
6.0      6343
7.0      9456
8.0     18890
9.0     27531
10.0    50989
Name: rating, dtype: int64

The idea of natural language processing algorithms is to represent text as feature vectors. A unique feature will be assigned for every word and number of occurences will be counted for each word in each review. First let's do some text preprocessing - remove non letters, switch to lower case, remove one letter words, remove stop words and apply "stemming" technique.

Stop words such as "and", "but", etc don't add any valuable information. NLTK library (natural language toolkit) - https://www.nltk.org/ provides a list of stopwords which i am going to compare against.

"Stemming" is a techinque of converting a word to its root for example "going", "went" will be converted to "go", etc. This allows to reduce dimensionality of the data while preserving valuable information. NLTK library provides that as well.

In [4]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

def t_transform(review):
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    return [ps.stem(word) for word in review if len(word) > 1 and word not in stopwords.words('english')]

[nltk_data] Downloading package stopwords to /home/kostya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This is how the reviews would look like after applying all the preprocessing steps described above.

In [5]:
train['review'].head(5).apply(t_transform)

0    [side, effect, take, combin, bystol, mg, fish,...
1    [son, halfway, fourth, week, intuniv, becam, c...
2    [use, take, anoth, oral, contracept, pill, cyc...
3    [first, time, use, form, birth, control, glad,...
4    [suboxon, complet, turn, life, around, feel, h...
Name: review, dtype: object

I would like to see whether the NLP would be able to differentiate between positive and negative reviews. So anything above 6 stars would be considered positive and anything below 5 starts negative. I am going to label all reviews this way and remove 5 and 6 stars reviews from both train and test datasets.

In [6]:
train = train[(train['rating'] <= 4) | (train['rating'] >= 7)]
train['rating2'] = [0 if x <= 4 else 1 for x in train['rating']]
test = test[(test['rating'] <= 4) | (test['rating'] >= 7)]
test['rating2'] = [0 if x <= 4 else 1 for x in test['rating']]

Let's import all necessary tools for NLP analysis from sklearn.

CountVectorizer creates a separate feature for each word in the document and counts word occurrences for each review. So every review will be represented by a very long vector with most features as zeros. This model is called bag of words.

TfidfTransformer implements technique called term frequency - inverse document frequency. Instead of just counting word frequencies as in the step above, TFIDF will assign a weight. The weight is based on a product of TF*IDF where TF is a word frequency relative to length of the review and IDF depends on how often the word appears in other reviews and assigns more weight to less frequent ones:

TF(t) = (Number of word occurences in a review) / (Total number of words in the review)

IDF(t) = log_e(Total number of reviews / Number of reviews with such word)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import time

Pipeline allows to assemble several steps and supply parameters for them. I will sequentially implement vectorizing, TFIDF and fit a model.

In [8]:
pipe = Pipeline([
    ('vec', CountVectorizer()),  
    ('tfidf', TfidfTransformer()),  
    ('clf', MultinomialNB()),  
])

Parameters can be provided by a step name following a double underscore. "vec__analyzer" is supplying test preprocessing function that i already created before. "vec__ngram_range" specifies whether contiguous sequences of words should be taken into account. For example "vec__ngram_range" = [(1,2)] would look at both 1-word and 2-words combinations. "vec__max_df" will ignore words that appear in more than X% of reviews. "clf__alpha" is a smoothing parameter for multinomial naive bayes classifier - MultinomialNB. More on this https://en.wikipedia.org/wiki/Additive_smoothing

In [9]:
tuned_params = dict(vec__analyzer=[t_transform],
                    vec__ngram_range=[(1, 1), (1, 2)],
                    vec__max_df=[0.5, 0.75],
                    clf__alpha=(1e-2, 1e-3))

GridSearchCV is used here to find the best combination of hyper-parameters. By default 3-fold cross validation is used as well. Although 3 folds is a little low number for cross validation to make sure there is no overfitting. I already trained this model with higher number of folds and found no overfitting so the default is used.

GridSearchCV is supplied with pipe object and parameters dictionary created above.

In [10]:
grid = GridSearchCV(pipe, 
                    param_grid=tuned_params,
                    scoring='accuracy',
                    n_jobs=-1,
                    verbose=1)

Let's train our model and see what results we can get.

In [11]:
start_time = time.time()
grid.fit(train['review'], train['rating2'])
print("Time spent: ", time.time() - start_time)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed: 519.6min finished


Time spent:  32720.129016160965


The best parameters combination is shown below. There were 8 combinations in total and average accuracy is shown for each combination. They are pretty close. Accuracy metric is just number of reviews where target variable was correctly predicted by the model over total number of reviews. Accuracy is around 81%.

GridSearchCV would train a model with each hyper-parameters combination using k-fold technique and select the best combination based on "scoring" parameter (in this case accuracy). Then the whole train dataset will be used to re-train the model again with the best hyper-parameters combination just found.

In [12]:
print("Best params: ", grid.best_params_)
print("Best score: ", grid.best_score_)
print(grid.cv_results_['mean_test_score'])
print(grid.cv_results_['std_test_score'])

Best params:  {'clf__alpha': 0.01, 'vec__analyzer': <function t_transform at 0x7f632a344ea0>, 'vec__max_df': 0.5, 'vec__ngram_range': (1, 1)}
Best score:  0.8103864816490972
[0.81038648 0.81038648 0.81038648 0.81038648 0.80996454 0.80996454
 0.80996454 0.80996454]
[0.0006044  0.0006044  0.0006044  0.0006044  0.00050035 0.00050035
 0.00050035 0.00050035]


Finally we can test the model on test dataset. The classification report and confusion matrix provide important information about the model performance - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support

The model is very good identifying "positive" reviews based on recall metric - 98% of positive reviews were predicted correctly (target variable 1 - positive review). At the same time only 39% of negative reviews were guessed right by the model. Could it be cause 3 and 4 stars reviews are more difficult to predict? It would be interesting to see whether the model can predict 1 and 2 starts reviews better than 1, 2, 3 and 4 stars.

Consusion matrix below just shows number of false positive, false negative, true positive and true negative. I can calculate accuracy based on this: (5228+34618)/48937 = 81.4% which is in line with accuracy obtained during training.

In [13]:
pred = grid.predict(test['review'])

from sklearn.metrics import classification_report
print (classification_report(test['rating2'], pred))

from sklearn.metrics import confusion_matrix
print(confusion_matrix(test['rating2'], pred))

             precision    recall  f1-score   support

          0       0.86      0.39      0.53     13497
          1       0.81      0.98      0.88     35440

avg / total       0.82      0.81      0.79     48937

[[ 5228  8269]
 [  822 34618]]
