# Medication Sentiment Analysis
In this notebook we aim at estimating which medications have proven to be most successful within the provided corpus.<br>
To do so, we downloaded a list of medication names (https://www.kaggle.com/iancornish/drug-data) and search the Covid-19 dataset for those medications. If we find an occurrence, we extract a snippet (50 characters before the occurrence and 100 characters after, removing the first and last word, as those will most likely be clipped) and perform sentiment analysis on this snippet. Then we display a list sorted by the most positive medications, together with the 10 most important words provided by tf-idf on all extracted snippets.

# Import and install packages

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk import ngrams
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


Bad key "text.kerning_factor" on line 4 in
C:\Users\m_lev\Anaconda3\envs\ml4hc\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


In [3]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

In [4]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

In [5]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [6]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
SIA = SentimentIntensityAnalyzer()

## Medication Sentiment Analysis
In this notebook we aim at estimating which medications have proven to be most successful within the provided corpus.<br>
To do so, we downloaded a list of medication names and search the Covid-19 dataset for those medications. If we find an occurrence, we extract a snippet (50 characters before the occurrence and 100 characters after, removing the first and last word, as those will most likely be clipped) and perform sentiment analysis on this snippet. Then we display a list sorted by the most positive medications, together with the 10 most important words provided by tf-idf on all extracted snippets.

Dataset

In [7]:
df_clean_pmc = pd.read_csv("data/medication _sentiment_analysis_data/clean_pmc.csv", skiprows=0, header=None)
df_clean_noncomm_use = pd.read_csv("data/medication _sentiment_analysis_data/clean_noncomm_use.csv", skiprows=1, header=None)
df_clean_comm_use = pd.read_csv("data/medication _sentiment_analysis_data/clean_comm_use.csv", skiprows=1, header=None)
df_biorxiv_clean = pd.read_csv("data/medication _sentiment_analysis_data/biorxiv_clean.csv", skiprows=1, header=None)

In [8]:
df_concat = pd.concat([df_clean_pmc, df_clean_comm_use, df_clean_noncomm_use, df_biorxiv_clean])

In [9]:
data = df_concat.to_numpy()

In [10]:
data.shape

(13203, 9)

In [11]:
data[0]

array(['paper_id', 'title', 'authors', 'affiliations', 'abstract', 'text',
       'bibliography', 'raw_authors', 'raw_bibliography'], dtype=object)

Medication names

In [12]:
df_drug_ratings = pd.read_csv("data/medication _sentiment_analysis_data/drugsComTest_raw.csv", skiprows=0, header=None)

In [13]:
drug_ratings = df_drug_ratings.to_numpy()

In [14]:
drug_ratings.shape

(53767, 7)

In [15]:
drug_ratings[1]

array(['163740', 'Mirtazapine', 'Depression',
       '"I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."',
       '10', '28-Feb-12', '22'], dtype=object)

In [16]:
drugs = np.unique(drug_ratings[1:,1])
conditions = np.unique(drug_ratings[1:,2].astype(str))[44:]

In [17]:
print(drugs.shape, conditions.shape)

(2637,) (665,)


In [18]:
# Given a document and an index, this method will extract a snippet of
# 50 characters before and 100 characters after the index
def get_snippet(index, doc):
    return doc[max(index.start()-50, 0):max(index.start()-1, 0)] +\
            doc[index.end():min(index.end()+100, len(doc))]

In [19]:
# Given a drug name and a document, this method returns all snippets containing this drug, the average of compound
# sentiment results and the number of occurrences
def process(drug, doc):
    s = []
    o = 0
    d = ""
    for index in re.finditer(drug.lower(), str(doc)):
        snippet = get_snippet(index, doc)
        sentiment = SIA.polarity_scores(snippet)
        s.append(sentiment['compound'])
        d += ' '.join([lem.lemmatize(w) for w in tokenizer.tokenize(snippet)[1:-1] if not w in stop_words])
        o += 1
        
    return s, d, o

In [20]:
drugnames = []
sentimemts = []
documents = []
occurrences = []

print("Total:", len(drugs))
for i, drug in enumerate(drugs):
    print("Progress:", i, end="\r")
    s = []
    o = 0
    d = ""
    for doc in data[1:]:
        # process the abstract
        p = process(drug, doc[4])
        if p[2] > 0: 
            s += p[0]
            d += " " + p[1]
            o += p[2]
        # process the content
        p = process(drug, doc[5])
        if p[2] > 0: 
            s += p[0]
            d += " " + p[1]
            o += p[2]
    
    # keep only the drugs that occured more than 10 times
    if o > 10:
        drugnames.append(drug)
        sentimemts.append(np.array(s))
        documents.append(d)  
        occurrences.append(o)

Total: 2637
Progress: 2636

In [21]:
# Example of a collection of snippets for a certain medication
documents[256]

' expressed chronic stage induced epilepsy medically refractory MTLE 4 Other experimental microarray studyacid tetanus toxin pentylenetetrazol directly injected induce primary focus epileptogenic activity immediate areaTwo study chemically induced epilepsy investigated proteomic profile following epileptogenesis whole hippocampus alonegene expression kainic acid model epileptogenesis despite studied exacting environment intraperitoneally injected 100 ml sterile solution 10 mg ml order collect saliva sample These fluid stored 270uC lymphocytic choriomeningitis virus LCMV induced BBB permeabilization subsequent seizure Fabene et al 2008 Kim et al 2009 observed following kainic acid 39 mediated seizure induction 40 rodent CCL2 CCR2 increased tissue also exerted neuroprotective activity induced epileptic mouse After preventive administration biflavonoid three consecutive total salivary secretion determined stimulation Blazsek Varga 1999 The salivary secretion tightly regulatedBrenner Stant

Perform tf-idf on the collections of snippets.

In [22]:
cvec = CountVectorizer(max_df=.85, stop_words=stop_words, ngram_range=(1, 3))
wordcounts = cvec.fit_transform(documents)

In [23]:
tfidf_trans = TfidfTransformer(smooth_idf=True, use_idf=True)
tf_idf = tfidf_trans.fit_transform(wordcounts)

In [24]:
voc = np.asarray(cvec.get_feature_names())

In [25]:
# Collect results
res = []
for i, d in enumerate(tf_idf):
    if len(documents[i]) > 0:
        res.append([drugnames[i], occurrences[i], sentimemts[i].mean().round(2),
                    np.asarray(sentimemts[i]).var().round(2), voc[d.toarray().argsort()[0, -10:]]])

In [26]:
res = sorted(res, reverse=True, key=lambda x: x[2])

In [27]:
# Medications sorted by sentiment score
labels = ['Name', 'Occs', 'Sentiment Mean', 'Stddev', 'tf-idf']
print(pd.DataFrame(res, range(len(res)), labels))

               Name  Occs  Sentiment Mean  Stddev  \
0         Ivacaftor    11            0.34    0.09   
1    Cholestyramine    16            0.29    0.10   
2        Zinc oxide    22            0.23    0.07   
3            Testim    61            0.22    0.11   
4            Vantin    35            0.22    0.08   
..              ...   ...             ...     ...   
344       Celecoxib    33           -0.17    0.11   
345      Sucralfate    12           -0.17    0.12   
346       Clozapine    18           -0.18    0.09   
347     Amphetamine    51           -0.19    0.17   
348       Meloxicam    30           -0.20    0.21   

                                                tf-idf  
0    [cftr channel, highlighted importance, al high...  
1    [two week daily, bile acid sequestrant, admini...  
2    [medicinal product, fumaric, medicinal product...  
3    [read real, real quote, php read real, real qu...  
4    [plasmid, pam, adjg lipopeptide, self adjg lip...  
..                   

In [28]:
# Medication with highest number of occurrences
res = sorted(res, reverse=True, key=lambda x: x[1])
print(res[0])

['Ovide', 44161, 0.16, 0.11, array(['virus', 'study', 'evidence', 'health', 'data', 'kindly pd', 'pr',
       'insight', 'information', 'pd'], dtype='<U134')]


In [29]:
# Most controverse medication
res = sorted(res, reverse=True, key=lambda x: x[3])
print(res[0])

['Lactulose', 17, 0.05, 0.27, array(['microbiota', 'faecal', 'kg', 'faecal microbiota',
       'intestinal permeability', 'intestinal', 'oligosaccharide', 'po',
       'po q8h', 'q8h'], dtype='<U134')]


# Conclusion
Unfortunately, some important medication names like cloroquine are not contained in the list of medications. We were not able to find a publicly available, more complete dataset for this experiment. But it could easily be repeated in case we were to find such a list.<br>
Sentiment analysis in the medical domain is very complex and subject to many studies, as mentioned in this paper: https://www.sciencedirect.com/science/article/pii/S0933365715000299<br>
We are well aware that our approach is basic and sometimes inaccurate, as it will for instance give negative sentiment to the following sentence "This medication heals cancer". This is also reflected in the results: none of the top medications seem to have a connection with coronavirus. In order for this method to work, we would therefore need to greatly improve sentiment analysis.