![](./banner.png)

# Data Collection 3: UCI Drugs.com reviews dataset

##### Citation
Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125. DOI: [Web Link](https://dl.acm.org/doi/10.1145/3194658.3194677)


## Overview
The dataset includes patient reviews on medicinal products, conditions that the product is indicated for, and a 10-star rating reflecting overall patient satisfaction.
For full description, please refer to [UCI Machine Learning Repositories](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29).


Let's first import the necessary packages:

In [401]:
import numpy as np
import pandas as pd

## Load the data
- The original dataset is split into two separate `csvs`, a training set (75%) and a test set (25%). 
- I loaded the datasets using the Pandas function `pd.read_csv()`.
- The files are in `tsv` format (tab-separated values), therefore I loaded the datasets with `delimiter = '\t'`.

In [402]:
# import training set csv
reviews_train = pd.read_csv('/Users/JocelynHo/Desktop/drugs_com_reviews/drugsComTrain_raw.tsv',
                            delimiter = '\t', index_col = 'Unnamed: 0')

# check number of rows, columns
print(reviews_train.shape)

(161297, 6)


In [403]:
# import test set csv
reviews_test = pd.read_csv('/Users/JocelynHo/Desktop/drugs_com_reviews/drugsComTest_raw.tsv',
                            delimiter = '\t', index_col = 'Unnamed: 0')

# check number of rows, columns
print(reviews_test.shape)

(53766, 6)


## Combine training and test datasets
In order to combine this dataset to the Yellow Card dataset I created, I combined them together using `pd.concat()`. Since the original datasets are shuffled (row indices are not in order), I sorted the indices using `.sort_index(inplace = True)`.

In [404]:
# combine both together
reviews = pd.concat([reviews_test, reviews_train])

# check number of rows and columns
print(reviews.shape)

# resort indices to be in order
reviews.sort_index(inplace = True)

# check first 5 rows
reviews.head()

(215063, 6)


Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Medroxyprogesterone,Abnormal Uterine Bleeding,"""Been on the depo injection since January 2015...",3.0,"October 28, 2015",4
2,Medroxyprogesterone,Amenorrhea,"""I&#039;m 21 years old and recently found out ...",10.0,"October 27, 2015",11
3,Medroxyprogesterone,Abnormal Uterine Bleeding,"""I have been on the shot 11 years and until a ...",8.0,"October 27, 2015",7
4,Medroxyprogesterone,Birth Control,"""Ive had four shots at this point. I was on bi...",9.0,"October 26, 2015",12
5,Medroxyprogesterone,Abnormal Uterine Bleeding,"""I had a total of 3 shots. I got my first one ...",1.0,"October 25, 2015",4


Let's take a look at the columns:
1. `drugName` (categorical): Brand name of drug
2. `condition` (categorical): Name of condition drug is used for
3. `review` (text): Patient's review
4. `rating` (numerical): Patient's rating out of 10 stars
5. `date` (date): Date of review entry
6. `usefulCount` (numerical): Number of users who found review useful

## Data Cleaning
### Drop date column
`date` column is not needed, so I decided to drop it.

In [405]:
# drop date column
reviews.drop(columns = 'date', inplace = True)

### Rename columns
I didn't like how the column names were set up, so I decided to rename them.

In [406]:
# rename columns
reviews.columns = ['drug', 'condition', 'review', 'rating', 'useful_count']

### Convert all text to lower case
I created a custom function `to_lower()` that converts all strings of texts to lower case, for easier manipulation at later stages. The try-except clause makes sure that if certain values cannot be converted (e.g. nan values), instead of raising an error, the original value will be returned.

In [407]:
# create function to convert to lowercase
def to_lower(x):
    try:
        return x.lower()
    except:
        return x

# apply custom function to the entire dataframe
reviews = reviews.applymap(to_lower)

### Convert brand names to drug names
The biggest task of this dataset is to address the issue of brand names vs drug names. This dataset is based on each medicinal product's BRAND NAME (e.g. Calpol), instead of the DRUG NAME or active ingredient (e.g. paracetamol).

In order to combine this dataset to my Yellow Card dataset, I have to convert all the brand names to their respective drug names.

##### Trial - manual conversion
Initially I tried to manually convert the brand names to drug names, but quickly realised it is too time consuming and decided to then automate the entire process.

Here is a snippet of my attempt to manually convert the values

In [408]:
# change brand names to drug names
reviews.loc[reviews[reviews['drug'] == 'acanya'].index, 'drug'] = 'benzoyl peroxide / clindamycin'
reviews.loc[reviews[reviews['drug'] == 'abilify discmelt'].index, 'drug'] = 'aripiprazole'
reviews.loc[reviews[reviews['drug'] == 'acnex'].index, 'drug'] = 'salicyic acid'

In [409]:
{'acanya': 'benzoyl peroxide / clindamycin',
'abilify discmelt', 'aripiprazole',
'acnex': 'salicyic acid',
'a / b otic': 'antipyrine / benzocaine',
'a + d cracked skin relief': 'benzalkonium chloride / lidocaine',
'actifed': 'chlorphenamine / phenylephrine',
'activella': 'estradiol / norethindrone',
'actoplus met': 'metformin / pioglitazone',
'actonel with calcium': 'risedronate / calcium',
'adderall xr': 'amphetamine',
'addyi': 'flibanserin',
'advicor': 'niacin / lovastatin',
'advil cold and sinus': 'ibuprofen / pseudoephedrine',
'advil liqui-gels': 'ibuprofen / pseudoephedrine',
'advil migraine': 'ibuprofen',
'afrin original': 'oxymetazoline',
'afrin sinus': 'oxymetazoline',
'ala-quin': 'clioquinol / hydrocortisone',
'alahist lq': 'diphenhydramine / phenylephrine',
'alavert': 'loratidine',
'aleve-d sinus & cold': 'naproxen / pseudoephedrine',
'alka-seltzer plus cold formula sparkling original effervescent tablets': 'aspirin / chlorpheniramine / phenylephrine'}

SyntaxError: invalid syntax (<ipython-input-409-cb7c6a22faf2>, line 2)

#### Extract unique brand names
In order to automate the process, I first extracted all the unique brand names using `.unique()` on the `drug` column.

In [410]:
brands = list(reviews.drug.unique())
print(len(brands))
brands[:20]

3669


['medroxyprogesterone',
 'phenylephrine',
 'silodosin',
 'resorcinol / sulfur',
 'methylin er',
 'clemastine',
 'pentosan polysulfate sodium',
 'metoprolol tartrate',
 'xeljanz xr',
 'rituxan',
 'torsemide',
 'diphenhydramine / hydrocortisone',
 'levo-dromoran',
 'everolimus',
 'tolak',
 'fenoprofen',
 'metronidazole',
 'sumatriptan',
 'ibudone',
 'triple antibiotic']

#### Clean names to use in Drugs.com
First, I need to change the format of the brand names, so that they follow the correct format of Drugs.com urls. This is done using a list comprehension with the following:
- Replacing `'/'` with `'-'`
- Replacing `' #'` with `'-'`
- Replacing `"'"` with `'-'`
- Replacing `' + '` with `'-'`
- Replacing `'.'` with `'-'`
- Replacing `'. '` with `'-'`
- Replacing `' '` (space) with `'-'`
- Removing extra spaces

In [411]:
# create dictionary containing symbols to be removed
x = {' / ': '-', ' #': '-', "'": "-", ' + ': '-', '+': '-', '.': '-', '. ': '-', ' ': '-'}
# create empty lsit to store new brand names
brands2 = []

# loop through each brand name and remove symbols
for i in brands:
    for a,b in x.items():
        i = i.replace(a, b)
    brands2.append(i)

# check length
print(len(brands2))
# show top 20 values
brands2[:20]

3669


['medroxyprogesterone',
 'phenylephrine',
 'silodosin',
 'resorcinol-sulfur',
 'methylin-er',
 'clemastine',
 'pentosan-polysulfate-sodium',
 'metoprolol-tartrate',
 'xeljanz-xr',
 'rituxan',
 'torsemide',
 'diphenhydramine-hydrocortisone',
 'levo-dromoran',
 'everolimus',
 'tolak',
 'fenoprofen',
 'metronidazole',
 'sumatriptan',
 'ibudone',
 'triple-antibiotic']

#### Retrieve active ingredients of products using BeautifulSoup
##### Import libraries
BeautifulSoup lets users extract useful data from html. I first imported the libraries:

In [412]:
import requests
import bs4
from bs4 import BeautifulSoup
from time import sleep
import random
import string
from tqdm import tqdm

##### Extract drug names from Drugs.com
Since reviews from this dataset are extracted from Drugs.com, I decided to extract the respective active ingredients (drug names) from the same site. Work flow will be as follows:
- I first created an empty lists `actives`, this will store all the data I extract from Drugs.com.
- Then, for each brand in the `brands` list I created above, I try to access their Drugs.com pages by placing each brand name into the url.
- If the webpage is accessible, I then extract the `'Generic Name'` listed on the page.
- If not, I then try different web pages (e.g. drug interactions page or side effects page) to extract the same information.
- If that doesn't work, I simply append a `nan` value to the `actives` list, indicating that the information was not retrievable using this particular method.

In [413]:
# create empty list to store active ingredients
actives = []

# loop through each brand name
for i in tqdm(brands2):
    
    # wait for a random number of seconds, from 3 to 7 seconds
    sleep(random.randint(3, 7))
    
    try:
        
        # insert each brand name into the url
        url = f'https://www.drugs.com/mtm/{i}.html'
        # get a request using the url
        r = requests.get(url)
        # find the necessary information using beautifulsoup
        soup = BeautifulSoup(r.text, 'html.parser')
        results = soup.find("div", class_ = 'contentBox')
        names = results.find_next("p").text


        # if webpage exists, text starts with 'Generic name:'
        if names.startswith('Generic'):

            # some texts have a long list, we only need the first part before '\n'
            try:
                try:
                    generic = names.split(' (')[0].split(': ')[1].split('\n')[0]
                    actives.append(generic)
                except:
                    generic = names.split(' (')[0].split(': ')[1]
                    actives.append(generic)
            except:
                actives.append(np.nan)

        # if the /mtm/ url does not work, try with side-effects.html
        elif names.startswith('Sorry'):
            try:
                # insert each brand name into the url
                url = f'https://www.drugs.com/sfx/{i}-side-effects.html'
                # get a request using the url
                r = requests.get(url)
                # find the necessary information using beautifulsoup
                soup = BeautifulSoup(r.text, 'html.parser')
                results = soup.find("div", class_ = 'contentBox')
                names = results.find_next("p").text

                # only extract the generic names after 'Generic Name: '
                try:
                    actives.append(names.split(': ')[1])

                # safety net to catch any errors and return nan value
                except:
                    actives.append(np.nan)

            # if webpage not accesible, return nan value
            except:
                actives.append(np.nan)

        else:
            actives.append(np.nan)
            
    except:
        actives.append(np.nan)

100%|██████████| 3669/3669 [6:09:25<00:00,  6.04s/it]  


In [414]:
print(len(actives))
actives[:5]

3669


['medroxyprogesterone', 'phenylephrine', 'silodosin', nan, 'methylphenidate']

##### Create dictionary of brand names to drug names
After retrieving the active ingredients (drug names) of each brand, I then created a dictionary that pairs each brand name to the drug names.

`zip` pairs each element of the same index in both the `brands` and `actives` list together, and returns a tuple of (brand name, drug names). `i[0]` extracts the brand name, `i[1]` extracts the drug names. I then used the dictionary comprehension method to create the desired dictionary.

In [415]:
brand_drug_dict = dict(zip(brands2, actives))

##### Check for null values
I checked for the brand names that I was unable to scrape from Drugs.com:

In [416]:
not_found = [i for i, j in brand_drug_dict.items() if str(j) == 'nan']
print(len(not_found))
not_found[:20]

465


['resorcinol-sulfur',
 'desloratadine-pseudoephedrine',
 'brinzolamide',
 'mestranol-norethindrone',
 'mupirocin',
 'acetaminophen-dextromethorphan-doxylamine-pseudoephedrine',
 'acrivastine-pseudoephedrine',
 'dextromethorphan-quinidine',
 'formoterol',
 'piperonyl-butoxide-pyrethrins',
 'nafarelin',
 'norvir',
 'atropine-diphenoxylate',
 'acetaminophen-aspirin-caffeine-salicylamide',
 'capsaicin',
 'penciclovir',
 'nystatin-triamcinolone',
 'piperacillin-tazobactam',
 'acetaminophen-pamabrom-pyrilamine',
 'tryptophan']

##### Re-scrape
There is a third url that I can use for scraping, and I decided to try it with the brand names I was not able to find information on, which are the drugs listed in `not_found`.

In [417]:
# try to re-scrape using a different url
actives2 = []

for i in tqdm(not_found):
    try:
        # insert each brand name into the url
        url = f'https://www.drugs.com/pro/{i}.html'
        # get a request using the url
        r = requests.get(url)
        # find the necessary information using beautifulsoup
        soup = BeautifulSoup(r.text, 'html.parser')
        results = soup.find("div", class_ = 'contentBox')
        names = results.find_next("p").text

        # if webpage exists, text starts with 'Generic name:'
        if names.startswith('Generic'):

            # some texts have a long list, we only need the first part before '\n'
            try:
                try:
                    generic = names.split(' (')[0].split(': ')[1].split('\n')[0]
                    actives2.append(generic)
                except:
                    generic = names.split(' (')[0].split(': ')[1]
                    actives2.append(generic)
            except:
                actives2.append(np.nan)
        else:
            actives2.append(np.nan)
            
    # if webpage not accesible, return nan value
    except:
        actives2.append(np.nan)

100%|██████████| 465/465 [04:50<00:00,  1.60it/s]


In [418]:
print(len(actives2))
actives2[:20]

465


[nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'ritonavir',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan]

##### Update dictionary
I managed to obtain a few non-null values from the second scrape, so I updated the original `brand_drug_dict` with the new scraped values. Then, I updated the dictionary keys, so that they contain the original brand names (`brands`) instead of the altered brand names used for scraping (`brands2`).

In [419]:
# update dictionary if second scrape values are not nan
for i, j in enumerate(actives2):
    if str(j) != 'nan':
        brand_drug_dict[brands2[i]] = j

In [420]:
# create dictionary with original brand names
brand_drug_dict = dict(zip(brands, brand_drug_dict.values()))

##### Add brand name and drug name columns
Now that I have the brand names with their respective drug names (active ingredients), I added those to the reviews data frame. For rows that I wasn't able to scrape the drug names, I found out that most of the brand names were already drug names, so I simply copied the `brand` values over to the `drug` column using `.fillna()`.

In [421]:
# rename drug column as brand
reviews.rename(columns = {'drug': 'brand'}, inplace = True)

# add new drug column by converting brand name to drug names scraped
reviews['drug'] = reviews.brand.map(brand_drug_dict)

# copy brand name to drug column if null
reviews['drug'].fillna(reviews.brand, inplace = True)
reviews.head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug
0,medroxyprogesterone,abnormal uterine bleeding,"""been on the depo injection since january 2015...",3.0,4,medroxyprogesterone
2,medroxyprogesterone,amenorrhea,"""i&#039;m 21 years old and recently found out ...",10.0,11,medroxyprogesterone
3,medroxyprogesterone,abnormal uterine bleeding,"""i have been on the shot 11 years and until a ...",8.0,7,medroxyprogesterone
4,medroxyprogesterone,birth control,"""ive had four shots at this point. i was on bi...",9.0,12,medroxyprogesterone
5,medroxyprogesterone,abnormal uterine bleeding,"""i had a total of 3 shots. i got my first one ...",1.0,4,medroxyprogesterone


In [422]:
# check null value sum
reviews.isnull().sum()

brand              0
condition       1194
review             0
rating             0
useful_count       0
drug               0
dtype: int64

### Condition column
- Some values are incorrect, where the condition says `'_ number of users found this comment useful.'`. I first extracted those and replaced them with null values.
- Then, I converted the null values in `condition` column to a string of `'nan'`, so that they will be a separate category when I dummify the column at a later stage.

In [424]:
# convert nan to a string
reviews.condition = reviews.condition.apply(lambda x: 'nan' if str(x) == 'nan' else x)
reviews.isnull().sum()

brand           0
condition       0
review          0
rating          0
useful_count    0
drug            0
dtype: int64

In [425]:
# extract rows with incorrect condition
tmp = reviews[reviews.condition.str.contains('</span>')]
tmp.head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug
1471,triple antibiotic,11</span> users found this comment helpful.,"""used this product as well as having used neos...",7.0,11,"bacitracin, neomycin, and polymyxin B"
1480,ultram odt,44</span> users found this comment helpful.,"""my experience with ultram is fantastic. i sev...",10.0,44,tramadol
1675,uricalm,46</span> users found this comment helpful.,"""great medication. works fast and lasts all da...",10.0,46,"chlorpheniramine maleate, methscopolamine nitrate"
2458,enulose,12</span> users found this comment helpful.,"""headache, flatulence""",10.0,12,lactulose
2560,camrese,2</span> users found this comment helpful.,"""i&#039;m 46 and i have been having menopausal...",2.0,2,ethinyl estradiol / levonorgestrel


In [427]:
# replace with np.nan
reviews.loc[tmp.index, 'condition'] = 'nan'
reviews.loc[tmp.index].head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug
1471,triple antibiotic,,"""used this product as well as having used neos...",7.0,11,"bacitracin, neomycin, and polymyxin B"
1480,ultram odt,,"""my experience with ultram is fantastic. i sev...",10.0,44,tramadol
1675,uricalm,,"""great medication. works fast and lasts all da...",10.0,46,"chlorpheniramine maleate, methscopolamine nitrate"
2458,enulose,,"""headache, flatulence""",10.0,12,lactulose
2560,camrese,,"""i&#039;m 46 and i have been having menopausal...",2.0,2,ethinyl estradiol / levonorgestrel


## Address combination of drugs within a product
Some products have a combination of two or more active ingredients (drugs), for instance `acetaminophen / aspirin / caffeine / salicylamide`.
In order to add these reviews to the Yellow Card data frame, each row has to include one unique drug and all its reviews. This involves the following steps:
- Identify brands that have a combination of drugs and create a subset data frame `combos`
- `.explode()` on `combos`, so that the combinations are separated, where each drug in that combination has a copy of the original reviews
- Map drug names to DrugBank IDs
- Regroup the reviews by DrugBank ID 

### Address combination products
#### Convert ',' and 'and' to '/'
Most of the combinations are separated by `'/'`, however some use `','`, `'with'` and `'and'`. I decided to first convert these to `'/'`:

In [428]:
# look at rows containing the word and
reviews[reviews.drug.str.contains('and')].head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug
1469,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,hydrocodone and ibuprofen
1470,triple antibiotic,bacterial skin infection,"""aaa ointment. it like... gives you wolverine ...",10.0,0,"bacitracin, neomycin, and polymyxin B"
1471,triple antibiotic,,"""used this product as well as having used neos...",7.0,11,"bacitracin, neomycin, and polymyxin B"
2774,vanoxide-hc,acne,"""i use vanoxide after i have been around anima...",10.0,0,benzoyl peroxide and hydrocortisone topical
2984,claritin-d 24 hour,nasal congestion,"""big mistake taking my first claritin d 24 at ...",5.0,1,loratadine and pseudoephedrine


In [429]:
# create dictionary for replacements
combo_replace = {'/': ' / ', ', and ': ' and ', ', ': ' and ',
                 ' and ': ' / ', ' with ': ' / '}

for i, j in combo_replace.items():
    reviews.drug = reviews.drug.str.replace(i, j)

# show first 10 values
list(reviews.drug.unique())[:20]

['medroxyprogesterone',
 'phenylephrine',
 'silodosin',
 'resorcinol  /  sulfur',
 'methylphenidate',
 'clemastine',
 'pentosan polysulfate sodium',
 'metoprolol',
 'tofacitinib',
 'rituximab',
 'torsemide',
 'ritonavir',
 'levorphanol tartrate',
 'everolimus',
 'fluorouracil topical',
 'fenoprofen',
 'metronidazole',
 'sumatriptan',
 'hydrocodone / ibuprofen',
 'bacitracin / neomycin / polymyxin B']

#### Corrections
##### Oral
I noticed that there are drugs called `oral liquid` from an error while scraping. So, I extracted the rows with the word `oral`, and copied the `brand` values to overwrite the incorrect `drug` values.

In [430]:
tmp = reviews[reviews['drug'].str.contains('oral')]
tmp.head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug
20922,aspirin / chlorpheniramine / phenylephrine,sinus symptoms,"""i love taking this before bedtime, or when i ...",8.0,2,oral tablet effervescent
27673,diphenhydramine / phenylephrine,allergic rhinitis,"""this medicine for hayfever and cold leaky sin...",10.0,2,oral liquid / oral suspension extended release...
38108,enteragam,irritable bowel syndrome,"""i was diagnosed with collagenous colitis/micr...",10.0,15,immune globulin oral
38109,enteragam,irritable bowel syndrome,"""really liked how enteragam affected my digest...",9.0,7,immune globulin oral
38110,enteragam,irritable bowel syndrome,"""it took almost two weeks of two packets a day...",9.0,12,immune globulin oral


In [431]:
reviews.loc[tmp.index, 'drug'] = tmp.brand
reviews.loc[tmp.index].head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug
20922,aspirin / chlorpheniramine / phenylephrine,sinus symptoms,"""i love taking this before bedtime, or when i ...",8.0,2,aspirin / chlorpheniramine / phenylephrine
27673,diphenhydramine / phenylephrine,allergic rhinitis,"""this medicine for hayfever and cold leaky sin...",10.0,2,diphenhydramine / phenylephrine
38108,enteragam,irritable bowel syndrome,"""i was diagnosed with collagenous colitis/micr...",10.0,15,enteragam
38109,enteragam,irritable bowel syndrome,"""really liked how enteragam affected my digest...",9.0,7,enteragam
38110,enteragam,irritable bowel syndrome,"""it took almost two weeks of two packets a day...",9.0,12,enteragam


##### Other routes of administration
I decided to also check for other routes of administration:
- Topical
- Injection
- Ophthalmic
- Subcutaneous
- Rectal
- Sublingual
- Intravenous
- Nasal
- Ocular
- Buccal
- Inhalation
- Transdermal
- Otic

`drug` column values with these words included have the same format of drug name + route, e.g. fluticasone inhalation, estradiol transdermal. I simply removed the words above from the `drug` column using `.replace()`.


In [432]:
# create dictionary for .replace()
routes_replace = {'topical': '', 'injection': '', 'ophthalmic': '', 
                  'subcutaneous': '', 'rectal': '', 'sublingual': '',
                  'intravenous': '', 'nasal': '', 'ocular': '', 
                  'buccal': '', 'inhalation': '', 'transdermal': '',
                  ' otic': '', 'extended release tablets': '',
                  'tablet': '', 'tablets': '', 'combination': '',
                  'electrolyte solution': '',  'vaginal': '',
                  'intrauterine system': '', 'system': '',
                  'implant': '', ' oral': ''}

# replace drug column looping through each key-value pair in dictionary
for i, j in routes_replace.items():
    reviews.drug = reviews.drug.str.replace(i, j)

# remove space
reviews.drug = reviews.drug.apply(lambda x: x.strip())

### Manual corrections
While skimming through the drug values, I noticed a few minor errors and performed the following manual corrections.

In [433]:
manual_replace = {'(klye-oh-kwin-ol)': '', '(klye-oh-KWIN-ol)': '',
                  'diltiazembrand name': 'diltiazem', 'diltiazemBrand Name': 'diltiazem',
                  'polymyxin B sulfate': 'polymyxin B',
                  'polymyxin B': 'polymyxin b',
                  'ethinyl estradiol': 'ethinylestradiol',
                  'atorvastatin calcium': 'atorvastatin',
                  'polymyxin B': 'polymyxin b',
                  'ethinyl estradiol': 'ethinylestradiol',
                  'pseudoephedrine sulfate': 'pseudoephedrine',
                  'amlodipine besylate': 'amlodipine',
                  'ethyl estradiol': 'ethinylestradiol',
                  'von Willebrand factor complex': 'von willebrand factor',
                  'esomeprazole sodium': 'esomeprazole',
                  'acyclovir': 'aciclovir',
                  'emergency contraceptive': '',
                  'viscous': '',
                  '200mg': '',
                  'protein-bound': '',
                  'omega 3 supplement': 'omega-3',
                  '>': '',
                  'for reconstitution / cream / gel / liquid / lotion / soap / spray': '',
                  'hypertrichosis of eyelid.  see below for a comprehensive list of adverse effects.': 'bimatoprost',
                  'hypertrichosis of eyelid.  See below for a comprehensive list of adverse effects.': 'bimatoprost',
                  '(': '', ')': ''}

# replace drug column looping through each key-value pair in dictionary
for i, j in manual_replace.items():
    reviews.drug = reviews.drug.str.replace(i, j)

# remove space
reviews.drug = reviews.drug.apply(lambda x: x.strip())

#### Separate combination and non-combination products
Here I extracted rows where the product is a combination product (more than one active ingredient/ drug), and saved it as a new data frame `combos`.

In [434]:
# separate names with multiple ingredients (contains a slash /)
combos = reviews[reviews['drug'].str.contains(' / ')]
print(len(combos))
combos.head()

41885


Unnamed: 0,brand,condition,review,rating,useful_count,drug
868,resorcinol / sulfur,acne,"""have loved this product since i began using i...",10.0,3,resorcinol / sulfur
1469,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,hydrocodone / ibuprofen
1470,triple antibiotic,bacterial skin infection,"""aaa ointment. it like... gives you wolverine ...",10.0,0,bacitracin / neomycin / polymyxin b
1471,triple antibiotic,,"""used this product as well as having used neos...",7.0,11,bacitracin / neomycin / polymyxin b
1657,uricalm,dysuria,"""i had a uti and it helped with my pain but on...",1.0,0,chlorpheniramine maleate / methscopolamine nit...


In [435]:
non_combos = reviews.copy()
non_combos.drop(combos.index, inplace = True)
print(len(non_combos))
non_combos.head()

173178


Unnamed: 0,brand,condition,review,rating,useful_count,drug
0,medroxyprogesterone,abnormal uterine bleeding,"""been on the depo injection since january 2015...",3.0,4,medroxyprogesterone
2,medroxyprogesterone,amenorrhea,"""i&#039;m 21 years old and recently found out ...",10.0,11,medroxyprogesterone
3,medroxyprogesterone,abnormal uterine bleeding,"""i have been on the shot 11 years and until a ...",8.0,7,medroxyprogesterone
4,medroxyprogesterone,birth control,"""ive had four shots at this point. i was on bi...",9.0,12,medroxyprogesterone
5,medroxyprogesterone,abnormal uterine bleeding,"""i had a total of 3 shots. i got my first one ...",1.0,4,medroxyprogesterone


In [436]:
# check that original data frame is separated correctly
len(non_combos) + len(combos) == len(reviews)

True

#### .explode() on combos
- I created a custom function `separate_name()` that converts the drugs in `a / b` form to a list `['a', 'b']`.
- Then, I applied that function to the drug column.
- This way, I was able to use the `.explode()` function to separated out the individual drugs from the combination products.

In [437]:
# function to separate drug name into list of different drugs
def separate_name(x):
    return x['drug'].split(' / ')

# apply function to drug column
combos['drug'] = combos.apply(separate_name, axis = 1)

# duplicate row so that each row has one drug from list of drugs
combos = combos.apply(lambda x: x.explode())

# remove space
combos.drug = combos.drug.apply(lambda x: x.strip())

# convert to lower case
combos.drug = combos.drug.str.lower()

# reset index
combos.reset_index(drop = True, inplace = True)

print(len(combos))
combos.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combos['drug'] = combos.apply(separate_name, axis = 1)


91501


Unnamed: 0,brand,condition,review,rating,useful_count,drug
0,resorcinol / sulfur,acne,"""have loved this product since i began using i...",10.0,3,resorcinol
1,resorcinol / sulfur,acne,"""have loved this product since i began using i...",10.0,3,sulfur
2,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,hydrocodone
3,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,ibuprofen
4,triple antibiotic,bacterial skin infection,"""aaa ointment. it like... gives you wolverine ...",10.0,0,bacitracin


In [438]:
# view first 20 drug values
list(combos.drug.unique())[:20]

['resorcinol',
 'sulfur',
 'hydrocodone',
 'ibuprofen',
 'bacitracin',
 'neomycin',
 'polymyxin b',
 'chlorpheniramine maleate',
 'methscopolamine nitrate',
 'desloratadine',
 'pseudoephedrine',
 'mestranol',
 'norethindrone',
 'ethinylestradiol',
 'levonorgestrel',
 'acetaminophen',
 'dextromethorphan',
 'doxylamine',
 'benzoyl peroxide',
 'hydrocortisone']

#### Non-combos data frame
Here's a brief overview of the other half of the reviews `non_combos`, with minor adjustments.

In [439]:
non_combos.drug = non_combos.drug.str.lower()
non_combos.drug = non_combos.drug.str.replace('omega 3 supplement', 'omega-3')
non_combos.head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug
0,medroxyprogesterone,abnormal uterine bleeding,"""been on the depo injection since january 2015...",3.0,4,medroxyprogesterone
2,medroxyprogesterone,amenorrhea,"""i&#039;m 21 years old and recently found out ...",10.0,11,medroxyprogesterone
3,medroxyprogesterone,abnormal uterine bleeding,"""i have been on the shot 11 years and until a ...",8.0,7,medroxyprogesterone
4,medroxyprogesterone,birth control,"""ive had four shots at this point. i was on bi...",9.0,12,medroxyprogesterone
5,medroxyprogesterone,abnormal uterine bleeding,"""i had a total of 3 shots. i got my first one ...",1.0,4,medroxyprogesterone


### Combine combos and non_combos back together
After exploding the combinations, I now concatinate the two data frames, `combos` and `non_combos` back together.

In [440]:
# combine new combos and non_combos back together
rv_exploded = pd.concat([combos, non_combos], axis = 0, ignore_index = True)
rv_exploded.reset_index(drop = True, inplace = True)
print(len(rv_exploded))
rv_exploded.head()

264679


Unnamed: 0,brand,condition,review,rating,useful_count,drug
0,resorcinol / sulfur,acne,"""have loved this product since i began using i...",10.0,3,resorcinol
1,resorcinol / sulfur,acne,"""have loved this product since i began using i...",10.0,3,sulfur
2,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,hydrocodone
3,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,ibuprofen
4,triple antibiotic,bacterial skin infection,"""aaa ointment. it like... gives you wolverine ...",10.0,0,bacitracin


### Update drugbank.csv
#### Scrape (again) on DrugBank
Now we rescrape to get more DrugBank IDs, using the same method from notebook 1b.

In [446]:
# create list to scrape
not_found = list(rv_exploded[rv_exploded.db_id.isnull()].drug.unique())
not_found[:20]

['sulfur',
 'polymyxin b',
 'chlorpheniramine maleate',
 'methscopolamine nitrate',
 'norethindrone',
 'acetaminophen',
 'niacin',
 'bacitracin',
 'propoxyphene',
 'piperonyl butoxide',
 'pyrethrins',
 'sulfacetamide',
 'influenza virus vaccine',
 'inactivated',
 'methscopolamine',
 'belladonna',
 'ethynodiol diacetate',
 'norethindrone acetate',
 'ferrous fumarate',
 'formoterol']

In [447]:
# import libraries
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
from time import sleep
import random
import pandas as pd
from tqdm import tqdm

In [448]:
# set up selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# PATH = "/Applications/chromedriver"
# driver = webdriver.Chrome(PATH)

options = webdriver.ChromeOptions()
chrome_options=options
options.add_argument('--enable-javascript')
driver = webdriver.Chrome(executable_path='/Applications/chromedriver', options=options)

driver.get('https://go.drugbank.com/drugs/DB00316')

In [449]:
db_ids = []

for x in tqdm(not_found):
    try:
        # click on search box
        driver.find_element_by_xpath('/html/body/header/nav[2]/div[1]/form/div[2]').click()
        # enter search keyword
        searchbox = driver.find_element_by_xpath('//*[@id="query"]')
        # clear search box
        searchbox.clear()
        # enter search word
        searchbox.send_keys(x)
        # click enter
        searchbox.send_keys(Keys.ENTER)
        
        # wait 5 seconds
        sleep(5)
        
        try:
            # get DrugBank number and add to dataframe
            num = driver.find_element_by_xpath('/html/body/main/div/div/div[2]/div[2]/dl[1]/dd[2]').text
            db_ids.append(num)
        
        except:
            # get DrugBank number using class
            num = driver.find_element_by_class_name("col-xl-4 col-md-9 col-sm-8").text
            db_ids.append(num)
            
        # sleep
        sleep(random.randint(5, 8))
    
    except:
        db_ids.append(np.nan)

100%|██████████| 511/511 [1:35:49<00:00, 11.25s/it]


In [450]:
print(len(db_ids))
db_ids[:10]

511


[nan,
 'DB00781',
 'DB01114',
 'DB11315',
 'DB00717',
 'DB00316',
 'DB00627',
 'DB00626',
 'DB00647',
 'DB09350']

I added the scraped DrugBank IDs to the original drugbank dataframe `db` using `.append()`.

In [455]:
# load csv
db = pd.read_csv('drugbank_final.csv', dtype = object)

for i in range(len(db_ids)):
    db = db.append({'drug': not_found[i], 'yc_id': np.nan, 'db_id': db_ids[i],
                   'target': np.nan, 'drug_cat': np.nan}, ignore_index = True)
db.shape

(3518, 5)

#### Remove duplicate and null rows
Some more cleaning out rows with duplicated drugs and those without drugbank ids.

In [456]:
# showing rows with null values
db.tail()

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
3513,eyebright,,,,
3514,humulin r u-500 concentrated,,,,
3515,red yeast rice,,,,
3516,amcinonide,,DB00288,,
3517,meningococcal group b vaccine,,,,


In [457]:
# showing rows with the same drug
db[db.drug.str.contains('bacitracin')]

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
169,bacitracin,255876943.0,DB00626,"['C55-isoprenyl pyrophosphate', 'Insulin-degra...","['X gen pharmaceuticals inc', 'App pharmaceuti..."
1994,bacitracin,,DB00626,,
2503,bacitracin,,DB00626,,
3014,bacitracin,,DB00626,,


In [466]:
# drop rows without db id
db = db[db.db_id.notnull()]

# look at drugs that appear in multiple rows
dups = list(db.drug.value_counts()[db.drug.value_counts().values > 1].index)
dups[:10]

['bacitracin',
 'bendamustine hydrochloride',
 'ferrous fumarate',
 'aldesleukin',
 'metaproterenol',
 'flibanserin',
 'ery-tab',
 'oxiconazole',
 'terconazole',
 'ramelteon']

In [480]:
# for each duplicated drug, drop all but first row
for drug in dups:
    n = db[db.drug == drug].index
    db.drop(index = n[1:], inplace = True)
    
# check for duplicate drugs
db.drug.value_counts()[db.drug.value_counts().values > 1]

Series([], Name: drug, dtype: int64)

In [482]:
db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2380 entries, 0 to 3005
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   drug      2380 non-null   object
 1   yc_id     1987 non-null   object
 2   db_id     2380 non-null   object
 3   target    1565 non-null   object
 4   drug_cat  1987 non-null   object
dtypes: object(5)
memory usage: 111.6+ KB


#### Find missing information on target and categories
A final scrape to fill in the missing values on `target` and `drug_cat` columns.

In [511]:
# drugbank ids where row has no information on target/ drug category
x = list(db.db_id.values[db.target.isnull() | db.drug_cat.isnull()])
x[:10]

['DB13573',
 'DB03166',
 'DB09055',
 'DB11205',
 'DB13518',
 'DB11100',
 'DB13595',
 'DB15477',
 'DB13509',
 'DB15573']

In [497]:
# set up headers
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
encoding = "gzip, deflate, br"
lang = "en-GB,en;q=0.9"
headers = {'accept': accept, 'accept-encoding': encoding,
           'accept-language': lang, 'user-agent': user_agent}

# create empty lists to store scraped information
targets = []
drug_cats = []

# loop throguh each DrugBank ID scraped earlier above
for db_id in tqdm(x):

    # add the drugbank id to the url to access each drug's page
    url = f'https://go.drugbank.com/drugs/{db_id}'
    r = requests.get(url, headers=headers)
    
    # get ahfs code (drug category)
    try:
        soup = BeautifulSoup(r.text, 'html.parser')
        results = soup.find_all("ul", class_ = "list-unstyled table-list")
        drug_cats.append(results[3].text)
    except:
        drug_cats.append(np.nan)
      
    
    # read all the tables on the webpage
    try:    
        # get dataframes
        dfs = pd.read_html(r.text)
    
    # error message that there was no table for the specific url
    except:
        print('no tables')
        targets.append(np.nan)
    
    else:
        target = 0
        
        # loop through each dataframe stored in dfs
        for df in dfs:
            
            # extract targets
            if df.columns[0] == 'Target':
                target = 1
                targets.append([x[1:] for x in df['Target'].values])

            else:
                pass
            
        # once all the dfs are looped through for each drug,
        # if no target table is found, append nan value
        if target == 0:
            targets.append(np.nan)
    
        # sleep
        sleep(random.randint(3, 6))

100%|██████████| 815/815 [1:12:29<00:00,  5.34s/it]


In [498]:
# check length
len(targets), len(drug_cats)

(815, 815)

In [512]:
# update db

# loop through each id in list x (those with missing target or category)
for e, value in tqdm(enumerate(x)):
    
    # get index for each db id
    i = db[db.db_id == value].index
    
    # loop through each index (some may return multiple indices)
    for j in i:
        
        # change target and drug_cat values frmo scraped lists
        db.loc[j, 'target'] = targets[e]
        db.loc[j, 'drug_cat'] = drug_cats[e]

815it [00:01, 544.25it/s]


In [514]:
db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2380 entries, 0 to 3005
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   drug      2380 non-null   object
 1   yc_id     1987 non-null   object
 2   db_id     2380 non-null   object
 3   target    1876 non-null   object
 4   drug_cat  1799 non-null   object
dtypes: object(5)
memory usage: 191.6+ KB


In [522]:
db.describe()

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
count,2380,1987,2380,1876,1799
unique,2380,1987,2195,1330,759
top,triazolam,724484677,DB00066,['DNA'],['nan']
freq,1,1,4,20,502


In [516]:
db.head()

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
0,abacavir,40046536,DB01048,"['Reverse transcriptase/RNaseH', 'HLA class I ...",['08:18.08.20 — Nucleoside and Nucleotide Reve...
1,abatacept,561378321,DB01281,"['T-lymphocyte activation antigen CD80', 'T-ly...","['Bristol-Myers Squibb Co.', 'Celltrion Inc.',..."
2,abciximab,231911819,DB00054,"['Integrin beta-3', 'Integrin alpha-IIb', 'Low...",['92:00.00 — Miscellaneous Therapeutic Agents']
3,abemaciclib,369408139,DB12001,"['Cyclin-dependent kinase 4', 'Cyclin-dependen...",['10:00.00 — Antineoplastic Agents']
4,abiraterone,968368347,DB05812,"['Steroid 17-alpha-hydroxylase/17,20 lyase']",['10:00.00 — Antineoplastic Agents']


#### Extract lists of single values
I made a custome function that extracts lists with single elements (eg. `['nan']`) to `'nan'`.

In [552]:
#  create custom function
def extract_list(x):
    try:
        if len(x) == 1:
            return x[0]
        else:
            return x
    except:
        return x

# apply function to db_id, target and drug_cat columns
db.db_id = db.db_id.apply(extract_list)
db.target = db.target.apply(extract_list)
db.drug_cat = db.drug_cat.apply(extract_list)
db.head()

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
0,abacavir,40046536,DB01048,"[Reverse transcriptase/RNaseH, HLA class I his...",08:18.08.20 — Nucleoside and Nucleotide Revers...
1,abatacept,561378321,DB01281,"[T-lymphocyte activation antigen CD80, T-lymph...","[Bristol-Myers Squibb Co., Celltrion Inc., E.R..."
2,abciximab,231911819,DB00054,"[Integrin beta-3, Integrin alpha-IIb, Low affi...",92:00.00 — Miscellaneous Therapeutic Agents
3,abemaciclib,369408139,DB12001,"[Cyclin-dependent kinase 4, Cyclin-dependent k...",10:00.00 — Antineoplastic Agents
4,abiraterone,968368347,DB05812,"Steroid 17-alpha-hydroxylase/17,20 lyase",10:00.00 — Antineoplastic Agents


In [553]:
db.describe()

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
count,2380,1987,2380,1876,1799.0
unique,2380,1987,2195,1271,700.0
top,triazolam,724484677,DB00066,DNA,
freq,1,1,4,26,502.0


#### Extract drug category codes
As we can see from row index 1 above, the drug category value is incorrect. The manufacturer name, `Bristol-Myers Squibb Co.`, was scraped instead of the actual AHFS drug category code.

I made a custom function that:
- Extracts the code from each cell (e.g. `10:00.00`)
- Replaces manufacturer names with `nan`
- If there are multiple codes, return the codes as a list

In [563]:
# example
db.loc[1,'drug_cat'][0]

'Bristol-Myers Squibb Co.'

In [583]:
# create function
def extract_category_codes(x):
    try:
        # to extract single vategory value
        if x[0].isdigit():
            return x.split('—')[0].strip()
        # to extact multiple category values as a list
        elif x[0][0].isdigit():
            return [y.split('—')[0].strip() for y in x]
        # return nan if company names, e.g. 'Roche pharmaceuticals'
        else:
            return np.nan
    except:
        return np.nan

# apply to drug_cat column
db.drug_cat = db.drug_cat.apply(extract_category_codes)
db.head()

Unnamed: 0,drug,yc_id,db_id,target,drug_cat
0,abacavir,40046536,DB01048,"[Reverse transcriptase/RNaseH, HLA class I his...",08:18.08.20
1,abatacept,561378321,DB01281,"[T-lymphocyte activation antigen CD80, T-lymph...",
2,abciximab,231911819,DB00054,"[Integrin beta-3, Integrin alpha-IIb, Low affi...",92:00.00
3,abemaciclib,369408139,DB12001,"[Cyclin-dependent kinase 4, Cyclin-dependent k...",10:00.00
4,abiraterone,968368347,DB05812,"Steroid 17-alpha-hydroxylase/17,20 lyase",10:00.00


In [584]:
# save as new db_final.csv
# db.to_csv('drugbank_final_2.csv')

### Add DrugBank ID
#### Add existing IDs
- To add drugbank ids to `rv_exploded`, I first load in the csv and create a dictionary, the same way we have done previously.
- Then, I created a new column `db_id` in `rv_exploded` that maps drug names to the DrugBank IDs.

In [586]:
# create dictionary
drug_id = dict(zip(db.drug, db.db_id))

# create drugbank id column
rv_exploded['db_id'] = rv_exploded.drug.map(drug_id)
rv_exploded.head()

Unnamed: 0,brand,condition,review,rating,useful_count,drug,db_id
0,resorcinol / sulfur,acne,"""have loved this product since i began using i...",10.0,3,resorcinol,DB11085
1,resorcinol / sulfur,acne,"""have loved this product since i began using i...",10.0,3,sulfur,
2,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,hydrocodone,DB00956
3,ibudone,pain,"""i just started taking this medicine. i have ...",7.0,27,ibuprofen,DB01050
4,triple antibiotic,bacterial skin infection,"""aaa ointment. it like... gives you wolverine ...",10.0,0,bacitracin,DB00626


In [587]:
# more manual corrections
rv_exploded.drug = rv_exploded.drug.str.replace('bacitracin zinc', 'bacitracin')
rv_exploded.drug = rv_exploded.drug.str.replace('belladonna alkaloids', 'belladonna')
rv_exploded.drug = rv_exploded.drug.str.replace('chondroitin', 'chondroitin sulfate')
rv_exploded.condition = rv_exploded.condition.str.replace('cance', 'cancer')

In [589]:
rv_exploded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264679 entries, 0 to 264678
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   brand         264679 non-null  object 
 1   condition     264679 non-null  object 
 2   review        264679 non-null  object 
 3   rating        264679 non-null  float64
 4   useful_count  264679 non-null  int64  
 5   drug          264679 non-null  object 
 6   db_id         259655 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 14.1+ MB


In [595]:
# remove rows where db_id is null
rv_exploded.dropna(inplace = True)
rv_exploded.reset_index(inplace = True, drop = True)
rv_exploded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259655 entries, 0 to 259654
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   brand         259655 non-null  object 
 1   condition     259655 non-null  object 
 2   review        259655 non-null  object 
 3   rating        259655 non-null  float64
 4   useful_count  259655 non-null  int64  
 5   drug          259655 non-null  object 
 6   db_id         259655 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 13.9+ MB


### Group rv_exploded by db_id
Now that the dataframe has no null values, and all the gaps are filled in, I group the reviews by drugbank ID using `.grouby()`. `.agg(pd.Seriies.tolist)` returns, for each db_id, a list of all the grouped values.

`pd.unique` retains only the unique values within each aggregated list of values.

In [602]:
# group rv_exploded on db_id
rv_group = rv_exploded.groupby('db_id').agg(pd.Series.tolist)

# get only unique values for drug, condition, brand columns
rv_group.drug = rv_group.drug.apply(pd.unique)
rv_group.condition = rv_group.condition.apply(pd.unique)
rv_group.brand = rv_group.brand.apply(pd.unique)

rv_group.head()

Unnamed: 0_level_0,brand,condition,review,rating,useful_count,drug
db_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DB00002,"[erbitux, cetuximab]","[colorectal cancer, head and neck cancer, squa...","[""i have stage 4 colon cancer with liver mets....","[9.0, 6.0, 1.0, 7.0, 8.0, 9.0, 8.0, 7.0, 9.0, ...","[2, 2, 1, 8, 7, 13, 47, 43, 2, 2, 1, 8, 7, 13,...",[cetuximab]
DB00003,"[dornase alfa, pulmozyme]",[cystic fibrosis],"[""my outcome/experience with pulmozyme was goo...","[10.0, 10.0]","[9, 9]",[dornase alfa]
DB00005,"[etanercept, enbrel]","[rheumatoid arthritis, psoriatic arthritis, pl...","[""took enbrel with methotrexate for four year...","[5.0, 5.0, 10.0, 1.0, 10.0, 5.0, 6.0, 9.0, 10....","[18, 5, 10, 14, 5, 40, 5, 5, 47, 11, 5, 30, 36...",[etanercept]
DB00006,"[bivalirudin, angiomax]",[percutaneous coronary intervention],"[""excellent one time treatment"", ""excellent on...","[10.0, 10.0]","[0, 0]",[bivalirudin]
DB00007,"[leuprolide, lupron depot-ped, lupron, eligard...","[uterine fibroids, prostate cancer, endometrio...","[""i was so skeptical about these shots, scared...","[10.0, 10.0, 3.0, 7.0, 2.0, 6.0, 1.0, 4.0, 6.0...","[1, 3, 0, 0, 0, 2, 10, 14, 0, 8, 4, 12, 20, 5,...",[leuprolide]


In [603]:
# rv_group.to_csv('rv_grouped.csv')

## Data cleaning on rv_group
### Rating column
Initally, I planned on utilising the `useful_count` column to calculate a weighted mean for each set rating scores. However, because a) not all ratings have a useful count, and b) some useful counts are 0, incorporating the useful counts only resulted in inaccurate means of near-zero values.

Therefore, I decided to drop the `useful_count` column and calculate the arithmetic mean using the `.mean()` function in the `statistics` library.

In [613]:
# # create function to normalize counts
# def normalize_counts(x):
#     return [(i / x['total_count']) for i in x['useful_count']]

# rv_counts['useful_count'] = rv_counts.apply(normalize_counts, axis = 1)
# rv_counts.head()

In [614]:
# def weighted_mean_rating(x):
#     scores = [i * j for i in x['rating'] for j in x['useful_count']]
#     return sum(scores) / len(scores)

# rv_counts['rating'] = rv_counts.apply(weighted_mean_rating, axis = 1)
# rv_counts.head()

In [612]:
# import library
import statistics

# apply on rating column
rv_group.rating = rv_group.rating.apply(lambda x: statistics.mean(x))

# drop useful_count column
rv_group.drop(columns = 'useful_count', inplace = True)
rv_group.head()

Unnamed: 0_level_0,brand,condition,review,rating,drug
db_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DB00002,"[erbitux, cetuximab]","[colorectal cancer, head and neck cancer, squa...","[""i have stage 4 colon cancer with liver mets....",6.875,[cetuximab]
DB00003,"[dornase alfa, pulmozyme]",[cystic fibrosis],"[""my outcome/experience with pulmozyme was goo...",10.0,[dornase alfa]
DB00005,"[etanercept, enbrel]","[rheumatoid arthritis, psoriatic arthritis, pl...","[""took enbrel with methotrexate for four year...",8.220253,[etanercept]
DB00006,"[bivalirudin, angiomax]",[percutaneous coronary intervention],"[""excellent one time treatment"", ""excellent on...",10.0,[bivalirudin]
DB00007,"[leuprolide, lupron depot-ped, lupron, eligard...","[uterine fibroids, prostate cancer, endometrio...","[""i was so skeptical about these shots, scared...",6.154856,[leuprolide]


### Manual addition of sulfur
I later found more information on sulfur, and therefore added that in manually.

In [624]:
sulfur = combos[combos.drug == 'sulfur']
sulfur = sulfur.groupby('drug').agg(pd.Series.tolist)
sulfur.condition = sulfur.condition.apply(pd.unique)
sulfur.brand = sulfur.brand.apply(pd.unique)
sulfur.rating = sulfur.rating.apply(lambda x: statistics.mean(x))
sulfur.drop(columns = 'useful_count', inplace = True)
sulfur.reset_index(inplace = True)
sulfur['db_id'] = 'DB11104'
sulfur.set_index('db_id', inplace = True)
sulfur

Unnamed: 0_level_0,drug,brand,condition,review,rating
db_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DB11104,sulfur,"[resorcinol / sulfur, coal tar / salicylic aci...","[acne, psoriasis, dandruff, seborrheic dermati...","[""have loved this product since i began using ...",8.71831


In [625]:
rv_group = rv_group.append(sulfur)

### Create number of reviews column
I also decided to create a new column that records the total number of reviews each drug has received.

In [633]:
rv_group['n_reviews'] = rv_group.review.apply(lambda x: len(x))
rv_group.head()

Unnamed: 0_level_0,brand,condition,review,rating,drug,n_reviews
db_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DB00002,"[erbitux, cetuximab]","[colorectal cancer, head and neck cancer, squa...","[""i have stage 4 colon cancer with liver mets....",6.875,[cetuximab],16
DB00003,"[dornase alfa, pulmozyme]",[cystic fibrosis],"[""my outcome/experience with pulmozyme was goo...",10.0,[dornase alfa],2
DB00005,"[etanercept, enbrel]","[rheumatoid arthritis, psoriatic arthritis, pl...","[""took enbrel with methotrexate for four year...",8.220253,[etanercept],395
DB00006,"[bivalirudin, angiomax]",[percutaneous coronary intervention],"[""excellent one time treatment"", ""excellent on...",10.0,[bivalirudin],2
DB00007,"[leuprolide, lupron depot-ped, lupron, eligard...","[uterine fibroids, prostate cancer, endometrio...","[""i was so skeptical about these shots, scared...",6.154856,[leuprolide],381


In [634]:
rv_group.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1133 entries, DB00002 to DB11104
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   brand      1133 non-null   object 
 1   condition  1133 non-null   object 
 2   review     1133 non-null   object 
 3   rating     1133 non-null   float64
 4   drug       1133 non-null   object 
 5   n_reviews  1133 non-null   int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 102.0+ KB


In [636]:
# rv_group.to_csv('reviews_final.csv')

# Next step:
Finally, reviews data frame is ready. Let's move onto notebook `2a_sql_yellow_card`.