# Customer Reviews on Drugs Purchasing and Satisfaction

Data science applications in healthcare can offer us a good amount of interesting insights to the stakeholders. Aside from more-advanced applications to health workers to such as predictive analytics, medical imaging, and drug research, data such as customers' sales and reviews in healthcare products can also be used by pharma and medical device companies in evaluating their market or product.

In this project, we'll be exploring the dataset of from Ahmed et al., (2023). The dataset contains 392510 unique reviews along with the name of the drug, user rating, the credibility of the review in the form of likes, length of the reviews, the condition for which the drug was taken and how long the drug was taken for.

The project will be divided into 4 main sections:
1. `01-EDA-and-data-cleaning` *(this one)*, where we'll be cleaning the data for future purposes of sentiment prediction and dashboard
2. `02-import-csv-to-postgresql`, where we'll be transferring the cleaned data into PostgreSQL database--a lighter size (for dashboard)
3. `03-drug-reviews-dashboard`, where the data will be interfaced on a Tableau dashboard for a more convenient interface, and lastly
4. `04-sentiment-prediction`, where we're utilizing Deep Learning to predict sentiments towards a product.

Let's dive in, shall we?

## EDA: Exploratory Data Analysis + Data Cleaning

In [None]:
import re
import numpy as np
import pandas as pd
from textblob import TextBlob

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('DrugReviews.csv')
pd.options.display.max_colwidth = 50

In [3]:
df

Unnamed: 0,MedicineName,MedicineFor,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes
0,12 Hour Nasal Decongestant Spray,For Nasal Congestion,26-Jan-21,xano,Not Specified,This is very effective IF you can get the cove...,52,6,0
1,12 Hour Nasal Decongestant Spray,For Nasal Congestion,19-Aug-22,Breat...,Taken for 1 to 2 years,Actually I use the generic brand of the 12 hou...,319,10,0
2,12 Hour Nasal Decongestant Spray,For Nasal Congestion,28-Apr-18,Abe,Taken for less than 1 month,Cap took 20 minutes to open process was frustr...,373,1,0
3,5-HTP,For Anxiety,3-May-20,Andres,Taken for less than 1 month,Hi everyone\n'10 / 105-HTPFor Anxiety23129-Oct...,623,10,345
4,5-HTP,For Anxiety,11-Jul-19,Shawn,Not Specified,Took SSRI (Prozac) for Anxiety/Depression for ...,156,9,229
...,...,...,...,...,...,...,...,...,...
392505,ZzzQuil for Insomnia,Not Mentioned,23-Sep-20,emano,Taken for less than 1 month,I hate zzzquill. I took multiple doses one nig...,218,1,8
392506,ZzzQuil for Insomnia,Not Mentioned,26-Sep-22,ZzzQu...,Taken for less than 1 month,Desperately in need of sleep & decided to take...,234,1,3
392507,ZzzQuil for Insomnia,Not Mentioned,12-Jun-20,Annon...,12-Jun-20,I have insomnia for the past several months an...,228,1,8
392508,ZzzQuil for Insomnia,Not Mentioned,18-Aug-20,Sleepy,Taken for less than 1 month,I personally wouldn’t recommend this to my wor...,325,1,6


In [4]:
df.shape

(392510, 9)

In [5]:
df['Rating'].value_counts()

Rating
10    110256
1      82266
9      54327
8      39585
2      20853
7      20323
5      19822
3      17802
6      14731
4      12545
Name: count, dtype: int64

In [6]:
df.sort_values(['NumberOfLikes', 'ReviewLength', 'Rating'], ascending=False)

Unnamed: 0,MedicineName,MedicineFor,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes
160841,Fluoxetine,Prozac (fluoxetine) for Depression,20-Aug-20,READT...,Not Specified,Even though I always said I was never going to...,487,10,3555
305810,Prozac,For Depression,20-Aug-20,READT...,Not Specified,Even though I always said I was never going to...,487,10,3555
160864,Fluoxetine,Prozac (fluoxetine) for Depression,20-Aug-20,READT...,20-Aug-20,Even though I always said I was never going to...,487,10,3545
386906,Zoloft,For Depression,9-Sep-19,Saved...,Not Specified,I’m 36 and I’ve dealt with depression my entir...,186,10,3199
324768,Sertraline,Zoloft (sertraline) for Depression,9-Sep-19,Saved...,9-Sep-19,I’m 36 and I’ve dealt with depression my entir...,186,10,3190
...,...,...,...,...,...,...,...,...,...
58783,Cefdinir,For Strep Throat,10-Jan-21,Ruth,Taken for less than 1 month,I,1,2,0
155653,Ezetimibe,For High Cholesterol,16-Dec-22,1347,Not Specified,I,1,2,0
151559,Etonogestrel,Nexplanon (etonogestrel) for Birth Control,22-Feb-17,Kirst...,Not Specified,I,1,1,0
255479,Multivitamin,Lipoflavonoid (multivitamin) for Dietary Suppl...,7-Nov-21,Anonymous,Taken for 1 to 6 months,I,1,1,0


## `IntakeTime`

In [7]:
df['IntakeTime'].value_counts()

IntakeTime
Not Specified                   143399
Taken for less than 1 month      83287
Taken for 1 to 6 months          65065
Taken for 6 months to 1 year     21971
Taken for 1 to 2 years           18733
                                 ...  
18-Jun-23                            1
7-Jun-23                             1
25-Oct-22                            1
3-Nov-21                             1
8-Sep-20                             1
Name: count, Length: 5032, dtype: int64

In [8]:
df['IntakeTime'].value_counts()[8:]

IntakeTime
2-Nov-16     22
5-Nov-09     21
28-Jul-16    20
20-Jan-12    20
27-Jan-16    19
             ..
18-Jun-23     1
7-Jun-23      1
25-Oct-22     1
3-Nov-21      1
8-Sep-20      1
Name: count, Length: 5024, dtype: int64

In [9]:
regex_find_date = re.compile('\d{1,2}-.{3}-\d{2}')

df.loc[df['IntakeTime'].str.contains(regex_find_date), :]

Unnamed: 0,MedicineName,MedicineFor,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes
88,Abilify,For Bipolar Disorder,4-Jan-12,Anonymous,4-Jan-12,Starting on day two of treatment I felt progre...,112,1,13
101,Abilify,For Schizoaffective Disorder,28-Dec-20,Skeptic,28-Dec-20,Terrible drug. Caused me to have seizures with...,102,1,4
103,Abilify,For Depression,22-Oct-09,becky...,22-Oct-09,I don't think this drug is working for my depr...,90,1,14
118,Acetaminophen,Tylenol (acetaminophen) for Pain,13-Oct-09,la23409,13-Oct-09,Best pain killer ever known.,28,10,30
128,Abilify,For Major Depressive Disorder,15-Feb-18,Twing...,15-Feb-18,I haven't experienced any changes with this me...,55,1,9
...,...,...,...,...,...,...,...,...,...
392491,ZzzQuil for Insomnia,Not Mentioned,31-Jan-21,jasmine,31-Jan-21,I suffer from depression & anxiety so I also h...,352,10,24
392492,ZzzQuil for Insomnia,Not Mentioned,25-Aug-19,Feli_...,25-Aug-19,I used to work overnight so now I suffer from ...,651,1,32
392494,ZzzQuil for Insomnia,Not Mentioned,3-Oct-18,Lucky,3-Oct-18,Diphenhydramine is a fantastic sleep aid but y...,989,8,25
392499,ZzzQuil for Insomnia,Not Mentioned,27-May-17,LAG,27-May-17,I usually work a 10 hr shift in the afternoon ...,169,4,23


On `IntakeTime`, we have a data with format of `<Date>`--seems like the users input the same date as `ReviewDate`. We can't know sure when they consume the drug specifically, so we'll include them to `Not Specified`.

In [10]:
df.loc[df['IntakeTime']
       .str.contains(regex_find_date), 
          'IntakeTime'] = 'Not Specified'

In [11]:
df['IntakeTime'].value_counts()

IntakeTime
Not Specified                   168435
Taken for less than 1 month      83287
Taken for 1 to 6 months          65065
Taken for 6 months to 1 year     21971
Taken for 1 to 2 years           18733
Taken for 2 to 5 years           18049
Taken for 5 to 10 years           8694
Taken for 10 years or more        8276
Name: count, dtype: int64

## Drug Names: `MedicineFor` and `MedicineName`

In [12]:
pd.DataFrame(df['MedicineFor'].value_counts())[:25]

Unnamed: 0_level_0,count
MedicineFor,Unnamed: 1_level_1
For Birth Control,26772
For Depression,12009
For Anxiety,9791
For Weight Loss (Obesity/Overweight),8157
Not Mentioned,7573
For Pain,6796
For Acne,6666
For Insomnia,6575
For Bipolar Disorder,5614
Nexplanon (etonogestrel) for Birth Control,4472


In [13]:
pd.DataFrame(
    df.loc[df['MedicineFor']
      .str.contains('Birth Control'), 
              'MedicineFor'].value_counts())

Unnamed: 0_level_0,count
MedicineFor,Unnamed: 1_level_1
For Birth Control,26772
Nexplanon (etonogestrel) for Birth Control,4472
Mirena (levonorgestrel) for Birth Control,1798
Implanon (etonogestrel) for Birth Control,1570
Lo Loestrin Fe (ethinyl estradiol / norethindrone) for Birth Control,1473
...,...
Estrostep Fe (ethinyl estradiol / norethindrone) for Birth Control,1
Gemmily (ethinyl estradiol / norethindrone) for Birth Control,1
Larin 1.5/30 (ethinyl estradiol / norethindrone) for Birth Control,1
Brevicon (ethinyl estradiol / norethindrone) for Birth Control,1


In [14]:
pd.DataFrame(
    df.loc[~df['MedicineFor']
      .str.contains(' for |For '), 
               'MedicineFor'].value_counts(ascending=False))

Unnamed: 0_level_0,count
MedicineFor,Unnamed: 1_level_1
Not Mentioned,7573
Yaz (drospirenone / ethinyl estradiol),910
Yasmin (drospirenone / ethinyl estradiol),555
Geodon (ziprasidone),354
Gianvi (drospirenone / ethinyl estradiol),279
...,...
Zetia (ezetimibe),1
Repatha (evolocumab),1
Lodine (etodolac),1
Lo/Ovral-28 (ethinyl estradiol / norgestrel),1


In [15]:
pd.DataFrame(df['MedicineName'].value_counts(ascending=False))

Unnamed: 0_level_0,count
MedicineName,Unnamed: 1_level_1
Levonorgestrel,9369
Ethinyl estradiol / norethindrone,7273
Etonogestrel,6082
Ethinyl estradiol / norgestimate,5343
Ethinyl estradiol / levonorgestrel,4708
...,...
Sabril,1
Epsom Salt,1
Sodium chloride ophthalmic,1
"Rabies vaccine, purified chick embryo cell for Rabies Prophylaxis",1


In [16]:
pd.DataFrame(
    df.loc[df['MedicineName']
      .str.contains(' for '), 
              'MedicineName'].value_counts(ascending=False))[:20]

Unnamed: 0_level_0,count
MedicineName,Unnamed: 1_level_1
Drospirenone / ethinyl estradiol for Birth Control,1523
Drospirenone / ethinyl estradiol for Acne,647
Yaz for Birth Control,453
Yasmin for Birth Control,318
Mavyret for Hepatitis C,307
Ziprasidone for Bipolar Disorder,273
Yaz for Acne,258
Drospirenone / ethinyl estradiol for Premenstrual Dysphoric Disorder,229
Zoledronic acid for Osteoporosis,225
Zofran for Nausea/Vomiting,195


In [17]:
pd.DataFrame(
    df.loc[df['MedicineName']
      .str.contains('\('), 
              'MedicineName'].value_counts(ascending=False))

Unnamed: 0_level_0,count
MedicineName,Unnamed: 1_level_1
Sars-cov-2 (covid-19) mrna-1273 vaccine,22
Chorionic gonadotropin (hcg),14
Humulin R U-500 (Concentrated),8
C1 esterase inhibitor (human),7
Sars-cov-2 (covid-19) mrna-1273 vaccine,5
Rho (d) immune globulin,4
Boostrix (Tdap),3


### Insights for Data Cleaning

As we can see in there are four formats of data in both columns of `MedicineName` and `MedicineFor`:

* Format 1 -- complete: `<Brand> (<GenericName>) for <Usage>`,
* Format 2 -- no brand: `For <Usage>`, 
* Format 3 -- no usage: `<Brand> (<GenericName>)` and `Not Mentioned`

To could give us a better insight on the data, we'll split the data into three new columns:

* `MedicineUsedFor`,
* `MedicineBrandName`, and
* `MedicineGenericName`

And then, we'll split the data from both `MedicineFor` and `MedicineName`, and then fill all the relevant data to the three columns.

### Data Cleaning: `MedicineFor`

In [18]:
mask_mf_1 =  df['MedicineFor'].str.contains(' for ')       ## Format 1
mask_mf_2 =  df['MedicineFor'].str.contains('For ')        ## Format 2
mask_mf_3 = ~df['MedicineFor'].str.contains(' for |For ')  ## Format 3 & 4

In [19]:
## Split `MedicineFor` into two parts:

## Treatment 1: `<Brand> (<GenericName>) for <MedicineFor>`
medicinefor_with_name = (
    df.loc[mask_mf_1, 'MedicineFor']
        .str.split(' for ', expand=True, n=1)
).rename(columns={0: "Medicine", 1: 'MedicineUsedFor'})

medicinefor_with_name[[
    'MedicineBrandName', 'MedicineGenericName']] = (
        medicinefor_with_name['Medicine']
            .str.replace(')', '')
            .str.split(' \(', n=1, expand=True)
)

In [20]:
medicinefor_with_name

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
70,Orencia (abatacept),Rheumatoid Arthritis,Orencia,abatacept
79,Orencia (abatacept),Rheumatoid Arthritis,Orencia,abatacept
80,Orencia (abatacept),Rheumatoid Arthritis,Orencia,abatacept
81,Orencia (abatacept),Rheumatoid Arthritis,Orencia,abatacept
110,5-HTP (5-hydroxytryptophan),Anxiety,5-HTP,5-hydroxytryptophan
...,...,...,...,...
391341,"Shingrix (zoster vaccine, inactivated)","Herpes Zoster, Prophylaxis",Shingrix,"zoster vaccine, inactivated"
391342,"Shingrix (zoster vaccine, inactivated)","Herpes Zoster, Prophylaxis",Shingrix,"zoster vaccine, inactivated"
391343,"Shingrix (zoster vaccine, inactivated)","Herpes Zoster, Prophylaxis",Shingrix,"zoster vaccine, inactivated"
391344,"Shingrix (zoster vaccine, inactivated)","Herpes Zoster, Prophylaxis",Shingrix,"zoster vaccine, inactivated"


In [21]:
medicinefor_with_name['MedicineBrandName'].unique()

array(['Orencia', '5-HTP', 'Tylenol', ..., 'Edluar', 'Zolpimist',
       'Zostavax'], dtype=object)

In [22]:
medicinefor_with_name.loc[
    medicinefor_with_name['MedicineBrandName'].str.endswith(' '),
'MedicineBrandName'].unique()

array([], dtype=object)

In [23]:
## Treatment 2: `For <MedicineFor>`
medicinefor_no_name = (
    df.loc[mask_mf_2, 'MedicineFor']
        .str.split('For ', expand=True, n=1)
).rename(columns={0: "Medicine", 1: 'MedicineUsedFor'})

## Define the rest with '' (blank)
medicinefor_no_name[[
    'Medicine', 'MedicineBrandName', 'MedicineGenericName']] = np.nan

In [24]:
medicinefor_no_name

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
0,,Nasal Congestion,,
1,,Nasal Congestion,,
2,,Nasal Congestion,,
3,,Anxiety,,
4,,Anxiety,,
...,...,...,...,...
392448,,Insomnia,,
392449,,Insomnia,,
392450,,Insomnia,,
392451,,Insomnia,,


In [25]:
medicinefor_no_name['MedicineBrandName'].unique()

array([nan])

In [26]:
## Treatment 3: `<Brand> (<GenericName>)` and `Not Mentioned`
medicinefor_no_for = pd.DataFrame(
    df.loc[mask_mf_3, 'MedicineFor']
).rename(columns={'MedicineFor': 'Medicine'})

medicinefor_no_for[[
    'MedicineUsedFor', 'MedicineBrandName', 'MedicineGenericName']] = np.nan

## 3.1: <Brand> (<GenericName>)
medicinefor_no_for[[
    'MedicineBrandName', 'MedicineGenericName']] = (
        medicinefor_no_for['Medicine']
            .str.replace(')', '')
            .str.split(' \(', n=1, expand=True)
)

## 3.2: `Not Mentioned` 
medicinefor_no_for.loc[
    medicinefor_no_for['Medicine'] == 'Not Mentioned',
   ['MedicineBrandName', 'MedicineGenericName']] = np.nan

medicinefor_no_for.loc[
    medicinefor_no_for['Medicine'] != 'Not Mentioned',
                              'MedicineUsedFor'] = np.nan

In [27]:
medicinefor_no_for[medicinefor_no_for['Medicine'].str.contains(' \(')]

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
279,5-HTP (5-hydroxytryptophan),,5-HTP,5-hydroxytryptophan
293,5-HTP (5-hydroxytryptophan),,5-HTP,5-hydroxytryptophan
306,5-HTP (5-hydroxytryptophan),,5-HTP,5-hydroxytryptophan
315,5-HTP (5-hydroxytryptophan),,5-HTP,5-hydroxytryptophan
343,5-HTP (5-hydroxytryptophan),,5-HTP,5-hydroxytryptophan
...,...,...,...,...
391277,"Shingrix (zoster vaccine, inactivated)",,Shingrix,"zoster vaccine, inactivated"
391296,"Shingrix (zoster vaccine, inactivated)",,Shingrix,"zoster vaccine, inactivated"
391298,"Shingrix (zoster vaccine, inactivated)",,Shingrix,"zoster vaccine, inactivated"
391300,"Shingrix (zoster vaccine, inactivated)",,Shingrix,"zoster vaccine, inactivated"


In [28]:
medicinefor_no_for['MedicineBrandName'].value_counts().index.unique()[:10]

Index(['Yaz', 'Yasmin', 'Geodon', 'Gianvi', 'Nikki', 'Reclast', 'Loryna',
       'Ocella', 'Triumeq', 'Xiidra'],
      dtype='object', name='MedicineBrandName')

In [29]:
medicinefor_no_for.loc[medicinefor_no_for['MedicineUsedFor'].isnull()]

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
37,Not Mentioned,,,
99,Not Mentioned,,,
116,Not Mentioned,,,
130,Not Mentioned,,,
138,Not Mentioned,,,
...,...,...,...,...
392505,Not Mentioned,,,
392506,Not Mentioned,,,
392507,Not Mentioned,,,
392508,Not Mentioned,,,


In [30]:
medicinefor_all = [
    medicinefor_with_name, 
    medicinefor_no_name, 
    medicinefor_no_for
]

medicinefor_cleaned = pd.concat(medicinefor_all).sort_index()

In [31]:
medicinefor_cleaned.sort_index()

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
0,,Nasal Congestion,,
1,,Nasal Congestion,,
2,,Nasal Congestion,,
3,,Anxiety,,
4,,Anxiety,,
...,...,...,...,...
392505,Not Mentioned,,,
392506,Not Mentioned,,,
392507,Not Mentioned,,,
392508,Not Mentioned,,,


In [32]:
## HW: Clean the code above so that it can directly change
##     the data straight from the df, not through other variables.

medicinefor_cleaned[~medicinefor_cleaned.index.isin(df.index)]

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName


### Data Cleaning: `MedicineName`

In [33]:
mask_mn_1 =  df['MedicineName'].str.contains(' for ')
mask_mn_2 = ~df['MedicineName'].str.contains(' for ')

In [34]:
## Split `MedicineName` into two parts:

## Treatment 1: `<Brand/GenericName> for <Usage>
medicinename_with_generic = pd.DataFrame(
    df.loc[mask_mn_1, 'MedicineName']
        .str.split(' for ', expand=True, n=1)
).rename(columns={0: "MedicineBrandName", 1: 'MedicineUsedFor'})

medicinename_with_generic[['Medicine', 'MedicineGenericName']] = np.nan

medicinename_with_generic = (
    medicinename_with_generic[[
        'Medicine', 'MedicineUsedFor',
        'MedicineBrandName', 'MedicineGenericName'
    ]]
)

In [35]:
medicinename_with_generic

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
374,,HIV Infection,Abacavir / dolutegravir / lamivudine,
375,,HIV Infection,Abacavir / dolutegravir / lamivudine,
376,,HIV Infection,Abacavir / dolutegravir / lamivudine,
377,,HIV Infection,Abacavir / dolutegravir / lamivudine,
378,,HIV Infection,Abacavir / dolutegravir / lamivudine,
...,...,...,...,...
392505,,Insomnia,ZzzQuil,
392506,,Insomnia,ZzzQuil,
392507,,Insomnia,ZzzQuil,
392508,,Insomnia,ZzzQuil,


In [36]:
medicinename_with_generic['MedicineBrandName'].value_counts().index.unique()[:10]

Index(['Drospirenone / ethinyl estradiol', 'Yaz', 'Yasmin', 'Ziprasidone',
       'Zoledronic acid', 'Mavyret', 'Zofran', 'Ibuprofen', 'Rybelsus',
       'Zolmitriptan'],
      dtype='object', name='MedicineBrandName')

In [37]:
pd.DataFrame(medicinename_with_generic['MedicineBrandName'].value_counts())

Unnamed: 0_level_0,count
MedicineBrandName,Unnamed: 1_level_1
Drospirenone / ethinyl estradiol,2521
Yaz,832
Yasmin,520
Ziprasidone,420
Zoledronic acid,312
...,...
Rhinaris,1
Rhinocort Allergy,1
Ribasphere,1
Riomet,1


In [38]:
pd.DataFrame(medicinename_with_generic.loc[
    medicinename_with_generic['MedicineBrandName']
        .str.contains(' / '), 'MedicineBrandName'].value_counts())

Unnamed: 0_level_0,count
MedicineBrandName,Unnamed: 1_level_1
Drospirenone / ethinyl estradiol,2521
Abacavir / dolutegravir / lamivudine,107
Acetaminophen / butalbital,18
Capsaicin / lidocaine / menthol / methyl salicylate topical,2
Resorcinol / sulfur topical,2
Hydralazine / hydrochlorothiazide / reserpine,1
Letrozole / ribociclib,1


As we can see above, there are several data that has more than one composition in `MedicineName`, not to mention the "pharma-esque" name. These are the `MedicineGenericName` and we'll move them out to where it should be.

In [39]:
data_move_to_generic = medicinename_with_generic.loc[
    medicinename_with_generic['MedicineBrandName']
        .str.contains(' / '), 'MedicineBrandName'].unique()
data_move_to_generic

array(['Abacavir / dolutegravir / lamivudine',
       'Acetaminophen / butalbital', 'Drospirenone / ethinyl estradiol',
       'Capsaicin / lidocaine / menthol / methyl salicylate topical',
       'Hydralazine / hydrochlorothiazide / reserpine',
       'Letrozole / ribociclib', 'Resorcinol / sulfur topical'],
      dtype=object)

In [40]:
# Moving the data with generic name from
# `MedicineBrandName` to `MedicineGenericName`

medicinename_with_generic.loc[
    medicinename_with_generic['MedicineBrandName'].isin(
        data_move_to_generic), 'MedicineGenericName'] = (

medicinename_with_generic.loc[
    medicinename_with_generic['MedicineBrandName'].isin(
        data_move_to_generic), 'MedicineBrandName']

)

medicinename_with_generic.loc[
    medicinename_with_generic['MedicineGenericName'].isin(
        data_move_to_generic), 'MedicineBrandName'] = np.nan

In [41]:
medicinename_with_generic

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
374,,HIV Infection,,Abacavir / dolutegravir / lamivudine
375,,HIV Infection,,Abacavir / dolutegravir / lamivudine
376,,HIV Infection,,Abacavir / dolutegravir / lamivudine
377,,HIV Infection,,Abacavir / dolutegravir / lamivudine
378,,HIV Infection,,Abacavir / dolutegravir / lamivudine
...,...,...,...,...
392505,,Insomnia,ZzzQuil,
392506,,Insomnia,ZzzQuil,
392507,,Insomnia,ZzzQuil,
392508,,Insomnia,ZzzQuil,


In [42]:
pd.DataFrame(medicinename_with_generic['MedicineGenericName'].value_counts())

Unnamed: 0_level_0,count
MedicineGenericName,Unnamed: 1_level_1
Drospirenone / ethinyl estradiol,2521
Abacavir / dolutegravir / lamivudine,107
Acetaminophen / butalbital,18
Capsaicin / lidocaine / menthol / methyl salicylate topical,2
Resorcinol / sulfur topical,2
Hydralazine / hydrochlorothiazide / reserpine,1
Letrozole / ribociclib,1


In [43]:
## Split `MedicineName` into two parts:

## Treatment 2: `<Brand>

medicinename_brand_only = pd.DataFrame(
    df.loc[mask_mn_2, 'MedicineName']
).rename(columns={'MedicineName': 'MedicineBrandName'})

medicinename_brand_only[['Medicine', 
    'MedicineUsedFor', 'MedicineGenericName']] = np.nan

medicinename_brand_only = (
    medicinename_brand_only[[
        'Medicine', 'MedicineUsedFor',
        'MedicineBrandName', 'MedicineGenericName'
    ]]
)

In [44]:
medicinename_brand_only

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
0,,,12 Hour Nasal Decongestant Spray,
1,,,12 Hour Nasal Decongestant Spray,
2,,,12 Hour Nasal Decongestant Spray,
3,,,5-HTP,
4,,,5-HTP,
...,...,...,...,...
392448,,,ZzzQuil,
392449,,,ZzzQuil,
392450,,,ZzzQuil,
392451,,,ZzzQuil,


In [45]:
medicinename_brand_only['MedicineBrandName'].value_counts().index.unique()[:10]

Index(['Levonorgestrel ', 'Ethinyl estradiol / norethindrone ',
       'Etonogestrel ', 'Ethinyl estradiol / norgestimate ',
       'Ethinyl estradiol / levonorgestrel ', 'Nexplanon ',
       'Miconazole topical ', 'Sertraline ', 'Escitalopram ', 'Fluoxetine '],
      dtype='object', name='MedicineBrandName')

In [46]:
pd.DataFrame(medicinename_brand_only['MedicineBrandName'].value_counts())

Unnamed: 0_level_0,count
MedicineBrandName,Unnamed: 1_level_1
Levonorgestrel,9369
Ethinyl estradiol / norethindrone,7273
Etonogestrel,6082
Ethinyl estradiol / norgestimate,5343
Ethinyl estradiol / levonorgestrel,4708
...,...
Epsom Salt,1
Nitroprusside,1
Acetaminophen / dextromethorphan / guaifenesin / pseudoephedrine,1
Tiazac,1


In [47]:
medicinename_all = [
    medicinename_with_generic,
    medicinename_brand_only
]

medicinename_cleaned = pd.concat(medicinename_all).sort_index()

In [48]:
medicinename_cleaned

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
0,,,12 Hour Nasal Decongestant Spray,
1,,,12 Hour Nasal Decongestant Spray,
2,,,12 Hour Nasal Decongestant Spray,
3,,,5-HTP,
4,,,5-HTP,
...,...,...,...,...
392505,,Insomnia,ZzzQuil,
392506,,Insomnia,ZzzQuil,
392507,,Insomnia,ZzzQuil,
392508,,Insomnia,ZzzQuil,


### Add the Clean Columns!

In [49]:
meds_clean = medicinefor_cleaned.fillna(medicinename_cleaned)

for col in meds_clean.columns:
    meds_clean[col] = meds_clean[col].str.title()

meds_clean = meds_clean.fillna('Not Mentioned')

In [50]:
meds_clean

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName
0,Not Mentioned,Nasal Congestion,12 Hour Nasal Decongestant Spray,Not Mentioned
1,Not Mentioned,Nasal Congestion,12 Hour Nasal Decongestant Spray,Not Mentioned
2,Not Mentioned,Nasal Congestion,12 Hour Nasal Decongestant Spray,Not Mentioned
3,Not Mentioned,Anxiety,5-Htp,Not Mentioned
4,Not Mentioned,Anxiety,5-Htp,Not Mentioned
...,...,...,...,...
392505,Not Mentioned,Insomnia,Zzzquil,Not Mentioned
392506,Not Mentioned,Insomnia,Zzzquil,Not Mentioned
392507,Not Mentioned,Insomnia,Zzzquil,Not Mentioned
392508,Not Mentioned,Insomnia,Zzzquil,Not Mentioned


In [51]:
meds_clean[~meds_clean.index.isin(df.index)]

Unnamed: 0,Medicine,MedicineUsedFor,MedicineBrandName,MedicineGenericName


In [52]:
df = pd.merge(meds_clean, df, left_index=True, right_index=True)
df = df.drop(['Medicine', 'MedicineName', 'MedicineFor'], axis=1)

In [53]:
df

Unnamed: 0,MedicineUsedFor,MedicineBrandName,MedicineGenericName,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes
0,Nasal Congestion,12 Hour Nasal Decongestant Spray,Not Mentioned,26-Jan-21,xano,Not Specified,This is very effective IF you can get the cove...,52,6,0
1,Nasal Congestion,12 Hour Nasal Decongestant Spray,Not Mentioned,19-Aug-22,Breat...,Taken for 1 to 2 years,Actually I use the generic brand of the 12 hou...,319,10,0
2,Nasal Congestion,12 Hour Nasal Decongestant Spray,Not Mentioned,28-Apr-18,Abe,Taken for less than 1 month,Cap took 20 minutes to open process was frustr...,373,1,0
3,Anxiety,5-Htp,Not Mentioned,3-May-20,Andres,Taken for less than 1 month,Hi everyone\n'10 / 105-HTPFor Anxiety23129-Oct...,623,10,345
4,Anxiety,5-Htp,Not Mentioned,11-Jul-19,Shawn,Not Specified,Took SSRI (Prozac) for Anxiety/Depression for ...,156,9,229
...,...,...,...,...,...,...,...,...,...,...
392505,Insomnia,Zzzquil,Not Mentioned,23-Sep-20,emano,Taken for less than 1 month,I hate zzzquill. I took multiple doses one nig...,218,1,8
392506,Insomnia,Zzzquil,Not Mentioned,26-Sep-22,ZzzQu...,Taken for less than 1 month,Desperately in need of sleep & decided to take...,234,1,3
392507,Insomnia,Zzzquil,Not Mentioned,12-Jun-20,Annon...,Not Specified,I have insomnia for the past several months an...,228,1,8
392508,Insomnia,Zzzquil,Not Mentioned,18-Aug-20,Sleepy,Taken for less than 1 month,I personally wouldn’t recommend this to my wor...,325,1,6


In [54]:
df['MedicineBrandName'].value_counts()

MedicineBrandName
Nexplanon           4515
Nexplanon           4478
Yaz                 2645
Plan B One-Step     2491
Plan B One-Step     2458
                    ... 
Baricitinib            1
Magnevist              1
Prednicot              1
Sterapred              1
Monopril               1
Name: count, Length: 5380, dtype: int64

In [55]:
df['MedicineBrandName'].value_counts().index.unique()

Index(['Nexplanon', 'Nexplanon ', 'Yaz', 'Plan B One-Step', 'Plan B One-Step ',
       'Mirena', 'Sertraline ', 'Lexapro', 'Lexapro ', 'Gabapentin ',
       ...
       'Praziquantel', 'Dehydroepiandrosterone ',
       'Bazedoxifene / Conjugated Estrogens', 'Silq Vanilla', 'Demeclocycline',
       'Baricitinib', 'Magnevist', 'Prednicot', 'Sterapred', 'Monopril'],
      dtype='object', name='MedicineBrandName', length=5380)

In [56]:
df.loc[
    df['MedicineBrandName'].str.endswith(' '),
    'MedicineBrandName'].value_counts().index.unique()

Index(['Nexplanon ', 'Plan B One-Step ', 'Sertraline ', 'Lexapro ',
       'Gabapentin ', 'Phentermine ', 'Miconazole Topical ', 'Zoloft ',
       'Depo-Provera ', 'Metronidazole ',
       ...
       'Ethinyl Estradiol / Segesterone ', 'Tenormin ',
       'Estradiol / Levonorgestrel ', 'Estazolam ', 'Pramoxine Topical ',
       'Viloxazine ', 'Mefloquine ', 'Elagolix ', 'Edarbyclor ',
       'Gemfibrozil '],
      dtype='object', name='MedicineBrandName', length=1403)

In [57]:
(df.loc[df['MedicineBrandName']
   .str.endswith(' '), 'MedicineBrandName']) = (
    
 df.loc[df['MedicineBrandName']
   .str.endswith(' '), 'MedicineBrandName']
   .str.rstrip(' ')
    
)

In [58]:
df.loc[
    df['MedicineBrandName'].str.endswith(' '),
    'MedicineBrandName'].value_counts().index.unique()

Index([], dtype='object', name='MedicineBrandName')

In [59]:
df['MedicineBrandName'].value_counts()

MedicineBrandName
Nexplanon                             8993
Plan B One-Step                       4949
Lexapro                               4230
Zoloft                                3557
Depo-Provera                          3539
                                      ... 
Cinacalcet                               1
Stanback Analgesic                       1
Flumist Quadrivalent                     1
Bayer Women'S Aspirin With Calcium       1
Symlinpen 120                            1
Name: count, Length: 4256, dtype: int64

In [60]:
df['MedicineUsedFor'].value_counts()

MedicineUsedFor
Birth Control                                  60144
Depression                                     19521
Weight Loss (Obesity/Overweight)               14106
Anxiety                                        14085
Pain                                           11261
                                               ...  
Epididymitis, Non-Specific                         1
Multiple Endocrine Adenomas                        1
Hyperlipoproteinemia Type Iib, Elevated Ldl        1
Occupational Exposure                              1
Cluster-Tic Syndrome                               1
Name: count, Length: 1086, dtype: int64

## `Reviews`

In [61]:
pd.options.display.max_colwidth = 1500

In [62]:
pd.DataFrame(df['Reviews'].head())

Unnamed: 0,Reviews
0,This is very effective IF you can get the cover off.
1,Actually I use the generic brand of the 12 hour nasal spray. I'm not addicted! My nose is! Probably because it's worth the nasty taste to breathe freely again. The generic is a mere (maybe was) $1.88 + tax per bottle. I actually dilute the bottle with a 50/50 purified water. Works great. Doesn't burn and cost pennies.
2,Cap took 20 minutes to open process was frustrating and painful.'5-HTP User Reviews & Ratings (Page 2)For Anxiety11830-Jun-17Code9nTaken for less than 1 monthI have Opioid withdrawal induced anxiety / low energy. After decades of opioid usage I've suffered from anxiety - I'm assuming from excessive noradrenaline (norepinephrine) that's known to be produced in withdrawal.
3,Hi everyone\n'10 / 105-HTPFor Anxiety23129-Oct-19GannuTaken for less than 1 monthYes it works. I am floating after taking 5-HTP. I take 100 mg after breakfast and 100 mg after dinner. Also I stopped smoking. Good for my anxiety and stress. In the past talking to people was hard. I am shy guy. I think too much which is not at all required. After taking 5-HTP I feel the confidence . I am skinny guy and used to worry people around me is watching. Now I don’t care it’s my life and got my confidence now . I used to get tears automatically while talking to people because of nervousness. 5 HTP gave me my life back . Thanks
4,Took SSRI (Prozac) for Anxiety/Depression for 18 years with the only side effects being needing more sleep (8.5-9 hours a night minimum) than normal people.


In [63]:
(df.loc[df['Reviews']
      .str.contains('\n'), 
        ['ReviewDate', 'UserName', "Reviews"]
]).sample(15)

Unnamed: 0,ReviewDate,UserName,Reviews
49797,16-Aug-17,Snk,This medication has been a major game changer for me. Before starting this medication i was so severly depressed. I was taking other antidepressants at that time. They were helping to make me at least functional most days just doing the bare minimum seemed too much. Eventually my PCP added 150 mg of Wellbutrin XL on top of the medication I was already taking. Since then we have up to 300 mg. I finally have energy\r\n'8 / 10Bupropion User Reviews & Ratings (Page 55)Wellbutrin XL (bupropion) for Depression289-Nov-15theDo...Taken for 1 to 6 monthsI've been on and off this medication but nothing related to the drug. I was diagnosed with dysthymia one year ago and then it aggravated with a major depression so I started sertraline after the adjustment period it was the best I was always happy with lots of energy and lost 30 pounds in like 3 months loved it. then my psych decided to change me to wellbutrin. I felt fine with energy and wanting to do stuff. because before I couldnt get up of my bed but yes I felt the wellbutrin rage (those days when u look like u have PMS) I never lost a single pound on it (which i dont like).
349456,29-Sep-16,Madii...,When I took it\r\n'10 / 10Tramadol User Reviews & Ratings (Page 61)Ultram (tramadol) for Pain261-Apr-12AnonymousGood medicine it gets rid of your pain without that drowsy sick feeling.
74059,15-Nov-20,Anonymous,clonazepam help is huge it helps me sleep helps migraine PTSD\r\n'8 / 10Clonazepam User Reviews & Ratings (Page 77)Klonopin (clonazepam) for Anxiety174-Sep-09scand...Works well but I need it 3 x a day.
61118,25-Jun-23,Grace,I get UTI/bladder infections with blood in my urine at least twice a year. I ask for cephalexin because it works great for me\r\n'10 / 10Cephalexin User Reviews & Ratings (Page 7)Keflex (cephalexin) for Upper Respiratory Tract Infection410-Nov-22JannyTaken for 6 months to 1 yearBeing a
289077,2-Jun-15,Caeti,So I posted on here on my 6th day on adipex\r\n'10 / 10Phentermine User Reviews & Ratings (Page 12)Adipex-P (phentermine) for Weight Loss (Obesity/Overweight)6530-Mar-15ErkaI would recommend this pill to anyone! I started taking adipex 37.5 mg only last week and i am already down 15 lbs. thats crazy! It doesnt make me want to eat i actually have to remind myself sometimes to eat. And i was always the girl with food in her hand. I havent done much excersise just walking on nature trails. Bit for anyone wondering if it actually works the answer is yes!
297386,22-Feb-18,Grampa,We live in the wine country \r\n'5 / 10Pradaxa User Reviews & Ratings (Page 2)For Prevention of Thromboembolism in Atrial Fibrillation1221-May-20GrewTaken for 1 to 6 monthsI have been on Pradaxa for about 6 weeks now after taking warfarin for almost a decade. It has been a rough transition. Upset stomach abdominal pain and cramps extreme nausea (I haven't been able to eat much) and diarrhea. These side effects have decreased significantly over the weeks but I'm still struggling. I'm holding out hope that this drug will work out for me but at this point I'm not convinced at all. I've literally lost 20 pounds in 4 weeks and am fairly miserable on most days.
44484,17-May-18,cobweb,Rexulti has stopped my delusions and obsessive ruminationing where Seroquel XR amongst others failed\r\n'Brexpiprazole User Reviews & Ratings (Page 11)Rexulti (brexpiprazole) for Major Depressive Disorder209-Nov-17HolliiTaken for 6 months to 1 yearUsed for MDD and anxiety in conjunction with 40mg Prozac. Lifted me out of depression only taking 0.5 mg initially made me sleepy but felt like a miracle drug. Told my doctor I never realized I could feel this happy. Problem is I’ve gained almost 40lbs and so now my doctor is looking to take me off of it and try trintellix. Hoping I can actually start losing weight I’ve never had an issue with weight my whole life I’ve
383776,16-Apr-22,Carl...,I live in covington Louisiana where Boudreaux's Butt Paste was first made around 1970's\r\n'10 / 10Zinc oxide topicalFor Dermatologic Lesion726-Dec-13Maryam...Taken for less than 1 monthI am 24 years old and had very big chicken pox spots so I used Sudocream containing zinc oxide to dry the spots after the first day of my chicken pox. I also used it when I terribly burned my hand during cooking. In both cases it totally helped to calm down my skin condition and help it heal faster.
81840,4-Sep-22,atsirt,9-3-22 I took my first dose of Cymbalta (Duloxetine) 30mg. I took the dose @9:30pm EST. About 2 hrs after I took it I started feeling tired so I laid down & went to sleep with no problem. A few hours later (@4am) I woke up profusely sweating trembling/shaking & feeling EXTREMELY NAUSEOUS (didn't throw up though) I literally just squirmed around trying to get comfy waiting for it to pass. Eventually I was able to fall back asleep. I woke up @10:30am with a massive headache foggy head my heart was beating really fast my anxiety was through the roof I felt very anxious/agitated my skin felt prickly (the best way I can describe it is that I felt like I couldn't stand to be in my own skin. If that makes any sense?) As the day went on (it's now 6:30pm) most of the symptoms I woke up with have gradually faded. The anxiety headache\r\n'9 / 10Cymbalta User Reviews & Ratings (Page 48)For Major Depressive Disorder1016-May-20NickTaken for 1 to 6 monthsI haven't been on Cymbalta very long it seems to be helping my major depression I was put on Wellbutrin first and the anxiety I didn't know that I had went off the rails. The only side effect that have noticed is a strong bitter taste when ever I eat.
269448,21-Apr-20,Eric...,I was getting ECT treatment for depression when my doctor told me Nortriptyline pairs well with the ECT therapy. I once had a bad reaction to Trintillex . Without having much recollection of why I decided to start this new med I did. I felt ECT was starting to heal my depression. As a few weeks passed\r\n'10 / 10Nortriptyline User Reviews & Ratings (Page 13)For Migraine Prevention363-Oct-10chesn...I have been migraine free for over a month now. I was getting 3-5 migraines a week with a continuous headache for 5 months straight and through the heat of the summer.


In [64]:
df.loc[df['Reviews'].str.contains('\r\n', regex=True), 'Reviews'].shape

(5214,)

In `Reviews` column, you can see there are several data that have weird addition of characters in this format:

```
## X: Any digit, (...): Optional characters,
## <...>: Formattings, {...}: Number of characters

<(\r)\n>'<(X / 10)><MedicineName> <User Reviews & Ratings (Page XX)<MedicineFor>X{1-5}-<Month>{3}-XX(<Name>)(<IntakeTime>)

## <MedicineName>, <MedicineFor>, <IntakeTime> are 
## included in other columns, while <Name> doesn't.

```

We may assume that the users were supposed to hit Enter for a new paragraph, given the `\r\n` in the beginning. However, the whole data are shown instead. 

We need to clean it out for the purpose of sentiment analysis so the sentences can be cleanly tokenized. Also there are a sensitive data of actual name placed after the date--we need to maintain the privacy of patients on healthcare data. 

Let's clean it by using regular expression or **regex** to encapsulate the format. The process will be a bit tricky since on the optional parts `(<Name>)(<IntakeTime>)`, some reviews have both, some have one part, and others have none at all. We'll be cleaning it in two steps: **(1) the main format, and (2) the optional format.**



In [65]:
df.loc[df.index == 163266, 'Reviews']

163266    I am 59 y old male with anxiety on and off since I remember. I work on the computer as a full time job. Last year I had severe tension headache for a month from work related stress. Couldn't drive the car at night panic attacks dizziness vertigo issuebody shaking etc. My GC provider listened to my story. She put me on 10mg fluox for 1 month. First 4 days ok. Then headache started so bad I was ready to give up after a week then 4 sleepless nights in the row. Insane. I had panic attacks\r\n'6 / 10Fluoxetine User Reviews & Ratings (Page 50)Prozac (fluoxetine) for Anxiety and Stress26Anonymous29-Nov-11Was on 10mg for 1 yr... not much change. Switched to 20mg for two weeks then doctor bumped it up to 30mg/day. I think it might help depression but I feel its doing very little to help anxiety. Don't like the idea of having to be dependent on medicines to make me feel normal. Stick with it awhile longer.
Name: Reviews, dtype: object

In [66]:
## Regex mask format to clean: 
regex_mask_1 = re.compile(
    "(\\r\\n.+|\\r.+|'\d{1,2} \/ 10.+)(\d{1,6}-.{3}-\d{2})"
)

df['Reviews'] = df['Reviews'].str.replace(regex_mask_1, ' --filtered_1 ', regex=True)

In [67]:
# index examples:
index_sample = [7659, 209873, 242716, 269576, 198296]
pd.DataFrame(df.loc[df.index.isin(index_sample), 'Reviews'])

Unnamed: 0,Reviews
7659,"As a heathcare professional having given Mucomyst to patients over more than 30 year career I've yet to observe any benefit from the aerosolization of this med. I don't want to minimize the previous comments. Used during a bronchoscopy in liquid wash it will dissolve thick mucus quickly and effectively but the benefit of aerosolization has not been demonstrated or established in any controlled study. Also after more than 24 hours of use the lung tissue will begin to weep this is a very caustic substance to lung. Too often it's order for a specific problem area of the lung. As the laws of nature take effect in that gas follows the path of least resistance, Mucomyst is delivered to all areas of lung with minimal effect to the problem. --filtered_1 richardTaken for 6 months to 1 yearI take this for COPD I am wheezing constantly after regular use of NAC cysteine it helps a little reducing mucus but creating a difficulty in breathing and wheezing daily . whereas this did not happen before taking this drug l revert back to MSM powder with vitamin C liquid and feel better ."
198296,I got it about two months ago. The insertion does hurt but it’s nothing out of ordinary and the moment where I felt pain lasted less than a minute it felt more like a sharp pain than a severe cramp. My first period after the insertion lasted for 11 days (usually it’s like 5) and I have been a bit more hormonal and tired than normal but I guess it’s just my body accommodating. Two weeks after the insertion I saw my doctor for a check-up and she said my IUD had moved farther up than where she placed it don’t know why but apparently it’s nothing to worry about. I did feel pretty odd in the beginning --filtered_1 ChalkyTaken for 1 to 2 years34yo mom & exercise daily. Retroverted uterus. Skyla
209873,I would have a bowel motion maybe one a week or every 2 weeks --filtered_1 ScribeTaken for less than 1 monthI was honestly VERY nervous to start taking this but my GI doctor highly recommended at least trying it for 2 weeks so I did. I will say within an hour of taking one 145 mcg pill I was running to the bathroom and let's just say it was a close call. Thankfully I took my first dose on a Sunday as I had no plans. I had diarrhea bloating abdominal pain nausea and fatigue all day. As my body adjusted to it my symptoms seemed to lessen BUT I still have the explosive diarrhea about 1-2 hours after taking it. Also I work a very busy job with very few chances for a break but somehow I have always been able to go when I feel the urge. I am going to continue taking this as long as my insurance covers it!!
242716,I developed a YI while on I was on my period and initially tried to cure my YI naturally by eating plain yogurt with live cultures drinking water with ACV and taking probiotics. They didn't help so I picked up M7 b/c I read that it has the least side effects. LIES!!! The first night I was in so much pain. My legs were shaking and I was breathing like I was going into labor. The burning was horrible!! What helped: Running a clean wash cloth in cold water --filtered_1 LyzTaken for less than 1 monthI had a severe yeast infection after taking amoxicillin for an ear infection. I was literally going insane from how horrible it was! And on top of that I was on vacation hundreds of miles away! I picked this up on a whim and got the cream version (they were out of suppositories). The first night I wanted to SCREAM it felt like FIRE had been lit inside me! I wanted to cut my lady parts off. I was in tears in bed for the second night as well. I have never hated being a woman in my whole LIFE. But on the third morning I felt normal again! I didn’t even have to use the 3rd dose! The burning was worth it!
269576,I can't recommend nortriptyline for IBS I was on this for over a month --filtered_1 DJ 1Taken for 1 to 6 monthsI have had a positive experience thus far . Currently taking this for migraines with vertigo symptoms and nausea. The migraines have subsided as well as the vertigo symptoms. Highly recommended


In [68]:
pd.DataFrame(
    df.loc[df['Reviews']
           .str.contains(' --filtered_1 '), 
              'Reviews'])

Unnamed: 0,Reviews
3,Hi everyone\n --filtered_1 GannuTaken for less than 1 monthYes it works. I am floating after taking 5-HTP. I take 100 mg after breakfast and 100 mg after dinner. Also I stopped smoking. Good for my anxiety and stress. In the past talking to people was hard. I am shy guy. I think too much which is not at all required. After taking 5-HTP I feel the confidence . I am skinny guy and used to worry people around me is watching. Now I don’t care it’s my life and got my confidence now . I used to get tears automatically while talking to people because of nervousness. 5 HTP gave me my life back . Thanks
9,5HTP was absolutely terrible --filtered_1 worldTaken for 2 to 5 yearsChoose a sustained/time release 5-HTP product for a smoother and longer lasting effect. Combine with a good Vitamin B complex possibly in the biologically active form.
65,So far so good as only on it for about 5 days at 300mg per day. I have combined it with multi vitamin vitamin b complex --filtered_1 anony...This supplement has really worked for me! The only problem is after about a month of taking it I have started experiencing ringing in my ears and pressure headaches. But it does boost my mood and I haven't mood swings or anxiety like before I started taking this. Also I find I haven't been as hungry and eating as much. This supplement would be perfect if I hadn't started having the problems with my ears and the headaches.
160,Hi everyone --filtered_1 GannuTaken for less than 1 monthYes it works. I am floating after taking 5-HTP. I take 100 mg after breakfast and 100 mg after dinner. Also I stopped smoking. Good for my anxiety and stress. In the past talking to people was hard. I am shy guy. I think too much which is not at all required. After taking 5-HTP I feel the confidence . I am skinny guy and used to worry people around me is watching. Now I don’t care it’s my life and got my confidence now . I used to get tears automatically while talking to people because of nervousness. 5 HTP gave me my life back . Thanks
228,So far so good as only on it for about 5 days at 300mg per day. I have combined it with multi vitamin vitamin b complex --filtered_1 anony...This supplement has really worked for me! The only problem is after about a month of taking it I have started experiencing ringing in my ears and pressure headaches. But it does boost my mood and I haven't mood swings or anxiety like before I started taking this. Also I find I haven't been as hungry and eating as much. This supplement would be perfect if I hadn't started having the problems with my ears and the headaches.
...,...
392321,I was given zyrtec for cold symptoms --filtered_1 AZJJTaken for 1 to 2 yearsI took this on and off for several years and then recently it started to make me feel like a zombie.
392329,Extreme itchynes all over body (every body part ) cannot sleep cannot stop itching myself everywhere --filtered_1 PaulaTaken for 1 to 6 monthsThis medication has helped me a lot. I take it only once per day but I notice that after 12 hrs the congestion comes back so I will try it every 12 hrs now. It works very well for me .
392346,Had PSA of 70 and Gleason score of 9 January 2021. After Lupron for 9 months --filtered_1 John...Taken for less than 1 monthSevere joint pain after 2 weeks so far. Moves between shoulders elbows hips knees. lasting 2-minutes per joint then moving to the next.
392416,I have been having a lot of issues with falling asleep lately --filtered_1 DDeas...Taken for less than 1 monthIf you have actual REAL insomnia ZzzQuil won't do a thing for you; I breathe great when I take it but no sleep. Your best bet would be to contact your doctor for some prescription sleep aids; maybe a sleep study if you got the insurance.


In [69]:
## Checking if there's data that has <IntakeTime> format of <Date>
## e.g.: "...richard2-Nov-16I have consumed the drug since ..."

regex_check = re.compile(' --filtered_1 .+ (\d{1,2}-.{3}-\d{2})')
pd.DataFrame(df.loc[df['Reviews'].str.contains(regex_check), 'Reviews'])

  pd.DataFrame(df.loc[df['Reviews'].str.contains(regex_check), 'Reviews'])


Unnamed: 0,Reviews


In [70]:
pd.DataFrame(
    df.loc[df['Reviews']
           .str.contains(' --filtered_1 '), 
              'Reviews']).sample(10)

Unnamed: 0,Reviews
295398,I drank it pretty quickly it made me nausea I felt like throwing up and I felt extremely cold but it worked very quickly. I didn't eat the day before or day off based on the instructions from the Drs office . I started 1st dose at 7 30 PM and 2nd dose at 7 30 AM meanwhile the 1st dose was still working when I woke up causes stomach cramps gurgling --filtered_1 EliseTaken for less than 1 monthMy first colonoscopy and was unsure what to expect and felt a bit scared. For three days before I followed a low-fiber diet and ate minimally which I believe helped with this. Liquid diet 24 hours before. Took the first dose at 1 pm took an hour for bowel movements to start and went maybe 8-10 times very watery. Didn't think the taste was that bad tbh I just followed it with lemonade and water. Took the second dose at 5 pm and this was harder did feel nauseous taking this but drank it slow and kept it down don't panic or think you're going to vomit just drink fresh water too. Drank another 3 liters or so on top of this and went to the loo maybe every 5-10mins for around 4.5 hours. Managed to get to sleep at 11:30 pm and then had to wake at 5:30 am for 2 further bowel movements. Left for an appointment at 7 am and went once again at the hospital and had a colonoscopy at 8 am. Still have watery BM after but it's going to take a few days for my stomach to return to normal I'd expect. Not a bad experience but obviously not fun.
378395,I was born with curvature of the spine then my job as nurses aide for about 20 years kind of finished my back off. I've been using pain meds with varying degrees of success for about 40 years. About a year ago the pain clinic decided to put me on Xtampza 9. They know I cannot sleep after taking pain meds as they give me energy so I take both in the morning. I get no pain relief but what surprised me was also no energy I can sleep even after taking 2 but cannot sleep after 1/2 hydrocodone 5/325. I can take 3 days off of the Xtampza and feel no difference. 1/2 a hydrocodone 5/325 gives me better pain relief something is not right here. Please --filtered_1 SparkyTaken for 1 to 2 yearsApparently have stage 4 prostate cancer. Had radiation no surgery; went on Lupron for 2 years. Prescribed term ended and then PSA climbed in 3 months from 0.23 to 4-ish. Back on the Lupron. PSA dropped at first then climbed to 6.4. Started Xtandi (simultaneously
214358,I took lisinopril for a few years. It works to control my high blood pressure. Occasionally --filtered_1 AnonymousTaken for less than 1 monthBad rash on feet
310530,First night --filtered_1 MarilynTaken for less than 1 monthDoes not work for me.
314012,This drug has been a major miracle drug for my 18 year old son with Autism spectrum disorder (ASD). He had been experiencing what I believe was a major depressive disorder for over 3 years. He stopped doing the things he once loved --filtered_1 AJ63Taken for 2 to 5 yearsI took Rexulti for almost 2 years for depression. I had tried many different medications and it was the one medication that I responded to at the time. I started out on 0.5mg and slowly increased my dosage to 2mg. It worked for a while but then stopped working. It also made me feel flat and sedated. My doctor added a stimulant (Adderall) to counteract the sedation which only helped a little. We tried adding on other antidepressants with not much response. Any time I started a new antidepressant I would feel more sedated (now I realize it was a drug interaction with the Rexulti).
133447,Awful --filtered_1 I...Taken for 1 to 6 monthsI started using the patch around 2 months ago. I loved it at first. I needed the hormones because I have hypothyroidism and I needed more estrogen. I mainly took it because my health condition caused me to have acne all the time 2 periods a month & I started growing facial hair rapidly. I tried this and at first it was pretty cool my hair stopped growing so fast my periods were predictable my acne was gone and my small boobs grew. But all of a sudden after 2 weeks I started getting crazy mood swings and sad.
245720,(mirabegron 50mg Taken In the morning Born 1960) I have been a night-time bedwetter all my life daytime I had no problem holding my urine when I was awake I had test after test at a local hospital then up to guys hospital after more examination's they almost said you are doing It on purpose there's nothing more we can do. I asked my Dr In 2012 Is there anything new? And he started me on these --filtered_1 AnonymousTaken for less than 1 monthOn 25 mg not working this is only after 2 weeks visiting my Dr next week he already suggested for me to take 50mg after I called him to tell him how I was doing my pcp suggested the this drug is the best but expensive he also said I should be on 50 mg . I will wait to see urologist next week . I just had a tourp done and this occurred ????
258246,I have to say it been a miracle for me. I have bad allergies and hayfever. I feel the difference and this does not dry out my nose --filtered_1 AnonymousThis product was by far one of the best I have used.
6662,I've read most posts --filtered_1 DebisisTaken for 1 to 6 monthsI'm disgusted after reading these comments...because I too had a chronic pain issue for years and have been taking a generic oxycodone 5mg but was fine with it. I then had
215717,I got the pill from my family doctor as a sample pack. He said i should try it since i have been having acne problems since i gave birth. First month was fine on my 2nd pack a week after my period I started bleeding again. I was just on the 4th pill on the first row. The blood was browny in color not the usual red. I am also having hip pains really bad ones --filtered_1 Merry...Only taken this birth control pill for 1 week so far. The only noticeable side effect is extreme irritability most of the time and the occasional feeling of needing to cry out of the blue.


We've made a tag on cleaned main format as `--filtered_1 `. 

### Insights for Cleaning `Reviews`

To clean the optional formats, we're going into two steps:

* Make another regex for complete optional format of `(<Name>)(<IntakeTime>)`,
* and after that clean the ` --filtered_1` for the rest of data.

But here's the tricky part. From observation, we've come to conclusion that `(<Name>)` has a total of 3-10 characters, followed without whitespace by the first word of a new paragraph (capitalized) and then whitespace before the second word. 

However, the `(<Name>)` varies in cases. It could be:

* Capitalized such as `Hope` and `Anonymous`, 
* all lowercase such as `richard` and , 
* mixed characters such as `CarlyP`, `Kirsten...` and `Lc122...`, 
* or all uppercase such as `JAM` and `LJRI`.

Given the limitless combinations from the `(<Name>)`, to include all of this in Regex would be impossible. By that we need to make some sacrifice on the first word. These words include `I`, `After`, `At (first)` or even the `(<BrandName>)` itself, which has been written on the previous paragraph. In Tokenizer, these are called `stopwords`; inconsequential words with little value in Natural Language Processing (NLP). 

Given that :
* We have to hide the personal information on healthcare data, 
* NLP is one project the author is leading to, 
* and we're dealing with less chunks of data (`5014/392514` or `~1.277%`), 

The author decides to do the second step by cleaning the rest of data with the first word. We may inevitably clean an important word explaining the review, but the data only covers 1% of the total so the accuracy for NLP later on will not impact as much.

In [71]:
regex_group_1 = re.compile(
    '( --filtered_1 )(.+[^A-Z])(Taken for (less than 1 month|1 to 6 months|6 months to 1 year|1 to 2 years|10 years or more)?)'
)

regex_group_2 = re.compile(' --filtered_1 (.+?) ')

In [72]:
# Replace it with `. ` to start a new sentence.

df['Reviews'] = df['Reviews'].str.replace(regex_group_1, '. ', regex=True)
df['Reviews'] = df['Reviews'].str.replace(regex_group_2, '. ', regex=True)

In [73]:
pd.DataFrame(
    df.loc[df['Reviews']
           .str.contains(' --filtered_1 '), 
              'Reviews'])

Unnamed: 0,Reviews
37311,I took Belviq XR for 3 weeks. My hands & feet started swelling --filtered_1 Walking...I
146703,I've been taking Mono-Linyah for 5 months now and I am so disappointed and unhappy with it. I get frequent headaches they last all day sometimes they last for 2-3 days at a time They're almost unbearable. My periods have gotten even heavier than they were before I started taking this and they last for about 6 or 7 days. I just got off my period 5 days ago and I'm back on it again right now. I have had a period literally every other week since I started taking Mono-linyah. Most people notice weight gain while taking most birth controls --filtered_1 Natal...I
252555,I have been on this pill a little over a month now. This is the first birth control pill I've ever been on. So far --filtered_1 atxMonsternessa.
279307,Roxicodone is the most effective pain medication I've had so far. 20 mg every 4 hours. Also 25 mcg patches of Fentanyl. I'm stage 4 with bone cancer and am a recent 'incomplete quadriplegic' from a cervical spinal cord injury but learned to walk again. I'll take the cancer any day over the paralysis. I was taking 4mg of Dilaudid every 4 hours in rehab which was debilitating (to me). I'm starting to build up a tolerance to roxicodone --filtered_1 Anonymous0016unregistered:
287334,I have been getting cold sores for years. I got a tube of Penciclovir once I felt the tell tail tingle and after 5 days it had gone it didn't even blister. I only paid R 60.00 for it --filtered_1 sangaRelief!
315859,My son was put on this mediicne when he was ten --filtered_1 HORRIBLE!


In [74]:
# Clean the rest ` -- filtered_1 `, which are placed
# before the second paragraph with only one sentence; 
# no whitespace at the end.

regex_residue = re.compile(' --filtered_1 (.+)')

df['Reviews'] = df['Reviews'].str.replace(regex_residue, '. ', regex=True)

In [75]:
pd.DataFrame(
    df.loc[df['Reviews']
           .str.contains(' --filtered_1 '), 
              'Reviews'])

Unnamed: 0,Reviews


## Final Touch: Drop Duplicates

In [76]:
pd.options.display.max_colwidth = 50

In [77]:
df.sort_values(['NumberOfLikes', 'ReviewLength', 'Rating'], ascending=False)[:20]

Unnamed: 0,MedicineUsedFor,MedicineBrandName,MedicineGenericName,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes
160841,Depression,Prozac,Fluoxetine,20-Aug-20,READT...,Not Specified,Even though I always said I was never going to...,487,10,3555
305810,Depression,Prozac,Not Mentioned,20-Aug-20,READT...,Not Specified,Even though I always said I was never going to...,487,10,3555
160864,Depression,Prozac,Fluoxetine,20-Aug-20,READT...,Not Specified,Even though I always said I was never going to...,487,10,3545
386906,Depression,Zoloft,Not Mentioned,9-Sep-19,Saved...,Not Specified,I’m 36 and I’ve dealt with depression my entir...,186,10,3199
324768,Depression,Zoloft,Sertraline,9-Sep-19,Saved...,Not Specified,I’m 36 and I’ve dealt with depression my entir...,186,10,3190
324769,Anxiety And Stress,Sertraline,Not Mentioned,26-Jan-20,Feari...,Taken for 1 to 6 months,I would literally re read the high rated reviews,48,10,2767
120404,Anxiety,Lexapro,Escitalopram,4-Sep-20,oksy,Taken for 1 to 2 years,I want to thank the people here that have shar...,379,10,2336
205637,Anxiety,Lexapro,Not Mentioned,4-Sep-20,oksy,Taken for 1 to 2 years,I want to thank the people here that have shar...,379,10,2336
280798,Not Mentioned,Ozempic,Not Mentioned,26-Sep-20,dana,Taken for less than 1 month,Started taking this medication off label for w...,529,9,2305
321967,Not Mentioned,Ozempic,Semaglutide,26-Sep-20,dana,Taken for less than 1 month,Started taking this medication off label for w...,529,9,2300


You may notice in the first showcase, that there are data duplicates. We're dropping it at last given that all columns have been sorted. 

In [78]:
df = (df.sort_values(
            ['NumberOfLikes', 'ReviewLength', 'Rating',
             'MedicineBrandName', 'MedicineUsedFor'], 
              ascending=[False, False, False, True, True])
        .drop_duplicates(
            subset=['UserName', 'ReviewDate', 'MedicineBrandName', 
                    'MedicineUsedFor', 'ReviewLength'], 
              keep='first')
        .sort_values(['ReviewDate'], ignore_index=True)
)

In [79]:
df

Unnamed: 0,MedicineUsedFor,MedicineBrandName,MedicineGenericName,ReviewDate,UserName,IntakeTime,Reviews,ReviewLength,Rating,NumberOfLikes
0,Cough,Acetaminophen / Codeine,Not Mentioned,1-Apr-08,smoore,Not Specified,Works good as a cough suppressant.,34,9,24
1,Cough,Benzonatate,Not Mentioned,1-Apr-08,Anonymous,Not Specified,Pneumonia cough was non-stop - gave almost ins...,210,9,39
2,Dermatologic Lesion,Methylprednisolone Dose Pack,Methylprednisolone,1-Apr-08,Anonymous,Not Specified,This steriod helped kill the pain of my condit...,162,8,24
3,"Hypogonadism, Male",Androgel,Not Mentioned,1-Apr-08,MikeC...,Not Specified,I'm a 35 year old male and I had no idea that ...,105,9,380
4,Depression,Celexa,Not Mentioned,1-Apr-08,Cherpie,Not Specified,It is so nice to have my life back!!!,37,10,206
...,...,...,...,...,...,...,...,...,...,...
255945,Birth Control,Isibloom,Desogestrel / Ethinyl Estradiol,9-Sep-22,Skylar,Not Specified,This birth control is awful severe nausea and ...,108,1,0
255946,Underactive Thyroid,Unithroid,Levothyroxine,9-Sep-22,Syd,Taken for less than 1 month,Post partial thyroidectomy due to a large beni...,224,2,7
255947,Bacterial Infection,Amoxicillin / Clavulanate,Not Mentioned,9-Sep-22,FLgirl,Taken for less than 1 month,I was given this for a tooth abscess. I was sc...,957,9,1
255948,Strep Throat,Augmentin,Amoxicillin / Clavulanate,9-Sep-22,peein...,Taken for less than 1 month,This stuff is great if you wanna pee out of yo...,263,1,0


The data went down from 392514 to 255950 rows--seems like there are a lot of duplicates. With a handful of subset, the data should've dropped the duplicates well. So hey, data's now clean!

## Next Section

In the next section, we will proceed into two types of project:

* **Drug Recommendations Dashboard from Customer Reviews**, and
* **Sentiment Analysis of Customer Reviews in Drug Medications**

We need to save the cleaned data beforehand to be used on both.

In [80]:
df.to_csv('DrugReviews_cleaned.csv', index=False)

## Outro

In this section, we have cleaned data from `DrugReviews.csv` through:

* Changing the date-formatted values in `IntakeTime` into `Not Specified`, 
* Sorting the mixed classifications of drugs into its sorted columns of `MedicineUsedFor`, `MedicineBrandName`, and `MedicineGenericName`, and
* Cleaning glitched characters on `Reviews` for visualization and machine learning purposes.