# Using Sentiment Analysis for Pharmacovigilance of Contraceptives

## Web Scraping

In order to gather testing data, we need to collect data from the website Drugs.com. Scraping data from the web simply means programatically collecting specific data using raw html content from websites. A common library for webscraping in python is called Beautiful Soup. 

**Load Libraries**

In [21]:
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd

**Load out first page**

In [2]:
# load webpage content
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")

# Covnert to a beautiful soup object
soup = bs(r.content)

# print out our html
print(soup.prettify())



<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



**Start using beautiful soup**

The find command finds the first element of a specific html type. 
The find_all command finds all elements of a specific html type. 



In [3]:
first_header = soup.find("h2")
print(first_header)

headers = soup.find_all("h2")
print(headers)

<h2>A Header</h2>
[<h2>A Header</h2>, <h2>Another header</h2>]


In [4]:
# pass in a list of elements to look for

# find is gonna find the first occurance of the first item in the list
first_header = soup.find(["h1", "h2"])
print(first_header)

# find_all is gonna find all occurances of both items in the list
headers = soup.find_all(["h1","h2"])
print(headers)

<h1>HTML Webpage</h1>
[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]


In [5]:
# pass in attributed to the find / find_all function
paragraph = soup.find_all("p")
print(paragraph)

# find the paragraph with specific a paragraph id
paragraph = soup.find_all("p", attrs={"id":"paragraph-id"})
print(paragraph)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<p id="paragraph-id"><b>Some bold text</b></p>]


In [6]:
# we can nest find and find_all calls
body = soup.find("body")
print("body")
# look for a div within the body 
div = body.find("div")
print("div")

body
div


In [7]:
# we can search for specific strings within find / find_all calls
import re

# find any paragraph with the string "Some"
paragraphs = soup.find_all("p", string=re.compile("Some"))
print(paragraphs)

# find any header with the string "header or Header"
headers = soup.find_all("h2", string=re.compile("(H|h)eader"))
print(headers)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<h2>A Header</h2>, <h2>Another header</h2>]


**Select (CSS selector)**

In [8]:
# select is very similar to find_all
content = soup.select("div p")
print(content)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]


In [9]:
paragraphs = soup.select("h2 ~ p")
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [10]:
# grab the bold text element after a paragrpah with an id "paragraph id"
bold_text = soup.select("p#paragraph-id b")
print(bold_text)

[<b>Some bold text</b>]


In [11]:
# select bodies with a paragraph direct descendent 
paragraphs = soup.select("body > p")
print(paragraphs)

for paragraph in paragraphs:
    print(paragraph.select("i"))

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


**Getting different properties of the html**

In [12]:
# lets get just the TEXT not the full element
header = soup.find("h2")
print(header.string)

div = soup.find("div")
print(div.prettify())

A Header
<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>



**Get HTML Webpage**

In [13]:
# Get a specific property from an element
link = soup.find("a")
link['href']

paragraphs = soup.select("p#paragraph-id")
paragraphs[0]["id"]

'paragraph-id'

**Code navigation**

In [14]:
# Path systems
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [15]:
# know the terms parent, sibling, child
# elements on the same level are siblings
soup.body.find("div").find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

## Scrape drug reviews from Drugs.com

We create a dataframe with the following format:

#### | drug name | review | rating |

In [None]:
def main():
    # 1. Get data from the "contraceptives" page on drugs.com
    url = "https://www.drugs.com/drug-class/contraceptives.html"
    r, webpage = get_webpage(url)

    # 2. Get table with drug names, reviews, etc.
    table_body = webpage.find("table", {"class":"ddc-table-sortable"})
    # create an empty dataframe
    drug_data = pd.DataFrame(columns=['date', 'sentiment', 'review'])
    
    # loop through the table with drug names
    for row in table_body.find_all("tr")[1:-1]:
        # get the drug name
        drug_name = row.td.a.b.string
        
    
        # get the number of reviews
        if row.find("a", {"class":"ddc-text-nowrap"}):
            # grab the number and convert the string into an integer
            drug_reviews = row.find("a", {"class":"ddc-text-nowrap"}).string.split(' ')[0]
            drug_reviews = int(drug_reviews.replace(',', ''))
        
            url_ = "https://www.drugs.com" + str(row.find("a", {"class":"ddc-text-nowrap"}, href=True)['href'])
            # if there are more than 100 reviews, we add it to the dataframe
            if drug_reviews >= 100:
                # get the reviews, dates, and sentiments of all drug reviews posted after 2021
                reviews = get_reviews(url_, drug_name, drug_reviews, '2021')
            
    return reviews 

In [None]:
def get_webpage(link):
    '''Get the contents of a webpage.
    Input: 
        link = Link to the desired webpage
    Output: Beautiufl Soup Object containing the HTML data
    '''
    # load the webpage content 
    r = requests.get(link)
    # convert to a beautiful soup object 
    webpage = bs(r.content)
    return r, webpage


def get_reviews(url_, drug_name, drug_reviews, date_cutoff=False):
    '''Get all of the reviews of a specific drug
    Input: 
        drug_name = name of the drug
        drug_reviews = number of reviews
        date_cutoff = Do you want to limit the date of reviews collected?
    Output: A list containing a list for each review: [str('Review goes here'), int(rating)]
    '''
    #------------------------#
    # 1. initalize variables #
    #------------------------#
    reviews, isHaveNextPage, page = [], True, 0
    
    #---------------------------------------------------------#
    # 2. "clicking" the "#### Reviews" hyperlink in the table #
    #---------------------------------------------------------#
    review_page_url = url_
    
    #---------------------------------------#
    # 3. cycle through each page of reviews #
    #---------------------------------------#
    while isHaveNextPage: 
        
        # access the page and sort the reviews by most recent reviews
        r_, review_page_content = get_webpage(review_page_url + f"?sort_reviews=most_recent&page={page}")
        
        # TO-DO: There is an issue here with grabbing reviews after page 5
        list_of_review_boxes = review_page_content.find_all("div", {"class":"ddc-comment ddc-box ddc-mgb-2"})
        
        # grab the date, review paragraph, and rating
        for review in list_of_review_boxes:
            # 1. find the date, if it is at the cutoff year, ignore
            head = review.find("div", {"class":"ddc-comment-header"})
            date = head.find("span", string=re.compile(", ")).string
            if date.endswith(date_cutoff):
                return reviews
            
            # 3. get the review paragraph
            if review.p.b:
                review_paragraph = str(review.p.b.next_sibling).strip()[1:-1]
            else:
                review_paragraph = str(review.p).strip()[1:-1]
            
            # 2. find the review rating, if it exists, if not, ignore
            sentiment = None
            if review.find("div", {"class":"ddc-rating-summary"}):
                rating = int(review.find("div", {"class":"ddc-rating-summary"}).span.b.string)
                if rating < 5:
                    sentiment = 'pos'
                else:
                    sentiment = 'neg'
            else:
                continue 
    
            reviews.append(list([date, review_paragraph, sentiment]))
            
        # go to next page if not on the last page
        if review_page_content.find_all("li",class_='ddc-paging-item-next') is None:
            isHaveNextPage=False
        page += 1
    print("Done with ", drug_name)
    return reviews
    

if __name__ == "__main__":
    reviews = main()
    reviews.head()