# Data Craping Notebook
The scaper here will focus on doctors in Saskatchewan. However, it can be readily adapted to any region in the world by adapting the url. Our focus will be pages in https://www.ratemds.com/best-doctors/sk. ,A combination of these pages sends relatively a small number of requests. A request is what happens whenever web page is accessed. A `request` of the content of a page from the server. The more requests we make, the longer our script will need to run, and the greater the strain on the server.

One way to get all the data we need is to compile a list of specialties, and use it to access the web page.
If we go to the [ratemds](www.ratemds.com/best-doctors/sk/regina) site we can see that the specialties are listed. Upon exploring it, we not that each page for any specialty displays upto 10 doctors and their raitings.
The data will be restricted to to medical personels with atleast a review.

### Identifying the URL structure
Our challenge now is to make sure we understand the logic of the URL as the pages we want to scrape change. This will help us to extract the parameters we wants. At the moment, we are going to extract the __name, specialty, ratings, votes, gender__. The votes refer to the number of people who gave reviews, and the others are self expllanatory. 

Lets further limit ourselves to doctors in the Regina region. The url in this case is <em>https://www.ratemds.com/best-doctors/sk/regina/ </em>. We used url request
<code>

In [None]:
from requests import get
url = 'https://www.ratemds.com/best-doctors/sk/regina/'
response = get(url)
print(response.text[:500])

### Understanding the HTML structure of a single page

The first line of response.text indicates that the server sent us an HTML document. The document describes comes with the overall structure of that web page, along with its specific unique content.
Upon inspection, we can notice that the pages we want to scrape have the same overall structure leading to the same HTML structure. So, one task in the script is for it to understand the HTML structure of only one page. The browser’s Developer Tools can be used.

Each page has 11 health pracitoner and we can navigate the pages by clicking on each of the page numers displayed underneath. To parse our HTML document and extract the 11 health practitioners div containers, Python BeautifulSoup module is used.
  -  Import BeautifulSoup class creator from the package bs4.
  -  Parse response.text by creating a BeautifulSoup object.

In [None]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

In [None]:
doctor_container = html_soup.find_all('div', class_ = 'search-item doctor-profile')
print(type(doctor_container))
print(len(doctor_container))

Let's select only the first container, and extract each item of interest, including the __name, specialty, ratings, score, sex__

##### a. Name
Lets concentrate on the first item. Using the Devtools we note that the name is contained within an anchor tag `<a>` inside the`doctor_container[0]` object. To extract it us the command 

In [None]:
doctor_container[0].a.get_text()

##### b. Specialty
This data is stored within the `<div>` tag below the `<a>` that contains the name. Dot notation will only access the first div element, so a search by the distinctive mark of the second `<div>` using the `find()` method. Note, `find()` is equivalent to `find_all(limit = 1)`, with limit argument retricting the output to the first match. The distinguishing mark consists of the values __search-item-specialty__ assigned to the class attribute. 
To extract thisus the command

In [None]:
doctor_container[0].find('div', class_ = 'search-item-specialty').a.get_text()

#### c. Rating
Just like above, it is found in a tag, this time specifically `<span>`. The `find()` method with the distinguishing mark consists of the values __star-rating__ assigned to the class attribute. The ratings is present inside a dict that is access via ,title'. Extract it using

In [None]:
doctor_container[0].find('span', class_ = 'star-rating')['title'] 

#### c. Votes
The votes are present in a  `<div>` tage identified by values __star-rating-count__ . Using the `find()` method, the text can be extract it using

In [None]:
doctor_container[0].find('div', class_ = "star-rating-count").get_text()

#### c. Gender
The gender can be extracted in multiple ways including navigating to the individual doctor's profile and extracting it. Here we extract it from the profile picture. This is achieved via the `find()` method to an  `<src>` tag identified by values __search-item-image__ . The following will return the first letter of the sex

In [None]:
import os
sexurl =  doctor_container[0].find('img', class_="search-item-image")['src']      
os.path.dirname(sexurl)[44]

In [None]:
# Lists to store the scraped data in
names = []
specialty = []
ratings = []
ratings_count = []
gender = []

#Extract data from individual doctor container
for container in doctor_container:
    # If the doctor has ratings, then extract:
    rating = float(container.find('span', class_ = 'star-rating')['title'])
    if rating != 0:        
        names.append(container.a.get_text())   #add the name
        #specialty
        special = container.find('div', class_ = 'search-item-specialty').a.get_text()
        specialty.append(special)
        #rating
        ratings.append(rating)
        #Number of ratings
        num_ratings = container.find('div', class_ = "star-rating-count").get_text()
        ratings_count.append(int(num_ratings.split()[0]))
        #gender
        sexurl =  container.find('img', class_="search-item-image")['src']       
        gender.append(os.path.dirname(sexurl)[44])

Let’s check the data collected so far. Pandas makes it easy for us to see whether we’ve scraped our data successfully.

In [None]:
import pandas as pd
page_1 = pd.DataFrame({'Name': names,
                        'Specialty': specialty,
                        'Ratings': ratings,
                        'Rates': ratings_count,
                        'Gender': gender
                        })

In [None]:
page_1.head()

### The script for multiple pages
At this point follow the markdown file or the use the .py script.