In [1]:
import pandas as pd
import datetime
import re
import requests
import bs4

# Web Scraping

### Steps:
- Sending a request to `coursereport.com` to get the html
    - *Replace our browser*
- Parse html
    - *Understand html*
- Find the right tags
    - *Utilize the browser inspector*
- Extract required information from those tags
- Build litte functions that do these things

### 1. Get html

In [2]:
url = "https://www.coursereport.com/best-coding-bootcamps"

In [3]:
requests.get(url)

<Response [200]>

In [4]:
type(requests.get(url))

requests.models.Response

In [5]:
resp = requests.get(url)

In [6]:
resp.content



Can't really make sense of the content. Need to parse and 'interpret' it.

### 2. Parse html and make a soup

In [7]:
suppe = bs4.BeautifulSoup(resp.content, "html.parser")

In [8]:
type(suppe)

bs4.BeautifulSoup

In [9]:
suppe

<!DOCTYPE html>
<!--[if IE 8]><html class="ie ie8 lt-ie9"><![endif]--><!--[if IE 9]><html class="ie ie9"><![endif]--><!--[if (gte IE 9)|!(IE)]<!--><html><!--<![endif]--><head>
<script>window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"04fb2dfdee","applicationID":"3752730","transactionName":"clsKERQNDlxQRB0EA0JAOwYJBgteUmlQCQlFVwUIFhFNWVtSVx4=","queueTime":4,"applicationTime":135,"agent":""}</script>
<script>(window.NREUM||(NREUM={})).loader_config={xpid:"UA8GVVFQGwAHUVNVBAE=",licenseKey:"04fb2dfdee",applicationID:"3752730"};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(21),s={};try{

The entire HTML code is now stored as a Soup. We can now find and access all the individual tags

### 3. Get data

We want to have a table that has the following columns:

- Rank
- Name
- Overall Rating
- Stars
- No of reviews
- Locations
- Description


Approach:

- Start with one school
- Only try to get the data
- Wrap those tasks into functions
- etc.

#### 3.1. Get data for Flatiron School

In [28]:
# Walk the DOM tree
school_items = (suppe
                .body
                .find("div", class_="main-body")
                .find("div", class_="longform-body container")
                .find("div", class_="row")
                .find("div", class_="col-md-11")
                .find("ul", id="schools")
                .find_all("li"))

In [29]:
school_items

[<li><div class="info-container"><a href="/schools/flatiron-school"><div class="school-image"><img alt="flatiron-school-logo" src="https://course_report_production.s3.amazonaws.com/rich/rich_files/rich_files/999/s100/flatironschool.png" title="Flatiron School Logo"/></div></a><h3><a href="/schools/flatiron-school">1. Flatiron School</a></h3><span class="banner-container"><img alt="Established school badge" class="banner" src="https://coursereport-production-herokuapp-com.global.ssl.fastly.net/assets/established_school_badge-d099e568a815b527a609dbea7ca07bb9.png"/><img alt="Large alumni network badge" class="banner" src="https://coursereport-production-herokuapp-com.global.ssl.fastly.net/assets/large_alumni_network_badge-00d96602124ae1fffd2e993b116ea803.png"/><img alt="Transparent outcomes badge" class="banner" src="https://coursereport-production-herokuapp-com.global.ssl.fastly.net/assets/transparent_outcomes_badge-d69b653a170c89c52079317b1b985664.png"/><div class="ratings title-rating"

In [31]:
len(school_items)

48

This maps to the 48 schools listed on coursereport.com

In [32]:
school_items[0]

<li><div class="info-container"><a href="/schools/flatiron-school"><div class="school-image"><img alt="flatiron-school-logo" src="https://course_report_production.s3.amazonaws.com/rich/rich_files/rich_files/999/s100/flatironschool.png" title="Flatiron School Logo"/></div></a><h3><a href="/schools/flatiron-school">1. Flatiron School</a></h3><span class="banner-container"><img alt="Established school badge" class="banner" src="https://coursereport-production-herokuapp-com.global.ssl.fastly.net/assets/established_school_badge-d099e568a815b527a609dbea7ca07bb9.png"/><img alt="Large alumni network badge" class="banner" src="https://coursereport-production-herokuapp-com.global.ssl.fastly.net/assets/large_alumni_network_badge-00d96602124ae1fffd2e993b116ea803.png"/><img alt="Transparent outcomes badge" class="banner" src="https://coursereport-production-herokuapp-com.global.ssl.fastly.net/assets/transparent_outcomes_badge-d69b653a170c89c52079317b1b985664.png"/><div class="ratings title-rating">

The first element refers to Flatiron School!

We don't have to walk down the DOM tree if the
tags we want to access are uniquely identifiable

In [35]:
(suppe
 .find("ul", id="schools")
 .find_all("li")) == school_items

True

In [36]:
school_items = (suppe
                .find("ul", id="schools")
                .find_all("li"))

In [37]:
flatiron_item = school_items[0] 

#### 3.2. Get rank and name

In [48]:
# this is not working as the a tag we want is under the h3 tag for each list item
flatiron_item.find("a")

<a href="/schools/flatiron-school"><div class="school-image"><img alt="flatiron-school-logo" src="https://course_report_production.s3.amazonaws.com/rich/rich_files/rich_files/999/s100/flatironschool.png" title="Flatiron School Logo"/></div></a>

In [44]:
flatiron_item.find("h3").find("a")

['1. Flatiron School']

In [45]:
# Access the content of a tag
flatiron_item.find("h3").find("a").contents

['1. Flatiron School']

In [47]:
# Access text
flatiron_item.find("h3").find("a").text

'1. Flatiron School'

In [49]:
rank_pattern = r"^(\d{1,2})\."
name_pattern = r"^\d{1,2}\.\s(.+)"

In [54]:
re.findall(rank_pattern,
           (flatiron_item
            .find("h3")
            .find("a")
            .text))[0]

'1'

In [55]:
re.findall(name_pattern,
           (flatiron_item
            .find("h3")
            .find("a")
            .text))[0]

'Flatiron School'

#### 3.3. Get overall rating

In [59]:
rating = (flatiron_item
          .find("span",
                class_="longform-rating-text")
          .text)
rating

'Overall Rating: (4.71) '

In [60]:
rating_pattern = r"\((.+)\)"
re.findall(rating_pattern, rating)[0]

'4.71'

#### 3.4. No of review

In [70]:
flatiron_item.find_all("span",
                     class_="longform-rating-text")

[<span class="longform-rating-text">Overall Rating: (4.71) </span>,
 <span class="longform-rating-text"><a href="/schools/flatiron-school#reviews">435 Reviews</a></span>,
 <span class="longform-rating-text">Reviewer's Score: </span>]

In [66]:
reviews = (flatiron_item
           .find_all("span",
                     class_="longform-rating-text")[1]
           .find("a")
           .text)
reviews

'435 Reviews'

In [68]:
reviews_pattern = r"(^\d*)\s"
re.findall(reviews_pattern, reviews)[0]

'435'

#### 3.5. Get Locations

In [74]:
(flatiron_item
 .find("span",
       class_="location")
 .find_all("a"))[0].text

'Brooklyn'

In [77]:
location_list = (flatiron_item
                 .find("span",
                       class_="location")
                 .find_all("a"))
location_list

[<a href="/cities/brooklyn">Brooklyn</a>,
 <a href="/cities/houston">Houston</a>,
 <a href="/cities/london">London</a>,
 <a href="/cities/austin-coding-bootcamps">Austin</a>,
 <a href="/cities/denver">Denver</a>,
 <a href="/cities/seattle-coding-bootcamps">Seattle</a>,
 <a href="/cities/new-york-city-coding-bootcamps">New York City</a>,
 <a href="/cities/chicago">Chicago</a>,
 <a href="/cities/washington-coding-bootcamps">Washington</a>,
 <a href="/cities/online-coding-bootcamps">Online</a>,
 <a href="/cities/atlanta">Atlanta</a>,
 <a href="/cities/san-francisco-coding-bootcamps">San Francisco</a>]

In [79]:
[loc.text for loc in location_list]

['Brooklyn',
 'Houston',
 'London',
 'Austin',
 'Denver',
 'Seattle',
 'New York City',
 'Chicago',
 'Washington',
 'Online',
 'Atlanta',
 'San Francisco']

#### 3.6. Get description

In [82]:
(flatiron_item
 .find("div",
       class_="desc-container")
 .find_all("p"))[1].text

'Flatiron School offers immersive on-campus and online programs in software engineering, data science, UX/UI design, and cybersecurity in NYC, Brooklyn, Washington DC, London, Houston, Atlanta, Austin, Seattle, Chicago, Denver, and Online. Flatiron School’s immersive courses aim to launch students into fulfilling careers as software engineers, data scientists, and UX/UI designers through rigorous, market-aligned curricula, and the support of seasoned instructors and personal career coaches. Through test-driven labs and portfolio projects, Flatiron teaches students\xa0to think\xa0and build\xa0like software engineers and data scientists. Flatiron School’s UX/UI Design Immersive includes a client project to give students client-facing experience and an industry-vetted portfolio.'

#### 3.7. Get stars

In [87]:
stars = (flatiron_item
         .find("div",
               class_="ratings title-rating")
         .find_all("span"))[1:]
stars

[<span class="icon-full_star"></span>,
 <span class="icon-full_star"></span>,
 <span class="icon-full_star"></span>,
 <span class="icon-full_star"></span>,
 <span class="icon-half_star"></span>]

In [114]:
type(stars[0])

bs4.element.Tag

In [95]:
stars[0]["class"][0]

'icon-full_star'

In [96]:
[star["class"][0] for star in stars]

['icon-full_star',
 'icon-full_star',
 'icon-full_star',
 'icon-full_star',
 'icon-half_star']

In [97]:
# translate class names to integers
stars_dict = {"icon-full_star": 1,
              "icon-half_star": .5}

In [104]:
stars[0]["class"][0]

'icon-full_star'

In [105]:
stars_dict[stars[0]["class"][0]]

1

In [113]:
sum([stars_dict[star["class"][0]] for star in stars])

4.5