# Web Scraping Demo - User Profile

#### Author: Yu-Chang Ho (Andy), UC Davis
#### Latest Update: 2019 10/13


This notebook demonstrates the basic implementation for scraping data using BeautifulSoup4 library. Please use the website [https://hipposerver.ddns.net/webscraping/](https://hipposerver.ddns.net/webscraping/) as the sample webpage to practice webscraping.

- Target website: [https://hipposerver.ddns.net/webscraping/](https://hipposerver.ddns.net/webscraping/)
- Objective: Get a user profile data in a clean CSV format

## I. Library import

In [1]:
### import the required libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

## II. Retrieve the webpage source code

In [2]:
### connect to the server and get the content of the webpage

url = 'https://hipposerver.ddns.net/webscraping/'

# get the source code of the webpage
r = requests.get( url )
# check it out!
print( r.text )

<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<title>Web-scraping Demo - User Profile Example</title>

<script src="assets/js/syntax_highlight.js"></script>

<style>
body {
    font-size: 1.5em;
}

.bold-text {
    font-weight: bold;    
}
</style>

</head>
<body>
    <h1>User Profile Example</h1>
    <hr />
    <div class="freelancer">
        <div id="name">Andy Ho</div>
        <div id="location">Davis, CA, USA</div>
        <div class="bold-text">$50/hr</div>
        <div class="experience">1.5 years</div>
        <div class="rating" data="5.0">Excellent</div>
        <div class="domain">Software</div>
    </div>

    <!-- ********** IGNORE THE BELOW PART!! ********** -->

    <pre id="code">
        &lt;div class="freelancer"&gt;
            &lt;div id="name"&gt;Andy Ho&lt;/div&gt;
            &lt;div id="location"&gt;Davis, CA, USA&lt;/div&gt;
            &lt;div class="bold-text"&gt;$50/hr&lt;/div&gt;

## III. Create a BeautifulSoup4 Parser

In [3]:
### create a Beautiful Soup parser
'''
normally, we use 'html.parser' as the default parser
but, many user suggest to use 'html5lib'
'''
soup = BeautifulSoup( r.text, 'html.parser' )

## IV. Locate the Target Element

In [4]:
### search for a specific element
# search a div element
e = soup.find( 'div' )
print( e )

<div class="freelancer">
<div id="name">Andy Ho</div>
<div id="location">Davis, CA, USA</div>
<div class="bold-text">$50/hr</div>
<div class="experience">1.5 years</div>
<div class="rating" data="5.0">Excellent</div>
<div class="domain">Software</div>
</div>


In [5]:
# search an element by a specific class name
e = soup.find( 'div', class_='bold-text' )
print( e )

<div class="bold-text">$50/hr</div>


In [6]:
# search an element by a specific id
e = soup.find( 'div', { "id": "name" } )
print( e )

<div id="name">Andy Ho</div>


### Recall the HTML Structure

```html
<div class="freelancer">
    <div id="name">Andy Ho</div>
    <div id="location">Davis, CA, USA</div>
    <div class="bold-text">$50/hr</div>
    <div class="experience">1.5 years</div>
    <div class="domain">Software</div>
</div>
```

In [7]:
### search for multiple element
idx = 0
for e in soup.find_all( 'div' ):
    idx += 1
    print( f"No.{idx}:\t" + str(e) )

No.1:	<div class="freelancer">
<div id="name">Andy Ho</div>
<div id="location">Davis, CA, USA</div>
<div class="bold-text">$50/hr</div>
<div class="experience">1.5 years</div>
<div class="rating" data="5.0">Excellent</div>
<div class="domain">Software</div>
</div>
No.2:	<div id="name">Andy Ho</div>
No.3:	<div id="location">Davis, CA, USA</div>
No.4:	<div class="bold-text">$50/hr</div>
No.5:	<div class="experience">1.5 years</div>
No.6:	<div class="rating" data="5.0">Excellent</div>
No.7:	<div class="domain">Software</div>


### Find the elements under a parent element

In the above code, we have a line:

`soup = BeautifulSoup( r.text, 'html.parser' )`,

which makes the variable `soup` holds all the content of a webpage.

later, we use `soup.find()` to find a certain element.

If we want to further find a sub-element under a retrieved element, we could do the same!

### Recall the HTML Structure

```html
<div class="freelancer">
    <div id="name">Andy Ho</div>
    <div id="location">Davis, CA, USA</div>
    <div class="bold-text">$50/hr</div>
    <div class="experience">1.5 years</div>
    <div class="domain">Software</div>
</div>
```

In [8]:
### Assuming we want to find the elements under <div class="freelancer">
# 1. get the parent
parent = soup.find( 'div', class_='freelancer' )

# 2. call the find() funct using the located element
name = parent.find( 'div', { "id": "name" } )
print( name )

<div id="name">Andy Ho</div>


## V. Get the Value We Want

In [9]:
### retrieve the value
# get the name
# the value we want is placed within the element block

'''
<div id="name">Andy Ho</div>
'''

e = soup.find( 'div', { "id": "name" } )
print( e.text )

Andy Ho


In [10]:
# get the rating
# the value we want is placed in the element attribute

'''
<div class="rating" data="5.0">Excellent</div>
'''

e = soup.find( 'div', class_="rating" )
print( e[ 'data' ] )

5.0


## VI. Create DataFrame

In [11]:
### prepare the output dataframe

# 1. find the parent that holds all the elements we want
parent = soup.find( 'div', class_='freelancer' )
print( parent )

<div class="freelancer">
<div id="name">Andy Ho</div>
<div id="location">Davis, CA, USA</div>
<div class="bold-text">$50/hr</div>
<div class="experience">1.5 years</div>
<div class="rating" data="5.0">Excellent</div>
<div class="domain">Software</div>
</div>


In [12]:
# 2. retrive the data
name   = parent.find( 'div', { 'id': 'name' } ).text
loc    = parent.find( 'div', { 'id': 'location' } ).text
salary = parent.find( 'div', class_='bold-text' ).text
exp    = parent.find( 'div', class_='experience' ).text
rating = parent.find( 'div', class_='rating' )[ "data" ]
domain = parent.find( 'div', class_='domain' ).text

print( name )
print( loc )
print( salary )
print( exp )
print( rating )
print( domain )

Andy Ho
Davis, CA, USA
$50/hr
1.5 years
5.0
Software


In [13]:
# 3. data cleaning and transformation
parts = str(loc).split( ", " )
city = parts[ 0 ]
state = parts[ 1 ]
country = parts[ 2 ]

exp = float(str(exp).replace( " years", "" ))

rating = float(str(rating))

In [14]:
# 4. append the data as a row
row = []
row.append( name )
row.append( city )
row.append( state )
row.append( country )
row.append( salary )
row.append( exp )
row.append( rating )
row.append( domain )

print( row )

['Andy Ho', 'Davis', 'CA', 'USA', '$50/hr', 1.5, 5.0, 'Software']


In [15]:
# 5. create a dataframe and append the data row
header = [ "name", "city", "state", "country", "expected_salary", "experience", "rating", "domain" ]
df = pd.DataFrame( [row], columns=header )

In [17]:
# 6. the result
print( df )

      name   city state country expected_salary  experience  rating    domain
0  Andy Ho  Davis    CA     USA          $50/hr         1.5     5.0  Software
