# Web Scrapping With Python
Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have. Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can be imported.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)

The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. The second argument 'lxml' is the html parser. The soup object allows you to extract interesting information about the website

In [3]:
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

In [4]:
# Get the title
title = soup.title
print(title)

<title>2017 Intel Great Place to Run 10K \ Urban Clash Games Race Results</title>


## Save the page as txt

In [5]:
#Saving the output file
file_name = f'{url.split("/")[-1]}.txt'
with open(str(file_name), 'w', encoding='utf-8') as f_out:
    f_out.write(soup.prettify())

You can use the find_all() method of soup to extract useful html tags within a webpage. Examples of useful tags include < a > for hyperlinks, < table > for tables, < tr > for table rows, < th > for table headers, and < td > for table cells. The code below shows how to extract all the hyperlinks within the webpage.

In [6]:
soup.find_all('a')

[<a href="mailto:timing@hubertiming.com">timing@hubertiming.com</a>,
 <a href="https://www.hubertiming.com/">Huber Timing Home</a>,
 <a class="btn btn-primary btn-lg" href="/results/2017GPTR" role="button" style="margin: 0px 0px 5px 5px">5K</a>,
 <a class="btn btn-primary btn-lg" href="/results/summary/2017GPTR10K" role="button" style="margin: 0px 0px 5px 5px">Summary</a>,
 <a class="btn btn-secondary btn-sm" href="#team" role="button"><i aria-hidden="true" class="fa fa-users"></i> Team Results</a>,
 <a class="btn btn-secondary btn-sm" href="#individual" role="button"><i aria-hidden="true" class="fa fa-user"></i> Individual Results</a>,
 <a name="team"></a>,
 <a id="individual" name="individual"></a>,
 <a href="#tabs-1" style="font-size: 18px">10K Results</a>,
 <a href="https://www.hubertiming.com/"><img height="65" src="/sites/all/themes/hubertiming/images/clockWithFinishSign_small.png" width="50"/>Huber Timing</a>,
 <a href="https://facebook.com/hubertiming/"><img src="/results/FB-f-

## Filtering the links of the page
As you can see from the output above, html tags sometimes come with attributes such as class, src, etc. These attributes provide additional information about html elements. You can use a for loop and the get('"href") method to extract and print out only hyperlinks.

We can group the links like below

In [7]:
from urllib.parse import urlsplit
base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(url))
#Grouping all the links
mail_address = []
external_link = []
internal_link = []
all_links = soup.find_all("a")
for link in all_links:
    link = link.get("href")
    if link:
        if "mailto" in link:
            mail_address.append(link)
        elif "www" in link or ".com" in link:
            external_link.append(link)
        elif link.startswith('#'):
            link = url+link
            internal_link.append(link)
        else:
            link = base_url + link
            internal_link.append(link)

print(mail_address)
print(external_link)
print(internal_link)

['mailto:timing@hubertiming.com']
['https://www.hubertiming.com/', 'https://www.hubertiming.com/', 'https://facebook.com/hubertiming/']
['http://www.hubertiming.com//results/2017GPTR', 'http://www.hubertiming.com//results/summary/2017GPTR10K', 'http://www.hubertiming.com/results/2017GPTR10K#team', 'http://www.hubertiming.com/results/2017GPTR10K#individual', 'http://www.hubertiming.com/results/2017GPTR10K#tabs-1']


## Working with Table
To print out table rows only, pass the 'tr' argument in soup.find_all().

In [8]:
# Print the first 5 rows for sanity check
rows = soup.find_all('tr')
print(rows[:5])

[<tr colspan="2"><b>10K:</b></tr>, <tr><td>Finishers:</td><td>577</td></tr>, <tr><td>Male:</td><td>414</td></tr>, <tr><td>Female:</td><td>163</td></tr>, <tr>
<td>Award</td>
<td>Name</td>
<td>Combined Time</td>
<td>1</td><td>2</td><td>3</td><td>4</td></tr>]


In [9]:
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html)
table = soup.find("table", id='individualResults')

In [10]:
import csv
output_rows = []
table_header = []
#Get the Header of the table
for table_row in table.findAll('th'):
    table_header.append(table_row.get_text())
#Get the rows of the table
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)
#Save the table as CSV
with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(table_header)
    writer.writerows(output_rows[1:])#The first row is of header