# Lab04. Web Scrapping

This notebook mainly goes over how to get data with the Python packages `requests` and  `BeautifulSoup`.

<a id='sec0'></a>
## Pre-Setup

The following is a pseudo-module which programmers can use to enable new language features which are not compatible with the current interpreter. For example, the expression 11 over 4 (11/4) currently evaluates to 2. If the module in which it is executed had enabled true division by executing. The expression 11/4 would evaluate to 2.75.

In [None]:
from __future__ import division

<a id='sec1'></a>
# Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to extract information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [None]:
import requests # The requests library is an 
# HTTP library for getting content and posting etc.

import bs4 as bs # BeautifulSoup4 is a Python library 
# for pulling data out of HTML and XML code.
# we can query markup languages for specific content

# Scraping a simple website

In [None]:
source = requests.get("http://www.comp.hkbu.edu.hk/~hugolee/") 
# a GET request will download the HTML webpage.

print(source) # If <Response [200]> then 
# the website has been downloaded succesfully

**Different types of repsonses:**
Generally status code starting with 2 indicates success. Status code starting with 4 or 5 indicates error. Frequent appearance of the status codes like 404 (Not Found), 403 (Forbidden), 408 (Request Timeout) might indicate that you got blocked.

In [None]:
print(source.content) # This is the HTML content of the website,
# as you can see it's quite hard to decipher

In [None]:
# Convert source.content to a beautifulsoup object 
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content, features='html.parser') 
# we pass in the source content
# features specifies what type of code we are parsing, 
# here 'html.parser' specifies that we want beautiful soup to parse HTML code

In [None]:
print(soup) # looks a lot nicer!

In [None]:
print(soup.prettify()) 
# .prettify() method makes the HTML code more readable

Above we printed the HTML code of the website, decoded as a beautiful soup object.

### HTML tags
`<xxx> </xxx>`: are all the HTML tags, that specifies certain sections, stylings etc of the website, for more info: 
https://www.w3schools.com/tags/ref_byfunc.asp

## **class and id: ** 

class and id attributes of HTML tags, they are used as hooks to give unique styling to certain elements and an id for sections / parts of the page.

- **id:** is a unique tag for a specific element (this often does not change)
- **class:** specifies a class of objects. Several elements in the HTML code can have the same class.

Full list of HTML tags: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

### Suppose we want to extract content that is shown on the website

In [None]:
# Inside the <body> tag of the website is where all the main content is
print(soup.body)

In [None]:
print(soup.title) # Title of the website
print(soup.find('title')) # same as .title

In [None]:
# If we want to extract specific text
print(soup.find('p')) # will only return first <p> tag

In [None]:
print(soup.find('p').text) # extracts the string within the <p> tag, strips it of tag

In [None]:
for p in soup.find_all('p'): # print all text paragraphs on the webpage
    print(p.text)

In [None]:
# Extract links / urls
# Links in html is usually coded as <a href="url">
# where the link is url

print(soup.a)

In [None]:
soup.a.get('href') 
# to get the link from href attribute

In [None]:
# if we want to list links and their text info

links = soup.find_all('a')

for l in links:
    print("Info about {}: ".format(l.text), \
          l.get('href')) 
# then we have extracted the link

### Find table:  
Usually organized data in HTML format on a website is stored in tables under `<table>, <tr>,` and `<td>` tags. Here we want to extract any table in the website.

In [None]:
# We can get the table
full_table = soup.find_all('table')

In [None]:
full_table

In [None]:
# A new row in an HTML table starts with <tr> tag
# A new column entry is defined by <td> tag
table_result = list()
for table in full_table:
    for row in table.find_all('tr'):
        row_cells = row.find_all('td') # find all table data
        row_entries = [cell.text for cell in row_cells]
        print(row_entries) 
        table_result.append(row_entries)
        # get all the table data into a list

In [None]:
# Pandas can also grab tables from a website automatically

import pandas as pd

import html5lib
# requires html5lib: 
#!conda install --yes html5
dfs = pd.read_html('http://www.comp.hkbu.edu.hk/~hugolee/') 
# returns a list of all tables at url



In [None]:
len(dfs)

In [None]:
type(dfs[0])

In [None]:
print(type(dfs)) #list of tables
df = pd.concat(dfs,ignore_index=True)

In [None]:
# Looks so-so, however striped from break line characters etc.
df

In [None]:
# Make it nicer

# Assign column names
df.columns=  ['Lab','Detailed Description']

# Assing week number
weeks = list()
for i in range(1,5):
    weeks = weeks+['Week{}'.format(i) for tmp in range(2)]
df['Week'] = weeks

In [None]:
df.head(10)

In [None]:
# Set Week and Lab as column indices
df = df.set_index(['Week','Lab'])

In [None]:
df.dropna().head(10)

In [None]:
# Export to excel
df.to_excel('labSchedule.xlsx')

<a id='sec3'></a>

## Scraping function to download files of any type from a website

Below is a function that takes in a website and a specific file type to download X of them from the website.

In [None]:
# Extended scraping function of any file format
import os # To interact with operating system and format file name
import shutil # To copy file object from python to disk
import requests
import bs4 as bs

def py_file_scraper(url, html_tag='img', source_tag='src', file_type='.jpg',max=-1):
    
    '''
    Function that scrapes a website for certain file formats.
    The files will be placed in a folder called "files" 
    in the working directory.
    
    url = the url we want to scrape from
    html_tag = the file tag (usually img for images or 
    a for file links)
    
    source_tag = the source tag for the file url 
    (usually src for images or href for files)
    
    file_type = .png, .jpg, .pdf, .csv, .xls etc.
    
    max = integer (max number of files to scrape, 
    if = -1 it will scrape all files)
    '''
    
    # make a directory called 'files' 
    # for the files if it does not exist
    if not os.path.exists('files/'):
        os.makedirs('files/')
    print('Loading content from the url...')
    source = requests.get(url).content
    print('Creating content soup...')
    soup = bs.BeautifulSoup(source,'html.parser')
    
    i=0
    print('Finding tag:%s...'%html_tag)
    for n, link in enumerate(soup.find_all(html_tag)):
        file_url=link.get(source_tag)
        print ('\n',n+1,'. File url',file_url)
        
        
        if 'http' in file_url: # check that it is a valid link
            print('It is a valid url..')
            
            
            if file_type in file_url: #only check for specific 
                # file type
                
                print('%s FILE TYPE FOUND IN THE URL...'%file_type)
                file_name = os.path.splitext(os.path.basename(file_url))[0] + file_type 
                #extract file name from url

                file_source = requests.get(file_url, stream = True)
             
                # open new stream connection

                with open('./files/'+file_name, 'wb') as file: 
                    # open file connection, create file and 
                    # write to it
                    
                    shutil.copyfileobj(file_source.raw, file) 
                    # save the raw file object
                    
                    print('DOWNLOADED:',file_name)
                    
                    i+=1
                    
                del file_source # delete from memory
            else:
                print('%s file type NOT found in url:'%file_type)
                print('EXCLUDED:',file_url) 
                # urls not downloaded from
                
        if i == max:
            print('Max reached')
            break
            

    print('Done!')

# Scrape funny cat pictures

In [None]:
py_file_scraper('https://funcatpictures.com/') 
# scrape cats

You can find out the cat pictures in the folder '/files' under the current directory.

# Scrape pdf's from Websitesite

In [None]:
py_file_scraper('---place the url here---',
                html_tag='a',source_tag='href',file_type='.pdf', \
                max=5)

# Scrape real data CSV files from websites

In [None]:
py_file_scraper('---place the url here---',
                html_tag='a', # R data sets
                source_tag='href', file_type='.csv',max=5)

# Exercise 1


In this exercise, you should extract live weather data from:

http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168

* Task scrape
    * period / day (as Tonight, Friday, FridayNight etc.)
    * the temperature for the period (as Low, High)
    * the long weather description (e.g. Partly cloudy, with a low around 49..)
    
Store the scraped data strings in a Pandas DataFrame



**Hint:** The weather information is found in a div tag with `id='seven-day-forecast'`



In [None]:
import requests
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")

In [None]:
# BeautifulSoup4 is a Python library 
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

In [None]:
# Extract the name of the forecast item, the short description, and the temperature for the first day
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

In [None]:
img = tonight.find("img")
desc = img['title']

print(desc)

In [None]:
# Use get_text method on each BeautifulSoup object
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

In [None]:
# Combining our data into a Pandas Dataframe
import pandas as pd
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

In [None]:
# We can use a regular expression and the Series.str.extract method to pull out the numeric temperature values
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

In [None]:
# Calculate mean temperature
weather["temp_num"].mean()

# Exercise 2


Starting from https://en.wikipedia.org/wiki/Data_analysis. Then, get all the article links from Data_analysis page. After doing so, iterate this list of articles. For each article link, repeat the above steps. Recursively and stop when the total number of the list of articles exceeds 3000. For these 3000 records, each should contain its own title and the title of another article linked to it. Save the result as a text file.
![ex2.png](attachment:ex2.png)

In [None]:
from bs4 import BeautifulSoup
import requests

start_url = 'https://en.wikipedia.org/wiki/Data_analysis'
domain = 'https://en.wikipedia.org'

''' get soup '''
def get_soup(url):
    # get contents from url
    content = requests.get(url).content
    # get soup
    return BeautifulSoup(content,'lxml') # choose lxml parser


''' return a list of links to other wiki articles '''
def extract_articles(url=start_url):
    # get soup
    soup = get_soup(url)
    # find all the paragraph tags
    p_tags = soup.findAll('p')
    # gather all <a> tags 
    a_tags = []
    for p_tag in p_tags:
        a_tags.extend(p_tag.findAll('a'))
    # filter the list : remove invalid links
    a_tags = [ a_tag for a_tag in a_tags if 'title' in a_tag.attrs and 'href' in a_tag.attrs ]
    # get all the article titles
    titles = [ a_tag.get('title') for a_tag in a_tags ] 
    # get all the article links
    links  = [ a_tag.get('href')  for a_tag in a_tags ] 
    # get own title
    self_title = soup.find('h1', {'class' : 'firstHeading'}).text
    return self_title, titles, links


''' main section '''
if __name__ == '__main__':
    # list of scraped items
    items = []
    title, ext_titles, ext_links = extract_articles(url=start_url)
    items.extend(zip([title]*len(ext_titles), ext_titles))
    for ext_link in ext_links:
        title, ext_titles, ext_links = extract_articles(domain + ext_link)
        items.extend(zip([title]*len(ext_titles), ext_titles))
        if len(items) > 3000:
            break
    # write to file
    with open('result.txt','w', encoding='utf-8') as f:
        for item in items:
            print(item[0] + '->' + item[1] + '\n')
            f.write(item[0] + '->' + item[1] + '\n')