### Web Scraping using Python

In this tutorial, you'll learn how to extract data from the web, manipulate and clean data using Python's Pandas library.


Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have. Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can be imported.

#### Web Scraping using Beautiful Soup & UrlLib

To perform web scraping, you should also import the libraries shown below. The UrlLib module is used to open URLs and act as a browser. The Beautiful Soup package is used to extract data from html files. The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.

In [6]:
import time, datetime, re
import urllib2 as urllib
# import urllib.request # Python 3.X
from bs4 import BeautifulSoup

import pandas as pd

After importing necessary modules, you should specify the URL containing the dataset and set the necessary defaults. 

* The Headers are some standard header parameters that a browser expects
* The base keys are the query parameters that are used when searching the actual page 
```
'd1':'2019-05-18',
's':'start_time',
'sd':'desc',
'l': 10,
't':'article',
'nsa':'eedition' 
```


In [7]:
base = "https://www.trinidadexpress.com/search/?"
basekeys = ['d1', 's', 'sd', 'l', 't', 'nsa']

In [8]:
# This function is only for Python 3.X
def get_url(url):
    with urllib.request.urlopen(url) as response:
        html = response.read()
    return html

Based on how the site uses its search, we created a function ```build_daily_search``` to dynamically build a search query given the parameters

In [9]:
def build_daily_search(keys = None, query = None):
    q_list = []
    for k in keys:
        q_list.append(str(k)+"="+str(query[k]))
    q = "&".join(q_list)
    return base+q

The ```process_link``` function carries out a couple of operations:

```Python
# e.g. /news/local/charged-with-killing-wife-s-ex-boyfriend/article_0dba43a0-7b39-11e9-a62e-ef0d1c3758f4.html
#opens the above url and stores the html content in the variable link
response = urllib.urlopen("https://www.trinidadexpress.com"+url_link)
link = response.read()
#BeautifulSoup pares the html page to be processed
soup = BeautifulSoup(link, 'html.parser') 
# to find the title, author, date created and image url we inspect the content of the page
# and looks for the html attributes that are linked to the values we want 
# we then use beautifulSoup to find these values by their html attribues
title = soup.find('h1', {'class', "headline"}).text.encode('ascii', 'ignore')
author = soup.select('span[itemprop="author"]')
dateCreated = soup.find('time', {'class', "asset-date text-muted"})
imgUrl = soup.find('div', {'class','image'})
```

Once we get the desired values we store them in a ```data``` variable. The next step is the get the actual text content from the site. This is done in a similar fasion as the other vaules.
We investigate the ```html``` attributes that represent the values we want, in this case the attribute is the ```html <p>``` paragraph tag.

We get all paragraph tags and extract the text from it. Once we obtain all the text we add it to the ```data``` variable and return.

In [10]:
def process_link(url_link):
    print("processing: ", url_link)
    try:
#         link = get_url("https://www.trinidadexpress.com"+url_link) # Python3.X
        response = urllib.urlopen("https://www.trinidadexpress.com"+url_link)
        link = response.read()
        soup = BeautifulSoup(link, 'html.parser')
        title = soup.find('h1', {'class', "headline"}).text.encode('ascii', 'ignore')
        author = soup.select('span[itemprop="author"]')
        dateCreated = soup.find('time', {'class', "asset-date text-muted"})
        imgUrl = soup.find('div', {'class','image'})
        data = {
            'url':url_link,
            'title': re.sub('\W+ ','', title),
            'author':author[0].text.encode('ascii', 'ignore') if len(author) > 0 else "undefined",
            'dateCreated': dateCreated['datetime'] if dateCreated is not None\
                            else datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            'imgUrl': imgUrl.find('img')['src'] if imgUrl is not None else "undefined"
        }
        article = soup.find('div', {'class':'asset-content subscriber-premium'}).find_all('p')
        article_text = ""
        for pgraph in article:
            article_text+=pgraph.text.encode('ascii', 'ignore').replace('\n','')
        if imgUrl is not None:
            caption = soup.find('figcaption', {'class', "caption"})
            data['caption'] = re.sub('\W+ ','', caption.text.encode('ascii', 'ignore')) \
            if caption is not None else "undefined"
        data['article_text'] = article_text
        return data
    except Exception as e:
        print(e)
        return {}

We've created a function above which when given a link, opens the link, views the content and extracts the desired content, however if we want to dynamically get all links that the main page has we would need to build a seperate function for that.

The ```get_page_links``` function takes care of that for us. Given the query parameters outlined above we build the url in the format that the webpage expects it then same as above we:

* Open the link 
```
response = urllib.urlopen(url)
open_url = response.read()
```
* Parse the result for processing ```soup = BeautifulSoup(open_url.read(), 'html.parser')```
* Get the attribute that has the links we want ```content = soup.find(class_="results-container")```
* And for all the attributes in the page extract the links:
```Python
articles = []
for link in link_containers:
l = link.find('a')['href']
article = process_link(l)
articles.append(article)
```

In [11]:
def get_page_links(query):
    print(query)
    q = query
    k = query.keys()
    url = build_daily_search(k, q)
    print(url)
    response = urllib.urlopen(url)
    open_url = response.read()
    #open_url = get_url(url) #Python 3.X
    soup = BeautifulSoup(open_url, 'html.parser')
    links = []
    content = soup.find(class_="results-container")
    link_containers = content.find_all('h3', {'class', "tnt-headline"})
    articles = []
    for link in link_containers:
        l = link.find('a')['href']
        article = process_link(l)
        articles.append(article)
    return articles

In [12]:
qry = {
        'd1':'2019-05-20', #start date
        's':'start_time', #parameter name
        'sd':'desc', # order
        'l': 10, # number of links to display
        't':'article', # type of articles
        'nsa':'eedition' # parameter name
    }

In [13]:
end_data = get_page_links(qry)
df = pd.DataFrame(end_data)
df.to_csv(qry['d1']+'-'+str(qry['l'])+'.csv', sep='^')

{'nsa': 'eedition', 'l': 10, 's': 'start_time', 't': 'article', 'd1': '2019-05-20', 'sd': 'desc'}
https://www.trinidadexpress.com/search/?nsa=eedition&l=10&s=start_time&t=article&d1=2019-05-20&sd=desc
('processing: ', u'/business/local/sando-waterfront-project-begins/article_00e69a40-7c22-11e9-9e99-574806223954.html')
('processing: ', u'/sports/local/getting-started/article_31bb254a-7c1e-11e9-b2ce-db2c9b90612a.html')
('processing: ', u'/sports/local/club-sando-la-horquetta-win-ypl-titles/article_98140696-7c1d-11e9-a0b4-4fc3ac0bf4cd.html')
('processing: ', u'/sports/local/combat-gold-for-t-t/article_eba59f8c-7c1c-11e9-b647-e7d9ec57b764.html')
('processing: ', u'/sports/local/costello-ross-cop-caribbean-age-group-triathlon-titles/article_46461748-7c1b-11e9-9895-db55dee68deb.html')
('processing: ', u'/features/local/shastri-s-arrival/article_eebcc0e0-7c19-11e9-9ec3-c3e3a4f2161c.html')
('processing: ', u'/opinion/columnists/regional-foreign-policy-imperatives/article_25c8308e-7c19-11e9-b4b