### Web Scraping using Python

In this tutorial, you'll learn how to extract data from the web, manipulate and clean data using Python's Pandas library.


Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have. Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can be imported.

#### Web Scraping using Beautiful Soup & Mechanize

To perform web scraping, you should also import the libraries shown below. The mechanize module is used to open URLs and act as a browser. The Beautiful Soup package is used to extract data from html files. The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.

In [53]:
import mechanize, time, datetime, re
from bs4 import BeautifulSoup

import pandas as pd

After importing necessary modules, you should specify the URL containing the dataset and set the necessary defaults. 

* The Headers are some standard header parameters that a browser expects
* The base keys are the query parameters that are used when searching the actual page 
```
'd1':'2019-05-18',
's':'start_time',
'sd':'desc',
'l': 10,
't':'article',
'nsa':'eedition' 
```


In [47]:
base = "https://www.trinidadexpress.com/search/?"
headers = ("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")
basekeys = ['d1', 's', 'sd', 'l', 't', 'nsa']
br = mechanize.Browser()
br.set_handle_robots(False)
addheaders = [headers]

Based on how the site uses its search, we created a function ```build_daily_search``` to dynamically build a search query given the parameters

In [48]:
def build_daily_search(keys = None, query = None):
    q_list = []
    for k in keys:
        q_list.append(str(k)+"="+str(query[k]))
    q = "&".join(q_list)
    return base+q

The ```process_link``` function carries out a couple of operations:

```Python
# e.g. /news/local/charged-with-killing-wife-s-ex-boyfriend/article_0dba43a0-7b39-11e9-a62e-ef0d1c3758f4.html
#opens the above url and stores the html content in the variable link
link = br.open("https://www.trinidadexpress.com"+url_link).read() 
#BeautifulSoup pares the html page to be processed
soup = BeautifulSoup(link, 'html.parser') 
# to find the title, author, date created and image url we inspect the content of the page
# and looks for the html attributes that are linked to the values we want 
# we then use beautifulSoup to find these values by their html attribues
title = soup.find('h1', {'class', "headline"}).text.encode('ascii', 'ignore')
author = soup.select('span[itemprop="author"]')
dateCreated = soup.find('time', {'class', "asset-date text-muted"})
imgUrl = soup.find('div', {'class','image'})
```

Once we get the desired values we store them in a ```data``` variable. The next step is the get the actual text content from the site. This is done in a similar fasion as the other vaules.
We investigate the ```html``` attributes that represent the values we want, in this case the attribute is the ```html <p>``` paragraph tag.

We get all paragraph tags and extract the text from it. Once we obtain all the text we add it to the ```data``` variable and return.

In [54]:
def process_link(url_link):
    print("processing: ", url_link)
    try:
        link = br.open("https://www.trinidadexpress.com"+url_link).read()
        soup = BeautifulSoup(link, 'html.parser')
        title = soup.find('h1', {'class', "headline"}).text.encode('ascii', 'ignore')
        author = soup.select('span[itemprop="author"]')
        dateCreated = soup.find('time', {'class', "asset-date text-muted"})
        imgUrl = soup.find('div', {'class','image'})
        data = {
            'url':url_link,
            'title': re.sub('\W+ ','', title),
            'author':author[0].text.encode('ascii', 'ignore') if len(author) > 0 else "undefined",
            'dateCreated': dateCreated['datetime'] if dateCreated is not None\
                            else datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            'imgUrl': imgUrl.find('img')['src'] if imgUrl is not None else "undefined"
        }
        article = soup.find('div', {'class':'asset-content subscriber-premium'}).find_all('p')
        article_text = ""
        for pgraph in article:
            article_text+=pgraph.text.encode('ascii', 'ignore').replace('\n','')
        if imgUrl is not None:
            caption = soup.find('figcaption', {'class', "caption"})
            data['caption'] = re.sub('\W+ ','', caption.text.encode('ascii', 'ignore')) \
            if caption is not None else "undefined"
        data['article_text'] = article_text
        return data
    except Exception as e:
        print(e)
        return {}

We've created a function above which when given a link, opens the link, views the content and extracts the desired content, however if we want to dynamically get all links that the main page has we would need to build a seperate function for that.

The ```get_page_links``` function takes care of that for us. Given the query parameters outlined above we build the url in the format that the webpage expects it then same as above we:

* Open the link ```open_url = br.open(url)```
* Parse the result for processing ```soup = BeautifulSoup(open_url.read(), 'html.parser')```
* Get the attribute that has the links we want ```content = soup.find(class_="results-container")```
* And for all the attributes in the page extract the links:
```Python
articles = []
for link in link_containers:
l = link.find('a')['href']
article = process_link(l)
articles.append(article)
```

In [59]:
def get_page_links(query):
    print(query)
    q = query
    k = query.keys()
    url = build_daily_search(k, q)
    print(url)
    open_url = br.open(url)
    soup = BeautifulSoup(open_url.read(), 'html.parser')
    links = []
    content = soup.find(class_="results-container")
    link_containers = content.find_all('h3', {'class', "tnt-headline"})
    articles = []
    for link in link_containers:
        l = link.find('a')['href']
        article = process_link(l)
        articles.append(article)
    return articles

In [62]:
qry = {
        'd1':'2019-05-01', #start date
        's':'start_time', #parameter name
        'sd':'desc', # order
        'l': 200, # number of links to display
        't':'article', # type of articles
        'nsa':'eedition' # parameter name
    }

In [63]:
end_data = get_page_links(qry)
df = pd.DataFrame(end_data)
df.to_csv(qry['d1']+'-'+str(qry['l'])+'.csv', sep='^')

{'nsa': 'eedition', 'l': 200, 's': 'start_time', 't': 'article', 'd1': '2019-05-01', 'sd': 'desc'}
https://www.trinidadexpress.com/search/?nsa=eedition&l=200&s=start_time&t=article&d1=2019-05-01&sd=desc
('processing: ', u'/sports/ocm-s-laurence-defends-aips-titles/article_dca5646a-7b4d-11e9-8bc7-5faa7cbfc555.html')
('processing: ', u'/news/local/charged-with-killing-wife-s-ex-boyfriend/article_0dba43a0-7b39-11e9-a62e-ef0d1c3758f4.html')
('processing: ', u'/news/local/year-old-gemma-killed-by-burglar-say-cops/article_7983d308-7b2f-11e9-a4ef-7f92e3e99d2c.html')
('processing: ', u'/news/local/coconut-vendor-rapist-makes-escape-by-sea/article_c86a4bbe-7b1b-11e9-98b2-3791636abee6.html')
('processing: ', u'/news/local/ganja-farmer-loses-crop-in-tobago/article_0e12af9c-7b0f-11e9-9baa-1b7e0b8cd068.html')
('processing: ', u'/news/local/sympathy-for-fugitives-gary-does-not-care/article_cd3f94c0-7b0a-11e9-9748-8768d6ceaed5.html')
('processing: ', u'/news/local/venezuelans-linked-ar-rifles/article

('processing: ', u'/news/local/government-to-refurbish-prison-fence-after-escape/article_db688e4e-790c-11e9-bc37-b788098c87a4.html')
('processing: ', u'/news/local/reward-for-fugitives/article_94b2f710-790b-11e9-9fe6-a7a63045c577.html')
('processing: ', u'/news/local/miss-the-molester/article_17def4a0-790b-11e9-9cea-cfb7cc27af19.html')
('processing: ', u'/business/local/imbert-up-to-m-from-tax-amnesty/article_be196ca8-7909-11e9-8252-d37adc654b08.html')
('processing: ', u'/opinion/editorials/celebrating-harold-hoyte/article_0fc3e99e-78fa-11e9-936a-978337a314a3.html')
('processing: ', u'/features/local/bonding-with-your-mother-in-law/article_30459816-78fa-11e9-8f38-9b2f7eecdd2c.html')
('processing: ', u'/opinion/letters/leave-comfort-zone-behind/article_66f20134-78f9-11e9-8ae4-5b9c50ee981e.html')
('processing: ', u'/opinion/columnists/adjusting-the-region-s-ties-with-europe/article_8219fba6-78f9-11e9-95f6-eb7c4fb99d5e.html')
('processing: ', u'/opinion/columnists/a-song-a-dance-a-passage