## Web scraping with BeautifulSoup in Python - Liberals Blog
- __Date__: July 25, 2020
- __Author__: Karim Khan
- __Description__: This Notebook shows scraping the Liberal Party of Canada's News Website by creating a function to scrape multiple number of posts over multiple pages (either x posts or posts since time t specified by the user).

### Import libraries

In [None]:
import requests
from requests import get
import datetime
from bs4 import BeautifulSoup as Soup
import pandas as pd
import numpy as np
import csv
print("Libraries have been imported.")

Libraries have been imported.


### Get top-level coverpage blogs for a single page

In [None]:
url = 'https://www.liberal.ca/blog/page/1/'
coverpage_blogs = get(url, '{"Accept-Language": "en-US, en;q=0.5"}').content

### Parse coverpage news with BS

In [None]:
coverpage_soup = Soup(coverpage_blogs, 'lxml')
# coverpage_soup

### Extract Element Tag containing essential information

In [None]:
coverpage_news = coverpage_soup.find_all('div', class_ = 'cell large-4 medium-6 small-12')
coverpage_news[0:2] #Display first 2 cards on coverpage 

[<div class="cell large-4 medium-6 small-12">
 <article class="post-listing-item">
 <a class="post-listing-item__link" href="https://www2.liberal.ca/fiscal-snapshot/">
 <div class="post-listing-item__featured-image" data-bg="https://s31184.pcdn.co/wp-content/uploads/sites/292/2020/07/fiscal-snapshot-1-EN-OG-1200x630-1-1024x538.jpg"></div>
 <div class="post-listing-item__card-content">
 <div class="post-listing-item__card-content-top">
 <h3 class="post-listing-item__title-text text--red">Snapshot: Supporting Canadians and stabilizing our economy</h3>
 <div class="post-listing-item__excerpt">
 <p>Throughout this pandemic, Justin Trudeau and the Liberal government have been working hard to make sure you and your family are safe and supported during this challenging time.</p>
 </div>
 </div>
 <div class="post-listing-item__card-content-bottom">
 <span class="button transparent-bg-button arrow-button">
 			Read More		</span>
 </div>
 </div>
 </a>
 </article>
 </div>, <div class="cell large-

### Test "cleanliness" of BeautifulSoup scraping

In [None]:
print(coverpage_news[0].h3.get_text().strip()) #title
print(coverpage_news[0].p.get_text()) #subtitle
print(coverpage_news[0].a['href']) #link
url = 'https://www.liberal.ca/justin-trudeaus-address-to-parliament-on-anti-black-racism-in-canada/'#example article
article = get(url, headers = {"Accept-Language": "en-US, en;q=0.5"}).content
soup = Soup(article, 'lxml')
body = soup.find('div', class_='post-content-container').get_text().strip()
print(body) #body of example article

Snapshot: Supporting Canadians and stabilizing our economy
Throughout this pandemic, Justin Trudeau and the Liberal government have been working hard to make sure you and your family are safe and supported during this challenging time.
https://www2.liberal.ca/fiscal-snapshot/
Justin Trudeau’s address to Parliament on anti-Black racism in Canada									


								June 2, 2020								


										Share									



















Check against delivery
June 2, 2020
I rise today to address what so many people of colour live with every day.
Over the past few days, we’ve seen horrific reports of police violence against Black men and women south of the border.
But these are not isolated incidents or elsewhere problems.
Prejudice, discrimination, and violence is a lived reality for far too many people.
It is the result of systems which far too often condone, normalize, perpetrate, and perpetuate inequality and injustice against people of colour.
As a country, we are not concerned bystanders 

### Helper Function 1 - Get X Posts

In [None]:
# Helper Function 1 - Get X Posts
def getNumPosts(number_of_posts = int):
    
    #create a blog counter
    blog_count = 0
           
    url = 'https://www.liberal.ca/blog/'
    coverpage_blogs = get(url, {"Accept-Language": "en-US, en;q=0.5"}).content
    
    #Parse coverpage news with BS
    coverpage_soup = Soup(coverpage_blogs, 'lxml')
    
    # get total number of pages:
    pages = [i.text for i in coverpage_soup.find_all('a',class_='page') if 'https://www2.liberal.ca/category/blog/page/' in str(i)]
    total_pages = int(pages[-1])

    filename = 'Liberals_Num_Posts.csv' #File name to store the scraped data
    f = open(filename, 'w', encoding='utf-8')

    #Defining header for csv
    headers = 'Blog No., Title, Subtitle, Date, Link, Content\n'
    f.write(headers)
    f.close()
    print("New CSV file opened successfully")
    
    for page in range(1, total_pages+1):

        # Make a get request
        url = ('https://www.liberal.ca/blog/page/' + str(page) + '/')
        
        coverpage_blogs = get(url, {"Accept-Language": "en-US, en;q=0.5"}).content

        #Parse coverpage news with BS
        coverpage_soup = Soup(coverpage_blogs, 'lxml')

        coverpage_news = coverpage_soup.find_all('div', class_ = 'cell large-4 medium-6 small-12')
               
        for n in range(0, 9):
            
            # Getting the link of the article
            try:
                link = coverpage_news[n].a['href']
                
            except:
                link = np.nan

            # Getting the title
            try:
                title = coverpage_news[n].h3.get_text().strip()

            except:
                title = np.nan

            # Getting the subtitle
            try:
                subtitle = coverpage_news[n].p.get_text().strip()

            except:
                subtitle = np.nan

            # Reading the content (it is divided in paragraphs)
            article = get(link, '{"Accept-Language": "en-US, en;q=0.5"}')
            article_content = article.content
            soup_article = Soup(article_content, 'lxml')

            # Getting the date
            try:
                date = soup_article.find('p', class_='single__date').get_text()

            except:
                date = np.nan

            # Getting the content

            try:
                body = soup_article.find('div', class_='post-content-container').get_text().strip()

            except:
                body = np.nan
                          
            blog_count += 1
            
            if blog_count > number_of_posts:
                break
                
           # Append data sequentially to CSV file after each loop      
            with open('Liberals_Num_Posts.csv', mode='a', encoding='utf-8') as csv_file:
                fieldnames = ['Blog No.','Title','Subtitle','Date', 'Link', 'Content']
                writer = csv.DictWriter(csv_file, fieldnames = fieldnames) 

                # Assign to each row
                writer.writerow({'Blog No.':blog_count,'Title':title,'Subtitle':subtitle, 'Date':date,'Link':link, 'Content':body})

                # print Success every time new row is added
                print('Blog {} written successfully'.format(blog_count))
                           
        if blog_count > number_of_posts:
            print("Provided number of posts reached")
            break
    
    return

### Test Helper Function 1

In [None]:
# # Commented out since it will be demonstrated in the Main Function below.
# getNumPosts(5)

### Helper Function 2 - Get Posts since time t

In [None]:
# Helper Function 2 - Get Posts since time t
def getDatePosts():
    
    try:
        date_since = input('Enter date to scrape posts from in [e.g. January 01, 2020]: ')
        date_since_convert = datetime.datetime.strptime(date_since, '%B %d, %Y') # Converted date from string to datetime format

    except:
        print("Incorrect format, please try again")
        date_since = input('Enter date to scrape posts from in [e.g. January 01, 2020]: ')
        date_since_convert = datetime.datetime.strptime(date_since, '%B %d, %Y') # Converted date from string to datetime format
           
    #create a blog counter
    blog_count = 0
       
    # get total number of pages:
    url = 'https://www.liberal.ca/blog/'
    coverpage_blogs = get(url, '{"Accept-Language": "en-US, en;q=0.5"}').content
    
    #Parse coverpage news with BS
    coverpage_soup = Soup(coverpage_blogs, 'lxml')
    pages = [i.text for i in coverpage_soup.find_all('a',class_='page') if 'https://www2.liberal.ca/category/blog/page/' in str(i)]
    total_pages = int(pages[-1])
    
    filename = 'Liberals_Dated_Posts.csv' #File name to store the scraped data
    f = open(filename, 'w', encoding='utf-8')

    #Defining header for csv
    headers = 'Blog No., Title, Subtitle, Date, Link, Content\n'
    f.write(headers)
    f.close()
    print("New CSV file opened successfully")

    for page in range(1, total_pages+1):

        # Make a get request
        url = ('https://www.liberal.ca/blog/page/' + str(page) + '/')
        
        coverpage_blogs = get(url, '{"Accept-Language": "en-US, en;q=0.5"}').content

        #Parse coverpage news with BS
        coverpage_soup = Soup(coverpage_blogs, 'lxml')

        coverpage_news = coverpage_soup.find_all('div', class_ = 'cell large-4 medium-6 small-12')
                        
        for n in range(0, 9):
            
            # Getting the link of the article
            try:
                link = coverpage_news[n].a['href']
                
            except:
                link = np.nan

            # Getting the title
            try:
                title = coverpage_news[n].h3.get_text().strip()
            
            except:
                title = np.nan

            # Getting the subtitle
            try:
                subtitle = coverpage_news[n].p.get_text().strip()
            
            except:
                subtitle = np.nan

            # Reading the content
            article = get(link, '{"Accept-Language": "en-US, en;q=0.5"}')
            article_content = article.content
            soup_article = Soup(article_content, 'lxml')

            # Getting the date
            try:
                date = soup_article.find('p', class_='single__date').get_text().strip()
                date_convert = datetime.datetime.strptime(date, '%B %d, %Y') # Converted date from string to datetime format
                
            except:
                date_convert = datetime.datetime.strptime(date, '%B %d, %Y')

            # Getting the content
            try:
                body = soup_article.find('div', class_='post-content-container').get_text().strip()
                
            except:
                body = np.nan

            blog_count += 1
            
            if date_convert < date_since_convert:
                break
                           
            # Append data sequentially to CSV file after each loop  

            with open('Liberals_Dated_Posts.csv', mode='a', encoding='utf-8') as csv_file:
                fieldnames = ['Blog No.','Title','Subtitle','Date', 'Link', 'Content']
                writer = csv.DictWriter(csv_file, fieldnames = fieldnames) 

                # Assign to each row
                writer.writerow({'Blog No.':blog_count,'Title':title,'Subtitle':subtitle, 'Date':date,'Link':link, 'Content':body})

            # print success message every time new row is added
            print('Blog {} written successfully'.format(blog_count))

        if date_convert < date_since_convert:
            print("Provided range of posts reached")
            break

    return

### Test Helper Function 2

In [None]:
# # Commented out since it will be demonstrated in the Main Function below.
# getDatePosts()

### Main Function - Gives User the option to either scrape specific number of posts or posts since a specific date

In [None]:
# Main Function - Including Helper Functions 1 and 2.
def WebScraping():
    user_input = str(input("Enter Your Choice: \n A) Scrape specific number of posts \n B) Scrape posts since a specific date \n "))

    if user_input == 'A':
        number_of_posts = int(input("Enter number of blogs to be scraped (from most recent backwards): "))
        value = getNumPosts(int(number_of_posts))
        return value
        print("Data printed successfully!")

    elif user_input == 'B':
        value = getDatePosts()
        return value
        print("Data printed successfully!")

    else:
        print("Invalid selection, please try again.")

### Test Main Function Works - X Posts

In [None]:
WebScraping()

Enter Your Choice: 
 A) Scrape specific number of posts 
 B) Scrape posts since a specific date 
 A
Enter number of blogs to be scraped (from most recent backwards): 10
New CSV file opened successfully
Blog 1 written successfully
Blog 2 written successfully
Blog 3 written successfully
Blog 4 written successfully
Blog 5 written successfully
Blog 6 written successfully
Blog 7 written successfully
Blog 8 written successfully
Blog 9 written successfully
Blog 10 written successfully
Provided number of posts reached


### Test Main Function Works - Get Posts since time t

In [None]:
WebScraping()

Enter Your Choice: 
 A) Scrape specific number of posts 
 B) Scrape posts since a specific date 
 B
Enter date to scrape posts from in [e.g. January 01, 2020]: May 01, 2020
New CSV file opened successfully
Blog 1 written successfully
Blog 2 written successfully
Blog 3 written successfully
Provided range of posts reached
