# Final Year Project - Times of Malta Scraper Section

## Overview

In this section, a dataset is created by scraping for article images and their corresponding information from the Times of Malta website. This code iterates through each article found when searching with the keyword 'inset' and views each of them one by one. Whenever it clicks into an article, the following information is retrieved:

- URL
- Article Title
- Date Published
- Categories
- Images
- Image Captions

This information is stored within two CSV files, these being:

- Article_Information.csv - This CSV file stores the URL, Article Title, Date Published, Categories and the names of the images in the article.
- Image_Information.csv - This CSV file stores the URL, Image Name, Image Caption as well as an attribute called 'Inset'. This attribute is responsible for storing whether the corresponding image is an inset image or not.

All information is retrieved by accessing the respective HTML elements, the only exception to this being the 'Inset' attribute, which is set to True whenever the word 'inset' is found in the corresponding image caption.

## Installing / Importing Packages

The following packages are required for the notebook to work.

In [1]:
#Installing and Importing Packages
import os
import requests
import selenium
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from datetime import datetime
from datetime import date as dateToday
from datetime import timedelta

## Creating WebScraper Class

This class is used to define paths which will be used throughout the rest of the code.

In [2]:
#Defining class called WebScraper which has a path to the ChromeDriver executable file, 
#the folder in which images will be saved and the path where the CSV files will be saved.
class WebScraper:
    def __init__(self, folderName = 'data'):
        self.CHROME_DRIVER_PATH = ".\chromedriver_win32\chromedriver.exe"                                      
        self.NEWS_IMG_PATH = os.path.join(folderName, 'img')
        self.NEWS_PATH = folderName

## Creating Function to get Image Extension

This function checks the image extension type are returns the respective type.

In [3]:
def getImageExtension(imageLink):
    if imageLink.endswith('.jpeg'):
        return '.jpeg'
    elif imageLink.endswith('.jpg'):
        return '.jpg'
    elif imageLink.endswith('.png'):
        return '.png'
    else: return ""

## Creating Function to Scrape Website and Save Information in CSV Files

In this section, an instance of the WebScraper class defined earlier is created and the appropriate settings are made to connect to the Times of Malta website. The information specified above is retrieved and saved in the csv files whilst also saving all images in the img folder. The appropriate error checking and exception handling is performed.

In [4]:
def scrapingFunction(folderName, website, numberOfArticles):

    #Create an instance of the WebScraper Class
    webScraper = WebScraper(folderName = folderName)

    #Getting Chrome options and disabling Chrome Logging Messages
    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-logging'])

    #Create an instance of a Service Object
    service = Service(executable_path = webScraper.CHROME_DRIVER_PATH)

    #Create an instance of a driver used to control Chrome
    driver = webdriver.Chrome(service = service, options = options)

    #Opening the website from which content will be scraped
    driver.get(website)

    #Create lists of data to be stored
    articleData = []
    imageData = []

    #Create counters
    count = 0
    articleIndex = 0
    pageIndex = 0

    #Check for Cookies Consent button, if it is displayed, click the Consent Button
    try: 
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='fc-consent-root']//button[@aria-label='Consent']//p[@class='fc-button-label' and text()='Consent']"))).click()
    except: 
        pass
 
    #Loop for n amount of times, where n is the number of articles to be scraped
    for i in range(numberOfArticles):
        try: 
            myobj = datetime.now()
            today = dateToday.today()
            #Click on the Article
            driver.find_element(By.XPATH,f'//*[@id="listing-articles"]/div[{str(2+articleIndex)}]/a').click()
            #Increment the Article Index
            articleIndex += 1

            #Wait for the contents to load
            sleep(1)
            #Get the current URL
            url = driver.current_url
            #Get the Article Title
            title = driver.find_element(By.XPATH,'//*[@id="article-head"]/div/h1').text
            #Get the Article Date of Publication
            date = driver.find_element(By.CLASS_NAME,'wi-WidgetMeta-time').text

            if date.endswith("ago"):
                if int(myobj.hour) - int(date[0:2]) >= 0:
                    date = today.strftime("%d-%b-%Y")
                else:
                    date = (today - timedelta(days = 1)).strftime("%d-%b-%Y")

            #Get the Category Names and remove any new lines, instead add commas as delimeters
            categoryNames = driver.find_elements(By.XPATH, '//*[@id="article-head"]/div/div')[0].text
            categories = ""
            for category in categoryNames:
                if category != '\n':
                    categories += category
                else:
                    categories += ','
            
            #Get Article Thumbnail and Images
            imageLinks = [image for image in driver.find_elements(By.XPATH,'//*[@id="observer"]/main/article/div[2]/div/*/img') + driver.find_elements(By.XPATH,'//*[@id="article-head"]/div/picture/img')]
            
            #Create images and author variables as empty strings
            images = ""
            author = ""

            #Get author name
            try:
                author = driver.find_element(By.CLASS_NAME,'wi-WidgetMeta-author').text[2:]
            except:
                author = "N/A"
                pass

            #Write images to disk
            for imageLink in imageLinks:
                #Get Image Source
                imageSource = imageLink.get_attribute('src')
                #Get Image Caption
                caption = imageLink.get_attribute('alt')
                #Get Image Extension
                imageExtension  = getImageExtension(imageSource)      
                #Create Image Name by appending the extension to the count value
                imageName = f'img{str(count).zfill(5)}' + imageExtension
                #Append Image Name to list
                images  += imageName + ','           
                #Download Image          
                img_data = requests.get(imageSource).content       
                #Check if 'inset' or 'Inset' is in caption, if it is set the attribute to True, else keep it False
                insetBool = False
                if "inset" in caption or "Inset" in caption:
                    insetBool = True
                #Append a list containing the URL, Image Name, Caption and Inset Boolean variable to the imageData list
                imageData.append([url, imageName, caption, insetBool])
                #Creating a dataframe using imageData and saving it in a CSV file named Image_Information.csv#
                pd.DataFrame(columns=['URL', 'Image Name', 'Caption', 'Inset'], data=imageData).to_csv(os.path.join(webScraper.NEWS_PATH,'Image_Information.csv'), index=False)
                #Incrementing the row counter
                count += 1

                #Save the images
                with open(os.path.join(webScraper.NEWS_IMG_PATH, imageName),'wb') as file:
                    file.write(img_data)

            #Append a list containing the URL, Article Title, Article Date of Publication, Categories and Image Names to the articleData list
            articleData.append([url, title, author, date, categories, images[:-1]])
            #Creating a dataframe using articleData and saving it in a CSV file named Article_Information.csv
            pd.DataFrame(columns=['URL', 'Article Name', 'Author', 'Date', 'Categories', 'Images'], data=articleData).to_csv(os.path.join(webScraper.NEWS_PATH,'Article_Information.csv'), index=False)

            #Go back to previous page
            driver.back()

        #Exception where element is not found
        except selenium.common.exceptions.NoSuchElementException as e:
            #Wait for the contents to load
            sleep(1.5)

            try: 
                #Go to the Next Page
                driver.find_element(By.XPATH,f'//*[@id="observer"]/main/div/div[2]/div/span[{str(2+pageIndex)}]').click()
            except: 
                #Display Error Message and go back to Google Home Page
                print(f"ERROR: Reloading page {pageIndex+2}")
                driver.get(f"https://google.com")

                #Wait for the contents to load
                sleep(2)

                #Increment Page Index
                pageIndex += 1
                #Load the website again, skipping the previous page
                driver.get(website.split("page")[0] + f'page={str(pageIndex+2)}')
                #Reset articleIndex
                articleIndex = 0
                
                continue

            #Increment pageIndex
            pageIndex += 1
            #Reset articleIndex
            articleIndex = 0
        
        #Exception where element could not be clicked
        except selenium.common.exceptions.ElementClickInterceptedException as e:
            #Display Error Message
            print('ERROR: Click Intercepted - Skipping')
            #Increment articleIndex
            articleIndex += 1

            continue
        
        #Exception for other cases
        except Exception as e:
            #Display Error Message
            print(str(e))

            #Increment articleIndex
            articleIndex += 1
            #Increment pageIndex
            pageIndex += 1
            
            continue

## Calling Scraping Functions

In this section, we call scrapingFunction() and pass the folder in which images will be saved, the website to scraper from and the number of articles to scrape. In this case, articles are being scraped from the list of articles given when querying the 'inset' keyword in the Times of Malta website as well as their National, World, Opinion, Sport and Business sections.

In [5]:
scrapingFunction('.\\TOM_Dataset\\TOM_Dataset_Inset', 'https://timesofmalta.com/search?keywords=inset&author=0&tags=0&sort=date&order=desc&fields%5B0%5D=title&fields%5B1%5D=body&page=1', 100)
scrapingFunction('.\\TOM_Dataset\\TOM_Dataset_National', 'https://timesofmalta.com/articles/listing/national/page=1', 200)
scrapingFunction('.\\TOM_Dataset\\TOM_Dataset_World', 'https://timesofmalta.com/articles/listing/world/page=1', 200)
scrapingFunction('.\\TOM_Dataset\\TOM_Dataset_Opinion', 'https://timesofmalta.com/articles/listing/opinion/page=1', 200)
scrapingFunction('.\\TOM_Dataset\\TOM_Dataset_Sport', 'https://timesofmalta.com/articles/listing/sport/page=1', 200)
scrapingFunction('.\\TOM_Dataset\\TOM_Dataset_Business', 'https://timesofmalta.com/articles/listing/business/page=1', 200)

ERROR: Click Intercepted - Skipping
ERROR: Reloading page 3
ERROR: Click Intercepted - Skipping
ERROR: Reloading page 12
ERROR: Click Intercepted - Skipping
ERROR: Reloading page 8
ERROR: Reloading page 10
ERROR: Reloading page 11
ERROR: Reloading page 12
ERROR: Reloading page 13
ERROR: Click Intercepted - Skipping
ERROR: Reloading page 12
ERROR: Click Intercepted - Skipping
ERROR: Reloading page 12
ERROR: Reloading page 3
ERROR: Click Intercepted - Skipping
Message: unknown error: cannot determine loading status
from no such window
  (Session info: chrome=115.0.5790.171)
Stacktrace:
Backtrace:
	GetHandleVerifier [0x010FA813+48355]
	(No symbol) [0x0108C4B1]
	(No symbol) [0x00F95220]
	(No symbol) [0x00F888E2]
	(No symbol) [0x00F87138]
	(No symbol) [0x00F877AA]
	(No symbol) [0x00F908A9]
	(No symbol) [0x00F9C668]
	(No symbol) [0x00F9F566]
	(No symbol) [0x00F87BC3]
	(No symbol) [0x00F9C37A]
	(No symbol) [0x00FECB1F]
	(No symbol) [0x00FDA536]
	(No symbol) [0x00FB82DC]
	(No symbol) [0x00FB93