# TO DO:

### STEP 1
- Contents CSV 1 (Image Name, Caption, Inset Boolean)
- Contents CSV 2 (Article URL, Article Name, Categories, Date, List of Image Names)
- Draw ERD

### STEP 2
- Get most recent articles (example: past week or past 100 articles)
- Perform same functionality as in step 1
- Calculate statistics (To be discussed in further detail)

### STEP 3
- Research Image-in-Image and Picture-in-Picture Decomposition

# Final Year Project - Emoji Inset Creator Section

## Overview

In this section, a dataset is created by scraping for article images and their corresponding information from the Times of Malta website. This code iterates through each article found when searching with the keyword 'inset' and views each of them one by one. Whenever it clicks into an article, the following information is retrieved:

- URL
- Article Title
- Date Published
- Categories
- Images
- Image Captions

This information is stored within two CSV files, these being:

- Article_Information.csv - This CSV file stores the URL, Article Title, Date Published, Categories and the names of the images in the article.
- Image_Information.csv - This CSV file stores the URL, Image Name, Image Caption as well as an attribute called 'Inset'. This attribute is responsible for storing whether the corresponding image is an inset image or not.

All information is retrieved by accessing the respective HTML elements, the only exception to this being the 'Inset' attribute, which is set to True whenever the word 'inset' is found in the corresponding image caption.

## Installing / Importing Packages

The following packages are required for the notebook to work.

In [13]:
#Installing and Importing Packages
import os
import requests
import selenium
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service

## Creating WebScraper Class

This class is used to define paths which will be used throughout the rest of the code.

In [14]:
#Defining class called WebScraper which has a path to the ChromeDriver executable file, 
#the folder in which images will be saved and the path where the CSV files will be saved.
class WebScraper:
    def __init__(self, folderName = 'data'):
        self.CHROME_DRIVER_PATH = ".\chromedriver_win32\chromedriver.exe"                                      
        self.NEWS_IMG_PATH = os.path.join(folderName, 'img')
        self.NEWS_PATH = folderName

## Creating Function to get Image Extension

This function checks the image extension type are returns the respective type.

In [15]:
def getImageExtension(imageLink):
    if imageLink.endswith('.jpeg'):
        return '.jpeg'
    elif imageLink.endswith('.jpg'):
        return '.jpg'
    elif imageLink.endswith('.png'):
        return '.png'
    else: return ""

## Connecting Chrome Driver to Times of Malta Website

In this section, an instance of the WebScraper class defined earlier is created and the appropriate settings are made to connect to the Times of Malta website, using the keyword 'inset' to search through articles.

In [16]:
#Create an instance of the WebScraper Class
webScraper = WebScraper(folderName = '.\\times_of_malta')

#Getting Chrome options and disabling Chrome Logging Messages
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])

#Create an instance of a Service Object
service = Service(executable_path = webScraper.CHROME_DRIVER_PATH)

#Create an instance of a driver used to control Chrome
driver = webdriver.Chrome(service = service, options = options)

#Opening the website from which content will be scraped
driver.get('https://timesofmalta.com/search?keywords=inset&author=0&tags=0&sort=date&order=desc&fields%5B0%5D=title&fields%5B1%5D=body&page=1')

## Scraping Website and Saving Information in CSV Files

In this section, the information specified above is retrieved and saved in the csv files whilst also saving all images in the img folder. The appropriate error checking and exception handling is performed.

In [17]:
#Create lists of data to be stored
articleData = []
imageData = []

#Create counters
count = 0
articleIndex = 0
pageIndex = 0

#Check for Cookies Consent button, if it is displayed, click the Consent Button
try: 
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='fc-consent-root']//button[@aria-label='Consent']//p[@class='fc-button-label' and text()='Consent']"))).click()
except: 
    pass

#Loop for n amount of times, where n is the number of articles to be scraped
for i in range(5):
    try: 

        #Click on the Article
        driver.find_element(By.XPATH,f'//*[@id="listing-articles"]/div[{str(2+articleIndex)}]/a').click()
        #Increment the Article Index
        articleIndex += 1

        #Wait for the contents to load
        sleep(1)
        #Get the current URL
        url = driver.current_url
        #Get the Article Title
        title = driver.find_element(By.XPATH,'//*[@id="article-head"]/div/h1').text
        #Get the Article Date of Publication
        date = driver.find_element(By.CLASS_NAME,'wi-WidgetMeta-time').text
        #Get the Category Names and remove any new lines, instead add commas as delimeters
        categoryNames = driver.find_elements(By.XPATH, '//*[@id="article-head"]/div/div')[0].text
        categories = ""
        for category in categoryNames:
            if category != '\n':
                categories += category
            else:
                categories += ','
        
        #Get Article Thumbnail and Images
        imageLinks = [image for image in driver.find_elements(By.XPATH,'//*[@id="observer"]/main/article/div[2]/div/*/img') + driver.find_elements(By.XPATH,'//*[@id="article-head"]/div/picture/img')]
        
        #Create images and captions variables as empty strings
        captions = ""
        images = ""

        #Write images to disk
        for imageLink in imageLinks:
            #Get Image Source
            imageSource = imageLink.get_attribute('src')
            #Get Image Caption
            caption = imageLink.get_attribute('alt')
            #Get Image Extension
            imageExtension  = getImageExtension(imageSource)      
            #Create Image Name by appending the extension to the count value
            imageName = f'img{str(count).zfill(5)}' + imageExtension
            #Append Image Name to list
            images  += imageName + ','           
            #Download Image          
            img_data = requests.get(imageSource).content       
            #Check if 'inset' or 'Inset' is in caption, if it is set the attribute to True, else keep it False
            insetBool = False
            if "inset" in caption or "Inset" in caption:
                insetBool = True
            #Append a list containing the URL, Image Name, Caption and Inset Boolean variable to the imageData list
            imageData.append([url, imageName, caption, insetBool])
            #Creating a dataframe using imageData and saving it in a CSV file named Image_Information.csv#
            pd.DataFrame(columns=['URL', 'Image Name', 'Caption', 'Inset'], data=imageData).to_csv(os.path.join(webScraper.NEWS_PATH,'Image_Information.csv'), index=False)
            #Incrementing the row counter
            count += 1

            #Save the images
            with open(os.path.join(webScraper.NEWS_IMG_PATH, imageName),'wb') as file:
                file.write(img_data)

        #Append a list containing the URL, Article Title, Article Date of Publication, Categories and Image Names to the articleData list
        articleData.append([url, title, date, categories, images[:-1]])
        #Creating a dataframe using articleData and saving it in a CSV file named Article_Information.csv
        pd.DataFrame(columns=['URL', 'Article Name', 'Date', 'Categories', 'Images'], data=articleData).to_csv(os.path.join(m.NEWS_PATH,'Article_Information.csv'), index=False)

        #Go back to previous page
        driver.back()

    #Exception where element is not found
    except selenium.common.exceptions.NoSuchElementException as e:
        #Wait for the contents to load
        sleep(1.5)

        try: 
            #Go to the Next Page
            driver.find_element(By.XPATH,f'//*[@id="observer"]/main/div/div[2]/div/span[{str(2+pageIndex)}]').click()
        except: 
            #Display Error Message and go back to Google Home Page
            print(f"ERROR: Reloading page {pageIndex+2}")
            driver.get(f"https://google.com")

            #Wait for the contents to load
            sleep(2)

            #Increment Page Index
            pageIndex += 1
            #Load the website again, skipping the previous page
            driver.get(f'https://timesofmalta.com/search?keywords=inset&author=0&tags=0&sort=date&order=desc&fields%5B0%5D=title&fields%5B1%5D=body&page={str(page_index+2)}') #Skip page

            #Reset articleIndex
            articleIndex = 0
            
            continue

        #Increment pageIndex
        pageIndex += 1
        #Reset articleIndex
        articleIndex = 0
    
    #Exception where element could not be clicked
    except selenium.common.exceptions.ElementClickInterceptedException as e:
        #Display Error Message
        print('ERROR: Click Intercepted - Skipping')
        #Increment articleIndex
        articleIndex += 1

        continue
    
    #Exception for other cases
    except Exception as e:
        #Display Error Message
        print(str(e))

        #Increment articleIndex
        articleIndex += 1
        #Increment pageIndex
        pageIndex += 1
        
        continue

In [18]:
# m = WebScraper(folder_name = '.\\times_of_malta')

# # def signal_handler(sig, frame):
# #     pd.DataFrame(columns=['Title','Image Name','Caption','Body'], data=data).to_csv(os.path.join(m.NEWS_PATH,'data.csv'), index=False)
# #     sys.exit()

# def get_img_ext(img_link):
#     if img_link.endswith('.jpeg'):
#         return '.jpeg'
#     elif img_link.endswith('.jpg'):
#         return '.jpg'
#     elif img_link.endswith('.png'):
#         return '.png'
#     else: return ""

# options = webdriver.ChromeOptions()
# options.add_experimental_option('excludeSwitches', ['enable-logging'])

# service = Service(executable_path=m.CHROME_DRIVER_PATH)

# # Since chromedriver is in PATH we dont have specify it location otherwise webdriver.Chrome('path/chromedriver.exe')
# driver = webdriver.Chrome(service = service, options = options)

# # Opening required website to scrape content 
# driver.get('https://timesofmalta.com/search?keywords=inset&author=0&tags=0&sort=date&order=desc&fields%5B0%5D=title&fields%5B1%5D=body&page=1')


# #Closing pop-ups
# # print('Closing initial pop-ups: ',end='')
# # driver.find_element(By.XPATH,'/html/body/div[4]/div[2]/div[1]/div[2]/div[2]/button[1]').click()
# # print('[OK]')


# data = []
# count = 0
# article_index = 0
# page_index = 0

# #Setup ^C Handler to save on signal.
# try: 
#     # print(driver.find_element(By.CLASS_NAME,'fc-button fc-cta-consent fc-primary-button'))
#     # print("FOUND BUT NOT CLICKED")
#     WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, 
#                                                                 "//div[@class='fc-consent-root']//button[@aria-label='Consent']//p[@class='fc-button-label' and text()='Consent']"))).click()
#     #driver.find_element(By.CLASS_NAME,"fc-button fc-cta-consent fc-primary-button").click()
# except: 
#     pass
# for i in range(10):

#     # signal.signal(signal.SIGINT, signal_handler)
#     try: 
#         #Remove donation message
#         try: driver.find_element(By.XPATH,'//*[@id="eng-accept"]').click()
#         except: pass

#         #Click on article
#         driver.find_element(By.XPATH,f'//*[@id="listing-articles"]/div[{str(2+article_index)}]/a').click()
#         #Increment article index
#         article_index += 1

#         sleep(1)

#         url = driver.current_url
#         #Get Title
#         title = driver.find_element(By.XPATH,'//*[@id="article-head"]/div/h1').text

#         print(f'Title: {title}')
        
#         #Get Thumbnail + Images
#         img_links = [img for img in \
#                      driver.find_elements(By.XPATH,'//*[@id="observer"]/main/article/div[2]/div/*/img') +\
#                      driver.find_elements(By.XPATH,'//*[@id="article-head"]/div/picture/img')]
        
#         captions = ""
#         images   = ""

#         #Write images to disk
#         for img_link in img_links:
#             img_src   = img_link.get_attribute('src')        #Get source
#             captions  += img_link.get_attribute('alt') + '☺' #Append current caption

#             img_ext  = get_img_ext(img_src)                #Get image extension
#             img_name = f'img{str(count).zfill(5)}'+img_ext #Get image name
#             images  += img_name + ','                      #Append image name to list
#             img_data = requests.get(img_src).content       #Download image

#             count += 1

#             #Write current image to disk
#             with open(os.path.join(m.NEWS_IMG_PATH,img_name),'wb') as f:
#                 f.write(img_data)

#         #Get Body
#         text_content = driver.find_element(By.XPATH,'//*[@id="observer"]/main/article/div[2]/div')                                                     
#         body = " ".join(p.text for p in text_content.find_elements(By.TAG_NAME,'p'))

#         #Add row
#         data.append([url, images[:-1], captions[:-1]])

#         #Save to csv
#         # if i%50 == 0:
#         print('Saving...\n')
#         for image in images:
#             pd.DataFrame(columns=['Image Name','Caption', 'Inset'], data=data).to_csv(os.path.join(m.NEWS_PATH,'contents.csv'), index=False)

#     #Go back
#         print('Back to home page\n')
#         driver.back()

#     except selenium.common.exceptions.NoSuchElementException as e:
#         #Switch to next page
#         sleep(1.5)

#         try: driver.find_element(By.XPATH,f'//*[@id="observer"]/main/div/div[2]/div/span[{str(2+page_index)}]').click()
#         except: 
#             print(f'Error: Reloading page {page_index+2}')
#             driver.get(f'https://google.com')
#             sleep(2)
#             page_index += 1
#             driver.get(f'https://timesofmalta.com/search?keywords=inset&author=0&tags=0&sort=date&order=desc&fields%5B0%5D=title&fields%5B1%5D=body&page={str(page_index+2)}') #Skip page

#             try: driver.find_element(By.XPATH,'//*[@id="qc-cmp2-ui"]/div[2]/div/button[2]').click() #Close pop-up
#             except: pass

#             article_index = 0 #Invoke next-page switch
            
#             continue #Go back to beggining

        

#         page_index   += 1
#         article_index = 0
#         print(f'Next Page! -> {page_index+1}')
    
#     except selenium.common.exceptions.ElementClickInterceptedException as e:
#         print('Click Intercepted - Skipping')
#         article_index += 1
#         continue

#     except Exception as e:
#         print(str(e))
#         article_index += 1
#         page_index += 1
#         continue