## Python Crawler to receive notification when a update occurs - Script Module

### Requirements:
* pip install python3
* pip install sendgrid
* pip install requests==2.22.0 beautifulsoup4==4.8.1

This script expect you have the following files in the project directory

* emailInformation.txt: Here you should add the From email address in the first line and the To email address in the second line
* sendgrid.env must be in your git ignore file, and your API KEY must be saved in your environmental variables

Source: 
* https://www.twilio.com/blog/web-scraping-and-parsing-html-in-python-with-beautiful-soup
* https://www.geeksforgeeks.org/scheduling-python-scripts-on-linux/
* https://app.sendgrid.com/guide/integrate/langs/python

### Script exploration

First of all, these are the libs required to execute the script.

To use SendGrid, please access https://sendgrid.com/ and create you account and generate your API Key.

Source: https://app.sendgrid.com/guide/integrate/langs/python

In [1]:
#pip install sendgrid

In [2]:
import re

import requests
import filecmp    
import os

#if your environment is not recognizing the correct library folder use this
import sys
#sys.path.append("/Users/pauloalves/workspace/crawler/crawler/lib/python3.9/site-packages")

# used to parser html source code
from bs4 import BeautifulSoup

# using SendGrid's Python Library
# https://github.com/sendgrid/sendgrid-python
from sendgrid import SendGridAPIClient
from sendgrid.helpers.mail import Mail

# used for generate and compare files
from os.path import exists
from datetime import date, datetime 
import shutil

Now, lets start exploring the developed methods.

*getContent* is a method to get data from a website using BeautifulSoup. 

To do so, this method request the website url, the HTML tag (ex: div), and the tag value where the information that you are looking for is available.

In [3]:
# params: 
#  _url: website address that you want to check
#  _id: HTML tag that you want to explore
def getContent(_url,_tag,_value):    
    html_text = requests.get(_url).text
    soup = BeautifulSoup(html_text, 'html.parser')
    if _tag=='div':
        return soup.find_all(_tag,class_=_value)
    elif _tag=='id':
        return soup.find(id=_value).text
#     else
#         res = requests.get(_url)
#         return res.text

*saveContent* is a method to save the data from a website into a txt file

return: void

In [4]:
# params: 
#  _fileName: File name related to the file that you want to store the content
#  _content: Content extracted from the website mentioned before 
def saveContent(_fileName,_content):
    #open file
    file = open(_fileName, "w")

    #convert variable to string
    file.write(repr(_content))

    #close file
    file.close()

*compareContent* is a method used to compare two files.

Return: Boolean: True if the files are equal and False otherwise.

In [5]:
# params: 
#  _file1: File that you want to compare
#  _file2: File that you want to compare
def compareContent(_file1, _file2):
    #compare files
    result = filecmp.cmp(_file1, _file2)
    
    return result

*sendMail* is a method used send a notification by mail. This method expects a subject and a url that corresponds to the website that you are monitoring

The email information, such as mail addresses of from and to should be stored in the *emailInformation.txt*. This file should be stored in the root folder of this project.

This method expects to receive a subject and the url.

Return: void

In [6]:
# params: 
#  _subject: String related to the subject that this crawler will verify
def sendMail(_subject,_url):
    f = open("emailInformation.txt", "r")
    _emailFrom = f.readline()
    _emailTo = f.readline()

    message = Mail(    
        from_email=_emailFrom,
        to_emails=_emailTo,
        subject='Crawler Notification - '+_subject,
        html_content='<strong>You have an update in the site that you are monitoring.</strong><br>'+_url+'<br>##webCrawler##')
    try:
        sg = SendGridAPIClient(os.environ.get('SENDGRID_API_KEY'))
        response = sg.send(message)
    except Exception as e:
        print(e.message)

*compareFiles* is a method designed to persist the source file, and generate a new version of it if needed.

Return: Void

In [7]:
def compareFiles(_subjects,_contents,_urls):
    for subject,content,url in zip(_subjects,_contents,_urls): 
        newFile = str(subject)+'New.txt'
        oldFile = str(subject)+'Old.txt'

        # verify if there is a source file to compare with the current website state
        if exists(oldFile)==False:
            saveContent(oldFile, content)

        # save the current website state    
        saveContent(newFile, content)

        # store the result of file comparison. True if they are equal, False if they are not
        result = compareContent(oldFile,newFile)
        
        # verify the result, if false, send an email to notify the stakeholders informing that there is an update
        now = datetime.now().time()
        if result==False:
            print(str(date.today())+'-'+str(now)+'-['+subject+']-'+ 'Do something, there is an update in your site')
            shutil.copyfile(oldFile, str(date.today())+'-'+oldFile)
            shutil.copyfile(newFile,oldFile)
            sendMail(str(subject),str(url))
        else:
            print(str(date.today())+'-'+str(now)+'-['+subject+']-'+'No changes')

In [8]:
## MAIN ##
aeroUrl = 'https://convocacaotemporarios.fab.mil.br/candidato/index.php'
pmerjUrl = 'https://otvs.pmerj.rj.gov.br/atualizacoes.php'
ufrjUrl = 'https://concursos.pr4.ufrj.br/index.php/56-concursos/concursos-em-andamento/edital-436-de-07-de-junho-de-2022/523-edital-n-436-de-07-de-junho-de-2022'
urls = [aeroUrl,pmerjUrl,ufrjUrl]

aeroContent = getContent(aeroUrl,'id','convocacao-recentes')
pmerjContent = getContent(pmerjUrl, 'div', 'box-round bg-white-1 p-4 text-start')
ufrjContent = getContent(ufrjUrl,'id','main')

contents = [aeroContent,pmerjContent,ufrjContent]

subjects = ["aero","PMERJ","UFRJ"]

compareFiles(subjects,contents,urls)

2022-09-14-10:18:38.504670-[aero]-No changes
2022-09-14-10:18:38.521153-[PMERJ]-No changes
2022-09-14-10:18:38.521974-[UFRJ]-No changes
