# Programming Project - Unit 3,2
*by Igor A. Brandão and Leandro Antonio Feliciano da Silva*

**Goals**
The purpose of this project is explore the following:

- Access a content using webscraping way;
- Catch the following data about people:


1. About
2. RG
3. CPF
4. CNPJ
5. Name
6. Marital state
7. Birthdate
8. Age
9. Education level
10. Language
11. Workplace
12. Salary

# Global Imports section

#### Import the necessary libraries to handle 

- Requests;
- urlopen;
- HTTPError;
- BeautifulSoup
- Regular expression
- Tqdm progress bar

In [1]:
### Library necessary to run this IPython Notebook
!pip install tqdm

### Library necessary to tabulate python outputs
!pip install tabulate



In [2]:
# Imports
from urllib.request import Request, urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re
import requests
from tqdm import tqdm

# Global variables section

#### This section contains is used to define the content variables

In [3]:
# Content variables
about_me = ""
rg = ""
cpf = ""
cnpj = ""
name = ""
marital_state = ""
birthdate = ""
age = ""
education_level = []
language = []
workplace = []
salary = []

# Function section

#### This section contains all the generic functions used in this project

In [4]:
# Return an soup Object
def getSoup(url):
    try:
        html = requests.get(url)
    except HTTPError as e:
        return None
    try:
        soup = BeautifulSoup(html.content, 'html.parser')
    except AttributeError as e:
        return None
    return soup

In [5]:
# Remove the dirty part from link
def getLink(url, dirty):
    result = url.split(dirty)
    return result[0]

# Input section

#### This section cover all the basic parameters used to perform the web scrapping operation

In [6]:
# [Global parameters]

# Search term (person name)
search_term = "Ivanovitch Medeiros Dantas da Silva"

# Search limiter
search_results_number = "20"

# Number of recursion (how many levels the search will dig)
recursion_number = 2

# Base URI
search_url = "https://www.google.com.br/search?q="

#### Here we assemble the final search URI

In [7]:
# First of all, replace blank spaces by add signals to perform the search
search_term_adapted = re.sub(" ", "+", search_term)

# Assembly the search URI
url = (search_url + search_term_adapted + "&num=" + search_results_number)

# Access the page content
soup = getSoup(url)

# Processing section

#### This section is responsible for handle the processing

#### *Warning:* The *url* must be properly assembled for this section work

In [8]:
# [Processing parameters]

# Search method
# It's define how we'll look for the results section
# Possible methods {id, tag, className}
search_method = "id"

# Result tag ID (the idea here is just getting the search result and ignore what's left over)
result_tag_id = "res"

# HTML container element
element = "div"

# Class name
class_name = ""

# Content variable
content = ""

#### Check the choosen search method

In [9]:
# Check the search method
if search_method == "id":
   content = soup.findAll(element, id=result_tag_id)[0]
elif search_method == "tag":
   content = soup.findAll(element, limit=1)
elif search_method == "className":
   content = soup.find(class_=class_name)
else:
   print("You must choose one search method")

#### Get the list of result links and put into an array

In [10]:
# Link list
link_list = []
link_list_content = []
link = ""

# Filter (to avoid receiving undesirable URIs)
filters = ['webcache', '.pdf', '.doc', '.docx', '.xls', '.xlsx']

# Add the links to the list
for item in tqdm(content.find_all('a')):
    
    # Check if exist a href attribute
    if 'href' in item.attrs:
        
        # Check if it's a link
        if 'http' in str(item.attrs):
            
            # Filter the results accordling to the filter list
            if not any(filter_item in str(item.attrs) for filter_item in filters):
                
                # Remove the dirty part from the link using the defined function getLink()
                link = getLink(item.attrs['href'].replace("/url?q=", ""), "&sa=")
                link_list.append(link)

100%|██████████| 45/45 [00:00<00:00, 20609.71it/s]


#### Run into inside links n levels (depends on recursion_number)

**Important:** recursion_number = 2 means read the search page and its sub-links (1 level)

In [11]:
# Add the links to the list
for idx, item in tqdm(enumerate(link_list)):
    link_list_content.append(getSoup(item))
    print(item)
    print("===============================================================================================")

1it [00:00,  1.62it/s]

https://sigaa.ufrn.br/sigaa/public/docente/portal.jsf%3Fsiape%3D2885532


2it [00:00,  2.00it/s]

http://www.dca.ufrn.br/~ivan/


3it [00:01,  2.36it/s]

http://www.dca.ufrn.br/~ivan/index.php%3Fcorpo%3Dproducao.php


4it [00:01,  2.48it/s]

http://www.dca.ufrn.br/~ivan/index.php%3Fcorpo%3Dapresenta.php


5it [00:03,  1.20it/s]

https://scholar.google.com/citations%3Fuser%3Daa5xs_0AAAAJ


6it [00:05,  1.12s/it]

https://www.escavador.com/sobre/7485210/ivanovitch-medeiros-dantas-da-silva


7it [00:07,  1.47s/it]

https://www.escavador.com/sobre/11169212/ivanovitch-medeiros-dantas-da-silva


8it [00:07,  1.18s/it]

http://www.dca.ufrn.br/~ivan/


9it [00:08,  1.02s/it]

https://sigaa.ufrn.br/sigaa/public/departamento/professores.jsf%3Fid%3D6069


10it [00:09,  1.03s/it]

https://portal.imd.ufrn.br/2017/02/16/aluno-do-imdufrn-e-convidado-para-trabalhar-na-microsoft/


11it [00:10,  1.01s/it]

http://www.ccet.ufrn.br/prh22/discentes_ex_alunos.htm


12it [00:11,  1.01it/s]

https://portal.imd.ufrn.br/2017/02/16/aluno-do-imdufrn-e-convidado-para-trabalhar-na-microsoft/


13it [00:13,  1.32s/it]

http://www.nossaciencia.com.br/espalhando-a-cultura-empreendedora


14it [00:15,  1.65s/it]

https://sigarra.up.pt/feup/pt/func_geral.formview%3Fp_codigo%3D470835


15it [00:18,  1.94s/it]

https://pt-br.facebook.com/lii.ufrn/


16it [00:25,  3.54s/it]

http://www.nossaciencia.com.br/espalhando-a-cultura-empreendedora


17it [00:26,  2.72s/it]

http://bdtd.ibict.br/vufind/Author/Home%3Fauthor%3DSilva%252C%2BIvanovitch%2BMedeiros%2BDantas%2Bda


18it [00:27,  2.17s/it]

http://mossorohoje.com.br/noticias/15468/valeu-a-pena-gastar-horas-estudando-diz-potiguar-convidado-para-trabalhar-na-microsoft


19it [00:27,  1.65s/it]

https://www.natal.rn.gov.br/bvn/detalheNoticia.php%3FvalorRegistro%3D22649


20it [00:31,  2.22s/it]

http://www.jusbrasil.com.br/topicos/84719252/jacinta-fabiara-cordeiro-campos


21it [00:32,  1.91s/it]

http://scholar.google.com.br/citations%3Fuser%3DYItyGFkAAAAJ%26hl%3Dpt-BR


22it [00:34,  1.86s/it]

https://www.lsec.icmc.usp.br/en/wocces


23it [00:35,  1.64s/it]

https://contas.tcu.gov.br/juris/SvlHighLight%3Bjsessionid%3D1BFE4D3AF3E3C59E723D6E7B7EA070CF%3Fkey%3DACORDAO-RELACAO-LEGADO-109358-23-2012-55162012%26texto%3D50524f43253341313738363532303132332a%26sort%3DRELEVANCIA%26ordem%3DDESC%26bases%3DACORDAO-LEGADO%3BDECISAO-LEGADO%3BRELACAO-LEGADO%3BACORDAO-RELACAO-LEGADO%3B%26highlight%3D%26posicaoDocumento%3D0





# Page scrapping function section

#### This section contains the functions to dig the source

In [12]:
# Return an soup Object
def getFromEscavador(content_, contentType_):
    
    # ==========================================================
    # About
    # ==========================================================
    if contentType_ == "about":
        
        # Dig the elements
        first_div = content_.find('div',{"class" : "box -flushHorizontal"})
        child_element = first_div.findAll('p', limit=1)
        
        # Assign the result
        return(child_element[0].text)
    
    # ==========================================================
    # Name
    # ==========================================================
    elif contentType_ == "name":
        
        # Dig the elements and return the result
        return(content_.find(class_="heading name").text)
    
    # ==========================================================
    # Education level
    # ==========================================================
    elif contentType_ == "education_level":
        
        # Array with education
        education = []
        
        # Dig the elements
        for item in content_.findAll(class_='heading -likeH5 inline-edit-item inline-edit-item-formacao'):
            if item.text:
                education.append(item.text)
        
        # Return the education list
        return(education)
    
    # ==========================================================
    # Language
    # ==========================================================
    elif contentType_ == "language":
        
        # Array with languages
        language_list = []
        
        # Dig the elements
        for item in content_.findAll('div', id='idiomas'):
            if item:
                # Look for all languages
                for subitem in item.findAll(class_='col-sm-6 clearfix-box'):
                    # Get the language description
                    child_element = subitem.findAll('p', limit=2)

                    if child_element[0] and child_element[1]:
                        # Append it to the language list
                        language_list.append(child_element[0].text + " - " + child_element[1].text.replace('\n', ' '))
        
        # Return the education list
        return(language_list)
    
    # ==========================================================
    # Workplace
    # ==========================================================
    elif contentType_ == "workplace":
        
        # Array with workplaces
        workplace_list = []
        
        # Dig the elements
        for item in content_.findAll('div', id='endereco-profissional'):
            if item:
                # Look for all workplaces
                for subitem in item.findAll(class_='item'):
                    # Get the workplace description
                    child_element = subitem.find('p')

                    if child_element:
                        # Append it to the language list
                        workplace_list.append(child_element.text)
        
        # Return the education list
        return(workplace_list)

In [13]:
# Return an soup Object
# Important: This function works just with public workers
def getFromPortalDaTransparencia(contentType_):
    
    if contentType_ == "salary":
    
        # ==========================================================
        # Public worker ID and CPF
        # ==========================================================
    
        # Handle global search term
        global search_term
        
        # Worker informations handler
        worker_id = 0;
        worker_info = []
        
        # Base URI
        search_url = "http://www.portaltransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=1&TextoPesquisa="
        
        # First of all, replace blank spaces by add signals to perform the search
        search_term_adapted = re.sub(" ", "%20", search_term)

        # Assembly the search URI
        url = (search_url + search_term_adapted)

        # Access the page content
        salary_content = getSoup(url)
        
        # Dig the elements
        for item in salary_content.findAll('div', id='listagem'):
            if item:

                 # Check if exist a href attribute
                for subitem in item.findAll('td'):
                    
                    # Get the worker CPF
                    if 'class' in subitem.attrs:
                        worker_info.append(subitem.text)
    
                    # Get the worker id
                    for sub_subitem in subitem.findAll('a'):
                    
                        if 'href' in sub_subitem.attrs:
                
                            # Check if it's a link
                            if 'IdServidor' in str(sub_subitem.attrs):

                                # Remove the dirty part from the link
                                result = sub_subitem.attrs['href'].split('IdServidor=')
                                workerId = result[1]
                        
                                # Get the worker id
                                worker_info.append(worker_id)
                                worker_id = workerId
                    
        # ==========================================================
        # Public worker salary
        # ==========================================================
        
        # Check if worker informations was found
        if worker_info:
            
            # Renew worker_info array
            worker_info = []
            worker_info.append(worker_id)
            
            # Base URI
            search_url = "http://www.portaltransparencia.gov.br/servidores/Servidor-DetalhaRemuneracao.asp?Op=1&IdServidor=" + worker_id + "&bInformacaoFinanceira=True"

            # Access the page content
            salary_detail_content = getSoup(search_url)
            
            # Dig the elements
            for item in salary_detail_content.findAll('td',{"class" : "colunaValor"}):
                
                if item.text.strip():
                    worker_info.append(item.text.strip())
        
        return(worker_info)

#### Get the information (it is the main flow)

In [14]:
# First of all, run through link list content
for idx, item in tqdm(enumerate(link_list_content)):
    
    # Try to look for the information in available sources
    
    # ==========================================================
    # Escavador source
    # ==========================================================
    if "Escavador" in str(item) or "escavador" in str(item):
        
        # Get informations from Escavador
        if not name:
            name = getFromEscavador(item, "name")
            
        if not about_me:
            about_me = getFromEscavador(item, "about")
            
        if not education_level:
            education_level = getFromEscavador(item, "education_level")
            
        if not language:
            language = getFromEscavador(item, "language")
            
        if not workplace:
            workplace = getFromEscavador(item, "workplace")
    
    # ==========================================================
    # Other source
    # ==========================================================

23it [00:01, 15.38it/s]


In [15]:
# Try to get the salary (just work with public workers)
salary = getFromPortalDaTransparencia("salary")

# Result section

#### Now it's the time of truth, let's see whats our program can bring about you ;-)

In [16]:
# Imports to output the result as a Markdown
from IPython.display import display, Markdown

# Imports to tabulate outputs
from tabulate import tabulate

# ==========================================================
# Display informations
# ==========================================================
display(Markdown('# Hello, ' + name + '!'))
display(Markdown('<hr>'))

# Name
if name:
    display(Markdown('**Your name:** ' + name))

# About you
if about_me:
    display(Markdown('**Some about you:** ' + about_me))

# Education level (print as a tabular list)
if education_level:
    display(Markdown('**Education level:**'))
    
    list = zip(education_level)
    print(tabulate(list))

# Languages (print as a tabular list)
if language:
    display(Markdown('**Language(s):**'))
    
    list = zip(language)
    print(tabulate(list))
    
# Workplaces (print as a tabular list)
if workplace:
    display(Markdown('**Workplace(s):**'))
    
    list = zip(workplace)
    print(tabulate(list))

# Salary
if salary:
    display(Markdown('**Salary informations:**'))
    
    display(Markdown('*Worker ID: *' + salary[0]))
    display(Markdown('*Worker CPF: *' + salary[2]))
    display(Markdown('*Worker Type: *' + salary[3]))
    display(Markdown('<hr>'))
    display(Markdown('***Base salary: *' + 'R$ ' + salary[4]))
    display(Markdown('*IRRF: *' + 'R$ ' + salary[5]))
    display(Markdown('*PSS/RPGS: *' + 'R$ ' + salary[6]))
    display(Markdown('*Other discounts: *' + 'R$ ' + salary[7]))
    display(Markdown('*Final salary: *' + 'R$ ' + salary[8]))
    
    

# Hello, Ivanovitch Medeiros Dantas da Silva!

<hr>

**Your name:** Ivanovitch Medeiros Dantas da Silva

**Some about you:** Possui graduação em Engenharia de Computação (2006), mestrado (2008) e doutorado (2013) em Engenharia Elétrica e Computação pela Universidade Federal do Rio Grande do Norte e co-participação da Universidade do Porto (Sanduíche). Desde 2013 é docente da Universidade Federal do Rio Grande do Norte sendo lotado no Instituto Metrópole Digital (IMD). Atua no Programa de Pós Graduação em Engenharia Elétrica e de Computação (PPGEEC) da UFRN, orientando trabalhos de mestrado e doutorado. Seus interesses em pesquisa incluem: modelagem e análise científica de dados, dependabilidade, redes de sensores sem fio, internet das coisas industriais (IIoT) e desenvolvimento de aplicações para Cidades Inteligentes.

**Education level:**

--------------------------------------------
Doutorado em Engenharia Elétrica
Mestrado em Engenharia Elétrica e Computação
Graduação em Engenharia de Computação
--------------------------------------------


**Language(s):**

--------------------------------------------------------
Inglês -  Compreende Bem, Fala Bem, Lê Bem, Escreve Bem.
--------------------------------------------------------


**Workplace(s):**

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Universidade Federal do Rio Grande do Norte, Instituto Metropole Digital. , Campus Universitario Lagoa Nova, Lagoa Nova, 59072970 - Natal, RN - Brasil, Telefone: (084) 32422216, Ramal: 131, URL da Homepage:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


**Salary informations:**

*Worker ID: *1500347

*Worker CPF: ****.654.994-**

*Worker Type: *Civil

<hr>

***Base salary: *R$ 13.308,31

*IRRF: *R$ -2.435,93

*PSS/RPGS: *R$ -608,44

*Other discounts: *R$ -32,10

*Final salary: *R$ 10.231,84