## Web Scraping using Python - Demografic and Economic Database - WIKIPÉDIA

- Neste projeto tenho como  objetivo, realizar o scrap de dados econômicos do Wikipédia, para futuros projetos de análise explorátoria de dados e machine learning.

- Por meio do web-scraping consigo realizar a retirada de dados que levariam maior tempo de análise, devido a várias bases diferentes.

- A escolha do Wikipedia foi exatamente, devido a possuir dados concentrados em páginas específicas dos países, com uma padronização de informações.

In [1]:
#Importando as bibliotecas

import numpy as np
import pandas as pd

from urllib.request import urlopen #Bbiliotecas relevantes utilizadas para acessar sites HTML.
from bs4 import BeautifulSoup

- Primeiramente, defini a função getHTMLContent, o qual aceita a url e usa o BeautifulSoup para pegar os dados HTML da web.

In [2]:
def getHTMLContent(link):
    html = urlopen(link)
    soup = BeautifulSoup(html, 'html.parser')
    return soup

- Entendendo os dados: Os dados serão retornados em formato HTML. Assim sendo é necessário entenner os mesmos e extrair as informações. No entendo pode ser que existam inumeras tableas na pagina. Assim, precisaríamos encontrar a classe dessa tabela e acessar seus dados.

- Antes de utiliza a função a seguir deve-se analisar o codigo fonte da pagina, para achar a tag a qual estamos buscando, que é a sinformações da tabela "Sovereign states and dependencies by population". Assim sendo utilizamos o modulo fin_all e a tag table, para retirar as informações.

In [3]:
#Utilizando a função - Inicialmente irei acessar os dados demográficos, para conseguiur analisar cada pagina de pais separadamente.

content = getHTMLContent("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita")
tables = content.find_all('table')
for table in tables:
    print(table.prettify())

<table width="100%">
 <tbody>
  <tr>
   <td valign="top">
    <div class="legend" style="-webkit-column-break-inside: avoid;page-break-inside: avoid;break-inside: avoid-column">
     <span class="legend-color" style="display:inline-block; width:1.5em; height:1.5em; margin:1px 0; border:1px solid black; background-color: #003C00; color:black; font-size:100%; text-align:center;">
     </span>
     &gt;$60,000
    </div>
    <div class="legend" style="-webkit-column-break-inside: avoid;page-break-inside: avoid;break-inside: avoid-column">
     <span class="legend-color" style="display:inline-block; width:1.5em; height:1.5em; margin:1px 0; border:1px solid black; background-color: #007F00; color:black; font-size:100%; text-align:center;">
     </span>
     $50,000 - $60,000
    </div>
    <div class="legend" style="-webkit-column-break-inside: avoid;page-break-inside: avoid;break-inside: avoid-column">
     <span class="legend-color" style="display:inline-block; width:1.5em; height:1.5em; 

<table class="wikitable sortable" style="margin-left:auto;margin-right:auto;text-align: right">
 <tbody>
  <tr>
   <th data-sort-type="number">
    Rank
   </th>
   <th>
    Country/Territory
   </th>
   <th>
    <a href="/wiki/United_States_dollar" title="United States dollar">
     US$
    </a>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="800" data-file-width="1000" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/19px-Flag_of_Monaco.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/29px-Flag_of_Monaco.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/38px-Flag_of_Monaco.svg.png 2x" width="19"/>
    </span>
    <a href="/wiki/Monaco" title="Monaco">
     Monaco
    </a>
    (2018)
   </td>
   <td align="left">
    185,741
   </td>
  </tr>
  <tr>
   <td>

- A tabela que usaremos possui a classe 'wikitable classable'. Possui linhas de informações em que a primeira linha tem títulos e as outras linhas sucessivas têm informações sobre cada país.
- Irei explorar o site para cada pais. A celula com o nome do pais em cada linha contém o link para a pagina do pais no wikipédia.
- A seguir indicarei o passo a passo para chegar ao link procurado.

In [4]:
#Primeiro achamos o bloco que contme a tag 'wikitable sortable' onde está o link

table = content.find('table', {'class': 'wikitable sortable'})
table

<table class="wikitable sortable" style="margin-left:auto;margin-right:auto;text-align: right">
<tbody><tr>
<th data-sort-type="number">Rank
</th>
<th>Country/Territory
</th>
<th><a href="/wiki/United_States_dollar" title="United States dollar">US$</a>
</th></tr>
<tr>
<td>1</td>
<td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/da/Flag_of_Luxembourg.svg/23px-Flag_of_Luxembourg.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/da/Flag_of_Luxembourg.svg/35px-Flag_of_Luxembourg.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/da/Flag_of_Luxembourg.svg/46px-Flag_of_Luxembourg.svg.png 2x" width="23"/> </span><a href="/wiki/Luxembourg" title="Luxembourg">Luxembourg</a></td>
<td>113,196
</td></tr>
<tr>
<td>2</td>
<td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="512" data-fil

In [5]:
#Proximo passo é diminuir a camada achando a tag <tr>

rows = table.find_all('tr')
rows

[<tr>
 <th data-sort-type="number">Rank
 </th>
 <th>Country/Territory
 </th>
 <th><a href="/wiki/United_States_dollar" title="United States dollar">US$</a>
 </th></tr>,
 <tr>
 <td>1</td>
 <td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/da/Flag_of_Luxembourg.svg/23px-Flag_of_Luxembourg.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/da/Flag_of_Luxembourg.svg/35px-Flag_of_Luxembourg.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/da/Flag_of_Luxembourg.svg/46px-Flag_of_Luxembourg.svg.png 2x" width="23"/> </span><a href="/wiki/Luxembourg" title="Luxembourg">Luxembourg</a></td>
 <td>113,196
 </td></tr>,
 <tr>
 <td>2</td>
 <td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="512" data-file-width="512" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/commo

In [6]:
# Agora separa a celula que contem as infomrações, consequentemente o link

for row in rows:
    cells = row.find_all('td')
    
cells

[<td>186</td>,
 <td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="500" data-file-width="1000" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Flag_of_South_Sudan.svg/23px-Flag_of_South_Sudan.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Flag_of_South_Sudan.svg/35px-Flag_of_South_Sudan.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Flag_of_South_Sudan.svg/46px-Flag_of_South_Sudan.svg.png 2x" width="23"/> </span><a href="/wiki/South_Sudan" title="South Sudan">South Sudan</a></td>,
 <td>275
 </td>]

In [7]:
# E finalmente a tag <a que contem href com a informação procurada

country_link = cells[1].find('a')
country_link

<a href="/wiki/South_Sudan" title="South Sudan">South Sudan</a>

In [8]:
table = content.find('table', {'class': 'wikitable sortable'})
rows = table.find_all('tr')

#Lista todos os links encontrados por meio das tags anteriores
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 1:
        country_link = cells[1].find('a')
        print(country_link.get('href'))
        

/wiki/Luxembourg
/wiki/Switzerland
/wiki/Macau
/wiki/Norway
/wiki/Republic_of_Ireland
/wiki/Qatar
/wiki/Iceland
/wiki/United_States
/wiki/Singapore
/wiki/Denmark
/wiki/Australia
/wiki/Netherlands
/wiki/Sweden
/wiki/Austria
/wiki/Hong_Kong
/wiki/Finland
/wiki/San_Marino
/wiki/Germany
/wiki/Canada
/wiki/Belgium
/wiki/Israel
/wiki/France
/wiki/United_Kingdom
/wiki/Japan
/wiki/New_Zealand
/wiki/United_Arab_Emirates
/wiki/The_Bahamas
/wiki/Italy
/wiki/Puerto_Rico
/wiki/South_Korea
/wiki/Malta
/wiki/Spain
/wiki/Kuwait
/wiki/Brunei
/wiki/Cyprus
/wiki/Slovenia
/wiki/Aruba
/wiki/Bahrain
/wiki/Taiwan
/wiki/Estonia
/wiki/Czech_Republic
/wiki/Portugal
/wiki/Saudi_Arabia
/wiki/Greece
/wiki/Slovakia
/wiki/Lithuania
/wiki/Saint_Kitts_and_Nevis
/wiki/Latvia
/wiki/Antigua_and_Barbuda
/wiki/Barbados
/wiki/Oman
/wiki/Hungary
/wiki/Seychelles
/wiki/Uruguay
/wiki/Palau
/wiki/Trinidad_and_Tobago
/wiki/Panama
/wiki/Chile
/wiki/Maldives
/wiki/Croatia
/wiki/Poland
/wiki/Romania
/wiki/Costa_Rica
/wiki/Grenada
/

- Cada linha possui um link para a página do país correspondente na Wikipedia. No entanto, o link da web inicial está ausente, portanto, teríamos que anexá-lo. Vamos entender o conteúdo da página com o exemplo de uma página.

- As informações que estão sendo procuradas devem servir como palavras chaves, para a busca dentro do codigo HTML, devido ao grande numero de caracteres no codigo fonte, apenas ler o codigo, pode demanadar algum tempo, então use as ferramentas de busca.

In [26]:
def getAdditionalDetails(url):
    try:
        country_page = getHTMLContent('https://en.wikipedia.org' + url)
        table = country_page.find('table', {'class': 'infobox geography vcard'}) #Depois da analise do codigo, a tag é escolhida
        additional_details = []
        read_content = False
        for tr in table.find_all('tr'):
            if (tr.get('class') == ['mergedtoprow'] and not read_content): #A tag indica é o caminho das informações economicas na pagina
                link = tr.find('a')
                if (link and (link.get_text().strip() == 'Area' or
                   (link.get_text().strip() == 'GDP' and tr.find('span').get_text().strip() == '(nominal)'))):
                    read_content = True
                if (link and (link.get_text().strip() == 'Population')):
                    read_content = False
            elif ((tr.get('class') == ['mergedrow'] or tr.get('class') == ['mergedbottomrow']) and read_content):
                additional_details.append(tr.find('td').get_text().strip('\n'))
                if (tr.find('div').get_text().strip() != '•\xa0Total area' and
                   tr.find('div').get_text().strip() != '•\xa0Total'):
                    read_content = False
        return additional_details
    except Exception as error:
        print('Error occured: {}'.format(error))
        return []

**Criando o Dataset**

- Agora que identificamos quais as informações precisam ser extraídas e como. Nós compilamos todo o processo como uma função acima. Agora, apenas passamos por cada linha da lista de países e compilamos seus dados.

In [27]:
data_content = []
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 1:
        print(cells[1].get_text())
        country_link = cells[1].find('a')
        country_info = [cell.text.strip('\n') for cell in cells]
        additional_details = getAdditionalDetails(country_link.get('href'))
        if (len(additional_details) == 4):
            country_info += additional_details
            data_content.append(country_info)

            
dataset = pd.DataFrame(data_content)

 Luxembourg
  Switzerland
 Macau
 Norway
 Ireland
 Qatar
 Iceland
 United States
 Singapore
 Denmark
 Australia
 Netherlands
 Sweden
 Austria
 Hong Kong
 Finland
 San Marino
 Germany
 Canada
 Belgium
 Israel
 France
 United Kingdom
 Japan
 New Zealand
 United Arab Emirates
 Bahamas, The
 Italy
 Puerto Rico
 Korea, South
 Malta
 Spain
 Kuwait
 Brunei
 Cyprus
 Slovenia
 Aruba
 Bahrain
 Taiwan
 Estonia
 Czech Republic
 Portugal
 Saudi Arabia
 Greece
 Slovakia
 Lithuania
 Saint Kitts and Nevis
 Latvia
 Antigua and Barbuda
 Barbados
 Oman
 Hungary
 Seychelles
 Uruguay
 Palau
 Trinidad and Tobago
 Panama
 Chile
 Maldives
 Croatia
 Poland
 Romania
 Costa Rica
 Grenada
 Mauritius
 World
Error occured: 'NoneType' object has no attribute 'find_all'
 Russia
 Malaysia
 Saint Lucia
 Mexico
 China
 Argentina
 Lebanon
 Bulgaria
 Kazakhstan
 Turkey
 Equatorial Guinea
 Brazil
 Montenegro
 Dominican Republic
 Dominica
 Nauru
 Gabon
 Botswana
 Turkmenistan
 Thailand
 Saint Vincent and the Grenadines
 Ser

In [29]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1,Luxembourg,113196,"2,586.4 km2 (998.6 sq mi) (167th)",0.60%,$69.453 billion[3] (69th),"$113,196[3] (1st)"
1,2,Switzerland,83716,"41,285 km2 (15,940 sq mi) (132nd)",4.2,$704 billion[7] (20th),"$82,950[7] (2nd)"
2,—,Macau,81151,115.3 km2 (44.5 sq mi),73.7,$54.545 billion[3] (83rd),"$81,728[3] (3rd)"
3,3,Norway,77975,"385,207 km2 (148,729 sq mi)[7] (67thb)",5.7c,$443 billion[9] (22nd),"$82,711[9] (3rd)"
4,4,Ireland,77771,"70,273 km2 (27,133 sq mi) (118th)",2.00,$384.940 billion[5] (32nd),"$77,771[5] (4th)"


- Agora, nosso conjunto de dados está compilado, mas não possui cabeçalhos para colunas. Assim, adicionaremos esses cabeçalhos e removeremos as colunas que não agregam valor.

In [30]:
headers = rows[0].find_all('th')
headers = [header.get_text().strip('\n') for header in headers]
headers += ['Total Area', 'Percentage Water', 'Total Nominal GDP', 'Per Capita GDP']
dataset.columns = headers

drop_columns = ['Rank']
dataset.drop(drop_columns, axis = 1, inplace = True)
dataset.sample(3)

dataset.to_csv("Dataset.csv", index = False)

In [31]:
dataset.head()

Unnamed: 0,Country/Territory,US$,Total Area,Percentage Water,Total Nominal GDP,Per Capita GDP
0,Luxembourg,113196,"2,586.4 km2 (998.6 sq mi) (167th)",0.60%,$69.453 billion[3] (69th),"$113,196[3] (1st)"
1,Switzerland,83716,"41,285 km2 (15,940 sq mi) (132nd)",4.2,$704 billion[7] (20th),"$82,950[7] (2nd)"
2,Macau,81151,115.3 km2 (44.5 sq mi),73.7,$54.545 billion[3] (83rd),"$81,728[3] (3rd)"
3,Norway,77975,"385,207 km2 (148,729 sq mi)[7] (67thb)",5.7c,$443 billion[9] (22nd),"$82,711[9] (3rd)"
4,Ireland,77771,"70,273 km2 (27,133 sq mi) (118th)",2.00,$384.940 billion[5] (32nd),"$77,771[5] (4th)"


In [32]:
dataset.info() #adiconar gini e hdi - olhar as paginas

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Country/Territory  180 non-null    object
 1   US$                180 non-null    object
 2   Total Area         180 non-null    object
 3   Percentage Water   180 non-null    object
 4   Total Nominal GDP  180 non-null    object
 5   Per Capita GDP     180 non-null    object
dtypes: object(6)
memory usage: 8.6+ KB


- Utilizando uma nova função para o scrap da população.



In [36]:
def getAdditionalDetailsPop(url):
    try:
        country_page = getHTMLContent('https://en.wikipedia.org' + url)
        table = country_page.find('table', {'class': 'infobox geography vcard'}) #Depois da analise do codigo, a tag é escolhida
        additional_details = []
        read_content = False
        for tr in table.find_all('tr'):
            if (tr.get('class') == ['mergedtoprow'] and not read_content): #A tag indica é o caminho das informações economicas na pagina
                link = tr.find('a')
                if (link and (link.get_text().strip() == 'Area' or
                   (link.get_text().strip() == 'GDP' and tr.find('span').get_text().strip() == '(nominal)'))):
                    read_content = True
                if (link and (link.get_text().strip() == 'Population')):
                    read_content = True
            elif ((tr.get('class') == ['mergedrow'] or tr.get('class') == ['mergedbottomrow']) and read_content):
                additional_details.append(tr.find('td').get_text().strip('\n'))
                if (tr.find('div').get_text().strip() != '•\xa0Total area' and
                   tr.find('div').get_text().strip() != '•\xa0Total'):
                    read_content = False
        return additional_details
    except Exception as error:
        print('Error occured: {}'.format(error))
        return []

In [37]:
#Criando o dataset com a população

data_content = []
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 1:
        print(cells[1].get_text())
        country_link = cells[1].find('a')
        country_info = [cell.text.strip('\n') for cell in cells]
        additional_details = getAdditionalDetailsPop(country_link.get('href'))
        if (len(additional_details) == 4):
            country_info += additional_details
            data_content.append(country_info)

            
dataset = pd.DataFrame(data_content)

 Luxembourg
  Switzerland
 Macau
 Norway
 Ireland
 Qatar
 Iceland
 United States
 Singapore
 Denmark
 Australia
 Netherlands
 Sweden
 Austria
 Hong Kong
 Finland
 San Marino
 Germany
 Canada
 Belgium
 Israel
 France
 United Kingdom
 Japan
 New Zealand
 United Arab Emirates
 Bahamas, The
 Italy
 Puerto Rico
 Korea, South
 Malta
 Spain
 Kuwait
 Brunei
 Cyprus
 Slovenia
 Aruba
 Bahrain
 Taiwan
 Estonia
 Czech Republic
 Portugal
 Saudi Arabia
 Greece
 Slovakia
 Lithuania
 Saint Kitts and Nevis
 Latvia
 Antigua and Barbuda
 Barbados
 Oman
 Hungary
 Seychelles
 Uruguay
 Palau
 Trinidad and Tobago
 Panama
 Chile
 Maldives
 Croatia
 Poland
 Romania
 Costa Rica
 Grenada
 Mauritius
 World
Error occured: 'NoneType' object has no attribute 'find_all'
 Russia
 Malaysia
 Saint Lucia
 Mexico
 China
 Argentina
 Lebanon
 Bulgaria
 Kazakhstan
 Turkey
 Equatorial Guinea
 Brazil
 Montenegro
 Dominican Republic
 Dominica
 Nauru
 Gabon
 Botswana
 Turkmenistan
 Thailand
 Saint Vincent and the Grenadines
 Ser

In [38]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6
0,8,Singapore,63987,725.7 km2 (280.2 sq mi)[3] (176th),"5,703,600[4][Note 1] (115th)",$391.875 billion[5] (31st),"$68,487[5] (7th)"
1,9,Denmark,59795,"42,933 km2 (16,577 sq mi)[3] (130th)","5,824,857[4] (114th)",$370 billion[7][N 6] (34th),"$63,829[7] (6th)"
2,16,Germany,46563,"357,022 km2 (137,847 sq mi)[4] (62nd)","83,166,711[5] (18th)",$3.863 trillion[6] (4th),"$46,653[6] (16th)"
3,22,Japan,40846,"377,975 km2 (145,937 sq mi)[2] (61st)","125,930,000[3] (11th)",$5.413 trillion[5] (3rd),"$43,043 (22nd)"
4,32,Cyprus,27719,"9,251 km2 (3,572 sq mi) (162nd)","1,189,265[c][5][6] (158th)",$24.996 billion[9] (114th),"$28,888[9] (33rd)"


In [39]:
headers = rows[0].find_all('th')
headers = [header.get_text().strip('\n') for header in headers]
headers += ['Total Area', 'Population', 'Total Nominal GDP', 'Per Capita GDP']
dataset.columns = headers

drop_columns = ['Rank', 'US$', 'Total Area', 'Total Nominal GDP', 'Per Capita GDP' ]
dataset.drop(drop_columns, axis = 1, inplace = True)
dataset.sample(3)

dataset.to_csv("Dataset_pop.csv", index = False)

In [43]:
dataset.head(20)

Unnamed: 0,Country/Territory,Population
0,Singapore,"5,703,600[4][Note 1] (115th)"
1,Denmark,"5,824,857[4] (114th)"
2,Germany,"83,166,711[5] (18th)"
3,Japan,"125,930,000[3] (11th)"
4,Cyprus,"1,189,265[c][5][6] (158th)"
5,Taiwan,"23,780,452[9] (56th)"
6,Maldives,"379,270[8] (178th)"
7,Serbia,"6,963,764 (excluding Kosovo)[2]8,963,753\n(inc..."
8,Libya,"6,871,287[3] (108th)"
9,Georgia,"3,716,858 [a][3] (131st)"


In [41]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Country/Territory  17 non-null     object
 1   Population         17 non-null     object
dtypes: object(2)
memory usage: 400.0+ bytes
