# 04 - More BeautifulSoup practice
Using the [sovereign states](https://en.wikipedia.org/wiki/List_of_sovereign_states) Wikipedia page, you're going to identify different parts of the unconventionally 

## 1. Select all the rows of the first table on the page

In [302]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [303]:
# sovereign states wikipedia url
states_url = 'https://en.wikipedia.org/wiki/List_of_sovereign_states'

In [304]:
# request the page
states_r = requests.get(states_url)

In [305]:
# create a beautifulsoup object
states_bs = BeautifulSoup(states_r.text)

In [306]:
# select the first table
states_table = states_bs.find('table')

In [307]:
# select all the rows
states_trs = states_table.find_all('tr')

## 2. Extract text only from all td tags
You'll do this to identify where to separate the two tables

In [136]:
# loop through all the rows and extract only the text
full_table_data = []
for tr in states_trs[1:]:
    tds = tr.find_all('td')
    # select all the td tags nested within each tr object
    cells = []
    for td in tds:
        cells.append(td.text)
    full_table_data.append(cells)

In [151]:
# separate the first table from the second with list indexing
states_df = pd.DataFrame(full_table_data[3:226], columns=None)

In [150]:
states_df.head()

Unnamed: 0,0,1,2,3
0,Albania – Republic of Albania\n,A UN member state\n,A None\n,\n
1,Algeria – People's Democratic Republic of Alg...,A UN member state\n,A None\n,\n
2,Andorra – Principality of Andorra\n,A UN member state\n,A None\n,Andorra is a co-principality in which the offi...
3,Angola – Republic of Angola\n,A UN member state\n,A None\n,\n
4,Antigua and Barbuda\n,A UN member state\n,A None\n,Antigua and Barbuda is a Commonwealth realm[e]...


In [139]:
# separate the second table from the first with list indexing
other_states_df = pd.DataFrame(full_table_data[229:-2])

In [152]:
other_states_df.head()

Unnamed: 0,0,1,2,3
0,Abkhazia – Republic of Abkhazia\n,D No membership\n,B Claimed by Georgia Claimed by North Korea Cl...,"Recognised by Russia, Nauru, Nicaragua, Syria,..."
1,Artsakh – Republic of Artsakh[ag]\n,D No membership\n,B Claimed by Georgia Claimed by North Korea Cl...,"A de facto independent state,[56][57][58] reco..."
2,Cook Islands\n,D Member of eight UN specialized agencies\n,A None(See political status)\n,"A state in free association with New Zealand, ..."
3,Kosovo – Republic of Kosovo\n,D Member of two UN specialized agencies\n,B Claimed by Georgia Claimed by North Korea Cl...,Pursuant to United Nations Security Council Re...
4,Niue\n,D Member of five UN specialized agencies\n,A None(See political status)\n,"A state in free association with New Zealand, ..."


## 3. Explore further the contents of each table
### a. States (table 1)
![image](assets/sovereign-states-tr.png)

In [144]:
# now separate the list based on the information we found from the data frames above and gather the nested tags from each td tag
# we also don't want rows where there is a td of colspan 4
states_list = []
for tr in states_trs[4:226]:
    tds = tr.find_all('td')
    if(tds[0])
    cells = []
    for td in tds:
        cells.append(td)
    states_list.append(cells)

This is the same information as the image above except in a BeautifulSoup object and not html. Identify what information might be useful apart from the text within each td tag.

- 
- 
- 

Find those attributes in a single row using Antigua and Barbuda as an example

In [157]:
antigua_barbuda = states_list[5]

In [158]:
antigua_barbuda

[<td style="vertical-align:top;"><span id="Antigua_and_Barbuda"></span><b><span class="flagicon"><img alt="" class="thumbborder" data-file-height="460" data-file-width="690" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/89/Flag_of_Antigua_and_Barbuda.svg/23px-Flag_of_Antigua_and_Barbuda.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/89/Flag_of_Antigua_and_Barbuda.svg/35px-Flag_of_Antigua_and_Barbuda.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/89/Flag_of_Antigua_and_Barbuda.svg/45px-Flag_of_Antigua_and_Barbuda.svg.png 2x" width="23"/> </span><a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">Antigua and Barbuda</a></b>
 </td>,
 <td><span style="display:none">A</span> UN member state
 </td>,
 <td><span style="display:none">A</span> None
 </td>,
 <td style="vertical-align:top;text-align:left;font-size:90%;">Antigua and Barbuda is a <a href="/wiki/Commonwealth_realm" title="Commonwealth realm">Commonwea

In [164]:
# flag image src
antigua_barbuda[0].find('img')['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/8/89/Flag_of_Antigua_and_Barbuda.svg/23px-Flag_of_Antigua_and_Barbuda.svg.png'

In [167]:
# link to wikipedia page
antigua_barbuda[0].find('a')['href']

'/wiki/Antigua_and_Barbuda'

In [174]:
# list of citations
for sup in antigua_barbuda[3].find_all('sup'):
    print(sup)

<sup class="reference" id="cite_ref-realm_7-0"><a href="#cite_note-realm-7">[e]</a></sup>
<sup class="reference" id="cite_ref-autonomous_9-0"><a href="#cite_note-autonomous-9">[f]</a></sup>


### b. Other states

### a. Other states (table 2)
![image](assets/sovereign-states-other-tr-1.png)
![image](assets/sovereign-states-other-tr-2.png)

In [187]:
other_states_list = []
for tr in states_trs[230:-2]:
    tds = tr.find_all('td')
    cells = []
    for td in tds:
        cells.append(td)
    other_states_list.append(cells)

This is the same information as the image above except in a BeautifulSoup object and not html. Identify what information might be useful apart from the text within each td tag.

- 
- 
- 

Find those attributes in a single row using Abkhazia as an example

In [189]:
abkhazia = other_states_list[0]

In [195]:
abkhazia

[<td style="vertical-align:top;"><span id="Abkhazia"></span><b><span class="flagicon"><img alt="" class="thumbborder" data-file-height="300" data-file-width="600" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Flag_of_the_Republic_of_Abkhazia.svg/23px-Flag_of_the_Republic_of_Abkhazia.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Flag_of_the_Republic_of_Abkhazia.svg/35px-Flag_of_the_Republic_of_Abkhazia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Flag_of_the_Republic_of_Abkhazia.svg/46px-Flag_of_the_Republic_of_Abkhazia.svg.png 2x" width="23"/> </span><a href="/wiki/Abkhazia" title="Abkhazia">Abkhazia</a></b> – Republic of Abkhazia
 </td>,
 <td style="background:LemonChiffon;"><span style="display:none">D</span> No membership
 </td>,
 <td style="background:LightCoral;"><span style="display:none">B </span>Claimed by <a href="#Georgia">Georgia</a><span style="display:none"> Claimed by <a href="#Korea_North

In [222]:
# what is in "display:none" span? labeled "claimed by/disputed by" is this valuable informtion?
str(abkhazia[2]).split('</a>')

['<td style="background:LightCoral;"><span style="display:none">B </span>Claimed by <a href="#Georgia">Georgia',
 '<span style="display:none"> Claimed by <a href="#Korea_North">North Korea',
 ' Claimed by <a href="#Serbia">Serbia',
 ' Claimed by <a href="#Somalia">Somalia',
 ' Claimed by the <a href="#China">People\'s Republic of China',
 ' Claimed by the <a href="#Taiwan">Republic of China',
 ' Claimed by <a href="#Korea_South">South Korea',
 ' Claimed by <a href="#Azerbaijan">Azerbaijan',
 ' Claimed by the <a href="#Cyprus">Republic of Cyprus',
 ' Disputed by <a href="#Israel">Israel',
 ' Claimed by <a href="#Mauritius">Mauritius',
 ' Claimed by <a href="#Morocco">Morocco',
 ' Claimed by <a href="#Moldova">Moldova',
 ' Claimed by <a href="#Mali">Mali',
 ' Claimed by <a href="#Spain">Spain',
 ' Claimed by <a href="#Argentina">Argentina',
 '</span>\n</td>']

In [203]:
# what about the background color of the membership column? 
abkhazia[1]['style']

'background:LemonChiffon;'

In [233]:
# "secondary" name? after " – "
str(abkhazia[0]).split('\xa0–')[1]

' Republic of Abkhazia\n</td>'

## 4. Extract the information from step 3 for each row
### a. States

In [269]:
states_list_4 = []
# for each tr
for tr in states_trs[4:226]:
#     tr[0] is th first td tag
    tds = tr.find_all('td')
    try:
        c = tds[0]['colspan']
    except:
        new_row = [None, None, []]
        if(tds[0].find('img')):
            new_row[0] = tds[0].find('img')['src']
        if(tds[0].find('a')):
            new_row[1] = tds[0].find('a')['href']
        if(len(tds[3].find_all('sup')) > 0):        
            for sup in tds[3].find_all('sup'):
                new_row[2].append(sup['id'])
        for td in tds:
            new_row.append(td.text)
        states_list_4.append(new_row)

In [270]:
states_df_4 = pd.DataFrame(states_list_4)

In [275]:
states_columns_4 = ['flag_img', 'wiki_link', 'citations', 'name', 'un_membership', 'dispute', 'more_info']
states_df_4.columns = states_columns_4

In [278]:
states_df_4.head(5)

Unnamed: 0,flag_img,wiki_link,citations,name,un_membership,dispute,more_info
0,//upload.wikimedia.org/wikipedia/commons/thumb...,/wiki/Afghanistan,[],Afghanistan – Islamic Republic of Afghanistan\n,A UN member state\n,A None\n,\n
1,//upload.wikimedia.org/wikipedia/commons/thumb...,/wiki/Albania,[],Albania – Republic of Albania\n,A UN member state\n,A None\n,\n
2,//upload.wikimedia.org/wikipedia/commons/thumb...,/wiki/Algeria,[],Algeria – People's Democratic Republic of Alg...,A UN member state\n,A None\n,\n
3,//upload.wikimedia.org/wikipedia/commons/thumb...,/wiki/Andorra,[cite_ref-6],Andorra – Principality of Andorra\n,A UN member state\n,A None\n,Andorra is a co-principality in which the offi...
4,//upload.wikimedia.org/wikipedia/commons/thumb...,/wiki/Angola,[],Angola – Republic of Angola\n,A UN member state\n,A None\n,\n


### b. Other states

In [291]:
other_states_list_4 = []
# for each tr
for tr in states_trs[230:-2]:
#     tr[0] is th first td tag
    tds = tr.find_all('td')
    try:
        c = tds[0]['colspan']
    except:
        new_row = [None, None, []]
        new_row[0] = tds[1].get('style')
        if(len(str(tds[0]).split('–')) > 0):      
            new_row[1] = str(tds[0]).split('–')
        if(tds[2].find('a')):
            new_row[2] = str(tds[2]).split('</a>')
        for td in tds:
            new_row.append(td.text)
        other_states_list_4.append(new_row)

In [293]:
other_states_df_4 = pd.DataFrame(other_states_list_4)

In [297]:
other_states_df_4.head(5)
other_states_df_4.columns = ['background_color', 'secondary_name', 'claimed_disputed', 'name', 'un_membership', 'notes', 'other_notes']

In [300]:
other_states_df_4.head()

Unnamed: 0,background_color,secondary_name,claimed_disputed,name,un_membership,notes,other_notes
0,background:LemonChiffon;,"[<td style=""vertical-align:top;""><span id=""Abk...","[<td style=""background:LightCoral;""><span styl...",Abkhazia – Republic of Abkhazia\n,D No membership\n,B Claimed by Georgia Claimed by North Korea Cl...,"Recognised by Russia, Nauru, Nicaragua, Syria,..."
1,background:LemonChiffon;,"[<td style=""vertical-align:top;""><span id=""Art...","[<td style=""background:LightCoral;""><span styl...",Artsakh – Republic of Artsakh[ag]\n,D No membership\n,B Claimed by Georgia Claimed by North Korea Cl...,"A de facto independent state,[56][57][58] reco..."
2,background:lightgreen;,"[<td style=""vertical-align:top;""><span id=""Coo...","[<td><span style=""display:none"">A</span> None<...",Cook Islands\n,D Member of eight UN specialized agencies\n,A None(See political status)\n,"A state in free association with New Zealand, ..."
3,background:lightgreen;,"[<td style=""vertical-align:top;""><span id=""Kos...","[<td style=""background:LightCoral;""><span styl...",Kosovo – Republic of Kosovo\n,D Member of two UN specialized agencies\n,B Claimed by Georgia Claimed by North Korea Cl...,Pursuant to United Nations Security Council Re...
4,background:lightgreen;,"[<td style=""vertical-align:top;""><span id=""Niu...","[<td><span style=""display:none"">A</span> None<...",Niue\n,D Member of five UN specialized agencies\n,A None(See political status)\n,"A state in free association with New Zealand, ..."
