<a href="https://colab.research.google.com/github/drshahizan/python-web/blob/main/Malaysia_state.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping - Beautiful Soup

In this file, we are going to do some web scraping by using beautifulsoup library. Website that we choose is [States and federal territories of Malaysia](https://en.wikipedia.org/wiki/States_and_federal_territories_of_Malaysia). We are going to take all the data from States table which include flag, emblem, state, capital, population and etc.

**Team Members:**

1.   MUHAMMAD DINIE HAZIM BIN AZALI
2.   RADIN DAFINA BINTI RADIN ZULKAR NAIN
3.   ADRINA ASYIQIN BINTI MD ADHA
4.   KELVIN EE

First, we need to install all the required library to make it available in our colab.

In [9]:
!pip install beautifulsoup4
!pip install requests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


All the necessary libraries need to be imported first.

In [10]:
# Importing the required libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup

We have to download the actual HTML of the site that we choose into our colab.

In [11]:
# Downloading contents of the web page
url = 'https://en.wikipedia.org/wiki/States_and_federal_territories_of_Malaysia'
data = requests.get(url).text

Next, we create a BeautifulSoup object and print it so that we can inspect the HTML file for us to find the table that we want.

In [12]:
soup = BeautifulSoup(data, 'html.parser')

In the website that we choose, there is more than one table. We need to verify what our table class attribute so we can use the information to pick the correct table.

In [13]:
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
    print(table.get('class'))

Classes of each table:
['box-More_citations_needed', 'plainlinks', 'metadata', 'ambox', 'ambox-content', 'ambox-Refimprove']
['sidebar', 'sidebar-collapse', 'nomobile', 'nowraplinks', 'vcard', 'hlist']
['wikitable']
['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner']
['navbox-columns-table']
None
None
None
['wikitable', 'noresize', 'sortable']
['wikitable', 'noresize', 'sortable']
['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner']
['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner']
['nowraplinks', 'navbox-subgroup']
['nowraplinks', 'navbox-subgroup']
['nowraplinks', 'navbox-subgroup']
['nowraplinks', 'navbox-subgroup']
['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner']


We want to use the 6 table with class attribute wikitable, noresize and sortable.

In [14]:
# Creating list with all tables
tables = soup.find_all('table')

#  Looking for the table with the classes 'wikitable' and 'sortable'
table = soup.find('table', {'class':"wikitable noresize sortable"})

Once we have the correct table, we can extract its data to create our own dataframe.

In [15]:
# Defining of the dataframe
df = pd.DataFrame(columns=['Flag', 'Emblem', 'State',	'Capital',	'Royal_capital',	'Population',	'Area_km2',	'Licence_plate',	'Area_code',	'Abbr',	'ISO',	'HDI',	'Region',	'Head_of_state',	'Head_of_government'])

# Collecting Ddata
for row in table.tbody.find_all('tr'):    
    # Find all data for each column
    columns = row.find_all('td')

    if(columns != []):
        flag = columns[0].find('img')['src']
        emblem = columns[1].find('img')['src']
        state = columns[2].text.strip()
        capital = columns[3].text.strip()
        royal_capital = columns[4].text.strip()
        population = columns[5].text.strip()
        area_km2 = columns[6].text.strip()
        licence_plate = columns[7].text.strip()
        area_code = columns[8].text.strip()
        abbr = columns[9].text.strip()
        iso = columns[10].text.strip()
        hdi = columns[11].text.strip()
        region = columns[12].text.strip()
        head_of_state = columns[13].text.strip()
        head_of_government = columns[14].text.strip()

        df = df.append({'Flag': flag, 'Emblem': emblem, 'State': state,	'Capital': capital,	'Royal_capital': royal_capital,	'Population': population,	'Area_km2': area_km2,	'Licence_plate': licence_plate,	'Area_code': area_code,	'Abbr': abbr,	'ISO': iso,	'HDI': hdi,	'Region': region,	'Head_of_state': head_of_state,	'Head_of_government': head_of_government}, ignore_index=True)

In [16]:
df

Unnamed: 0,Flag,Emblem,State,Capital,Royal_capital,Population,Area_km2,Licence_plate,Area_code,Abbr,ISO,HDI,Region,Head_of_state,Head_of_government
0,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Johor,Johor Bahru,Muar,3794000,19166,J,"07, 06 (Muar & Tangkak)",JHR,MY-01,0.825,Peninsular Malaysia,Sultan,Menteri Besar
1,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Kedah,Alor Setar,Anak Bukit,2194100,9492,K,04,KDH,MY-02,0.808,Peninsular Malaysia,Sultan,Menteri Besar
2,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Kelantan,Kota Bharu,Kubang Kerian,1928800,15040,D,09,KTN,MY-03,0.779,Peninsular Malaysia,Sultan,Menteri Besar
3,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Malacca,Malacca City,—,937500,1712,M,06,MLK,MY-04,0.835,Peninsular Malaysia,Yang di-Pertua Negeri (Governor),Chief Minister
4,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Negeri Sembilan,Seremban,Seri Menanti,1129100,6658,N,06,NSN,MY-05,0.829,Peninsular Malaysia,Yang di-Pertuan Besar(Grand Ruler),Menteri Besar
5,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Pahang,Kuantan,Pekan,1684600,35965,C,"09, 03 (Genting Highlands), 05 (Cameron)",PHG,MY-06,0.804,Peninsular Malaysia,Sultan,Menteri Besar
6,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Penang,George Town,—,1774400,1049,P,04,PNG,MY-07,0.845,Peninsular Malaysia,Yang di-Pertua Negeri (Governor),Chief Minister
7,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Perak,Ipoh,Kuala Kangsar,2508900,21146,A,05,PRK,MY-08,0.816,Peninsular Malaysia,Sultan,Menteri Besar
8,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Perlis,Kangar,Arau,255400,819,R,04,PLS,MY-09,0.805,Peninsular Malaysia,Raja,Menteri Besar
9,//upload.wikimedia.org/wikipedia/commons/thumb...,//upload.wikimedia.org/wikipedia/commons/thumb...,Sabah,Kota Kinabalu,—,3833000,73621,S,087–089,SBH,MY-12,0.71,East Malaysia,Yang di-Pertua Negeri (Governor),Chief Minister


In [28]:
df.to_csv('Malaysia_states.csv', index=False)