<a href="https://colab.research.google.com/github/ashioyajotham/CIA-Factbook-Data-Dashboard/blob/main/angolafactbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Data on Angola from CIA's World Fact Book

## Scraping Foundations

### Downloading the webpage using `requests` library

In [None]:
!pip install requests --upgrade --quiet

[?25l[K     |█████▏                          | 10 kB 28.1 MB/s eta 0:00:01[K     |██████████▍                     | 20 kB 30.2 MB/s eta 0:00:01[K     |███████████████▋                | 30 kB 12.5 MB/s eta 0:00:01[K     |████████████████████▉           | 40 kB 7.0 MB/s eta 0:00:01[K     |██████████████████████████      | 51 kB 7.6 MB/s eta 0:00:01[K     |███████████████████████████████▎| 61 kB 8.7 MB/s eta 0:00:01[K     |████████████████████████████████| 62 kB 469 kB/s 
[?25h

In [None]:
import requests

In [None]:
angola_url="https://www.cia.gov/the-world-factbook/countries/angola/"

In [None]:
# you might need to mount google drive on colab
# run the following code:
# from google.colab import drive
# drive.mount('/content/gdrive')

In [None]:
response=requests.get(angola_url)

In [None]:
help(requests.get)

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response



requests.get() returns information on the webpage provided of type `requests.Response object` which contains the response from the server of the website from which we just requested. This includes the `content`, `headers`, `status_code` .etc.

Checkout [requests.Response object](https://www.w3schools.com/python/ref_requests_response.asp)


If we type `response.content`, we're able to see the content from the page. If we use `response.text`, we see the output as unicode.

In [None]:
response.content[:500]

b'<!DOCTYPE html><html lang="en"><head><meta charSet="utf-8"/><meta http-equiv="x-ua-compatible" content="ie=edge"/><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/><style id="typography.js">html{font-family:sans-serif;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%}body{margin:0}article,aside,details,figcaption,figure,footer,header,main,menu,nav,section,summary{display:block}audio,canvas,progress,video{display:inline-block}audio:not([controls]){displa'

In [None]:
response.text[:500]

'<!DOCTYPE html><html lang="en"><head><meta charSet="utf-8"/><meta http-equiv="x-ua-compatible" content="ie=edge"/><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/><style id="typography.js">html{font-family:sans-serif;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%}body{margin:0}article,aside,details,figcaption,figure,footer,header,main,menu,nav,section,summary{display:block}audio,canvas,progress,video{display:inline-block}audio:not([controls]){displa'

In [None]:
len(response.text)

351469

We can see that the length of the response is about 350000 characters long. Let's save the webpage in a file with the extension .html. This file will be parsed as html, and will open in a web-browser.

In [None]:
with open ("angolafacts.html","w", encoding="utf-8") as file:
    file.write(response.text)

If we go to the directory in which our file is saved, we find the webpage saved as `angolafacts.html`. We have the webpage saved on our machine! We have successfully been able to scrape the website with a few steps! 

Hooray!!

### Inspecting the HTML in a webpage

All websites are created using a language called [HTML]("https://www.w3schools.com/html/"). We can see the HTML used to create a page by using the right-click, and clicking on `View Page Source` or `Inspect Element`. 

![](https://drive.google.com/uc?export=view&id=1UCXqucHehA0PYdXrCTZXwXqqnQQ75yeh)

The Page Source contains the HTML

![](https://drive.google.com/uc?export=view&id=1uWjmMs5XKHBhEcAKcQ1B3T9klQiks4cg)

From the source code, we can see the various tags used to construct the HTML, which we'll use in our scraping. These tags include
* `html`
* `head`
* `title`
* `body`
* `div`
* `span`
* `h1` to `h6`
* `p`
* `img`
* `ul`, `ol` and `li`
* `table`, `tr`, `th` and `td`
* `style`
* ...

Each tag means something specific in html. For example, the `h1` tag contains the header, or to be more specific, the h1 header, which at most times, is the main header in the html. Scraping allows us to target each one of these tags.

Each html tag supports several attributes. Following are some common attributes used to modify the behavior of tags:

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)

We can also target each one of these attributes during scraping.

### Extracting info using BeautifulSoup

`BeautifulSoup` is a library used to extract information from HTML source code of a webpage. To use it, we have to install it first. [Beautiful Soup Docs](https://beautiful-soup-4.readthedocs.io/en/latest/)

In [None]:
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |██▋                             | 10 kB 20.2 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 12.8 MB/s eta 0:00:01[K     |███████▊                        | 30 kB 16.9 MB/s eta 0:00:01[K     |██████████▎                     | 40 kB 7.4 MB/s eta 0:00:01[K     |████████████▉                   | 51 kB 7.3 MB/s eta 0:00:01[K     |███████████████▍                | 61 kB 8.5 MB/s eta 0:00:01[K     |██████████████████              | 71 kB 8.0 MB/s eta 0:00:01[K     |████████████████████▌           | 81 kB 6.2 MB/s eta 0:00:01[K     |███████████████████████         | 92 kB 6.8 MB/s eta 0:00:01[K     |█████████████████████████▋      | 102 kB 7.5 MB/s eta 0:00:01[K     |████████████████████████████▏   | 112 kB 7.5 MB/s eta 0:00:01[K     |██████████████████████████████▊ | 122 kB 7.5 MB/s eta 0:00:01[K     |████████████████████████████████| 128 kB 7.5 MB/s 
[?25h

In [None]:
from bs4 import BeautifulSoup as bs

Now, we convert the `response` to a BeautifulSoup object

In [None]:
factsangola=bs(response.content)

In [None]:
# print the bs object
# print(factsangola.prettify)

### Targetting specific parts during scraping

To get the `<title>` we can use the `title` method

In [None]:
factsangola.title

<title data-react-helmet="true">Angola - The World Factbook</title>

The title of a website is what is shown on the tab of a page.
If we only wanted to get the text from the title, we could use the `string` method.

In [None]:
factsangola.title.string

'Angola - The World Factbook'

In [None]:
title_tag=factsangola.title.string

In [None]:
# To get the head
# factsangola.head

In [None]:
# To get h1 tag
factsangola.h1

<h1 class="hero-title">Angola</h1>

In [None]:
# To get h2 tag
factsangola.h2

<h2>Photos of Angola</h2>

In [None]:
# To get <p> tag
factsangola.p

<p>From the late 14th to the mid 19th century a Kingdom of Kongo stretched across central Africa from present-day northern Angola into the current Congo republics. It traded heavily with the Portuguese who, beginning in the 16th century, established coastal colonies and trading posts and introduced Christianity. By the 19th century, Portuguese settlement had spread to the interior; in 1914, Portugal abolished the last vestiges of the Kongo Kingdom and Angola became a Portuguese colony. <br/><br/>Angola scores low on human development indexes despite using its large oil reserves to rebuild since the end of a 27-year civil war in 2002. Fighting between the Popular Movement for the Liberation of Angola (MPLA), led by Jose Eduardo DOS SANTOS, and the National Union for the Total Independence of Angola (UNITA), led by Jonas SAVIMBI, followed independence from Portugal in 1975. Peace seemed imminent in 1992 when Angola held national elections, but fighting picked up again in 1993. Up to 1.5 

Now, you might realise that even though we have mulitple `<h2>` tags and `<p>` tags, only one is returned. This is because BeautifulSoup only returns the **first** tag it finds. In this case, the first h2, the first p, and so on..

### find and find_all

So as to target tags more specifically, bs4 provides the methods `find` and `find_all`. Let's see how each of this works.

In [None]:
factsangola.find_all('h2')

[<h2>Photos of Angola</h2>,
 <h2>Introduction</h2>,
 <h2>Geography</h2>,
 <h2>People and Society</h2>,
 <h2>Environment</h2>,
 <h2>Government</h2>,
 <h2>Economy</h2>,
 <h2>Energy</h2>,
 <h2>Communications</h2>,
 <h2>Transportation</h2>,
 <h2>Military and Security</h2>,
 <h2>Transnational Issues</h2>]

`find_all` returns all elements with the tag `h2`

In [None]:
factsangola.find('img')

<img alt="" aria-hidden="true" role="presentation" src="data:image/svg+xml;charset=utf-8,%3Csvg height='600' width='600' xmlns='http://www.w3.org/2000/svg' version='1.1'%3E%3C/svg%3E" style="max-width:100%;display:block;position:static"/>

In [None]:
import re

In [None]:
# find an alt attribute for an img element and get the src link as done below
imgsrc=factsangola.find_all('img',attrs={"alt":re.compile("Cabo Ledo")})[0]['data-src']

In [None]:
imgurl='https://www.cia.gov'+imgsrc

In [None]:
# save an image
with open("Cabo Lebo.jpg","wb") as f:
    f.write(requests.get(imgurl).content)

`find`, however, returns only the first element.

In [None]:
# help(factsangola.find)
help(factsangola.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Look in the children of this PageElement and find all
    PageElements that match the given criteria.
    
    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.
    
    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find_all() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A ResultSet of PageElements.
    :rtype: bs4.element.ResultSet



You can pass in attributes to the find_all and find method, allowing you to specify what you want to scrape more. The code below returns all h3 elements with the `class` `mt30`

In [None]:
h3mt30=factsangola.find_all("h3",attrs={"class":"mt30"})

In [None]:
h3mt30[:5]

[<h3 class="mt30"><a href="/the-world-factbook/field/background">Background</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/location">Location</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/geographic-coordinates">Geographic coordinates</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/map-references">Map references</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/area">Area</a></h3>]

In [None]:
# find which class the elements in h3mt30 belong
h3mt30[0]['class']

['mt30']

In [None]:
h3mt30[0].find('a')

'Background'

In [None]:
for h3 in h3mt30:
    print(h3.find('a').text)

Background
Location
Geographic coordinates
Map references
Area
Area - comparative
Land boundaries
Coastline
Maritime claims
Climate
Terrain
Elevation
Natural resources
Land use
Irrigated land
Major rivers (by length in km)
Major watersheds (area sq km)
Major aquifers
Population distribution
Natural hazards
Geography - note
Population
Nationality
Ethnic groups
Languages
Religions
Demographic profile
Age structure
Dependency ratios
Median age
Population growth rate
Birth rate
Death rate
Net migration rate
Population distribution
Urbanization
Major urban areas - population
Sex ratio
Mother's mean age at first birth
Maternal mortality ratio
Infant mortality rate
Life expectancy at birth
Total fertility rate
Contraceptive prevalence rate
Drinking water source
Current health expenditure
Physicians density
Sanitation facility access
HIV/AIDS - adult prevalence rate
HIV/AIDS - people living with HIV/AIDS
HIV/AIDS - deaths
Major infectious diseases
Obesity - adult prevalence rate
Alcohol consum

## Getting the Data we need

For this exercise, we want to scrape all facts about Angola, and save them in a CSV file. However, we need to do this in an orderly way such that, if one gets the data, they would understand it almost as equally as they would understand it if it were on the website. This means retaining the headings, and linking them with the data below the headings.

Let us first find all the main headings, which are under the tag `h2`

### Finding all the main headings(`h2`)

In [None]:
h2sangola=factsangola.find_all('h2')

In [None]:
h2sangola=[h2.text for h2 in h2sangola]
h2sangola

['Introduction',
 'Geography',
 'People and Society',
 'Environment',
 'Government',
 'Economy',
 'Energy',
 'Communications',
 'Transportation',
 'Military and Security',
 'Transnational Issues']

### Find all h3 headings

Now, each `h2` heading has subheadings underneath; which are `h3`. Can we find the `h3`s under Geography?

In [None]:
geoh3=factsangola.find("div",attrs={"id":"geography"}).find_all('h3')
geoh3[:5]


[<h3 class="mt30"><a href="/the-world-factbook/field/location">Location</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/geographic-coordinates">Geographic coordinates</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/map-references">Map references</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/area">Area</a></h3>,
 <h3 class="mt30"><a href="/the-world-factbook/field/area-comparative">Area - comparative</a></h3>]

Can we do the same with Introduction?

In [None]:
introsh3=factsangola.find("div",attrs={"id":"introduction"}).find_all('h3')
introsh3[0].get_text()

'Background'

Now, can we make our work easier by creating a function to do the rest of the headings?

In [None]:
def geth3s(h2name):
    allh3=[]
    h3s=factsangola.find("div",id=h2name).find_all('h3')
    for h3 in h3s:
        allh3.append(h3.text)
    return allh3

In [None]:
geth3s("geography")

['Location',
 'Geographic coordinates',
 'Map references',
 'Area',
 'Area - comparative',
 'Land boundaries',
 'Coastline',
 'Maritime claims',
 'Climate',
 'Terrain',
 'Elevation',
 'Natural resources',
 'Land use',
 'Irrigated land',
 'Major rivers (by length in km)',
 'Major watersheds (area sq km)',
 'Major aquifers',
 'Population distribution',
 'Natural hazards',
 'Geography - note']

In [None]:
# Get all text from the h2s
h2slist=[]
for h2 in h2sangola:
    h2slist.append(h2.text)
h2slist

['Photos of Angola',
 'Introduction',
 'Geography',
 'People and Society',
 'Environment',
 'Government',
 'Economy',
 'Energy',
 'Communications',
 'Transportation',
 'Military and Security',
 'Transnational Issues']

In [None]:
def geth3 (h2lis,bsobject):
    totalh3=[]
    for h2 in h2lis[1:]:
        allh3=[]
        h3s=bsobject.find("div",attrs={"id":str(h2).lower().replace(" ","-")}).find_all('h3')
        for h3 in h3s:
            allh3.append(h3.text)
        totalh3.append(allh3)   
    return totalh3

In [None]:
allh3s=geth3(h2slist,factsangola)

### Link main heading to sub-heading

In order to link each main heading with a sub-heading, we will concatenate the names, adding a ': ' in between

In [None]:
# We first pop photos of angola since it does not have any sub headings
# h2sangola.pop(0)

<h2>Photos of Angola</h2>

Now, the length of `h2sangola` should equal the length of `allh3s`

In [None]:
print(zip(h2sangola,allh3s))

<zip object at 0x000002196EAEC400>


In [None]:
allh3s[0]

['Background']

In [None]:
h2sh3= [[ h2sangola[i]+ ": " +allh3s[i][j] for j in range(len(allh3s[i]))] for i in range(len(h2sangola))]

In [None]:
h2sh3

[['Introduction: Background'],
 ['Geography: Location',
  'Geography: Geographic coordinates',
  'Geography: Map references',
  'Geography: Area',
  'Geography: Area - comparative',
  'Geography: Land boundaries',
  'Geography: Coastline',
  'Geography: Maritime claims',
  'Geography: Climate',
  'Geography: Terrain',
  'Geography: Elevation',
  'Geography: Natural resources',
  'Geography: Land use',
  'Geography: Irrigated land',
  'Geography: Major rivers (by length in km)',
  'Geography: Major watersheds (area sq km)',
  'Geography: Major aquifers',
  'Geography: Population distribution',
  'Geography: Natural hazards',
  'Geography: Geography - note'],
 ['People and Society: Population',
  'People and Society: Nationality',
  'People and Society: Ethnic groups',
  'People and Society: Languages',
  'People and Society: Religions',
  'People and Society: Demographic profile',
  'People and Society: Age structure',
  'People and Society: Dependency ratios',
  'People and Society: Me

Amazing! We now have every sub-heading linked to its heading!! If you think about it, these could be our column heads when we store our data in either a pandas dataframe or csv!

### Getting the paragraphs under the h3 elements

Now that we have headings and subheadings, we can go to the next paragraphs under each subheading. Note that each subheading has at least one paragraph below it.

In [None]:
# We can reuse the function to get subheadings to get paragraphs in this case
def getp (h2lis,bsobject):
    totalp=[]
    for h2 in h2lis[1:]:
        allp=[]
        h3s=bsobject.find("div",attrs={"id":str(h2).lower().replace(" ","-")}).find_all('h3')
        for h3 in h3s:
            allp.append(h3.find_next("p").get_text())
        totalp.append(allp)   
    return totalp

In [None]:
allps=getp(h2slist,factsangola)

In [None]:
allps[0]

["From the late 14th to the mid 19th century a Kingdom of Kongo stretched across central Africa from present-day northern Angola into the current Congo republics. It traded heavily with the Portuguese who, beginning in the 16th century, established coastal colonies and trading posts and introduced Christianity. By the 19th century, Portuguese settlement had spread to the interior; in 1914, Portugal abolished the last vestiges of the Kongo Kingdom and Angola became a Portuguese colony. Angola scores low on human development indexes despite using its large oil reserves to rebuild since the end of a 27-year civil war in 2002. Fighting between the Popular Movement for the Liberation of Angola (MPLA), led by Jose Eduardo DOS SANTOS, and the National Union for the Total Independence of Angola (UNITA), led by Jonas SAVIMBI, followed independence from Portugal in 1975. Peace seemed imminent in 1992 when Angola held national elections, but fighting picked up again in 1993. Up to 1.5 million liv

### Saving scraped Data

After getting the data we need, we can then store the data. Lets first save it in a pandas dataframe. Once saved in a pandas dataframe, it's easy to save as csv using the `to_csv` method.

In [None]:
import pandas as pd
import numpy as np

We can use [np.hstack](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html) to create a concatenated version of our lists along the first axis.

In [None]:
columnsarr=np.hstack(h2sh3)

In [None]:
dataarr=np.hstack(allps)

In [None]:
np.shape(columnsarr)

(171,)

In [None]:
angoladf=pd.DataFrame([dataarr],columns=columnsarr)

In [None]:
angoladf.to_csv('angola_factbook.csv')

In [None]:
!pip install jovian

In [None]:
import jovian

In [None]:
jovian.commit(filename="angolafactbook.ipynb")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "andrewkamaukim/angolafactbook" on https://jovian.ai/[0m
[jovian] Committed successfully! https://jovian.ai/andrewkamaukim/angolafactbook[0m


'https://jovian.ai/andrewkamaukim/angolafactbook'

### Bringing it all together

Now, we can summarize all that we have done to get the data from the CIA's Angola Factbook. We
* Used the requests library to get the webpage
* Extracted the website's info using BeautifulSoup
* Looked for all the main headings(`h2`) in the website and saved them as a list `h2slist`
* Found all sub-headings,`h3` under each heading and saved them as a list of lists, `allh3s`
* Linked the main headings to sub-headings using a colon- `: ` and saved them as `h2sh3`
* Got all the paragraphs,`p`, under the subheadings `h3` as a list of lists, `allps`
* Created a concatenated version of our lists as an array using `np.hstack`
* Converted the `h2sh3` to column heads and `allps` to rows for a pandas dataframe
* Converted the pandas dataframe to a csv

Now, these may seem like a lot of steps, but they were all vital to get the final csv. However, imagine how painful it would be to do this, over and over again for each country. Is there a way we could get the data for all the countries without repeating this process? 

This is what we'll be looking into in the next notebook!

In [None]:
jovian.commit(filename="angolafactbook.ipynb")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "andrewkamaukim/angolafactbook" on https://jovian.ai/[0m
[jovian] Committed successfully! https://jovian.ai/andrewkamaukim/angolafactbook[0m


'https://jovian.ai/andrewkamaukim/angolafactbook'