# Scraping The Ghana Contracts Portal

This jupyter notebook details the steps taken to scrape the Ghana Public Procurement Authority's contracts website data: http://tenders.ppa.gov.gh/contracts. This is part of the data acquisition stage in a data engineering internship. 

The goal of scraping the data is have data available in CSV format that can either be:
1. Analysed with data analysis tools in pandas, R or Excel
2. Populated into a relational database such as PostgreSQL or MySQL

The main scraping library used in the lab is BeautifulSoup. For more details about Beautiful Soup check out the [documentation page](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). For a detailed introductory tutorial, check out this [FreecodeCamp video](https://www.youtube.com/watch?v=87Gx3U0BDlo).

## Step 1: Setting Up Your Environment

In order to seamlessly run the code below, ensure that you perform the following steps:

1. Create and activate a virtual environment for web scraping: we recommend using another to create your virtual environment. Check out the [Anaconda Cheat Sheet](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf) for more information
2. Install the libraries in the requirements.txt file with the code:
**```pip install -r requirements.txt```**

After this your should be set to run the cells below.


In [1]:
#relevant libraries

from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import sleep
from random import randint
from time import time
from IPython.core.display import clear_output
from warnings import warn

## Step 2: Let's Start Scraping

The [Ghana PPA website](http://tenders.ppa.gov.gh/contracts) is made up of 552 pages each containing 10 blocks of contract information. Each block contains a URL which links to details contracting information that we will like to scrape. Take a moment to check out a couple of the blocks from the contracts website. For example, this is what contract [number 9511](http://tenders.ppa.gov.gh/contracts/9511) looks like. Notice that each contract block contains, 17 different fields or values.  We want to scrape each of these fields into a tabular format (CSV).


Here are the steps we will take:
1. Get the URL for each block from all 552 pages
2. Navigate to each contract URL page and scrape the contract details
3. Save and export the data as CSV

### 2.1 - Get The URL For Each Block

In [2]:
#  a loop that iterates over each page and scrapes all contract title urls to a list called 'links'

# declaring lists to store data in
links = []
start_time = time()
req = 0

#For all the 552 pages, 
for page in range(1,553):
    url = "http://tenders.ppa.gov.gh/contracts?page={}".format(page)
    
    #Make a request
    r = requests.get(url)
    
    #Pause the loop
    sleep(randint(0,1))
    
    #Monitor the requests
    req += 1
    elapsed_time = time() - start_time
    print('Req:{}; Frequency: {} req/s'.format(req, req/elapsed_time))
    clear_output(wait = True)
    
    #Throw a warning for non-200 status codes
    if r.status_code != 200:
            warn('Req: {}; Status code: {}'.format(req, r.status_code))
    
    # Break the loop if the number of requests is greater than expected
    if req > 553:
            warn('Number of requests was greater than expected.')
            break
    
    # Parse the content of the request with BeautifulSoup
    soup = BeautifulSoup(r.content,"html.parser")
    
    #Find the contract urls
    page_soup = soup.find('div', class_='content')
    title_url = page_soup.select(".list-wrap .list-title a")
    
    # Scrape the url
    for link in title_url:
        links.append(link.get('href'))    

Req:552; Frequency: 0.8635106088821604 req/s


### 2.2 - Navigate To Each Block And Scrape Data

In [5]:
#  a loop that iterates over each contract link and scrapes data from each in batches

# declaring lists to store data
contract_data_gh = pd.DataFrame()
start_time = time()
req = 0
batchsize = 920

#For all the 5,520 contract links,
for i in range(0, len(links), batchsize):
    batch = links[i:i+batchsize]
    
    #pause interval between each batch scraped
    sleep(randint(600, 1000))
    
    for link in batch:
        table_header,table_content=[],[]

        #Make a request
        response = requests.get(link)

        #Monitor the requests
        req += 1
        elapsed_time = time() - start_time
        print('Req:{}; Frequency: {} req/s'.format(req, req/elapsed_time))
        clear_output(wait = True)

        #Throw a warning for non-200 status codes
        if response.status_code != 200:
                warn('Req: {}; Status code: {}'.format(req, response.status_code))

        # Break the loop if the number of requests is greater than expected
        if req > 5520:
                warn('Number of requests was greater than expected.')
                break

        # Parse the content of the request with BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")

        #Find the data
        link_soup = soup.find('div', class_='content')
        head = link_soup.select("dl dt")
        content = link_soup.select("dl dd")

        #Scrape the data and append to a dataframe
        for item in head:
            table_header.append(item.string)

        for item in content:
            table_content.append(item.string)

        contracts = pd.DataFrame(dict(zip(table_header, table_content)), index=[0])   

        contract_data_gh=contract_data_gh.append(contracts)

Req:5520; Frequency: 0.7684969499796461 req/s


## Step 3: Save and Export Data

In [6]:
# Number of rows and columns in dataframe

contract_data_gh.shape

(5520, 17)

In [7]:
#view first five rows of dataframe

contract_data_gh.head()

Unnamed: 0,Awarding Agency:,Tender Package No:,Tender Type:,Contract Type:,Lot #:,Tender Description:,Approval Auth:,Justification:,Contract Date:,Completion Date:,Contract Currency:,Contract Award Price:,Contract Awarded To:,Company Email:,Company Address:,Company Tel:,Supplier #:
0,Ghana Irrigation Development Authority - Accra,MOFA/GIDA-IFT/LOT 9-SAN-DAG/12/18,-,Restricted Tender,LOT 9,Construction of small earth dams to store wate...,ETC/PPA,40 (1)b - Urgency,"10th May, 2019","20th May, 2020",Ghana Cedi,Gh¢4962525.82,Ashcarl Investment Ltd,,P. O. Box 2310 Tamale,,-
0,Ghana Irrigation Development Authority - Accra,MOFA/GIDA-IFT/LOT 1-VUN-NAM/12/18,-,Restricted Tender,LOT 1,Construction of Small earth dams at Vunania an...,ETC/PPA,40 (1)b - Urgency,"10th May, 2019","12th May, 2020",Ghana Cedi,Gh¢5087867.52,ALZAK COMPANY LIMITED,info@alzak.com,-,-,-
0,Ghana Irrigation Development Authority - Accra,MOFA/GIDA-IFT/LOT 3-KATAA/12/18,-,Restricted Tender,LOT 3,Construction of small earth dam at Kataa for a...,ETC/PPA,40 (1)b - Urgency,"10th May, 2019","14th January, 2020",Ghana Cedi,Gh¢2489286.20,QUALITY ASSURED ENGINEERING COMPANY,info@qualityassured.com,-,0,-
0,Ghana Irrigation Development Authority - Accra,MOFA/GIDA-IFT/LOT 4-DUONG/12/18,-,Restricted Tender,LOT 4,Construction of small earth dam at Doung to st...,ETC/PPA,40 (1)b - Urgency,"10th May, 2019","14th January, 2020",Ghana Cedi,Gh¢2576940.88,HANDOSKY INTERNATIONAL LIMITED,support@HANDOSKY.com,-,-,-
0,Ghana Irrigation Development Authority - Accra,MOFA/GIDA-IFT/LOT 5-DOUSE/12/18,-,Restricted Tender,LOT 5,Construction of small earth dams at Douse to s...,ETC/PPA,40 (1)b - Urgency,"10th May, 2019","14th January, 2020",Ghana Cedi,Gh¢2435596.41,WAALE CONSTRUCTION WORKS LIMITED,WAALE@yahoo.com,-,-,-


In [8]:
# setting new index for dataframe and deleting current one

contract_data_gh.reset_index(drop=True, inplace=True)

In [9]:
#converting dataframe to csv

contract_data_gh.to_csv('contract_data_ppa_gh.csv')