# Scraping the data of world's top insurance companies by `market capitalization`

![](https://i.imgur.com/lA4LqC7.png)


**Data** is the collection of facts!

_**Web Scraping**_ is a technique used to automatically extract large amounts of data from websites and save it to a file or database. The data scraped will usually be in tabular or spreadsheet format(e.g : CSV file)


Here, in this web scrapping we will scrap data from [value.today](https://www.value.today/world-top-companies/insurance).

We'll use the Python libraries `requests` and `beautifulsoup4` to perform scrapping from the webpage.



Here's an outline of the steps we'll follow:

1. Download the webpage using `requests`
2. Parse the HTML source code using `beautifulsoup4`
3. Extract company names, CEOs, world ranks, Market capitalization, Annual revenue, number of employees, company URLs
4. Compile the extracted information into and Python lists and dictionaries
5. Extract and combine data from multiple pages
6. Save the extracted information to a CSV file.


By the end of the project, we'll create a CSV file in the following format:

![](https://i.imgur.com/a0Bllr6.png)


## How to Run the Code
You can execute the code using the "Run" button at the top of this page using "Run on binder". You can make changes and save your version of the notebook to [Jovian](https://www.jovian.ai) by executing the following cells:

In [44]:
!pip install jovian --upgrade --quiet

In [45]:
import jovian

In [46]:
# Execute this to save new versions of the notebook
#jovian.commit(project="web-scrapping-finally-final")

## Download the webpage using `requests`

We'll use the `requests` library to download the web page.

The library can be installed using `pip`

In [47]:
!pip install requests --upgrade --quiet

In [48]:
import requests

The library is now installed and imported.

To download a page, we can use the `get` function from requests, which returns a response object.

In [49]:
topics_url = 'https://www.value.today/world-top-companies/insurance'
response = requests.get(topics_url)

`requests.get` returns a response object containing the data from the web page and some other information.

The `.status_code` property can be used to check if the response was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [50]:
response.status_code

200

The request was successful. We can get the contents of the page using `response.text`.

In [51]:
page_content = response.text

Let's check the number of characters of the page. 

In [52]:
len(page_content)

161485

The page contains over 160000 characters!

Here are the first 1000 characters of the page:

In [53]:
page_content[:1000]

'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">\n  <head>\n    <meta charset="utf-8"/>\n<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>\n<script>(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2407955258669770",enable_page_level_ads:true});</script><script>window.google_analytics_uacct="UA-121331115-1";(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,"script","https

In the above cell `page_content[:1000]` contains the [HTML](https://en.wikipedia.org/wiki/HTML) of the webpage [value.today](https://www.value.today/world-top-companies/insurance)

We can also save it to a file and view the page locally within Jupyter using "File > Open".

In [54]:
with open('world-insurance.html','w',encoding = "utf-8") as file:
    file.write(page_content)

The page looks similar to the original page

![](https://i.imgur.com/Kmmbnlo.png)

In the section, we used the requests library to download a web page as HTML. We have successfully downloaded the webpage using `requests` library.

## Parse the HTML source code using `beautifulsoup4`

In [55]:
!pip install beautifulsoup4 --upgrade --quiet

In [56]:
from bs4 import BeautifulSoup

In [57]:
doc = BeautifulSoup(response.text, 'html.parser')

With this `doc` object, we can navigate and search through the `HTML` for data that we want.

In [58]:
type(doc)

bs4.BeautifulSoup

The `doc` object contains several properties and methods for extracting information from the HTML document.

[the documentation of BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [59]:
doc.find('title')

<title>World Top Insurance Companies by Market Value as on 2021</title>

Here, `doc.find('title')` will give the title of the web page.

![](https://i.imgur.com/of9igZA.png) 

In [60]:
def get_page(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the webpage
    response = requests.get(url)
    
    # Check if the dowmload was successful
    if response.status_code != 200:
        raise Exception('Unable to download page {}'.format(url))
    
    # Get the page HTML
    page_contents = response.text
    
    # Create a bs4 doc
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

In [61]:
doc2 = get_page(topics_url)

In [62]:
doc2.find('title')

<title>World Top Insurance Companies by Market Value as on 2021</title>

`doc` and `doc2` has the same title `World Top Insurance Companies by Market Value as on 2021`

We can use the function `get_page` to downlaod any web page and parse it using beautiful soup.

## Extract company names, CEOs, world ranks, Market capitalization, Annual revenue, number of employees, company URLs

Upon inspecting the box we get an idea that all the information that we need to scrape is under `li` tag with `class` attribute set to `row well clearfix`

![](https://i.imgur.com/i3Srr2K.png)

Let's find all the `li` tags matching this class.

In [63]:
company_block = doc.find_all('li',class_='row well clearfix')

In [64]:
len(company_block)

10

The web page contain 10 boxes of `li` tag

## Company Name

Now Let's create a `function` to extract the all the company names of first page using the `for loop`

In [65]:
def name_of_companies(company_block):
    company_names = []
    for tag in company_block:
        c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item')

        company_names.append(c_name.find('a').text)
    return company_names

We can call function `name_of_companies` to get the companies names. 

In [66]:
#Let's check the function
name_of_companies(company_block)

['BERKSHIRE HATHAWAY',
 'UNITEDHEALTH GROUP',
 'BANK OF AMERICA CORPORATION',
 'WELLS FARGO & COMPANY',
 'AIA GROUP',
 'ROYAL BANK OF CANADA',
 'PING AN INSURANCE (GROUP) COMPANY OF CHINA',
 'BANK OF CHINA',
 'STATE FARM',
 'TORONTO-DOMINION BANK']

## CEOs

The name of the `CEO` is under the `div` of class `field--item` of `href` attribute

![](https://i.imgur.com/UxAUb2n.png)

Let's create a `function` to extract the all the CEO names of first page using the `for loop`

In [67]:
def name_of_CEOs(company_block):
    CEO_names = []
    for tag in company_block:
        names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above')
        try:
            ceo = names.find('a').text
            CEO_names.append(ceo)
        except AttributeError:
            CEO_names.append(None)
    return CEO_names

We can call function `name_of_CEOs` to get the CEOs name

In [68]:
# Let's call the function
name_of_CEOs(company_block)

['Warren Buffett',
 'David S. Wichmann',
 'Brian Moynihan',
 'Charles W. Scharf',
 'Lee Yuan Siong',
 'David I. McKay',
 'Ma Mingzhe',
 'Gao Yingxin',
 None,
 'Bharat Masrani']

## World Rank

Let's create a `function` to extract the all the `world ranks` of first page using the `for loop`

In [69]:
def ranks_of_world(company_block):
    world_ranks = []

    for tag in company_block:
        rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above')

        world_ranks.append(rank.find('div',class_='field--item').text)
    return world_ranks

We can call function `ranks_of_world` to get the `world ranks`

In [70]:
ranks_of_world(company_block)

['8', '18', '20', '65', '91', '93', '94', '111', '131', '133']

## Market Capitalization

Let's create a `function` to extract the all the `market capitalization' of first page using the `for loop`

In [71]:
def market_caps(company_block):
    market_capitalization_in_dollars = []

    for tag in company_block:
        market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above')
        try:
            caps = market_cap.find('div',class_='field--item').text
            replace_caps = caps.replace(' Billion USD',"")
            market_capitalization_in_dollars.append(float(replace_caps))
        except AttributeError:
            market_capitalization_in_dollars.append(None)
    return market_capitalization_in_dollars

We can call function `market_caps` to get the `market capitalization`

In [72]:
# Let's call the function
market_caps(company_block)

[543.68, 332.73, 262.2, 124.78, 152.33, 116.72, 233.34, 129.25, None, 102.4]

## Anuual Revenue

Let's create a `function` to extract the all the `annual revenue` of first page using the `for loop`

In [73]:
def annual_rev(company_block):
    annual_revenue_in_dollars = []

    for tag in company_block:
        annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline')
        try:
            revenue = annual_revenue.find('div',class_='field--item').text
            replace_string = revenue.replace(',',"").replace(' Million USD',"")

            annual_revenue_in_dollars.append(int(replace_string))

        except AttributeError:
            annual_revenue_in_dollars.append(None)
    return annual_revenue_in_dollars

We can call function `annual_rev` to get the `annual revenue`

In [74]:
# Let's call the function
annual_rev(company_block)

[286260, 255630, 85530, 72340, 50360, 37367, 166950, 82215, 81730, 34568]

## Number of Employees

Let's create a `function` to extract the all the `employees` of first page using the `for loop`

In [75]:
def employees(company_block):
    no_of_employees = []

    for tag in company_block:
        employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline')
        try:
            n_employee = employee.find('div',class_='field--item').text
            replace_string = n_employee.replace(',',"")
            no_of_employees.append(int(replace_string))
        except AttributeError:
            no_of_employees.append(None)
    return no_of_employees

We can call function `employees` to get the `number of employees`

In [76]:
# Let's call the function
employees(company_block)

[391500, 320000, 208000, 258700, 23000, 83842, 376900, 309384, 59000, 89598]

## Company URL

![](https://i.imgur.com/umXAZPT.png)

Let's create a `function` to extract the all the `Company URLs` of first page using the `for loop`

In [77]:
def extract_urls(company_block):    
    company_urls = []

    for tag in company_block:
        c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above')
        try:

            company_urls.append(c_url.find('a')['href'])
        except AttributeError:
            company_urls.append(None)
    return company_urls

We can call function `extract_urls` to get the `Company URLs`

In [78]:
extract_urls(company_block)

['https://www.berkshirehathaway.com/',
 'https://www.unitedhealthgroup.com/',
 'https://www.bankofamerica.com/',
 'https://www.wellsfargo.com/',
 'http://www.aia.com/',
 'https://www.rbcroyalbank.com',
 'http://www.pingan.com/',
 'https://www.boc.cn/en/',
 'https://www.statefarm.com/',
 'https://www.td.com']

So far, we have created `7` function. These are `name_of_companies`, `name_of_CEOs`, `ranks_of_world`, `market_caps`, `annual_rev`, `employees`, `extract_urls`. And now we have developed an approach to extract the data from a block.

## Compile the extracted information into and Python lists and dictionaries

## Extract and combine data from multiple pages

Let's create a `dictionary` using all the functions

![](https://i.imgur.com/cPlfU16.png)

As there is `53` pages on wesite, We will need to `loop` through all the pages. So that we can extract the data from all the pages.

In [79]:
def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
            'companies_name':[],
            'CEOs_name':[],
            'world_ranks':[],
            'market_capitalizations_in_billion_dollars':[],
            'annual_revenues_in_million_dollars':[],
            'number_of_employees':[],
            'companies_URLs':[]
            }
    for page in range (0,53):

        url = f"https://www.value.today/world-top-companies/insurance?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_cap_aug_01_2021__value=&page={page}"
        company_block = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_block)
        all_info_dict['CEOs_name'] += name_of_CEOs(company_block)
        all_info_dict['world_ranks'] += ranks_of_world(company_block)
        all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block)
        all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block)
        all_info_dict['number_of_employees'] += employees(company_block)
        all_info_dict['companies_URLs'] += extract_urls(company_block)
        page = page + 1
    return all_info_dict

In [80]:
#results[0]

In [81]:
# Create pandas dataframe from dictionary
import pandas as pd
scrape_page_dataframe = pd.DataFrame(scrape_page())

In [82]:
# Let's view the first and last 5 rows
scrape_page_dataframe

Unnamed: 0,companies_name,CEOs_name,world_ranks,market_capitalizations_in_billion_dollars,annual_revenues_in_million_dollars,number_of_employees,companies_URLs
0,BERKSHIRE HATHAWAY,Warren Buffett,8,543.680,286260.0,391500.0,https://www.berkshirehathaway.com/
1,UNITEDHEALTH GROUP,David S. Wichmann,18,332.730,255630.0,320000.0,https://www.unitedhealthgroup.com/
2,BANK OF AMERICA CORPORATION,Brian Moynihan,20,262.200,85530.0,208000.0,https://www.bankofamerica.com/
3,WELLS FARGO & COMPANY,Charles W. Scharf,65,124.780,72340.0,258700.0,https://www.wellsfargo.com/
4,AIA GROUP,Lee Yuan Siong,91,152.330,50360.0,23000.0,http://www.aia.com/
...,...,...,...,...,...,...,...
522,NOVUS ACQUISITION & DEVELOPMENT,,36479,0.005,,,
523,INSR INSURANCE GROUP ASA,,36703,0.010,,,
524,"ATLAS FINANCIAL HOLDINGS, INC.",,37108,,,,
525,"HEALTH REVENUE ASSURANCE HOLDINGS, INC.",,37682,,,,


## Save the extracted information to a CSV file

In [83]:
scrape_page_dataframe.to_csv('scrape_page_dataframe.csv',index=None)

## Summary

Here's what we've covered in this notebook

1. Downloaded the webpage using `requests`
2. Parsed the HTML source code using `beautifulsoup4`
3. Extracted company names, CEOs, world ranks, Market capitalization, Annual revenue, number of employees, company URLs
4. Compiled the extracted information into and Python lists and dictionaries
5. Extracted and combine data from multiple pages
6. Saved the extracted information to a CSV file.


The CSV file we created has this format:

![](https://i.imgur.com/a0Bllr6.png)

Here's the complete code for this project:

In [84]:
def get_page(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the webpage
    response = requests.get(url)    
    # Check if the dowmload was successful
    if response.status_code != 200:
        raise Exception('Unable to download page {}'.format(url))    
    # Get the page HTML
    page_contents = response.text    
    # Create a bs4 doc
    doc = BeautifulSoup(response.text,'html.parser')
    return doc


def name_of_companies(company_block):
    company_names = []
    for tag in company_block:
        c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item')

        company_names.append(c_name.find('a').text)
    return company_names


def name_of_CEOs(company_block):
    CEO_names = []
    for tag in company_block:
        names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above')
        try:
            ceo = names.find('a').text
            CEO_names.append(ceo)
        except AttributeError:
            CEO_names.append(None)
    return CEO_names


def ranks_of_world(company_block):
    world_ranks = []

    for tag in company_block:
        rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above')

        world_ranks.append(rank.find('div',class_='field--item').text)
    return world_ranks


def market_caps(company_block):
    market_capitalization_in_dollars = []

    for tag in company_block:
        market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above')
        try:
            caps = market_cap.find('div',class_='field--item').text
            replace_caps = caps.replace(' Billion USD',"")
            market_capitalization_in_dollars.append(float(replace_caps))
        except AttributeError:
            market_capitalization_in_dollars.append(None)
    return market_capitalization_in_dollars


def annual_rev(company_block):
    annual_revenue_in_dollars = []

    for tag in company_block:
        annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline')
        try:
            revenue = annual_revenue.find('div',class_='field--item').text
            replace_string = revenue.replace(',',"").replace(' Million USD',"")

            annual_revenue_in_dollars.append(int(replace_string))

        except AttributeError:
            annual_revenue_in_dollars.append(None)
    return annual_revenue_in_dollars


def employees(company_block):
    no_of_employees = []

    for tag in company_block:
        employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline')
        try:
            n_employee = employee.find('div',class_='field--item').text
            replace_string = n_employee.replace(',',"")
            no_of_employees.append(int(replace_string))
        except AttributeError:
            no_of_employees.append(None)
    return no_of_employees



def extract_urls(company_block):    
    company_urls = []

    for tag in company_block:
        c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above')
        try:

            company_urls.append(c_url.find('a')['href'])
        except AttributeError:
            company_urls.append(None)
    return company_urls



def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
            'companies_name':[],
            'CEOs_name':[],
            'world_ranks':[],
            'market_capitalizations':[],
            'annual_revenues':[],
            'number_of_employees':[],
            'companies_URLs':[]
            }
    for page in range (0,53):

        url = f"https://www.value.today/world-top-companies/insurance?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_cap_aug_01_2021__value=&page={page}"
        company_block = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_block)
        all_info_dict['CEOs_name'] += name_of_CEOs(company_block)
        all_info_dict['world_ranks'] += ranks_of_world(company_block)
        all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block)
        all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block)
        all_info_dict['number_of_employees'] += employees(company_block)
        all_info_dict['companies_URLs'] += extract_urls(company_block)
        page = page + 1
    return all_info_dict

## Future Work

* We can now fetch individual topic pages, and get the list of top insurance companies 
* We can scrape the page to get the additioanal information
* We can use this data for further analysis
* We can extract the data of two or more different audit month and perform the analysis

## References

* [Jovian](https://jovian.ai/) A platform to learn Data Science

* This project is made under the guidence of [Aakash N S](https://aakashns.medium.com/) 

* A Youtube video by `Aakash N S` [Let's Build a Python Web Scraping Project from Scratch | Hands-On Tutorial](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=6677s)

* [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
                                 
* [Pandas Documentation](https://pandas.pydata.org/docs/)

In [85]:
jovian.commit(files = ['scrape_page_dataframe.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "pankajthakur3999/web-scrapping-of-top-insurance-companies" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/pankajthakur3999/web-scrapping-of-top-insurance-companies[0m


'https://jovian.ai/pankajthakur3999/web-scrapping-of-top-insurance-companies'

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>