# Data extraction

**This project will scrape information about Board Members and Details of a company displayed [`on this page`](https://www.asx.com.au/asx/share-price-research/company/CBA/details)**

First, we have to import `required libraries`. In this project, we will make use of `selenium` and `BeautifulSoup`, which are common packages and libraries used for webscraping

In [1]:
from selenium import webdriver
import chromedriver_binary
from bs4 import BeautifulSoup
import pandas as pd

Next, get the content of the page in `html` format

In [2]:
driver = webdriver.Chrome(executable_path=r'C:\Users\Admin PC\Desktop\Project\chromedriver_win32\chromedriver.exe')

# Get the website
driver.get('https://www.asx.com.au/asx/share-price-research/company/CBA/details')

# Print the response
print(driver.page_source)

<html xmlns:ng="https://angularjs.org" class="js flexbox flexboxlegacy canvas canvastext webgl no-touch geolocation postmessage websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths no-mobile no-phone no-tablet mobilegradea ng-scope" ng-app="companyInfoApp" id="ng-app"><head><script src="https://securepubads.g.doubleclick.net/gpt/pubads_impl_rendering_2020012701.js"></script><script type="text/javascript" async="async" src="https://dpm.demdex.net/id?d_visid_ver=1.5.7&amp;d_rtbd=json&amp;d_ver=2&amp;d_orgid=FD20401053DA8A4F0A490D4C%40AdobeOrg&amp;d_nsid=0&amp;d_mid=04643428276588511455408225277379410280&amp;d_cb=s_c_il%5B0%5D._setAudienceManagerFields"></script><st

In [3]:
soup = BeautifulSoup(driver.page_source)

After inspecting the page, we know that the information we want to gain is stored under `table` tag. Therefore, we have to find all the tables and then convert it into `dataframe`

In [4]:
# Find all the tables on page
table = soup.find_all('table')

# Read all tables into dataframe
df = pd.read_html(str(table)) # df is a list of dataframes

Inspect our `df` variable to make sure we captured the right information

In [7]:
# Print all the tables
df

[                            0  \
 0                 Issuer code   
 1       Official listing date   
 2                 Fiscal year   
 3         GICS industry group   
 4             Exempt foreign?   
 5            Internet address   
 6   Registered office address   
 7       Head office telephone   
 8             Head office fax   
 9              Share registry   
 10   Share registry telephone   
 
                                                     1  
 0                                                 CBA  
 1                                          12/09/1991  
 2                                                 NaN  
 3                                               Banks  
 4                                                  No  
 5                         http://www.commbank.com.au/  
 6   Ground Floor, Tower 1, 201 Sussex Street, SYDN...  
 7                                      (02) 9378 2000  
 8                                      (02) 9118 7192  
 9   LINK MARKET SER

***We can see that there are 3 dataframes in our list. Let's have a look at each dataframe!***

In [9]:
# First dataframe
df[0]

Unnamed: 0,0,1
0,Issuer code,CBA
1,Official listing date,12/09/1991
2,Fiscal year,
3,GICS industry group,Banks
4,Exempt foreign?,No
5,Internet address,http://www.commbank.com.au/
6,Registered office address,"Ground Floor, Tower 1, 201 Sussex Street, SYDN..."
7,Head office telephone,(02) 9378 2000
8,Head office fax,(02) 9118 7192
9,Share registry,"LINK MARKET SERVICES LTD LEVEL 12, 680 GEORGE..."


In [10]:
# Second dataframe
df[1]

Unnamed: 0,0,1
0,Ms Catherine Livingstone,"Chairman, Non Exec. Director"
1,Mr Matthew Comyn,"Managing Director, CEO, Director"
2,Mr Shirish Apte,Non Exec. Director
3,Prof. Genevieve Bell,Non Exec. Director
4,Mr Paul O'Malley,Non Exec. Director
5,Ms Mary Padbury,Non Exec. Director
6,Ms Wendy Stops,Non Exec. Director
7,Ms Anne Templeman-Jones,Non Exec. Director
8,Mr Rob Whitfield,Non Exec. Director


In [11]:
# Last dataframe in list
df[2]

Unnamed: 0,0,1
0,Ms Kristy Huxtable,Company Secretary
1,Ms Kara Nicholls,Company Secretary


Now that we have all the information we want, the last step is to convert all the `dataframes` into an `excel workbook` with each sheet stores different information about the company

In [13]:
# Export all the tables into excel workbook
with pd.ExcelWriter('Company details.xlsx') as writer:  
    df[0].to_excel(writer, sheet_name='Service_info')
    df[1].to_excel(writer, sheet_name='Board_of_directors')
    df[2].to_excel(writer, sheet_name='Secretaries')

The workbook will be saved in the project directory.
## Thanks for reading!