prototype_ngo_scraping_single_ngo

# Prototype NGO Scraping: Scraping Information for a Single Firm
The objective of this notebook is to play around with the NGO website scraping to in order to find a good way to scrape the list of NGO's. 

## imports

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)  # to mute DeprecationWarnings

#-------FOR WORKING WITH DATA IN A DATAFRAME--------

import pandas as pd #To store scraped data

#-------SCRAPING SPECIFIC MODULES--------
from selenium import webdriver #to automate the navigating within the browser
from webdriver_manager.chrome import ChromeDriverManager #in order to interact with the browser selenium needs the underlying driver, while that driver can be installed non-programtically, that involved the addtional step of placing that driver in the PYTHON PATH for it to be accessible to python. Using webdriver-manager takes care of those two steps. 
from selenium.webdriver.common.keys import Keys # to "click" on clickable web elements
from selenium.webdriver.support.ui import Select #to select the features we want on the website via the scraper
from selenium.webdriver.support.ui import WebDriverWait #again, to add wait times more 'implicitly'
from selenium.webdriver.common.by import By # this is to set up contingent actions, such as search-by-id or search-by-xpath
from selenium.webdriver.support import expected_conditions as EC # this allows us to specify that we're expecting certain elements to be present on the webpage, such as a close-button, and to specify conditions concerning those
from selenium.webdriver.chrome.options import Options #to use properties of the chrome webbrowser

#----------MISCELLANEOUS----------------------------
import random # this is a random-item generator
import time # to add hard-coded sleep times, as well as to time the script

## noting start time, to use for timing the code

In [2]:
start_time = time.time() # noting the time at which this command is executed, and storing it as the "start time", in order to time the code.

## scraping

### set-up

#### setting up selenium to use chrome as the browser

In [3]:
options = Options() # to modify the behaviour of the browser we're going to use, and store those modifications
options.headless = False # True hides the navigating of the browser by the scraper, False shows you the tab/window opening and stuff getting clicked
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)



Current google-chrome version is 99.0.4844
Get LATEST chromedriver version for 99.0.4844 google-chrome
Driver [/Users/garima/.wdm/drivers/chromedriver/mac64/99.0.4844.51/chromedriver] found in cache


#### specifying the url to be scraped

In [4]:
base_url = 'https://ngodarpan.gov.in/index.php/home/sectorwise' # this is the url to the homepage with the list of 43 sectors

In [5]:
driver.get(base_url) # driver is the browser, get essentially opens the url specified in the paranthesis

### obtaining links to the pages containing NGOs for each sector
The organization of the website is as follows: 
- the homepage consists of a list of 43 sectors, each element in the list is a hyperlink to the "sector page" 
- on clicking on a sector on the homepage, we arrive at the sector page, which consists of a summary table of NGOs under that sector
- the summary table is several pages long, and the default setting is to display 10 table entries per page
- each entry in the summary table is a hyperlink to a pop-up box containing detailed information about the NGO in the table entry

#### collect all the URLs to each of the 43 sectors and store them in a list
These URLs are independent of each other, that is one doesn't need to work through a sequence. 

In [6]:
sector_urls = []; # initiating a  list to store the sector-wise URLs. As of 28/Feb/2022, the page has 43 sectors, so we should end up with a list of length 43 after the completion of this step. 

In [7]:
sector_elems = driver.find_elements_by_class_name('bluelink11px') # each of the sector hyperlinks is stored under the class name 'bluelink11px' in the HTML code for the webpage

In [8]:
for elem in sector_elems: 
    sector_url = elem.get_attribute('href')
    sector_urls.append(sector_url)

#### select a random sector from which to pick a NGO to scrape

- picking the 10th sector to scrape
- also, changing the default number of entries displayed per page to 100

In [9]:
test_url = sector_urls[9] # 9 here refers to the 10th sector, since python has zero-indexing
test_url = test_url + '?per_page=100' # the default number of entries displayed per page is 10, adding "?per_page=100" to the url changes it to 100 per page. 

In [10]:
driver.get(test_url) # this opens up the browser to the "sector page" for the 10th sector

#### getting total number of pages with NGOs for the selected sector
- the number of table entries under each sector is variable, consequently the total number of pages is variable too. Extracting and storing the total number of pages for the sector to be scraped. 
- the page numbers are available in the URL itself, so it's most convenient to extract them from there. An example URL is "https://ngodarpan.gov.in/index.php/home/sectorwise_ngo/18167/7/22?per_page=100". Here, "https://ngodarpan.gov.in/index.php/home/sectorwise_ngo/" is the base URL. The "18167/7" is a code that changes with each sector. The "22" refers to the current page number within the sector. So, in order to extract the page number, we breakup the URL at "?", getting two parts: part 1: "https://ngodarpan.gov.in/index.php/home/sectorwise_ngo/18167/7/22" and part 2: "per_page=100". We further split part 1 from the right at "/", getting part1.1 as "https://ngodarpan.gov.in/index.php/home/sectorwise_ngo/18167/7" and part 1.2 as "22". Part 1.2 is our desired object. 

In [11]:
page_num = driver.find_elements_by_partial_link_text('Last') # the last page for a given sector is denoted by 'Last' so looking for the last-page element using that text
page_num[0].click() # clicking on the button for the last page
last_url = driver.current_url # we've now been taken to the last page for the sector, the page number is present in the URL so storing the URL to extract the page number from it
last_page_num = int(last_url.split('?')[0].rsplit('/',1)[1]) # extracting page number from the URL as described in the markdown cell above

#### selecting NGO data to scrape from a randomly selected page

In [12]:
page_to_scrape = random.choice(range(1, last_page_num))
scrape_url = test_url.rsplit('?', 1)[0][:-1]+f'{page_to_scrape}'+'?per_page=100' # re-inserting a new page number in the URL to scrape NGOs on that page

In [13]:
driver.get(scrape_url) # opening up the page to scrape with the browser

#### getting links of all NGOs on the selected page

In [14]:
ngo_list_on_page= driver.find_elements_by_xpath("//a[contains(@onclick,'show_ngo_info')]") # finding all elements on the page that contain "show_ngo_info", this returns the list of NGO "elements" on the given page
len(ngo_list_on_page) #since we change the setting to 100 per page, we expect this to be 100, unless it's the last page for the sector

100

#### scraping data for a single NGO

##### clicking the link for the NGO to get the information popup

In [15]:
ngo_list_on_page[0].click()  # going to the first NGO element on the page, this will open the pop-up info-box for the NGO

##### getting information from the pop-up for all attributes that are not variable
Name, Unique ID, Registration, FCRA and Contact Details tables for each NGO are fixed, that is they contain the same number of and same elements for each NGO. 
Whereas the Members and Source of Funds tables have diffrent number of entries across the NGOs

In [16]:
name = driver.find_element_by_id('ngo_name_title').get_attribute('innerHTML')
uid = driver.find_element_by_id('UniqueID').get_attribute('innerHTML')
reg_with = driver.find_element_by_id('reg_with').get_attribute('innerHTML')
ngo_type = driver.find_element_by_id('ngo_type').get_attribute('innerHTML')
ngo_regno = driver.find_element_by_id('ngo_regno').get_attribute('innerHTML')
rc_upload = driver.find_element_by_id('rc_upload').get_attribute('innerHTML')
pc_upload = driver.find_element_by_id('pc_upload').get_attribute('innerHTML')
act_name = driver.find_element_by_id('ngo_act_name').get_attribute('innerHTML')
city_reg = driver.find_element_by_id('ngo_city_p').get_attribute('innerHTML')
state_reg = driver.find_element_by_id('ngo_state_p').get_attribute('innerHTML')
reg_date = driver.find_element_by_id('ngo_reg_date').get_attribute('innerHTML')
key_issues = driver.find_element_by_id('key_issues').get_attribute('innerHTML')
operational_states = driver.find_element_by_id('operational_states').get_attribute('innerHTML')
operational_districts = driver.find_element_by_id('operational_district').get_attribute('innerHTML')
fcra_details = driver.find_element_by_id('FCRA_details').get_attribute('innerHTML')
fcra_regno = driver.find_element_by_id('FCRA_reg_no').get_attribute('innerHTML')
details_achievement = driver.find_element_by_id('activities_achieve').get_attribute('innerHTML')
contact_address = driver.find_element_by_id('address').get_attribute('innerHTML')
contact_city = driver.find_element_by_id('city').get_attribute('innerHTML')
contact_state = driver.find_element_by_id('state_p_ngo').get_attribute('innerHTML')
contact_telephone = driver.find_element_by_id('phone_n').get_attribute('innerHTML')
contact_mobile = driver.find_element_by_id('mobile_n').get_attribute('innerHTML')
contact_website = driver.find_element_by_id('ngo_web_url').get_attribute('innerText')
contact_email = driver.find_element_by_id('email_n').get_attribute('innerHTML')

##### extracting details from the Members table
The number of members for each NGO is different thus the tables are of variable length. The table data is present in the form of a vector, when extracted through the HTML. The table contains 'n' rows and 4 columns (name, designation, PAN availability, and aadhar availability). So the extracted data is a list of length 4n: 
- starting from the first element, every 5th element in the list is the name of a member, 
- starting from the 2nd element, every 5th element is the designation of the member, 
- starting from the 3rd element every 5th element is the PAN availability of the member,
- starting from the 4th element every 5th element is the Aadhar availability of the member. 
So, I exploit this structure to extract information from this table. 

In [17]:
members_table = driver.find_element_by_id('member_table')
member_names =  [i.get_attribute('innerHTML') for i in members_table.find_elements_by_xpath('.//tr//td')[::4]]
member_designations = [i.get_attribute('innerHTML') for i in members_table.find_elements_by_xpath('.//tr//td')[1::4]]
member_pan = [i.get_attribute('innerHTML') for i in members_table.find_elements_by_xpath('.//tr//td')[2::4]]
member_aadhar = [i.get_attribute('innerHTML') for i in members_table.find_elements_by_xpath('.//tr//td')[3::4]]
member_name_designation_dict = dict(zip(member_names, member_designations)) # storing member names and their designations as dictionaries
member_name_pan_dict = dict(zip(member_names, member_pan)) # storing member names and their PAN availabilities as dictionaries
member_name_aadhar_dict = dict(zip(member_names, member_aadhar)) # storing member names and tneir Aadhar availabilities as dictionaries

##### extracting details from the Source of Funds table
The number of sources for each NGO is different thus the tables are of variable length. The table data is present in the form of a vector, when extracted through the HTML. The table contains 'n' rows and 5 columns (department name, source, financial year, amount sanctioned and purpose). So the extracted data is a list of length 4n: 
- starting from the first element, every 6th element in the list is the department name of the source, 
- starting from the first element, every 6th element in the list is the source, 
- starting from the first element, every 6th element in the list is the financial year of the source, 
- starting from the first element, every 6th element in the list is the amount sanctioned by the source, 
- starting from the first element, every 6th element in the list is the purpose of the funds. 

So, I exploit this structure to extract information from this table. 

In [18]:
sof_table = driver.find_element_by_id('source_table') # extracting the table essentially as a vector
dept_name = [i.get_attribute('innerHTML') for i in sof_table.find_elements_by_xpath('.//tr//td')[::5]] 
source = [i.get_attribute('innerHTML') for i in sof_table.find_elements_by_xpath('.//tr//td')[1::5]]
financial_year = [i.get_attribute('innerHTML') for i in sof_table.find_elements_by_xpath('.//tr//td')[2::5]]
amount_sanctioned =[i.get_attribute('innerHTML') for i in sof_table.find_elements_by_xpath('.//tr//td')[3::5]]
purpose = [i.get_attribute('innerHTML') for i in sof_table.find_elements_by_xpath('.//tr//td')[4::5]]
year_amount_dict = dict(zip(financial_year, amount_sanctioned))
year_dept_dict = dict(zip(financial_year, dept_name))
year_source_dict = dict(zip(financial_year, source))
year_purpose_dict = dict(zip(financial_year, purpose))

#### storing data into a dataframe

We have all the data assigned to variables, now we want to generate a dataframe and store each variable into a different column of the dataframe

In [19]:
df = pd.DataFrame() # creating an empty dataframe to write the data into

In [20]:
df['ngo_name'] = [name]
df['unique_id'] = uid
df['registered_with'] = reg_with
df['type_of_ngo'] = ngo_type
df['registration_number'] = ngo_regno
df['copy_of_registration_certificate'] = rc_upload
df['copy_of_pan_card'] = pc_upload
df['act_name'] = act_name
df['city_of_registration'] = city_reg
df['state_of_registration'] = state_reg
df['registration_date'] = reg_date
df['key_issues'] = key_issues
df['operational_areas_states'] = operational_states
df['operational_areas_districts'] = operational_districts
df['FCRA_details'] = fcra_details
df['FCRA_registration_num'] = fcra_regno
df['details_of_achievement'] = details_achievement
df['contact_details_address'] = contact_address
df['contact_details_city'] = contact_city
df['contact_details_state'] = contact_state
df['contact_details_telephone'] = contact_telephone
df['contact_details_website'] = contact_website
df['contact_details_email'] = contact_email
df['members_names_designations'] = [member_name_designation_dict]
df['members_names_pan_availability'] = [member_name_pan_dict]
df['members_names_aadhar_availability'] = [member_name_aadhar_dict]
df['source_of_funds_amount_sanctioned'] = [year_amount_dict]
df['source_of_funds_department_name'] = [year_dept_dict]
df['source_of_funds_source'] = [year_source_dict]
df['source_of_funds_purpose'] = [year_purpose_dict]

## noting end time to see how long it took to scrape a single NGO's data

In [21]:
print("--- The script took  %s seconds ---" % (time.time() - start_time))

--- The script took  14.198443174362183 seconds ---


## displaying output

In [22]:
df

Unnamed: 0,ngo_name,unique_id,registered_with,type_of_ngo,registration_number,copy_of_registration_certificate,copy_of_pan_card,act_name,city_of_registration,state_of_registration,...,contact_details_telephone,contact_details_website,contact_details_email,members_names_designations,members_names_pan_availability,members_names_aadhar_availability,source_of_funds_amount_sanctioned,source_of_funds_department_name,source_of_funds_source,source_of_funds_purpose
0,,AP/2016/0112003,Registrar of Societies,Registered Societies (Non-Government),281-2005,Available,Available,XXI OF 1860,CHITTOOR,ANDHRA PRADESH,...,08572-221457,http://pewsindia.in,pewsindia4u(at)gmail[dot]com,"{' K PATTABHI REDDY': 'Member', ' A Santosh Ku...","{' K PATTABHI REDDY': 'Available', ' A Santosh...","{' K PATTABHI REDDY': 'Available', ' A Santosh...","{'2014-2015': 'Not Specified', '2015-2016': 'N...","{'2014-2015': 'Not Specified', '2015-2016': 'N...","{'2014-2015': 'Any Other', '2015-2016': 'Any O...",{'2014-2015': 'Income generated through retail...
