First things first - making all the necessary imports

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Checking if the table could be extracted directly via HTML i.e. without BeautifulSoup

In [2]:
df = pd.read_html('https://cza.nic.in/information-about-zoos/en')
df[0]

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)>

Since the table could not be extracted directly through HTML, I start scrapping the webpage through BeautifulSoup

In [26]:
response = requests.get("https://cza.nic.in/information-about-zoos/en") 
doc = BeautifulSoup(response.text)

SSLError: HTTPSConnectionPool(host='cza.nic.in', port=443): Max retries exceeded with url: /information-about-zoos/en (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))

Oh no! It gives an error: HTTPSConnectionPool(host='cza.nic.in', port=443): Max retries exceeded with url: /information-about-zoos/en (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))

It is basically saying that it can't verify the website's SSL certificate. So as a measure of security, it is preventing me from connecting to potentially unsafe website.

I ask for a solution to ChatGPT, which suggests to add 'verify=False'. It temporarily disables SSL verification.

In [32]:
response = requests.get(f"https://cza.nic.in/information-about-zoos/en/", verify = False)
doc = BeautifulSoup(response.text)
doc



<!DOCTYPE html>
<html lang="en">
<head>
<title>Information about Zoos</title>
<!-- for-mobile-apps -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Central Zoo Authority, Zoos of India, Animals of Zoos" name="keywords"/>
<script type="text/javascript"> 
            addEventListener("load", function() { setTimeout(hideURLbar, 0); }, false);
            function hideURLbar()
            { 
                window.scrollTo(0,1); 
            } 
            var BASE_URL = "https://cza.nic.in/";
        </script>
<!-- //for-mobile-apps -->
<link href="https://cza.nic.in/assets/frontend/css/bootstrap.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://cza.nic.in/assets/frontend/css/flexslider.css" media="screen" property="" rel="stylesheet" type="text/css"/>
<link href="https://cza.nic.in/assets/frontend/css/zoomslider.css" rel="stylesheet" type="text/css"/>
<link h

It works, but it gives a warning! I ignore the warning.

Now, I inspect the web page to find which div class hosts the elements of the table. It's 'table-responsive'. I check it by calculating all the items inside it.

In [33]:
items = doc.find_all('div', attrs={'class': 'table-responsive'})[0].find_all('tr')

len(items)

11

It correctly prints the total items in the table i.e. 11. So, I check if it's reading the content of the table perfectly.

In [35]:
items[1]

<tr class="item">
<th scope="row">1</th>
<td class="text">Biological Park, Chidiyatapu</td>
<td class="text">Port Blair</td>
<td class="text">Andaman &amp; Nicobar Islands</td>
<td><a href="https://cza.nic.in/uploads/documents/zoos/zoo-status/english/331.pdf" target="_blank">Recognized</a></td>
<td><a href="https://cza.nic.in/uploads/documents/zoos/information/english/331.pdf" target="_blank"><img src="https://cza.nic.in/assets/images/pdf-icon.png" width="30"/> Download</a></td>
</tr>

It is! Now the issue here is that the table is paginated and I will need to scrape the tables from all the pages - 53 in total. Since I haven't done something like this before, I take Yip(TA)'s help. He helps me apply a loop in the url.

In [38]:
rows = []
for i in range(53):
  print(f"https://cza.nic.in/information-about-zoos/en/page/{i+1}")

https://cza.nic.in/information-about-zoos/en/page/1
https://cza.nic.in/information-about-zoos/en/page/2
https://cza.nic.in/information-about-zoos/en/page/3
https://cza.nic.in/information-about-zoos/en/page/4
https://cza.nic.in/information-about-zoos/en/page/5
https://cza.nic.in/information-about-zoos/en/page/6
https://cza.nic.in/information-about-zoos/en/page/7
https://cza.nic.in/information-about-zoos/en/page/8
https://cza.nic.in/information-about-zoos/en/page/9
https://cza.nic.in/information-about-zoos/en/page/10
https://cza.nic.in/information-about-zoos/en/page/11
https://cza.nic.in/information-about-zoos/en/page/12
https://cza.nic.in/information-about-zoos/en/page/13
https://cza.nic.in/information-about-zoos/en/page/14
https://cza.nic.in/information-about-zoos/en/page/15
https://cza.nic.in/information-about-zoos/en/page/16
https://cza.nic.in/information-about-zoos/en/page/17
https://cza.nic.in/information-about-zoos/en/page/18
https://cza.nic.in/information-about-zoos/en/page/19
ht

After looping through the url, Yip helped me to scrape the data from each url, which required another for loop.

In [39]:
rows = []
for i in range(53):
  response = requests.get(f"https://cza.nic.in/information-about-zoos/en/page/{i+1}", verify = False)
  doc = BeautifulSoup(response.text)
  items = doc.find_all('div', attrs={'class': 'table-responsive'})[0].find_all('tr')

  for item in items:
    try:
        cols = item.find_all('td')
        number = item.find_all('th')
        links = item.find_all('a')
        row = {
            "Number": number[0].get_text(),
            "Zoo name": cols[0].get_text(),
            "Location": cols[1].get_text(),
            "State": cols[2].get_text(strip = True),
            "Recognition": cols[3].get_text(strip = True),            
            "Download": cols[4].get_text(strip = True),
            "url": links[1]['href']
        }
        rows.append(row)
    except:
        pass
rows



[{'Number': '1',
  'Zoo name': 'Biological Park, Chidiyatapu',
  'Location': 'Port Blair',
  'State': 'Andaman & Nicobar Islands',
  'Recognition': 'Recognized',
  'Download': 'Download',
  'url': 'https://cza.nic.in/uploads/documents/zoos/information/english/331.pdf'},
 {'Number': '2',
  'Zoo name': 'Deer Park, Chittoor',
  'Location': 'Chittoor (East) Division',
  'State': 'Andhra Pradesh',
  'Recognition': 'Recognized',
  'Download': 'Download',
  'url': 'https://cza.nic.in/uploads/documents/zoos/information/english/23.pdf'},
 {'Number': '3',
  'Zoo name': 'Deer Park, Kandaleru',
  'Location': 'Kandaleru',
  'State': 'Andhra Pradesh',
  'Recognition': 'Recognized',
  'Download': 'Download',
  'url': 'https://cza.nic.in/uploads/documents/zoos/information/english/365.pdf'},
 {'Number': '4',
  'Zoo name': 'Deer Park, Municipal Park',
  'Location': 'Rajahmundry',
  'State': 'Andhra Pradesh',
  'Recognition': 'Derecognized',
  'Download': 'Download',
  'url': 'https://cza.nic.in/uploads/

In [42]:
df = pd.json_normalize(rows)
df.head()

Unnamed: 0,Number,Zoo name,Location,State,Recognition,Download,url
0,1,"Biological Park, Chidiyatapu",Port Blair,Andaman & Nicobar Islands,Recognized,Download,https://cza.nic.in/uploads/documents/zoos/info...
1,2,"Deer Park, Chittoor",Chittoor (East) Division,Andhra Pradesh,Recognized,Download,https://cza.nic.in/uploads/documents/zoos/info...
2,3,"Deer Park, Kandaleru",Kandaleru,Andhra Pradesh,Recognized,Download,https://cza.nic.in/uploads/documents/zoos/info...
3,4,"Deer Park, Municipal Park",Rajahmundry,Andhra Pradesh,Derecognized,Download,https://cza.nic.in/uploads/documents/zoos/info...
4,5,"Deer Park, Tirumala Hills",Chittoor,Andhra Pradesh,Derecognized,Download,https://cza.nic.in/uploads/documents/zoos/info...


In [43]:
df['Recognition'].value_counts()

Recognition
Derecognized    369
Recognized      156
Name: count, dtype: int64

In [44]:
df.groupby(['State', 'Recognition']).size().unstack(fill_value=0)

Recognition,Derecognized,Recognized
State,Unnamed: 1_level_1,Unnamed: 2_level_1
Andaman & Nicobar Islands,0,1
Andhra Pradesh,16,5
Arunachal Pradesh,0,4
Assam,3,3
Bihar,14,2
Chhattisgarh,3,3
Dadra & Nagar Haveli,3,1
Daman & Diu,2,0
Delhi,5,1
Goa,0,1
