# Web Scraping Aerosol Optical Thickness (AOT)
Tiny solid and liquid particles suspended in the atmosphere are called aerosols. Examples of aerosols include windblown dust, sea salts, volcanic ash, smoke from fires, and pollution from factories. These particles are important to scientists because they can affect climate, weather, and people's health. Aerosols affect climate by scattering sunlight back into space and cooling the surface. Aerosols also help cool Earth in another way -- they act like "seeds" to help form clouds. The particles give water droplets something to cling to as the droplets form and gather in the air to make clouds. Clouds give shade to the surface by reflecting sunlight back into space. People's health is affected when they breathe in smoke or pollution particles. Such aerosols in our lungs can cause asthma or cancer of other serious health problems. But scientists do not fully understand all of the ways that aerosols affect Earth's environment. To help them in their studies, scientists use satellites to map where there were large amounts of aerosol on a given day, or over a span of days.

Source: https://neo.gsfc.nasa.gov/view.php?datasetId=MYDAL2_M_AER_OD

In [None]:
# importing relevant libraries
import glob, re, requests

import numpy  as np
import pandas as pd

from bs4 import BeautifulSoup

In [None]:
# unzipping the data files
!unzip '/content/drive/MyDrive/PrimaryData_Group21.zip'
!unzip '/content/PrimaryData_Group21/Climate/AOD Data 2021.zip'

Archive:  /content/drive/MyDrive/PrimaryData_Group21.zip
   creating: PrimaryData_Group21/Climate/
 extracting: PrimaryData_Group21/Climate/AOD Data 2019.zip  
 extracting: PrimaryData_Group21/Climate/AOD Data 2021.zip  
   creating: PrimaryData_Group21/Education/
 extracting: PrimaryData_Group21/Education/ASER Data 2019.zip  
 extracting: PrimaryData_Group21/Education/ASER Data 2021.zip  
Archive:  /content/PrimaryData_Group21/Climate/AOD Data 2021.zip
  inflating: AOD Data 2021/MYDAL2_M_AER_OD_2021-01-01_rgb_3600x1800.SS.CSV  
  inflating: AOD Data 2021/MYDAL2_M_AER_OD_2021-02-01_rgb_3600x1800.SS.CSV  
  inflating: AOD Data 2021/MYDAL2_M_AER_OD_2021-03-01_rgb_3600x1800.SS.CSV  
  inflating: AOD Data 2021/MYDAL2_M_AER_OD_2021-04-01_rgb_3600x1800.SS.CSV  
  inflating: AOD Data 2021/MYDAL2_M_AER_OD_2021-05-01_rgb_3600x1800.SS.CSV  
  inflating: AOD Data 2021/MYDAL2_M_AER_OD_2021-06-01_rgb_3600x1800.SS.CSV  
  inflating: AOD Data 2021/MYDAL2_M_AER_OD_2021-07-01_rgb_3600x1800.SS.CSV  
  i

In [None]:
# aggregating districts [named copied from data set] and their wikipedia links
a = np.array(['BAGH', 'BHIMBER', 'HATTIAN', 'HAVELI', 'KOTLI', 'MIRPUR',
              'MUZAFFARABAD', 'NEELUM', 'SUDHNATI', 'AWARAN', 'BARKHAN',
              'BOLAN', 'CHAGHI', 'DERA BUGTI', 'DUKI', 'GWADAR',
              'HARNAI', 'JAFARABAD', 'JHAL MAGSI', 'KALAT', 'KECH',
              'KHARAN', 'KHUZDAR', 'KOHLU', 'LASBELA', 'LORALAI',
              'MASTUNG', 'MUSAKHAIL', 'NASIRABAD', 'NUSHKI', 'PANJGUR',
              'PISHIN', 'QILLA ABDULLAH', 'QILLA SAIFULLAH', 'QUETTA',
              'SIBI', 'WASHUK', 'ZHOB',
              'ZIARAT', 'ASTORE', 'DAREL', 'DIAMER', 'GHANCHE',
              'GILGIT', 'GUPIS YASIN', 'HUNZA', 'KHARMANG', 'NAGAR', 'RONDU',
              'SHIGAR', 'SKARDU', 'ISLAMABAD', 'ABBOTTABAD', 'BAJAUR',
              'BANNU', 'BATTAGRAM', 'BUNER', 'CHITRAL',
              'DERA ISMAIL KHAN', 'HARIPUR',
              'KOHAT', 'KURRAM', 'LAKKI MARWAT', 'LOWER DIR',
              'MARDAN', 'MOHMAND', 'NORTH WAZIRISTAN',
              'NOWSHERA', 'ORAKZAI', 'PESHAWAR',
              'SWABI', 'SWAT', 'TANK', 'TOR GHAR', 'UPPER DIR', 'ATTOCK',
              'BAHAWALNAGAR', 'BAHAWALPUR', 'BHAKKAR', 'CHAKWAL', 'CHINIOT',
              'DERA GHAZI KHAN', 'FAISALABAD', 'GUJRANWALA', 'GUJRAT',
              'HAFIZABAD', 'JEHLUM', 'JHANG', 'KASUR',
              'LAYYAH', 'LODHRAN', 'MANDI BAHAUDDIN', 'MIANWALI', 'MULTAN',
              'MUZAFFARGARH', 'NANKANA SAHIB', 'NAROWAL', 'OKARA', 'PAKPATTAN',
              'RAHIM YAR KHAN', 'RAJANPUR', 'RAWALPINDI', 'SAHIWAL', 'SARGODHA',
              'SHEIKHUPURA', 'SIALKOT', 'TOBA TEK SINGH',
              'DADU', 'GHOTKI', 'JACOBABAD', 'JAMSHORO', 'MALIR',
              'KARACHI WEST', 'KASHMORE', 'KHAIRPUR', 'LARKANA', 'MATIARI',
              'MIRPURKHAS', 'NAUSHAHRO FEROZE', 'QAMBAR SHAHDADKOT', 'SAJAWAL',
              'SANGHAR', 'SHAHEED BENAZIRABAD', 'SHIKARPUR', 'SUKKUR',
              'TANDO ALLAHYAR', 'TANDO MUHAMMAD KHAN', 'THATTA', 'UMERKOT'])
b = ['https://en.wikipedia.org/wiki/' + i.replace(' ', '_').title() + '_District' for i in a] + ['https://en.wikipedia.org/wiki/Poonch_District,_Pakistan', 'https://en.wikipedia.org/wiki/Ghizer_District_(2019%E2%80%93)', 'https://en.wikipedia.org/wiki/Kohistan_District,_Pakistan', 'https://en.wikipedia.org/wiki/Hyderabad_District,_Sindh', 'https://en.wikipedia.org/wiki/Tharparkar', 'https://en.wikipedia.org/wiki/Chaman', 'https://en.wikipedia.org/wiki/Lehri,_Balochistan', 'https://en.wikipedia.org/wiki/File:Sherani_District.svg', 'https://en.wikipedia.org/wiki/Sohbatpur', 'https://en.wikipedia.org/wiki/Surab,_Pakistan', 'https://en.wikipedia.org/wiki/Charsadda', 'https://en.wikipedia.org/wiki/Hangu,_Pakistan', 'https://en.wikipedia.org/wiki/Karak,_Pakistan', 'https://en.wikipedia.org/wiki/Landi_Kotal', 'https://en.wikipedia.org/wiki/Batkhela', 'https://en.wikipedia.org/wiki/Mansehra', 'https://en.wikipedia.org/wiki/Alpuri', 'https://en.wikipedia.org/wiki/Wanna,_Pakistan', 'https://en.wikipedia.org/wiki/Khanewal', 'https://en.wikipedia.org/wiki/Jauharabad', 'https://en.wikipedia.org/wiki/Vehari', 'https://en.wikipedia.org/wiki/Badin']

In [None]:
# rounding coordinates appropriately
def rounder(z):
    return tuple([np.round((int(z_i * 10) / 10) + 0.05, 2) for z_i in z])

In [None]:
# creating a list for each pair of coordinates and its respective district
coords    = []
districts = []
others    = []

# web scraping coordinates and districts
for b_i in b:
    source_code1 = requests.get(b_i)
    soup1        = BeautifulSoup(source_code1.content, 'lxml')
    url1         = soup1.find("a", {"class": "external text", "href": True})['href']

    if url1[:30] == 'https://geohack.toolforge.org/':
        source_code2 = requests.get(url1)
        soup2        = BeautifulSoup(source_code2.content, 'lxml')
        url2         = soup2.find("h1", {"id": "firstHeading"}).contents[0][10:]

        districts.append(url2)

        source_code3 = requests.get(url1)
        soup3        = BeautifulSoup(source_code3.content, 'lxml')
        url3         = soup3.find("a", {"rel": "nofollow", "href": True})['href']

        coords.append(rounder([float(number) for number in re.findall(r'\d+\.*\d*', url3)]))

    else: others.append(b_i)

# for Tangir: https://www.mindat.org/maps.php?id=157133
coords.append(rounder([35.619722222222, 73.4625]))
districts.append('Tangir District')

In [None]:
# intializing containers
aot   = {district: [] for district in districts}
files = glob.glob('/content/AOD Data 2021/*.CSV')

files.sort() # sorting files

for csv in files:
    # reading and modifying the data set whose each observation is a latitude value, each feature a longitude one,
    # and each cell the monthly AOT value for that pair of latitude and longitude
    df            = pd.read_csv(csv)
    df.columns    = ['lat/lon'] + [format(np.round(float(col), 2), '.2f') for col in df.columns if col != 'lat/lon']
    df['lat/lon'] = df['lat/lon'].round(2)

    # extracting monthly AOT values for each district based on its coordinates
    for i, (lat, lon) in enumerate(coords):
        aot[districts[i]].append(float(df.loc[np.round(df['lat/lon'], 2) == lat, format(lon, '.2f')]))

print(f'aot: {aot}') # printing monthly AOT values for each district

aot: {'Bagh District': [0.201, 0.315, 0.52, 0.142, 99999.0, 99999.0, 0.315, 0.26, 0.256, 0.118, 0.228, 99999.0], 'Bhimber District': [0.232, 0.335, 0.26, 0.217, 0.39, 99999.0, 0.311, 0.453, 0.209, 0.236, 0.26, 0.213], 'Hattian Bala District': [99999.0, 99999.0, 99999.0, 0.146, 99999.0, 99999.0, 0.319, 0.272, 0.201, 0.091, 99999.0, 99999.0], 'Haveli District': [99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 0.059, 99999.0, 99999.0], 'Kotli District': [0.209, 0.28, 0.295, 0.205, 0.39, 99999.0, 99999.0, 0.327, 0.287, 0.193, 0.248, 0.161], 'Mirpur District': [0.299, 0.382, 0.606, 0.134, 0.457, 99999.0, 1.0, 0.634, 0.531, 0.315, 0.374, 0.161], 'Muzaffarabad District': [99999.0, 99999.0, 99999.0, 0.165, 0.382, 99999.0, 0.339, 0.252, 0.232, 0.157, 99999.0, 99999.0], 'Neelum District': [99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 99999.0, 0.047, 99999.0, 99999.0], 'Sudhanoti District': [0.197, 0.268, 0.264, 0.173, 0.429, 99999.0, 99

In [None]:
temp = pd.DataFrame.from_dict(aot).replace(99999.000, np.nan) # creating climate data set

temp.fillna(temp.mean(), inplace = True) # replacing missing values with mean

# computing mean AOT valus to create the dataframe, assuming that first six
# months are clear, while the remaining six experience smog
clear_mean = temp.iloc[:6, :].mean()
smog_mean  = temp.iloc[6:, :].mean()
climate_df = pd.concat([clear_mean, smog_mean], axis = 1).reset_index().rename(columns = {'index': 'district', 0: 'clear_mean', 1: 'smog_mean'})

climate_df # printing the dataframe

Unnamed: 0,district,clear_mean,smog_mean
0,Bagh District,0.283556,0.239778
1,Bhimber District,0.286212,0.280333
2,Hattian Bala District,0.195833,0.215767
3,Haveli District,0.059000,0.059000
4,Kotli District,0.273083,0.245917
...,...,...,...
147,Khanewal,0.632667,0.793333
148,Jauharabad,0.552803,0.728833
149,Vehari,0.585424,0.769667
150,Badin,0.474417,0.596583


In [None]:
# preparing to rename districts to match the rest of data
climate_df['district'] = climate_df['district'].apply(lambda x: x.upper().replace(' DISTRICT', ''))
climate_df.sort_values('district', inplace = True, ignore_index = True)

In [None]:
# creating a list to check the columns that match
c = ['BAGH', 'BHIMBER', 'HATTIAN', 'HAVELI', 'KOTLI', 'MIRPUR',
     'MUZAFFARABAD', 'NEELUM', 'POONCH', 'SUDHNATI', 'AWARAN',
     'BARKHAN', 'BOLAN', 'CHAGHI', 'CHAMAN', 'DERA BUGTI', 'DUKKI',
     'GWADAR', 'HARNAI', 'JAFARABAD', 'JHAL MAGSI', 'KALLAT',
     'KECH (TURBAT)', 'KHARAN', 'KHUZDAR', 'KOHLU', 'LASBELA', 'LEHRI',
     'LORALAI', 'MASTUNG', 'MUSAKHEL', 'NASIRABAD', 'NUSHKI', 'PANJGUR',
     'PISHIN', 'QILLA ABDULLAH', 'QILLA SAIFULLAH', 'QUETTA', 'SHERANI',
     'SIBI', 'SOHBATPUR', 'SURAB', 'WASHUK', 'ZHOB', 'ZIARAT', 'ASTORE',
     'DARIAL', 'DIAMER', 'GHANCHE', 'GHIZER', 'GILGIT', 'GUPIS YASIN',
     'Hunza-Nagar', 'KHARMANG', 'NAGAR', 'RONDU', 'SHIGAR', 'SKARDU',
     'TANGIR', 'ISLAMABAD', 'ABBOTTABAD', 'BAJAUR AGENCY', 'BANNU',
     'BATTAGRAM', 'BUNER', 'CHARSADDA', 'CHITRAL', 'DERA ISMAIL KHAN',
     'HANGU', 'HARIPUR', 'KARAK', 'KHYBER AGENCY', 'KOHAT', 'KOHISTAN',
     'KURRAM AGENCY', 'LAKKI MARWAT', 'LOWER DIR', 'MALAKAND',
     'MANSEHRA', 'MARDAN', 'MOHMAND AGENCY', 'NORTH WAZIRISTAN',
     'NOWSHERA', 'ORAKZAI AGENCY', 'PESHAWAR', 'SHANGLA',
     'SOUTH WAZIRISTAN', 'SWABI', 'SWAT', 'TANK', 'TOR GHAR',
     'UPPER DIR', 'ATTOCK', 'BAHAWALNAGER', 'BAHAWALPUR', 'BHAKKAR',
     'CHAKWAL', 'CHINIOT', 'DERA GHAZI KHAN', 'FAISALABAD',
     'GUJRANWALA', 'GUJRAT', 'HAFIZABAD', 'JEHLUM', 'JHANG', 'KASUR',
     'KHANEWAL', 'KHUSHAB', 'LAYYAH', 'LODHRAN', 'MANDI BAHUDDIN',
     'MIANWALI', 'MULTAN', 'MUZAFFAR GARH', 'NANKANA SAHIB', 'NAROWAL',
     'OKARA', 'PAKPATTAN', 'RAHIM YAR KHAN', 'RAJANPUR', 'RAWALPINDI',
     'SAHIWAL', 'SARGODHA', 'SHEIKHUPURA', 'SIALKOT', 'T.T SINGH',
     'VEHARI', 'BADIN', 'DADU', 'GOTKI', 'HYDERABAD', 'JACOBABAD',
     'JAMSHORO', 'KARACHI-MALIR-RURAL', 'KARACHI-WEST-RURAL',
     'KASHMORE', 'KHAIRPUR', 'LARKANA', 'MATIARI', 'MIRPURKHAS',
     'MITHI', 'NOWSHERO FEROZE', 'QAMBAR SHAHDADKOT', 'SAJAWAL',
     'SANGHAR', 'SHAHEED BENAZIRABAD', 'SHIKARPUR', 'SUKKUR',
     'TANDO ALLAH YAR', 'TANDO MUHAMMAD KHAN', 'THATTA', 'UMER KOT']
c.sort()

In [None]:
# correcting minor typos
climate_df['district'] = climate_df['district'].replace({'ALPURI' : 'SHANGLA',
                                                         'BAHAWALNAGAR' : 'BAHAWALNAGER',
                                                         'BAJAUR' : 'BAJAUR AGENCY',
                                                         'BATKHELA' : 'MALAKAND',
                                                         'CHAGAI' : 'CHAGHI',
                                                         'DAREL' : 'DARIAL',
                                                         'DUKI' : 'DUKKI',
                                                         'FILE:SHERANI.SVG' : 'SHERANI',
                                                         'GHIZER (2019–)' : 'GHIZER',
                                                         'GHOTKI' : 'GOTKI',
                                                         'GUPIS-YASIN' : 'GUPIS YASIN',
                                                         'HANGU, PAKISTAN' : 'HANGU',
                                                         'HATTIAN BALA' : 'HATTIAN',
                                                         'HUNZA' : 'Hunza-Nagar',
                                                         'HYDERABAD, SINDH' : 'HYDERABAD',
                                                         'JAUHARABAD' : 'KHUSHAB',
                                                         'JHELUM' : 'JEHLUM',
                                                         'KACHHI' : 'BOLAN',
                                                         'KALAT' : 'KALLAT',
                                                         'KARAK, PAKISTAN' : 'KARAK',
                                                         'KECH' : 'KECH (TURBAT)',
                                                         'KILLA ABDULLAH' : 'QILLA ABDULLAH',
                                                         'KILLA SAIFULLAH' : 'QILLA SAIFULLAH',
                                                         'KOHISTAN, PAKISTAN' : 'KOHISTAN',
                                                         'KURRAM' : 'KURRAM AGENCY',
                                                         'LANDI KOTAL' : 'KHYBER AGENCY',
                                                         'LEHRI, BALOCHISTAN' : 'LEHRI',
                                                         'MALIR' : 'KARACHI-MALIR-RURAL',
                                                         'MANDI BAHAUDDIN' : 'MANDI BAHUDDIN',
                                                         'MIRPUR KHAS' : 'MIRPURKHAS',
                                                         'MOHMAND' : 'MOHMAND AGENCY',
                                                         'MUSAKHAIL' : 'MUSAKHEL',
                                                         'MUZAFFARGARH' : 'MUZAFFAR GARH',
                                                         'NAUSHAHRO FEROZE' : 'NOWSHERO FEROZE',
                                                         'ORAKZAI' : 'ORAKZAI AGENCY',
                                                         'ORANGI (KARACHI WEST)' : 'KARACHI-WEST-RURAL',
                                                         'POONCH, PAKISTAN' : 'POONCH',
                                                         'ROUNDU' : 'RONDU',
                                                         'SUDHANOTI' : 'SUDHNATI',
                                                         'SUJAWAL' : 'SAJAWAL',
                                                         'SURAB, PAKISTAN' : 'SURAB',
                                                         'TANDO ALLAHYAR' : 'TANDO ALLAH YAR',
                                                         'THARPARKAR' : 'MITHI',
                                                         'TOBA TEK SINGH' : 'T.T SINGH',
                                                         'TORGHAR' : 'TOR GHAR',
                                                         'UMERKOT' : 'UMER KOT',
                                                         'WANNA, PAKISTAN' : 'SOUTH WAZIRISTAN'})

In [None]:
unmatched1 = [c[i] for i in range(len(c)) if not (climate_df['district'] == c[i]).any()]                                   # in list, but not in df
unmatched2 = [climate_df['district'][i] for i in range(len(climate_df['district'])) if climate_df['district'][i] not in c] # in df, but not in list

len(unmatched1), len(unmatched2) # printing the number of unmatched entries

(0, 0)

In [None]:
climate_df # printing the final dataframe

Unnamed: 0,district,clear_mean,smog_mean
0,ABBOTTABAD,0.231621,0.305833
1,SHANGLA,0.094000,0.094000
2,ASTORE,,
3,ATTOCK,0.387848,0.478333
4,AWARAN,0.294152,0.313667
...,...,...,...
147,VEHARI,0.585424,0.769667
148,SOUTH WAZIRISTAN,0.227258,0.263833
149,WASHUK,0.304894,0.347833
150,ZHOB,0.074467,0.099933


In [None]:
climate_df.to_csv('/content/climate_df.csv') # saving the dataframe as csv