# Data Collection

<h2> Importing libraries for Webscraping the Export data </h2>

For web-scraping India’s export and import data, we leveraged a web browser automation tool: Selenium. 

The website had a nested structure wherein the user needed to enter the country of his/her choice from the dropdown list and the year he/she needed the data for; and after submitting the query, the user was redirected to a page having results of the export/import trade data for that particular country and year. The information on this page was in the structure of a table; thus we needed to extract the information from HTML table tag. We run a loop for the years 1997-2019 and all the countries present in the drop-down list to retrieve the data. We follow this procedure for both import and export data.

In [1]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
import pandas as pd
import html5lib
from webdriver_manager.chrome import ChromeDriverManager

We make a list of all the countries by choosing all the available options from the element "cntcode" which is the name given to the element for choosing the country. It has element id "select3" which we have used in the subsequent code.

In [2]:
export_comm_url = "https://commerce-app.gov.in/eidb/ecntcomq.asp"

# Setting up the driver
chrome_options = Options()
chrome_options.headless = True

get_country_list = webdriver.Chrome(ChromeDriverManager().install(),options=chrome_options)

get_country_list.get(export_comm_url)

# Populating country list
country_names = []
for o in Select(get_country_list.find_element_by_name('cntcode')).options:
          country_names.append(o.text)
        
get_country_list.close()


Looking for [chromedriver 78.0.3904.105 mac64] driver in cache 
File found in cache by path [/Users/anjaliagrawal/.wdm/drivers/chromedriver/78.0.3904.105/mac64/chromedriver]


In [3]:
export_comm_data = pd.DataFrame()

In [4]:
# Scraping commodity data
def get_export_comm(year, upper_limit=len(country_names)):
    
    driver = webdriver.Chrome(ChromeDriverManager().install(),options=chrome_options)
    driver.get(export_comm_url)
    
    global export_comm_data    
    
    for ix in range(0, upper_limit):
        
        
        country_name = country_names[ix]  #chooses the country from the list
        print(year, country_name)
        Select(driver.find_element_by_id('select2')).select_by_value(str(year)) # Select year

        Select(driver.find_element_by_id('select3')).select_by_visible_text(country_name) # Select country name

        # Additional options    
        driver.find_element_by_id('radioDAll').click()
        driver.find_element_by_id('radiousd').click()

        driver.find_element_by_id('button1').click()

        try:
            tbl = driver.find_element_by_tag_name('table').get_attribute('outerHTML')
            df = pd.read_html(tbl)[0]

            # Scraping dataframe
            df['country'] = country_name
            df['year'] = year
            df.rename(columns={df.columns[4]:'value'}, inplace=True) # Renaming column to use in append


            export_comm_data = export_comm_data.append(df.iloc[:-3,[1,2,4,6,7]], sort=False)
            
        except:
            pass
        
        driver.get(export_comm_url)
    
    driver.close()

In [6]:
for year in range(1997, 2020):
    get_export_comm(year)


Looking for [chromedriver 78.0.3904.105 mac64] driver in cache 
File found in cache by path [/Users/anjaliagrawal/.wdm/drivers/chromedriver/78.0.3904.105/mac64/chromedriver]
1997 AFGHANISTAN TIS
1997 ALBANIA
1997 ALGERIA
1997 AMERI SAMOA
1997 ANDORRA
1997 ANGOLA
1997 ANGUILLA
1997 ANTARTICA
1997 ANTIGUA
1997 ARGENTINA
1997 ARMENIA
1997 ARUBA
1997 AUSTRALIA
1997 AUSTRIA
1997 AZERBAIJAN
1997 BAHAMAS
1997 BAHARAIN IS
1997 BANGLADESH PR
1997 BARBADOS
1997 BELARUS
1997 BELGIUM
1997 BELIZE
1997 BENIN
1997 BERMUDA
1997 BHUTAN
1997 BOLIVIA
1997 BOSNIA-HRZGOVIN
1997 BOTSWANA
1997 BR VIRGN IS
1997 BRAZIL
1997 BRUNEI
1997 BULGARIA
1997 BURKINA FASO
1997 BURUNDI
1997 C AFRI REP
1997 CAMBODIA
1997 CAMEROON
1997 CANADA
1997 CANARY IS
1997 CAPE VERDE IS
1997 CAYMAN IS
1997 CHAD
1997 CHANNEL IS
1997 CHILE
1997 CHINA P RP
1997 CHRISTMAS IS.
1997 COCOS IS
1997 COLOMBIA
1997 COMOROS
1997 CONGO D. REP.
1997 CONGO P REP
1997 COOK IS
1997 COSTA RICA
1997 COTE D' IVOIRE
1997 CROATIA
1997 CUBA
1997 CURACAO
1

2001 C AFRI REP
2001 CAMBODIA
2001 CAMEROON
2001 CANADA
2001 CANARY IS
2001 CAPE VERDE IS
2001 CAYMAN IS
2001 CHAD
2001 CHANNEL IS
2001 CHILE
2001 CHINA P RP
2001 CHRISTMAS IS.
2001 COCOS IS
2001 COLOMBIA
2001 COMOROS
2001 CONGO D. REP.
2001 CONGO P REP
2001 COOK IS
2001 COSTA RICA
2001 COTE D' IVOIRE
2001 CROATIA
2001 CUBA
2001 CURACAO
2001 CYPRUS
2001 CZECH REPUBLIC
2001 DENMARK
2001 DJIBOUTI
2001 DOMINIC REP
2001 DOMINICA
2001 ECUADOR
2001 EGYPT A RP
2001 EL SALVADOR
2001 EQUTL GUINEA
2001 ERITREA
2001 ESTONIA
2001 ETHIOPIA
2001 FALKLAND IS
2001 FAROE IS.
2001 FIJI IS
2001 FINLAND
2001 FR GUIANA
2001 FR POLYNESIA
2001 FR S ANT TR
2001 FRANCE
2001 GABON
2001 GAMBIA
2001 GEORGIA
2001 GERMANY
2001 GHANA
2001 GIBRALTAR
2001 GREECE
2001 GREENLAND
2001 GRENADA
2001 GUADELOUPE
2001 GUAM
2001 GUATEMALA
2001 GUERNSEY
2001 GUINEA
2001 GUINEA BISSAU
2001 GUYANA
2001 HAITI
2001 HEARD MACDONALD
2001 HONDURAS
2001 HONG KONG
2001 HUNGARY
2001 ICELAND
2001 INDONESIA
2001 INSTALLATIONS IN INTERNATIO

2005 GABON
2005 GAMBIA
2005 GEORGIA
2005 GERMANY
2005 GHANA
2005 GIBRALTAR
2005 GREECE
2005 GREENLAND
2005 GRENADA
2005 GUADELOUPE
2005 GUAM
2005 GUATEMALA
2005 GUERNSEY
2005 GUINEA
2005 GUINEA BISSAU
2005 GUYANA
2005 HAITI
2005 HEARD MACDONALD
2005 HONDURAS
2005 HONG KONG
2005 HUNGARY
2005 ICELAND
2005 INDONESIA
2005 INSTALLATIONS IN INTERNATIONAL WATERS   
2005 IRAN
2005 IRAQ
2005 IRELAND
2005 ISRAEL
2005 ITALY
2005 JAMAICA
2005 JAPAN
2005 JERSEY         
2005 JORDAN
2005 KAZAKHSTAN
2005 KENYA
2005 KIRIBATI REP
2005 KOREA DP RP
2005 KOREA RP
2005 KUWAIT
2005 KYRGHYZSTAN
2005 LAO PD RP
2005 LATVIA
2005 LEBANON
2005 LESOTHO
2005 LIBERIA
2005 LIBYA
2005 LIECHTENSTEIN
2005 LITHUANIA
2005 LUXEMBOURG
2005 MACAO
2005 MACEDONIA
2005 MADAGASCAR
2005 MALAWI
2005 MALAYSIA
2005 MALDIVES
2005 MALI
2005 MALTA
2005 MARSHALL ISLAND
2005 MARTINIQUE
2005 MAURITANIA
2005 MAURITIUS
2005 MAYOTTE
2005 MEXICO
2005 MICRONESIA
2005 MOLDOVA
2005 MONACO
2005 MONGOLIA
2005 MONTENEGRO
2005 MONTSERRAT
2005 MOROCC

2009 LIBYA
2009 LIECHTENSTEIN
2009 LITHUANIA
2009 LUXEMBOURG
2009 MACAO
2009 MACEDONIA
2009 MADAGASCAR
2009 MALAWI
2009 MALAYSIA
2009 MALDIVES
2009 MALI
2009 MALTA
2009 MARSHALL ISLAND
2009 MARTINIQUE
2009 MAURITANIA
2009 MAURITIUS
2009 MAYOTTE
2009 MEXICO
2009 MICRONESIA
2009 MOLDOVA
2009 MONACO
2009 MONGOLIA
2009 MONTENEGRO
2009 MONTSERRAT
2009 MOROCCO
2009 MOZAMBIQUE
2009 MYANMAR
2009 N. MARIANA IS.
2009 NAMIBIA
2009 NAURU RP
2009 NEPAL
2009 NETHERLAND
2009 NETHERLANDANTIL
2009 NEUTRAL ZONE
2009 NEW CALEDONIA
2009 NEW ZEALAND
2009 NICARAGUA
2009 NIGER
2009 NIGERIA
2009 NIUE IS
2009 NORFOLK IS
2009 NORWAY
2009 OMAN
2009 PACIFIC IS
2009 PAKISTAN IR
2009 PALAU
2009 PANAMA C Z
2009 PANAMA REPUBLIC
2009 PAPUA N GNA
2009 PARAGUAY
2009 PERU
2009 Petroleum Products
2009 PHILIPPINES
2009 PITCAIRN IS.
2009 POLAND
2009 PORTUGAL
2009 PUERTO RICO
2009 QATAR
2009 REUNION
2009 ROMANIA
2009 RUSSIA
2009 RWANDA
2009 SAHARWI A.DM RP
2009 SAMOA
2009 SAN MARINO
2009 SAO TOME
2009 SAUDI ARAB
2009 SENEGAL

2013 PALAU
2013 PANAMA C Z
2013 PANAMA REPUBLIC
2013 PAPUA N GNA
2013 PARAGUAY
2013 PERU
2013 Petroleum Products
2013 PHILIPPINES
2013 PITCAIRN IS.
2013 POLAND
2013 PORTUGAL
2013 PUERTO RICO
2013 QATAR
2013 REUNION
2013 ROMANIA
2013 RUSSIA
2013 RWANDA
2013 SAHARWI A.DM RP
2013 SAMOA
2013 SAN MARINO
2013 SAO TOME
2013 SAUDI ARAB
2013 SENEGAL
2013 SERBIA
2013 SEYCHELLES
2013 SIERRA LEONE
2013 SINGAPORE
2013 SINT MAARTEN (DUTCH PART)
2013 SLOVAK REP
2013 SLOVENIA
2013 SOLOMON IS
2013 SOMALIA
2013 SOUTH AFRICA
2013 SOUTH SUDAN 
2013 SPAIN
2013 SRI LANKA DSR
2013 ST HELENA
2013 ST KITT N A
2013 ST LUCIA
2013 ST PIERRE
2013 ST VINCENT
2013 STATE OF PALEST
2013 SUDAN
2013 SURINAME
2013 SWAZILAND
2013 SWEDEN
2013 SWITZERLAND
2013 SYRIA
2013 TAIWAN
2013 TAJIKISTAN
2013 TANZANIA REP
2013 THAILAND
2013 TIMOR LESTE
2013 TOGO
2013 TOKELAU IS
2013 TONGA
2013 Trade to Unspecified Countries
2013 TRINIDAD
2013 TUNISIA
2013 TURKEY
2013 TURKMENISTAN
2013 TURKS C IS
2013 TUVALU
2013 U ARAB EMTS
2013 U K
2

2017 SUDAN
2017 SURINAME
2017 SWAZILAND
2017 SWEDEN
2017 SWITZERLAND
2017 SYRIA
2017 TAIWAN
2017 TAJIKISTAN
2017 TANZANIA REP
2017 THAILAND
2017 TIMOR LESTE
2017 TOGO
2017 TOKELAU IS
2017 TONGA
2017 Trade to Unspecified Countries
2017 TRINIDAD
2017 TUNISIA
2017 TURKEY
2017 TURKMENISTAN
2017 TURKS C IS
2017 TUVALU
2017 U ARAB EMTS
2017 U K
2017 U S A
2017 UGANDA
2017 UKRAINE
2017 UNION OF SERBIA & MONTENEGRO
2017 UNSPECIFIED
2017 URUGUAY
2017 US MINOR OUTLYING ISLANDS               
2017 UZBEKISTAN
2017 VANUATU REP
2017 VATICAN CITY
2017 VENEZUELA
2017 VIETNAM SOC REP
2017 VIRGIN IS US
2017 WALLIS F IS
2017 YEMEN REPUBLC
2017 ZAMBIA
2017 ZIMBABWE

Looking for [chromedriver 78.0.3904.105 mac64] driver in cache 
File found in cache by path [/Users/anjaliagrawal/.wdm/drivers/chromedriver/78.0.3904.105/mac64/chromedriver]
2018 AFGHANISTAN TIS
2018 ALBANIA
2018 ALGERIA
2018 AMERI SAMOA
2018 ANDORRA
2018 ANGOLA
2018 ANGUILLA
2018 ANTARTICA
2018 ANTIGUA
2018 ARGENTINA
2018 ARMENIA
2018 ARUBA
2

In [7]:
export_comm_data.reset_index(drop=True).to_csv('india_export.csv', index=False)