# International Days Scraper


This scraper downloads the international days from www.un.org
in order to create a dataset for research purporse.

The dataset is composed by the following attributes:
- date
- event
- url
- doc
- url_doc

The code is organized with the following sections:

- Setup of the env (install libraries, ...)
- Setup variables
- Open the browser and get the web page (with Selenium and Chrome Browser libraries)
- Parse DOM of the web page and download each single international day
- Store the data on CSV files
- Close the browser

This notebook uses Chrome Driver to simulate user interaction with United Nations web site.
To set up Chrome Driver on your laptop please refer to https://chromedriver.chromium.org/downloads

The notebook is tested with
`ChromeDriver 91.0.4472.19`

Please set up `chromedriver_path` to your Chrome Driver folder.

### Setup of the env

Install and import of python libraries 

In [399]:
!pip3 install selenium
!pip3 install pandas

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [400]:
import pandas as pd
import datetime
from selenium import webdriver as wd
import selenium
import json

### Set up variables

In [401]:
chromedriver_path =  './chromedriver'
sleep_time = 0
url = 'https://www.un.org/en/observances/list-days-weeks'
log = True

### Starting chrome and getting the web page

In [402]:
# from https://github.com/MatthewChatham/glassdoor-review-scraper/blob/master/main.py

def get_browser():
    chrome_options = wd.ChromeOptions()
    chrome_options.add_argument('log-level=3')
    browser = wd.Chrome(chromedriver_path, options=chrome_options)
    return browser

def get_page():
    global url
    print(f'Getting {url}')
    browser.get(url)

    # cookie_btn = browser.find_element_by_id('_evidon-accept-button')
    # cookie_btn.click()
    time.sleep(sleep_time)

def quit_browser():
    browser.quit()

browser = get_browser()

get_page()

Getting https://www.un.org/en/observances/list-days-weeks


### Get International days

`get_days()` function gets a response and produces a list composed by a dict with

{
  'date': '04-01',
  'event': 'World Braille Day',
  'url': 'https://www.un.org/en/observances/braille-day',
  'doc': 'A/RES/73/25',
  'url_doc': 'https://undocs.org/en/A/RES/73/161'
}

In [403]:
def get_days():
    days_list = []

    try:
        days_el = browser.find_elements_by_class_name("views-row")
    except:
            pass

    for day_el in days_el:
        day = dict() 
        str = day_el.find_element_by_xpath(".//div[1]/div/span").text
        date = datetime.datetime.strptime(str, "%d %b")
        day["date"] = date.strftime("%d-%m") # ok
        day["event"] = day_el.find_element_by_xpath(".//span[1]/span[@class='field-content']").text # ok
        try:
            day["url"] = day_el.find_element_by_xpath(".//span[1]/span[@class='field-content']/a").get_attribute('href')
        except:
            day["url"] = ""
        
        try:
            day["doc"] = day_el.find_element_by_xpath(".//span[2]/span[@class='field-content']").text[1:-1]
        except:
            day["doc"] = ""
        
        try:
            day["url_doc"] = day_el.find_element_by_xpath(".//span[2]/span[@class='field-content']/a").get_attribute('href')
        except:
            day["url_doc"] = ""
        
        if (log):
            print(day["date"])

        days_list.append(day)
    
    return days_list

days = get_days()

print(f"International days found: {len(days)}")


04-01
24-01
27-01
01-02
04-02
06-02
10-02
11-02
13-02
20-02
21-02
01-03
03-03
08-03
10-03
20-03
20-03
21-03
21-03
21-03
21-03
21-03
21-03
22-03
23-03
24-03
24-03
25-03
25-03
02-04
04-04
05-04
06-04
07-04
07-04
12-04
14-04
20-04
21-04
22-04
22-04
23-04
23-04
23-04
24-04
24-04
25-04
25-04
26-04
26-04
28-04
30-04
02-05
03-05
05-05
08-05
08-05
10-05
15-05
16-05
16-05
17-05
17-05
20-05
21-05
21-05
22-05
23-05
25-05
26-05
29-05
31-05
01-06
03-06
04-06
05-06
05-06
06-06
07-06
08-06
12-06
13-06
14-06
15-06
16-06
17-06
18-06
19-06
20-06
21-06
21-06
23-06
23-06
25-06
26-06
26-06
27-06
29-06
30-06
30-06
03-07
11-07
15-07
18-07
20-07
28-07
30-07
30-07
01-08
09-08
12-08
19-08
21-08
22-08
23-08
29-08
30-08
31-08
05-09
07-09
08-09
09-09
12-09
15-09
16-09
17-09
18-09
21-09
23-09
26-09
27-09
28-09
29-09
30-09
30-09
01-10
02-10
02-10
04-10
04-10
05-10
09-10
09-10
10-10
11-10
13-10
15-10
16-10
17-10
24-10
24-10
24-10
24-10
27-10
31-10
02-11
05-11
06-11
09-11
10-11
13-11
14-11
16-11
18-11
19-11
20-11
20-1

### Download and store the data to CSV file

In [404]:
df = pd.DataFrame.from_dict(days_list)
df = df.to_csv('international_days.csv', columns=['date','event','url', 'doc','url_doc'], index=False)

### Close the browser

In [405]:
quit_browser()