# 2. Data gathering

This part is devoted to data collection process. As the output we obtain raw data which will be transformed to the final dataset in the 3. Dataset construction section. 

Generally, we devided data into 4 categories: 

* pick-up points data
* spatial shapes data
* demographic data
* points of interest data.

Pick-up points data comes from websites like: [Bliskapaczka.pl](https://bliskapaczka.pl) and [DHL](https://www.dhl.com/pl-pl/home.html?locale=true). 

Spatial shapes data comes from [GUGIK](https://gis-support.pl/baza-wiedzy-2/dane-do-pobrania/granice-administracyjne/) (head office of geodesy and cartography in Poland). 

Demographic data are taken from the [Inspire repository](https://geo.stat.gov.pl/inspire) and it represents indicators for 1km2 grids in Poland. 

Finally, we obtained points of interest data from [OSM](https://download.geofabrik.de/europe/poland.html) repository. This is amazing site which store snapshots of the OSM in shape files!!! 

## Import dependencies

In [None]:
from google_drive_downloader import GoogleDriveDownloader
import requests
import json
import numpy as np
import pandas as pd

%config Completer.use_jedi = False

## Utilities

We define some utilities for code reproducibility. 

In [None]:
def download_gd_data_from_dict(dictionary: dict):
    '''
    Download, save and unzip data from Google Drive
    '''
    for i,j in dictionary.items():
        GoogleDriveDownloader.download_file_from_google_drive(file_id=j,
                                            dest_path=f"../datasets/raw_data/{i}/{i}.zip",
                                            unzip=True,
                                            showsize=True,
                                            overwrite=False)

def download_json_data_from_url(name:str, url: str):
    '''
    Download and save data from JSON API outputs
    '''
    response = requests.get(url).text
    df = pd.DataFrame(json.loads(response))
    df.to_csv(f"../datasets/raw_data/{name}.csv")

## Download data collected and stored on our Google Drive

We decided to download data from GUIGK, OSM and Inspire (using links attached in the introduction to this stage of study) and store it on our academic Google Drive to obtain reproducibility in any time. Thanks to our functional utilities we can just pass direct link to the file and then download, store and unzip files with the data! Does data are not stored in our remote git repository due to their size, but thanks to Google Drive they are available for anyone!

You can also inspect the file via Browser just combine: https://drive.google.com/file/d/ + file_id, for instace: https://drive.google.com/file/d/1BZCmADIZhJuf1_Jh-f6D8vSpI8p5-2wd

### Source: GUIGK

In [None]:
gugik = {'guigk_voi':'1BZCmADIZhJuf1_Jh-f6D8vSpI8p5-2wd',
        'guigk_pov':'1wX99dmNUbiEKYKh-qAfxDipT9oC6DLzE',
        'guigk_com':'1URjb9NM6Fm_qES5kC4QPPXZGERzarUIa'}

download_gd_data_from_dict(gugik)

### Source: OSM

In [None]:
osm = {'osm_mazowieckie':'195E_n9JlgavFWp4mbaOCHAYKFWziBkc0',
        'osm_malopolskie':'1KG6uPhCZ-jKDgEpBU46WKHXVG_Mc-dBS'}

download_gd_data_from_dict(osm)

### Source: Inspire

In [None]:
inspire = {"inspire":"1avnBMziIn9uLetSbucMrZlZadhnSUvPE"}
download_gd_data_from_dict(inspire)

## Scrape pick-up points data from websites

We collect that about pick-up points from two website or to be more precise from their APIs. It is the smartest way to gather this data in seconds!

### Source: Bliskapaczka.pl

In [None]:
url = 'https://pos.bliskapaczka.pl/api/v1/pos?fields=operator%2Ccode%2Clatitude%2Clongitude%2Cbrand%2CbrandPretty%2CoperatorPretty%2Ccod%2Cavailable%2C+city%2C+street&operators=RUCH%2CINPOST%2CPOCZTA%2CDPD%2CUPS%2CFEDEX'
download_json_data_from_url("bliska_paczka", url)

### Source: DHL

In [None]:
url = 'https://parcelshop.dhl.pl/index.php/points?type=lm&country=pl&ptype=parcelShop&hours_from=10&hours_to=16&week_days_PON=T&week_days_WT=T&week_days_SR=T&week_days_CZW=T&week_days_PT=T&week_days_SOB=N&week_days_NIEDZ=N&options_pickup_cod=N&show_on_map_parcelshop=T&show_on_map_parcelstation=T&show_on_map_pok=T&tab=pickup'
download_json_data_from_url("dhl", url)