## Energy Label Classification
### Web scraping for Copenhagen buildings

__Digging for data__: I found one public company in Denmark called [https://sparenergi.dk/](https://sparenergi.dk/) which tries to guide people make more sustainable choices on the energy sources they use at home.

As a first step in this process, they provide [a page](https://sparenergi.dk/forbruger/vaerktoejer/find-dit-energimaerke) where you can insert an address and get the respective energy classification.

After some more digging around, I found out that they also have [a map](https://sparenergi.dk/demo/addresses/map) where such a overview is accessible so I decided to try and see if I can scrape it.

After a lot of digging, I found out what the parameters for the HTTP requests were (the process involved playing with my browser in 'Inspect' mode, opening the 'Networking' tab and searching for the relevant HTTP requests as I was playing with the map). The most important one was a geographical polygon that described the area for which the information is to be fetched. An example of how such an HTTP request would look like can be found in [curl_request.txt](curl_request.txt). To get to the python HTTP request format accepted by Python's [requests](https://pypi.org/project/requests/) module I used [https://curl.trillworks.com/](https://curl.trillworks.com/) which basically does just this: takes in a CURL request and spits out a Python version.

So I defined a large enough polygon that should encapsulate the Municipality of Copenhagen and, in order to parallelize scrapring, I broke it down into a 200x200 grid. I then made a request for each of the individual polygons in this grid and combined the information into a single file.

Here is the large polygon I used for this process: ![Image](data/img/ibm_cph_map.PNG)

In [1]:
import utils
from datetime import datetime
import time
import requests
import multiprocessing as mp

I decided to try and parallelize the scraping as much as possible, since I wanted to avoid sending off a single request and waiting for the server to send the large response back (I was also skeptikal on whether my python setup would actually be able to process in one go).

After a bit of googling, I found [this](https://www.jpytr.com/post/analysinggeographicdatawithfolium/) very nice approach of doing so which essentially constructs a NxN grid of a 'parent' poligon.

#### 1. Split the larger area for which we want to retrieve buildig energy classes into a grid

In [2]:
# Form the grid
lower_left = [utils.REGION_LAT[0], utils.REGION_LON[0]]
upper_right = [utils.REGION_LAT[2], utils.REGION_LON[2]]
grid = utils.get_geojson_grid(upper_right, lower_left , n=200)

#### 2. Create the HTTP requests

In [3]:
start = datetime.now()
headers_l = list()
cookies_l = list()
data_l = list()

for i in range(200):
    h, c, d = utils.fill_request(grid[i])
    headers_l.append(h), cookies_l.append(c), data_l.append(d)

print(datetime.now()-start)

0:00:00.001349


#### 3. Fire off the requests in a parallelized way and store the responses to file (one per process)

In [4]:
pool = mp.Pool(mp.cpu_count())
results = pool.map(utils.make_request, [(cookies_l[i], headers_l[i], data_l[i]) for i in range(200)])