## Create dataset 

This notebook creates the initial dataset used in further steps of the analytic pipeline (preprocessing, visualizations, analyses etc.), and it combines three data sources: 

1. Noise data for the year 2022 in eight locations in Leuven, obtained from the city of Leuven.

2. Meteo data for the year 2022, obtained from authors of the paper *Quality control and correction method for air temperature data from citizen science weather station network in Leuven, Belgium*.  

3. Opening hours of bars, pubs, clubs... around the locations of interest, obtained from Google maps API. 


Note that this notebook is not "automatically reproducible" because it requires: 

* Key to Google maps API which has not been shared here. To obtain the key, create a Google Cloud account (free trial available), create a project and enable the "places API" in the project. Then, copy-paste your key in the `create_final_csv_functions.py` file.

* Access to the S3 bucket to store the csvs of intermediate steps and the final csv. All files in S3 bucket "s3://mda.project.monaco/" have public access in read-only mode; however, to be able to write files in the S3 bucket you need to be given access as a IAM user. 

In [1]:
# Import packages, functions and API key
%run 'create_dataset_functions.py'

In [None]:
# Using the provided noise and meteo data, create one csv containing data for all months with noise data grouped every 10 minutes and meteo data 

months = ["Jan", "Feb", "March", "April","May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]   

for m in months: 
    create_noise_meteo_csv_by_month(m)

concatenate_noise_meteo_sub_csv()

# The csv was uploaded to s3 bucket at: "s3://mda.project.monaco/project_data_v1.csv"

In [3]:
# Get data from Google maps API regarding the number of bars, pubs, clubs etc. open within a radius of 100 m at each location for each day of the week and time.

df_coordinates = coordinates_locations() 

# Extract location names as list to be used for later naming of csv files. 
names = df_coordinates['location_name'].tolist()

# For each location, obtain url to send request to Google Maps API
urls = urls_locations(df_coordinates)

# Create one csv per location with opening hours. Save in s3 bucket: "s3://mda.project.monaco/location_%s.csv" %(location_id) 
opening_hours_locations_csv(urls, names)

# Create one dataframe with opening hours for all locations, save as csv, and upload to s3 bucket: s3://mda.project.monaco/openings.csv 
df_out = pd.DataFrame()
for n in names: 
    df_out = pd.concat([df_out, count_open_bars(n)])

df_out.to_csv("data/openings.csv", index=False) 


In [None]:
# Merge noise and meteo data with counts of open bars, pubs, clubs etc.
df = pd.read_csv("s3://mda.project.monaco/project_data_v1.csv") 
df_openings = pd.read_csv("s3://mda.project.monaco/openings.csv")

df_merged = merger(df, df_openings)

# Save final dataset and upload to s3 bucket at: "s3://mda.project.monaco/project_data.csv"
df_merged.to_csv("data/project_data.csv", index=False) 