## Final Assignment


Before working on this assignment please read these instructions fully. Use blackboard to submit a link to your repository. 

On blackboard your find the assessment criteria. Please familiarize yourself with the criteria before beginning the assignment.

This assignment requires that you to find at least two datasets on the web which are related, and that you build an application that visualize these datasets to answer a research question with the broad topic of **health** or **agriculture** in the **region where you were born**. The region can be a city, town or a provence.  

The research question should be a question with a causual nature. For instance questions like: How does independent variable X influence the dependent variable of Y?

The code should be programmed efficiently. Also identify the most critical part and write software test for this part. Take into account the performance of the dataprocessing

### About the data

You can merge these datasets with data from different regions if you like. For instance, you might want to compare the health effect of earhtquacks in Groningen versus Los Angelos USA. 

You are welcome to choose datasets at your discretion, but keep in mind they will be shared with others, so choose appropriate datasets. You are welcome to use datasets of your own as well, but minimual two datasets should be coming from the web and or API's. 

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations in your visualization. 

### Instructions:

Define a research question, select data and code your data acquisition, data processing, data analysis and visualization. Write code to test most critical parts. Use a repository with a commit strategy and write a readme file. 

Write a small document with the following:
- State the region and the domain category that your data sets are about 
- State the research question 
- Justify the chosen data storage and processing approach
- Justify the chosen analysis approach
- Justify the chosen data visualization approach

Upload your document and the link of your repository to black board

In [1]:
import dask
import dask.dataframe as dd
import yaml
from dask.distributed import Client

In [3]:
#Reading in obese, overweight, corona cases files from config file using yaml
def get_config():
    with open("FA_config.yaml", 'r') as stream:
        config = yaml.safe_load(stream)
    return config
config = get_config()
obese = dd.read_csv(config["obese_cases"], sep=(";"))
overweight = dd.read_csv(config['overweight_cases'], sep = ";")
cor_cases = dd.read_csv(config["corona_cases"], sep=";")
towns_provinces = dd.read_csv(config["town_province"])
civilian_count_region = dd.read_csv(config["civilian_count_region"], sep = ";")

provinces = ["Groningen", "Drenthe"]


In [4]:
"""Sources: https://www.cbs.nl/nl-nl/nieuws/2020/53/aantal-gemeenten-daalt-in-2021-verder-tot-352
https://www.cbs.nl/nl-nl/onze-diensten/methoden/classificaties/overig/gemeentelijke-indelingen-per-jaar/indeling-per-jaar/gemeentelijke-indeling-op-1-januari-2019"""
# Manual adjusting the updates from the municipalities in 2017 and 2021

obese = obese.mask(obese == "Bedum", "Het Hogeland")
obese = obese.mask(obese == "Eemsmond", "Het Hogeland")
obese = obese.mask(obese == "De Marne", "Het Hogeland")
obese = obese.mask(obese == "Winsum", "Het Hogeland")

obese = obese.mask(obese == "Grootegast", "Westerkwartier")
obese = obese.mask(obese == "Leek", "Westerkwartier")
obese = obese.mask(obese == "Marum", "Westerkwartier")
obese = obese.mask(obese == "Zuidhorn", "Westerkwartier")

obese = obese.mask(obese == "Hoogezand-Sappemeer", "Midden-Groningen")
obese = obese.mask(obese == "Slochteren", "Midden-Groningen")
obese = obese.mask(obese == "Menterwolde", "Midden-Groningen")



overweight = overweight.mask(overweight == "Bedum", "Het Hogeland")
overweight = overweight.mask(overweight == "Eemsmond", "Het Hogeland")
overweight = overweight.mask(overweight == "De Marne", "Het Hogeland")
overweight = overweight.mask(overweight == "Winsum", "Het Hogeland")

overweight = overweight.mask(overweight == "Grootegast", "Westerkwartier")
overweight = overweight.mask(overweight == "Leek", "Westerkwartier")
overweight = overweight.mask(overweight == "Marum", "Westerkwartier")
overweight = overweight.mask(overweight == "Zuidhorn", "Westerkwartier")

overweight = overweight.mask(overweight == "Hoogezand-Sappemeer", "Midden-Groningen")
overweight = overweight.mask(overweight == "Slochteren", "Midden-Groningen")
overweight = overweight.mask(overweight == "Menterwolde", "Midden-Groningen")

overweight = overweight.mask(overweight == "Ten boer", "Groningen")
overweight = overweight.mask(overweight == "Groningen", "Groningen")
overweight = overweight.mask(overweight == "Haren", "Groningen")



In [5]:
overweight_added_province = dd.merge(overweight,towns_provinces[["Gemeentenaam", "Provincienaam"]], left_on="Gemeente", right_on="Gemeentenaam")
obese_added_province = dd.merge(obese,towns_provinces[["Gemeentenaam", "Provincienaam"]], left_on="Gemeente", right_on="Gemeentenaam")

In [6]:
obese_Gr_Dr = obese_added_province[obese_added_province["Provincienaam"].isin(provinces)]
overweight_Gr_Dr = overweight_added_province[overweight_added_province["Provincienaam"].isin(provinces)]

In [7]:
obese_Gr_Dr = obese_Gr_Dr.rename(columns={"Ernstig overgewicht (%)":"Obese(%)",
                                          "Provincienaam":"Province",
                                          "Gebied":"Area","Gemeente":"Region",
                                          "idID":"RegionID"})
overweight_Gr_Dr = overweight_Gr_Dr.rename(columns={"Overgewicht (%)":"Overweight(%)", 
                                                    "Provincienaam":"Province", 
                                                    "Gebied":"Area","Gemeente":"Region",
                                                    "idID":"RegionID"})

In [8]:
obese_overweight = dd.merge(obese_Gr_Dr, overweight_Gr_Dr)[["Province", "RegionID","Region","Area", "Obese(%)", "Overweight(%)"]].dropna()


In [9]:
obese_overweight.head()

Unnamed: 0,Province,RegionID,Region,Area,Obese(%),Overweight(%)
1,Drenthe,1680,Aa en Hunze,Wijk 00 Annen,15.0,53.0
2,Drenthe,1680,Aa en Hunze,Wijk 01 Eext,15.0,54.0
3,Drenthe,1680,Aa en Hunze,Wijk 02 Anloo,17.0,55.0
4,Drenthe,1680,Aa en Hunze,Wijk 03 Gasteren,13.0,50.0
5,Drenthe,1680,Aa en Hunze,Wijk 04 Anderen,15.0,55.0


In [10]:
obese_overweight[["Obese(%)", "Overweight(%)"]] = obese_overweight[["Obese(%)", "Overweight(%)"]].astype(float)

In [11]:
obese_overweight.head()

Unnamed: 0,Province,RegionID,Region,Area,Obese(%),Overweight(%)
1,Drenthe,1680,Aa en Hunze,Wijk 00 Annen,15.0,53.0
2,Drenthe,1680,Aa en Hunze,Wijk 01 Eext,15.0,54.0
3,Drenthe,1680,Aa en Hunze,Wijk 02 Anloo,17.0,55.0
4,Drenthe,1680,Aa en Hunze,Wijk 03 Gasteren,13.0,50.0
5,Drenthe,1680,Aa en Hunze,Wijk 04 Anderen,15.0,55.0


In [12]:
mean_obese_overweight_region = obese_overweight.groupby(["Province", "Region", "RegionID"]).mean()

In [13]:
mean_obese_overweight_region = mean_obese_overweight_region.reset_index()

In [14]:
mean_obese_overweight_region = mean_obese_overweight_region.mask(mean_obese_overweight_region == "416", '1952')
mean_obese_overweight_region = mean_obese_overweight_region.mask(mean_obese_overweight_region == "28", '1969')
mean_obese_overweight_region = mean_obese_overweight_region.mask(mean_obese_overweight_region == "1032", '1966')
mean_obese_overweight_region = mean_obese_overweight_region.mask(mean_obese_overweight_region == "14", '14')

In [15]:
mean_obese_overweight_region["RegionID"] = mean_obese_overweight_region["RegionID"].astype(int)

In [16]:
civilian_count_region["RegioS"] = civilian_count_region["RegioS"].str.strip("GM").astype(int)

In [17]:
#Adding civilian count per region for normalization
civilian_count_region = civilian_count_region.rename(columns={"RegioS":"RegionID"})


In [18]:
mean_obese_overweight_region = mean_obese_overweight_region.merge(civilian_count_region[["RegionID", "TotaleBevolking_1"]],on="RegionID")

In [19]:
#Extracting Groningen and Drenthe
cor_dr_gr = cor_cases[cor_cases["Province"].isin(provinces)]

In [20]:
cor_dr_gr["Date_of_report"] = dd.to_datetime(cor_dr_gr["Date_of_report"], errors="ignore")
up_to_data_cor = cor_dr_gr[cor_dr_gr["Date_of_report"] == cor_dr_gr["Date_of_report"].max()].dropna()
up_to_data_cor = cor_dr_gr[cor_dr_gr["Date_of_report"] == cor_dr_gr["Date_of_report"].max()].dropna()
up_to_data_cor = cor_dr_gr[cor_dr_gr["Date_of_report"] == cor_dr_gr["Date_of_report"].max()].dropna()
up_to_data_cor = up_to_data_cor.rename(columns={"Municipality_code":"RegionID"})
up_to_data_cor["RegionID"] = up_to_data_cor["RegionID"].str.strip("GM").astype(int)
COVID_OvOb = mean_obese_overweight_region.merge(up_to_data_cor[["RegionID", "Total_reported","Hospital_admission", "Deceased"]], on="RegionID")
COVID_OvOb["TotaleBevolking_1"] = COVID_OvOb["TotaleBevolking_1"].astype(int)
COVID_OvOb = COVID_OvOb.rename(columns={"TotaleBevolking_1":"Civilian_number"})
COVID_OvOb["Normalized_Corona"] = COVID_OvOb["Total_reported"]/COVID_OvOb["Civilian_number"]*1e5
COVID_OvOb["Normalized_hospitalized"] = COVID_OvOb["Hospital_admission"]/COVID_OvOb["Civilian_number"]*1e5
COVID_OvOb["Normalized_deceased"] = COVID_OvOb["Deceased"]/COVID_OvOb["Civilian_number"]*1e5

In [21]:
COVID_OvOb.head()

Unnamed: 0,Province,Region,RegionID,Obese(%),Overweight(%),Civilian_number,Total_reported,Hospital_admission,Deceased,Normalized_Corona,Normalized_hospitalized,Normalized_deceased
0,Drenthe,Aa en Hunze,1680,16.047619,54.714286,25445,912,5,15,3584.201218,19.650226,58.950678
1,Drenthe,Assen,106,14.8,51.6,68599,2107,39,12,3071.473345,56.852141,17.492966
2,Drenthe,Borger-Odoorn,1681,17.058824,56.588235,25559,993,18,12,3885.128526,70.425291,46.950194
3,Drenthe,Coevorden,109,16.333333,55.777778,35297,1874,43,25,5309.233079,121.823384,70.827549
4,Drenthe,De Wolden,1690,14.857143,52.571429,24330,1134,24,19,4660.912454,98.64365,78.092889
