# Explainer Page

### Motivation:
* What is your dataset?
  * Socio-economic data for Copenhagen: [data.kk.dk](https://data.kk.dk/dataset/samlede-socio-data-kbh)
  * The dataset contains measurements for different areas in Copenhagen from 2008 until 2014. Some areas are missing a lot of data.
  * The dataset contains information about Educational level, Income level and how old people are in the area.
* Why did you choose this/these particular dataset(s)?
  * Because we are interested in knowing how education and age influences the income level and where people live.
  * By using this dataset we can explore different kinds of information about people. We have different age groups and educational levels. So we can find the areas with the most rich people living and the poorer neighborhoods. 
* What was your goal for the end user’s experience?
  * The purpose of our visualization is to give the user an overview of where the poor and rich neighborhoods are and what educational level and how old people are in that area. Such that people moving to Copenhagen can find a neighborhood that suits them and where they fit in.

### Basic stats:
* Write about your choices in data cleaning and preprocessing
  * We chose to remove the rows containing NaN's
  * Only use columns regarding educational level, income and age. Our processed dataset can be found [here](processed_data.csv).
* Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.
  * Total size of our data is 596.9 KB 
  * The number of rows are 2675 (including headers)
  * The date range is 2008-2014
  * The educational levels, income levels and age groups are in percent
  * We have created an extra column containing the mean of the income level in a given area. Computed by $\left(\frac{1 \times low + 2 \times medium + 3 \times high}{100}\right)$ for each area, where low stands for the percent of people with low income in the area, medium for the percent of people with medium income and the same holds for high.
  * We have created a Random Forest regression model to predict the average income level for 2014 since this was not included in the dataset.
  * The predictions was based on the Age groups and Educational level.

### Genre: 
* Which genre of data story did you use? 
  * Annotated chart and animation 
* Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
  * Visual structuring: Checklist progress tracker
  * Highlighting: Zooming 
  * Transition Guidance: Animated transitions
  * We used these because they are easy to visualize.
* Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?
  * Ordering: Random access
  * Interactivity: Filtering/selection/search
  * Messaging: Captions/headline
  * We used these tools to enable users to further explore the data in a better and faster way.

### Visualizations:
* Explain the visualizations you’ve chosen.
  * We have chosen to use a map to show the areas, a histogram to display the age groups and a donut chart to display the educational levels.
* Why are they right for the story you want to tell?
  * Because interactivity provides effective and significant information of data to exploring for users and it is simple to visualize.
  * Also the map gives an easy way to interpert the location of the areas. The histogram gives a great overview of the people living in the area. And finally the donut chart is a fun and smart way to show how the educations are distributed among the people in the area.


### Discussion:
* What went well?
  * All the shown visualizations when pretty well and also the model used for predicting the income level for 2014.
* What is still missing? What could be improved?, Why?
  * One thing that we wanted to do was to maybe have a plot of the predicted values of the income.
  * Another thing we could have improved was to automatically update the histogram and donut chart when we selected another year.
  * These things were not done since we didn't have the time to do it.

### Contributions:
* Who did what?
  * Oldouz:
    * Wrote the "Story behind the data"
    * Wrote "Motivation", "Genre" and "Visualizations".
  * Martin:
    * Created all the visualizations and the webpage.
    * Wrote the "Explanation of the visualization and how to use it" on the webpage.
    * Wrote the code included below and the script "train_model.py"
    * Wrote "Basic stats" and "Discussion".

## Code

In this part of the page we will go through the function calls that prepare the data for our visualization. The code can be found [here](train_model.py).

### Imports

In [1]:
from train_model import getColumns,prepareData,trainModel,predict

### Get and prepare the data

In [2]:
save_as = "processed_data.csv"
df = getColumns(file_name="samlede_socio_data_kbh.csv")
df["avg_income_level"] = (1*df["pct_lav_indkomst"]+2*df["pct_middel_indkomst"]+3*df["pct_hoj_indkomst"])/100
df = df[(~((df["alder_pct_0_5"] == 0) & (df["alder_pct_6_17"] == 0) & (df["alder_pct_18_29"] == 0) & (df["alder_pct_30_39"] == 0) & (df["alder_pct_40_49"] == 0) & (df["alder_pct_40_49"] == 0) & (df["alder_pct_50_64"] == 0) & (df["alder_pct_over_65"] == 0)))]
X_train,y_train,X_eval,y_eval,X_test,y_test = prepareData(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  dfTrain.dropna(inplace=True)


### Train the model

We train a Random Forest regression model to predict the average income level.

In [3]:
model = trainModel(X_train,y_train,X_eval,y_eval)

Train Score 0.9672282385695855
Eval Score 0.724750977087156


### Predict and save the results

In [4]:
y_test_pred = predict(model,X_test)
res = df
res.loc["2014-01-01","avg_income_level"] = y_test_pred
res.to_csv(save_as)