<h1> Preprocessing for the urban diversity web tool </h1>

The urban superdiversity web tool requires a certain format of json to read data. You can create the format yourself or use the tools provided in this notebook. 
Below, find an example of the data used in the web app, with explaning comments. If you only wanted to know how the data is supposed to look like, you've found it and won't need the rest of this notebook!



```
{"type":"FeatureCollection",    //geojson format
"cityYear":"Vancouver-2006",
"features":[                    //each data point is in the features array. 
  {"type":"Feature",            //each data point is its own object, always type feature
  "properties":{                //properties are copied from the shape file information
    "DAUID":"59150004",
    "CSDUID":"5915055",
    "CCSUID":"5915020",
    "CDUID":"5915",
    "ERUID":"5920",
    "PRUID":"59",
    "CTUID":"9330133.01",
    "CMAUID":"933","
    indices":{                  //indices is what was provided in the data csv file.
      "Population":353,
      "Ethnicity-raw-count":22,
      "Ethnicity-raw-normalized":0.062323,
      "Mobility-raw-pct":36.619718,
      "Generation-raw-SI":0.732222,
      "Education-raw-SI":0.796875,
      "Income-raw-SI":0.865077}
      },
    "geometry":{               // point is the centroid of the polygon in the shapefile
      "type":"Point",
      "coordinates":[-123.25887520351195,49.38854863564009]},
    "geom_store":{
      "type":"Polygon",
      "coordinates":[[[a,b],   //left out the proper coordinates, but this is an array of many lat/lon tuples
  [c,d],
  [...]]]}},
  {...},
  {...},
  {...}]}                           this array of object contains all data points used on the map later.
```



<h3>Data format needed for these scripts to work</h3>



*   A shapefile, converted to geojson (use a program like ArcGIS to do this)
*   A csv file containing all the parameters/indices you want to analyse. One column of this csv file MUST identify the matching parameter to the shape file. This is what the two files will be merged on.

<h4> Step 1 </h4>
Upload the two files here , by clicking on the upload icon below "Files" in the left menu. The scripts expect one file ending in .csv and one ending in .json.

<h4> Step 2</h4>
Provide the names for the matching parameter in both files. Please type the names in the form below, which have been populated with sample names. The csv_comparator should be the column name in your csv file which contains the list of entities, while the geojson_comparator should be the property within the file that matches the data in the csv column. 
You can also provide a meaningful name for your dataset



In [None]:
#@title Names of matching parameters
geojson_comparator = 'DAUID' #@param {type:"string"}
csv_comparator = 'DA-ID' #@param {type:"string"}
name = 'City-2700' #@param {type:"string"}



<h4> Step 3 </h4>
Execute the code cell below. If everything goes right, a new json will be saved in the same folder you uploaded your data to, which can be used in the MultiViz tool.

<h5> Troubleshooting </h5>


*   Code fails to execute: Please read the error message. The most likely culprits are: Could not find csv or json file.
*   Resulting file does not show any of the indices in the csv! Please make sure you provided the correct spelling (case sensitive) for the column name in the csv for the geographical entity you want to  match and the same for the parameter in the shape file.
*   Jenks calculation fails: Only numerical values work. 


In [None]:
from numpy.core import shape_base
#some required libraries for the conversion.
import numpy as np
import pandas as pd
!pip install jenkspy
import jenkspy
import json
import sys
from pathlib import Path
import os
import copy
import math
from shapely.geometry import Polygon

#read the csv
for filename in Path(".").glob("*.csv"):
  try:
    df_indices = pd.read_csv(filename, index_col=False)
  except IOError:
    print("Error loading csv. Please upload one csv file and one geojson file to the main directory of this script.") 

#read the geojson (sometimes only json ending)
for filename in Path(".").glob("*json"):
  try:
    shape = json.load(open(filename))
  except IOError:
    print("Error while loading geojson. Please upload one csv file and one geojson file to the main directory of this script.") 

try:
  print("Successfully loaded csv. ")
  #print(df_indices)
except BaseException:
  print("Could not find csv data.")

try:
  df_shape = pd.DataFrame(shape["features"])
  print("Successfully loaded geojson. ")
  #print(df_shape)
except BaseException:
  print("Could not find geojson data.")

#converts the value to numeric (int if int, else float). 
#returns -1 if:
#   a) the value is not numeric, check for strings reflecting numeric values is implemented
#   b) if the value is nan
def makeFloat(val):

  if(isinstance(val,float)):
    if(math.isnan(val)):
      return -1
    else:
      return val
  elif (isinstance(val,int)):
    if(math.isnan(val)):
      return -1
    else:
      return val
  else: #string
    try:
      res = float(val)
      if(math.isnan(res)):
        return -1
      else:
        return res
    except ValueError:  #casting not possible --> not a numeric value in the string
      return -1

#Merging code. Grabs the row from the csv that matches the geojson_comparator from each entry in the geojson and puts the indices in a new  indices property.
for prop in df_shape["properties"]:  
  prop["indices"] = {}  #the indices from the csv file will be put in this indices property, when the comparators match.

  
  row = df_indices.loc[df_indices[csv_comparator]==prop[geojson_comparator]]  #find the correct row

  rowDict = row.to_dict(orient='records')

  if len(rowDict)>0:
    for key in rowDict[0]:  #go through the keys and add them to the json object.
      if (key.find("ndex") < 0) & (key!=csv_comparator):
            #print(row[key])
          val = makeFloat(row[key])  #TODO: This needs to change if we're ever going to use non-numerical values in our evaluations. 
          prop["indices"][key] = round(val,6)


#create the new resulting json.
newJson = {}
newJson["type"] = "FeatureCollection"
newJson["cityYear"] = name
newJson["features"] = []

#write the data to the new geojson object. Also adds centroid property for the geometry
for index in df_shape.index:
  obj = {"type": "Feature", "properties": df_shape["properties"][index],"geometry":df_shape["geometry"][index]}
   #create the centroid for the data points
  geometry = obj["geometry"]
  #print(geometry)
  if (type( geometry['coordinates'][0][0][0] ) is float):
    P = Polygon(geometry['coordinates'][0])
  else :
    P = Polygon(geometry['coordinates'][0][0])
  circle = P.centroid

  geometry_ = {"type":"Point","coordinates":[circle.x,circle.y]}
  obj["geometry"]=geometry_
  obj["geom_store"] = geometry
  newJson["features"].append(obj)

df_shape = newJson

with open("./"+ name +'.json', 'w') as f: #dump the data in the new file.
    f.write(json.dumps(df_shape, separators=(',', ':')))


<h4> Jenks values </h4>
If you want to add jenks values to your data, execute the cell below. This only works with numerical values per parameter. This will create a new file, named the same but with _withjenks suffix.

In [None]:
#mediator function to call jenkspy jenks calculation
def getJenks(arr,num):
  return jenkspy.jenks_breaks(arr, nb_class=num)

#get the jenks values from the jenksVals, which have been aggregated per parameters.
def makeJenks(data, jenksVals):
  currJson={}
  range = [12] #Jenks steps to be used
  print(data.keys())
  currJson["all"]={}
  for key in jenksVals.keys():
    currJson["all"][key] = {} 
    for val in range:
      currJson["all"][key][val] = getJenks(jenksVals[key], val)  

  data["jenks"] = currJson  #attach the jenks values to the data.
  #print(data.keys())
  with open("./"+ name +'_withjenks.json', 'w') as f: #dump the data in the new file.
    f.write(json.dumps(data, separators=(',', ':')))


arr = {"name":name, "all": {}}
#aggregate all values for all parameters in an array per parameter, which will be used for the jenks calculation.
#in the same loop, we also caluclate the centroid for each polygon which is used for the bars in the multiviz tool.
for feat in df_shape["features"]:

  for key in feat['properties']['indices'].keys(): #go through all the indices
      val = feat['properties']['indices'][key]

      if (key in arr['all'].keys()):  #is this index alread in the result? then add to that key
        if (val >=0 ): 
          arr['all'][key].append(val)
      else:
        if (val >=0 ) :
            arr['all'][key] = []
            arr['all'][key].append(val)

makeJenks(df_shape,arr["all"]) 

dict_keys(['Total-pop', 'Generation-raw-SI', 'Income-raw-SI', 'Ethnicity-raw-count', 'Ethnicity-raw-norm', 'Education-raw-SI', 'Mobility-raw-pct_old', 'Mobility-raw-pct'])
dict_keys(['type', 'cityYear', 'features'])
