### Interactive Data Visualization of Geospatial Data using D3.js, DC.js, Leaflet.js and Python
Taken from: http://adilmoujahid.com/posts/2016/08/interactive-data-visualization-geospatial-d3-dc-leaflet-python/?utm_campaign=Data%2BElixir&utm_medium=email&utm_source=Data_Elixir_93

The goal of this tutorial is to introduce the steps for building an interactive visualization of geospatial data.

To do this, we will use a dataset from a Kaggle competition to build a data visualization that shows the distribution of mobile phone users in China. We will also create additional charts that show the usage patterns, the most popular phone brands, and users’ age segments and gender. We will be able to filter the data by the different attributes and see the results reflected in the map and all charts.

We will cover a wide range of technologies in this tutorial: Pandas for cleaning the data, Flask for building the server, Javascript libraries d3.js, dc.js and crossfilter.js for building the charts and Leaflet.js for building the map.

#### 1. The Case Study

In this tutorial, we will use a dataset from a Kaggle competition called "TalkingData Mobile User Demographics". This dataset is provided by TalkingData, China’s largest third-party mobile data platform. It contains app usage data, geolocation data and mobile device properties.The goal of the competition is to predict the gender and age segments of users based on the data provided.

Data visualization is an important first step in the data analysis workflow. It enables us to effectively discover patterns through graphical means, and to represent these findings in a meaningful and effective way.

The dataset that we will use contains various attributes that can be combined together to build interesting data visualizations. Geospatial data is particularly interesting, as it allows us to see how the user profiles and usage behavior changes based on the location.

In this tutorial, we will build a data visualization that combines a map that shows user locations together with various charts that summarises users’ information and usage behavior. We will make this visualisation interactive, so we can drill down into a particular user segment or location.

#### 2. System Architecture

For our data visualization, we need a system architecture that handles the following:

Cleaning and structuring data for visualization. We will use mainly Python’s Pandas library for this.
Serving static files (html, css and Javascript file) and data to the browser. We will use a Python lightweight server called Flask for this.
Building the charts and map. We will mainly use 3 Javascript libraries for this. DC.js, D3.js and Leaflet.js.

#### 3. Data Preparation

We start by downloading the dataset from the competition website. You need to create a Kaggle account and agree to the competition rules to download the data.

We will be using 3 csv files: gender_age_train.csv, events.csv,phone_brand_device_model.csv.

gender_age_train.csv: This file contains the device id, gender and age of users.  
events.csv: This file contains information about phone events triggered by the users. Each event has an id, a timestamp and location lat/long.  
phone_brand_device_model.csv: This file contains the brand and model for each device.

In [1]:
import pandas as pd
import json
from shapely.geometry import Point, shape
from flask import Flask
from flask import render_template

n_samples = 1500
#Set 'path' to the location of the required datasets taken from https://www.kaggle.com/c/talkingdata-mobile-user-demographics/data
data_path = "/home/jovyan/work/_Core/Projects/Datasets/TalkingData/"

In [2]:
#Next, we read and merge the different datasets into a single Pandas DataFrame that we call df.
gen_age_tr = pd.read_csv(data_path + 'gender_age_train.csv')
ev = pd.read_csv(data_path + 'events.csv')
ph_br_dev_model = pd.read_csv(data_path + 'phone_brand_device_model.csv')

df = gen_age_tr.merge(ev, how='left', on='device_id')
df = df.merge(ph_br_dev_model, how='left', on='device_id')

#limit our number of samples to that speficied originally
df = df[df['longitude'] != 0].sample(n=n_samples)
df.head()

Unnamed: 0,device_id,gender,age,group,event_id,timestamp,longitude,latitude,phone_brand,device_model
198920,-1212279021374798828,F,48,F43+,2241815.0,2016-05-03 16:42:47,112.36,28.84,vivo,X5Max+
153664,-3887952842073188037,F,43,F43+,2973645.0,2016-05-02 13:55:49,120.23,30.31,小米,MI 3
434917,4276371718574135903,M,34,M32-38,983999.0,2016-05-04 10:12:05,121.44,31.13,华为,荣耀畅玩4X
260860,-8340098378141155823,M,28,M27-28,2530198.0,2016-05-01 20:41:44,111.92,34.74,华为,荣耀畅玩4X
988734,-7400049074130525357,M,27,M27-28,851480.0,2016-05-03 16:07:43,118.96,25.12,vivo,X5Pro


In [3]:
#Include english phone brands to our DataFrame.
top_10_brands_en = {'华为':'Huawei', '小米':'Xiaomi', '三星':'Samsung', 'vivo':'vivo', 'OPPO':'OPPO', 
                    '魅族':'Meizu', '酷派':'Coolpad', '乐视':'LeEco', '联想':'Lenovo', 'HTC':'HTC'}

df['phone_brand_en'] = df['phone_brand'].apply(lambda phone_brand: top_10_brands_en[phone_brand] if (phone_brand in top_10_brands_en) else 'Other')
df.head()

Unnamed: 0,device_id,gender,age,group,event_id,timestamp,longitude,latitude,phone_brand,device_model,phone_brand_en
198920,-1212279021374798828,F,48,F43+,2241815.0,2016-05-03 16:42:47,112.36,28.84,vivo,X5Max+,vivo
153664,-3887952842073188037,F,43,F43+,2973645.0,2016-05-02 13:55:49,120.23,30.31,小米,MI 3,Xiaomi
434917,4276371718574135903,M,34,M32-38,983999.0,2016-05-04 10:12:05,121.44,31.13,华为,荣耀畅玩4X,Huawei
260860,-8340098378141155823,M,28,M27-28,2530198.0,2016-05-01 20:41:44,111.92,34.74,华为,荣耀畅玩4X,Huawei
988734,-7400049074130525357,M,27,M27-28,851480.0,2016-05-03 16:07:43,118.96,25.12,vivo,X5Pro,vivo


In [4]:
#Define age segment of users to the DataFrame.
def get_age_segment(age):
    if age <= 22:
        return '22-'
    elif age <= 26:
        return '23-26'
    elif age <= 28:
        return '27-28'
    elif age <= 32:
        return '29-32'
    elif age <= 38:
        return '33-38'
    else:
        return '39+'

df['age_segment'] = df['age'].apply(lambda age: get_age_segment(age))
df.head()

Unnamed: 0,device_id,gender,age,group,event_id,timestamp,longitude,latitude,phone_brand,device_model,phone_brand_en,age_segment
198920,-1212279021374798828,F,48,F43+,2241815.0,2016-05-03 16:42:47,112.36,28.84,vivo,X5Max+,vivo,39+
153664,-3887952842073188037,F,43,F43+,2973645.0,2016-05-02 13:55:49,120.23,30.31,小米,MI 3,Xiaomi,39+
434917,4276371718574135903,M,34,M32-38,983999.0,2016-05-04 10:12:05,121.44,31.13,华为,荣耀畅玩4X,Huawei,33-38
260860,-8340098378141155823,M,28,M27-28,2530198.0,2016-05-01 20:41:44,111.92,34.74,华为,荣耀畅玩4X,Huawei,27-28
988734,-7400049074130525357,M,27,M27-28,851480.0,2016-05-03 16:07:43,118.96,25.12,vivo,X5Pro,vivo,27-28


For the next section we add to each record the Chinese province where the event was recorded. To do this, we need 2 elements:

    1) China provinces' borders. This is captured in a json file called china_provinces_en.json.  
    2)A function that takes as input the longitude and latitude of the event and outputs the Chinese province where the event was recorded. The function is called get_location. This function uses a python library called shapely.

In [5]:
def get_location(longitude, latitude, provinces_json):
    point = Point(longitude, latitude)
    for record in provinces_json['features']:
        polygon = shape(record['geometry'])
        if polygon.contains(point):
            return record['properties']['name']
    return 'other'

with open(data_path + 'china_provinces_en.json') as data_file:
    provinces_json = json.load(data_file)


In [6]:
df['location'] = df.apply(lambda row: get_location(row['longitude'], row['latitude'], provinces_json), axis=1)

In [None]:
#Define the columns that we will need for the data visualization and we delete the records with missing values.
cols_to_keep = ['timestamp', 'longitude', 'latitude', 'phone_brand_en', 'gender', 'age_segment', 'location']
df_clean = df[cols_to_keep].dropna()
df_clean.head()

#To communicate the data to the browser, we need to transform the format from a Pandas DataFrame to a JSON object. 
#We can do that by simply calling to_json() function on our Pandas DataFrame.
df_clean.to_json(orient='records')

'[{"timestamp":"2016-05-03 16:42:47","longitude":112.36,"latitude":28.84,"phone_brand_en":"vivo","gender":"F","age_segment":"39+","location":"Hunan"},{"timestamp":"2016-05-02 13:55:49","longitude":120.23,"latitude":30.31,"phone_brand_en":"Xiaomi","gender":"F","age_segment":"39+","location":"Zhejiang"},{"timestamp":"2016-05-04 10:12:05","longitude":121.44,"latitude":31.13,"phone_brand_en":"Huawei","gender":"M","age_segment":"33-38","location":"Shanghai"},{"timestamp":"2016-05-01 20:41:44","longitude":111.92,"latitude":34.74,"phone_brand_en":"Huawei","gender":"M","age_segment":"27-28","location":"Henan"},{"timestamp":"2016-05-03 16:07:43","longitude":118.96,"latitude":25.12,"phone_brand_en":"vivo","gender":"M","age_segment":"27-28","location":"Fujian"},{"timestamp":"2016-05-02 02:20:18","longitude":118.81,"latitude":32.11,"phone_brand_en":"Huawei","gender":"F","age_segment":"39+","location":"Jiangsu"},{"timestamp":"2016-05-05 07:10:38","longitude":121.59,"latitude":31.07,"phone_brand_en"

#### 4. Building the server

To build the server, we will use a Python library called Flask. The server's code is stored under app.py.  
Our server will have 2 routes:

    1) The first route is used for serving the html file (that we will build in the next section).  
    2) The second route serves the data that we prepared in the previous section in json format.

In [None]:
# -*- coding: utf-8 -*-
app = Flask(__name__)

@app.route("/")
def index():
    return render_template("index.html")

@app.route("/data")
def get_data():
    #gen_age_tr = pd.read_csv(data_path + 'gender_age_train.csv')
    #ev = pd.read_csv(data_path + 'events.csv')
    #ph_br_dev_model = pd.read_csv(data_path + 'phone_brand_device_model.csv')
    #df = gen_age_tr.merge(ev, how='left', on='device_id')
    #df = df.merge(ph_br_dev_model, how='left', on='device_id')
    #Get n_samples records
    #df = df[df['longitude'] != 0].sample(n=n_samples)


    #top_10_brands_en = {'华为':'Huawei', '小米':'Xiaomi', '三星':'Samsung', 'vivo':'vivo', 'OPPO':'OPPO',
    #                    '魅族':'Meizu', '酷派':'Coolpad', '乐视':'LeEco', '联想':'Lenovo', 'HTC':'HTC'}

    #df['phone_brand_en'] = df['phone_brand'].apply(lambda phone_brand: top_10_brands_en[phone_brand] 
    #                                                if (phone_brand in top_10_brands_en) else 'Other')
    #df['age_segment'] = df['age'].apply(lambda age: get_age_segment(age))

    #df['location'] = df.apply(lambda row: get_location(row['longitude'], row['latitude'], provinces_json), axis=1)

    #cols_to_keep = ['timestamp', 'longitude', 'latitude', 'phone_brand_en', 'gender', 'age_segment', 'location']
    #df_clean = df[cols_to_keep].dropna()
    print('Data requested on localhost')
    return df_clean.to_json(orient='records')


if __name__ == "__main__":
    app.run(host='0.0.0.0',port=5000)

#### 5. Front-end Side Preparation

Now that we have the data processing and server side code ready, we can start building the front-end code.  
    This will be hosted at: http://localhost:7778
    

We will be using a great responsive dashboard template from keen.io. keen.io templates provide the skeleton for analytics dashboards. With these pre-built templates, we only need to focus on building the charts without spending much effort in customizing the layout. For this tutorial, I created a new layout based on keen.io Javascript and css libraries.

For building the charts, we will be mainly using 4 Javascript libraries Crossfilter.js, D3.js and DC.js.
  
    1) Crossfilter.js is a Javascript library for grouping, filtering, and aggregating large datasets.  
    2) D3.js is a Javascript library for controlling the data and building charts.  
    3) DC.js is a Javascript charting library that leverages both crossfilter.js and d3.js, and makes the creation of highly interactive data visualization simple.  
    4) Leaflet.js is JavaScript library for interactive maps. Leaflet has many plugins that can be used to extend its functionalities. We will use the heatmap plugin to show the distribution of users in the map.  

In addition to the Javascript libraries above, we will use queue.js which is an asynchronous helper library for Javascript, and underscore.js which is Javascript library that contains useful functional programming helpers.

Note that the only files that we need to create from scratch are:
  
    app.py: Server side code for rendering html pages and serving data to the browser
    charts.js: Javascript file that will contain the code of our charts and map
    custom.css: css file that will contain our custom css code

We also need to make a few modifications to the index.html. Inside index.html, we need to reference the charts from charts.js.

In [None]:
%%javascript
queue()
    .defer(d3.json, "/data")
    .await(makeGraphs);

function makeGraphs(error, recordsJson) {
	
	//Clean data
	var records = recordsJson;
	var dateFormat = d3.time.format("%Y-%m-%d %H:%M:%S");
	
	records.forEach(function(d) {
		d["timestamp"] = dateFormat.parse(d["timestamp"]);
		d["timestamp"].setMinutes(0);
		d["timestamp"].setSeconds(0);
		d["longitude"] = +d["longitude"];
		d["latitude"] = +d["latitude"];
	});

	//Create a Crossfilter instance
	var ndx = crossfilter(records);

	//Define Dimensions
	var dateDim = ndx.dimension(function(d) { return d["timestamp"]; });
	var genderDim = ndx.dimension(function(d) { return d["gender"]; });
	var ageSegmentDim = ndx.dimension(function(d) { return d["age_segment"]; });
	var phoneBrandDim = ndx.dimension(function(d) { return d["phone_brand_en"]; });
	var locationdDim = ndx.dimension(function(d) { return d["location"]; });
	var allDim = ndx.dimension(function(d) {return d;});


	//Group Data
	var numRecordsByDate = dateDim.group();
	var genderGroup = genderDim.group();
	var ageSegmentGroup = ageSegmentDim.group();
	var phoneBrandGroup = phoneBrandDim.group();
	var locationGroup = locationdDim.group();
	var all = ndx.groupAll();


	//Define values (to be used in charts)
	var minDate = dateDim.bottom(1)[0]["timestamp"];
	var maxDate = dateDim.top(1)[0]["timestamp"];


    //Charts
    var numberRecordsND = dc.numberDisplay("#number-records-nd");
	var timeChart = dc.barChart("#time-chart");
	var genderChart = dc.rowChart("#gender-row-chart");
	var ageSegmentChart = dc.rowChart("#age-segment-row-chart");
	var phoneBrandChart = dc.rowChart("#phone-brand-row-chart");
	var locationChart = dc.rowChart("#location-row-chart");



	numberRecordsND
		.formatNumber(d3.format("d"))
		.valueAccessor(function(d){return d; })
		.group(all);


	timeChart
		.width(650)
		.height(140)
		.margins({top: 10, right: 50, bottom: 20, left: 20})
		.dimension(dateDim)
		.group(numRecordsByDate)
		.transitionDuration(500)
		.x(d3.time.scale().domain([minDate, maxDate]))
		.elasticY(true)
		.yAxis().ticks(4);

	genderChart
        .width(300)
        .height(1200)
        .dimension(genderDim)
        .group(genderGroup)
        .ordering(function(d) { return -d.value })
        .colors(['#6baeq6'])
        .elasticX(true)
        .xAxis().ticks(4);

	ageSegmentChart
		.width(300)
		.height(150)
        .dimension(ageSegmentDim)
        .group(ageSegmentGroup)
        .colors(['#6baed6'])
        .elasticX(true)
        .labelOffsetY(10)
        .xAxis().ticks(4);

	phoneBrandChart
		.width(300)
		.height(310)
        .dimension(phoneBrandDim)
        .group(phoneBrandGroup)
        .ordering(function(d) { return -d.value })
        .colors(['#6baed6'])
        .elasticX(true)
        .xAxis().ticks(4);

    locationChart
    	.width(200)
		.height(510)
        .dimension(locationdDim)
        .group(locationGroup)
        .ordering(function(d) { return -d.value })
        .colors(['#6baed6'])
        .elasticX(true)
        .labelOffsetY(10)
        .xAxis().ticks(4);

    var map = L.map('map');

	var drawMap = function(){

	    map.setView([31.75, 110], 4);
		mapLink = '<a href="http://openstreetmap.org">OpenStreetMap</a>';
		L.tileLayer(
			'http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
				attribution: '&copy; ' + mapLink + ' Contributors',
				maxZoom: 15,
			}).addTo(map);

		//HeatMap
		var geoData = [];
		_.each(allDim.top(Infinity), function (d) {
			geoData.push([d["latitude"], d["longitude"], 1]);
	      });
		var heat = L.heatLayer(geoData,{
			radius: 10,
			blur: 20, 
			maxZoom: 1,
		}).addTo(map);

	};

	//Draw Map
	drawMap();

	//Update the heatmap if any dc chart get filtered
	dcCharts = [timeChart, genderChart, ageSegmentChart, phoneBrandChart, locationChart];

	_.each(dcCharts, function (dcChart) {
		dcChart.on("filtered", function (chart, filter) {
			map.eachLayer(function (layer) {
				map.removeLayer(layer)
			}); 
			drawMap();
		});
	});

	dc.renderAll();

};

In [None]:
%%html
<!DOCTYPE html>
<html>
<head>
  <title>Data visualisation</title>
  <link rel="stylesheet" href="./static/lib/css//bootstrap.min.css">
  <link rel="stylesheet" href="./static/lib/css/keen-dashboards.css">
  <link rel="stylesheet" href="./static/lib/css/dc.min.css">
  <link rel="stylesheet" href="./static/lib/css/leaflet.css">
  <link rel="stylesheet" href="./static/css/custom.css">
  


</head>
<body class="application">

  <div class="navbar navbar-inverse navbar-fixed-top" role="navigation">
    <div class="container-fluid">
      <div class="navbar-header">
        <a class="navbar-brand" href="./">Awwh shit it's josh's maddog viz</a>
      </div>
    </div>
  </div>

  <div class="container-fluid">

    <div class="row">

      <div class="col-sm-6">
        <div class="row">

          <!-- Time Chart --> 
          <div class="col-sm-12">
            <div class="chart-wrapper">
              <div class="chart-title">
                Number of Events
              </div>
              <div class="chart-stage">
                <div id="time-chart"></div>
              </div>
            </div>
          </div>
          <!-- Time Chart --> 

          <!-- Brand -->
          <div class="col-sm-6">
            <div class="chart-wrapper">
              <div class="chart-title">
                Brand
              </div>
              <div class="chart-stage">
                <div id="phone-brand-row-chart"></div>
              </div>
            </div>
          </div>
          <!-- Brand -->


          <div class="col-sm-6">

            <div class="row">
            <!-- Gender -->
            <div class="col-sm-12">
              <div class="chart-wrapper">
                <div class="chart-title">
                  Gender
                </div>
                <div class="chart-stage">
                  <div id="gender-row-chart"></div>
                </div>
              </div>
            </div>
            </div>
            <!-- Gender -->

            <!-- Age Segment -->
            <div class="row">
            <div class="col-sm-12">
              <div class="chart-wrapper">
                <div class="chart-title">
                  Age Segment
                </div>
                <div class="chart-stage">
                  <div id="age-segment-row-chart"></div>
                </div>
              </div>
            </div>
            <!-- Age Segment -->
          </div>
          </div>


        </div>
      </div>


      <div class="col-sm-2">
        <div class="row">

          <!-- Chinese Province -->  
          <div class="col-sm-12">
            <div class="chart-wrapper">
              <div class="chart-title">
                Chinese Province
              </div>
              <div class="chart-stage">
                <div id="location-row-chart"></div>
              </div>
            </div>
          </div>
          <!-- Chinese Province -->  
        </div>
      </div>



      <div class="col-sm-4">

        <div class="row">
          <!-- Map -->  
          <div class="col-sm-12"> 
            <div class="chart-wrapper">
              <div class="chart-title">
                Map
              </div>
              <div class="chart-stage">
                <div id="map" style="width: 400px; height: 380px"></div>
              </div>
            </div>
          </div>
          <!-- Map -->  
        </div>

        <div class="row">
          <!-- Number of events -->  
          <div class="col-sm-12"> 
            <div class="chart-wrapper">
              <div class="chart-title">
                Number of Events
              </div>
              <div class="chart-stage">
                <div id="number-records-nd"></div>
              </div>
            </div>
          </div>
          <!-- Number of events -->  
        </div>

      </div>

    </div>

  </div>

  <hr>
  <p class="small text-muted">Built with &#9829; by <a href="https://keen.io">Keen IO</a></p>

  <script src="./static/lib/js/jquery.min.js"></script>
  <script src="./static/lib/js/bootstrap.min.js"></script>
  <script src="./static/lib/js/underscore-min.js"></script>
  <script src="./static/lib/js/crossfilter.js"></script>
  <script src="./static/lib/js/d3.min.js"></script>
  <script src="./static/lib/js/dc.min.js"></script>
  <script src="./static/lib/js/queue.js"></script>
  <script src="./static/lib/js/leaflet.js"></script>
  <script src="./static/lib/js/leaflet-heat.js"></script>
  <script src="./static/lib/js/keen.min.js"></script>
  <script src='./static/js/graphs.js' type='text/javascript'></script>


</body>
</html>
