## Intro to Data Science Visualization with datashader
Visualization is one of the key parts in a Data Science project. It allows us to get a global sense of our data and to understand better our results. 
There are many free and non-free tools in the market to make data visualization. One of my favourites is [datashader](https://github.com/bokeh/datashader), an open source python library that allows to visualize big amounts of data with a clean and nice API. 

In [None]:
#Load all libraries
import os,sys  
import pandas as pd
import numpy as np
import xarray as xr
import datashader as ds
import datashader.transfer_functions as tf
from datashader import reductions
from datashader.colors import colormap_select, Hot, inferno
from datashader.bokeh_ext import InteractiveImage
from bokeh.palettes import Greens3, Blues3, Blues9
from bokeh.plotting import figure, output_notebook
from bokeh.tile_providers import WMTSTileSource, STAMEN_TONER, STAMEN_TERRAIN
from functools import partial
import wget
import zipfile
import math
from difflib import SequenceMatcher

output_notebook()
#print(sys.path)
print(sys.version)

Let's start with something simple. Let's draw a set of points taken from a gaussian distribution of 2 different categories. Each category will be composed by 100.000 points. The first one has a wide standard deviation, so it will be scattered, and the second one will be more compressed. 

In [None]:
# Dataset generation
np.random.seed(1)
num=100000
dists = {cat: pd.DataFrame(dict(x=np.random.normal(x,std,num),
                                y=np.random.normal(y,std,num),
                                val=val,cat=cat))
         for x,y,std,val,cat in 
         [(3,3,5,10,"d1"), (2,-5,0.2,50,"d2")]}
df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category")
df.tail()

With datashader we can control the canvas size and how to show the data. In this case we are aggregating data. You can also control the colors and the background. In this case we are going to represent the data like if you were in Matrix.

In [None]:
%%time 
canvas = ds.Canvas(plot_width=400, plot_height=400, x_range=(-10,10), y_range=(-10,10))
agg = canvas.points(df,'x','y',agg=reductions.count())
img = tf.shade(agg, cmap=Greens3, how='eq_hist')
background = "black"
img = tf.set_background(img, background)

In [None]:
img

# Visualization of census data in the US
We are going to see data from 300 million people living in the US. For that we are going to use the [Racial dot map](http://www.coopercenter.org/demographics/Racial-Dot-Map) dataset created by the Cooper Center, which gathered data of the population density and ethnicity makeup of the USA. The first step is to download the data as an HDF5 file. 

Disclaimer: Even though the data and its creators distinguish each ethnicity as "race", there is no scientific evidence of it. The [notion of human races is a myth](http://www.americanscientist.org/bookshelf/pub/race-finished) as it has been proved by science, it is just another excuse to make us apart. In this [post](https://miguelgfierro.com/blog/2016/how-human-intelligence-works-and-why-that-makes-us-racists/) I argue that the origin of racism is based on how human intelligence is built and how stupid we are. Therefore, I will use the word ethnicity to refer persons from different communities. 

In [None]:
def download_data(infile):
    if(os.path.isfile(infile)):
        print("File %s already downloaded" % infile)
    else:
        url = 'http://s3.amazonaws.com/bokeh_data/census.zip'
        wget.download(url)
        output_path = os.path.basename(url)
        with zipfile.ZipFile(output_path, 'r') as zipf:
            zipf.extractall()
        os.remove(output_path)
census_data = '/datadrive/datashader/census.h5'
download_data(census_data)

The next step is to import the data using pandas. The data consist on more than 300 million data points and each ethnicity has been encoded as a character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic, and 'o' is other (typically Native American)).

The data in the dataset uses a format called [web mercator](https://en.wikipedia.org/wiki/Web_Mercator) which is a cilindrical projection of the world coordinates. It was invented in 1569 by [Gerardus Mercator](https://en.wikipedia.org/wiki/Mercator_projection) and became the standard format for nautical purposes. The web mercator format is an adaptation of the original mercator format and it is currently used by most modern map systems such as Google Maps, Bing Maps or OpenStreetMaps.

In [None]:
%%time
df = pd.read_hdf('census.h5', 'census')
df = df.rename(columns = {'race':'ethnic_group'})
df.ethnic_group = df.ethnic_group.astype('category')

Now we are going to prepare the canvas and plot our first map. This map shows all the points in the dataset with the Hot color of datashader.

In [None]:
%%time
USA = ((-13884029,  -7453304), (2698291, 6455972))
plot_width  = int(900)
plot_height = int(plot_width*7.0/12)
cvs = ds.Canvas(plot_width, plot_height, *USA)
agg = cvs.points(df, 'meterswest', 'metersnorth')
cm = partial(colormap_select, reverse=(background!="black"))
img = tf.shade(agg, cmap = cm(Hot,0.2), how='eq_hist')
img = tf.set_background(img, background)

In [None]:
img

### Visualization of Champions League matches
We can easily create a visualization of the Champion League matches from 1955 to 2016 using datashader. For that we need a dataset of the matches, such as [this one](https://github.com/jalapic/engsoccerdata/blob/master/data-raw/champs.csv) and the coordinates of the Stadiums of the teams, that you can find [here](http://opisthokonta.net/?cat=34). 

The first step is to treat the data.

In [None]:
df_stadium = pd.read_csv("stadiums.csv", usecols=['Team','Stadium','Latitude','Longitude'])
print("Number of rows: %d" % df_stadium.shape[0])
dd1 = df_stadium.take([0,99, 64, 121])
dd1

The next step is to match the club names in the dataset of coordinates with the those in the dataset of matches. They are similar but not always exactly the same, for example, in the dataset of coordinates we have `Real Madrid FC` and in the dataset of matches we have `Real Madrid`. Furthermore, in the first one there are several entries for some teams, like `Atletico Madrid`, `Atletico Madrid B` or `Atletico Madrid C` meaning they are the teams from the first division and from other divisions. 

In [None]:
df_match = pd.read_csv('champions.csv', usecols=['Date','home','visitor'])
df_teams_champions = pd.concat([df_match['home'], df_match['visitor']])
teams_champions = set(df_teams_champions)
print("Number of teams that have participated in the Champions League: %d" % len(teams_champions))
print("Number of matches in the dataset: %d" % df_match.shape[0])
df_match.head()

To find the string similarity you can use different methods. Here we will use a simple method to calculate it with `difflib`.

In [None]:
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def get_info_similar_team(team, df_stadium, threshold=0.6, verbose=False):
    max_rank = 0
    max_idx = -1
    stadium = "Unknown"
    latitude = np.NaN
    longitude = np.NaN
    for idx, val in enumerate(df_stadium['Team']):
        rank = similar(team, val)
        if rank > threshold:
            if(verbose): print("%s and %s(Idx=%d) are %f similar." % (team, val, idx, rank))
            if rank > max_rank:
                if(verbose): print("New maximum rank: %f" %rank)
                max_rank = rank
                max_idx = idx
                stadium = df_stadium['Stadium'].iloc[max_idx]
                latitude = df_stadium['Latitude'].iloc[max_idx]
                longitude = df_stadium['Longitude'].iloc[max_idx]
    return stadium, latitude, longitude
print(get_info_similar_team("Real Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Atletico de Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Inter Milan", df_stadium, verbose=True))
 

The next step is to create a dataframe relating each match with the stadium coordinates of each team

In [None]:
%%time
df_match_stadium = df_match
home_stadium_index = df_match_stadium['home'].map(lambda x: get_info_similar_team(x, df_stadium))
visitor_stadium_index = df_match_stadium['visitor'].map(lambda x: get_info_similar_team(x, df_stadium))
df_home = pd.DataFrame(home_stadium_index.tolist(), columns=['home_stadium', 'home_latitude', 'home_longitude'])
df_visitor = pd.DataFrame(visitor_stadium_index.tolist(), columns=['visitor_stadium', 'visitor_latitude', 'visitor_longitude'])
df_match_stadium = pd.concat([df_match_stadium, df_home, df_visitor], axis=1, ignore_index=False)

In [None]:
print("Number of missing values: %d out of %d" % (df_match_stadium['home_stadium'].value_counts()['Unknown'], df_match_stadium.shape[0]))
df1 = df_match_stadium['home_stadium'] == 'Unknown'
df2 = df_match_stadium['visitor_stadium'] == 'Unknown'
n_complete_matches = df_match_stadium.shape[0] - df_match_stadium[df1 | df2].shape[0]
print("Number of matches with complete data: %d" % n_complete_matches)
df_match_stadium.head()

Now, even though there are many entries in the dataset that don't have any value, we are going to create a dataframe with the teams that do have values and advance in the project. This dataframe finds the combination of teams (home and visitor) that have values and concatenate each other to create the map.

In [None]:
def aggregate_dataframe_coordinates(dataframe):
    df = pd.DataFrame(index=np.arange(0, n_complete_matches*3), columns=['Latitude','Longitude'])
    count = 0
    for ii in range(dataframe.shape[0]):
        if dataframe['home_stadium'].loc[ii]!= 'Unknown' and dataframe['visitor_stadium'].loc[ii]!= 'Unknown':
            df.loc[count] = [dataframe['home_latitude'].loc[ii], dataframe['home_longitude'].loc[ii]]
            df.loc[count+1] = [dataframe['visitor_latitude'].loc[ii], dataframe['visitor_longitude'].loc[ii]]
            df.loc[count+2] = [np.NaN, np.NaN]
            count += 3
    return df
df_agg = aggregate_dataframe_coordinates(df_match_stadium)
df_agg.head()

We have to transform the latitude and longitude coordinates to web mercator format in order to be able to represent it in a map using bokeh. 

In [None]:
def to_web_mercator(yLat, xLon):
    # Check if coordinate out of range for Latitude/Longitude
    if (abs(xLon) > 180) and (abs(yLat) > 90):  
        return
 
    semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
    east = xLon * 0.017453292519943295
    north = yLat * 0.017453292519943295
 
    northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
    easting = semimajorAxis * east
 
    return [easting, northing]
df_agg_mercator = df_agg.apply(lambda row: to_web_mercator(row['Latitude'], row['Longitude']), axis=1)
df_agg_mercator.head()

The next step is to plot the trayectories in the map using datashader

In [None]:
plot_width  = 850
plot_height = 600
x_range = (-2.0e6, 2.5e6)
y_range = (4.1e6, 7.8e6)
def create_image(x_range=x_range, y_range=y_range, w=plot_width, h=plot_height):
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.line(df_agg_mercator, 'Latitude', 'Longitude',  ds.count())
    img = tf.shade(agg, cmap=Blues9, how='eq_hist')
        
    return img

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height,**plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)
    
    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    
    return p

#img = create_image(dd1)
ArcGIS=WMTSTileSource(url='http://server.arcgisonline.com/ArcGIS/rest/services/World_Street_Map/MapServer/tile/{Z}/{Y}/{X}.png')
p = base_plot()
p.add_tile(ArcGIS)
InteractiveImage(p, create_image)



Now that we have the map we can start to improve it. If you are into football, you will notice that there are several points in the north of Spain, that corresponds to Sporting de Gijon. Sadly for Sporting supporters, they have never reached to the Champions. Instead, Sporting Clube de Portugal has participated several times in the championship, but since the current dataset doesn't have teams from Portugal, the system mistakenly thinks that `Sporting CP` from `champions.csv` is the Sporting de Gijon from `stadiums.csv`. So lets fix this issue by getting the stadiums coordinates from the rest of the countries in Europe.  
We can get that info from wikidata. 

In [None]:
df_stadium_read = pd.read_csv('stadiums_wikidata.csv', usecols=['clubLabel','venueLabel','coordinates'])
df_stadium_read.head()

We have to clean the column coordinates. For that we will use a regex pattern. The pattern `[-+]?[0-9]*\.?[0-9]+` finds any signed float in a string. Then we create two patterns separated by a space and name the columns using this format: `(?P<Longitude>)`. Finally we have to concatente the club name with the coordinates.

In [None]:
df_temp = df_stadium_read['coordinates'].str.extract('(?P<Longitude>[-+]?[0-9]*\.?[0-9]+) (?P<Latitude>[-+]?[0-9]*\.?[0-9]+)', expand=True)
df_stadium_new = pd.concat([df_stadium_read['clubLabel'],df_stadium_read['venueLabel'], df_temp], axis=1) 
print("Number of rows: %d" % df_stadium_new.shape[0])
unique_teams_stadium = list(set(df_stadium['clubLabel']))
print("Unique team's name number: %d" % len(unique_teams_stadium))
df_stadium_new.head()

Attribution: Part of the code in this notebook has been taken from the example folder of [datashader](https://github.com/bokeh/datashader/tree/master/examples).