# Data Visualization in Python

Inpired by:
- [Coursera](https://www.coursera.org/) course on [Data Visualization with Python](https://www.coursera.org/learn/python-for-data-visualization).
- [Matplotlib Tutorial](https://matplotlib.org/tutorials/introductory/pyplot.html)
- [Seaborn Tutorial](https://seaborn.pydata.org/tutorial.html)
- [Folium Quickstart](https://python-visualization.github.io/folium/quickstart.html)

Data visualization is a way to show a complex data in a form that is graphical and easy to understand.

A picture is worth a thousand words, then plots and graphs can be very effective in conveying a clear description of the data especially when disclosing findings to an audience.

Using [Darkhorse Analytics](https://www.darkhorseanalytics.com/)' words: **less** is more **effective**, **attractive**, and **impactive**. Any feature or design you incorporate in your plot to make it more attractive or pleasing should support the message that the plot is meant to get across and not distract from it.

## Matplotlib

[Matplotlib](https://matplotlib.org/) is probably the most popular data visualization library in Python. It was initially created to replicate the Matlab plots in a Python environment and now it is a standard (here a bit of history http://aosabook.org/en/matplotlib.html).

It is composed by three main layers:
- **Back-end** layer: comprise the main building blocks of a figure.
- **Artist** layer: the appropriate programming paradigm when writing a web application server, or a UI application, or perhaps a script to be shared with other developers
- **Script** layer: for everyday purposes and is considered a lighter scripting interface to simplify common tasks and for a quick and easy generation of graphics and plots

Let's focus for now on the on the script layer that essentially consists in the **pyplot** sub-package.

In [None]:
import matplotlib.pyplot as plt # import pyplot
import numpy as np              # without numpy we're not going anywhere

### First Line plot

Let's start with is the most basic type of chart: **Line Plots**. It displays information as series of data points colled *markers* connected by straight line segments.

**When to use it?**
When you have a quantity (y-axis) that depends on a numeric variable (x-axis) and you want to show a trend, that is how the observed quantity dependes on the indipendent variable. The most common indipendent variable is of course time.

To create a line plot we need to:
- create a figure with [`figure()`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.figure.html)
- plot the line in the figure by means of the method [`plot()`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html)


In [None]:
# define our indipendent variable time t
t = np.arange(10)  # build an array whose elements are 0, 1, ..., 9
x = t**2           # observed quantity

# create a figure using pyplot
plt.figure()
# plot x depending on t
plt.plot(t, x);

### Plot personalization

You can personilize the line plot, for example, by changing  the color, the linestyle, the linewidth, the markers' color, shape, size and so on (have a look to the documentation for [`matplotlib.lines.Line2D`](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.lines.Line2D.html) objects).

You can also change figure properties such as its size and background color as well as you can add component such as labels for the axes, a title, a legend, the grid or just text.

In [None]:
# create figure with custom size
plt.figure(figsize=(10, 4))

# plotting the same line butchanging some of the default properties
plt.plot(t, x, 
         color='r', linestyle='-.', linewidth=2, 
         marker='o', markersize=6, markerfacecolor='b')

# adding components to the axes
plt.xlabel('$t$ [s]')               # x-axis label, it supports Latex syntax!!
plt.ylabel('$x$')                   # y-axis label
plt.title('the title')              # title
plt.text(0, 60, 'some random text') # text needs the coordinates
plt.grid()                          # grid
plt.legend(['$x(t) = t^2$']);       # legend, it supports latex as well

### Multiple plots

Of course, you can add more plots in the same figure.

In [None]:
# define quantities to plot
t = np.arange(20)
x = t**2
y = 1 + 0.5*t**2
z = 0.5*(1 + np.sin(np.pi*t/4))*t**2


plt.figure(figsize=(10, 4))

# to help legend creation a label can be assigned to each plot
plt.plot(t, x, linestyle=':', linewidth=2, marker='s',
         label='$x(t) = t^2$')
# line properties support abbreviation (example: linestyle -> ls)
plt.plot(t, y, ls='--', lw=3, marker='o', ms=8, mfc='C0',
         label=r'$y(t) = 1+\frac{1}{2}t^2$')
# matplotlib supports MATLAB format string
plt.plot(t, z, '-.dg',
         label=r'$z(t) = \frac{1}{2}\left(1+\sin(\frac{\pi}{4} t)\right)t^2$')

plt.xlabel('$t$ [s]')
plt.ylabel('Amplitude')
plt.title('multiple plots')
plt.grid()
plt.legend();

The same result can be obtained by jumping down to the Artist layer of matplotlib.

The plots above are drawn on an Axes object ([`matplotlib.axes.Axes`](https://matplotlib.org/3.1.3/api/axes_api.html)), which in turn lays on a Figure object ([`matplotlib.figure.Figure`](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.figure.Figure.html)). Normally, you don't have to worry about this, because it is all taken care of behind the scenes. Yet, sometimes, getting full control on the objects can be useful.

In [None]:
# define quantities to plot
t = np.arange(20)
x = t**2
y = 1 + 0.5*t**2
z = 0.5*(1 + np.sin(np.pi*t/4))*t**2

# still using pyplot but to get Figure and Axes objects
fig, ax = plt.subplots(figsize=(10, 4))

# directly plot from the Axes object
ax.plot(t, x, lw=2, marker='o', label='$x(t)$')
ax.plot(t, y, lw=2, marker='s', label='$y(t)$')
ax.plot(t, z, lw=2, marker='d', label='$z(t)$')

# directly modify Axes properties
ax.set(
    title='multiple plots', 
    xlabel='$t$ [s]', 
    ylabel='Amplitude', 
    xticks=t[::2]
)
ax.grid()
ax.legend();

### Multiple figures and axes

pyplot, similarly to MATLAB, has the concept of the current figure and the current axes. All plotting commands apply to the current axes. 

The function [`gca()`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.gca.html) returns the current Axes, and [`gcf()`](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.pyplot.gcf.html) returns the current Figure. 

In the same Figure more than one Axes can be created by mean of the [`subplot()` ](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html) method. This command specifies the number of rows (`numrows`) and columns (`numcols`) to divide the figures and the plot number that ranges from 1 to `numrows\*numcols`. The commas in the subplot command are optional if `numrows\*numcols<10`, so that `subplot(211)` is identical to `subplot(2, 1, 1)`.

In [None]:
t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

fx = lambda t: np.exp(-t)
fy = lambda t: np.cos(2*np.pi*t)

# 1st figure
plt.figure(figsize=(10,6))

# 1st plot in the 1st figure
plt.subplot(211) # 2 rows, 1 col, 1st plot
plt.plot(t1, fx(t1), 'bd', t2, fx(t2), 'b:')
plt.plot(t1, fy(t1), 'r^', t2, fy(t2), 'r--')
plt.title('$f_x(t), f_y(t)$')
plt.grid()

# 2nd plot in the 1st figure
plt.subplot(212) # 2 rows, 1 col, 2nd plot
plt.plot(t2, fx(t2)*fy(t2), 'g')
plt.title('$f_x(t) f_y(t)$')
plt.grid()


# 2nd figure
plt.figure(figsize=(10,6))

# 1st plot in the 2nd figure
plt.subplot(221) # 2 rows, 2 cols, 1st plot
plt.plot(t1, fx(t1), 'bd', t2, fx(t2), 'b:')
plt.title('$f_x(t)$')
plt.grid()

# 2nd plot in the 2nd figure
plt.subplot(223) # 2 rows, 2 cols, 3rd plot
plt.title('$f_y(t)$')
plt.plot(t1, fy(t1), 'r^', t2, fy(t2), 'r--')

# 2nd plot in the 2nd figure
plt.subplot(122) # 1 row, 2 cols, 2nd plot
plt.plot(t2, fy(t2)/fx(t2), 'g')
plt.title('$f_x(t) / f_y(t)$')
plt.grid()

You can create an arbitrary number of subplots and axes. If you want to place an axes manually, i.e., not on a rectangular grid, use the [`axes()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.axes.html#matplotlib.pyplot.axes) command, which allows you to specify the location as axes([left, bottom, width, height]) where all values are in fractional (0 to 1) coordinates.

The same can be done using Axes objects.

In [None]:
# Figure with 2 subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,6))

ax1.plot(t1, fx(t1), 'bd', t2, fx(t2), 'b:')
ax1.plot(t1, fy(t1), 'r^', t2, fy(t2), 'r--')
ax1.set(title='$f_x(t), f_y(t)$')
ax1.grid()

ax2.plot(t2, fx(t2)*fy(t2), 'g')
ax2.set(title='$f_x(t) f_y(t)$')
ax2.grid()

### Bar plot

Bar plots are used to present categorical data with rectangular bars with heights (vertical bar plot) or lengths (horizontal bar plot) proportional to the values that they represent. One axis of the chart shows the specific categories being compared, and the other axis represents a measured value.

The method to draw bar plots are [bar()](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html) for vertical bar plot and [barh()](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.barh.html) for the horizontal version.

In [None]:
names = ['apple', 'pear', 'banana', 'kiwi']
values = [10, 12, 8, 4]

plt.figure(figsize=(10, 4))

plt.subplot(121)
plt.bar(names, values)
plt.xlabel('category')
plt.xlabel('value')

plt.subplot(122)
plt.barh(names, values)
plt.xlabel('value')
plt.xlabel('category')

A useful extension of the bar plot is the **stacked bar plot** that can add a second category in the same plot.

In [None]:
names = ['alpha', 'bravo', 'charlie', 'delta']
apple = [10, 8, 4, 8]
pear = [12, 6, 16, 2]
banana = [8, 10, 16, 4]
kiwi = [4, 2, 8, 12]


plt.figure(figsize=(10, 4))

plt.barh(names, apple, label='apple')
plt.barh(names, pear, left=apple, label='pear')
plt.barh(names, banana, left=pear, label='banana')
plt.barh(names, kiwi, left=banana, label='kiwi')
plt.xlabel('fruit')
plt.ylabel('people')
plt.legend();

### Scatter plot

A **scatter plot** uses dots to represent data point whose coordinates refer to two different numeric variables. Often a third numerical variable can be displayed by varying the size of each dot, and a fourth categorical variable can affect the dot color.

**When to use it?** When we want to observe relationships between variables. While line plots are useful when the relationship between variables is deterministic and can be described as a function, scatter plots are often used when the data point represents samples of a population.

In [None]:
from numpy import random

# create first population
n1 = 100
x1 = random.rand(n1)
y1 = (1+np.cos(2*np.pi*x1))/2 + 0.1*random.randn(n1)
s1 = (50*np.abs(np.cos(2*np.pi*x1))).astype(int)

# create second population
n2 = 50
x2 = random.rand(n2)
y2 = 1-x2 + 0.1*random.randn(n2)
s2 = (50*np.abs(np.sin(2*np.pi*x2))).astype(int)

# create third population
n3 = 75
x3 = random.rand(n3)
y3 = x3**2 + 0.1*random.randn(n3)
s3 = (50*(1-x3)).astype(int)

plt.figure(figsize=(10,5))
plt.scatter(x1, y1, s=s1, label='first')
plt.scatter(x2, y2, s=s2, label='second')
plt.scatter(x3, y3, s=s3, label='third')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid()

### Histogram

A **histogram** is a way of representing the frequency distribution of a numeric dataset. The range of the numeric data is partinioned into **bins**, and then each datapoint in the dataset is assigned to a bin. The bins are on the x-axis, while on the y-axis you have the the frequency or the number of datapoints in each bin.

The method to display a histogram is [`hist`](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.pyplot.hist.html).

In [None]:
x = random.randn(10000)

plt.figure(figsize=(10, 4))

plt.subplot(121)
# just specify the number of bins
plt.hist(x, bins=50, label='normal')
plt.title('histogram')
plt.xlabel('x')
plt.grid()

plt.subplot(122)
# specify the values of the bin edges and display pdf
bin_edges = np.linspace(min(x), max(x), 40)
plt.hist(x, bins=bin_edges, label='normal', density=True)
plt.title('probability density function (pdf)')
plt.xlabel('x')
plt.grid()

## DataSet

Let's consider a real dataset to visualize more meaningful information. We consider the [Ergast](http://ergast.com/mrd/) database that is also available as [Formula 1 Race Data](https://www.kaggle.com/cjgdev/formula-1-race-data-19502017) on [Kaggle](https://www.kaggle.com/) which is an online community of data scientists and machine learning practitioners.

This dataset contains data from 1950 all the way through the current season, and consists of tables describing constructors, race drivers, lap times, pit stops and more.

Here the list of tables composing the dataset:
- **circuits**: list of every formula 1 circuit, including name, location and geographic data;
- **constructor_results**: details of the results for every race, including race, constructor, and awarded points;
- **constructor_standings**: 
- **constructors**: list of every constructor team including name and nationality;
- **driver_standings**:
- **drivers**: list of every Formula 1 driver, including full name, dob, and nationality;
- **lap_times**: details the lap times for every race, including driver, lap number, position and time;
- **pit_stops**: details of every pitstop in Formula 1, including the time of the pit stop, the duration, the race and driver;
- **qualifying**: the results of every qualifying session, including the race, driver, constructor, position, and times for Q1, Q2 and Q3;
- **races**: details of every race, including year, date, time, circuit and round;
- **results**: details of the results for every race.
- **seasons**: list of every season and corresponding Wikipedia link;
- **status**: table of status codes and their status.

In [None]:
import pandas as pd
import os

In [None]:
# local folder where data is stored
data_folder = os.path.join('data', 'f1db_csv')

In [None]:
# loading data related to races
races = pd.read_csv(
    os.path.join(data_folder, 'races.csv'), 
    usecols=['raceId', 'year', 'circuitId', 'name', 'date', 'time'],
    index_col='raceId',
    parse_dates=['date'],
)

races.info()
races.head()

In [None]:
# loading data related to circuits
circuits = pd.read_csv(
    os.path.join(data_folder, 'circuits.csv'), 
    encoding='latin1',
    usecols=['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng'],
    index_col='circuitId',
)

circuits.info()
circuits.head()

In [None]:
# loading data related to drivers
drivers = pd.read_csv(
    os.path.join(data_folder, 'drivers.csv'), 
    encoding='latin1', 
    usecols=['driverId', 'code', 'forename', 'surname', 'dob', 'nationality'],
    index_col='driverId',
    parse_dates=['dob'],
)

drivers.info()
drivers.head()

In [None]:
dstand = pd.read_csv(
    os.path.join(data_folder, 'driver_standings.csv'), 
    index_col='driverStandingsId'
)

dstand.info()
dstand.head()

### Drivers Standing

The `driver_standings` table contains the driver standing after every race since 1950.

In [None]:
# consider only one season
year = 2021

# manipulate the data to get a dataframe with races as index, drivers as
# columns and points in each cell.
ydstand = dstand[dstand['raceId'].replace(races['year']) == year]\
            .set_index(['raceId'])\
            .groupby('driverId')\
            .apply(lambda x: x['points'], include_groups=False)\
            .unstack(level='driverId', fill_value=0.)\
            .rename(index=races['circuitId'].replace(circuits['country']), 
                    columns=drivers['code'])\
            .rename_axis(index='race', columns='driver')\
            .astype(int)
ydstand = ydstand.sort_values(ydstand.columns.tolist())
ydstand.head(10)

Most of methods that are available in matplotlib can be directly calles as DataFrame methods. This is valid for `plot()`, `bar()`,`scatter()`, `hist()` and so on. The input parameters are the same with subtle changes. 

In [None]:
# select the final driver standing by choosing the last row
final_ydstand = ydstand.iloc[-1].sort_values()

plt.figure(figsize=(12,6))
plt.barh(final_ydstand.index, final_ydstand.values)
plt.title('F1 {} - driver final standing'.format(year))
plt.xlabel('points')
plt.ylabel('driver')
plt.grid()

Why not to use the whole dataframe to display how drivers gained those points? Here the stacked parameter comes into play.

In [None]:
# create a dataframe that contains the points gained in each race
# instead of the current standing. To do that the `diff()` method can help
# for computing the difference between a row and the previous one
race_points = pd.concat([ydstand.iloc[[0]], ydstand.diff()[1:]])

# plot a bar graph directly from the dataframe with the `plot()` method
ax = race_points.T.reindex(final_ydstand.index).plot(
    kind='barh', stacked=True, 
    figsize=(12,6),
    title='F1 {} - drivers final standing'.format(year),
)
ax.set(xlabel='points', ylabel='driver',)
ax.legend(ncol=3, loc='lower right') # manually arrange the legeng
ax.grid()

We can also display how the standing of the top drivers evolved during the season  with a line plot.

In [None]:
ndrivers = 6 # number of driver to display

# select the top drivers from the final standing
top_drivers = list(final_ydstand.sort_values(ascending=False)[:ndrivers].index)

# plot the top_driver standing
ax = ydstand[top_drivers].plot(
    kind='line', 
    figsize=(12, 5), 
    marker='s',                   
    grid=True, 
    rot=90
)

# need to manually insert the name of the races in the x-axis
ax.set(
    title='F1 {} - drivers standing'.format(year), 
    ylabel='points',
    xticks=range(len(ydstand)), 
    xticklabels=ydstand.index
);

### Lap times

Let's now consider the table containing all the lap times for each race and driver.

In [None]:
# load the data
laptimes = pd.read_csv(
    os.path.join(data_folder, 'lap_times.csv'),
)

laptimes.info()
laptimes.head()

In [None]:
# pandas can handle Timestamp and Timedelta types, so let's use them
laptimes['time'] = pd.to_timedelta(laptimes['milliseconds'], unit='ms')
# add a column with the driver's code
laptimes['driver'] = laptimes['driverId'].replace(drivers['code'])

laptimes.info()
laptimes.head()

Let's select one race

In [None]:
# select race 
race_year, race_name = 2021, 'Qatar Grand Prix'

# select data referring to the race of interest
raceId = races[(races['year'] == race_year) 
               & (races['name'] == race_name)].index[0]
laps = laptimes[(laptimes['raceId'] == raceId)]

In [None]:
laps.sort_values('milliseconds').head()

Now we can plot the distribution of the lap times throughout the race for every driver

In [None]:
def strfy_ms(ms):
    ''' convert milliseconds to a string, e.g., 100_000 -> "1:40:000" '''
    mins, ms = divmod(ms, 60*1000)
    secs, ms = divmod(ms, 1000)
    return '{}:{:02d}.{:03d}'.format(int(mins), int(secs), int(ms))

fig, ax = plt.subplots(figsize=(12,5))
ax.hist(laps['milliseconds'], bins=50)
ax.set(title='{} {} - Lap Times distribution'.format(race_year, race_name), 
       xlabel='lap time', ylabel='frequency',
       xticks=ax.get_xticks(), # just a work-around to avoid Warnings
       xticklabels=[strfy_ms(ms) for ms in ax.get_xticks()],)
ax.grid()

Now we can visualize the how the lap times for the top drivers evolve over the race and see if we can infer some valueable information.

In [None]:
ndrivers = 6 # number of top drivers to analyze

# select top drivers in the specific race
top_drivers = laps[(laps['lap'] == laps['lap'].max()) 
                   & (laps['position'] <= ndrivers)
                  ].sort_values('position')['driver']
laps = laps[laps['driver'].isin(top_drivers)]


# plot a scatter plot for each top driver
fig, ax = plt.subplots(figsize=(12,5))
for driver, group in laps.groupby('driver'):
    ax.scatter(x=group['lap'], y=group['milliseconds'], 
                   label=driver, alpha=0.8)

# make the plot nicer
ax.set(title='{} {} - Lap Times'.format(race_year, race_name), 
       xlabel='lap', ylabel='lap time',
       ylim=(laps['milliseconds'].min()*0.99, 
             laps['milliseconds'].quantile(0.95)),)
ax.set(yticks=ax.get_yticks(), # just a work-around to avoid Warnings
       yticklabels=[strfy_ms(ms) for ms in ax.get_yticks()],)
ax.legend(ncol=3)
ax.grid()


## Seaborn

[**Seaborn**](https://seaborn.pydata.org/) is another data visualization library, it is actually based on Matplotlib. It was built primarily to provide a high-level interface for drawing attractive statistical graphics, such as regression plots, box plots, and so on. Seaborn makes creating plots very efficient. Therefore with Seaborn you can generate plots with code that is 5 times less than with Matplotlib.

In [None]:
import seaborn as sns
sns.set() # set parameters for prettier plots

### regplot

[regplot](https://seaborn.pydata.org/) plots data as a scatter plot but add a straight line representing a **linear regression model** that fits the data.

In [None]:
# create figure with matplotlib.pyplot
plt.figure(figsize=(12,5))

# plot data with seaborn
ax = sns.regplot(data=laps, x='lap', y='milliseconds', marker='x', order=3)
ax.set(title='{} {} - Lap Times'.format(race_year, race_name), 
      xlabel='lap', ylabel='lap time',
      ylim=(laps['milliseconds'].min()*0.99, 
            laps['milliseconds'].quantile(0.95)),)
ax.set(yticks=ax.get_yticks(), # just a work-around to avoid Warnings
       yticklabels=[strfy_ms(ms) for ms in ax.get_yticks()]);

### distplot

[**distplot**](https://seaborn.pydata.org/generated/seaborn.distplot.html) flexibly plot a univariate distribution of observations.

It combines the matplotlib `hist` function (with automatic calculation of a good default bin size) with a fitting of an univariate kernel density estimate.

In [None]:
ax = sns.histplot(laps['milliseconds'], kde=True)
ax.set(title=f'{race_year} {race_name} - Lap Times distribution', 
       xlabel='lap time', ylabel='count',
       xticks=ax.get_xticks(), # just a work-around to avoid Warnings
       xticklabels=[strfy_ms(ms)[:-4] for ms in ax.get_xticks()],);

### boxplot

A boxplot is a standardized way of displaying the distribution of data based on a five number summary: *minimum*, *first quartile* (Q1 or 25th percentile), *median* (Q2 or 50th percentile), *third quartile* (Q3 or 75th percentile), and *maximum*. 

<img src="figures/boxplot.png" width=400px>

It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

This kind of plot is also available in matplotlib with [`boxplot()`](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.pyplot.boxplot.html), but the seaborn's [`boxplot()`](https://seaborn.pydata.org/generated/seaborn.boxplot.html) is much easier to use.

In [None]:
# add the age column to the drivers dataframe
drivers['age'] = (pd.Timestamp.today() - drivers['dob']).dt.days//365

# list of nationality sorted by number of drivers in descending order
sorted_natl =  drivers.groupby('nationality')['nationality'].count()\
                      .sort_values(ascending=False).index.to_list()

# select only the countries that have given birth to most drivers
nnatl = 10 # number of nationalities to consider
df = drivers[drivers['nationality'].isin(sorted_natl[:nnatl])]
df

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x='nationality', y='age')
plt.xticks(rotation=45);

### countplot

With a Boxplot we get information about distribution in the different categories but we lose the information about quantity. So let's add a barplot.

seaborn not only provides the traditional barplot as matplotlib does, but also has the [`countplot()`](https://seaborn.pydata.org/generated/seaborn.countplot.html) method that allows you to give the hole dataframe as input and the method counts the elements in each category for you.

In [None]:
fig, (axbox, axbar) = plt.subplots(2, 1, figsize=(10, 6), sharex=True)
sns.boxplot(ax=axbox, data=df, x='nationality', y='age')
sns.countplot(ax=axbar, data=df, x='nationality')
plt.xticks(rotation=45);

### violinplot

Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.

This plot allows you to get a clearer ideas on the values distribution that is useful especially when the distribution presents more than one mode/population.

In [None]:
# select only the countries that have given birth to most drivers
nnatl = 6 # number of nationalities to consider
df = drivers[drivers['nationality'].isin(sorted_natl[:nnatl])]

fig, ax = plt.subplots(figsize=(10, 4), sharex=True)
sns.violinplot(ax=ax, data=df, x='nationality', y='age')
plt.xticks(rotation=45);

### pairplot

[**pairplot**](https://seaborn.pydata.org/generated/seaborn.pairplot.html) displays pairwise relationships in a dataset. 

By default, this function will create a grid of Axes such that each numeric variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

This is a high-level interface for [PairGrid](https://seaborn.pydata.org/generated/seaborn.PairGrid.html#seaborn.PairGrid) that is intended to make it easy to draw a few common styles.

In [None]:
# load dataset
cars = pd.read_csv(
    os.path.join('data', 'car-data.csv'),
)
cars.info()
cars.head()

In [None]:
# select only few makes and few categories to visualize
# mpg: miles per gallon
# MSRP: Manufacturer's Suggested Retail Price
makes = ['BMW', 'Audi', 'Mercedes-Benz', 'Ferrari', 'Toyota', 'Volkswagen']
cats = ['Make', 'Engine HP', 'city mpg', 'MSRP']
df = cars[cars['Make'].isin(makes)][cats]
df.head()

In [None]:
# plot the pairplot
pg = sns.pairplot(df, vars=['Engine HP', 'city mpg', 'MSRP'], hue='Make')
pg.axes[1,1].set(xlim=(0, 50))
pg.axes[2,2].set(
    xlim=(1000, 1000000), xscale='log',
    ylim=(1000, 1000000), yscale='log',
);

## Geospatial Data with Folium

[**folium**](https://python-visualization.github.io/folium/index.html) is a Python package for geospatial data visualization that builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the [**leaflet.js**](https://leafletjs.com/) library.

`folium` makes it easy to visualize data that has been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for [**choropleth**](https://en.wikipedia.org/wiki/Choropleth_map) visualizations as well as passing rich vector/raster/HTML visualizations as **markers** on the map.

The library has a number of built-in tilesets from [OpenStreetMap](https://www.openstreetmap.org/), [Mapbox](https://www.mapbox.com/), and [Stamen](http://maps.stamen.com/), and supports custom tilesets with Mapbox or Cloudmade API keys. `folium` supports both Image, Video, GeoJSON and TopoJSON overlays.

To create a base map, simply pass your starting coordinates (latitude and longitude) to the Map Class. We can also decide the starting level of zoom. Default tileset is `OpenStreetMap` but `Stamen Terrain`, `Stamen Toner`, `Mapbox Bright`, and `Mapbox Control Room`, and many others tiles are built in.

In [None]:
import folium

In [None]:
# create our map object
m = folium.Map(
    location=[44.4874, 11.3279],
    zoom_start=19,
    tiles='OpenStreetMap',
)

# save your map and open it from outside the Jupyter notebook
m.save('first_map.html')

### Markers

It is possible to add markers to the map.

There are numerous marker types, starting with a simple Leaflet style location marker with a popup and tooltip HTML.

In [None]:
# let's build a map with a marker for each F1 circuit since 1950

# loading data related to F1 circuits
circuits = pd.read_csv(
    os.path.join(data_folder, 'circuits.csv'), 
    encoding='latin1',
    usecols=['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng'],
    index_col='circuitId',
)

# create a generic world map
world_map = folium.Map(tiles='Cartodb Positron')

# iter over the F1 circuits and add a marker for each one
for i, circuit in circuits.iterrows():
    folium.Marker(
        circuit[['lat', 'lng']].tolist(), # coordinates
        tooltip=circuit['name'], # string that pop up when over the marker
    ).add_to(world_map)

world_map.fit_bounds(world_map.get_bounds())
world_map.save('f1_circuits.html')

In [None]:
# create a generic world map
world_map = folium.Map(tiles='OpenStreetMap')

recent_circuits = races[races['year'] > 2015]['circuitId'].unique()

# iter over the F1 circuits and add a marker for each one
for circuitId, circuit in circuits.iterrows():
    # choose icon depending on the circuit
    if circuitId in recent_circuits:
        icon = folium.Icon(icon='ok-circle', color='green')
    else:
        icon = folium.Icon(icon='remove-circle', color='red')
        
    # create marker
    folium.Marker(
        circuit[['lat', 'lng']].tolist(), # coordinates
        tooltip=circuit['name'], # string that pop up when over the marker
        icon=icon,               # marker's icon
    ).add_to(world_map)

world_map.fit_bounds(world_map.get_bounds())
world_map.save('f1_circuits_recent.html')

### Choropleth maps

A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map.

Here we can color the countries depending on how many circuits hosted a F1 Gran Prix.

In order to create a choropleth map of a region of interest, folium requires a [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON) file that includes geospatial data of the region. For a choropleth map of the world, we would need a Geo JSON file that lists each country along with any geospatial data to define its borders and boundaries.

Here is an example of GeoJSON for Italian borders:
```json
{
    'type': 'Feature',
    'properties': {
        'name': 'Italy'
    },
    'geometry': {
        'type': 'MultiPolygon',
        'coordinates': [
            [[[15.520376, 38.231155], ... , [15.520376, 38.231155]]],
            [[[ 9.210012, 41.209991], ... , [ 9.210012, 41.209991]]],
            [[[12.376485, 46.767559], ... , [12.376485, 46.767559]]]
        ]
    },
    'id': 'ITA'
}
```

And here is a GeoJSON for world countries 
[world-countries.json](https://github.com/python-visualization/folium/blob/master/examples/data/world-countries.json) provided by folium.

In [None]:
import urllib # library to download data from the internet

# link to GeoJSON
url = 'https://raw.githubusercontent.com/python-visualization/'\
      'folium/master/examples/data/world-countries.json'
# countries = json.loads(urllib.request.urlopen(url).read()) # load json as dict

# data to display
# change name to countries for uniformity with GeoJSON
circuits_ = circuits.replace({
    'UK': 'United Kingdom',
    'USA': 'United States of America',
    'UAE': 'United Arab Emirates',
    'Korea': 'South Korea'})
data = circuits_.groupby('country')['country'].count().rename('circuits')


# create Choropleth map
folium.Choropleth(
    name='choropleth',           # object name
    geo_data=url,                # geo data for world countries
    data=data,                   # data to display
    key_on='properties.name',    # where labels in data can be found in GeoJSON
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='number of circuits'
).add_to(world_map)

world_map.save('f1_circuits_choropleth.html')

display(data.sort_values(ascending=False).head(12))

---
---
---