# Visualization and Statistics

At this point in the course, you have had some experience in getting and processing data, and exporting your results in a useful format. But after that stage, you also need to be able to *analyze* and *communicate* your results. Programming-wise, this is relatively easy. There are tons of great modules out there for doing statistics and making pretty graphs. The hard part is finding out what is the best way to communicate your findings.

**At the end of this week, you will be able to:**
- Perform exploratory data analysis, using both visual and statistical means.
- Communicate your results using visualizations, that is:
    - Make line plots.
    - Make bar charts.
    - Create maps.
    - Create networks.

**This requires that you already have (some) knowledge about:**
- Loading and manipulating data.

**If you want to learn more about these topics, you might find the following links useful:**
- Visualization blog: http://gravyanecdote.com/ 
- List of more blogs: https://flowingdata.com/2012/04/27/data-and-visualization-blogs-worth-following/

## What kind of visualization to choose

The following chart was made by ([Abela, 2006](http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html)). It provides a first intuition on what kind of visualization to choose for your data. He also asks exactly the right question: **What do you want to show?** It is essential for any piece of communication to first consider: what is my main point? And after creating a visualization, to ask yourself: does this visualization indeed communicate what I want to communicate? (Ideally, also ask others: what kind of message am I conveying here?)

![chart chooser](./images/chart_chooser.jpg)

It's also apt to call this a 'Thought-starter'. There are many great kinds of visualizations that aren't in this diagram. To get some more inspiration, check out the example galleries for these libraries:

* [D3.js](https://d3js.org/)
* [Seaborn](https://seaborn.github.io/examples/index.html)
* [Bokeh](http://bokeh.pydata.org/en/latest/docs/gallery.html)
* [Pandas](http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html)
* [Matplotlib](http://matplotlib.org/gallery.html)
* [Vis.js](http://visjs.org/index.html)

But before you get carried away, do realize that **sometimes all you need is a good table**. Tables are visualizations, too! For a good guide on how to make tables, read the first three pages of [the LaTeX booktabs package documentation](http://ctan.cs.uu.nl/macros/latex/contrib/booktabs/booktabs.pdf). Also see [this guide](https://www.behance.net/gallery/Designing-Effective-Data-Tables/885004) with some practical tips.

## What kind of visualizations *not* to choose

As a warm-up exercise, take some time to browse [wtf-viz](http://viz.wtf/). For each of the examples, think about the following questions:

1. What is the author trying to convey here?
2. How did they try to achieve this?
3. What went wrong?
4. How could the visualization be improved? Or can you think of a better way to visualize this data?
5. What is the take-home message here for you?

For in-depth critiques of visualizations, see [Graphic Violence](https://graphicviolence.wordpress.com/). [Here](http://hanswisbrun.nl/tag/lieggrafiek/)'s a page in Dutch.



## A little history of visualization in Python

As you've seen in the [State of the tools](https://www.youtube.com/watch?v=5GlNDD7qbP4) video, `Matplotlib` is one of the core libraries for visualization. It's feature-rich, and there are many tutorials and examples showing you how to make nice graphs. It's also fairly clunky, however, and the default settings don't make for very nice graphs. But because `Matplotlib` is so powerful, no one wanted to throw the library away. So now there are several modules that provide wrapper functions around `Matplotlib`, so as to make it easier to use and produce nice-looking graphs.

* `Seaborn` is a visualization library that adds a lot of functionality and good-looking defaults to Matplotlib.
* `Pandas` is a data analysis library that provides plotting methods for its `dataframe` objects.

Behind the scenes, it's all still Matplotlib. So if you use any of these libraries to create a graph, and you want to customize the graph a little, it's usually a good idea to go through the `Matplotlib` documentation. Meanwhile, the developers of `Matplotlib` are still improving the library. If you have 20 minutes to spare, watch [this video](https://www.youtube.com/watch?v=xAoljeRJ3lU) on the new default colormap that will be used in Matplotlib 2.0. It's a nice talk that highlights the importance of color theory in creating visualizations.

With the web becoming more and more popular, there are now also several libraries offering interactive visualizations using Javascript instead of Matplotlib. These are, among others:

* [Bokeh](http://bokeh.pydata.org/en/latest/)
* [NVD3](http://nvd3.org/)
* [Lightning](http://lightning-viz.org/)
* [MPLD3](http://mpld3.github.io/) (Also using Matplotlib)
* [Plotly](https://plot.ly/)
* [Vincent](https://vincent.readthedocs.io/en/latest/)

# Getting started

Run the cell below. This will load relevant packages to use visualizations inside the notebook.

In [None]:
# This is special Jupyter notebook syntax, enabling interactive plotting mode.
# In this mode, all plots are shown inside the notebook!
# If you are not using notebooks (e.g. in a standalone script), don't include this.
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

## Tables

There are (at least) two ways to output your data as a formatted table:

* Using the `tabulate` package. (Use `pip install tabulate` to install it)
* Using the `pandas` dataframe method `df.to_latex(...)`, `df.to_string(...)`, or even `df.to_clipboard(...)`.

This is extremely useful if you're writing a paper. First version of the 'results' section: done!

In [None]:
from tabulate import tabulate

table = [["spam",42],["eggs",451],["bacon",0]]
headers = ["item", "qty"]

# Documentation: https://pypi.python.org/pypi/tabulate
print(tabulate(table, headers, tablefmt="latex_booktabs"))

In [None]:
import pandas as pd

# Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
df = pd.DataFrame(data=table, columns=headers)
print(df.to_latex(index=False))

Once you've produced your LaTeX table, it's *almost* ready to put in your paper. If you're writing an NLP paper and your table contains scores for different system outputs, you might want to make the best scores **bold**, so that they stand out from the other numbers in the table.

### More to explore

The `pandas` library is *really* useful if you work with a lot of data (we'll also use it below). As Jake Vanderplas said in the [State of the tools](https://www.youtube.com/watch?v=5GlNDD7qbP4) video from Week 1, the `pandas` DataFrame is becoming the central format in the Python ecosystem. [Here](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) is a page with `pandas` tutorials.

## Plots

This section shows you how to make plots using Matplotlib and Seaborn.

In [None]:
# Even if you're not using Seaborn, this import and the next command change the Matplotlib defaults.
# The effect of this is that Matplotlib plots look prettier!
import seaborn as sns
sns.set_style("whitegrid")

### Illustrating Zipf's law

We'll look at word frequencies to illustrate [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law): "the frequency of any word is inversely proportional to its rank in the frequency table." Now what does that mean?

For this illustration, we'll use the SUBTLEX-US frequency dataset, which is based on a huge collection of movie subtitles. One of the authors, Marc Brysbaert (Professor of Psychogy at the University of Ghent), notes that word frequencies in movie subtitles are the best approximation of the actual frequency distribution of the words that we hear every day. For this reason, these word frequencies are useful for psycholinguistic experiments.

First we need to load the data. We'll use the CSV module.

In [None]:
import csv

# We'll open the file using the DictReader class, which turns each row into a dictionary. 
# Keys in the dictionary are determined by the header of the file.

entries = []
with open('../Data/SUBTLEX-US/SUBTLEXus74286wordstextversion.txt') as f:
    reader = csv.DictReader(f, delimiter='\t')
    for entry in reader:
        # Turn the numbers into floats.
        entry['SUBTLWF'] = float(entry['SUBTLWF'])
        entry['Lg10WF'] = float(entry['Lg10WF'])
        # And append the entry to the list.
        entries.append(entry)

# Sort the list of entries by frequency.
entries = sorted(entries,
                 key=lambda d:d['SUBTLWF'], # Sort by the word frequency
                 reverse=True)              # Order the list from high to low 

Now we'll visualize the relationship between the frequency of the words and their rank, with the words ordered by their frequency.

In [None]:
# We use a list comprehension to get all the frequency values.
frequencies = [e['SUBTLWF'] for e in entries]

# Rank is just a list of numbers between 0 and the number of entries.
ranks = list(range(len(entries)))

# Plot the relationship in a scatterplot.
plt.plot(ranks, frequencies)

This graph looks nearly empty, but if you look really closely, you'll see a blue line along the X and Y axes. What's needed for us to see the relation, is a transformation using the `log` scale. This transformation makes exponential functions linear. (You don't need to know this for the exam!) After transforming the ranks and frequencies, the graph should (more or less) look like a straight line!

In [None]:
from math import log

# The CSV already has log word frequencies.
log_frequencies = [e['Lg10WF'] for e in entries]

# We'll take the log of the rank, starting at 1 (because the log function isn't defined for 0).
log_rank = [log(i) for i in range(1,len(log_frequencies)+1)]

# And plot the graph again. This should be a (more or less) straight line!
plt.plot(log_rank, log_frequencies)

### Correlation

Let's look at correlation between values in Python. We'll explore two measures: Pearson and Spearman correlation. Given two lists of numbers, Pearson looks whether there is any *linear relation* between those numbers. This is contrasted by the Spearman measure, which aims to see whether there is any *monotonic relation*. The difference between linear and monotonic is that the latter is typically less strict:

* Monotonic: a constant relation between two lists of numbers.
    1. if a number in one list increases, so does the number in the other list, or 
    2. if a number in one list increases, the number in the other list decreases.
* Linear: similar to monotonic, but the increase or decrease can be modeled by a straight line.

Here is a small example to illustrate the difference.

In [None]:
# Scipy offers many statistical functions, among which the Pearson and Spearman correlation measures.
from scipy.stats import pearsonr, spearmanr

# X is equal to [1,2,3,...,99,100]
x = list(range(100))

# Y is equal to [1^2, 2^2, 3^2, ..., 99^2, 100^2]
y = [i**2 for i in x]

# Z is equal to [100,200,300, ..., 9900, 10000]
z = [i*100 for i in x]

# Plot x and y.
plt.plot(x, y, label="X and Y")

# Plot y and z in the same plot.
plt.plot(x, z, label="X and Z")

# Add a legend.
plt.legend(loc='upper left')

In [None]:
correlation, significance = pearsonr(x,y)
print('The Pearson correlation between X and Y is:', correlation)

correlation, significance = spearmanr(x,y)
print('The Spearman correlation between X and Y is:', correlation)

print('----------------------------------------------------------')

correlation, significance = pearsonr(x,z)
print('The Pearson correlation between X and Z is:', correlation)

correlation, significance = spearmanr(x,z)
print('The Spearman correlation between X and Z is:', correlation)

The Spearman correlation is perfect in both cases, because with each increase in X, there is an increase in Y. But because that increase isn't the same at each step, we see that the Pearson correlation is slightly lower.

In Natural Language Processing, people typically use the Spearman correlation because they are interested in *relative scores*: does the model score A higher than B? The exact score often doesn't matter. Hence Spearman provides a better measure, because it doesn't penalize models for non-linear behavior.

### Exploratory visualization

Before you start working on a particular dataset, it's often a good idea to explore the data first. If you have text data; open the file and see what it looks like. If you have numeric data, it's a good idea to visualize what's going on. This section shows you some ways to do exactly that. We'll work with another data file by Brysbaert and colleagues, consisting of concreteness ratings. I.e. how abstract or concrete participants think a given word is.

In [None]:
# Let's load the data first.
concreteness_entries = []
with open('../Data/concreteness/Concreteness_ratings_Brysbaert_et_al_BRM.txt') as f:
    reader = csv.DictReader(f, delimiter='\t')
    for entry in reader:
        entry['Conc.M'] = float(entry['Conc.M'])
        concreteness_entries.append(entry)

For any kind of ratings, you can typically expect the data to have a normal-ish distribution: most of the data in the middle, and increasingly fewer scores on the extreme ends of the scale. We can check whether the data matches our expectation using a histogram.

In [None]:
scores = []
for entry in concreteness_entries:
    scores.append(entry['Conc.M'])

# Plot the distribution of the scores.
sns.distplot(scores, kde=False)

.

.

.

.

Surprise! It doesn't. This is a typical *bimodal* distribution with two peaks. Going back to [the original article](http://link.springer.com/sharelink/10.3758/s13428-013-0403-5), this is also mentioned in their discussion:

> One concern, for instance, is that concreteness and abstractness may be not the two extremes of a quantitative continuum (reflecting the degree of sensory involvement, the degree to which words meanings are experience based, or the degree of contextual availability), but two qualitatively different characteristics. One argument for this view is that the distribution of concreteness ratings is bimodal, with separate peaks for concrete and abstract words, whereas ratings on a single, quantitative dimension usually are unimodal, with the majority of observations in the middle (Della Rosa et al., 2010; Ghio, Vaghi, & Tettamanti, 2013).

To compare, here are sentiment scores for English (from [Dodds et al. 2014](http://www.uvm.edu/storylab/share/papers/dodds2014a/)), where native speakers rated a list of 10,022 words on a scale from 0 (negative) to 9 (positive).

In [None]:
# Load the data (one score per line, words are in a separate file).
with open('../Data/Dodds2014/data/labMTscores-english.csv') as f:
    scores = [float(line.strip()) for line in f]

# Plot the histogram
sns.distplot(scores, kde=False)

Because Dodds et al. collected data from several languages, we can plot the distributions for multiple languages and see whether they all have normally distributed scores. We will do this with a [Kernal Density Estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) plot. Basically, such a plot shows you the probability distribution (the chance of getting a particular score) as a continuous line. Because it's a line rather than a set of bars, you can show many of them in the same graph.

In [None]:
# This is necessary because the kdeplot function only accepts arrays.
import numpy as np

# This is necessary to get all the separate files.
import glob

# Get all the score files.
filenames = glob.glob('../Data/Dodds2014/data/labMTscores-*.csv')

# Showing the first 5, because else you can't keep track of all the lines.
for filename in filenames[:5]:
    # Read the language from the filename
    language = filename.split('-')[1]
    language = language.split('.')[0]
    with open(filename) as f:
        scores = [float(line.strip()) for line in f]
        sns.kdeplot(np.array(scores), label=language)

plt.legend()

Look at all those unimodal distributions!

**Continuing with the concreteness dataset**

It is commonly known in the literature on concreteness that concreteness ratings are (negatively) correlated with word length: the longer a word, the more abstract it typically is. Let's try to visualize this relation. We can plot the data using a regression plot to verify this. In addition, we're using a Pandas DataFrame to plot the data. You could also just use `sns.regplot(word_length, rating, x_jitter=0.4)`.

In [None]:
# Create two lists of scores to correlate.
word_length = []
rating = []
for entry in concreteness_entries:
    word_length.append(len(entry['Word']))
    rating.append(entry['Conc.M'])

# Create a Pandas Dataframe. 
# I am using this here, because Seaborn adds text to the axes if you use DataFrames.
# You could also use pd.read_csv(filename,delimiter='\t') if you have a file ready to plot.
df = pd.DataFrame.from_dict({"Word length": word_length, "Rating": rating})

# Plot a regression line and (by default) the scatterplot. 
# We're adding some jitter because all the points fall on one line. 
# This makes it difficult to see how densely 'populated' the area is.
# But with some random noise added to the scatterplot, you can see more clearly 
# where there are many dots and where there are fewer dots.
sns.regplot('Word length', 'Rating', data=df, x_jitter=0.4)

That doesn't look like a super strong correlation. We can check by using the correlation measures from SciPy.

In [None]:
# If we're interested in predicting the actual rating.
corr, sig = pearsonr(word_length, rating)
print('Correlation, according to Pearsonr:', corr)

# If we're interested in ranking the words by their concreteness.
corr, sig = spearmanr(word_length, rating)
print('Correlation, according to Spearmanr:', corr)

# Because word length is bound to result in ties (many words have the same length), 
# some people argue you should use Kendall's Tau instead of Spearman's R:
from scipy.stats import kendalltau

corr, sig = kendalltau(word_length, rating)
print("Correlation, according to Kendall's Tau:", corr)

Now you've seen several different plots, hopefully the general pattern is becoming clear: visualization typically consists of three steps:

1. Load the data.
2. Organize the data in such a way that you can feed it to the visualization function.
3. Plot the data using the function of your choice.

There's also an optional **fourth step**: After plotting the data, tweak the plot until you're satisfied. Of these steps, the second and fourth are usually the most involved. Now let's try a slightly more difficult graph: **the bar plot**. The following example shows you how to draw a bar plot and customize it.

In [None]:
# We want to visualize how far I've walked this week (using some random numbers).
# Here's a dictionary that can be loaded as a pandas dataframe. Each item corresponds to a COLUMN.
distance_walked = {'days': ['Monday','Tuesday','Wednesday','Thursday','Friday'],
                   'km': [5,6,5,19,4]}

# Turn it into a dataframe.
df = pd.DataFrame.from_dict(distance_walked)

# Plot the data using seaborn's built-in barplot function.
# To select the color, I used the color chart from here: 
# http://stackoverflow.com/questions/22408237/named-colors-in-matplotlib
ax = sns.barplot(x='days',y='km',color='lightsteelblue',data=df)

# Here's a first customization. 
# Using the Matplotlib object returned by the plotting function, we can change the X- and Y-labels.
ax.set_ylabel('km')
ax.set_xlabel('')

# Each matplotlib object consists of lines and patches that you can modify.
# Each bar is a rectangle that you can access through the list of patches.
# To make Thursday stand out even more, I changed its face color.
ax.patches[3].set_facecolor('palevioletred')

In [None]:
# You can also plot a similar chart using Pandas.
ax = df.plot(x='days',y='km',kind='barh') # or kind='bar'

# Remove the Y label.
ax.set_ylabel('')

### On your own

We'll work with data from Donald Trump's Facebook page. The relevant file is `Data/Trump-Facebook/FacebookStatuses.tsv`. Try to create a visualization that answers one of the following questions:

1. How does the number of responses to Trump's posts change over time?
2. What webpages does Donald Trump link to, and does this change over time? Which is the most popular? Are there any recent newcomers?
3. What entities does Trump talk about?
4. Starting March 2016 (when the emotional responses were introduced on Facebook), how have the emotional responses to Trumps messages developed?
5. [Question of your own.]

Try to at least think about what kind of visualization might be suitable to answer these questions, and we'll discuss this question in class on Monday. More specific questions:

* What kind of preprocessing is necessary before you can start visualizing the data?
* What kind of visualization is suitable for answering these questions?
    - What sort of chart would you choose?
    - How could you use color to improve your visualization?
* What might be difficult about visualizing this data? How could you overcome those difficulties?

In [None]:
# Open the data.


# Process the data so that it can be visualized.



In [None]:
# Plot the data.


# Modify the plot.



I want to leave you with a note on bar plots: while they're super useful, don't use them to visualize distributions. There was even a meme with the hashtag \#barbarplots to get some attention for this issue, using this image below. They even had a [Kickstarter](https://www.kickstarter.com/projects/1474588473/barbarplots) to raise money for sending T-shirts with this image to the editorial boards of big journals!

![barbarplots](./images/barbarplots.jpg)

## Maps

Maps are a *huge* subject that we won't cover in detail. We'll only discuss a very simple use case: suppose you have some locations that you want to show on a map. How do you do that?

First we need to import the relevant library:

In [None]:
# We'll use the Basemap module.
from mpl_toolkits.basemap import Basemap

Next, we need to create a `Basemap` instance. This instance contains all the data that is necessary to draw the area you're interested in. You can create a `Basemap` instance by calling the `Basemap` class with 6 parameters:

* The width of the map in meters. You can typically find a rough estimate online. Start from there, and find the optimal width by trial and error.
* The height of the map in meters. You can find the optimal width through the same procedure.
* The projection. Different projections are [listed here](http://matplotlib.org/basemap/users/mapsetup.html).
* The resolution. How detailed you want the borders to be drawn. Possible values are `c` (crude), `l` (low), `i` (intermediate), `h` (high), `f` (full) and `None`. The more detailed the borders are, the slower the drawing process becomes. So during development, you should use a lower resolution so that you see the results more quickly.
* Latitude at the center of the map. You can find this value (or a rough approximation) online.
* Longitude at the center of the map.

Using the `Basemap` object, you can draw the coastlines and the border lines, and then you have a nice map!

In [None]:
# Get the map. This may take a while..
m = Basemap(width=275000,height=360000,projection='lcc',
            resolution='h',lat_0=52.25,lon_0=5.2)

# Draw coastlines and borders.
m.drawcoastlines(linewidth=1)
m.drawcountries(linewidth=1)

The `Basemap` [documentation](http://matplotlib.org/basemap/users/geography.html) also shows you how to draw more detailed maps. Here's one of their examples:

In [None]:
# setup Lambert Conformal basemap.
# set resolution=None to skip processing of boundary datasets.
m = Basemap(width=12000000,height=9000000,projection='lcc',
            resolution=None,lat_1=45.,lat_2=55,lat_0=50,lon_0=-107.)
m.shadedrelief()

This visualization works very well for the USA, since it's such a large area. But since the Netherlands are much smaller, you end up with a very blurry map (because you need to zoom in so much). One option would be to add coastlines and borders again (play around with this by uncommenting the commands below). But for publications, I would probably use the first (black-and-white) map, because it's so clear.

In [None]:
m = Basemap(width=275000,height=360000,projection='lcc',
            resolution='h',lat_0=52.25,lon_0=5.2)
m.shadedrelief()
# m.drawcoastlines(linewidth=1,color='white')
# m.drawcountries(linewidth=1,color='white')

### Intermezzo: degrees, minutes, seconds

Latitude and longitude are sometimes given in decimal degrees, and sometimes in degree-minute-second (DMS) notation. E.g. Amsterdam is located at 52°22′N 4°54′E. The first number corresponds to the latitude , while the second number corresponds to the longitude. This is how you convert between DMS and decimal degrees:

In [None]:
def decimal_degrees(degrees, minutes, seconds):
    "Convert DMS-formatted degrees to decimal degrees."
    dd = degrees + (minutes/60) + (seconds/3600)
    return dd

### Plotting points and values on a map

Now let's plot some points on the map! We'll use the plot function for this, but for collections of points you may want to use `m.scatter(...)`. See [this page](http://matplotlib.org/api/markers_api.html) for more instructions on how to control the way markers look.

One of the most basic things you can do is put a dot on the map corresponding to the capital city. Here's how to do that for the Netherlands.

In [None]:
# Longitude and latitude for Amsterdam.
lon, lat = 4.8979956033677, 52.374436

# Draw the map again.
m.drawcoastlines(linewidth=1)
m.drawcountries(linewidth=1)

# Plot Amsterdam on the map.
# The latlon keyword tells Python that lon and lat are longitude and latitude values.
# These are automatically converted to the right coordinates for the current map projection.
# If you leave out the latlon keyword, you need to separately convert the coordinates, like so:
# lon, lat = m(lon, lat)
m.plot(lon, lat, 'ro', latlon=True)

OK, that looks good, but what if we wanted to plot *values* on the map? We can do that as well:

In [None]:
value = '42'

# Convert longitude and latitude to map coordinates.
x,y = m(lon, lat)

# Draw the map again.
m.drawcoastlines(linewidth=1)
m.drawcountries(linewidth=1)

# Plot the value on the map. Note that we're using a Matplotlib function now!
plt.text(x,y,value,weight='extra bold', color='red',size=14, va='center', ha='center')

Hmm, that doesn't look quite right. The coastline makes the number very hard to read. We can solve this by putting a white marker behind the number. (This is sort of cheating, but I found this trick to be very useful.)

In [None]:
m.drawcoastlines(linewidth=1)
m.drawcountries(linewidth=1)

# Plot a marker.
m.plot(x,y,'wo',mec='white',markersize=15)

# Plot the value on the map.
plt.text(x,y,value,weight='extra bold', color='red',size=14, va='center', ha='center')

Much more readable!

### Your turn

Try to create a map for some other country than the USA or the Netherlands, and mark the 5 biggest cities on the map. You can use Google/Wikipedia or the `Geopy` module to get the latitudes and longitudes. Here's a reminder for the `geopy` module:

```python
from geopy.geocoders import Nominatim

location = geolocator.geocode(place)
lon,lat  = location.longitude, location.latitude
```

If you do use `geopy`, please don't forget to cache your results. Store coordinates in a dictionary or a JSON file, with key: placename, value: (longitude, latitude).

[Here](http://maxberggren.se/2015/08/04/basemap/) is another tutorial for drawing a map using Basemap.

### More to explore

Other libraries for visualizing data on a map are:

* [Vincent](https://vincent.readthedocs.io/en/latest/)
* [Folium](https://github.com/python-visualization/folium)
* [Cartopy](http://scitools.org.uk/cartopy/docs/latest/)
* [Geoplotlib](https://github.com/andrea-cuttone/geoplotlib)
* [Kartograph](http://kartograph.org/)

Beyond displaying points on a map, you might also want to create [choropleth maps](https://en.wikipedia.org/wiki/Choropleth_map). We won't cover this subject in detail, but for anything more detailed than countries (states, provinces, municipalities, etc), you typically need to have a **shapefile** (often in GeoJSON format) that tells the mapping library what the relevant regions are. In those shapefiles, regions are represented as polygons: complex shapes that can be overlaid on a map.

## Networks

Some data is best visualized as a network. There are several options out there for doing this. The easiest is to use the NetworkX library and either plot the network using Matplotlib, or export it to JSON or GEXF (Graph EXchange Format) and visualize the network using external tools.

Let's explore a bit of WordNet today. For this, we'll want to import the NetworkX library, as well as the WordNet module. We'll look at the first synset for *dog*: `dog.n.01`, and how it's positioned in the WordNet taxonomy. All credits for this idea go to [this blog](http://www.randomhacks.net/2009/12/29/visualizing-wordnet-relationships-as-graphs/).

In [None]:
import networkx as nx
from nltk.corpus import wordnet as wn
from nltk.util import bigrams # This is a useful function.

Networks are made up out of *edges*: connections between *nodes* (also called *vertices*). To build a graph of the WordNet-taxonomy, we need to generate a set of edges. This is what the function below does.

In [None]:
def hypernym_edges(synset):
    """
    Function that generates a set of edges 
    based on the path between the synset and entity.n.01
    """
    edges = set()
    for path in synset.hypernym_paths():
        synset_names = [s.name() for s in path]
        # bigrams turns a list of arbitrary length into tuples: [(0,1),(1,2),(2,3),...]
        # edges.update adds novel edges to the set.
        edges.update(bigrams(synset_names))
    return edges

In [None]:
# Use the synset 'dog.n.01'
dog = wn.synset('dog.n.01')

# Generate a set of edges connecting the synset for 'dog' to the root node (entity.n.01)
edges = hypernym_edges(dog)

# Create a graph object.
G = nx.Graph()

# Add all the edges that we generated earlier.
G.add_edges_from(edges)

Now we can actually start drawing the graph. We'll increase the figure size, and use the `draw_spring` method (that implements the Fruchterman-Reingold layout algorithm).

In [None]:
# Increasing figure size for better display of the graph.
from pylab import rcParams
rcParams['figure.figsize'] = 11, 11

# Draw the actual graph.
nx.draw_spring(G,with_labels=True)

What is interesting about this is that there is a *cycle* in the graph! This is because *dog* has two hypernyms, and those hypernyms are both superseded (directly or indirectly) by *animal.n.01*.

What is not so good is that the graph looks pretty ugly: there are several crossing edges, which is totally unnecessary. There are better layouts implemented in NetworkX, but they do require you to install `pygraphviz`. Once you've done that, you can execute the next cell. (And if not, then just assume it looks much prettier!)

In [None]:
# Install pygraphviz first: pip install pygraphviz
from networkx.drawing.nx_agraph import graphviz_layout

# Let's add 'cat' to the bunch as well.
cat = wn.synset('cat.n.01')
cat_edges = hypernym_edges(cat)
G.add_edges_from(cat_edges)

# Use the graphviz layout. First compute the node positions..
positioning = graphviz_layout(G)

# And then pass node positions to the drawing function.
nx.draw_networkx(G,pos=positioning)

**Question**

How do dogs differ from cats, according to WordNet?

**Question**

Can you think of any data other than WordNet-synsets that could be visualized as a network?

### More to explore

* Python's network visualization tools are fairly limited (though I haven't really explored Pygraphviz (and Graphviz itself is able to create [examples like these](http://www.graphviz.org/Gallery.php))). It's usually easier to export the graph to GEXF and visualize it using [Gephi](https://gephi.org/) or [SigmaJS](http://sigmajs.org/). Gephi also features plugins, which enable you to create interactive visualizations. See [here](https://github.com/evanmiltenburg/dm-graphs/) for code and a link to a demo that I made.

* For analyzing graphs, I like to use either Gephi, or the [python-louvain](http://perso.crans.org/aynaud/communities/) library, which enables you to cluster nodes in a network.

* Some of the map-making libraries listed above also provide some cool functionality to create graphs on a map. This is nice to visualize e.g. relations between countries.