# Legal Analytics (LAW3025) - Tutorial 5: 'Visual Exploratory Data Analysis'

*Version*: 2021/2022

This notebook accompanies Chapter 6 of [Epstein and Martin's *An Introduction to Empirical Legal Research'*](https://maastrichtuniversity.on.worldcat.org/oclc/891136365), where Epstein and Martin explore the International Criminals Tribunals (ICT) dataset (available here: https://raw.githubusercontent.com/maastrichtlawtech/law3025-legal-analytics/main/data/dataset_ictData.csv).

Last week, you delved deeper into data understanding with *numerical* Exploratory Data Analysis. Now, you will use *visual* methods of analysis that will give you an even better understanding of the data.


## 1. Preliminaries

In [None]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load the dataset into a DataFrame named 'data'

## 2. Visual EDA

Visual exploratory data analysis can give you a very good grasp of your data. In this section, we will provide instructions and programmings tasks for the following data visualisations:

- frequency distributions
- bar charts
- histograms
- boxplots
- scatterplots

Epstein and Martin discuss two types of displays for summarizing data: *frequency distributions* and *histograms* in section 6.3 (p. 124 ff.). Let's start with replicating their tables and visualisations.

### 2.1.  Frequency distributions

Frequency distributions are particularly helpful to display categorical data. Epstein and Martin state at p. 125: 'A frequency distribution is a table that contains the number of observations that fall into each of the variable's categories.'

To count the values for each category of a variable, you can apply the `value_counts()` function to the relevant column of the DataFrame. The parameter `normalize` can be applied to switch between absolute and relative frequencies.

Now, reproduce table 6.3 on page 125.

In [None]:
# Count the frequencies for each value in crimRank and print the frequency table
# Hint: you will need to define a table for both the absolute and relative frequencies and then concatenate them.
# Hint: when using the concat() function, make sure to set the axis parameter right.


### 2.2. Bar charts

#### Bar charts are good

An alternative, not mentioned in chapter 6 of Epstein and Martin, for  a frequency table is a bar chart. It is easiest to use the `DataFrame.plot.bar()` function. For horizontal bars, replace `bar` with `barh`.

#### Pie charts are bad

It is true, one can also use a [pie chart](https://en.wikipedia.org/wiki/Pie_chart), but they are generally considered a poor visualisation method. Bar charts are more generally more insightful as comparisons between bars are more informative than comparisons between pie slices.

In [None]:
# Plot the absolute frequency of crimRank in an horizontal bar chart
# Hint: do not forget the axes lables and the chart title.


### 2.3. Histograms

A histogram is a representation of the distribution of data. The y-axis shows the frequency of the values on the x-axis.

In [None]:
# Plot a histogram of sentence


#### Kernel Density Estimate plot

Alternatively, you can make a kernel density estimate (KDE) plot. Think of it as a smoothed, continuous version of a histogram. Use the `DataFrame.plot()` function and set `kind = 'kde'`. Read the [Wikipedia page on kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) for more (technical) background information. The [example](https://en.wikipedia.org/wiki/Kernel_density_estimation#Example) is easiest to understand.

One of the advantages of a KDE plot is that it is much easier to visually infer whether the distribution is distinctly skewed.

In [None]:
# Make a KDE plot of sentence


#### Cumulative Distribution Function (CDF)

An alternative way to describe the distribution of data is to plot a Cumulative Distribution Function (CDF). This can be easily done with the `cumulative` parameter of the `DataFrame.hist()` function.

In [None]:
# Plot a CDF of sentence


#### Which tribunal's sentences are more dispersed?

Epstein and Martin ask at page 130 which tribunal's sentences are more dispersed? One way to investigate this question is to make a histogram of `sentence` for each category of `tribunal`. Use the `DataFrame.hist()` function here.

In [None]:
# Make histograms of sentence for each tribunal.
# Hint: use the `by` parameter.


Perhaps you'll find it difficult to answer the question on the basis of the histograms. Why not try comparing KDE plots for each tribunal?

In [None]:
# Step 1: subset the data by tribunal using the DataFrame.loc() function


# Step 2: make the KDE plots


# Step 3: define the legend, the label for the x-axis and show the plot


### 2.4. Boxplots

You can visualise the distrubution of data effectively in a boxplot. Epstein and Martin discuss boxplots at pages 247 and 262 to 264. Go to [Wikipedia](https://en.wikipedia.org/wiki/Box_plot) to learn more about boxplots. 

You can use the `DataFrame.boxplot()` function here.

The whiskers of the plot are 1.5 times the IQR. By changing the parameter `showfliers` to `True` you can also visualise the outliers. See [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) for more info.

By including the parameter `by = 'category', ` into the code you can make individual boxplots by categories.

Let's analyse `sentence` by `tribunal` again.

In [None]:
# Make a boxplot of the column sentence


In [None]:
# Make boxplots of sentence for each category in tribunal


### 2.5. Scatterplots

If you want to make a scatterplot, you can use the following bit of code:

    DataFrame.plot.scatter(x = 'VARIABLE', y = 'VARIABLE')

`x` and `y` represent the columns which values to use as coordinates for the points on the respective axis of the plot.

#### Application to the data
One would expect that sentence length increases with the number of counts on which the defendant was found guilty.

In [None]:
# Make an insightful scatterplot that explores the expectation that
# sentence length increases with the number of counts on which the defendant was found guilty.


## 3. Visualizations: the Good, the Bad and the Ugly

Information visualisation is not as easy as it might seems. Delivering a "good" visual representation of the information you extracted from data is not always obvious, especially when examining complex datasets.

When drawing a chart, it is important to assess its quality along the following dimensions:

* Does it do a good job or a bad job of presenting the data? Why?
* Does the presentation appear to be biased?
* Are the axes labeled in a clear and informative way?
* Is the color used effectively?
* Is there *chartjunk* in the figure (i.e., visual elements that are not necessary to comprehend the represented information or that distract the viewer from this information)?

Let's take an example. What is wrong with the following visualisations? How can we make them better?

<p align="center">
    <img src='../img/bad2.png' height="400" width="auto">
    <img src='../img/bad3.jpg' height="400" width="auto">
    <img src='../img/bad1.png' height="400" width="auto">
</p>

**1st figure**:
* ...
* ...

**2nd figure**:
* ...
* ...

**3rd figure**:
* ...
* ...

It's your turn now. Search online for two great and two bad visualisations. Assess each one of them along the previously mentioned dimensions. If you cannot find examples of bad visualisations, visit http://viz.wtf.