# Visualization for Empirical Analysis
---

### The most popular libraries for visualization in Python (The big four)  

- matplotlib.org
- seaborn.pydata.org
- bokeh.pydata.org
- plot.ly/python

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

**Histogram**  
An histogram is an accurate graphical representation of the distribution of numerical data. It takes as input one numerical variable only. The variable is cut into several bins, and the
number of observation per bin is represented by the height of the bar. Note that the shape of the histogram can be really different following the number of bins you set. Thus, try different
values before taking any conclusion. Note that it is really close to density plot.

In [None]:
normal = np.random.normal(0,2,100)

In [None]:
sns.distplot(normal)

**Density**  
A density plot shows the distribution of a numerical variable. It takes only set of numeric values as input. It is really close to a histogram. Since it is a really common dataviz technique, most of the dataviz libraries allow to draw it. Note that it is highly recommended to play with the bandwith argument in order not to miss a specific pattern in the data.

In [None]:
normal2 = np.random.normal(2,3,100)

In [None]:
sns.kdeplot(normal, color='r', shade=True)
sns.kdeplot(normal2, color='b', shade=True)

In [None]:
sns.jointplot(normal, normal2)

**Boxplot**   
Boxplot is probably one of the most common type of graphic. It gives a nice summary of one or several numeric variables. The line that divides the box into 2 parts represents the median of the data. The end of the box shows the upper and lower quartiles. The extreme lines shows the highest and lowest value excluding outliers. Note that boxplot hide the number of values existing behind the variable. Thus, it is highly advised to print the number of observation, add unique observation with jitter or use a violinplot if you have many observations.

In [None]:
normal_df = pd.DataFrame({'value':normal,'dist':['normal']*len(normal)})
normal2_df = pd.DataFrame({'value':normal2,'dist':['normal2']*len(normal2)})
df = normal2_df.append(normal_df)

In [None]:
sns.boxplot(x='dist', y='value', data=df)

In [None]:
sns.swarmplot(x='dist', y='value', data=df)

**Scatter Plot**  
A Scatterplot displays the value of 2 sets of data on 2 dimensions. Each dot represents an observation. The position on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables. It is really useful to study the relationship between both variables. It is common to provide even more information using colors or shapes (to show groups, or a third variable). It is also possible to map another variable to the size of each dot, what makes a bubble plot. If you have many dots and struggle with overplotting, consider using 2D density plot.

In [None]:
sns.regplot(x=normal, y=normal2)

**Bubble Plot**  
A bubble plot is a scatterplot where a third dimension is added: the value of an additional variable is represented through the size of the dots. You need 3 numerical variables as input: one is represented by the X axis, one by the Y axis, and one by the size. Do not forget to provide a legend to make possible the link between the size and the value. Note that too many bubble make the chart hard to read, so this type of representation is usually not recommended for big amount of data. Last but not least, note that the area of the circles must be proportional to the value, not to the radius, to avoid exaggerate the variation in your data.

In [None]:
x = np.random.rand(15)
y = x+np.random.rand(15)
z = x+np.random.rand(15)
z=z*z

Change color with c and alpha. I map the color to the X axis value.

In [None]:
plt.scatter(x, y, s=z*2000, c=x, cmap="Blues", alpha=0.4, edgecolors="grey", linewidth=2)
plt.xlabel("the X axis")
plt.ylabel("the Y axis")
plt.title("A colored bubble plot")

**Heatmap**  
A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above. It is really useful to display a general view of numerical data, not to extract specific data point. It is quite straight forward to make a heat map, as shown on the examples below. However be careful to understand the underlying mechanisms. You will probably need to normalise your matrix, choose a relevant colour palette, use cluster analysis and thus permute the rows and the columns of the matrix to place similar values near each other according to the clustering.

In [None]:
df = pd.DataFrame(np.random.random((10,10)), columns=["a","b","c","d","e","f","g","h","i","j"])

In [None]:
sns.heatmap(df, annot=True, annot_kws={"size": 7})

**Line chart**  
A line chart or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. It is a basic type of  chart common in many fields. It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically.

In [None]:
values = np.cumsum(np.random.randn(1000))

In [None]:
plt.plot(values)

In [None]:
plt.figure(figsize=(12,8))
plt.plot(values, color='green', marker='o', linestyle='dashed',linewidth=1, markersize=3, alpha=.6)#values.index, 

In [None]:
values = np.cumsum(np.random.randn(1000))
values2 = np.cumsum(np.random.randn(1000))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(values)
plt.plot(values2)
#plt.savefig('lines.png')