# Visualization

[Effectively Using matplotlib](https://pbpython.com/effective-matplotlib.html)

[matplotlib FAQ](https://matplotlib.org/faq/usage_faq.html)

[How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)

General advice:
- look at your data - manually inspect it
- any kind of results summary can be useful
- people are good at spotting visual problems
- log data to text files, use notebooks to view from data on disk

Important visualization tools:
- matplotlib
- seaborn (wrapper around matplotlib)
- Plotly
- D3 (Javascript)

## Types of charts

- line
- scatter
- histogram
- bar

## "Mistakes, we’ve drawn a few"
- The economist talks about how they could improve charts they made in the past

https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368

### Truncating the scale

Putting break points in the scale distorts the chart

![](assets/truncating.png)

### Choosing scales to force relationships

![](assets/scales.png)

### Using a line chart to show trend

Instead use dots for the individual points, smoothed line for the trend

![](assets/trend.png)


## Matplotlib - a tale of three (?) API's
Matplotlib is one of the most well-known plotting libraries for python. However, at the beginning, it can be difficult to wrap your head around.

It has 3 different APIs (ways of writing code to draw graphs).

- MATLAB / state based interface
- object oriented 

The two main abstractions in matplotlib are the **Figure** and **Axes**
- Figure = final image (can have many Axes)
- Axes = individual plot


![](assets/mpl-faq.png)
[From the matplotlib FAQ](https://matplotlib.org/faq/usage_faq.html)

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [None]:
# loading data to plot
kickstarter_projects = pd.read_csv("data/ks-projects-201801.csv", parse_dates=True)

In [None]:
kickstarter_projects.head()

### API One - `plt.plot()`

pyplot level

In this API, the figure is automatically created, and the `plt.` always refer to the most recent figure.

This is a quick and dirty way to make a plot.  

It is shown only for reference (and to be able to understand other peoples code). 

**It is not recommended.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

x = np.random.uniform(0, 100, size=100)
y = np.random.uniform(0, 100, size=100)
line = plt.plot(x)

#  to get access to the figure and axes objects
fig = plt.figure(1)
axes = fig.axes

#  common operations
plt.title('API One')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
#plt.savefig('./one.png')

## API Two - `plt.subplots()`

The reccomended API
- more explicit & clear
- more typing
- multiple axes on the same figure
- more options

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True)

In [None]:
# using the subplot syntax to show change in project type over year

In [None]:
kickstarter_projects.columns

In [None]:
kickstarter_projects.loc[:, "launched"] = pd.to_datetime(kickstarter_projects.loc[:, "launched"])
kickstarter_projects.loc[:, "deadline"] = pd.to_datetime(kickstarter_projects.loc[:, "deadline"])

In [None]:
# making new year column to filter on
kickstarter_projects.loc[:, "project_year"] = kickstarter_projects.loc[:, "deadline"].dt.year

In [None]:
kickstarter_projects.groupby("project_year")["ID"].count()

In [None]:
year_2016_project_types = year_2016.groupby("main_category").count()

In [None]:
year_2010 = kickstarter_projects.loc[kickstarter_projects.loc[:, "project_year"]==2010]
year_2013 = kickstarter_projects.loc[kickstarter_projects.loc[:, "project_year"]==2013]
year_2016 = kickstarter_projects.loc[kickstarter_projects.loc[:, "project_year"]==2016]

year_2010_project_types = year_2010.groupby("main_category")["ID"].count()
year_2013_project_types = year_2013.groupby("main_category")["ID"].count()
year_2016_project_types = year_2016.groupby("main_category")["ID"].count()

fig, ax = plt.subplots(nrows=3, figsize=(15,10), sharex=True, sharey=True)

year_2010_project_types.plot(ax=ax[0], kind='bar')

year_2013_project_types.plot(ax=ax[1], kind='bar')

year_2016_project_types.plot(ax=ax[2], kind='bar')

ax[0].set_xlabel('')
ax[0].set_title('Projects by type - 2010')
ax[1].set_title('Projects by type - 2013')
ax[2].set_title('Projects by type - 2016')

fig;

### Exercise:
Plot the amount raised by successful and unsuccessful projects over time.

The `axes` object is a `np.array`:

In [None]:
axes[0]

## API Three - `df.plot()`

Combination of pandas

In [None]:
tech_df = kickstarter_projects.loc[kickstarter_projects.loc[:, "main_category"]=="Technology"]
sub_categories_tech = pd.DataFrame(tech_df.groupby("category")["ID"].count().sort_values(ascending=False)[1:])
sub_categories_tech.rename(columns={"ID":"Tech Type"}, inplace=True)

In [None]:
ax = sub_categories_tech.plot(kind='barh')
ax.legend().set_visible(False)
ax.set(title='Kickstarter Technology Projects by Sub-Type', 
       xlabel='Total Number of Projects', ylabel='Tech Type');

## Changing plot style

In [None]:
plt.style.available
plt.style.use('ggplot')

# Changing dimensionality for visualization

Most commonly, we want a 2D representation of data to be able to plot it.  

Often our data is a higher dimension
- sometimes it can be lower (plot a latent space into 2D)

## t-SNE

Can also be used to increase dimensionality!
- use case = transforming a 1D latent space of an autoencoder to 2D

In [None]:
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
import seaborn as sns


digits = load_digits()

In [None]:
x = digits['data']
y = digits['target']
x.shape

In [None]:
tsne = TSNE(n_components=2)
trans = tsne.fit_transform(x)
trans.shape

In [None]:
sns.scatterplot(trans[:,0], trans[:,1], hue=y, legend='full', palette=sns.color_palette("bright", 10))

[How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)

- cluster sizes & distances are unstable

Hyperparameter perplexity
- balances between the local & global aspects of the data
- original paper suggests t-SNE is robust to values between 5-50
- in practice, you should look at a few different perplexities


In [None]:
tsne = TSNE(n_components=2, perplexity=25)
trans = tsne.fit_transform(x)
sns.scatterplot(trans[:,0], trans[:,1], hue=y, legend='full', palette=sns.color_palette("bright", 10))

In [None]:
tsne = TSNE(n_components=2, perplexity=50)
trans = tsne.fit_transform(x)
sns.scatterplot(trans[:,0], trans[:,1], hue=y, legend='full', palette=sns.color_palette("bright", 10))

## PCA

Only used for dimensionality reduction

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
trans = pca.fit_transform(x)
sns.scatterplot(trans[:,0], trans[:,1], hue=y, legend='full', palette=sns.color_palette("bright", 10))

In [None]:
sum(pca.explained_variance_ratio_)