# Introduciton to python - Basic Data Visualization

1. `seaborn` - quick data visualization
    - Categorical
    - Continuous
    - Categorical vs. continuous
    - Continuous vs. continuous
    - \>2 variables
    - "meta" plotting functions
2. `matplotlib` - full control of figures
    - reproduce a seaborn figure
3. Customize your plots
    - with `seaborn`
    - with `matplotlib`
4. Create and save your plot 

In python, there are two main packages used to visualize data: [`matplotlib`](https://matplotlib.org/) and [`seaborn`](https://seaborn.pydata.org/). 
Although `matplotlib` set the base in data visualization in Python, it becomes cumbersome to make quick plots. Hence, the `seaborn` library was born, as a wrapper of the latter adapted to make plotting simple and quick.

Their usage is so widespread that many packages depend on these to visualize data. Packages like `pandas`, rely on these libraries to make their visualizations.

In this session, we will use the [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset to workout how we can easily visualize data with these libraries and how they can be further leveraged to edit every bit of your plot.

In [None]:
# first, load the essential packages to read and wrangle tables
import pandas as pd
import numpy as np

In [None]:
# download and read the table from their online repository
data = pd.read_csv("data/penguins.csv")
data = data.dropna()
data

This dataset contains different types of information on these penguins. We can classify these into categorical or continuous types of data. Below you'll find a summary of what is contained in every variable

In [None]:
data.describe()

## `seaborn` - quick data visualization

As mentioned above, `seaborn` was born as a `matplotlib` wrapper to facilitate data visualization from dataframes.

Now, we will explore some of the different functions that we can use to quickly explore the dataset by asking questions.

But first, as usual, we need to load the package:

In [None]:
import seaborn as sns

### Categorical - `countplot` and `barplot`

Categorical variables contain information on how we can group our observations into similar buckets or classes. For that, barplots are a great tool

**How many penguins of each species are there?**

**1. Use `sns.countplot` setting the 'data' and 'x' arguments.**

In [None]:
# try using sns.countplot using the parameters 'data' and 'x'
# to introduce the dataframe and the variable to plot counts from, respectively

**2. Save your plot into an object called 'g'.**

**3. Relabel the X and Y axes using the methods `.set_xlabel` and `.set_ylabel`**

Note that we didn't need to provide the counts for each species to the function! That sped up the process! However, for the next visualizations we'll need to be more explicit... So, let's count the classes first.

**4. Switch to `sns.barplot` to visualize the number of penguins of each species.** First, you'll need to create a new dataframe counting the number of times each penguin species appears. Try to combine the `pandas.DataFrame` methods `.groupby`, `.size`, `reset_index`. Now, you'll have to define both axes variables to the ploting function.

In [None]:
# alternatively, count the penguins for each species and plot

**How many penguins from each island and species are there?**

Note that colors and x-axis are redundant in the plot above. Maybe we can exploit that to visualize more information. The "hue" parameter allows you to further split the bar plot into the selected category. See how grouping our data makes our lifes simpler to ask more complicated questions.

In [None]:
# try using the 'hue' parameter in sns.barplot

### Continuous - `histplot` and `kdeplot`

Continuous variables give us numerical information on the observations, providing us with a distribution of values for a certain feature, like penguins bill length.

**What is the distribution of bill lengths?**

In [None]:
# the 'bins' parameter sets the number of bins to partition our countinuous variable

**What is the distribution of bill lengths across species?**

### Categorical vs. Continuous - `boxplot`, `violinplot`,`stripplot`,`swarmplot`

We also find a series of plots for those moments when we need to know how continuous variables may differ between groups

**What are the distributions of bill length between sexes across species?**

Now try using violin plots.

And strip plots, a.k.a. jitter plots.

Or swarm plots.

### Continuous vs. Continuous - `scatterplot`, `kdeplot`, `jointplot`

Finally, sometimes we are interested in how two variables continuous variables covariate together.

**What is the relationship between bill length and body mass in penguins of different species?**

With `jointplot` we get three for the price of one!

### >2 variables

**What is the relationship between all continuous variables considering all penguin species?**

#### `pairplot`

#### `heatmap`

#### `clustermap`

### "meta" plotting functions

#### `catplot`

**How many penguins from each island and species of each sex are there?**

In [None]:
# try using the 'hue' and 'col_wrap' parameters in sns.catplot

#### `lineplot`

**What is the relationship between body mass and bill length across species and sexes?**

## `matplotlib` - full control of the plot

In [None]:
import matplotlib.pyplot as plt

### reproduce a `seaborn` figure

## Customize your plots

### `matplotlib`

In [None]:
import matplotlib.font_manager as font_manager

### `seaborn`

## Create and save your plot

In [None]:
help(plt.savefig)

In [None]:
# set your figure size
plt.figure(figsize=(4,4))
# place your plot here
g = sns.scatterplot(data=data, x="body_mass_g", y="bill_length_mm", hue="species", alpha=0.5)
g.set_xlabel("Body Mass (g)")
g.set_ylabel("Bill Length (mm)")
# save
plt.savefig("myfigure.png", dpi=300)
# show
plt.show()

![](myfigure.png)