# Notebook 4: Pandas and visulization I.

Here we will start using Pandas. Pandas is the standard way of working with columnar data. However, there is a substantial learning curve. If you want to learn more about Pandas, here is a useful site: http://pandas.pydata.org/

## Installation and downloads

In [None]:
# downloading the example dataset that we'll use for this class. 
# Pandas is already installed in colab by default as its very frequently used.
!pip -q install palmerpenguins

<img src = 'https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png'>
<img src = 'https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png'>

In [None]:
import numpy as np
import pandas as pd
# Visulization pacakge we are gonna use today
import seaborn as sns


## Introduction

Pandas is a library for working with tabular data. It was orignally based on the R data.frame library, but with a slightly different grammer and some different functionality. 


In [None]:
from palmerpenguins import load_penguins
df = load_penguins()
print(type(df))
df

* A 2-D object
* Has row (index) and column names
* The orientation of rows vs columns matters a lot
* Generally, you want features as columns and observations as rows
* Features are variables, they are the things you measure, whether that be quantitatively or qualitatively. While observations are each data point, in this case it is each penguin

Some very easy function can give us a good first glance at the dataframe

In [None]:
#function
df.describe()

In [None]:
#field
df.dtypes

## Several different  ways to subset/query a dataframe

### By Column

In [None]:
#Can query by column name, either single or multiple columns
df.columns
df[['bill_length_mm','island']]

In [None]:
print(type(df['bill_length_mm']))

* A single column of data
* Contains rownames but no column name, the rownames are always reffered to as `pd.Series.index`
* Can have an attribute `pd.Series.name` that can serve as the column name
* Works a lot like a python dictionary

If we want to visulize the distribution of this single variable better?

In [None]:
#first python plot!
#We can customize title/fig size/x,y axis label
#and save the figure in different formats. pdf/jpeg
sns.histplot(df['bill_length_mm'],kde=True,bins=20)

A very good resource for data visulization in python: https://python-graph-gallery.com

### Exercise 1:

In [None]:
#Another common example dataset
iris = sns.load_dataset("iris")

<img src = 'https://ars.els-cdn.com/content/image/3-s2.0-B9780128147610000034-f03-01-9780128147610.jpg'>

In [None]:
#Look at it and make a distribution plot for one of the variables

### By Row

In [None]:
#similarly, we can query for single/multiple rows
df.loc[3]
df.loc[0:3]

### By index

In [None]:
#Treating all the values in df as a 2d-numpy array
df.values
df.iloc[2,2]

### By condition

In [None]:
df.query("year > 2008")

In [None]:
#boolean vector
yr_vec = df['year'] > 2008
island_vec = df['island'] == 'Biscoe'

combine_vec = np.logical_and(yr_vec,island_vec)
df[combine_vec]

Histograms allowed us to visualize the distribution of a singel feature, what about the relation between features? How do we decide our filtering condition with multiple features?

In [None]:
sns.jointplot(data=df, x="bill_length_mm", y="bill_depth_mm")

In [None]:
sns.pairplot(df)

### Exercise 2:

In [None]:
## Do a pairplot between all variables for the iris dataset 

## Basic stats/functions for dataframe

In [None]:
#unique
df['species'].unique()

In [None]:
#value counts
df['species'].value_counts()

In [None]:
#mode/mean/sum
df['bill_length_mm'].mode()

Although looking at stats give you certian level of understading for data distribution, it may be biased. In general, publishers will not even you publish a barplot like the bottom right one anymore. This clearly is not showing the data faithfully. 


<img src="./figure/xkcd_plot.jpeg" width=700/>




In [None]:
#### Different ways of visualizing the same data can lead to different interpretations
import matplotlib.pyplot as plt
fig ,ax = plt.subplots(figsize=(15,12), ncols=2,nrows=2)
sns.swarmplot(data=df,x='species',y='body_mass_g',ax=ax[0,0],hue='species')
sns.violinplot(data=df,x='species',y='body_mass_g',ax=ax[0,1])
sns.boxplot(data=df,x='species',y='body_mass_g',ax=ax[1,0])
sns.barplot(data=df,x='species',y='body_mass_g',ax=ax[1,1])
plt.show()

### Exercise 3:

In [None]:
# Plot a variable in the iris dataset with different plot types