# Pandas tutorial

This notebook uses examples from the official [Pandas: Getting started tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)

## Load the libraries that you plan to use

It's good practice in a notebook to always load the libraries at the top of the notebook that you plan to use throughout the notebook. This preps the reader to understand what to expect in terms of commands and outputs. It also ensures that all the required functions and commands are available in the code cells, since you've loaded them into the computer's memory as the first set of instructions in the notebook.

This notebook will use [`pandas`](https://pandas.pydata.org/) and [`matplotlib`](https://matplotlib.org/), which are two very popular libraries to analyze and visualize data respectively. It will also use [`numpy`](https://numpy.org/) for calculating some values, and [`seaboarn`](https://seaborn.pydata.org/) for some of the visualizations.

Run the cell below to load all the libraries for this notebook.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## Load your data

You should store your data on the same server as your notebook whenever possible. Comma Separated Values (or `csv`) files are a very common format for storing Table-like information. The `read_csv` function in the `pandas` library, which the code above has shorted to `pd` when running `pandas` functions, can read in a `csv` file if you pass the function a string that contains the location of the file.

For example, the `pd.read_csv` command in the cell below will look in the `data` folder found next to this notebook, and load the file named `penguins.csv` as a dataframe named `penguins`.

In [None]:
penguins = pd.read_csv("data/penguins.csv", encoding='utf-8')

You can take a look at the first few rows of the dataframe by calling its named, followed by `.head()`. Running the cell below will show you the first 5 rows of data, as well as the corresponding column labels.

In [None]:
penguins.head()

This data is about the Palmer Penguins

![](https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/lter_penguins.png)

Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pallter.marine.rutgers.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).


The data set contains measurements on over 300 penguins, including their species, the island they were found on, the length and depth of their bills, the length of their flipper, their weight, their sex, and the year they were measured.

![](https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/culmen_depth.png)

For more information, you can visit [palmerpenguins by Allison Horst](https://allisonhorst.github.io/palmerpenguins/), who also made the images in this notebook.

## Selecting columns

Once you have your data loaded, you can start to create subsets of that data set for analysis. 

For example, the following command will create a new data frame `island` which only contains the column labeled `"island"` from the `penguins` dataframe.

In [None]:
island = penguins[["island"]]
island.head()

And the following command will create a new dataframe named `mass_sex` which only keeps the columns related to the body mass and sex of the penguins.

In [None]:
mass_sex = penguins[["body_mass_g", "sex"]]
mass_sex.head()

## Selecting rows

You can also filter your dataframe to only include certain observations that meet particular criteria. Here are some common things you might want to do with this data set.

### Select rows if a column contains a specific categorical value

For example, the following command will only keep rows that are about male penguins.

In [None]:
male_penguins = penguins[ penguins["sex"] == "male" ]
male_penguins

### Select rows if a column contains one of several categorical values

You can select rows that have multiple categorical values. In this example, we are keeping only penguins that were measured in 2008 and 2009.

In [None]:
recent_penguins = penguins[ penguins["year"].isin([2008, 2009]) ]
recent_penguins

In [None]:
# Same as above, just written differently
recent_penguins = penguins[(penguins["year"] == 2008) | (penguins["year"] == 2009)]
recent_penguins

### Select only rows for which a value is known (discard NA)

A lot of time datasets are missing values for certain observations. To discard rows for which a particular column is missing data, you can use the `.notna()` command.

In [None]:
has_bill_length = penguins[penguins["bill_length_mm"].notna()]
has_bill_length

### I want to select rows and columns based off of their index (row and column number)

When you create a dataframe, `pandas` will by default assign each row and column a number called an index. This is different than the `rowid` that came with this specific data set. You should already know how to pick a row based off of the `rowid` column using commands further up in this notebook. 

To use the indices provided by `pandas` to select rows 9 - 25 and columns 2-5, you can use the following command.

In [None]:
penguins.iloc[9:25, 2:5]

**Note:** `pandas` will *include* the first index value, but *not include* the second index value in the selected rows and columns returned. For example, the row with index 25 and the column with index 5 are not included in this subset of data.

### Sort by a column or columns

You can sort your dataframe by specifying the column or columns that you wish to sort by.

For example, to sort by flipper length you can use the following command.

In [None]:
penguins.sort_values(by='flipper_length_mm')

You can see the default behavior is to sort from smallest to largest, but you can change that by including an optional, `ascending=False` argument to the `sort_values()` method as shown below.

In [None]:
penguins.sort_values(by='flipper_length_mm', ascending=False)

And to sort by more than one column, you can run the following command.

In [None]:
# First, only keep the rows where a value for sex is defined, then sort
penguins[penguins['sex'].notna()].sort_values(by=['sex', 'flipper_length_mm'])

## Calculating values from the dataframe

Now that you can wrangle the data in your dataframe a bit to create the perfect subset of data for analysis, you'll want to know how to crunch some important numbers. Suppose you wanted to compute the mean weight for male penguins and compare that to female penguins. 

First, create a dataframe with only male penguins in it, and then use the `np.mean()` function on just the column of data that contains the mass of the penguins. The commands would look like those below.

In [None]:
# subset the original data
male_penguins = penguins[ penguins["sex"] == "male" ]

# compute the mean value for the specified column
np.mean( male_penguins["body_mass_g"] )

In [None]:
# subset the original data
female_penguins = penguins[ penguins["sex"] == "female" ]

# compute the mean value for the specified column
np.mean( female_penguins["body_mass_g"] )

Of course, you can do more than just compute a mean. Numpy has a whole slew of functions that you can call onto an array of values (or as we may call them, columns in a table). For a full reference of available `numpy` functions, see [Numpy Quickstart](https://numpy.org/doc/stable/user/quickstart.html).

Some common functions to analyze numerical columns of data are:
* `np.max()`. Returns the largest value in the column.
* `np.min()`. Returns the smallest value in the column.
* `np.median`. Returns the median value in the column.
* `np.mean()`. Returns the mean value in the column.
* `np.std()`. Returns the standard deviation of the values in the column.

## Visualizing Data

Of course, you might be interested in visualizing some of the data in your data frame. We'll take a look at a few common visualizations to highlight the necessary `pandas` commands.

### Scatter plot

You can call the `plot` command on a dataframe. You'll need to specify "scatter" as the `kind` argument, the columns you wish to use as the `x` and `y` arguments to construct the plot. Then, call `plt.show()` to render the graph in the notebook. In this example, the optional argument `alpha` is set to 0.5 to give each point a little transparency to help see overlapping points a bit clearer.

In [None]:
penguins.plot(kind='scatter', x='bill_length_mm', y="flipper_length_mm", alpha=0.5)
plt.show()

If you want to color code by categorical variable, it is actually a little bit easier to use a different visualization library, called [`seaborn`](https://seaborn.pydata.org/) because its scatterplot function allows you to specify the categorical variable column with the `hue` argument. You can create a similar effect using `pandas` but the approach is a bit more complicated.

In [None]:
sns.scatterplot(data=penguins, x='bill_length_mm', y="flipper_length_mm", hue='species', palette="viridis")
plt.show()

### Box plots

You can create a box plot by specifying the column that contains the numerical data using the `column` argument, and the column that contains the categorical variable you want to group by using the `by` argument.

In [None]:
penguins.plot(kind="box", column="bill_length_mm", by="species")
plt.show()

### Histograms

You can specify the bins for your histogram by either creating an array that contains the edges of each bin. The `np.arange()` function lets you create an array with a specified starting and ending values, as well as the size of the increment to use to get from the the starting value to the ending value.

For example:

```python
np.arrange(6, 15, 3)
```

Would create an array that contains, 6, 9, and 12, since the command said to start at 6, end at 15, counting by 3. You don't include the ending value in the array.

In [None]:
penguins.plot(kind="hist", column="flipper_length_mm", bins=np.arange(170,240,5))
plt.show()

For comparison sake, here's the same histogram made with `seaborn`.

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm")
plt.show()

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm", hue='species', bins=20)
plt.show()