<a href="https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/10-pandas-more.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas DataFrames II

```
  University of Virginia
  Programming for Data Science
  Last Updated: September 22, 2023
```  

### PREREQUISITES
- variables
- data types
- operators
- numpy arrays


### SOURCES

- sort_values()  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html


- value_counts()  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html


- to_csv() : saving to CSV file  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html


- read_csv() : load CSV file into DataFrame  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


- dropna() : drop missing data  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html


- fillna() : impute missing data  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html


### OBJECTIVES
- Introduce pandas dataframes and the essential operations

In [None]:
#import dependencies
import pandas as pd

# Load Iris Dataset

Let's load a bigger data set to explore more functionality.

The function `load_dataset()` in the `seaborn` package loads the built-in dataset.

You may need to install `seaborn` first:

`!pip install seaborn`

In [None]:
import seaborn as sns
iris = sns.load_dataset('iris')

# Notice there is no CSV for the "iris" data file, it's a built-in sample in Seaborn

Check the data type of `iris`:

In [None]:
type(iris)

**`.head()`**
- first records in dataframe

In [None]:
iris.head()

In [None]:
iris.head(10)

**`.tail()`**
* last records in dataframe

In [None]:
iris.tail()

In [None]:
iris.tail(10)

## Inspect metadata

**`.dtypes`**

In [None]:
iris.dtypes

**`.shape`**
* (rows, columns):

In [None]:
iris.shape

**`len()`**
* returns row (record) count:

In [None]:
len(iris)

**`.columns`**  
* column names:

In [None]:
iris.columns

**`.info()`**

In [None]:
iris.info()

## Set the index

**`.index`**

In [None]:
iris.index

**`.name`**
* name the index 'obs_id'

In [None]:
iris.index.name = 'obs_id'
iris

**`reset_index`**

In [None]:
iris.reset_index()

We can also redefine indexes to reflect the logic of our data.

In this data set, the species of the flower is part of its **identity**, so it can be part of the index.

(Note that is also a label that can be used for training a model to predict the species of an iris flower. In that use case, the column would be pulled out into a separate vector.)

**`.set_index`**

In [None]:
iris_w_idx = iris.reset_index().set_index(['species','obs_id'])

In [None]:
iris_w_idx

# Row Selection (Filtering)

**`iloc[]`**

You can extract rows using **indexes** with `iloc[]`.



In [None]:
# This fetches row 3, and all columns:

iris.iloc[2]

fetch rows with indices 1,2 (the right endpoint is exclusive), and all columns.

In [None]:
iris.iloc[1:3]

fetch rows with indices 1,2 and first three columns (positions 0, 1, 2)

In [None]:
iris.iloc[1:3, 0:3]

You can apply slices to column names too. You don't need `.iloc[]` here.

In [None]:
iris.columns[0:3]

## `.loc[]`

Filtering can also be done with `.loc[]`. This uses the row, column labels (names).

Here we ask for rows with labels (indexes) 1-3, and it gives exactly that  
`.iloc[]` returned rows with indices 1,2.

**Author note: This is by far the more useful of the two in my experience.**

In [None]:
iris.loc[1:3]

Subset on columns with column name (as a string) or list of strings

In [None]:
iris.loc[1:3, ['sepal_length','petal_width']]

Select all rows, specific columns

In [None]:
iris.loc[:, ['sepal_length','petal_width']]

## `.loc[]` with MultiIndex

In [None]:
iris_w_idx.loc['versicolor']

In [None]:
iris_w_idx.loc['setosa', 'sepal_length'].head()

In [None]:
iris_w_idx.loc['setosa', 'sepal_length'].to_frame().head()

We use a tuple to index multiple index levels.

Note that you can't pass slices here -- and this where indexing can get sticky.

In [None]:
iris_w_idx.loc[('versicolor', 52)]

## Another Example

In [None]:
df_cat = pd.DataFrame(
    index=['burmese', 'persian', 'maine_coone'],
    columns=['x'],
    data=[2,1,3]
)

In [None]:
df_cat

In [None]:
df_cat.iloc[:2]

In [None]:
df_cat.iloc[0:1]

In [None]:
df_cat.loc['burmese']

In [None]:
df_cat.loc[['burmese','maine_coone']]

# Boolean Filtering

It's very common to subset a dataframe based on some condition on the data.

🔑 Note that even though we are filtering rows, we are not using `.loc[]` or `.iloc[]` here.

Pandas knows what to do if you pass a boolean structure.

In [None]:
iris.sepal_length >= 7.5

In [None]:
iris[iris.sepal_length >= 7.5]

In [None]:
iris[(iris['sepal_length' ]>= 4.5) & (iris['sepal_length'] <= 4.7)]

## Masking

Here's an example of **masking** using boolean conditions passed to the dataframe selector:

Here are the **values** for the feature `sepal length`:

In [None]:
iris.sepal_length.values

And here are **the boolean values** generated by applying a comparison operator to those values:

In [None]:
mask = iris.sepal_length >= 7.5

In [None]:
mask.values

The two sets of values have the same shape.

We can now overlay the logical values over the numeric ones and keep only what is `True`:

In [None]:
iris.sepal_length[mask].values

# Working with Missing Data

Pandas primarily uses the data type `np.nan` from NumPy to represent missing data.

In [None]:
import numpy as np

In [None]:
df_miss = pd.DataFrame({
    'x':[2, np.nan, 1],
    'y':[np.nan, np.nan, 6]}
)

In [None]:
df_miss

## `.dropna()`

This will drop all rows with missing data in any column.

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [None]:
df_drop_all = df_miss.dropna()
df_drop_all

The `subset` parameter takes a list of column names to specify which columns should have missing values.

In [None]:
df_drop_x = df_miss.dropna(subset=['x'])
df_drop_x

## `.fillna()`

This will replace missing values with whatever you set it to, e.g. $0$s.

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

We can pass the results of an operation -- for example to peform simple imputation, we can replace missing values in each column with the median value of the respective column:

In [None]:
df_filled = df_miss.fillna(df_miss.median())

In [None]:
df_filled

# Sorting

**`.sort_values()`**

Sort by values
- `by` parameter takes string or list of strings
- `ascending` takes True or False
- `inplace` will save sorted values into the df

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

In [None]:
iris.sort_values(by=['sepal_length','petal_width'])

## `.sort_index()`

Sort by index. Example sorts by descending index

In [None]:
iris.sort_index(axis=0, ascending=False)

# Statistics

**`describe()`**

In [None]:
iris.describe()

In [None]:
iris.describe().T

In [None]:
iris.species.describe()

In [None]:
iris.sepal_length.describe()

**`value_counts()`**

This is **a highly useful** function for showing the frequency for each distinct value.  

Parameters give the ability to sort by count or index, normalize, and more.  

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)

In [None]:
iris.species.value_counts()

Show percentages instead of counts

In [None]:
iris.species.value_counts(normalize=True)

The methods returns a series that can be converted into a dataframe.

In [None]:
SEPAL_LENGTH = iris.sepal_length.value_counts().to_frame('n')

In [None]:
SEPAL_LENGTH

You can run `.value_counts()` on a column to get a kind of histogram:

In [None]:
SEPAL_LENGTH.sort_index().plot.bar(figsize=(8,4), rot=45);

**`.mean()`**

Operations like this generally exclude missing data.

So, it is important to convert missing data to values if they need to be considered in the denominator.

In [None]:
iris.sepal_length.mean()

**`.max()`**

In [None]:
iris.sepal_length.max()

**`.std()`**

This standard deviation.

In [None]:
iris.sepal_length.std()

**`.corr()`**

In [None]:
iris.corr()

Correlation can be computed on two fields by subsetting on them:

In [None]:
iris[['sepal_length','petal_length']].corr()

In [None]:
iris[['sepal_length','petal_length','sepal_width']].corr()

# Styling

In [None]:
iris.corr().style.background_gradient(cmap="Spectral", axis=None)

In [None]:
iris.corr().style.bar(axis=None)

# Visualization

Scatterplot using Seaborn on the df columns `sepal_length`, `petal_length`.


In [None]:
iris.plot.scatter('sepal_length', 'petal_length');

In [None]:
iris.sort_values(list(iris.columns)).plot(style='o', figsize=(10,10));

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
scatter_matrix(iris, figsize=(10,10));

# Save to CSV File

Common to save df to a csv file. The full path (path + filename) is required.  

There are also options to save to a database and to other file formats,

Common optional parameters:
- `sep` - delimiter
- `index` - saving index column or not

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
iris.to_csv('./iris_data.csv')