# NB: Introducting Pandas

Programming for Data Science

## What is Pandas?

Pandas is a Python library designed to work with **dataframes**.

Essentially, it **adds a ton of usability features to NumPy**.

It has become **a standard library** in data science.

## Why Pandas?

Since we already have NumPy as a powerful analytical tool to work with data, why do we need Pandas?

Recall one of the problems we faced when using NumPy &mdash; if we want to work with **labeled data**, say a matrix with named columns and rows, we have to create **separate arrays** and manage the relationship between the three arrays **in our heads**.

It would be nice if we could have an object which **contained all three** together.

This is one the things Panda offers.

## Structured Arrays

In fairness, NumPy does offer a partial solution to this problem &mdash; **structured arrays** &mdash; which we have not covered.

Structured arrays allow you to create arrays with **labeled columns**, and these columns may have **different data types**.

For example, here is a simple structured array:

In [13]:
import numpy as np

my_data = [('Rex', 9, 81.), ('Fido', 3, 27.), ('Pluto', 4, 55.)]
my_struct = np.array(
    my_data,
    dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]
)
my_struct

array([('Rex', 9, 81.), ('Fido', 3, 27.), ('Pluto', 4, 55.)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f4')])

In [14]:
my_struct['name'] # Gets a column

array(['Rex', 'Fido', 'Pluto'], dtype='<U10')

In [15]:
my_struct[my_struct['name'] == 'Pluto'] # Gets a row

array([('Pluto', 4, 55.)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f4')])

However, NumPy's documentation has this warning:

> Users looking to manipulate tabular data, such as stored in csv files, may find other pydata projects more suitable, such as xarray, pandas, or DataArray. These provide a high-level interface for tabular data analysis and are better optimized for that use. For instance, the C-struct-like memory layout of structured arrays in numpy can lead to poor cache behavior in comparison. \
[NumPy documentation](https://numpy.org/doc/1.26/user/basics.rec.html)

## Pandas Data Structures

In a way, Pandas takes the concept of the structured array and runs with it (although the two were developed independently).

In doing so, it makes a strong **design decision** to only work with $1$ and $2$ dimensional arrays:

A 1-dimensional labeled array capable of holding any data type is called a **Series**.

A 2-dimensional labeled array with columns of potentially different types is called a **DataFrame**.

As a side note, Pandas used to have a $3$-dimensional structure called a **panel**, but it has been removed from the library.
    
Ironically, the name "pandas" was partly derived the word "panel", as in "$pan(el)-da(ta)-s$".
    
To handle higher dimensional data, the Pandas team suggests using [XArray](https://xarray.pydata.org/en/stable/), which also build on NumPy arrays.

## The Data Frame

By far, the most **important** data structure in Pandas is the **dataframe** (sometimes spelled "data frame"), with the **series** playing a **supporting**, but crucial, role. 

In fact, **dataframe objects are built out of series objects**.

Dataframes are **inspired by the R structure** of the same name. 

They have many similarities, but there are fundamental differences between the two that go beyond mere language differences. 

Most important is the Pandas dataframes have **indexes**, whereas R dataframes do not.

It is helpful to think of **Pandas as wrapper around NumPy and Matplotlib** that makes it much easier to perform common operations, like select data by column name or visualizing plots. 

But this comes at a cost &mdash; Pandas is **slower** than NumPy. 

This represents the classic trade-off between **ease-of-use** for humnas and machine **performance**.

## Data Structure Design

Let's look at how data frame and series objects are **designed and built**.

It is essential to develop a **mental model** of what you are working with so operations and functions associated with them make sense.

Remember &mdash; data structure design is king.

## The Series

A Series is at heart a **one-dimensional array** with **labels** along its axis.

(We'll capitalize "Series" and "DataFrame" to signify their status as object **classes** within Pandas.)

**Labels** are essentially names that, ideally, uniquely identify each row (observation).

Its data must be of a **single type**, like NumPy arrays (which they are internally).

## The Index

The **axis labels** are referred to as the **index**.

Think of **the index as a separate data structure** that is attached to the array. 

The Series **array** holds the **data**. 

The Series **index** holds the names of the observations or things that the data are about.

Some consider the index to be **metadata** &mdash; data about data.

## The Data Frame

You can think of a DataFrame as **a bundle of Series objects that share an index**.

**Column labels** (also called an index) can be thought of as **Series names**.

<img src="https://pynative.com/wp-content/uploads/2021/02/dataframe.png" width="50%" height="50%"/>

<img src="https://miro.medium.com/max/700/1*KOBhtOeFntu6CyJUsCdN0g.jpeg" width="40%"/>

Image from [Nantasenamat 2021](https://towardsdatascience.com/how-to-master-pandas-for-data-science-b8ab0a9b1042).

Let's dive into how Pandas objects work in practice.

## Using Pandas

We import pandas like this, using the alias `pd` by convention:

In [16]:
import pandas as pd

We almost always import NumPy, too, since we use many of its functions with Pandas.

In [2]:
import numpy as np

## Data Frame Constructors

There are several ways to create pandas data frames.

Here, we create one by passing a dictionary of lists:

In [17]:
df = pd.DataFrame({
    'x': [0, 2, 1, 5], 
    'y': [1, 1, 0, 0], 
    'z': [True, False, False, False]
})

In [18]:
df

Unnamed: 0,x,y,z
0,0,1,True
1,2,1,False
2,1,0,False
3,5,0,False


In [19]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
list(df.index)

[0, 1, 2, 3]

In [7]:
df.columns

Index(['x', 'y', 'z'], dtype='object')

In [8]:
list(df.columns)

['x', 'y', 'z']

In [9]:
df.values

array([[0, 1, True],
       [2, 1, False],
       [1, 0, False],
       [5, 0, False]], dtype=object)

In [10]:
type(df.values)

numpy.ndarray

**Passing a list of tuples:**

In [11]:
my_data = [
    ('a', 1, True),
    ('b', 2, False)
]
df2 = pd.DataFrame(my_data, columns=['f1', 'f2', 'f3'])

In [12]:
df2

Unnamed: 0,f1,f2,f3
0,a,1,True
1,b,2,False


**Passing the three required pieces:**
- columns as list
- index as list
- data as list of lists (2D)

In [13]:
df3 = pd.DataFrame(
    columns=['x','y'], 
    index=['row1','row2','row3'], 
    data=[[9,3],[1,2],[4,6]])

In [14]:
df3

Unnamed: 0,x,y
row1,9,3
row2,1,2
row3,4,6


## Naming indexes

It is helpful to name your indexes.

In [15]:
df2.index.name = 'obs_id'

In [16]:
df2

Unnamed: 0_level_0,f1,f2,f3
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,a,1,True
1,b,2,False


## Why have an index?

Indexes provide a way to access elements of the array by name.

They allow series and data frame objects that share index labels to be combined, through joins and other data operations.

And many other things!

In fact, **a dataframe is a collection of series** with a common index. 

To this collection of series, the dataframe also adds a set of labels along the horizontal axis.
* The row index is **axis 0**.
* The column index is called **axis 1**.

<div class="callout">
    The row index is usually just called the index, while the column index is just called the columns.
</div>

Note that both index and column labels can be **multidimensional**.

* The are called **Hierarchical Indexes** and go the technical name of `MultiIndexes`.
* As an example, consider that a table of text data might have a two-column index: `(book_id, chap_id)`
* See [the Pandas documentation](https://pandas.pydata.org/docs/user_guide/advanced.html).

**It is crucial to understand the difference between the index of a dataframe and its data in order to understand how dataframes work.**

Many a headache is caused by not understanding this difference :-)

**Indexes are powerful and controversial.**

* They allow for all kinds of magic to take place when combining and accessing data.

* But they are expensive and sometimes hard to work with (especially multiindexes).
* They are especially difficult if you are coming from R and expecting dataframes to behave a certain way.

## Copying DataFrames with `copy()`

Use `copy()` to give the new df a clean break from the original.  

Otherwise, the copied df will point to the same object as the original.

In [17]:
df = pd.DataFrame(
    {
        'x':[0,2,1,5], 
        'y':[1,1,0,0], 
        'z':[True,False,False,False]
    }
) 

In [18]:
df

Unnamed: 0,x,y,z
0,0,1,True
1,2,1,False
2,1,0,False
3,5,0,False


We create two copies, one "deep" and one "shallow".

In [19]:
df_deep    = df.copy()  # deep copy; changes to df will not pass through
df_shallow = df         # shallow copy; changes to df will pass through

If we alter a value in the original ...

In [20]:
df.x = 1

In [21]:
df

Unnamed: 0,x,y,z
0,1,1,True
1,1,1,False
2,1,0,False
3,1,0,False


... then the shallow copy is also changed ...

In [22]:
df_shallow

Unnamed: 0,x,y,z
0,1,1,True
1,1,1,False
2,1,0,False
3,1,0,False


... while the deep copy is not.

In [23]:
df_deep

Unnamed: 0,x,y,z
0,0,1,True
1,2,1,False
2,1,0,False
3,5,0,False


Of course, the reverse is true too -- changes to the shallow copy affect the original:

In [24]:
df_shallow.y = 99

In [25]:
df

Unnamed: 0,x,y,z
0,1,99,True
1,1,99,False
2,1,99,False
3,1,99,False


So, `df_shallow` mirrors changes to `df`, since it references its indices and data.  
`df_deep` does not reference `df`, and so changes `to` df do not impact `df_deep`.

Let's reset our dataframe.

In [26]:
df = pd.DataFrame({'x':[0,2,1,5], 'y':[1,1,0,0], 'z':[True,False,False,False]}) 

## Column Data Types

### With `.types`

In [27]:
df.dtypes

x    int64
y    int64
z     bool
dtype: object

### With `.info()`

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   x       4 non-null      int64
 1   y       4 non-null      int64
 2   z       4 non-null      bool 
dtypes: bool(1), int64(2)
memory usage: 200.0 bytes


## Column Renaming

Can rename one or more fields at once using a dict.  

Rename the field `z` to `is_label`:

In [29]:
df = df.rename(columns={'z': 'is_label'})

In [30]:
df

Unnamed: 0,x,y,is_label
0,0,1,True
1,2,1,False
2,1,0,False
3,5,0,False


You can also change column names this way:

In [31]:
old_cols = df.columns # Keep a copy so we can revert
df.columns = ['X','Y', 'LABEL']

In [32]:
df

Unnamed: 0,X,Y,LABEL
0,0,1,True
1,2,1,False
2,1,0,False
3,5,0,False


In [33]:
df.columns = old_cols # Reset things

In [34]:
df

Unnamed: 0,x,y,is_label
0,0,1,True
1,2,1,False
2,1,0,False
3,5,0,False


You can also transform column named easily:

In [35]:
df3

Unnamed: 0,x,y
row1,9,3
row2,1,2
row3,4,6


In [36]:
df3.columns = df3.columns.str.upper()

In [37]:
df3

Unnamed: 0,X,Y
row1,9,3
row2,1,2
row3,4,6


## Column Referencing

Pandas supports both **bracket notation** and **dot notation**.  

**Bracket**

In [38]:
df['y']

0    1
1    1
2    0
3    0
Name: y, dtype: int64

**Dot** (i.e. as object attribute)

In [39]:
df.y

0    1
1    1
2    0
3    0
Name: y, dtype: int64

Dot notation is very convenient, since as object attributes they can be tab-completed in various editing environments.

But:
- It only works if the column names are **not reserved words**.
- It can't be used when creating a **new column** (see below).

It is convenient to names columns with a prefix, e.g. `doc_title`, `doc_year`, `doc_author`, etc. to avoid name collisions.

Column attributes and methods work with both:

In [40]:
df.y.values, df['y'].values

(array([1, 1, 0, 0]), array([1, 1, 0, 0]))

show only the first value, by indexing:

In [41]:
df.y.values[0]

1

## Column Selection

You select columns from a dataframe by passing a value or list (or any expression that evaluates to a list).

Calling a columns with a scalar returns a Series:

In [42]:
df['x']

0    0
1    2
2    1
3    5
Name: x, dtype: int64

In [43]:
type(df['x'])

pandas.core.series.Series

Calling a column with a list returns a dataframe:

In [44]:
df[['x']]

Unnamed: 0,x
0,0
1,2
2,1
3,5


In [45]:
type(df[['x']])

pandas.core.frame.DataFrame

In Pandas, we can use "fancy indexing" with labels:

In [46]:
df[['y', 'x']]

Unnamed: 0,y,x
0,1,0
1,1,2
2,0,1
3,0,5


We can put in a list comprehension, too:

In [47]:
df[[col for col in df.columns if col not in ['x','y']]]

Unnamed: 0,is_label
0,True
1,False
2,False
3,False


## Adding New Columns

It is typical to create a new column from existing columns.  

In this example, a new column (or field) is created by summing `x` and `y`:

In [48]:
df['x_plus_y'] = df.x + df.y

In [49]:
df

Unnamed: 0,x,y,is_label,x_plus_y
0,0,1,True,1
1,2,1,False,3
2,1,0,False,1
3,5,0,False,5


Note the use of bracket notation on the left.

When new columns are created, you **must** use bracket notation.

## Removing Columns with `del()` and `.drop()`

### `del()`

`del()` can be used to delete any object in Python.

`del` does the same thing.

`del()` can drop a DataFrame or single columns from the frame

In [50]:
df_drop = df.copy()

In [51]:
df_drop.head(2)

Unnamed: 0,x,y,is_label,x_plus_y
0,0,1,True,1
1,2,1,False,3


In [52]:
del(df_drop['x'])

In [53]:
df_drop

Unnamed: 0,y,is_label,x_plus_y
0,1,True,1
1,1,False,3
2,0,False,1
3,0,False,5


### `.drop()`

Can drop one or more columns.

takes `axis` parameter:
- axis=0 refers to rows  
- axis=1 refers to columns  

In [54]:
df_drop = df_drop.drop(['x_plus_y', 'is_label'], axis=1)

In [55]:
df_drop

Unnamed: 0,y
0,1
1,1
2,0
3,0


## Load Iris Dataset

Let's load a bigger data set to explore more functionality.

The function `load_dataset()` in the `seaborn` package loads the built-in dataset.

In [56]:
import seaborn as sns
iris = sns.load_dataset('iris')

Check the data type of `iris`:

In [57]:
type(iris)

pandas.core.frame.DataFrame

### See the first and last records with `.head()` and `.tail()`

In [58]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [59]:
iris.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [60]:
iris.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


### Inspect metadata

In [61]:
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

shape (rows, columns):

In [62]:
iris.shape

(150, 5)

Alternatively, `len()` returns row (record) count:

In [63]:
len(iris)

150

Column names:

In [64]:
iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

### Get it all with `.info()`

In [65]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


## The Index

In [66]:
iris.index

RangeIndex(start=0, stop=150, step=1)

We can name indexes, and it is important to do so in many cases.

In [67]:
iris.index.name = 'obs_id' # Each observation is a unique plant

In [68]:
iris

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


We can also redefine indexes to reflect **the logic of our data**.

In this data set, the species of the flower is part of its **identity**, so it can be part of the index.

The other features vary by individual. 

Note that `species` is also a **label** that can be used for training a model to predict the species of an iris flower. In that use case, the column would be pulled out into a separate vector.

In [69]:
iris_w_idx = iris.reset_index().set_index(['species','obs_id'])

In [70]:
iris_w_idx

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_width,petal_length,petal_width
species,obs_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,0,5.1,3.5,1.4,0.2
setosa,1,4.9,3.0,1.4,0.2
setosa,2,4.7,3.2,1.3,0.2
setosa,3,4.6,3.1,1.5,0.2
setosa,4,5.0,3.6,1.4,0.2
...,...,...,...,...,...
virginica,145,6.7,3.0,5.2,2.3
virginica,146,6.3,2.5,5.0,1.9
virginica,147,6.5,3.0,5.2,2.0
virginica,148,6.2,3.4,5.4,2.3


## Row Selection (Filtering) 

### `iloc[]`

You can extract rows using **indexes** with `iloc[]`. 

This fetches row 3, and all columns.

In [71]:
iris.iloc[2]

sepal_length       4.7
sepal_width        3.2
petal_length       1.3
petal_width        0.2
species         setosa
Name: 2, dtype: object

fetch rows with indices 1,2 (the right endpoint is exclusive), and all columns.

In [72]:
iris.iloc[1:3]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


fetch rows with indices 1,2 and first three columns (positions 0, 1, 2)

### Combining Filtering and Selecting

So, remember the **comma notation** from NumPy -- it is used here.

The first element is a **row selector**, the second a **column selector**.

In database terminology, row selection is called filtering.

In [73]:
iris.iloc[1:3, 0:3]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4.9,3.0,1.4
2,4.7,3.2,1.3


You can apply slices to column names too. You don't need `.iloc[]` here.

In [74]:
iris.columns[0:3]

Index(['sepal_length', 'sepal_width', 'petal_length'], dtype='object')

### `.loc[]`

Filtering can also be done with `.loc[]`. This uses the row names and column names.

Here we ask for rows with labels (indexes) 1-3, and it gives exactly that  
`.iloc[]` returned rows with indices 1,2.

**Author note: This is by far the more useful of the two in my experience.**

In [75]:
iris.loc[1:3]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa


In [76]:
iris.loc[1:3, ['sepal_width','sepal_length']]

Unnamed: 0_level_0,sepal_width,sepal_length
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.0,4.9
2,3.2,4.7
3,3.1,4.6


Note the different behavior of the slice here -- with `.loc`, `1:3` is short-hand for `[1,2,3]`, not a range of offsets.

In [77]:
iris.loc[[1,2,3]]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa


So, we are not using normal slicing here:

In [78]:
iris.loc[[:-1]]

SyntaxError: invalid syntax (170941475.py, line 1)

Although this works:

In [None]:
iris.loc[:]

Subset on columns with column name (as a string) or list of strings

In [None]:
iris.loc[1:3, ['sepal_length','petal_width']]

Select all rows, specific columns

In [None]:
iris.loc[:, ['sepal_length','petal_width']]

### `.loc[]` with MultiIndex

Recall our dataframe with a two element index:

In [None]:
iris_w_idx

Selecting a single observation by it's key, i.e. full label, uses a tuple:

In [None]:
iris_w_idx.loc[('setosa',0)] # df.at[r,c]

Selecting just the setosas:

In [None]:
iris_w_idx.loc['setosa']

Grabbing one species and one feature:

In [None]:
iris_w_idx.loc['setosa', 'sepal_length'].head()

This returns a series. If we want a dataframe back, we can use `.to_frame()`:

In [None]:
iris_w_idx.loc['setosa', 'sepal_length'].to_frame().head()

We use a tuple to index multiple index levels.

In [None]:
iris_w_idx.loc[('setosa', 5)]

Or a list to get multiple rows, a la fancy indexing.

In [None]:
iris_w_idx.loc[['setosa','virginica']]

### Another Example

In [None]:
df_cat = pd.DataFrame(
    index=['burmese', 'persian', 'maine_coone'],
    columns=['x'],
    data=[2,1,3]
)

In [None]:
df_cat

In [None]:
df_cat.iloc[:2]

In [None]:
df_cat.iloc[0:1]

In [None]:
df_cat.loc['burmese']

In [None]:
df_cat.loc[['burmese','maine_coone']]

## Boolean Filtering

It's very common to subset a dataframe based on some condition on the data.

Note that even though we are filtering rows, we are not using `.loc[]` or `.iloc[]` here.

Pandas knows what to do if you pass a boolean structure.

In [None]:
iris.sepal_length >= 7.5

In [None]:
iris[iris.sepal_length >= 7.5]

In [None]:
iris[(iris.sepal_length >= 4.5) & (iris.sepal_length <= 4.7)]

In [None]:
iris.loc[(iris.sepal_length >= 4.5) & (iris.sepal_length <= 4.7)]

In [None]:
iris.loc[(iris.sepal_length >= 4.5) & (iris.sepal_length <= 4.7), ['sepal_length']]

### Masking

Here's an example of **masking** using boolean conditions passed to the dataframe selector:

Here are the **values** for the feature `sepal length`:

In [None]:
iris.sepal_length.values

And here are **the boolean values** generated by applying a comparison operator to those values:

In [None]:
mask = iris.sepal_length >= 7.5

In [None]:
mask.values

In [None]:
mask.values.astype('int')

The two sets of values have the same shape.

We can now overlay the logical values over the numeric ones and keep only what is `True`:

In [None]:
iris.sepal_length[mask].values

## Working with Missing Data

Pandas primarily uses the data type `np.nan` from NumPy to represent missing data.

In [None]:
df_miss = pd.DataFrame({
    'x': [2, np.nan, 1], 
    'y': [np.nan, np.nan, 6]}
)

These values appear as `NaN`s:

In [None]:
df_miss

### `.dropna()` 

This will drop all rows with missing data in any column.

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [None]:
df_drop_all = df_miss.dropna()
df_drop_all

The `subset` parameter takes a list of column names to specify which columns should have missing values.

In [None]:
df_drop_x = df_miss.dropna(subset=['x'])
df_drop_x

### `.fillna()`

This will replace missing values with whatever you set it to, e.g. $0$s.

We can pass the results of an operation -- for example to peform **simple imputation**, we can replace missing values in each column with the median value of the respective column:

In [None]:
df_filled = df_miss.fillna(df_miss.median())

In [None]:
df_filled

## Sorting

### `.sort_values()`

Sort by values
- `by` parameter takes string or list of strings
- `ascending` takes True or False
- `inplace` will save sorted values into the df

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

In [None]:
iris.sort_values(by=['sepal_length','petal_width'], ascending=False)

### `.sort_index()`

Sort by index. Example sorts by descending index

In [None]:
iris.sort_index(axis=0, ascending=False)

## Statistics

###  `describe()`

In [None]:
iris.describe()

In [None]:
iris.describe().T

In [None]:
iris.species.describe()

In [None]:
iris.sepal_length.describe()

### `value_counts()`

This is **a highly useful** function for showing the frequency for each distinct value.  

Parameters give the ability to sort by count or index, normalize, and more.  

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)

In [None]:
iris.species.value_counts()

In [None]:
SPECIES = iris.species.value_counts().to_frame('n')

In [None]:
SPECIES

Show percentages instead of counts

In [None]:
iris.species.value_counts(normalize=True)

The methods returns a series that can be converted into a dataframe.

In [None]:
SEPAL_LENGTH = iris.sepal_length.value_counts().to_frame('n')

In [None]:
SEPAL_LENGTH.head()

You can run `.value_counts()` on a column to get a kind of histogram:

In [None]:
SEPAL_LENGTH.sort_index().plot.bar(figsize=(8,4), rot=45);

In [None]:
iris.sepal_length.hist();

### `.mean()`

Operations like this generally exclude missing data.

So, it is import to convert missing data to values if they need to be considered in the denominator.

In [None]:
iris.sepal_length.mean()

### `.max()`

In [None]:
iris.sepal_length.max()

### `.std()`

This standard deviation.

In [None]:
iris.sepal_length.std()

### `.corr()`

In [None]:
# iris.corr() # Won't work because of string column

In [None]:
iris.corr(numeric_only=True)

In [None]:
iris_w_idx.corr()

Correlation can be computed on two fields by subsetting on them:

In [None]:
iris[['sepal_length','petal_length']].corr()

In [None]:
iris[['sepal_length','petal_length','sepal_width']].corr()

## Styling

In [None]:
iris.corr(numeric_only=True).style.background_gradient(cmap="Blues", axis=None)

In [None]:
iris.corr(numeric_only=True).style.bar(axis=None)

## Visualization

Scatterplot using Seabprn on the df columns `sepal_length`, `petal_length`.

Visualization will be covered separately in more detail.

In [None]:
iris.plot.scatter('sepal_length', 'petal_length');

In [None]:
iris.sort_values(list(iris.columns)).plot(style='o', figsize=(10,10));

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
scatter_matrix(iris, figsize=(10,10));

## Save to CSV File

Common to save df to a csv file. The full path (path + filename) is required.  

There are also options to save to a database and to other file formats, 

Common optional parameters:
- `sep` - delimiter
- `index` - saving index column or not

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
iris.to_csv('./iris_data.csv')

## Read from CSV File

`read_csv()` reads from csv into DataFrame

takes full filepath

[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [None]:
iris_loaded = pd.read_csv('./iris_data.csv').set_index('obs_id')

In [None]:
iris_loaded.head(2)