# Pandas I

![Red Panda](images/red_panda.webp)

## <span class="objectives">Objectives</span>

+ Understand what the **Pandas** module provides
+ Load data from **CSV** and other files
+ Access data tables
+ Extract rows and columns using conditions
+ Calculate statistics for rows or columns

## About Pandas
Pandas can:

+ Read data from file, database, or other sources
+ Deals with real-life issues such as invalid data
+ Clean and reshape data
+ Select and query data
+ Use builtin statistical functions
+ Work with NumPy and SciPy routines
+ Export dataset to many formats

*Pandas* is a package designed to make it easy to get, organize, and analyze large datasets. Its strengths lie in its ability to read from many different data sources, and to deal with real-life issues, such as missing,  incomplete, or invalid data. 

Pandas also contains functions for calculating means, sums and other kinds of analysis.

For selecting desired data, Pandas has many ways to select and filter rows and columns. 

It is easy to integrate Pandas with NumPy, SciPy, Matplotlib, and other scientific packages. 

Pandas works best with two-dimensional (row/column) data, which can be visualized like a spreadsheet. However, using multiindexes, you can simulate three- or more dimensional data. 

The `groupby` method provides powerful split-apply-combine operations, combined with aggregate methods for powerful data summaries.  -- *groupby* enables transformations, aggregations, and easy-access to plotting functions. 

<div class="alert alert-block alert-info">
<b>TIP:</b> It is easy to emulate R's `plyr` package via pandas. 
</div>

<div class="alert alert-block alert-info">
<b>NOTE:</b>
Here are some links that compare Pandas features to the equivalents in R:

+ https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html
+ https://towardsdatascience.com/cheat-sheet-for-python-dataframe-r-dataframe-syntax-conversions-450f656b44ca
+ https://heads0rtai1s.github.io/2020/11/05/r-python-dplyr-pandas/
</div>

<details>
    <summary>
        <b>Where did the name "Pandas" come from?</b>
    </summary>
    
**PAN**el **DA**ta **S**ystem

</details>   


## Tidy data

+ Tidy data is neatly grouped
+ Data
    - __Value__ = "observation"
    - __Column__ = "variable"
    - __Row__ = "related observations"
+ Pandas best with tidy data

A dataset contains _values_. Those values can be either numbers or strings. Values are grouped into _variables_, which are usually represented as _columns_. For instance, a column might contain "unit price" or "percentage of NaCL". A group of related values is called an _observation_. A _row_ represents an observation.  Every combination of row and column is a single value. 

When data is arranged this way, it is said to be "tidy". Pandas is designed to work best with tidy data. 



For instance, 

    Product    SalesYTD
    oranges    5000
    bananas    1000      
    grapefruit 10000

is tidy data. The variables are "Product" and "SalesYTD", and the observations are the names of the fruits and the sales figures. 


The following dataset is NOT tidy:

    Fruit     oranges bananas grapefruit
    SalesYTD  5000    1000    10000 

To make selecting data easy, Pandas dataframes always have variable labels (columns) and observation labels (row indexes). A row index could be something simple like increasing integers, but it could also be a time series, or any set of strings, including a column pulled from the data set. 

<div class="alert alert-block alert-info">
<b>TIP</b> variables could be called "features" and observations could be called "samples"
</div>

<div class="alert alert-block alert-info">
<b>NOTE</b>
See <a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">The "informal and code heavy version" of Hadley Wickham's paper on tidy data</a> for a detailed discussion.
</div>

## Pandas architecture
+ Two main structures: Series and DataFrame
+ Series – one-dimensional
+ DataFrame – two-dimensional
The two main data structures in pandas are the ((*Series*)) and the ((*DataFrame*)). A series is a one-dimensional  indexed list of values, something like an ordered dictionary. A DataFrame is is a two-dimensional grid, with both row and column indexes (like the rows and columns of a spreadsheet, but more flexible).

You can specify the indexes, or pandas will use successive integers. Each row or column of a DataFrame is a Series. 

<div class="alert alert-block alert-info">
<b>NOTE</b> Pandas used to support the <b>Panel</b> type, which is more more or less a collection of DataFrames, but Panel has been deprecated in favor of MultiIndex, 

## Setting up the notebook
Import modules and configure settings needed for this notebook

In [None]:
import pandas as pd
from numpy.random import default_rng  # random number generator

## Series
* Indexed list of values
* Similar to a dictionary, but ordered
* Can get sum(), mean(), etc.
* Use index to get individual values
* indexes are not positional
A ((Series)) is an indexed sequence of values. Each item in the sequence has an index. The default index is a set of increasing integer values, but any set of values can be used. 

For example, you can create a series with the values 5, 10, and 15 as follows:

    s1 = pd.Series([5,10,15])

This will create a Series indexed by [0, 1, 2]. To provide index values, add a second list:

    s2 = pd.Series([5,10,15], ['a','b','c'])

This specifies the indexes as 'a', 'b',  and 'c'. 

You can also create a Series from a dictionary. pandas will put the index values in order:

    s3 = pd.Series({'b':10, 'a':5, 'c':15})

Most of the time, however, you will get a Series by selecting one column or one row from a DataFrame. 

In [None]:
NUM_DATA_POINTS = 10
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

rng = default_rng()
data = rng.standard_normal(NUM_DATA_POINTS)

In [None]:
s1 = pd.Series(data, index=index)  # create series with specified index
s1

In [None]:
s2 = pd.Series(data)  # create series with auto-generated index (0, 1, 2, 3, ...)
s2

### Selecting elements

Select items from series

In [None]:
s1[['h', 'b']]

Select slice of elements

In [None]:
print(s1['b':'d'], "\n")

Select by expression

In [None]:
s1[s1 > 1]

In [None]:
print("sum(), mean(), min(), max():")
print(s1.sum(), s1.mean(), s1.min(), s1.max(), "\n")  # get stats on series

print("cumsum(), cumprod():")
print(s1.cumsum(), s1.cumprod(), "\n")  # get stats on series

print('a' in s1)  # test for existence of label
print('m' in s1)  # test for existence of label
print()

s3 = s1 * 10  # create new series with every element of s1 multiplied by 10
print("s3 (which is s1 * 10)")
print(s3, "\n")

s1['e'] *= 5

print("boolean mask where s3 > 0:")
print(s3 > 0, "\n")  # create boolean mask from series

print("assign -1 where mask is true")
s3[s3 < 5] = -1  # set element to -1 where mask is True
print(s3, "\n")

s4 = pd.Series([-0.204708, 0.478943, -0.519439])  # create new series
print("s4.max(), .min(), etc.")
print(s4.max(), s4.min(), s4.max() - s4.min(), '\n')  # print stats

s = pd.Series([5, 10, 15], ['a', 'b', 'c'])  # create new series with index
print("creating series with index")
print(s)

## DataFrames
+ Two-dimensional grid of values
+ Row and column labels (indexes)
+ Rich set of methods
+ Powerful indexing
A DataFrame is the workhorse of Pandas. It represents a two-dimensional grid of values, containing indexed rows and columns, something like a spreadsheet. 

There are many ways to create a DataFrame. They can be modified to add or remove rows/columns. Missing or invalid data can be eliminated or normalized. 

DataFrames can be initialized from many kinds of data. See the table on the next page for a list of possibilities.


<div class="alert alert-block alert-warning">
<b>IMPORTANT</b> Most of the time you will create Series and Dataframes by reading data.
</div>

<div class="alert alert-block alert-info">
<b>FUN FACT</b> The DataFrame object is modeled after R's data.frame 
</div>


In [None]:
columns = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
rows = ['a', 'b', 'c', 'd', 'e', 'f']

values = [  # sample data
    [100, 110, 120, 130, 140],
    [200, 210, 220, 230, 240],
    [300, 310, 320, 330, 340],
    [400, 410, 420, 430, 440],
    [500, 510, 520, 530, 540],
    [600, 610, 620, 630, 640],
]

df_simple = pd.DataFrame(values, index=rows, columns=columns) 
df_simple

Just column `gamma`

In [None]:
df_simple['gamma']

## Reading Data
* Supports many data formats
* Reads headings to create column indexes
* Auto-creates indexes as needed
* Can used specified column as row index

Pandas supports many different input formats. It will read file headings and use them to create column indexes. By default, it will use integers for row indexes, but you can specify a column to use as the index, or provide a list of index values.

The **read_...()** functions have many options for controlling and parsing input. For instance, if large integers in the file contain commas, the thousands options let you set the separator as comma (in the US), so it will ignore them. 

`read_csv()` is the most frequently used function, and has many options. It can also be used to read generic flat-file formats. `read_table()` is similar to `read_csv()`, but doesn't assume CSV format. 

There are corresponding `to_...()` functions for many of the read functions. `to_csv()` and `to_excel()` are very useful. 


<div class="alert alert-block alert-info">
<b>NOTE:</b> See the notebook <b>PandaInputDemo</b> for examples of reading most types of input.
</div>

<div class="alert alert-block alert-info">
<b>TIP:</b> 
    See <a href="https://pandas.pydata.org/docs/user_guide/io.html">the Pandas I/O documentation</a> for more information
</div>


In [None]:
df_sales = pd.read_csv('../DATA/sales_records.csv')  # Read CSV data into dataframe. Pandas automatically uses the first row as column names
df_sales

## Data summaries
+ `describe()`  __basic statistical details__
+ `info()` __per-column details (shallow memory use)__
+ `info(memory_usage='deep')` __actual memory use__
You can call the `describe()` and `info()` methods on a dataframe to get summaries of the kind of data contained. 

The `describe()` method, by default, shows statistics on all numeric columns. Add `include='int'` or `include='float'` to restrict the output to those types. `include='all'` will show all types, including "objects" (AKA text). 

To show just objects (strings), use `include='O'`. This will show all text columns. You can compare the *count* and *unique* values to check the _cardinality_ of the column, or how many distinct values there are. Columns with few unique values are said to have low cardinality, and are candidates for saving space by using the `Categorical` data type. 

The `info()` method will show the names and types of each column, as well as the count of non-null values. Adding `memory_usage='deep'` will display the total memory actually used by the dataframe. (Otherwise, it's only the memory used by the top-level data structures). 

These may be called on either Series or DataFrame object.

### .info()
Display all columns, their types, and how many valid (non-null) values for a series or a dataframe column.

In [None]:
df_sales.info()

###  `DATAFRAME.describe()`
Get statistics on all numeric columns

In [None]:
print(df_sales.describe())

### `DATAFRAME.describe()` only integers
Only describe integer columns

In [None]:
df_sales.describe(include='int')

### DATAFRAME.describe() all columns

In [None]:
df_sales.describe(include='all')

## Basic Selecting

+ Similar to normal Python or numpy
+ Slices select rows

One of the real strengths of pandas is the ability to easily select desired rows and columns. This can be done with simple subscripting, like normal Python, or extended subscripting, similar to numpy. In addition, pandas has special methods and attributes for selecting  data. 

For selecting  columns, use the column name as the subscript value. This selects the entire column. To select multiple columns, use a sequence (list, tuple, etc.) of column names.

For selecting rows, use slice notation. This may not map to similar tasks in normal python. That is, dataframe[x:y] selects rows x through y, but dataframe[x] selects column x.


### Selecting columns

### Selecting multiple columns


In [None]:
df_sales[['Region', 'Item Type', 'Units Sold']]

#### Selecting with dot notation

In [None]:
df_sales.Country

#### Selecting multiple columns

In [None]:
print(df_sales[['Region', 'Country', 'Item Type']])


## Using .loc and .iloc

### Using .loc
+ `loc[__row-spec__,__col-spec__]` for names (strings or numbers)
+ `.loc[]` row or column specs can be
    - single name
    - iterable of names
    - range (inclusive) of names

The `.loc` and `.iloc` indexers provide more extensive and consistent selecting of rows and columns for dataframes. They both work exactly the same way, but `.loc` uses only row and column _names_, and `.iloc` uses only _positions_.

Both indexers use the _getitem_ operator `[]`, with the syntax `[row-specifier, column-specifier]`. 

For `.loc[]`, the specifier can be either a single name, an iterable of names, or a range of names. The end of a range is inclusive. 

For `.iloc[]`, the specifier can be either a single numeric index (0-based), iterable of indexes, or a range of indexes. The end of a range is exclusive. 

To select all rows, or all columns, use `:`.

The `.at[]` property can be used to select a single value at a given row and column: `df.at[47, "color"]`. This is a shortcut for `.loc[row, col]`.

For both `.loc()` and `.iloc()`, omit the column specifier to select all columns. 

### Using.iloc

+ `loc[__row-spec__,__col-spec__]` for names (strings or numbers)
+ `.loc[]` row or column specs can be
    - single name
    - iterable of names
    - range (inclusive) of names
+ `.iloc[__row-spec__,__col-spec__]` for 0-based position (integers only)
+ `.iloc[]` row or column specs can be
    - single number
    - iterable of numbers
    - range (exclusive) of numbers

The `.loc` and `.iloc` indexers provide more extensive and consistent selecting of rows and columns for dataframes. They both work exactly the same way, but `.loc` uses only row and column _names_, and `.iloc` uses only _positions_.

Both indexers use the _getitem_ operator `[]`, with the syntax `[row-specifier, column-specifier]`. 

For `.loc[]`, the specifier can be either a single name, an iterable of names, or a range of names. The end of a range is inclusive. 

For `.iloc[]`, the specifier can be either a single numeric index (0-based), iterable of indexes, or a range of indexes. The end of a range is exclusive. 

To select all rows, or all columns, use `:`.

The `.at[]` property can be used to select a single value at a given row and column: `df.at[47, "color"]`. This is a shortcut for `.loc[row, col]`.

NOTE: For `.loc()` and `.iloc()`, the column specifier can be omitted, which will select all columns for those rows. 

In [None]:
# df.loc['b', 'delta'] 


# df.loc['b']

# df.loc[:,'delta']


# df.loc['b':'d', :]
# df.loc['b':'d']

# df.loc[:, 'beta':'delta']

# df.loc['b':'d', 'beta':'delta']

# df.loc[['b', 'e', 'a']]

# df.loc[:, ['gamma', 'alpha', 'epsilon']]

# df.loc[['b', 'e', 'a'], ['gamma', 'alpha', 'epsilon']]



In [None]:
# Use real data ...


## Broadcasting
* Operation is applied across rows and columns
* Can be restricted to selected rows/columns
* Sometimes called vectorization
* Use apply() for more complex operations
If you multiply a dataframe by some number, the operation is broadcast, or vectorized, across all values. This is true for all basic math operations. 

The operation can be restricted to selected columns. 

For more complex operations, the `apply()` method will apply a function that selects elements. You can use the name of an existing function, or supply a user-defined function.

<div class="alert alert-block alert-info">
<b>TIP:</b> For simple functions, use the <em>lambda</em> (inline) syntax for defining the function.
</div>


In [None]:
# EXAMPLE
# USE REAL DATA not this crap


## Counting unique occurrences
+ Use `.value_counts()`
+ Called from column or dataframe

To count the unique occurrences within a column, call the method `value_counts()` on the column.

It returns a `Series` object with the column values and their counts.



In [None]:
vc = df_sales['Region'].value_counts()
vc

You can also call `value_counts()` on the dataframe and pass the column name as an argument.

In [None]:
vc = df_sales.value_counts('Region')
vc

To show the percentages of each value, add the `normalize` argument with a true value. 

In [None]:
vc = df_sales.value_counts('Region', normalize=True)
vc

## Creating new columns
+ Assign to column with new name
+ Use normal operators with other columns
  
For simple cases, it's easy to create new columns. Just assign a Series-like object to a new column name. The easy way to do this is to combine other columns with an operator or function. 

Any iterable object can be used, as long as its length matches the number of rows in the dataframe. 

<div class="alert alert-block alert-info">
<b>TIP:</b> If you assign a single value to a column, it will be replicated on all rows. 
</div>


In [None]:
cols = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
index = ['a', 'b', 'c', 'd', 'e', 'f']

values = [
    [100, 110, 120, 130, 140],
    [200, 210, 220, 230, 240],
    [300, 310, 320, 330, 340],
    [400, 410, 420, 430, 440],
    [500, 510, 520, 530, 540],
    [600, 610, 620, 630, 640],
]

df = pd.DataFrame(values, index=index, columns=cols)

def times_ten(x):
    return x * 10

df['zeta'] = df['delta'] * df['epsilon'] # product of two columns
df['eta'] = times_ten(df.alpha) # user-defined function
df['theta'] = df.sum(axis=1)  # sum each row
df['iota'] = df.mean(axis=1)  # avg of each row
df['kappa'] = df.loc[:,'alpha':'epsilon'].mean(axis=1)
# column kappa is avg of selected columns

df['junk'] = "JUNK"
df['toast'] = range(6)

df


To remove columns or rows, use the `drop()` method, with the appropriate labels. Use `axis=1` to drop columns, or axis=0 to drop rows.
## Removing entries
* Remove rows or columns
* Use drop() method

%load ~/py/common/examples/pandas_drop.py

## Useful pandas methods

See **Methods and attributes for fetching DataFrame/Series data** in the **PandasResources** notebook for DataFrame/series methods and attributes

## Chapter Review

<details>
    <summary>
        <b>What two-dimensional data structure is the "workhorse" of Pandas?</b>
    </summary>
DataFrame
</details> 

<details>
    <summary>
        <b>What method gives a statistical summary of numeric columns?</b>
    </summary>
.describe()
</details> 

<details>
    <summary>
        <b>Are DataFrame values read-only?</b></b>
    </summary>
.info()
</details> 

<details>
    <summary>
        <b>What method will count the distinct values in a column?</b>
    </summary>
value_counts()
</details> 

<details>
    <summary>
        <b>What _accessor_ makes it easy to select rows and columns by labels?</b>
    </summary>
.loc
</details> 

<details>
    <summary>
        <b>What method shows all columns and their data types?</b>
    </summary>
.info()
</details> 

<details>
    <summary>
        <b>What method can delete entire rows or columns?</b>
    </summary>
.drop()
</details> 

TODO: better labs

TODO: read in a simple csv

## Exercises

TODO: better labs

TODO: read in a simple csv

### Exercise 1

Read in the file `sales_records.csv` as shown in the early part of the chapter. Add three new columns to the dataframe:

+ Total Revenue (__units sold x unit price__)
+ Total Cost (__units sold x unit cost__)
+ Total Profit (__total revenue - total cost__)

### Exercise 2

The file `parasite_data.csv`, in the DATA folder, has some results from analysis on some intestinal parasites (not that it matters for this exercise...). 
Read parasite_data.csv into a DataFrame. Print out all rows where the Shannon Diversity is >= 1.0.

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()