# Pandas

* Jake VanderPlas. 2016. *Python Data Science Handbook: Essential Tools for Working with Data*. O'Reilly Media, Inc.
* Chapter 3 - Data Manipulation with Pandas
* https://github.com/jakevdp/PythonDataScienceHandbook

Pandas provides:

* Ritch I/O Capabilities (read/write data from/to CSV, Excel, SQL, JSON, etc.)
* 1-dimensional (**Series**) and 2-dimensional tabular (**DataFrame**) data structures.
* Data flexibility (handles missing data, time series, and heterogeneous data types).
* Labeled Rows and columns for data alignment
* Flexible indexing, slicing, fancy indexing, and subsetting of large datasets.

In [1]:
import numpy as np
import pandas as pd
pd.__version__

'2.2.3'

In [2]:
# type TAB to get the numpy namespace
#pd.

## Pandas Series

* One-dimensional array of indexed data
* Two attributes:
   * `values` : NumPy array
   * `index` : an array-like object of type `pd.Index

In [3]:
d = pd.Series([0.25, 0.5, 0.75, 1.0])
print(f'd =\n{d}')
print(f'{d.values = }')
print(f'{ d.index = }')

d =
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
d.values = array([0.25, 0.5 , 0.75, 1.  ])
 d.index = RangeIndex(start=0, stop=4, step=1)


Series can be created from NumPy arrays:

In [4]:
d = pd.Series(np.linspace(0,4,6))
d

0    0.0
1    0.8
2    1.6
3    2.4
4    3.2
5    4.0
dtype: float64

A series can be indexed just like a NumPy array:

In [5]:
d = pd.Series(np.arange(10,15))
print(f'{      d[3] = }')  # simple index --> scalar
print(f'{     d[3:] = }')  # slice --> series
print(f'{    d[3:4] = }')  # slice --> series
print(f'{d[[1,3,0]] = }')  # fancy index --> series
print(f'{    d[[1]] = }')  # fancy index --> series

      d[3] = 13
     d[3:] = 3    13
4    14
dtype: int64
    d[3:4] = 3    13
dtype: int64
d[[1,3,0]] = 1    11
3    13
0    10
dtype: int64
    d[[1]] = 1    11
dtype: int64


But... Pandas Series has an explicit **index** (vs. NumPy implicit integer index)
   * Defaults to integer index

In [6]:
d = pd.Series(np.arange(10,15), index=["a","b","c","d","e"])
print(f'd =\n{d}')
# WARNING "treating keys as positions is deprecated"  --> use serie.iloc[pos]
#print(f'{d[1] = }')
print(f'{d["b"] = }')

d =
a    10
b    11
c    12
d    13
e    14
dtype: int64
d["b"] = 11


In [7]:
d = pd.Series(np.arange(10,15), index=[3,23,1,2,456])
print(f'd =\n{d}')
print(f'{d[456] = }')

d =
3      10
23     11
1      12
2      13
456    14
dtype: int64
d[456] = 14


* Series are kind of specialized Python dictionaries $\{index_{typed \,\&\, ordered} \to value_{typed}\}$ 
* `pd.Series(dict)` &rarr; create a Series from a dictionary
* `pd.Series.to_dict()` &rarr; create a dictionary from a Series<br><br>

In [8]:
d1 = pd.Series(np.arange(10,15), index=[3,23,1,2,456])
x = d1.to_dict()
d2 = pd.Series(x)
print(f'd1 =\n{d1}')
print(f'{x = }')
print(f'd2 =\n{d2}')

d1 =
3      10
23     11
1      12
2      13
456    14
dtype: int64
x = {3: 10, 23: 11, 1: 12, 2: 13, 456: 14}
d2 =
3      10
23     11
1      12
2      13
456    14
dtype: int64


**Slicing & Fancy Indexing** works with <u>any kind of index</u><br/><br/>

In [9]:
d = pd.Series(np.arange(10,15), index=[3,23,1,2,456])
print(f'{        d[2] = }')  # simple index --> scalar
print(f'{       d[2:] = }')  # slice --> series
print(f'{     d[23:2] = }')  # slice --> series
print(f'{ d[[1,23,3]] = }')  # fancy index --> series
print(f'{    d[[456]] = }')  # fancy index --> series

        d[2] = 13
       d[2:] = 1      12
2      13
456    14
dtype: int64
     d[23:2] = Series([], dtype: int64)
 d[[1,23,3]] = 1     12
23    11
3     10
dtype: int64
    d[[456]] = 456    14
dtype: int64


## Pandas DataFrame

* DataFrames are kind of specialized Python dictionaries $\{index_{typed \,\&\, ordered} \to series\}$ with a common row index
* Kind of *Series of Series* with a common row index

<br>

In [31]:
cities = ['California','Texas','Florida','New York']
population = pd.Series([39538223,29145505,21538187,20201249], index=cities)
area = pd.Series([423967, 695662, 170312, 141297], index=cities)
states = pd.DataFrame({'population':population, 'area':area})
states

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297


* Like the Series, DataFrame has an `index` attribute (*row index*)
* DataFrame has a `columns` attribute (*column index*)
* Both are of type `pd.Index`

<br>

In [33]:
states.index

Index(['California', 'Texas', 'Florida', 'New York'], dtype='object')

In [34]:
states.columns

Index(['population', 'area'], dtype='object')

In [35]:
type(states.index) == type(states.columns) == pd.Index

True

A DataFrame can be constructed from a single series:<br><br>

In [39]:
cities = ['California','Texas','Florida','New York']
population = pd.Series([39538223,29145505,21538187,20201249], index=cities)
states = pd.DataFrame(population, columns=['population'])
states

Unnamed: 0,population
California,39538223
Texas,29145505
Florida,21538187
New York,20201249


A DataFrame can be constructed from a dictionary of series:<br><br>

In [37]:
cities = ['California','Texas','Florida','New York']
population = pd.Series([39538223,29145505,21538187,20201249], index=cities)
area = pd.Series([423967, 695662, 170312, 141297], index=cities)
states = pd.DataFrame({'population':population, 'area':area})
states

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297


A DataFrame can be constructed from list of dictionaries:<br><br>

In [49]:
data = [
    {'population':39538223, 'area':423967},
    {'population':29145505, 'area':695662},
    {'population':21538187, 'area':170312},
    {'population':20201249, 'area':141297}
]
states = pd.DataFrame(data, index=["California","Texas","Florida","New York"])
states

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297


A DataFrame can be constructed from a two-dimensional NumPy array:<br><br>

In [50]:
data = np.array([[39538223,   423967],
          [29145505,   695662],
          [21538187,   170312],
          [20201249,   141297]])
states = pd.DataFrame(data,
                      columns=["population","area"],
                      index=["California","Texas","Florida","New York"])
states

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297


A DataFrame can integrate indexes in different orders<br><br>

In [62]:
cities1 = ['California','Texas','Florida','New York']
population = pd.Series([39538223,29145505,21538187,20201249], index=cities1)
cities2 = ['Texas','Florida','New York','California']
area = pd.Series([695662, 170312, 141297, 423967], index=cities2)
states = pd.DataFrame({'population':population, 'area':area})
states

Unnamed: 0,population,area
California,39538223,423967
Florida,21538187,170312
New York,20201249,141297
Texas,29145505,695662


A DataFrame can integrate indexes with different values
* *Missing* values are filled with `NaN`s <br><br>

In [64]:
cities1 = ['California','Texas','Florida']
population = pd.Series([39538223,29145505,21538187], index=cities1)
cities2 = ['Texas','Florida','New York']
area = pd.Series([695662, 170312, 141297], index=cities2)
states = pd.DataFrame({'population':population, 'area':area})
states

Unnamed: 0,population,area
California,39538223.0,
Florida,21538187.0,170312.0
New York,,141297.0
Texas,29145505.0,695662.0


## Pandas Index

## Indeing and Selection

## Operating on Data

## Handling Missing Data

## Hierarchical Indexing???

## Combining Datasets: Concat and Append