# Lab 6 Introduction to Pandas

- What is Pandas?
- Series
    - Time Series
- Introduction of DataFrame
    - Index Object
    - Generating and Viewing
    - Indexing and Selection
    - Operations
    - Handling Missing Data
- Advanced DataFrame
    - Advanced Operations
    - Hierarchical indexing (MultiIndex)
    - Data Reshaping and Pivot Table

# Lab 6 Introduction to Pandas

- What is Pandas?
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- <span style="color:#aaa">DataFrame</span>
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## What is Pandas?
- High performance open-source Python library for data manipulation and analysis
- Developed (on top of NumPy) by Wes McKinney at 2008
- Popular versions: 0.20 (2017) ~ 0.25 (2019)
- Latest version (Jan 2020): 1.0
- We usually use an alias pd for pandas, like np for numpy

In [None]:
import pandas as pd
pd.__version__

import numpy as np

## Why Pandas?
- NumPy is good for well-formatted and homogeneous data. But real-world data is more complex and noisy.
- Pandas is designed for data wrangling (or munging): transform data from the raw form into another format, making it more appropriate for data analytics. It works similar to spreadsheet programs.

**Installation**
- with Conda: **conda install pandas**
- with PyPI (Python Package Index): pip install pandas

**Reference**
- https://pandas.pydata.org/docs/index.html 
- Getting started: https://pandas.pydata.org/docs/getting_started/index.html
- User guide: https://pandas.pydata.org/docs/user_guide/index.html

## Pandas Data Structures

- **1-dimensional: _Series_**
    - 1D labeled homogeneously-typed array
- **2-dimensional: _DataFrame_**
    - General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column
- _**Remark**: there was a 3-dimensional data structure **Panel** in earlier versions of Pandas (0.24 or before), which was removed since version 0.25._


### A Quick Review of Dict (Dictionary)
**Dict** in Python is a flexibly-sized collection of key-value pairs, where key and value are Python objects.

In [None]:
# E.g., the following statement creates a dict with three key-value pairs:
d = {'name':'CHAN TAI MAN', 'birth':2001, 'grade':86.5}
print(d)

In [None]:
# To access a value in a dict, we need to know its corresponding key:
print(d['name'], d['birth'], d['grade'])

In [None]:
# We can add new key-value pair directly by assignment:
d['phone'] = '65658888'
print(d)

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- Series
    - <span style="color:#aaa">Time Series</span>
- <span style="color:#aaa">DataFrame</span>
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## Series
**Series** is a 1-dimensional **labeled array**, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
- All data items in a Series must be the same data type
- Each item in the array has an associatied **label**, also called **index**.
- Why do we need labels?
    - Like a dict, a data time can be quickly located by its label.
    - **Remark 1**: Unlike a dict, labels in a Series don't need to be unique
    - **Remark 2**: Unlike a dict, the size of a Series is fixed after its creation
- Importtant attributes of a Series **s**:

|attribute|description|
|---------|-----------|
|_s.values_|its data|
| _s.index_|its labels|
|_s.dtype_|its data type|
|_s.size_|its size|

## Examples: Create Series by pd.Series()

Create a Series object from an ndarray with random values

In [None]:
s = pd.Series(np.random.randn(3))
print(s)

create a Series object from an ndarray with the index.

In [None]:
s = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
print(s)

create a Series object from an ndarray with the same index.

In [None]:
s = pd.Series(np.random.randn(3), index=['a', 'a', 'a'])
print(s)

### Examples: Create Series by pd.Series()

 create a Series object from a scalar with the index

In [None]:
s = pd.Series(10.0, index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s.size)
print(s.values)
print(s.index)

create a Series object from a dict without specifying the index

In [None]:
d = {'b': 1, 'a': 0, 'c': 2}
s = pd.Series(d)
print(s)

### Examples: Create Series by pd.Series()

create a Series object from a dict with the index. **Caution**: the index may not match the keys of your dict!

In [None]:
d = {'a': 0., 'b': 1., 'c': 2.}
s = pd.Series(d, index=['b', 'c', 'd', 'a'])
print(s)

**Remarks**
- NaN (not a number) is the standard missing data marker used in pandas.
- Numpy explicitly defines numpy.nan (or np.nan). It is very useful in data analytics

## Examples: Series acts similarly to an 1-D ndarray

Although we defined our own index, we can still use the default index 0, 1, …, 4 to access the values.

In [None]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s[0])
print(s[:3])

In [None]:
print(s['a'])
s['c'] = 1.2345
print(s)
print('e' in s)
print('f' in s)

## Examples: Series acts similarly to an 1-D ndarray

Vertorized operations and label alignment

In [None]:
print(s)
print(s*2)
print(np.exp(s))
print(s[1:])
print(s[:-1])
print(s[1:] + s[:-1]) # be careful

### Remark
- Operations between Series automatically align the data based on label.
- If label is not found, we will get NaN

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - Time Series
- <span style="color:#aaa">DataFrame</span>
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## Time Series
Time series data is a sequence of data measured or observed at many points in time.

- Very often, time series are with a fixed frequency, i.e., the time interval between two consecutive measurements is fixed.

Python standard library has the **datetime** module to handle date and time data

In [None]:
# the 1st datetime is the module name, the 2nd is class name
from datetime import datetime 
from datetime import timedelta

- A **datetime** object stores the **date** and **time** information. 
- A **timedelta** object represents the **time difference** between two datetime objects

## Time Series Examples

Get datetime from now

In [None]:
now1 = datetime.now()
print(now1)
print(now1.year, now1.month, now1.day)
print(now1.hour, now1.minute, now1.second, now1.microsecond)

Get datetime from a bit later

In [None]:
now2 = datetime.now()
print(now2)

# Time Series Examples

To see the difference between each time value
- ~1000 mircosecond (0.001 second)
- it will be different from different hardware

In [None]:
delta = now2 - now1
print(delta)

Time calculation

In [None]:
start = datetime(2020, 3, 5)
print(start)
end = start + timedelta(12)
print(end)

# Time Series Examples

Pandas provides the **date_range()** function to easily create a sequence of time points as an index.

In [None]:
# generate 100 time points with an interval of 1 second
rng = pd.date_range('1/1/2020', periods=100, freq='S')
rng

In [None]:
# generate a time Series with 100 values, using rng as the index
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

In [None]:
# generate 50 time points with an interval of 1 day
rng2 = pd.date_range('1/1/2020', periods=100, freq='D')
ts2 = pd.Series(np.random.randint(0, 500, len(rng2)), index=rng2)
ts2

## Using COVID-19 data as an example

use pd.read_csv() to load data, `sequeeze=True` indicates single column and it will return a Series

In [None]:
ser = pd.read_csv('cases_mainland_china.csv', parse_dates=['date'], index_col='date', squeeze=True)
ser.head()
ser.tail()

In [None]:
# Time Series Visualization
ser.plot.line()

In [None]:
ser.plot.bar()

In [None]:
# Time Series Analysis
# How to find the new cases every day?
diff=ser.diff()
diff.plot.bar()

## Series Summary

- Attributes of a Series object s:
    - s.values, s.index, s.dtype, s.size
- To create a Pandas Series object by pd.Series():
    - from an ndaaray with and without index
    - from a scalar with index
    - from a dict with and without index
- To access data in a Series object s:
    - by default integer index `s[0]`, `s[1]`, …
    - by the user-specified index like a dict
    - vectorized operations like an ndarray
- Time Series
    - datetime and timedelta objects (in the datetime module)
    - **pd.date_range()** can generate a sequence of time points
    - **pd.read_csv()** can read CSV data as a Series
    - Visualization: **Series.plot.line()** and **Series.plot.bar()**

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- DataFrame
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## DataFrame

A 2-dimensional labeled data structure with columns of potentially different types
- Similar to a spreadsheet in Excel
- Each column is actually a Series object, so a DataFrame can be regarded as a dict of Series objects

A DataFrame can be generated by **pd.DataFrame()** from different sources:
- Dict of 1-D ndarrays, lists, dicts, or Series
- 2-D ndarrays
- A Series
- Another DataFrame 

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- DataFrame
    - Index Object
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## Index object

**Index object** is used to organize your data so that a data item can be quickly accessed through its index. 

In Pandas, a Series or DataFrame object contains an explicit **Index object** that allows us to quickly reference a data item. 
*Remark: an Index object contains all the indices of the data.*

In Pandas, there are many types of Index object, such as
- RangeIndex, Int64Index, Float64Index, DatetimeIndex
- CategoricalIndex, IntervalIndex
- MultiIndex

The Index object can be thought of either as an **immutable array** or an **ordered set**.

### Examples of Index Object in Series

In [None]:
# create a Series object from a dict
d = {'b': 1, 'a': 0, 'c': 2}
s = pd.Series(d)
print(s)
print(s.index)

In [None]:
# reindex (Remark: s will not be affected)
s2 = s.reindex(['a', 'b', 'c', 'd'])
print(s2)

In [None]:
# create a Series object from an ndarray
s = pd.Series(np.random.randn(5))
print(s.index)

### Examples of Create Index Object Explicitly

In [None]:
# create integer index
ind = pd.Index([1, 3, 5, 7, 9])
print(ind)

In [None]:
# create object index
strd = pd.Index(['Tom', 'Bob', 'Rose'])
print(strd)

In [None]:
# create a Datatime index
rng = pd.date_range('1/1/2020', periods=4, freq='M')
print(rng)

In [None]:
# An index object can operate like an immutable array
ind = pd.Index([2, 4, 6, 8, 10])
print(ind[1])
print(ind[2:])
print(ind[:3])
print(ind[1:5:2])
print(ind.size, ind.shape, ind.ndim, ind.dtype)

### Index as an Ordered Set ###
- Sometimes we need to process multiple Series or DataFrames using set arithmetic
- Lets introduce four basic set operations:
    1. Union (or)
    2. Intersections (and)
    3. Difference (-)
    4. Symmetric difference (xor)
    
**Remark**: Symmetric difference is equal to (Union - Intersection)

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
print(indA | indB) # union ( A or B)
print(indA & indB) # intersection ( A and B )
print(indA.difference(indB)) # difference ( A - B )
print(indA ^ indB) # symettric difference ( (AorB) - (AandB))

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- DataFrame
    - <span style="color:#aaa">Index Object</span>
    - Generating and Viewing
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

# Examples Generatae DataFrame from dict of ndarrays or lists

Generate DataFrame from dict of Series

In [None]:
# without passing the index, the default index range(n) will be used
#     where n is the array length
d = {'one':[1., 2., 3., 4.], 
     'two':[4., 3., 2., 1.]}
df = pd.DataFrame(d)
print(df)

In [None]:
df = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
print(df)

In [None]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)

# Examples Generatae DataFrame from dict of ndarrays or lists

In [None]:
# the index specifies which rows should be used
df = pd.DataFrame(d, index=['d', 'b', 'a'])
print(df)

In [None]:
df = df.reindex(['a', 'b', 'c', 'd'])
print(df)

In [None]:
# the columns specifies which columns should be used
df = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
print(df)

In [None]:
population_dict = {'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

states = pd.DataFrame({'population': population, 'area': area})
print(states)

# Examples Generate DataFrame from 2-D ndarray

In [None]:
dates = pd.date_range('20200305', periods=5)
print(dates)

In [None]:
df = pd.DataFrame(np.random.randn(5, 4), index=dates, columns=list('ABCD'))
print(df)

**Remark**: list('ABCD') gives us `['A', 'B', 'C', 'D']`

## Examples Viewing Raw Data

We can view the DataFrame as an enhanced two-dimensional array. The raw underlying data array of a DataFrame can be accessed through **pd.values**.


In [None]:
print(df.values)
type(df.values)
print(df.values[0])

In [None]:
# Viewing Frame Data
df.head(3)
df.tail(3)

In [None]:
# Display the Index and Column Labels
df.index # row index
df.columns # column index

## Examples Statistic Summary

**pd.describe()** display some static of numeric columns

In [None]:
df.describe()

In the 2-D DataFrame, there are two axes.
- axis = 0: the vertical direction (or row-wise)
- axis = 1: the horizontal direction (or column-wise)

In [None]:
df.sort_index(axis=1, ascending=False) # Sorting by an axis

In [None]:
df.sort_values(by='B') # Given many rows, how to sort them according to the values in some column?

In [None]:
df.sort_values(by='B', ascending=False) # try it

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- DataFrame
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - Indexing and Selection
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

DataFrame offers df.loc and df.iloc attributies to access the data.
- `df.loc[]` allows indexing and slicing using explicit index (i.e., label)
- `df.iloc[]` allows indexing and slicing using the implicit integer index (i.e., location)

|Operation|Example|Return type|
|---------|-------|-----------|
|Select a column by label|`df[label]`|Series|
|Select a row by label|`df.loc[label]`|Series|
|Select a row by integer location|`df.iloc[loc]`|Series|
|Slice rows|`df[5:10]`<br>`df.iloc[5:10]`<br>`df.loc[a:b]`|DataFrame|
|Select some columns in rows|`df.iloc[a:b, c:d]`|DataFrame|

## Examples  Indexing and Selection

A DataFrame can be regarded as a dict of Series objects, where each column is a Series that can be located through its column label.

In [None]:
df['A']

In [None]:
df.A #try it

Select a Single Row

In [None]:
df.loc['2020-03-06'] # by row label

In [None]:
df.iloc[0] # by row position

## Examples Indexing and Selection

Slicing Rows

In [None]:
df[1:3]

In [None]:
# the same as df[1:3]
# row 3 is excluded
df.iloc[1:3]

In [None]:
# row '2020-03-07' is included
df.loc['2020-03-06':'2020-03-07']

## Examples Indexing and Selection

Select Some Columns of Some Rows

In [None]:
# select the first 2 columns in Row 3-4
df.iloc[3:5, 0:2]

In [None]:
# select the first and the third columns in Rows[1,2,4]
df.iloc[[1, 2, 4], [0, 2]]

Add a New Column

In [None]:
s1 = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('20200305', periods=5))
print(s1)

In [None]:
df['F'] = s1
print(df)

## Examples Indexing and Selection

Set Values by `at[]`, `iat[]`, `loc[]`, `iloc[]`

In [None]:
df.at['2020-03-05', 'A'] = 0
df.iat[0, 1] = 0
print(df)

In [None]:
df.loc[:, 'D'] = np.array([5] * len(df))
print(df)

In [None]:
#try it
df.iloc[:, 1] = np.array([4]*len(df))
print(df)

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- DataFrame
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - Operations
    - <span style="color:#aaa">Missing Data</span>
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## Operations on DataFrames

If we apply a NumPy universal function on a Pandas object, the result will be another Pandas object with the Index preserved.


Universal functions: https://docs.scipy.org/doc/numpy/reference/ufuncs.html#ufunc 

In [None]:
rng = np.random.RandomState(42)
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
print(df)

np.sin(df * np.pi / 4)

**RandomState()** is a random number generator.

### Index Alignment

For binary operations on two DataFrame objects, Pandas will align indices in the process of performing the operation.

In [None]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
print(A)

In [None]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
print(B)

- the new row Index is the union of the two row Index
- the new column label is also the union
- NaN for missing values

In [None]:
print(A+B)

### Index Alignment and subtraction

When we perform operations between a DataFrame and a Series, the index and column alignment is similarly maintained.

In [None]:
A = rng.randint(10, size=(3, 4))
df = pd.DataFrame(A, columns=list('QRST'))
print(df)

subtraction between a DataFrame and one of its rows is applied row-wise

In [None]:
df - df.iloc[0]

subtraction between a DataFrame and one of its columns is applied column-wise

In [None]:
df.subtract(df['R'], axis=0)

### Applying function

**pd.apply(func, axis)** function can apply a given function to the whole DataFrame.

In [None]:
df = pd.DataFrame(np.random.randint(0,100,size = (6,2)), 
                  columns = ['first','second'])
print(df)

In [None]:
#try it
df1 = df.apply(np.sqrt)
print(df1)

In [None]:
# downward
df2 = df.apply(np.sum)
print(df2)

In [None]:
# rightward
df3 = df.apply(np.sum, axis=1)
print(df3)

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- DataFrame
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - Missing Data
- <span style="color:#aaa">Advanced DataFrame</span>
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## Handling Missing Data

'2020301' and '2020302' don't align

In [None]:
dates = pd.date_range('20200301', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20200302', periods=6))
print(s1)
df['F'] = s1
print(df)

**df.reindex(index,columns)** reconstruct the data using the specified index and columns. Missing data will be replaced with `NaN`.

In [None]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
print(df1)
df1.loc[dates[0]:dates[1], 'E'] = 1
print(df1)

## Handling Missing Data

**df.dropna()** and **df.fillna()** are used to handling missing data.

In [None]:
# drop rows with NaN
df2=df1.dropna()
print(df2)

In [None]:
# drop columns with NaN
df3=df1.dropna(axis=1)
print(df3)

In [None]:
# replace NaN with a value
df4=df1.fillna(value=5)
print(df4)

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- <span style="color:#aaa">DataFrame</span>
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- Advanced DataFrame
    - Advanced Operations
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## Advanced DataFrame Operations

- Concatenate two or more DataFrames
- Join two DataFrames
- Groupby

### **Question: How to combine two ore more DataFrames?** ####

**pd.concat()** performs **concatenation operation** along an axis (by default, axis = 0) while performing optional set logic (union or intersection) of the indexes (if any) on the other axes

## Concatenation by pd.concat()

### A simple example: the input DataFrames have same column labels

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[0, 1, 2, 3])
frames = [df1, df2]

In [None]:
result1 = pd.concat(frames)
print(result1)

result2 = pd.concat(frames, ignore_index = True)
print(result2)

## Concatenation by pd.concat()

### A complex example: the input DataFrames have different column labels

By default, the union of labels will be used. None-exist data will be filled by NaN

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
df2 = pd.DataFrame({'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7'],
                    'E': ['E4', 'E5', 'E6', 'E7']},
                    index=[0, 1, 2, 3])
frames = [df2, df1]

In [None]:
result = pd.concat(frames, ignore_index = True, sort=False)
print(result)

In [None]:
frames = [df1, df2]
result = pd.concat(frames, ignore_index = True, sort=False)
print(result)

## Join DataFrames by pd.merge()

Given two DataFrame objects (say left and right) that share a common column label (say 'key'), we can merge them based on the common column:  **pd.merge(left, right, on='key')**

**A simple scenario**: the two columns from the two DataFrames are the same.

In [None]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
print(left)

In [None]:
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
print(right)

In [None]:
merged = pd.merge(left, right, on='key')
print(merged)

## Join DataFrames by pd.merge()

### **But in many cases, the two columns may not be exactly the same. Then how do we merge them?** ####
- **Intersection** (inner join): the result will only include the row labels that appear in both input DataFrames. This is the default way of merge()
- **Union** (outer join): the result will include all row labels from the two input DataFrames
- **Left join**: keep all labels from left DataFrame
- **Right join**: keep all labels from right DataFrame

In [None]:
age = pd.DataFrame({'name': ['Bob', 'Jack', 'Peter', 'Tom'], 'Age': [20, 23, 18, 19]})
print(age)

In [None]:
grade = pd.DataFrame({'name': ['Bob', 'Jack', 'Peter', 'Zoe'], 'Grade': [95, 87, 92, 78]})
print(grade)

In [None]:
merged = pd.merge(age, grade, on='name')
print(merged)

## Join DataFrames by pd.merge()
- **Union** (outer join): the result will include all row labels from the two input DataFrame

In [None]:
o_merged = pd.merge(age, grade, on='name', how='outer')
print(o_merged)

- **Left join**: keep all labels from left DataFrame

In [None]:
l_merged = pd.merge(age, grade, on='name', how='left')
print(l_merged)


- **Right join**: keep all labels from right DataFrame

In [None]:
r_merged = pd.merge(age, grade, on='name', how='right')
print(r_merged)

## Grouping by pd.groupby()

In [None]:
fortune = pd.read_csv('files\lab6_fortune1000.csv', index_col='Rank')
fortune.head(5)

**Groupby** is an important data analysis method. It usually involves one or more of the following three steps:
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure


In [None]:
# What are the different sectors?
fortune['Sector'].unique()

## Grouping by pd.groupby()

In [None]:
# Group by sector
sector = fortune.groupby('Sector')

In [None]:
# details of the groups
sector.groups

In [None]:
# How many rows per group?
sector.size()

In [None]:
# Numeric statistics for each group
sector.describe()

In [None]:
# Apply the sum() function
sector.sum()

## Grouping by pd.groupby()

We can access a specific group from a groupby object by get_group() function. The returned group is a DataFrame.

In [None]:
# Get the 'Technology' group
tech = sector.get_group('Technology')

In [None]:
# what's the data type?
type(tech)

In [None]:
# display some contents
tech.head()

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- <span style="color:#aaa">DataFrame</span>
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- Advanced DataFrame
    - <span style="color:#aaa">Advanced Operations</span>
    - Hierarchical indexing (MultiIndex)
    - <span style="color:#aaa">Data Reshaping and Pivot Table</span>

## Hierarchical indexing (MultiIndex)

So far we have learned about 1D Series and 2D DataFrame
How about *3-dimensional* data? E.g.,

<img src="data.png" width="40%" style="display:inline" /> <img src="table.png" width="40%" style="display:inline"   />

We can transform the 3D data into 2D table with a new column 'Year'

Potential row index:
- Year
- Company
- (Year, Company)

**Question**: which columns(s) should we use as the new row index?

### MultiIndex Examples

The tuple (Year, Company) can uniquely indentify a row, so we can use the tuple (Year, Company) as the index. This is called MultiIndex in Pandas.

We can create a MultiIndex by **pd.MultiIndex.from_tuples()**

In [None]:
tuples = [('2017', 'Company A'), ('2017', 'Company B'),('2017', 'Company C'),
          ('2018', 'Company A'), ('2018', 'Company B'),('2018', 'Company C'),
          ('2019', 'Company A'), ('2019', 'Company B'),('2019', 'Company C')]
mindex = pd.MultiIndex.from_tuples(tuples, names=['Year', 'Company'])
mindex

Since our MultiIndex is a combination of 'Year' and 'Company', there is an easier way to create the same MultiIndex by **pd.MultiIndex.from_product()**

In [None]:
Year = ['2017', '2018', '2019']
Company = ['Company A', 'Company B', 'Company C']
mindex = pd.MultiIndex.from_product([Year, Company], names=['Year', 'Company'])
mindex

### MultiIndex Examples

We can generate a DataFrame using the created MultiIndex object.

**Remark**: in this example, we use MultiIndex in axis = 0. It is also possible to use MultiIndex in axis = 1.

In [None]:
df = pd.DataFrame(np.random.randint(1000, 10000, (9, 3)), index=mindex, columns=['Hong Kong Island', 'Kowloon', 'NT'])
df

In [None]:
 # try it
df.index
df.columns

### Access Data with MultiIndex

we can select data by a "partial" label identifying a subgroup in the data

In [None]:
# select one group of rows
print(df.loc['2017'])


In [None]:
# select one row
print(df.loc[('2017', 'Company C')])

In [None]:
# select one data
print(df.loc[('2017', 'Company C'), 'Kowloon'])

# Lab 6 Introduction to Pandas

- <span style="color:#aaa">What is Pandas?</span>
- <span style="color:#aaa">Series</span>
    - <span style="color:#aaa">Time Series</span>
- <span style="color:#aaa">DataFrame</span>
    - <span style="color:#aaa">Index Object</span>
    - <span style="color:#aaa">Generating and Viewing</span>
    - <span style="color:#aaa">Indexing and Selection</span>
    - <span style="color:#aaa">Operations</span>
    - <span style="color:#aaa">Missing Data</span>
- Advanced DataFrame
    - <span style="color:#aaa">Advanced Operations</span>
    - <span style="color:#aaa">Hierarchical indexing (MultiIndex)</span>
    - Data Reshaping and Pivot Table

### Data Reshaping and Pivot Table

Reshaping means the transformation of the structure of a table to make it suitable for further analysis

Considr the following table:
![](reshaping.png)

**Question**: Can we compare the cost of different sizes more easily?

### Reshaping by pd.pivot()

The **pd.pivot(index, columns, values)** function is used to create a new DataFram out of a given one. We need to specify the index, columns, and values for the new DataFrame, using the column labes of the original DataFrame.

In [None]:
shop = {'Product':['shoe', 'shoe', 'shoe', 'dress', 'dress', 'dress'], 
        'Size':['L', 'M', 'S', 'L', 'M', 'S'],
        'Cost':[50, 45, 40, 100, 90, 80]}
df = pd.DataFrame(shop)
print(df)
df.pivot(index='Product', columns='Size', values='Cost')

### Reshaping by pd.pivot()

The pivot() function works only if (index, columns) is unique in the original DataFrame There are many cases that (index, columns) is not unique, e.g.,
<img src="pivot.png" width="50%">

**Pivot table** can be used to aggregate the data from rows with the same (State, City), e.g., mean, max, min, sum, std, len, etc. The **pd.pivot_table(values, index, columns, aggfunc='mean')** function is used to create a new DataFrame out of a given one, which applies the aggfunc on each group of rows with the same (index, columns) pair. 

**Remark**: The values should be **numerical** types. Commonly used aggfunc includes *np.sum, np.mean, np.max, np.min, np.std, len, etc.*

In [None]:
salary = {'State':['California', 'California', 'California', 'California', 'Washington', 'Washington'],
          'City':['Los Angeles', 'Los Angeles', 'San Jose', 'San Jose', 'Seattle', 'Seattle'],
          'Year':['2018', '2019', '2018', '2019', '2018', '2019'],
          'Salary':[65700, 67400, 63500, 63800, 58700, 59300]}

df = pd.DataFrame(salary)
df

### Examples of Pivot Table

There are two unique values in 'Year':'2018', '2019'. They will become the column labels.

In [None]:
df.pivot_table(index='State', columns='Year', values='Salary', aggfunc=np.sum)

There are two unique values in 'State':'California', 'Washington'. They will become the column labels.
**Remark**: default aggfunc = np.mean

In [None]:
df.pivot_table(index='Year', columns='State', values='Salary')
df.pivot_table(index=['State', 'City'], columns='Year', values='Salary')

### Pivot Table with MultiIndex

In [None]:
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})
df

Column 'C' has two unique values: 'foo', and 'bar'. They will be the colun labels in the pivot table. For some (index, column), there is no data in the original DataFrame, e.g. ('three','A', 'foo'). NaN will be used to fill up.

In [None]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

In [None]:
pd.pivot_table(df, values=['D', 'E'], index=['A', 'B'], columns=['C'])