# Mini Intro to Pandas


Official Pandas Doc:
- [pandas.pydata.org](https://pandas.pydata.org/)

### Pandas -- "Panel Data"
Built on `NumPy`, `Pandas` is the key package to manipulate data, particularly good for working on `tabular` data.
1. Pandas basics, `Series` and `DataFrame` objects
2. Data Cleaning
3. Data Wrangling
4. Data Aggregating and grouping

### 1. Getting start with Pandas

In [None]:
import numpy as np
import pandas as pd
# Remember you can also import sub-objects under one package:
from pandas import Series, DataFrame

### Pandas Series

A Pandas `Series` is an advanced list/array with much more flexibility and functionality:
- a sequence of values
- associated data index


In [None]:
# create a random numpy array as data
arr = np.random.randint(-5, 6, 5) # <= 5 random numbers from -5 to 5
arr

#### Create Pandas Series (1)

In [None]:
# by default, numerical index are assigned, starting from 0
ser1 = pd.Series(arr)
ser1

#### Create Pandas Series (2)

In [None]:
# we can also specify index for the values
ser2 = pd.Series(arr, index = ['e', 'd', 'c', 'b', 'a'])
ser2

#### Series Indexing

In [None]:
# retrieve a single value using "label", loc, and iloc
ser2['a'], ser2.loc['a'], ser2.iloc[len(ser2)-1]

In [None]:
# update a single value using "label"
ser2['a'] = -99

In [None]:
# retrieve multiple elements using a list of "label"
ser2[['a', 'e']]

In [None]:
# boolean indexing
ser2[ser2 < 0]

#### Series Attributes

In [None]:
# return values of a Series
ser2.values

In [None]:
# return index of a Series, dtype = 'object' is a general form for string
ser2.index

In [None]:
# change labels
ser2.index = ['Lancaster', 'York', 'Manchester', 'Edinburgh', 'Liverpool']

In [None]:
# return the assigned name for the Series
ser2.name = 'example'
ser2.index.name = 'city'
ser2

### Pandas DataFrame

A Pandas `DataFrame` represents a rectangular table of data:
- has two deminsions;
- contains an ordered(indexed) collection of columns (Series)
- each column can be a different value type (`int`, `float`, `string`, `boolean`, etc);
- has a row and column index

#### Create a Pandas DataFrame

In [None]:
# UK University League
# A dictionary of equal-length lists of NumPy arrays
data = {'uni':['Lancaster', 'Lancaster', 'Lancaster', 'Manchester', 'Manchester', 'Manchester'],
        'year':[2017, 2018, 2019, 2017, 2018, 2019],
        'rank':[9, 9, 8, 25, 22, 18]}

In [None]:
# from Python dictionary to Pandas DataFrame
df = pd.DataFrame(data, columns = ['uni', 'year', 'rank'])

#### How a Pandas DataFrame looks like?

In [None]:
data # <== this is the raw dictionary of data

In [None]:
# Jupyter displays a DataFrame as a nice-looking HTML table
# Same as Series, by default, numerical index are assigned to DataFrame
df

#### "Label" Indexing

In [None]:
# we can of course change the index, Pandas is very flexible
df.index = ['zero', 'one', 'two', 'three', 'four', 'five']

In [None]:
# Now let's have another look at the DataFrame
df

#### Retrieve rows and columns

In [None]:
# retrieve a column
df['rank'] # note: we still have index with this single column

In [None]:
# selection using "label"
df.loc['two']
# df.iloc[2] # or selection using integer index

#### Create new columns

In [None]:
# An empty or new column can be assigned a scalar value, an array or a Series
df['new_col1'] = 99
df['new_col2'] = np.arange(6.)
df['new_col3'] = pd.Series([0.5, 0.7], index = [0, 5])
# or create a dummy variable base on other column(s)
df['top10'] = df['rank'] <= 10

In [None]:
df

In [None]:
# let's delete non-sense columns by:
df.drop(['new_col1', 'new_col2', 'new_col3'], axis=1, inplace=True) 
# or using 
# del df['new_col1', 'new_col2', 'new_col3']

#### Sorting and Ranking (1)

In [None]:
# sort_index: rows
df.sort_index() # by default, axis = 0

In [None]:
# sort_index: columns
df.sort_index(axis=1) # note, this is a copy

In [None]:
# let's switch back to numeric index
df.reset_index(drop=True, inplace=True)

#### Sorting and Ranking (2)

In [None]:
# sort_values
df.sort_values(by=['rank'])

In [None]:
# sort_values
df.sort_values(by=['year', 'rank'])

In [None]:
# let's get back the orginal order
df.sort_index(axis=0, inplace=True)

#### Summarizing Descriptive Statistics

In [None]:
df.info()

In [None]:
is_lancs = df['uni'] == 'Lancaster'
df[is_lancs].describe() # only numeric data will be summarized, how about year?

#### Unique values, Value counts, and Membership

In [None]:
# unique values, such as unique university in dataset
uniques = df['uni'].unique()
uniques

In [None]:
# how many values/observations for each unique university
df['uni'].value_counts()

In [None]:
# vectorized set membership check
mask = df['uni'].isin(['Warwick', 'Bath'])
mask # we can then use this mask object as boolean to filter data

### 2. Data Cleaning

#### Missing data
- Python built-in: `None`
- Numpy and Pandas: `NaN` (Not a Number)

In [None]:
ser = pd.Series([None, 1, 3, 5, np.nan])
ser.isnull() # or notnull()

#### Drop missing data

In [None]:
# drop missing data from Series
ser.dropna() # same as ser[ser.notnull()]

In [None]:
# drop missing data from DataFrame, a bit more complex
df.loc[(df['uni'] == 'Manchester') & (df['year'] >= 2018), 'rank'] = np.nan
df.loc[(df['uni'] == 'Manchester') & (df['year'] == 2019), 'top10'] = None
# df[(df['uni'] == 'Manchester') & (df['year'] == 2018)]['rank'] = np.nan

In [None]:
df.dropna() # drop any row if containing one missing value
# df.dropna(axis=1) # also play with arg: how='all'

#### Fill missing data

In [None]:
# fill missing data with 0
df.fillna(0) # you can also put a callable inside, e.g., mean()

In [None]:
# fill missing data with the most "recent" data points
df.fillna(method='ffill') # forward fill and backward fill

#### Replace data

In [None]:
# replace obvious erros with np.nan
# 99, or -99 by construction are errors
ser = pd.Series([-99, -0.05, 0.01, 0.03, 0.02, 99])
ser.replace([99, -99], np.nan)

In [None]:
# replace according to a dictionary
ser.replace({-99: np.nan, 99: 0})

### 3. Data Wrangling

#### Merge data

In [None]:
# left and right dataframe
left = pd.DataFrame({'uni':['Lancaster', 'Lancaster', 'Lancaster', 'Manchester', 'Manchester', 'Manchester'],
                     'year':[2017, 2018, 2019, 2017, 2018, 2019],
                     'rank':[9, 9, 8, 25, 22, 18]})
right = pd.DataFrame({'uni':['Lancaster', 'Lancaster', 'Manchester', 'Manchester'],
                     'year':[2018, 2019, 2018, 2019],
                     'acf_rank':[7, 10, 35, 20]})

In [None]:
pd.merge(left, right, on=['uni', 'year'], how='left') # keep all left, get matched right

In [None]:
pd.merge(left, right, on=['uni', 'year'], how='inner') # get matched from both dataframes

#### Concat data

In [None]:
# left and right dataframe
up = pd.DataFrame({'uni':['Lancaster', 'Lancaster', 'Lancaster', 'Manchester', 'Manchester', 'Manchester'],
                     'year':[2017, 2018, 2019, 2017, 2018, 2019],
                     'rank':[9, 9, 8, 25, 22, 18]})
down = pd.DataFrame({'uni':['Lancaster', 'Lancaster', 'Manchester', 'Manchester'],
                     'year':[2015, 2016, 2015, 2016],
                     'rank':[11, 9, 28, 28]})

In [None]:
pd.concat([up, down]).sort_values(by=['uni', 'year']).reset_index(drop=True)