# Intruduction to Pandas
Pandas is one of the most popular tools in Python for data analytics.  It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

We start with importing pandas and give it a short name, "pd".  We also import numpy to help with pandas

In [None]:
import pandas as pd
import numpy as np

## Data Structure - Series
A Series is a one-dimensional array-like list containing a sequence of values with the same type.  It is an associated array with data labels called **index**.

### Create a Series
There are many approaches to creating a Series.  We first create a Series with 4 elements from a list:

Index | Data (int)
:---: | :---:
0 | 4
1 | 7
2 | -5
3 | 3

Note that **integer index starts from 0**

In [None]:
# create Series from a list
s1 = pd.Series([4, 7, -5, 3])
s1

#### Note that:
```
dtype: int64
```
- dtype is the data type of the data values, which is 64-bit integer in this case

In [None]:
s1.values

s1.index

We can create a Series with labels as index.  A label can be an arbitary string.  Although, we assign labels to the Series, integer index still exists.

Index | Label | Data (int)
:---: | :---: | :---:
0 | d | 4
1 | b | 7
2 | a | -5
3 | c | 3

Note that when we display a Series, it will show labels (if exist) or integer index (otherwise).

In [None]:
# create series with index, which can be numbers or strings
s2 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
s2

In [None]:
s2.index

Notice the differences of the integer index in s1 (range of numbers) and label in s2 (arbitary strings)

### Series data operations

In [None]:
# access single data
s2['a']

In [None]:
# we can also use integer index
s2[2]

In [None]:
# change data value
s2['d'] = 6

In [None]:
# select multiple data
s2[['c','a','d']]


In [None]:
# filtering - boolean data accessing
s2[s2 > 3]

In [None]:
# vector operations
s2 * 2

In [None]:
s2

In [None]:
# manipulate with numpy
np.exp(s2)

In [None]:
# test if value in Series index
'c' in s2

In [None]:
'k' in s2

### Index and data alignment

In [None]:
# create series from dict
sdata = { 'Chiang Mai': 1687971, 'Lamphun': 403896, 'Phrae':  421653 , 'Lampang': 730980 }
s3 = pd.Series(sdata)
s3

Note that index can be any arbitary strings (even Thais)

In [None]:
# control the order of index keys
provinces = ['Lamphun', 'Chiang Mai', 'Lampang', 'Chiang Rai', 'Phrae']
s4 = pd.Series(sdata, index=provinces)
s4

Notice the order of the index.

In addition:
```
Chiang Rai          NaN
```
***NaN*** (Not a number) represents missing data.  We can work with missing data with *isnull* and *notnull*

In [None]:
s4.isnull()

In [None]:
s4.notnull()

In [None]:
sum(s4.notnull())

In [None]:
s4[s4.isnull()]

In [None]:
# you can also use an instance method of Series
s4.isnull()

When we perform a series operation, it will align data by index label.  Thus, eventhough two series may have different index ordering, we can perform operations between these series with ease.

In [None]:
s3

In [None]:
s4

In [None]:
s3 + s4

Notice the difference orders of index in s3 and s4 and check out the addition result.

In [None]:
# set name for index and data values
s4.name = 'population'
s4.index.name = 'province'
s4

In [None]:
# alter Series's index
s1

In [None]:
s1.index = ['dave', 'jane', 'george', 'kelvin']
s1

## Data Structure - DataFrame
A Series represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).  DataFrame has both row and column indices.  Although DataFrame is mainly used for two-dimensional data, we can use hierarchical indexing to represent more complicated data.

### Create a DataFrame
There are several approaches to create a DataFrame.  A simple one is to create from dict.
We create the following DataFrame:

Index | province | year | population
:---: | :---: | :---: | :---:
0 | Chiang Mai | 2016 | 1630428
1 | Chiang Mai | 2017 | 1664012
2 | Chiang Mai | 2018 | 1687971
3 | Phrae | 2016 | 398936
4 | Phrae | 2017 | 410382
5 | Phrae | 2018 | 421653

In [None]:
# create from a dict
data = {
    'province': ['Chiang Mai', 'Chiang Mai', 'Chiang Mai', 'Phrae', 'Phrae', 'Phrae'],
    'year': [2016, 2017, 2018, 2016, 2017, 2018],
    'population': [1630428, 1664012, 1687971, 398936, 410382, 421653]
}
df = pd.DataFrame(data)
df

In [None]:
df.shape

In [None]:
df.head(4)

In [None]:
# assign column names and their sequence
df2 = pd.DataFrame(data, columns=['year', 'province', 'population'])
df2

In [None]:
# index can also be other data types
df2 = pd.DataFrame(data, columns=['year', 'province', 'population', 'household'], index=['one', 'two', 'three', 'four', 'five', 'six'])
df2

In [None]:
df2.columns

In [None]:
df2.index

### Operation on DataFrame

In [None]:
df2['province']

In [None]:
df2.year

In [None]:
# refer to a row using index value
df2.loc['three']

In [None]:
df2

In [None]:
# we can assign value to the entire column
df2['household'] = 10
df2

In [None]:
# value assignment can be a list that match the length of the DataFrame
df2.household = np.arange(6.)
df2

In [None]:
# we can assign a series to the DataFrame
hh = pd.Series([15, 17, 12], index=['two', 'four', 'one'])

In [None]:
df2.household = hh
df2

If we assign data to a column that does not exist, pandas will create a new column.

In [None]:
df2['bigcity'] = df2.province == 'Chiang Mai'
df2

In [None]:
df2.columns

In [None]:
# remove column with del
del df2['bigcity']
df2.columns

We can create DataFrame with nested dict of dicts

In [None]:
pop = {
    'Chiang Mai': { 2016: 1630428, 2017: 1664012, 2018: 1687971},
    'Phrae': { 2016: 398936, 2017: 410382, 2018: 421653}
}
df3 = pd.DataFrame(pop)
df3

In [None]:
# transpose a DataFrame
df3.T

In [None]:
# convert DataFrame to array
df3.values

In [None]:
# with differnt structure, converting will lead to different array
df2

In [None]:
df2.values

## Operation on Index
Index manipulation is very important, especially for time-series data as we use timestamps as index

### Index Objects

In [None]:
s1

In [None]:
index = s1.index
index

In [None]:
index[2:]

Note that index objects are immutable and cannot be modified by the user.
Thus:
```
index[1] = 'delan'
```
will lead to a runtime error.

### Reindexing
We can make pandas objects to conform to a new index with reindexing.

In [None]:
s2

In [None]:
s3 = s2.reindex(['a', 'b', 'c', 'd', 'e'])
s3

For ordered data like time series, using timestamps as index may lead to non-periodic data.  It is possible to interpolate or fill values of gaps in index using *ffill*, which forward-fills the values.

In [None]:
s4 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5])
s4

In [None]:
s4.reindex(range(7), method='ffill')

We can also reindex both rows and columns in DataFrame.

In [None]:
df3

In [None]:
df3.reindex([2019, 2018, 2016])

In [None]:
df3.reindex(columns=['Lamphun', 'Chiang Mai', 'Lampang'])

Sometimes, we will have to rename index and column names.  This can be done with rename.

In [None]:
df3.rename(columns={'Chiang Mai': 'CNX', 'Phrae': 'PRH'})

## Indexing, Selection, and Filtering
These operations are frequently used in data exploration.

### Series

In [None]:
s2

In [None]:
s2['b']

In [None]:
s2[1]

In [None]:
s2[2:4]

In [None]:
s2[['b', 'c', 'd']]

In [None]:
s2[[1,3]]

Note that slicing with labels behaves differently than normal Python slicing as it is inclusive

In [None]:
s2

In [None]:
s2['b':'c']

In [None]:
s2['a':'d']

In [None]:
s2['a':'c'] = 100
s2

In [None]:
s2 < 50

In [None]:
s2[s2 < 50]

### DataFrame

In [None]:
df2

In [None]:
df2['population']

In [None]:
df2[['province', 'year', 'household']]

In [None]:
df2[:2]

In [None]:
df2['population']> 1500000

In [None]:
df2[df2['population']> 1500000]

### Data Referencing with *loc* and *iloc*
*loc* and *iloc* can be used for slecting a subset of rows and columns in a DataFrame.  *loc* is for label indexing and *iloc* is for integer indexing.  The selecting can be applied for both read and write operations.

In [None]:
df2

In [None]:
df2.loc['two', 'year']

In [None]:
df2.loc[['one', 'three', 'six'], ['population', 'household']]

In [None]:
df2.loc['three']

In [None]:
df2.loc['three', :]

In [None]:
df2.loc[:,'population']

In [None]:
df2.population += df2.year

In [None]:
df2

In [None]:
df2.loc[df2['household'] < 15, 'household'] = 10
df2

In [None]:
df2.iloc[0]

In [None]:
df2.iloc[[0,1], [1, 3]]

In [None]:
df2.iloc[:4, 1:3]

## Arithmetic and Data Alignment
An important pandas feature is the behavior of arithmteic between objects with different indexes.  The result will be the union of the index pairs.

In [None]:
s1 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
s1

In [None]:
s2 = pd.Series([1, 0, 9], index=['a','c','x'])
s2

In [None]:
s1 + s2

This is also true for DataFrame

### Arithmetic methods with fill values

In [None]:
s1.add(s2, fill_value=0)

## Sorting and Ranking

In [None]:
s1 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
s1

In [None]:
s1.sort_index()

In [None]:
s1.sort_values()

In [None]:
s1['c'] = np.NaN

In [None]:
s1

In [None]:
s1.sort_values()

In [None]:
s1.rank()

In [None]:
s1.rank(ascending=False)

## Summarizing and Desciptive Statistics

In [None]:
df2

In [None]:
df2.shape

In [None]:
df2.count()

In [None]:
df2.min()

In [None]:
df2.max()

In [None]:
df2.sum()

In [None]:
df2

In [None]:
df2.sum(axis='columns')

In [None]:
df2.mean()

In [None]:
df2.describe()

In [None]:
df2

In [None]:
df2.year.value_counts()

In [None]:
df2.province.value_counts()