# Intruduction to Pandas
Pandas is one of the most popular tools in Python for data analytics.  It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

We start with importing pandas and give it a short name, "pd".  We also import numpy to help with pandas

In [1]:
import pandas as pd
import numpy as np

## Data Structure - Series
A Series is a one-dimensional array-like list containing a sequence of values with the same type.  It is an associated array with data labels called **index**.

### Create a Series
There are many approaches to creating a Series.  We first create a Series with 4 elements from a list:

Index | Data (int)
:---: | :---:
0 | 4
1 | 7
2 | -5
3 | 3

Note that **integer index starts from 0**

In [2]:
# create Series from a list
s1 = pd.Series([4, 7, -5, 3])
s1

0    4
1    7
2   -5
3    3
dtype: int64

#### Note that:
```
dtype: int64
```
- dtype is the data type of the data values, which is 64-bit integer in this case

In [3]:
s1.values

array([ 4,  7, -5,  3], dtype=int64)

In [4]:
s1.index

RangeIndex(start=0, stop=4, step=1)

We can create a Series with labels as index.  A label can be an arbitary string.  Although, we assign labels to the Series, integer index still exists.

Index | Label | Data (int)
:---: | :---: | :---:
0 | d | 4
1 | b | 7
2 | a | -5
3 | c | 3

Note that when we display a Series, it will show labels (if exist) or integer index (otherwise).

In [5]:
# create series with index, which can be numbers or strings
s2 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
s2

d    4
b    7
a   -5
c    3
dtype: int64

In [6]:
s2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Notice the differences of the integer index in s1 (range of numbers) and label in s2 (arbitary strings)

### Index and data alignment

In [7]:
# create series from dict
sdata = { 'Chiang Mai': 1687971, 'Lamphun': 403896, 'Phrae':  421653 , 'Lampang': 730980 }
s3 = pd.Series(sdata)
s3

Chiang Mai    1687971
Lamphun        403896
Phrae          421653
Lampang        730980
dtype: int64

Note that index can be any arbitary strings (even Thais)

In [8]:
# control the order of index keys
provinces = ['Lamphun', 'Chiang Mai', 'Lampang', 'Chiang Rai', 'Phrae']
s4 = pd.Series(sdata, index=provinces)
s4

Lamphun        403896.0
Chiang Mai    1687971.0
Lampang        730980.0
Chiang Rai          NaN
Phrae          421653.0
dtype: float64

Notice the order of the index.

In addition:
```
Chiang Rai          NaN
```
***NaN*** (Not a number) represents missing data.  We can work with missing data with *isnull* and *notnull*

In [9]:
s4.isnull()

Lamphun       False
Chiang Mai    False
Lampang       False
Chiang Rai     True
Phrae         False
dtype: bool

In [10]:
s4.notnull()

Lamphun        True
Chiang Mai     True
Lampang        True
Chiang Rai    False
Phrae          True
dtype: bool

In [11]:
sum(s4.notnull())

4

In [12]:
s4[s4.isnull()]

Chiang Rai   NaN
dtype: float64

In [13]:
# you can also use an instance method of Series
s4.isnull()

Lamphun       False
Chiang Mai    False
Lampang       False
Chiang Rai     True
Phrae         False
dtype: bool

When we perform a series operation, it will align data by index label.  Thus, eventhough two series may have different index ordering, we can perform operations between these series with ease.

In [14]:
s3

Chiang Mai    1687971
Lamphun        403896
Phrae          421653
Lampang        730980
dtype: int64

In [15]:
s4

Lamphun        403896.0
Chiang Mai    1687971.0
Lampang        730980.0
Chiang Rai          NaN
Phrae          421653.0
dtype: float64

In [16]:
s3 + s4

Chiang Mai    3375942.0
Chiang Rai          NaN
Lampang       1461960.0
Lamphun        807792.0
Phrae          843306.0
dtype: float64

Notice the difference orders of index in s3 and s4 and check out the addition result.

## Data Structure - DataFrame
A Series represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).  DataFrame has both row and column indices.  Although DataFrame is mainly used for two-dimensional data, we can use hierarchical indexing to represent more complicated data.

### Create a DataFrame
There are several approaches to create a DataFrame.  A simple one is to create from dict.
We create the following DataFrame:

Index | province | year | population
:---: | :---: | :---: | :---:
0 | Chiang Mai | 2016 | 1630428
1 | Chiang Mai | 2017 | 1664012
2 | Chiang Mai | 2018 | 1687971
3 | Phrae | 2016 | 398936
4 | Phrae | 2017 | 410382
5 | Phrae | 2018 | 421653

In [17]:
# create from a dict
data = {
    'province': ['Chiang Mai', 'Chiang Mai', 'Chiang Mai', 'Phrae', 'Phrae', 'Phrae'],
    'year': [2016, 2017, 2018, 2016, 2017, 2018],
    'population': [1630428, 1664012, 1687971, 398936, 410382, 421653]
}
df = pd.DataFrame(data)
df

Unnamed: 0,province,year,population
0,Chiang Mai,2016,1630428
1,Chiang Mai,2017,1664012
2,Chiang Mai,2018,1687971
3,Phrae,2016,398936
4,Phrae,2017,410382
5,Phrae,2018,421653


In [18]:
df.shape

(6, 3)

In [19]:
df.head(4)

Unnamed: 0,province,year,population
0,Chiang Mai,2016,1630428
1,Chiang Mai,2017,1664012
2,Chiang Mai,2018,1687971
3,Phrae,2016,398936


In [20]:
# assign column names and their sequence
df2 = pd.DataFrame(data, columns=['year', 'province', 'population'])
df2

Unnamed: 0,year,province,population
0,2016,Chiang Mai,1630428
1,2017,Chiang Mai,1664012
2,2018,Chiang Mai,1687971
3,2016,Phrae,398936
4,2017,Phrae,410382
5,2018,Phrae,421653


In [21]:
# index can also be other data types
df2 = pd.DataFrame(data, columns=['year', 'province', 'population', 'household'], index=['one', 'two', 'three', 'four', 'five', 'six'])
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1630428,
two,2017,Chiang Mai,1664012,
three,2018,Chiang Mai,1687971,
four,2016,Phrae,398936,
five,2017,Phrae,410382,
six,2018,Phrae,421653,


In [22]:
df2.columns

Index(['year', 'province', 'population', 'household'], dtype='object')

In [23]:
df2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

## Indexing, Selection, and Filtering
These operations are frequently used in data exploration.

### Series

In [24]:
s2

d    4
b    7
a   -5
c    3
dtype: int64

In [25]:
s2['b']

7

In [26]:
s2[1]

7

In [27]:
s2[2:4]

a   -5
c    3
dtype: int64

In [28]:
s2[['b', 'c', 'd']]

b    7
c    3
d    4
dtype: int64

In [29]:
s2[[1,3]]

b    7
c    3
dtype: int64

Note that slicing with labels behaves differently than normal Python slicing as it is inclusive

In [30]:
s2

d    4
b    7
a   -5
c    3
dtype: int64

In [31]:
s2['b':'c']

b    7
a   -5
c    3
dtype: int64

In [32]:
s2['a':'d']

Series([], dtype: int64)

In [33]:
s2['a':'c'] = 100
s2

d      4
b      7
a    100
c    100
dtype: int64

In [34]:
s2 < 50

d     True
b     True
a    False
c    False
dtype: bool

In [35]:
s2[s2 < 50]

d    4
b    7
dtype: int64

### DataFrame

In [36]:
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1630428,
two,2017,Chiang Mai,1664012,
three,2018,Chiang Mai,1687971,
four,2016,Phrae,398936,
five,2017,Phrae,410382,
six,2018,Phrae,421653,


In [37]:
df2['population']

one      1630428
two      1664012
three    1687971
four      398936
five      410382
six       421653
Name: population, dtype: int64

In [38]:
df2[['province', 'year', 'household']]

Unnamed: 0,province,year,household
one,Chiang Mai,2016,
two,Chiang Mai,2017,
three,Chiang Mai,2018,
four,Phrae,2016,
five,Phrae,2017,
six,Phrae,2018,


In [39]:
df2[:2]

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1630428,
two,2017,Chiang Mai,1664012,


In [40]:
df2['population']> 1500000

one       True
two       True
three     True
four     False
five     False
six      False
Name: population, dtype: bool

In [41]:
df2[df2['population']> 1500000]

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1630428,
two,2017,Chiang Mai,1664012,
three,2018,Chiang Mai,1687971,


### Data Referencing with *loc* and *iloc*
*loc* and *iloc* can be used for slecting a subset of rows and columns in a DataFrame.  *loc* is for label indexing and *iloc* is for integer indexing.  The selecting can be applied for both read and write operations.

In [42]:
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1630428,
two,2017,Chiang Mai,1664012,
three,2018,Chiang Mai,1687971,
four,2016,Phrae,398936,
five,2017,Phrae,410382,
six,2018,Phrae,421653,


In [43]:
df2.loc['two', 'year']

2017

In [44]:
df2.loc[['one', 'three', 'six'], ['population', 'household']]

Unnamed: 0,population,household
one,1630428,
three,1687971,
six,421653,


In [45]:
df2.loc['three']

year                2018
province      Chiang Mai
population       1687971
household            NaN
Name: three, dtype: object

In [46]:
df2.loc['three', :]

year                2018
province      Chiang Mai
population       1687971
household            NaN
Name: three, dtype: object

In [47]:
df2.loc[:,'population']

one      1630428
two      1664012
three    1687971
four      398936
five      410382
six       421653
Name: population, dtype: int64

In [48]:
df2.population += df2.year

In [49]:
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1632444,
two,2017,Chiang Mai,1666029,
three,2018,Chiang Mai,1689989,
four,2016,Phrae,400952,
five,2017,Phrae,412399,
six,2018,Phrae,423671,


In [50]:
df2.loc[df2['household'] < 15, 'household'] = 10
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1632444,
two,2017,Chiang Mai,1666029,
three,2018,Chiang Mai,1689989,
four,2016,Phrae,400952,
five,2017,Phrae,412399,
six,2018,Phrae,423671,


In [51]:
df2.iloc[0]

year                2016
province      Chiang Mai
population       1632444
household            NaN
Name: one, dtype: object

In [52]:
df2.iloc[[0,1], [1, 3]]

Unnamed: 0,province,household
one,Chiang Mai,
two,Chiang Mai,


In [53]:
df2.iloc[:4, 1:3]

Unnamed: 0,province,population
one,Chiang Mai,1632444
two,Chiang Mai,1666029
three,Chiang Mai,1689989
four,Phrae,400952


## Arithmetic and Data Alignment
An important pandas feature is the behavior of arithmteic between objects with different indexes.  The result will be the union of the index pairs.

In [54]:
s1 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
s1

d    4
b    7
a   -5
c    3
dtype: int64

In [55]:
s2 = pd.Series([1, 0, 9], index=['a','c','x'])
s2

a    1
c    0
x    9
dtype: int64

In [56]:
s1 + s2

a   -4.0
b    NaN
c    3.0
d    NaN
x    NaN
dtype: float64

This is also true for DataFrame

### Arithmetic methods with fill values

In [57]:
s1.add(s2, fill_value=0)

a   -4.0
b    7.0
c    3.0
d    4.0
x    9.0
dtype: float64

## Sorting

In [58]:
s1 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
s1

d    4
b    7
a   -5
c    3
dtype: int64

In [59]:
s1.sort_index()

a   -5
b    7
c    3
d    4
dtype: int64

In [60]:
s1.sort_values()

a   -5
c    3
d    4
b    7
dtype: int64

In [61]:
s1['c'] = np.NaN

In [62]:
s1

d    4.0
b    7.0
a   -5.0
c    NaN
dtype: float64

In [77]:
s1.sort_values()

a   -5.0
d    4.0
b    7.0
c    NaN
dtype: float64

In [78]:
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1632444,
two,2017,Chiang Mai,1666029,
three,2018,Chiang Mai,1689989,
four,2016,Phrae,400952,
five,2017,Phrae,412399,
six,2018,Phrae,423671,


In [None]:
df2.sort_values("pop")

## Summarizing and Desciptive Statistics

In [64]:
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1632444,
two,2017,Chiang Mai,1666029,
three,2018,Chiang Mai,1689989,
four,2016,Phrae,400952,
five,2017,Phrae,412399,
six,2018,Phrae,423671,


In [65]:
df2.shape

(6, 4)

In [66]:
df2.count()

year          6
province      6
population    6
household     0
dtype: int64

In [67]:
df2.min()

year                2016
province      Chiang Mai
population        400952
household           None
dtype: object

In [68]:
df2.max()

year             2018
province        Phrae
population    1689989
household        None
dtype: object

In [69]:
df2.sum()

year                                                  12102
province      Chiang MaiChiang MaiChiang MaiPhraePhraePhrae
population                                          6225484
household                                                 0
dtype: object

In [70]:
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1632444,
two,2017,Chiang Mai,1666029,
three,2018,Chiang Mai,1689989,
four,2016,Phrae,400952,
five,2017,Phrae,412399,
six,2018,Phrae,423671,


In [71]:
df2.sum(axis='columns')

one      1634460
two      1668046
three    1692007
four      402968
five      414416
six       425689
dtype: int64

In [72]:
df2.mean()

year          2.017000e+03
population    1.037581e+06
household              NaN
dtype: float64

In [73]:
df2.describe()

Unnamed: 0,year,population
count,6.0,6.0
mean,2017.0,1037581.0
std,0.894427,685197.7
min,2016.0,400952.0
25%,2016.25,415217.0
50%,2017.0,1028058.0
75%,2017.75,1657633.0
max,2018.0,1689989.0


In [74]:
df2

Unnamed: 0,year,province,population,household
one,2016,Chiang Mai,1632444,
two,2017,Chiang Mai,1666029,
three,2018,Chiang Mai,1689989,
four,2016,Phrae,400952,
five,2017,Phrae,412399,
six,2018,Phrae,423671,


In [75]:
df2.year.value_counts()

2018    2
2017    2
2016    2
Name: year, dtype: int64

In [76]:
df2.province.value_counts()

Chiang Mai    3
Phrae         3
Name: province, dtype: int64