<a href="https://colab.research.google.com/github/SoIllEconomist/ds4b/blob/master/python_ds4b/02_wrangle/01_pandas_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas
- Pandas is an open source library built on top of NumPy
- Allows fast analysis, cleaning and preparation of data
- High performance and productivity
- Built-in visualization capability
- Work with data from many sources


## Overview of Pandas Capabilities
- Series
- DataFrames
- Selection
- Missing Data
- Operations
- Merging, Joining, and Concatenating
- GroupBy
- Reshaping Data and Pivot Tables
- Time Series
- Data Input/Output


# Pandas
This is a basic introduction to the pandas module. 

First we start off with the customary imports. 


In [0]:
import numpy as np
import pandas as pd

np.random.seed(42)


# Object Creation

## Series
- Similar to NumPy array
    - Built on top of it
- Can have axis labels


## Series Creation
- Here we will show a few ways to create series
    - Throughout the course we will be primarily dealing with DataFrames
    - DataFrames will be discussed shortly
## Series
- Series can hold a variety of object types
- Numbers, strings, etc

Create a Series by passing a list, letting pandas create a default index value. 


In [0]:
s = pd.Series([1,2,3,np.nan, 4,5])

In [0]:
s


0    1.0
1    2.0
2    3.0
3    NaN
4    4.0
5    5.0
dtype: float64

## Series
- Key to using a series is understanding its index
    - Pandas makes use of these index names/numbers
    - Allows fast lookups of information
    - Works like a hash table or dictionary
    

## Series Examples


In [0]:
s_1 = pd.Series([1,2,3,4],
                ['USA', 'Germany', 'China', 
                 'Japan'])

In [0]:
s_1

USA        1
Germany    2
China      3
Japan      4
dtype: int64

## Series Examples

In [0]:
s_2 = pd.Series([1,2,5,6],['USA', 'Germany', 'Italy', 'China'])

In [0]:
s_2

USA        1
Germany    2
Italy      5
China      6
dtype: int64

In [0]:
s_2['China'] # Indexing is type dependent

6

In [0]:
s_2.index

Index(['USA', 'Germany', 'Italy', 'China'], dtype='object')

## Series Example

In [0]:
labels = ['a', 'b', 'c'] 
s_3 = pd.Series(data=labels)

In [0]:
s_3

0    a
1    b
2    c
dtype: object

In [0]:
s_3[2]


'c'

## Series
- Matches operation off of the index
- Creates NaN object where missing matches
- Integers convert to floats


In [0]:
s_1

USA        1
Germany    2
China      3
Japan      4
dtype: int64

In [0]:
s_2

USA        1
Germany    2
Italy      5
China      6
dtype: int64

In [0]:
s_1 + s_2



China      9.0
Germany    4.0
Italy      NaN
Japan      NaN
USA        2.0
dtype: float64

# DataFrame Creation

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns.

In [0]:
dates = pd.date_range('20190101',periods=10)

In [0]:
dates

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10'],
              dtype='datetime64[ns]', freq='D')

In [0]:
df = pd.DataFrame(np.random.randn(10,4), 
                  index=dates, 
                  columns=list('ABCD'))

In [0]:
df

Unnamed: 0,A,B,C,D
2019-01-01,0.496714,-0.138264,0.647689,1.52303
2019-01-02,-0.234153,-0.234137,1.579213,0.767435
2019-01-03,-0.469474,0.54256,-0.463418,-0.46573
2019-01-04,0.241962,-1.91328,-1.724918,-0.562288
2019-01-05,-1.012831,0.314247,-0.908024,-1.412304
2019-01-06,1.465649,-0.225776,0.067528,-1.424748
2019-01-07,-0.544383,0.110923,-1.150994,0.375698
2019-01-08,-0.600639,-0.291694,-0.601707,1.852278
2019-01-09,-0.013497,-1.057711,0.822545,-1.220844
2019-01-10,0.208864,-1.95967,-1.328186,0.196861


Creating a DataFrame by passing a dictionary that can be converted to a series:

In [0]:
df2 = pd.DataFrame({'A':1.,
                    'B': pd.Timestamp('20190101'),
                    'C':pd.Series(1, index=list(range(4)),dtype='float32'),
                    'D':np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(['test','train','test','train']),
                    'F':'foo'})

In [0]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2019-01-01,1.0,3,test,foo
1,1.0,2019-01-01,1.0,3,train,foo
2,1.0,2019-01-01,1.0,3,test,foo
3,1.0,2019-01-01,1.0,3,train,foo


The columns of the DataFrame have different `dtypes`.

In [0]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float32
D    4 non-null int32
E    4 non-null category
F    4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes



# Viewing Data

Here is how to view the top and bottom rows of the frame:

In [0]:
df.head()

Unnamed: 0,A,B,C,D
2019-01-01,0.496714,-0.138264,0.647689,1.52303
2019-01-02,-0.234153,-0.234137,1.579213,0.767435
2019-01-03,-0.469474,0.54256,-0.463418,-0.46573
2019-01-04,0.241962,-1.91328,-1.724918,-0.562288
2019-01-05,-1.012831,0.314247,-0.908024,-1.412304


In [0]:
df.tail(2)

Unnamed: 0,A,B,C,D
2019-01-09,-0.013497,-1.057711,0.822545,-1.220844
2019-01-10,0.208864,-1.95967,-1.328186,0.196861


Display the index, columns:

In [0]:
df.index

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10'],
              dtype='datetime64[ns]', freq='D')

In [0]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [0]:
df.shape

(10, 4)

We can convert our DataFrame (of floating-points) to a NumPy array.

In [0]:
df.to_numpy() # This can be a taxing operation in not all floats
# df.values

array([[ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986],
       [-0.23415337, -0.23413696,  1.57921282,  0.76743473],
       [-0.46947439,  0.54256004, -0.46341769, -0.46572975],
       [ 0.24196227, -1.91328024, -1.72491783, -0.56228753],
       [-1.01283112,  0.31424733, -0.90802408, -1.4123037 ],
       [ 1.46564877, -0.2257763 ,  0.0675282 , -1.42474819],
       [-0.54438272,  0.11092259, -1.15099358,  0.37569802],
       [-0.60063869, -0.29169375, -0.60170661,  1.85227818],
       [-0.01349722, -1.05771093,  0.82254491, -1.22084365],
       [ 0.2088636 , -1.95967012, -1.32818605,  0.19686124]])

In [0]:
df2.to_numpy() # This has multiple dtypes

array([[1.0, Timestamp('2019-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2019-01-01 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2019-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2019-01-01 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

`describe()` shows a quick statistic summary of your data:

In [0]:
df.describe()

Unnamed: 0,A,B,C,D
count,10.0,10.0,10.0,10.0
mean,-0.046179,-0.48528,-0.306027,-0.037061
std,0.701907,0.874335,1.060584,1.181041
min,-1.012831,-1.95967,-1.724918,-1.424748
25%,-0.525656,-0.866207,-1.090251,-1.056205
50%,-0.123825,-0.229957,-0.532562,-0.134434
75%,0.233688,0.048626,0.502648,0.669501
max,1.465649,0.54256,1.579213,1.852278


Transposing your data:

In [0]:
df.T

Unnamed: 0,2019-01-01,2019-01-02,2019-01-03,2019-01-04,2019-01-05,2019-01-06,2019-01-07,2019-01-08,2019-01-09,2019-01-10
A,0.496714,-0.234153,-0.469474,0.241962,-1.012831,1.465649,-0.544383,-0.600639,-0.013497,0.208864
B,-0.138264,-0.234137,0.54256,-1.91328,0.314247,-0.225776,0.110923,-0.291694,-1.057711,-1.95967
C,0.647689,1.579213,-0.463418,-1.724918,-0.908024,0.067528,-1.150994,-0.601707,0.822545,-1.328186
D,1.52303,0.767435,-0.46573,-0.562288,-1.412304,-1.424748,0.375698,1.852278,-1.220844,0.196861


Sorting by an axis:

In [0]:
df.sort_index(axis=0, ascending=False) # axis = 1 Columns

Unnamed: 0,A,B,C,D
2019-01-10,0.208864,-1.95967,-1.328186,0.196861
2019-01-09,-0.013497,-1.057711,0.822545,-1.220844
2019-01-08,-0.600639,-0.291694,-0.601707,1.852278
2019-01-07,-0.544383,0.110923,-1.150994,0.375698
2019-01-06,1.465649,-0.225776,0.067528,-1.424748
2019-01-05,-1.012831,0.314247,-0.908024,-1.412304
2019-01-04,0.241962,-1.91328,-1.724918,-0.562288
2019-01-03,-0.469474,0.54256,-0.463418,-0.46573
2019-01-02,-0.234153,-0.234137,1.579213,0.767435
2019-01-01,0.496714,-0.138264,0.647689,1.52303


Sorting by values:

In [0]:
df.sort_values(by='B', ascending=False)

Unnamed: 0,A,B,C,D
2019-01-03,-0.469474,0.54256,-0.463418,-0.46573
2019-01-05,-1.012831,0.314247,-0.908024,-1.412304
2019-01-07,-0.544383,0.110923,-1.150994,0.375698
2019-01-01,0.496714,-0.138264,0.647689,1.52303
2019-01-06,1.465649,-0.225776,0.067528,-1.424748
2019-01-02,-0.234153,-0.234137,1.579213,0.767435
2019-01-08,-0.600639,-0.291694,-0.601707,1.852278
2019-01-09,-0.013497,-1.057711,0.822545,-1.220844
2019-01-04,0.241962,-1.91328,-1.724918,-0.562288
2019-01-10,0.208864,-1.95967,-1.328186,0.196861



# Selection
Selecting a single columns yields a `Series`

In [0]:
df['A']

2019-01-01    0.496714
2019-01-02   -0.234153
2019-01-03   -0.469474
2019-01-04    0.241962
2019-01-05   -1.012831
2019-01-06    1.465649
2019-01-07   -0.544383
2019-01-08   -0.600639
2019-01-09   -0.013497
2019-01-10    0.208864
Freq: D, Name: A, dtype: float64

Selecting with `[]` slices the rows

In [0]:
df[0:3]

Unnamed: 0,A,B,C,D
2019-01-01,0.496714,-0.138264,0.647689,1.52303
2019-01-02,-0.234153,-0.234137,1.579213,0.767435
2019-01-03,-0.469474,0.54256,-0.463418,-0.46573


In [0]:
df['20190101':'20190103']



Unnamed: 0,A,B,C,D
2019-01-01,0.496714,-0.138264,0.647689,1.52303
2019-01-02,-0.234153,-0.234137,1.579213,0.767435
2019-01-03,-0.469474,0.54256,-0.463418,-0.46573
