# DataFrames

* DataFrames are the main tool of pandas and are directly inspired by the R programming language.
* A dataFrame can be considered as a bunch of Series objects put together to share the same index.
___

In [1]:
import pandas as pd
import numpy as np

In [2]:
# To generate random numbers, using the seed of 103
from numpy.random import randn
np.random.seed(103)

In [5]:
dfx = pd.DataFrame(randn(5,4),columns='Australia Thailand Japan Korea'.split())
dfx

Unnamed: 0,Australia,Thailand,Japan,Korea
0,0.665646,1.697165,1.053685,0.29262
1,-0.824499,0.101015,0.248726,1.715312
2,-0.377291,-2.098652,-0.602954,-0.518908
3,-2.212747,-0.406055,-0.296496,0.271738
4,-0.458887,0.463878,-1.027579,0.172251


In [6]:
# Here are the data, index (rows) and column arguments
# Each column is a pandas series
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,0.593688,-0.308152,-0.923724,-0.702789
B,1.013443,1.528611,0.526076,1.688804
C,0.012561,0.840668,-0.090803,1.03216
D,0.484239,-0.389983,-0.101988,1.847697
E,0.674614,-1.392793,-0.578982,-0.248501


___
### Selection and Indexing

Methods of obtaining data from a DataFrame.

In [8]:
# Obtain data from column X
# Try type(df['X'])
# Another way to get column (not recommended; it is a SQL syntax): df.X
dfx['Japan']

0    1.053685
1    0.248726
2   -0.602954
3   -0.296496
4   -1.027579
Name: Japan, dtype: float64

In [9]:
# Obtain data from multiple columns
df[['W','Z']]

Unnamed: 0,W,Z
A,0.593688,-0.702789
B,1.013443,1.688804
C,0.012561,1.03216
D,0.484239,1.847697
E,0.674614,-0.248501


DataFrame Columns are Series objects

In [10]:
type(df['X'])

pandas.core.series.Series

**Creating a new column:**

In [11]:
# This modify the existing df
df['huh'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,huh
A,0.593688,-0.308152,-0.923724,-0.702789,-0.330036
B,1.013443,1.528611,0.526076,1.688804,1.539519
C,0.012561,0.840668,-0.090803,1.03216,-0.078242
D,0.484239,-0.389983,-0.101988,1.847697,0.38225
E,0.674614,-1.392793,-0.578982,-0.248501,0.095631


**Removing Columns**

In [14]:
# inplace argument True to occur in place (i.e. permanently removed)
df.drop('huh',axis=1,inplace=True)

In [5]:
# axis=1 refers to the column
df.drop('huh',axis=1)

Unnamed: 0,W,X,Y,Z
A,-1.249278,-0.260331,0.383793,-0.385461
B,-1.085137,2.327219,0.430793,0.432316
C,-0.980011,-0.631965,0.577442,-0.124758
D,0.978948,1.594922,-1.201945,-1.376369
E,1.054346,-0.038853,0.680286,1.329175


In [40]:
df

Unnamed: 0,W,X,Y,Z
A,1.655699,-1.016102,-0.796458,0.79172
B,0.254532,-0.810877,-0.689845,-0.512344
C,0.336987,0.055036,-1.585869,-0.764266
D,-1.142788,-0.558285,-1.013816,1.15316
E,0.833378,2.4304,0.74685,1.442444


Using .drop() to remove rows:

In [18]:
# df.drop('E',axis=0) works the same
df.drop('B',axis=0)

Unnamed: 0,W,X,Y,Z
A,0.593688,-0.308152,-0.923724,-0.702789
C,0.012561,0.840668,-0.090803,1.03216
D,0.484239,-0.389983,-0.101988,1.847697
E,0.674614,-1.392793,-0.578982,-0.248501


In [19]:
# axis is taken directly from the shape. In this case, it is 2D (rows = axis 0) x (columns = axis 1)
df.shape

(5, 4)

**Select rows - Method One: .loc[] based on row index name**

In [44]:
df.loc['A']

W    1.655699
X   -1.016102
Y   -0.796458
Z    0.791720
Name: A, dtype: float64

**Select rows - Method Two: .iloc[] based on numeric index position**

In [45]:
# Select the 3rd row. Index position starts from 0, 1, 2, ...
df.iloc[2]

W    0.336987
X    0.055036
Y   -1.585869
Z   -0.764266
Name: C, dtype: float64

**Specifying row and column position(s) to obtain value(s)**

In [47]:
# Obtain value at specific row and column position
df.loc['B','Y']

-0.6898447458335051

In [48]:
# Subset return of A, B rows of W, Y columns
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,1.655699,-0.796458
B,0.254532,-0.689845


___
### Conditional Selection with Bracket Notation

In [20]:
df

Unnamed: 0,W,X,Y,Z
A,0.593688,-0.308152,-0.923724,-0.702789
B,1.013443,1.528611,0.526076,1.688804
C,0.012561,0.840668,-0.090803,1.03216
D,0.484239,-0.389983,-0.101988,1.847697
E,0.674614,-1.392793,-0.578982,-0.248501


In [21]:
# Which values in the dataframe are larger than zero?
df>1

Unnamed: 0,W,X,Y,Z
A,False,False,False,False
B,True,True,False,True
C,False,False,False,True
D,False,False,False,True
E,False,False,False,False


In [51]:
# Using condition selection with the dataframe (False = NaN)
# or use booldf = df>0, then df[booldf]
df[df>0]

Unnamed: 0,W,X,Y,Z
A,1.655699,,,0.79172
B,0.254532,,,
C,0.336987,0.055036,,
D,,,,1.15316
E,0.833378,2.4304,0.74685,1.442444


In [53]:
# Applying to a subset of column where the condition is True (i.e. filtering the rows with True condition)
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,1.655699,-1.016102,-0.796458,0.79172
B,0.254532,-0.810877,-0.689845,-0.512344
C,0.336987,0.055036,-1.585869,-0.764266
E,0.833378,2.4304,0.74685,1.442444


In [22]:
# Further example with using a variable
abc = df[df['Z']<0]
abc

Unnamed: 0,W,X,Y,Z
A,0.593688,-0.308152,-0.923724,-0.702789
E,0.674614,-1.392793,-0.578982,-0.248501


In [23]:
# Show the values of X column from the earlier isolated rows
abc['X']

A   -0.308152
E   -1.392793
Name: X, dtype: float64

In [60]:
# All in one step - stacking command: Filter rows & columns
df[df['W']>0]['Y']

A   -0.796458
B   -0.689845
C   -1.585869
E    0.746850
Name: Y, dtype: float64

In [61]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,-0.796458,-1.016102
B,-0.689845,-0.810877
C,-1.585869,0.055036
E,0.74685,2.4304


**For multiple (two or more) conditions, operators & (and) and | (or) can be used with parenthesis:**

In [63]:
df[(df['W']>0) & (df['Y']>0)]

Unnamed: 0,W,X,Y,Z
E,0.833378,2.4304,0.74685,1.442444


___
### Index Features: Reset & Set

In [24]:
# Index Reset: Revert everything to default 0,1...n index
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.593688,-0.308152,-0.923724,-0.702789
1,B,1.013443,1.528611,0.526076,1.688804
2,C,0.012561,0.840668,-0.090803,1.03216
3,D,0.484239,-0.389983,-0.101988,1.847697
4,E,0.674614,-1.392793,-0.578982,-0.248501


In [5]:
# Index Set: Create new index with splitting by blank spaces
ni = 'KR JP FR DE NZ'.split()
ni

['KR', 'JP', 'FR', 'DE', 'NZ']

In [6]:
# Put the new index as a column into the df. Five items matches the number of rows in df.
df['Countries'] = ni
df

Unnamed: 0,W,X,Y,Z,Countries
A,-1.249278,-0.260331,0.383793,-0.385461,KR
B,-1.085137,2.327219,0.430793,0.432316,JP
C,-0.980011,-0.631965,0.577442,-0.124758,FR
D,0.978948,1.594922,-1.201945,-1.376369,DE
E,1.054346,-0.038853,0.680286,1.329175,NZ


In [82]:
# Set the new column as the index
df.set_index('Countries')

Unnamed: 0_level_0,W,X,Y,Z
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KR,-1.827965,-0.031068,-1.411191,2.659326
JP,1.101063,1.057788,-1.440619,-1.739719
FR,-0.177101,0.007524,-3.176829,-0.465
DE,-0.94361,-1.54628,0.779946,-0.390912
NZ,-1.113685,0.148858,-0.049329,-0.209643


In [7]:
# Set the changes permanently
df.set_index('Countries',inplace=True)

In [8]:
df

Unnamed: 0_level_0,W,X,Y,Z
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KR,-1.249278,-0.260331,0.383793,-0.385461
JP,-1.085137,2.327219,0.430793,0.432316
FR,-0.980011,-0.631965,0.577442,-0.124758
DE,0.978948,1.594922,-1.201945,-1.376369
NZ,1.054346,-0.038853,0.680286,1.329175


In [10]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [12]:
df.head(2)

Unnamed: 0_level_0,W,X,Y,Z
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KR,-1.249278,-0.260331,0.383793,-0.385461
JP,-1.085137,2.327219,0.430793,0.432316


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, KR to NZ
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   W       5 non-null      float64
 1   X       5 non-null      float64
 2   Y       5 non-null      float64
 3   Z       5 non-null      float64
dtypes: float64(4)
memory usage: 200.0+ bytes


___
### Advanced Topics: Multi-Index (multi-levels index) and Index Hierarchy

In [13]:
df

Unnamed: 0_level_0,W,X,Y,Z
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KR,-1.249278,-0.260331,0.383793,-0.385461
JP,-1.085137,2.327219,0.430793,0.432316
FR,-0.980011,-0.631965,0.577442,-0.124758
DE,0.978948,1.594922,-1.201945,-1.376369
NZ,1.054346,-0.038853,0.680286,1.329175


In [16]:
# Codes below are for constructing a multi-index level df (Not too important)
# Starting with lists, list & zip functions (combining lists into tuple)
o = ['G1','G1','G1','G2','G2','G2']
i = [1,2,3,1,2,3]
hi = list(zip(o,i))
hi

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

In [17]:
# Create multi level index
hi = pd.MultiIndex.from_tuples(hi)
hi

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [18]:
# Index Hierarchy: Two levels of df (G1,G2), Two sublevels (1,2,3), Two columns
df = pd.DataFrame(np.random.randn(6,2),index=hi,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,1.28345,-1.758254
G1,2,0.614306,1.516358
G1,3,-0.195977,-0.817206
G2,1,-0.946128,0.220639
G2,2,-0.600734,-0.152566
G2,3,-1.187443,0.299138


**How to retrieve data from a multilevel index?**

Indexing: 
* For index hierarchy, use df.loc[].
* Use normal bracket notation df[] for columns axis. 
* Calling one level of the index returns the sub-dataframe.

In [19]:
# Obtain the most outest layer
df.loc['G1']

Unnamed: 0,A,B
1,1.28345,-1.758254
2,0.614306,1.516358
3,-0.195977,-0.817206


In [20]:
# Extend index to obtain the next layer --> retrieve the two items as a series
df.loc['G1'].loc[1]

A    1.283450
B   -1.758254
Name: 1, dtype: float64

In [21]:
# To show index names. Currently none
df.index.names

FrozenList([None, None])

In [22]:
# Pass strings as the new index names (two names due to two layers)
df.index.names = ['Group','Num']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,1.28345,-1.758254
G1,2,0.614306,1.516358
G1,3,-0.195977,-0.817206
G2,1,-0.946128,0.220639
G2,2,-0.600734,-0.152566
G2,3,-1.187443,0.299138


In [103]:
# Another example: Retrieve the value located in G2, 2, B
df.loc['G2'].loc[2]['B']

1.5257208705705767

In [104]:
# Cross Section (xs) for multilevel index
# Get everything under G1. In addition to the method of df.loc['G1'], below is also possible:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-1.357456,-0.337351
2,-0.525621,0.422249
3,0.231313,0.818303


In [106]:
# Obtain values under G1, 1
df.xs(('G1',1))

A   -1.357456
B   -0.337351
Name: (G1, 1), dtype: float64

In [107]:
# Cross sectioning for rows 1 of the both G1 & G2. An advantage over the .loc[] method.
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,-1.357456,-0.337351
G2,-0.29927,-1.195671
