In this section, we study the pandas' functionality in Python. In pandas, we have different kinds of objects. If you compare to pandas to libraries in R, pandas is extremely similar to the base R packages, and it has objects such as 'DataFrame' objects etc. We will go over these objects one by one. 

In [1]:
import pandas as pd
import numpy as np

The first main data type we will learn from 'pandas' is the 'Series' data type.  A 'Series' object is very similar to a ('NumPy') array (in fact it is built on top of the 'NumPy' array object). What differentiates a 'NumPy' array from a 'Series', is that a 'Series' can have axis labels, meaning it can be assigned with a label, instead of just a number location. A 'Series' also doesn't need to hold numeric data, it can hold any arbitrary Python objects such as Python functions.

We start by learning how to create a 'Series' object through an existing array, lists or dictionaries. The Series() function can pretty much take on all kinds of common data object types aforementioned:

In [2]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

In [3]:
series1=pd.Series(data=my_list)
series1

0    10
1    20
2    30
dtype: int64

In [4]:
series2=pd.Series(data=my_list,index=labels)
series2

a    10
b    20
c    30
dtype: int64

In [5]:
series3=pd.Series(my_list,labels)
series3

a    10
b    20
c    30
dtype: int64

In [6]:
series4=pd.Series(arr)
series4

0    10
1    20
2    30
dtype: int32

In [7]:
series5=pd.Series(arr,labels)
series5

a    10
b    20
c    30
dtype: int32

In [8]:
series6=pd.Series(d)
series6

a    10
b    20
c    30
dtype: int64

One thing to keep in mind is that since a Python list can contain different data types, when we convert a list into a 'Series' object, Python automatically does the casting for us, but in a more complicated way than SAS. In the example below, we have a list that contains an integer, a float and a string. The interesting thing is that when we convert the list the new 'Series', the whole object 'ser0' becomes an object-type but within each element, we still keep the original data type:

In [9]:
labels = ['a','b','c']
my_list2 = [10.6743,'Got ya!',30]
ser0=pd.Series(my_list, index=labels)
print(ser0, '\n')

for j in range(3):
    print(type(ser0[j]))

a    10
b    20
c    30
dtype: int64 

<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>


The nicest thing about a 'Series' object is that you can assign flexible index/labels. A 'Series' object can hold a variety of object types. The key to using' Series' is understanding its index. Pandas makes use of these index names or numbers by allowing for fast informatino look-ups (it works like hash tables/dictionaries). This also means you can perform certain operations on the 'Series'. But keep in mind that the operations of two 'Series' objects are based on indices, not elements. Below are some examples:

In [10]:
series7=pd.Series(data=labels) # creating a series of labels
print(series7)
series8=pd.Series(data=[sum, print, len]) # creating a series of functions
print(series8)

0    a
1    b
2    c
dtype: object
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object


In [11]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])   
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])   
print('First Series:\n', ser1, '\n')
print('Second Series:\n', ser2, '\n')
ser1 + ser2 

First Series:
 USA        1
Germany    2
USSR       3
Japan      4
dtype: int64 

Second Series:
 USA        1
Germany    2
Italy      5
Japan      4
dtype: int64 



Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

Next, we study 'DataFrame' objects, which are the core components of 'pandas' and are directly inspired by the R programming language. We can think of a 'DataFrame' as a bunch of 'Series' objects put together that share the same index. 

Below is an example of a 'DataFrame', which consists of the data, index (row numbers) and columns, just like what you encounter in SAS and R. In the pandas' framework, a 'DataFrame' is just a bunch of 'Series' sharing a common index. So you should really just treat a 'DataFrame' object as a bundled 'Series' collection (so in this sense, the column vectors you see in mathematics are default vectors equivalent to a 'Series').

In [12]:
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) # getting from a standardized normal distribution
print(type(df))
print('DataFrame shape:', df.shape)
df

<class 'pandas.core.frame.DataFrame'>
DataFrame shape: (5, 4)


Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Now we have created a 'DataFrame' from scratch, let's learn the various methods to manipulate the data. Specifically, we will learn the following things:

   1. Subsetting on a column
   2. Creating a new column
   3. Dropping an existing column
   4. Dropping a particular row
   5. Subsetting on a particular row
   6. Conditional subsetting in general
   7. Renaming an existing column
   8. Manipulating indices

In [13]:
df[['Z','W']] # subsetting on certain columns

Unnamed: 0,Z,W
A,0.503826,2.70685
B,0.605965,0.651118
C,-0.589001,-2.018168
D,0.955057,0.188695
E,0.683509,0.190794


In [14]:
df['new'] = df['W'] + df['Y'] # creating a new column
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [15]:
df.drop('X',axis=1) # dropping a column (axis=1 means column and axis=0 means row)

Unnamed: 0,W,Y,Z,new
A,2.70685,0.907969,0.503826,3.614819
B,0.651118,-0.848077,0.605965,-0.196959
C,-2.018168,0.528813,-0.589001,-1.489355
D,0.188695,-0.933237,0.955057,-0.744542
E,0.190794,2.605967,0.683509,2.796762


In [16]:
df.drop('E',axis=0) # dropping a row

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542


There is a caveat here before we proceed. In Python, there is an 'inplace' argument in the drop() method. When not specifying this argument, the default is inplace=False, meaning that dropping that variable does not fundamentally change the existing 'DataFrame'. However, if inplace=True, then the changes you made by dropping column(s) will stay in place and the object will be permanently changed. Compare the two blocks of codes below and you will see the power of 'inplace':

In [17]:
df.drop('new', axis=1)
print(df) # notice that the column 'new' is still there even after you dropped the variable

          W         X         Y         Z       new
A  2.706850  0.628133  0.907969  0.503826  3.614819
B  0.651118 -0.319318 -0.848077  0.605965 -0.196959
C -2.018168  0.740122  0.528813 -0.589001 -1.489355
D  0.188695 -0.758872 -0.933237  0.955057 -0.744542
E  0.190794  1.978757  2.605967  0.683509  2.796762


In [18]:
df.drop('new', axis=1, inplace=True)
print(df) # permanently dropping the variable

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


If you would like to permanently remove a columnn from a 'DataFrame' object, you can also use the following command: del df['column_name']. This is the same as df.drop('column_name', axis=1, inplace=True).

To select certain rows, we can use the loc() and iloc() method. The loc() method requires the labels of the rows and iloc() requires the index position of the rows. We generally prefer the loc() over iloc() since that coding style may be more readable. The one thing that is extremely bizarre is that both the loc() and iloc() methods need to be used in conjunction with the brackets '[]' rather than the usual parentheses '()'. This is not very intuitive now but later the rationale for this will become clearer. For simplicity, you can think of the loc() method as the 'where' statement in SAS or SQL. For now, the following two examples will give us equivalent results of row selection:

In [19]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [20]:
df.iloc[0] # the starting position is zero

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

These two methods can also be used to extract certains rows and columns simultaneously for subsetting. Always remember the first index starts from 0:

In [21]:
print(df.loc['B','Y'], '\n') # extracting a single value from the DataFrame
print(df.loc[['A','B'],['Z','Y']], '\n') # extracting column Z and Y and row A and B
print(df.iloc[[0,1],[0,1]]) # extracting the first 2 rows and first 2 columns

-0.8480769834036315 

          Z         Y
A  0.503826  0.907969
B  0.605965 -0.848077 

          W         X
A  2.706850  0.628133
B  0.651118 -0.319318


An important feature of 'pandas' is conditional selection using bracket notation, very similar to numpy and syntax in R:

In [22]:
print(df, '\n') # original DataFrame
print(df>0, '\n') # truth table where elements are greater than 0
print(df[df>0], '\n') # getting a dataset where all elements are greater than 0 (notice there are NaN values)
print(df[df['W']>0]['Y'], '\n') # getting the Y column where the values of the W columns are greater than 0
print(df[(df['W']>0) & (df['Y'] > 1)], '\n')

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509 

       W      X      Y      Z
A   True   True   True   True
B   True  False  False   True
C  False   True   True  False
D   True  False  False   True
E   True   True   True   True 

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118       NaN       NaN  0.605965
C       NaN  0.740122  0.528813       NaN
D  0.188695       NaN       NaN  0.955057
E  0.190794  1.978757  2.605967  0.683509 

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64 

          W         X         Y         Z
E  0.190794  1.978757  2.605967  0.683509 



Notice that for the last code above, we have to use the '&' to denote the logical operator 'AND'. The reason is because in 'pandas', the 'AND' operator can only handle a single boolean. Since we have a 'Series' object, the 'AND' operator will get confused. The same concept applies to the 'OR' operator. To avoid the error, we should use the pipe symbol '|' to denote 'OR' in logic:

In [23]:
try:
    print(df[(df['W']>0) and (df['Y'] > 1)], '\n')
except ValueError:
    print("ValueError: The truth value of a 'Series' is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()")

ValueError: The truth value of a 'Series' is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()


To rename a column in Python, we can use the rename() method:

In [24]:
df.rename(columns={'W': 'newName1', 'X': 'newName2'}, inplace=True)
print(df)
df.rename(columns={'newName1':'W', 'newName2':'X'}, inplace=True)
print(df)

   newName1  newName2         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


Now let's discuss more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy.

There are two methods here involved mainly. One of them is called the reset_index() method, while the other is just called set_index(). The reset_index() method associated with 'DataFrame' object will reset the index using numerical values. While doing this, the method also creates an additional column called 'index' in the original data frame.  The second method set_index() basically designates a new index using an existing column in the 'DataFrame' object. While doing this, the method simultaneously eliminates the column that is just being converted as an index. 

In [25]:
df.reset_index(inplace=True)
print('After resetting the index for the data frame: \n', df, '\n')
df.drop('index', axis=1, inplace=True)
print('Now let us return the df to its original state:\n', df)

After resetting the index for the data frame: 
   index         W         X         Y         Z
0     A  2.706850  0.628133  0.907969  0.503826
1     B  0.651118 -0.319318 -0.848077  0.605965
2     C -2.018168  0.740122  0.528813 -0.589001
3     D  0.188695 -0.758872 -0.933237  0.955057
4     E  0.190794  1.978757  2.605967  0.683509 

Now let us return the df to its original state:
           W         X         Y         Z
0  2.706850  0.628133  0.907969  0.503826
1  0.651118 -0.319318 -0.848077  0.605965
2 -2.018168  0.740122  0.528813 -0.589001
3  0.188695 -0.758872 -0.933237  0.955057
4  0.190794  1.978757  2.605967  0.683509


Above is an example for reset_index(). Notice that in the example, resetting the index created a new column called 'index' while all the true indices of the data frame became numeric. 

Below is an example for set_index(): we first create a list called 'newind', and then we create a new column called 'States' in the original data, and finally we set the index using the information from 'States':

In [26]:
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
print('Creating a new column for the data frame: \n', df)
df.set_index('States', inplace=True)
print('After setting the index: \n', df)

Creating a new column for the data frame: 
           W         X         Y         Z States
0  2.706850  0.628133  0.907969  0.503826     CA
1  0.651118 -0.319318 -0.848077  0.605965     NY
2 -2.018168  0.740122  0.528813 -0.589001     WY
3  0.188695 -0.758872 -0.933237  0.955057     OR
4  0.190794  1.978757  2.605967  0.683509     CO
After setting the index: 
                W         X         Y         Z
States                                        
CA      2.706850  0.628133  0.907969  0.503826
NY      0.651118 -0.319318 -0.848077  0.605965
WY     -2.018168  0.740122  0.528813 -0.589001
OR      0.188695 -0.758872 -0.933237  0.955057
CO      0.190794  1.978757  2.605967  0.683509


Now, let us go over how to work with 'Multi-Index'. To begin with, we'll create a quick example of what a multi-indexed 'DataFrame' object would look like. The first step of this is to create a multi-level index:

In [27]:
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
print(hier_index)
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]


MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

Now using the index created from above, we can create a multi-level indexed 'DataFrame' (so there are more than one level of indices but the columns are still unique). Below, we see the example has two levels of indices: one for G1, G2, and within each level we have 1,2,3 and 1,2,3:

In [28]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


Notice that there are two indices from the example above, yet none of them have a index name. To assign an index name, we can do the following:

In [29]:
df.index.names = ['Group','Num']
print(df)

                  A         B
Group Num                    
G1    1    0.302665  1.693723
      2   -1.706086 -1.159119
      3   -0.134841  0.390528
G2    1    0.166905  0.184502
      2    0.807706  0.072960
      3    0.638787  0.329646


Now let's learn how to extract elements from the 'DataFrame' with hierarchical indices. The general rule is that the indexing extraction starts from outside to inside. 

In [30]:
print(df.loc['G1'], '\n')
print(df.loc['G1'].loc[1], '\n')
print(df.loc['G1'].loc[1]['B'], '\n') 

            A         B
Num                    
1    0.302665  1.693723
2   -1.706086 -1.159119
3   -0.134841  0.390528 

A    0.302665
B    1.693723
Name: 1, dtype: float64 

1.693722925204035 



Next, we study the xs() function associated with 'DataFrames'. This function is pronounced as 'cross section' and it facilitates grabbing elements on a specific level (index). In other words, the function returns a cross-section (row(s) or column(s)) from the 'Series'/'DataFrame' objects. The function defaults to cross-section on the rows (axis=0).

In [31]:
print(df)
print(df.xs('G1'))
print(df.xs(1, level='Num'))
print(df.xs(('G2', 3), level=[0, 1]))

                  A         B
Group Num                    
G1    1    0.302665  1.693723
      2   -1.706086 -1.159119
      3   -0.134841  0.390528
G2    1    0.166905  0.184502
      2    0.807706  0.072960
      3    0.638787  0.329646
            A         B
Num                    
1    0.302665  1.693723
2   -1.706086 -1.159119
3   -0.134841  0.390528
              A         B
Group                    
G1     0.302665  1.693723
G2     0.166905  0.184502
                  A         B
Group Num                    
G2    3    0.638787  0.329646


Lastly, let's show a few convenient methods to deal with missing data in 'pandas'. 

To start with, the dropna() method drops rows or columns that contain null values. The method also has optional arguments that faciliate data manipulations:

In [32]:
df = pd.DataFrame({'A':[1.23,2.49,np.nan,0],'B':[5.08,np.nan,np.nan, np.nan],'C':[1,2,3,4], 'D':[7.12,1.66,np.nan,np.nan]})
print(df)

      A     B  C     D
0  1.23  5.08  1  7.12
1  2.49   NaN  2  1.66
2   NaN   NaN  3   NaN
3  0.00   NaN  4   NaN


In [33]:
df.dropna(axis=0) # dropping rows that have null values

Unnamed: 0,A,B,C,D
0,1.23,5.08,1,7.12


In [34]:
df.dropna(axis=1) # dropping columns that have null values

Unnamed: 0,C
0,1
1,2
2,3
3,4


The dropna() method can take on additional arguments. The 'thresh' argument keeps only the rows with at least that many non-NA values. The 'how' argument can tell you whether to drop the rows where all of the elements are null. If there is no row to drop, the 'DataFrame' will stay the same:

In [35]:
df.dropna(thresh=2) # keeping only the rows with at least 2 non-NA values

Unnamed: 0,A,B,C,D
0,1.23,5.08,1,7.12
1,2.49,,2,1.66
3,0.0,,4,


In [36]:
df.dropna(axis=0, how='all') # dropping the rows where all of the elements are null

Unnamed: 0,A,B,C,D
0,1.23,5.08,1,7.12
1,2.49,,2,1.66
2,,,3,
3,0.0,,4,


The fillna() method can help us achieve missing data imputations. For example:

In [37]:
df.fillna('Missing') # filling in the null value

Unnamed: 0,A,B,C,D
0,1.23,5.08,1,7.12
1,2.49,Missing,2,1.66
2,Missing,Missing,3,Missing
3,0,Missing,4,Missing


In [38]:
df['A'].fillna(value=df['A'].mean()) # using the mean value to populate (mean value is obtained by adding up the non-missing values)

0    1.23
1    2.49
2    1.24
3    0.00
Name: A, dtype: float64

One thing we need to be cautious is that when we impute missing values using the fillna() method, Python may automatically perform data casting as we discussed before. Consider the example below, we see that after the data imputation, the relevant columns are changed to object from float64. But the data type of the individual elements of the particular column is unchanged:

In [39]:
df_orig = pd.DataFrame({'A':[1.23,2.49,np.nan,0],'B':[5.08,np.nan,np.nan, np.nan],'C':[1,2,3,4], 'D':[7.12,1.66,np.nan,np.nan]})
print(df_orig, '\n')
print(df_orig.info(), '\n')
df2=df_orig
df2.fillna('Missing', inplace=True) # filling in the null value
print(df2)
print(df2.info())

      A     B  C     D
0  1.23  5.08  1  7.12
1  2.49   NaN  2  1.66
2   NaN   NaN  3   NaN
3  0.00   NaN  4   NaN 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    3 non-null float64
B    1 non-null float64
C    4 non-null int64
D    2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 208.0 bytes
None 

         A        B  C        D
0     1.23     5.08  1     7.12
1     2.49  Missing  2     1.66
2  Missing  Missing  3  Missing
3        0  Missing  4  Missing
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null object
B    4 non-null object
C    4 non-null int64
D    4 non-null object
dtypes: int64(1), object(3)
memory usage: 208.0+ bytes
None


In [40]:
extract_series=df2['A']
for j in range(3):
    print(extract_series[j], '---', type(extract_series[j]))
extract_series

1.23 --- <class 'float'>
2.49 --- <class 'float'>
Missing --- <class 'str'>


0       1.23
1       2.49
2    Missing
3          0
Name: A, dtype: object