# Different ways to create, subset, and combine dataframes using pandas

__What is a package?__ <br>
In most of the real world applications, it happens that the actual requirement needs one to do lot of coding for solving a relatively common problem. For example, machine learning is such a real world application which many people around the world are using but has a very standard approach in solving things. To save a lot of time for coders and those who would have otherwise thought of developing such codes, all such applications or pieces of codes are written and are published online of which most of them are often open source. This collection of codes is termed as package.

__How to install and call packages?__ <br>
Pandas is one such package which is easily one of the most used around the world. Individuals have to download such packages before being able to use them. This can be easily done using a terminal where one enters pip command. Once downloaded, these codes sit somewhere in your computer but cannot be used as is. One has to do something called as "Importing" the package. In simple terms we use this statement to tell that computer that "Hey computer, I will be using manually downloaded pieces of code by this <i>name</i> in this file". With this, computer would understand that it has to look into the downloaded files for all the functionalities available in that package.

<i>Format to install packages using pip command: pip install package-name<br>
Calling packages: import package-name as alias</i>

__What is pandas?__ <br>
Pandas is a collection of multiple functions and custom classes called dataframes and series. These are easily one of the most used package and a lot of data scientists around the world use it for their analysis. It is also the first package that most of the data science students learn about. Let us look in detail what can be done using this package.

__Note:__ We will not be looking at all the functionalities offered by pandas, rather we will be looking at few useful functions that people often use and might need in their day-to-day work.

## Importing required packages 

Importing packages needed for this notebook with an alias. We use alias so as to avoid typing name of package multiple times as we use it.

In [1]:
import pandas as pd
import numpy as np

Checking pandas version installed in the system

In [2]:
pd.__version__

'1.0.3'

## DataFrame vs Pandas Series 

Most used data types of pandas package are dataframe and series. Series can be said as a subset of dataframe. Series is generally a set of information which is one dimentional in nature where as dataframe is mostly two dimentional in nature

In [3]:
pd.Series([["hi","bye"],[1,2]])

0    [hi, bye]
1       [1, 2]
dtype: object

In [4]:
pd.DataFrame([["hi","bye"],[1,2]])

Unnamed: 0,0,1
0,hi,bye
1,1,2


## Creating basic dataframes 

Pandas is a package that operates on a very special and important data type called as dataframes. More often than not, people in data science use dataframes to store data and work on them. Let us now look at how to initialize a basic dataframe

In [5]:
# Let's create a dataframe with values as some names
#The specified information should be in a list or similar format, let us see what would happen if we give data in different formats
#Giving input as list
pd.DataFrame(['Tom','Jerry']) 

Unnamed: 0,0
0,Tom
1,Jerry


In [6]:
#Giving input as tuple
pd.DataFrame(('Tom','Jerry')) #List and tuple give same result

Unnamed: 0,0
0,Tom
1,Jerry


In [7]:
#Giving input as set
pd.DataFrame({'Tom','Jerry'}) #List, tuple, and set give same result

Unnamed: 0,0
0,Jerry
1,Tom


In [8]:
#Giving input as dict
pd.DataFrame({'name':'Tom','name2':'Jerry'}) #This throws a value error, here it means that a string cannot be a value in dict

ValueError: If using all scalar values, you must pass an index

In [9]:
pd.DataFrame.from_records([{'name':'Tom','name2':'Jerry'}])

Unnamed: 0,name,name2
0,Tom,Jerry


In [10]:
#Giving input as dict
pd.DataFrame({'name':['Tom'],'name2':['Jerry']}) #Now this works, but the information in list and others is different as each key value pair here is treated as column

Unnamed: 0,name,name2
0,Tom,Jerry


This might bring up question that instead of using dictionary can we not specify column names while using list data type in input. The answer for it is, yes we can. Let us have a look at how to do so.

In [11]:
#Let us specify column name by taking in a list input
pd.DataFrame(['Tom','Jerry'],columns=['Name']) #Note that column name should also not be a scalar value, would suggest to use list

Unnamed: 0,Name
0,Tom
1,Jerry


Now, instead of using dictionaries we will see how to bring in multiple columns into a dataframe

In [12]:
#Here data is list of lists and every list is treated as a seperate row
pd.DataFrame([['Tom','Cat'],['Jerry','Mouse']],columns=['Name','Animal'])

Unnamed: 0,Name,Animal
0,Tom,Cat
1,Jerry,Mouse


In [13]:
#This works for tuple as well
pd.DataFrame((('Tom','Cat'),('Jerry','Mouse')),columns=['Name','Animal'])

Unnamed: 0,Name,Animal
0,Tom,Cat
1,Jerry,Mouse


## Selecting or Indexing the data within dataframe or series

The Selecting or Indexing is a method using which people often extract the values they want from dataframes and/or series as needed. It is intersting to note that the output of indexing a dataframe is a series and output of indexing a series is often a individual data point.

__Note:__ Indexing is not to be confused with index which talks about each rows address in a dataframe or series.

In [14]:
# Let us declare a series and dataframe each to understand how subsetting works
df=pd.DataFrame(np.random.randn(10,4),columns=['A','B','C','D'],index=[9,8,7,6,5,4,3,2,1,0])
df

Unnamed: 0,A,B,C,D
9,-2.307873,2.141098,0.923685,1.074306
8,-0.235183,-1.646698,-0.54041,0.696396
7,-1.041741,-2.787609,0.560987,1.084208
6,0.246745,1.918956,-0.237109,0.86724
5,-0.390702,-0.510395,0.261368,-0.052081
4,0.782527,0.761266,-0.219027,-0.747414
3,-0.439519,0.704513,0.938185,0.629322
2,0.372363,0.034795,0.671337,0.471281
1,-3.935592,0.043111,-0.193167,1.393288
0,-0.47895,-1.838009,0.36328,-0.13786


In [15]:
s=pd.Series(np.random.randn(10),index=[9,8,7,6,5,4,3,2,1,0])
s

9   -1.592681
8   -0.608740
7    1.440607
6   -0.811629
5   -0.329097
4   -0.227123
3   -0.644074
2    0.922882
1   -0.501884
0    0.048950
dtype: float64

__There are three popular ways to indexing or selecting information__
- loc
- iloc
- [] slicing

### loc 

loc will fetch the data using the index information in the dataframe and/or series. Let us look at dataframe and series seperately to understand the usage of loc in them better

#### Dataframe 

In [16]:
# Let us try to understand using of loc in dataframes

print(df.loc[0]) #Notice that this information corresponds to index 0 and not the first row in dataframe df, also note that output is a series
print(type(df.loc[0]))

A   -0.478950
B   -1.838009
C    0.363280
D   -0.137860
Name: 0, dtype: float64
<class 'pandas.core.series.Series'>


In [17]:
# It is still possible to access elements within the series, to do so the following codes can be used

print(df.loc[0]['A']) # Extracting using the index in series
print(df.loc[0][0]) # Extracting using the position, here 0 corresponds to 1st position, 1 to 2nd position, etc.

-0.4789501265992803
-0.4789501265992803


__Try for yourself__ <br>
Q. Extract the value which is in column B and 4th row of dataframe df using loc

In [None]:
# Your answer


Click __here__ to see answer
<!--
Both the answers given below are correct
A1. df.loc[6]['B']
A2. df.loc[6][1]
-->

#### Series

In [18]:
# Let us try to understand using of loc in series

s.loc[0] #Notice that this information corresponds to index 0 and not the first row in series s, aslo note that output is a scalar i.e. an individual data point

0.04895003367006049

__Try for yourself__ <br>
Q. Extract the value which is in 1st row of dataframe df using loc

In [None]:
# Your answer


Click __here__ to see answer
<!--
There is only one answer that is correct and is given below
A. s.loc[9]
-->

### iloc

iloc will fetch the data using the location information in the dataframe and/or series. Let us look at dataframe and series seperately to understand the usage of iloc in them better.

__Note:__ The only difference in the loc and iloc is that loc has the index information passed where as iloc has the position information passed into it

#### DataFrame 

In [19]:
# Let us try to understand using of loc in dataframes

print(df.iloc[0]) #Notice that this information corresponds to first row in dataframe df, also note that output is a series
print(type(df.iloc[0]))

A   -2.307873
B    2.141098
C    0.923685
D    1.074306
Name: 9, dtype: float64
<class 'pandas.core.series.Series'>


In [20]:
# Just like in loc, we can still access elements within the series, to do so the following codes can be used

print(df.iloc[0]['A']) # Extracting using the index in series
print(df.iloc[0][0]) # Extracting using the position, here 0 corresponds to 1st position, 1 to 2nd position, etc.

-2.3078734091828874
-2.3078734091828874


#### Series 

In [21]:
# Similar to dataframe, we can directly specify the location to fetch information from a series
s

9   -1.592681
8   -0.608740
7    1.440607
6   -0.811629
5   -0.329097
4   -0.227123
3   -0.644074
2    0.922882
1   -0.501884
0    0.048950
dtype: float64

In [22]:
# Let us fetch the information from 3rd row, here we have to specify 2 as input as the numbering in python starts from 0
s.iloc[2]

1.4406074160733833

### Slicing and Filtering

The slicing in python is done using brackets - []. There are multiple ways in which we can slice the data according to the need. Let us look at how to utilise slicing most effectively

#### DataFrame 

In [23]:
# Assume that you would like to extract everything but the first row containing information, then you would type the following
df[1:] #Here, 1: specifies that the information should be from row 2 till the end

Unnamed: 0,A,B,C,D
8,-0.235183,-1.646698,-0.54041,0.696396
7,-1.041741,-2.787609,0.560987,1.084208
6,0.246745,1.918956,-0.237109,0.86724
5,-0.390702,-0.510395,0.261368,-0.052081
4,0.782527,0.761266,-0.219027,-0.747414
3,-0.439519,0.704513,0.938185,0.629322
2,0.372363,0.034795,0.671337,0.471281
1,-3.935592,0.043111,-0.193167,1.393288
0,-0.47895,-1.838009,0.36328,-0.13786


In [24]:
# Assume that you would like to extract everything but the last row containing information, then you would type the following
df[:-1] #Here, :-1 specifies that the information should be from row 1 till the last row-1

Unnamed: 0,A,B,C,D
9,-2.307873,2.141098,0.923685,1.074306
8,-0.235183,-1.646698,-0.54041,0.696396
7,-1.041741,-2.787609,0.560987,1.084208
6,0.246745,1.918956,-0.237109,0.86724
5,-0.390702,-0.510395,0.261368,-0.052081
4,0.782527,0.761266,-0.219027,-0.747414
3,-0.439519,0.704513,0.938185,0.629322
2,0.372363,0.034795,0.671337,0.471281
1,-3.935592,0.043111,-0.193167,1.393288


In [25]:
# We can also take out only 2nd row to 5th row by typing the following
df[1:5] # Here, 1:5 means to extract everything from row 2 (1 indicates 2nd row) till row 6 (5 indicates 6th row with index 4 which is not included)

Unnamed: 0,A,B,C,D
8,-0.235183,-1.646698,-0.54041,0.696396
7,-1.041741,-2.787609,0.560987,1.084208
6,0.246745,1.918956,-0.237109,0.86724
5,-0.390702,-0.510395,0.261368,-0.052081


In [26]:
# We can also extract information of columns using slicing, let us have a look at how to do it

df["A"] # This extracts information in series format

9   -2.307873
8   -0.235183
7   -1.041741
6    0.246745
5   -0.390702
4    0.782527
3   -0.439519
2    0.372363
1   -3.935592
0   -0.478950
Name: A, dtype: float64

In [27]:
# What is we need information to be extracted in dataframe type? We can use following methods

# method 1
df[['A']]

Unnamed: 0,A
9,-2.307873
8,-0.235183
7,-1.041741
6,0.246745
5,-0.390702
4,0.782527
3,-0.439519
2,0.372363
1,-3.935592
0,-0.47895


In [28]:
# method 2
pd.DataFrame(df['A'])

Unnamed: 0,A
9,-2.307873
8,-0.235183
7,-1.041741
6,0.246745
5,-0.390702
4,0.782527
3,-0.439519
2,0.372363
1,-3.935592
0,-0.47895


However, using method 1 is suggested as it is pretty straightforward.

In [29]:
# Using method 1 we can also slice dataframe for multiple columns
df[['A','B']] #Slicing data for columns A and B, here output is a dataframe

Unnamed: 0,A,B
9,-2.307873,2.141098
8,-0.235183,-1.646698
7,-1.041741,-2.787609
6,0.246745,1.918956
5,-0.390702,-0.510395
4,0.782527,0.761266
3,-0.439519,0.704513
2,0.372363,0.034795
1,-3.935592,0.043111
0,-0.47895,-1.838009


In [30]:
# We can also slice data with a condition at column, let's say we want all columns but rows where column A has values greater than 0
df['A']>0 #This returns a series of boolean values with length as number of rows, when passed as input in slicing would show rows of df with column A holds value greater than 0

9    False
8    False
7    False
6     True
5    False
4     True
3    False
2     True
1    False
0    False
Name: A, dtype: bool

In [31]:
df[df['A']>0] # Filtering df where column A has greater than 0

Unnamed: 0,A,B,C,D
6,0.246745,1.918956,-0.237109,0.86724
4,0.782527,0.761266,-0.219027,-0.747414
2,0.372363,0.034795,0.671337,0.471281


In [32]:
# We can do filtering for multiple conditions, lets say we want to filter for column A and B having values greater than 0
# Each condition is mentioned in () and we can shoose how they should interact - with AND OR comparisions
df[(df['A']>0) & (df['B']>0)] # We use AND as we want rows with A "AND" B greater than 0

Unnamed: 0,A,B,C,D
6,0.246745,1.918956,-0.237109,0.86724
4,0.782527,0.761266,-0.219027,-0.747414
2,0.372363,0.034795,0.671337,0.471281


In [33]:
df[(df['A']>0) & (df['B']>0)]

Unnamed: 0,A,B,C,D
6,0.246745,1.918956,-0.237109,0.86724
4,0.782527,0.761266,-0.219027,-0.747414
2,0.372363,0.034795,0.671337,0.471281


Notice how outputs have not continuous index, this might create problems. Hence, it is advisable to use reset_index to ensure continuous index in new filtered dataframes

In [34]:
# Using reset index
# Default for drop is False which will retain old index. More often than not we wouldn't need old index hence input dop as True to remove old index completely
df[(df['A']>0) & (df['B']>0)].reset_index(drop=True) 

Unnamed: 0,A,B,C,D
0,0.246745,1.918956,-0.237109,0.86724
1,0.782527,0.761266,-0.219027,-0.747414
2,0.372363,0.034795,0.671337,0.471281


__Try for yourself__ <br>
Q. Filter dataframe for any rows which hold negetive values for column C or A

In [None]:
# Your answer


Click __here__ to see answer
<!--
Both the answers given below are correct
A1. df.loc[6]['B']
A2. df.loc[6][1]
-->

#### Series 

In [35]:
# Printing series s to see how it looks like
s

9   -1.592681
8   -0.608740
7    1.440607
6   -0.811629
5   -0.329097
4   -0.227123
3   -0.644074
2    0.922882
1   -0.501884
0    0.048950
dtype: float64

In [36]:
# Slicing in series has a mixed functionality of loc and iloc. When trying to extract information using a single value we get the following
s[0]

0.04895003367006049

As we can see, series extracts value using the index when calling it individually

In [37]:
# When trying to slice a range of values it acts similar to iloc where it fetches according to number of rows instead of index as is in case of loc and single value slicing
s[0:3]

9   -1.592681
8   -0.608740
7    1.440607
dtype: float64

## Combining two dataframes 

When trying to do any high level analysis, there is a very good chance that there will be multiple data sources and often the information is extracted in different files which later we would have to use to create a combined view as we want for analysis. Let us have a look at some methods we can use to mix two or more dataframes.

### Concat 

According to Pandas documentation <i>"The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Note that I say “if any” because there is only a single possible axis of concatenation for Series."</i>

If you have difficulty in understanding the above statement, it basically means that concat can handle most of combining operations available in python. However, some easy to use functions are also available as we will be learning them in next steps

In [38]:
# Declaring new dataframe df1
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})#,index=[0, 1, 2, 3])

df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [39]:
# Declaring new dataframe df2
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})#,index=[4, 5, 6, 7])

df2

Unnamed: 0,A,B,C,D
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


In [40]:
# Declaring new dataframe df3
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']})#,index=[8, 9, 10, 11])

df3

Unnamed: 0,A,B,C,D
0,A8,B8,C8,D8
1,A9,B9,C9,D9
2,A10,B10,C10,D10
3,A11,B11,C11,D11


In [41]:
# Using a simple concat statement to understand how it works
pd.concat([df1,df2]) # Notice how we specified the names of dataframes in a alist. If we dont do this, it would throw an error

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


By using a simple concat statement, dataframes are joined one below another as shown above based on column names, but they stil retain their original index information using which we can differentialte the origin of row level information if we want to

In [42]:
# We utilise axis argument and specify its value as 1. Please note that the default value of this argument is 0
pd.concat([df1,df2],axis=1) # Here, the information is joined side by side based on the index available

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,A4,B4,C4,D4
1,A1,B1,C1,D1,A5,B5,C5,D5
2,A2,B2,C2,D2,A6,B6,C6,D6
3,A3,B3,C3,D3,A7,B7,C7,D7


There is one important argument in concat called join. Join is used to specify how the two dataframes will be joined. If you are new to the concept, to explain it simply there are 4 types of joins. 

One is <i>inner join</i> which does an operation of joining common information of two dataframes being joined. In python concat, the common information that will be looked for is either columns or index. 

Second type is <i>left join</i> where when we specify to join data1 with data2. Based on matched index, joined information of data1 and data2 will be showed while retaining all information from data1.

Third type is <i>right join</i> where when we specify to join data1 with data2. Based on matched index, joined information of data1 and data2 will be showed while retaining all information from data2.

Fourth type is <i>outer join</i> which is also default where we join all possible combinations.

Let us look at examples below to understand them clearly.

In [43]:
# Defining df1 and df2 to be used in example
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},index=[0, 5, 6, 7])

# Using join argument, we specify inner as string and this results in only one row as there is only one common index value between two dataframes
pd.concat([df1,df2],axis=1,join='inner') # Join used here is similar to that of SQl join. The default value for join argument is outer

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,A4,B4,C4,D4


In [44]:
# Defining df1 and df2 to be used in example
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'E': ['D4', 'D5', 'D6', 'D7']},index=[0, 1, 2, 3])

# When using join as inner for axis=0, the output shows only information of common columns from two dataframes
pd.concat([df1,df2],axis=0,join='inner')

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
0,A4,B4,C4
1,A5,B5,C5
2,A6,B6,C6
3,A7,B7,C7


In [45]:
# We can use ignore_index argument with value True as showed. This will reset and bring in sequential index.
pd.concat([df1,df2],axis=0,join='inner',ignore_index=True)

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4
5,A5,B5,C5
6,A6,B6,C6
7,A7,B7,C7


In [46]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

# We can specify argument keys to add another index or column (specified in axis argument) which indicates the origin of data in combined view
pd.concat([df1,df2],axis=0,join='inner',keys=["DF1","DF2"]) # Here, keys are shown in index as we join information vertically as axis=0

Unnamed: 0,Unnamed: 1,A,B,C,D
DF1,0,A0,B0,C0,D0
DF1,1,A1,B1,C1,D1
DF1,2,A2,B2,C2,D2
DF1,3,A3,B3,C3,D3
DF2,0,A4,B4,C4,D4
DF2,1,A5,B5,C5,D5
DF2,2,A6,B6,C6,D6
DF2,3,A7,B7,C7,D7


In [48]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

pd.concat([df1,df2],axis=1,join='inner',keys=["DF1","DF2"]) # Here, keys are shown in column as we join information horizontally as axis=1

Unnamed: 0_level_0,DF1,DF1,DF1,DF1,DF2,DF2,DF2,DF2
Unnamed: 0_level_1,A,B,C,D,A,B,C,D
0,A0,B0,C0,D0,A4,B4,C4,D4
1,A1,B1,C1,D1,A5,B5,C5,D5
2,A2,B2,C2,D2,A6,B6,C6,D6
3,A3,B3,C3,D3,A7,B7,C7,D7


In [49]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

# Be mindful that if we use igoner_index along with keys it will rewrite keys information to form new sequential index
pd.concat([df1,df2],axis=0,join='inner',ignore_index=True,keys=["DF1","DF2"])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [50]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

# For the additional index or column level we gain by using key argument, we can specify custom names to differentiate between original index to newly added index.
# This will help users to understand what each index/column level mean by just looking at their names
pd.concat([df1,df2],axis=0,join='inner',keys=["DF1","DF2"],names=["Dataframe_info","Original_Index"])

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Dataframe_info,Original_Index,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DF1,0,A0,B0,C0,D0
DF1,1,A1,B1,C1,D1
DF1,2,A2,B2,C2,D2
DF1,3,A3,B3,C3,D3
DF2,0,A4,B4,C4,D4
DF2,1,A5,B5,C5,D5
DF2,2,A6,B6,C6,D6
DF2,3,A7,B7,C7,D7


### Append 

According to pandas documentation "A useful shortcut to concat() are the append() instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index". Read more about it <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html">here</a>.

If you are having tough time to understand the above statement, it basically means that append is a shortcut or sub-functionality of concat and concatenate dataframes or series only one below another (vertical stacking).

In [51]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df1.append(df2) # Using append we can do vertical stacking as stated above. This is done based on column names

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


In [52]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']})

df1.append(df2,df3) # Users cannot specify multiple dataframes to be appended as is, else it would throw the following error

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [53]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']})

df1.append([df2,df3]) # Similar to concat users have to specify mutiple dataframes in a list to append them if available dataframes to append are more than 1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7
0,A8,B8,C8,D8
1,A9,B9,C9,D9


In [54]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']})

# ignore_index argument is available in append as well and has same functionality which is to remove old index and create new sequential index
df1.append([df2,df3],ignore_index=True)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [55]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']})

# When the argument verify_integrity is set to True (default is False), it will throw an error if there are duplicate index in the joined dataframe
df1.append([df2,df3],ignore_index=True,verify_integrity=True) # Here, it shows no error as we have ignore_index as True

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [56]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']})

df1.append([df2,df3],verify_integrity=True) # Here, it shows error as it has duplicate index values coming from multiple dataframes and doesn't have ignore_index as True

ValueError: Indexes have overlapping values: Int64Index([0, 1, 2, 3], dtype='int64')

### Join 

According to pandas documentation <i>"DataFrame.join() is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame"</i>. This means that join combines any two dataframes based on their index by default.

The above statement basically means that the join method does horizontal stacking of the dataframes

In [57]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

df1.join(df2) # As stated above simple join statement combines dataframes based on their index value

Unnamed: 0,A,B,C,D
0,A0,B0,C4,D4
1,A1,B1,C5,D5
2,A2,B2,C6,D6
3,A3,B3,C7,D7


In [58]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']},index=[0, 1, 2, 3])

df2 = pd.DataFrame({'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},index=[0, 5, 6, 7])

df1.join(df2) # When the dataframes being joined have different indexes, by default left join will be applied where all information from left along with combined information from two dataframes will be shown.

Unnamed: 0,A,B,C,D
0,A0,B0,C4,D4
1,A1,B1,,
2,A2,B2,,
3,A3,B3,,


__Note:__ NaN here means that the values are missing. It is just an indicator and do not hold any other meaning

In [59]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']},index=[0, 1, 2, 3])

df2 = pd.DataFrame({'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},index=[0, 5, 6, 7])

# Using how argument we specify outer type of join is to be done which results in combining all information from both dataframes
df1.join(df2,how="outer") # how argument is similar to that of join argument in concat. It will specify the type of join to be performed

Unnamed: 0,A,B,C,D
0,A0,B0,C4,D4
1,A1,B1,,
2,A2,B2,,
3,A3,B3,,
5,,,C5,D5
6,,,C6,D6
7,,,C7,D7


In [60]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']},index=[0, 1, 2, 3])

df2 = pd.DataFrame({'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},index=[0, 5, 6, 7])

# Specifying right in how argument to see result of right join where all information from df2 will be retained alng with matched information from df1
df1.join(df2,how="right")

Unnamed: 0,A,B,C,D
0,A0,B0,C4,D4
5,,,C5,D5
6,,,C6,D6
7,,,C7,D7


In [61]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']},index=[0, 1, 2, 3])

df2 = pd.DataFrame({'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},index=[0, 5, 6, 7])

# Specifying inner in how argument to see result of inner join where only common information in retained
df1.join(df2,how="inner")

Unnamed: 0,A,B,C,D
0,A0,B0,C4,D4


In [62]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'A': ['C4', 'C5', 'C6', 'C7'],
                    'B': ['D4', 'D5', 'D6', 'D7']})

# Standard join cannot be used for dataframes having same column names as it will throw error. 
# This happens because when joining, python would be confused which value to populate for combination of an index and column as there would be two values
df1.join(df2)

ValueError: columns overlap but no suffix specified: Index(['A', 'B'], dtype='object')

In [63]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})#,index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['C4', 'C5', 'C6', 'C7'],
                    'B': ['D4', 'D5', 'D6', 'D7']})#,index=[4, 5, 6, 7])

# Two dataframes having same columns can be joined only when lsuffix and rsuffix are mentioned. 
# These will change the column names in left and right dataframes respectively which will solve problem of having same column names for an index. Hence, makes the join possible.
df1.join(df2,lsuffix="_df1",rsuffix="_df2")

Unnamed: 0,A_df1,B_df1,A_df2,B_df2
0,A0,B0,C4,D4
1,A1,B1,C5,D5
2,A2,B2,C6,D6
3,A3,B3,C7,D7


### Merge 

According to pandas documentation <i>"pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame or named Series objects"</i>.

In most of the cases, people prefer using merge as it is similar to a SQL join statement in working and has functionality to join based on one or more column values in dataframes. This is very useful and more practical when compared to joining dataframes using index. Let us look at some examples to understand it better.

In [64]:
# Creatin two dataframes left and right with common column key to demonstrate merge
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                               'C': ['C0', 'C1', 'C2', 'C3'],
                               'D': ['D0', 'D1', 'D2', 'D3']})

# on is a mandatory argument where users provide based on which column or set of columns should the join happen.
pd.merge(left, right, on='key') # Here, merging happens based on key column.

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


In [65]:
# Creatin two dataframes left and right with columns key1 and key2 holding same values to demonstrate merge
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
                               'C': ['C0', 'C1', 'C2', 'C3'],
                               'D': ['D0', 'D1', 'D2', 'D3']})

# When we have to use columns with different name to merge dataframes, we have to use left_on and right_on arguments to specify column names in left and right dataframe respectively
pd.merge(left, right, left_on='key1',right_on='key2')

Unnamed: 0,key1,A,B,key2,C,D
0,K0,A0,B0,K0,C0,D0
1,K1,A1,B1,K1,C1,D1
2,K2,A2,B2,K2,C2,D2
3,K3,A3,B3,K3,C3,D3


In [66]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

# We can also specify multiple columns to be used to merge two dataframes. This has to be done by passing column names in a list
pd.merge(left, right, on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [67]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

# We can specify argument how with the type of join we want
pd.merge(left, right, on=['key1', 'key2'],how='left') # Here, the statement creates left join

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


In [68]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left, right, on=['key1', 'key2'],how='right') # Here, the statement creates right join

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [69]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left, right, on=['key1', 'key2'],how='outer') # Here, the statement creates outer join

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


In [70]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

# When we want to see at row level the status of merge, we can use argument indicator as given below
pd.merge(left, right, on=['key1', 'key2'],how='outer',indicator=True) # This will specify at row level if merge got information from both dataframes, left dataframe, or only right dataframe

Unnamed: 0,key1,key2,A,B,C,D,_merge
0,K0,K0,A0,B0,C0,D0,both
1,K0,K1,A1,B1,,,left_only
2,K1,K0,A2,B2,C1,D1,both
3,K1,K0,A2,B2,C2,D2,both
4,K2,K1,A3,B3,,,left_only
5,K2,K0,,,C3,D3,right_only
