# Pandas

While **numpy** deals only with homogeneous data types ( all numbers or all floats ), **_Pandas_** is heterogenous in dealing with data. Think of Pandas as a library that can deal with manipulating heterogenous data grids ( pretty much like excel )

## Table of Contents

- ### [Introduction](#Introduction)
 - #### [What is Pandas ](#What-is-Pandas)
 - #### [Why learn Pandas](#Why-learn-Pandas)
 - #### [Our approach to Pandas](#Our-approach-to-Pandas)
- ### [Getting Started](#Getting-Started)
 - #### [Installing Pandas](#Installing-Pandas)
- ### [Dataframes](#Dataframes)
 - #### [Create Dataframe](#Create-Dataframe)
   - ##### [From List or Dictionary](#From-List-or-Dictionary)
   - ##### [From an Empty Dataframe](#From-an-Empty-Dataframe)
   - ##### [From Files](#From-Files)
 - #### [Display Dataframe](#Display-Dataframe)
 - #### [Dataframe size](#Dataframe-size)
 - #### [Selecting Data from Dataframe](#Selecting-Data-from-Dataframe)
   - ##### [loc and iloc functions](#loc-and-iloc-functions)
   - ##### [Boolean Mask](#Boolean-Mask)
 - #### [Rows](#Rows)
   - ##### [Add rows to Dataframe](#Add-rows-to-Dataframe)
   - ##### [Delete rows from Dataframe](#Delete-rows-from-Dataframe)
 - #### [Columns](#Columns)
   - ##### [Add columns to Dataframe](#Add-columns-to-Dataframe)
   - ##### [Delete columns from Dataframe](#Delete-columns-from-Dataframe)
 - #### [Grouping](#Grouping)
 - #### [Merge Dataframes](#Merge-Dataframes)
   - ##### [Concatenate Dataframes](#Concatenate-Dataframes)
   - ##### [Merge](#Merge)
     - ##### [Inner Join](#Inner-Join)
     - ##### [Left Join](#Left-Join)
     - ##### [Right Join](#Right-Join)
     - ##### [Outer Join](#Outer-Join)

## Introduction

Most data is heterogenous and tabular in nature. For example, look at the following data which shows some stats from google play store. There is text, numbers, floats etc.

<img src="./pics/google_play_store_data.png"/>

_numpy_ is not suited to manipulate this kind of data. For that we need **Pandas**

### What is Pandas

Pandas is pretty much like a data manipulation tool ( think data munging, wrangling, preparation etc ) on a grid of data ( text, numbers, floats etc ). For example, if you look at the data grid above, and say you want to

_Filter_
- a particular category ( say only ART_AND_DESIGN ) 
- or all rows with Rating > 4.1  
- or all rows where the category is ART_AND_DESIGN and rating > 4.1 


_Collapse or Group by_
- and find how many rows are there in a particular category 
- or find how many rows are there with Rating > 4.1
- or a combination of both

_Read_
- data from different formats ( excel, csv, SQL databases etc )

_Handle_
- missing data ( like NAs, blanks etc )
- erroneous data ( data that does not comply with the data type of the column ) etc

_Manipulate_
- combine data from different sources into one
- or split data into a set of rows or columns or both
- or extract a sub-set of data into another data frame ( say create a new data set only for category ART_AND_DESIGN)
- or slice the dataset based on a variety of parameters
- or insert/delete columns or rows from/to to the dataset 
 - Like add a new app category or delete the rating column

Think of **Pandas** as _Excel_ on Steroids. 

### Why learn Pandas

In the context of Machine Learning and Python, **Pandas** is the gold standard in in-memory data management ( read or manipulate ). Written in C or Cython, Pandas is as fast as any C library in manipulating data. It is not uncommon for Pandas to comfortably handle large data sets ( around 5 to 10 GB ) without a hitch. 

### Our approach to Pandas

We will be using Pandas quite extensively in this Machine Learning course. However, we will cover most of the essential aspects of Pandas in this chapter and leave the more complicated options to later chapters where we would be  encountering situations that would lead us to explore them. For now, we will keep it pretty simple and to small test datasets. 

## Getting Started

### Installing Pandas

- pip

<pre>
    > pip install pandas
</pre>

If you are using Anaconda distribution, pandas is installed by default - you just have to enable it ( if necessary ). If you are just using the _conda_ package manager for Python, 
- conda

<pre>
    > conda install pandas
</pre>

You can verify if _Pandas_ is already installed on your python installation using the Python console.
<pre>
    >>> help("pandas")
</pre>
If it is installed, you will get a help message on Pandas.

## Dataframes

A **_Data Frame_** is the main data structure in pandas. Think of a data frame as an excel grid. It is quite simply just rows and columns. 

<img src="./pics/dataframe-like-excel.png"/>

You can create, add, delete, filter data very easily in pandas. For starters, let's see how easy it is to create a data set.

### Create Dataframe

#### From List or Dictionary

For example, if you wanted to create a simple grid of data like this,

<img src="./pics/simple_dataframe.png"/>

just create the columns ( names, population_m ) as 2 separate lists, combine them into a dictionary and pass it as an argument to the DataFrame() function.

In [2]:
import pandas as pd

names        = ["India","United States","Canada"]
population_m = [1500,300,36]

d = {"names" :names , "population" : population_m}

df = pd.DataFrame(d)
df

Unnamed: 0,names,population
0,India,1500
1,United States,300
2,Canada,36


#### From an Empty Dataframe

You can create an empty data frame and start adding columns one by one. 

In [3]:
df = pd.DataFrame()
df["names"]      = names
df["population"] = population_m
print ( df )

           names  population
0          India        1500
1  United States         300
2         Canada          36


#### From Files

Here is a simple file with just 3 entries. It can directly be read into pandas using the read_csv ( ) function.

<img src="./pics/simple-dataframe.png"/>

In [16]:
df = pd.read_csv("../data/simple_dataframe.csv")
df

Unnamed: 0,names,population_m
0,India,1500
1,United States,300
2,Canada,36


### Display Dataframe

Once you read a dataframe, typically, you would want to examine it. We typically want to just see the first few rows or the last few rows. For that, you use the **head ( )** or **tail ( )** functions. For a change, instead of reading a CSV file, let's read an excel file. In case this does not work, please install the python module **xlrd** using 
<pre>
    > pip install xlrd
</pre>
Say, we have read an excel file like this ( contains a list of all countries and their population ). Since the list is big, we just want to display the first few entries. 

<img src="./pics/countries-population.png"/>

In [4]:
df = pd.read_excel("../data/countries_population.xlsx")
df.head() # Shows the first few rows by default. 

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


In [21]:
df.tail() # Shows the last few rows by default.

Unnamed: 0,rank,country,continent,population,Change
228,229,"Saint Helena, Ascension and Tristan da Cunha",Africa,4035,0.003
229,230,Falkland Islands,Americas,2910,0
230,231,Niue,Oceania,1624,−0.4%
231,232,Tokelau,Oceania,1282,0.014
232,233,Vatican City,Europe,801,−1.1%


In [5]:
df.head(10)  # You can very well ask for a specific number of rows to be displayed.

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008
5,6,Pakistan,Asia,193203476,0.02
6,7,Nigeria,Africa,185989640,0.026
7,8,Bangladesh,Asia,162951560,0.011
8,9,Russia,Europe,146864513,−2.0%
9,10,Mexico,Americas,127540423,0.013


But do you know how big the dataframe is that you have read from the excel file ?

### Dataframe size

There are a couple of ways to find out how big the dataframe is. Like we discussed, a dataframe has rows and columns, right ?

<img src="./pics/rows-columns-dataframe.png"/>

The **shape** tuple tells us the the number of rows and columns are there in the dataframe. so, there are 233 rows and 5 columns in the population table. 

In [25]:
df.shape

(233, 5)

If you wanted to find out the total number of data points in the data frame ( think of all the cells in the excel ), then you can just use the **size** tuple.

In [27]:
df.size

1165

### Selecting Data from Dataframe

Selecting Data from Dataframes is also called **indexing** - because we use some form of indices. Let's see that with an example. 

In [4]:
import pandas as pd

df = pd.read_excel("../data/countries_population.xlsx")
#------------ select a subset of the data to start with ---------------

df_small = df.head()
df_small

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


Now, what if you want to just select the second column - _country_ ?

<img src="./pics/dataframe_second_column.png"/>

In [30]:
df_small["country"]

0            China
1            India
2    United States
3        Indonesia
4           Brazil
Name: country, dtype: object

Another way to do this is to use the **loc ( )** function

In [36]:
df_small.loc[:,"country"]

0            China
1            India
2    United States
3        Indonesia
4           Brazil
Name: country, dtype: object

This deserves some explanation. Let's dive deeper into **loc ( )** and **iloc ( )** functions

#### loc and iloc functions

The best way to extract data from a dataframe is via the **loc** or **iloc** functions. To understand how to use these functions look at the picture below - It shows the indexing of the rows and columns starting with 0

<img src="./pics/dataframe_row_column_indices.png"/>

##### iloc - Integer Location

iloc ( or integer location ) is one way to extract data from a data frame. The syntax is show below. 

<img src="./pics/iloc-function.png"/>

How do you specify the rows or columns ? Using integers or slices. Let's see some examples. 

- Get the first two row and all columns

In [42]:
df.iloc[[0,1],[0,1,2,3,4]]

Unnamed: 0,rank,country,continent,population,Change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011


<img src="./pics/iloc-syntax-1.png"/>

Now, instead of specifying all the column numbers, you can very well use the slicing notation and just use the : operator. So, the following would also yield the same result. 

In [44]:
df.iloc[[0,1], :]

Unnamed: 0,rank,country,continent,population,Change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011


<img src="./pics/iloc-syntax-2.png"/>

Along the same lines, if you wanted to select all the rows as well, you could use the : operator. It brings out the entire dataframe. 

In [46]:
df.iloc[:,:]

Unnamed: 0,rank,country,continent,population,Change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008
5,6,Pakistan,Asia,193203476,0.02
6,7,Nigeria,Africa,185989640,0.026
7,8,Bangladesh,Asia,162951560,0.011
8,9,Russia,Europe,146864513,−2.0%
9,10,Mexico,Americas,127540423,0.013


<img src="./pics/iloc-syntax-3.png"/>

What if you wanted to select the 2nd, 3rd and 5th rows ?

In [47]:
df.iloc[[1,3,4],:]

Unnamed: 0,rank,country,continent,population,Change
1,2,India,Asia,1324171354,0.011
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


<img src="./pics/iloc-syntax-4.png"/>

Say you wanted just these highlighted rows only, how would you do it ? The same old indexing using indices.

<img src="./pics/iloc-syntax-5.png"/>

In [48]:
df.iloc[[1,3,4],[1,4]]

Unnamed: 0,country,Change
1,India,0.011
3,Indonesia,0.011
4,Brazil,0.008


In [52]:
df.iloc[ 1:4 , 0:4 ]

Unnamed: 0,rank,country,continent,population
1,2,India,Asia,1324171354
2,3,United States,Americas,322179605
3,4,Indonesia,Asia,261115456


How about a chunk like this ?

<img src="./pics/iloc-syntax-6.png"/>

In [53]:
df.iloc[1:4,0:4]

Unnamed: 0,rank,country,continent,population
1,2,India,Asia,1324171354
2,3,United States,Americas,322179605
3,4,Indonesia,Asia,261115456


Instead of using the entire list of indices in a list, we are using slices to specify the indices. This results in a compact syntax. However, remember that in Python, slices exclude the last element. 

loc - Location

Another way to select data from a dataframe is using the labels ( row or column names - as opposed to numeric indices ). For example, 

<img src="./pics/loc-syntax-1.png"/>

In [66]:
df.loc[ 1:4 , ["rank","country","population","change"] ]

Unnamed: 0,rank,country,population,change
1,2,India,1324171354,0.011
2,3,United States,322179605,0.007
3,4,Indonesia,261115456,0.011
4,5,Brazil,207652865,0.008


###### Rownames

Sometimes, dataframes have row names. For example the same dataframe could be looking like this. 

<img src="./pics/row-names.png"/>

In [70]:
df_country = df.set_index("country")
df_country.head()

Unnamed: 0_level_0,rank,continent,population,change
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
China,1,Asia,1403500365,0.004
India,2,Asia,1324171354,0.011
United States,3,Americas,322179605,0.007
Indonesia,4,Asia,261115456,0.011
Brazil,5,Americas,207652865,0.008


Now, you can use these label based indices to select rows ( as opposed to numeric indices )

<img src="./pics/loc-syntax-2.png"/>

In [74]:
df_country.loc[ ["India","Indonesia"], ["rank","continent","change"]]

Unnamed: 0_level_0,rank,continent,change
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,2,Asia,0.011
Indonesia,4,Asia,0.011


You can even use slicing. 

<img src="./pics/loc-syntax-3.png

In [76]:
df_country.loc[ "India":"Brazil", ["rank","continent","change"]]

Unnamed: 0_level_0,rank,continent,change
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,2,Asia,0.011
United States,3,Americas,0.007
Indonesia,4,Asia,0.011
Brazil,5,Americas,0.008


#### Boolean Mask

If you have heard about the **WHERE** clause in SQL, you will be right at home with **boolean mask** in dataframes. This concept has been borrowed from other math/statistical languages like MATLAB and R. Let's take an example. 

Get all the rows where the "Continent" = "Asia"

In [4]:
df_1 = df.iloc[0:5,:]
df_1

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


<img src="./pics/dataframe-boolean-mask.png"/>

In [9]:
df_1.loc[ df_1["continent"] == "Asia", :]

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
3,4,Indonesia,Asia,261115456,0.011


How did this work ? The **loc** function works not just on column/row labels, but it also works with boolean values. The syntax

<pre>
    df_1["continent"] == "Asia"
</pre>

results in a True/False ( boolean ) vector like below. 

In [10]:
df_1["continent"] == "Asia"

0     True
1     True
2    False
3     True
4    False
Name: continent, dtype: bool

And all **True** rows are returned and **False** rows are suppressed. This is equivalent to the SQL **WHERE** clause. An equivalent SQL statement would be

<pre>
   SELECT * from df where continent = "Asia"
</pre>

You are not limited to a single condition. Using boolean operations, you can make this as complicated as you want. 

<img src="./pics/dataframe-boolean-mask-2.png"/>

In [23]:
continent = df_1["continent"] == "Asia" 
continent

0     True
1     True
2    False
3     True
4    False
Name: continent, dtype: bool

In [25]:
rank =  df_1["rank"] <= 3
rank

0     True
1     True
2     True
3    False
4    False
Name: rank, dtype: bool

In [31]:
condition = continent & rank
condition

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [33]:
df_1.loc[condition, :]

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011


You can put all of it together as below. 

In [47]:
df_1.loc[ (df_1["rank"] <=3 ) | (df_1["continent"] == "Asia") , :]

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011


### Rows

#### Add rows to Dataframe

<img src="./pics/dataframe-append.png"/>

In [6]:
df_small = df.head()
df_small

#--------- what is the next row ? -----------
new_rows = df.iloc[5:6,:]

#--------- Add rows from one dataframe to another
df_small.append(new_rows)
df_small

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


Why didn't it work ? That is because this operation doesn't do it inplace. Instead, it returns a new dataframe. So, try this. 

In [27]:
df_new = df_small.append(new_rows)
df_new

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008
5,6,Pakistan,Asia,193203476,0.02


#### Delete rows from Dataframe

Use the **drop ( )** function to delete rows from a dataframe. Here is how you drop a row. 

<img src="./pics/dataframe-drop-a-row.png"/>

In [9]:
df_new = df_new.drop(df.index[1])
df_new

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008
5,6,Pakistan,Asia,193203476,0.02


or drop multiple rows. 

<img src="./pics/dataframe-drop-multiple-rows.png"/>

In [17]:
df_new = df_new.drop(df.index[[1,2,4]])
df_new

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
3,4,Indonesia,Asia,261115456,0.011
5,6,Pakistan,Asia,193203476,0.02


or drop a slice of rows. 

<img src="./pics/dataframe-drop-slice-of-rows.png"/>

In [19]:
df_new = df_new.drop(df.index[1:4])
df_new

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
4,5,Brazil,Americas,207652865,0.008
5,6,Pakistan,Asia,193203476,0.02


### Columns

#### Add columns to Dataframe

We have already seen some examples of adding new columns to a dataframe. Just use the indexing syntax. For example, if you have a list that you wanted to add as a column, you can use the index name as follows. 

In [25]:
df_1 = df_small.iloc[:,0:4]
df_1

Unnamed: 0,rank,country,continent,population
0,1,China,Asia,1403500365
1,2,India,Asia,1324171354
2,3,United States,Americas,322179605
3,4,Indonesia,Asia,261115456
4,5,Brazil,Americas,207652865


In [24]:
change = [0.004,0.011,0.007,0.011,0.008]
df_1["change"] = change
df_1

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


or use the index numbering.

In [26]:
change = [0.004,0.011,0.007,0.011,0.008]
df_1[4] = change
df_1

Unnamed: 0,rank,country,continent,population,4
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


If you wanted to insert the column at a specific index, use the **insert ( )** function. Just specify the column name, the index at which it should be located and the actual data itself.

In [32]:
df_1 = df_small.iloc[:,0:4]
df_1

Unnamed: 0,rank,country,continent,population
0,1,China,Asia,1403500365
1,2,India,Asia,1324171354
2,3,United States,Americas,322179605
3,4,Indonesia,Asia,261115456
4,5,Brazil,Americas,207652865


In [33]:
change = [0.004,0.011,0.007,0.011,0.008]
df_1.insert(3,"change",change)
df_1

Unnamed: 0,rank,country,continent,change,population
0,1,China,Asia,0.004,1403500365
1,2,India,Asia,0.011,1324171354
2,3,United States,Americas,0.007,322179605
3,4,Indonesia,Asia,0.011,261115456
4,5,Brazil,Americas,0.008,207652865


#### Delete columns from Dataframe

Deleting rows from a dataframe is just as easy. For example, if you wanted to drop the "rank" column from the dataframe above, use the **drop ( )** function. Just don't forget to include the **axis** parameter. Axis = 1 means, along the **columns**

<img src="./pics/dataframe-drop-column.png"/>

In [56]:
df_new = df_new.drop("rank",axis=1)
df_new

Unnamed: 0,country,continent,population,change
0,China,Asia,1403500365,0.004
1,India,Asia,1324171354,0.011
2,United States,Americas,322179605,0.007
3,Indonesia,Asia,261115456,0.011
4,Brazil,Americas,207652865,0.008
5,Pakistan,Asia,193203476,0.02


To do the same using column index, rather than column names, use the **df.columns ( )** function.

In [60]:
df_new = df_new.drop(df_new.columns[0],axis=1)

You could delete multiple columns as well. 

<img src="./pics/dataframe-drop-multiple-columns.png"/>

In [35]:
df_new = df_new.drop(["rank","change","population"],axis=1)
# if you wanted to do it by column indices
# df_new = df_new.drop(df_new.columns[0,3,4],axis=1)
df_new

Unnamed: 0,country,continent
0,China,Asia
1,India,Asia
2,United States,Americas
3,Indonesia,Asia
4,Brazil,Americas
5,Pakistan,Asia


### Grouping

In Pandas, there is an equivalent to the SQL **GROUP BY** sytax. Look at the example below. How do you compute the average population of each continent ? Pandas has an in-built **groupby ( )** function that can do it for us. 

<img src="./pics/dataframe-groupby.png"/>

In [7]:
df_small.groupby("continent").mean()

Unnamed: 0_level_0,rank,population
continent,Unnamed: 1_level_1,Unnamed: 2_level_1
Americas,4.0,264916200.0
Asia,2.333333,996262400.0


In [8]:
d = df_small.groupby("continent")
d

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A33BD70>

Without the aggregator function ( mean() in this case ), **groupby ( )** returns a DataFrameGroupBy object. In itself, it is not a dataframe yet, until you apply the aggregator function. **mean ( )** is just one example. You can use generic functions like. 

- size ( )     ## like count()
- sum ( )
- first ( )
- last ( )

or statistical functions like

- mean ( )
- std ( ) ## Standard Deviation
- var ( ) ## Variance
- min ( )
- max ( )

etc

In [34]:
df_small.groupby("continent").size()

continent
Americas    2
Asia        3
dtype: int64

In [14]:
df_small.groupby("continent").sum()

Unnamed: 0_level_0,rank,population
continent,Unnamed: 1_level_1,Unnamed: 2_level_1
Americas,8,529832470
Asia,7,2988787175


You are not limited to just a single criteria for grouping or aggregation. For example, 

<img src="./pics/dataframe-groupby-multiple-columns.png"/>

In [32]:
df_1 = df.iloc[0:5,:]
df_1.loc[:,"nato"] = [False, False, True, False , False]
df_1.groupby(["continent","nato"]).mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,rank,population
continent,nato,Unnamed: 2_level_1,Unnamed: 3_level_1
Americas,False,5.0,207652900.0
Americas,True,3.0,322179600.0
Asia,False,2.333333,996262400.0


### Merge Dataframes

Merging dataframes is a bit involved. We will start with the simplest of cases and move towards more complicated ones. 

#### Concatenate Dataframes

<img src="./pics/dataframe-concatenation.png"/>

In [27]:
df1 = df.iloc[0:2,0:3]
df2 = df.iloc[2:4,0:3]
df3 = df.iloc[4:6,0:3]

df_new = pd.concat([df1,df2,df3])
df_new

Unnamed: 0,rank,country,continent
0,1,China,Asia
1,2,India,Asia
2,3,United States,Americas
3,4,Indonesia,Asia
4,5,Brazil,Americas
5,6,Pakistan,Asia


What then is the difference between **append ( )** and **concat ( )** ? Think of **concat** as a modern version of **append**. By the way, **append ( )** is not limited to just 2 dataframes. For example, 

In [28]:
df_new = df1.append([df2,df3])
df_new

Unnamed: 0,rank,country,continent
0,1,China,Asia
1,2,India,Asia
2,3,United States,Americas
3,4,Indonesia,Asia
4,5,Brazil,Americas
5,6,Pakistan,Asia


The major difference between them is flexibility that **concat** provides. For example, what if you wanted to concatenate along the columns ? Like this

<img src="./pics/dataframe-concatenate-columns.png"/>

In [32]:
df1 = df.iloc[0:3,0:3]
df2 = df.iloc[0:3,3:5]
print ( df1 ) 
print ( df2 )

   rank        country continent
0     1          China      Asia
1     2          India      Asia
2     3  United States  Americas
   population change
0  1403500365  0.004
1  1324171354  0.011
2   322179605  0.007


In [35]:
df_new = pd.concat([df1,df2],axis=1)
df_new

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007


In [37]:
df_new["rank"] = None

In [38]:
df_new

Unnamed: 0,rank,country,continent,population,change
0,,China,Asia,1403500365,0.004
1,,India,Asia,1324171354,0.011
2,,United States,Americas,322179605,0.007


#### Merge

**Merge** is similar to database joins. Look at this example. 

<img src="./pics/dataframe-merge-1.png"/>

In [46]:
df_1 = df.iloc[0:5,:]
df_2 = pd.DataFrame()
df_2["country"] = [ "China","India","United States","Indonesia","Brazil"]
df_2["nato"] = [False,False,True,False,False]
df_1

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
3,4,Indonesia,Asia,261115456,0.011
4,5,Brazil,Americas,207652865,0.008


In [38]:
df_2

Unnamed: 0,country,nato
0,China,False
1,India,False
2,United States,True
3,Indonesia,False
4,Brazil,False


In [39]:
df_1.merge(df_2)

Unnamed: 0,rank,country,continent,population,change,nato
0,1,China,Asia,1403500365,0.004,False
1,2,India,Asia,1324171354,0.011,False
2,3,United States,Americas,322179605,0.007,True
3,4,Indonesia,Asia,261115456,0.011,False
4,5,Brazil,Americas,207652865,0.008,False


##### Inner Join

By default the type of join is called **inner join**. Think of this like an intersection. Expanding on the example above, 

<img src="./pics/dataframe-inner-join.png"/>

In [42]:
df_3 = pd.DataFrame()
df_3["country"] = [ "China","India","United States"]
df_3["nato"] = [False,False,True]
df_3

Unnamed: 0,country,nato
0,China,False
1,India,False
2,United States,True


In [45]:
df_1.merge(df_3)

Unnamed: 0,rank,country,continent,population,change,nato
0,1,China,Asia,1403500365,0.004,False
1,2,India,Asia,1324171354,0.011,False
2,3,United States,Americas,322179605,0.007,True


##### Left Join

What if the left dataframe has more rows and we want to retain all of them ? Then we use a **left join**

<img src="./pics/dataframe-left-join.png"/>

In [47]:
df_1.merge(df_3,how="left")

Unnamed: 0,rank,country,continent,population,change,nato
0,1,China,Asia,1403500365,0.004,False
1,2,India,Asia,1324171354,0.011,False
2,3,United States,Americas,322179605,0.007,True
3,4,Indonesia,Asia,261115456,0.011,
4,5,Brazil,Americas,207652865,0.008,


##### Right Join

On the contrary, if the right dataframe has more rows and if you want to preserve all the rows in the right dataframe, use **right join**

<img src="./pics/dataframe-right-join.png"/>

In [54]:
df_1 = df.iloc[0:3,:]
df_1

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007


In [55]:
df_3 = pd.DataFrame()
df_3["country"] = [ "China","India","United States","Indonesia","Brazil"]
df_3["nato"] = [False,False,True,False,False]
df_3

Unnamed: 0,country,nato
0,China,False
1,India,False
2,United States,True
3,Indonesia,False
4,Brazil,False


In [53]:
df_1.merge(df_3,how="right")

Unnamed: 0,rank,country,continent,population,change,nato
0,1.0,China,Asia,1403500000.0,0.004,False
1,2.0,India,Asia,1324171000.0,0.011,False
2,3.0,United States,Americas,322179600.0,0.007,True
3,,Indonesia,,,,False
4,,Brazil,,,,False


##### Outer Join

There is another type of join called **outer join**. Let's try this on the same datasets as above. 

In [57]:
df_1.merge(df_3,how="outer")

Unnamed: 0,rank,country,continent,population,change,nato
0,1.0,China,Asia,1403500000.0,0.004,False
1,2.0,India,Asia,1324171000.0,0.011,False
2,3.0,United States,Americas,322179600.0,0.007,True
3,,Indonesia,,,,False
4,,Brazil,,,,False


Surprisingly, you see the same results, right ? Let's take a better example to illustrate this. 

<img src="./pics/dataframe-outer-join.png"/>

In [58]:
df_1 = df.iloc[[0,1,2,5,6],:]
df_1

Unnamed: 0,rank,country,continent,population,change
0,1,China,Asia,1403500365,0.004
1,2,India,Asia,1324171354,0.011
2,3,United States,Americas,322179605,0.007
5,6,Pakistan,Asia,193203476,0.02
6,7,Nigeria,Africa,185989640,0.026


In [60]:
df_3 = df_3.iloc[[0,2,3],:]
df_3

Unnamed: 0,country,nato
0,China,False
2,United States,True
3,Indonesia,False


In [61]:
df_1.merge(df_3,how="outer")

Unnamed: 0,rank,country,continent,population,change,nato
0,1.0,China,Asia,1403500000.0,0.004,False
1,2.0,India,Asia,1324171000.0,0.011,
2,3.0,United States,Americas,322179600.0,0.007,True
3,6.0,Pakistan,Asia,193203500.0,0.02,
4,7.0,Nigeria,Africa,185989600.0,0.026,
5,,Indonesia,,,,False
