# Concatenation
* Often the data you need exists in two separate sources, fortunately, Pandas makes it easy to combine these together.
* The simplest combination is if both sources are already in the <b>same format</b>, then a <b>concatenation</b> through the <b>pd.concat()</b> call is all that is needed



* Concatenation is simply "pasting" the two DataFrames together, by columns

<table class="center">
<tr>
<th>Labeled Index</th>
<th>Year</th>
<th>Pop</th>
</tr>

<tr>
<th>USA</th>
<th>1776</th>
<th>328</th>

</tr>

<tr>
<th>CANADA</th>
<th>1867</th>
<th>38</th>

</tr>

<tr>
<th>MEXICO</th>
<th>1821</th>
<th>1.7</th>

</table>

<table class="center">
<tr>
<th>Labeled Index</th>
<th>GDP</th>
<th>Carriers</th>
</tr>

<tr>
<th>USA</th>
<th>20.5</th>
<th>11</th>
</tr>

<tr>
<th>CANADA</th>
<th>1.7</th>
<th>Nan</th>
</tr>

<tr>
<th>MEXICO</th>
<th>1.22</th>
<th>Nan</th>
</table>

* Try to glue every table together to get all the information for the four columns

  <table class="center">
<tr>
<th>Labeled Index</th>
<th>Year</th>
<th>Pop</th>
<th>GDP</th>
<th>Area</th>
</tr>

<tr>
<th>USA</th>
<th>1776</th>
<th>NAN</th>
<th>NAN</th>
<th>NAN</th>
</tr>

<tr>
<th>CANADA</th>
<th>1867</th>
<th>38</th>
<th>1.7</th>
<th>3.86</th>
</tr>

<tr>
<th>MEXICO</th>
<th>1821</th>
<th>1.7</th>
<th>1.22</th>
<th>0.76</th>
</table>



Remember Pandas will also automatically fill NaN where necessary


In [1]:
import pandas as pd
import numpy as np

In [2]:
data_one = {'A': ['A0', 'A1', 'A2', 'A3'],'B': ['B0', 'B1', 'B2', 'B3']}

In [3]:
data_two = {'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}

In [4]:
one = pd.DataFrame(data_one)

In [5]:
two = pd.DataFrame(data_two)

In [6]:
one

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


In [7]:
two

Unnamed: 0,C,D
0,C0,D0
1,C1,D1
2,C2,D2
3,C3,D3


In [8]:
# concatenation the one and two dataframes along the columns
# passing the list into concat
pd.concat([one, two], axis=0)

Unnamed: 0,A,B,C,D
0,A0,B0,,
1,A1,B1,,
2,A2,B2,,
3,A3,B3,,
0,,,C0,D0
1,,,C1,D1
2,,,C2,D2
3,,,C3,D3


In [9]:
# if you want the reverse order of two dataframes
pd.concat([two, one], axis=1)


Unnamed: 0,C,D,A,B
0,C0,D0,A0,B0
1,C1,D1,A1,B1
2,C2,D2,A2,B2
3,C3,D3,A3,B3


In [10]:
# because you can't have two values sitting at the same position for index 0
# duplicate the index position
pd.concat([one, two], axis=0)

Unnamed: 0,A,B,C,D
0,A0,B0,,
1,A1,B1,,
2,A2,B2,,
3,A3,B3,,
0,,,C0,D0
1,,,C1,D1
2,,,C2,D2
3,,,C3,D3


In [11]:
# match up the column name in order to make the concatenation along the rows
# make the columns A, B == C, D (replace the columns of two by one)
two.columns = one.columns


In [12]:
two

Unnamed: 0,A,B
0,C0,D0
1,C1,D1
2,C2,D2
3,C3,D3


In [13]:
mydf = pd.concat([one, two], axis=0)

In [14]:
mydf

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3
0,C0,D0
1,C1,D1
2,C2,D2
3,C3,D3


In [15]:
# replace the new index
mydf.index = range(len(mydf))

In [16]:
mydf

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3
4,C0,D0
5,C1,D1
6,C2,D2
7,C3,D3


# "Inner" Merge
* Often DataFrames are not in the exact the same order or format, meaning we can not simply concatenate them together
* In this case, we need to <b>merge</b> the DataFrames
* This is analogous to a JOIN command in SQL

* The <b>merge()</b> method takes in a key argument labeled how
* There are 3 main ways of merging tables together using the <b>how</b> parameter:
    * Inner
    * Outer
    * Left or Right
* The main idea behind the argument is to decide <b>how</b> to deal with information only present in one of the joined tables

# The example
*  the company is holding a conference for people in the movie rental industry
* We will have people register online beforehand and then login the day of the conference



In [17]:
# After the conference we have these two table
# The respective id columns indicate what order they registered or logged in on site
# Assume name is unique
# Registrations names' first letter go A,B,C,D
registrations = pd.DataFrame({'reg_id':[1,2,3,4],'name':['Andrew','Bobo','Claire','David']})
logins = pd.DataFrame({'log_id':[1,2,3,4],'name':['Xavier','Andrew','Yolanda','Bobo']})


In [18]:
# the people registration
# you should notice that there some people who registration but not login
registrations

Unnamed: 0,reg_id,name
0,1,Andrew
1,2,Bobo
2,3,Claire
3,4,David


In [19]:
# the people login
# some people login without registration
logins

Unnamed: 0,log_id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


* First we need to decide <b>on</b> what column to merge together
* There is two rules:
    * The <b>on</b> column should be a <i>primary</i> identifier, meaning unique indentifier per row
    * The <b>on</b> column should be present in both tables being merged
    * the name column represent the name of person who either registered or logged in
    * Since we assume names are unique here, will we merge on= "name"
* Next we need to decide <b>how</b> to merge the tables <b>on</b> the <b>name</b> column
* With <b> how="inner"</b> the result will be set of records that match in both table
    * The result will be the set of records that <b> match in both </b> tables


<b>Merges are often shown as  a Venn diagram</b>

In [20]:
# help(pd.merge)

In [21]:
pd.merge(registrations, logins, how='inner', on='name')


Unnamed: 0,reg_id,name,log_id
0,1,Andrew,2
1,2,Bobo,4


In [22]:
# the order of tables provided are actually not matter
pd.merge(logins, registrations, how='inner', on='name')


Unnamed: 0,log_id,name,reg_id
0,2,Andrew,1
1,4,Bobo,2


# Left and Right Merge
* Now that we understand an 'inner' merge, let's explore "left" versus "right" merge conditions.
* Note! Order of the tables passed in as arguments does matter here




* Let's explore an <b>how="left"</b> condition with our two example tables
* Note: Registrations is the left table, logins will be the right table
* Because <b>how = "left"</b> => want everything in the name column of registration table regardless they not appear in the login table
* Every name in the registrations table will be present in the results along with their corresponding column values
* The value not present in the logins => return Nan in Registrations table

In [23]:
# left= is the left table
# right= is the right table
# how= is the way to merger the
pd.merge(left=registrations, right=logins, how='left', on='name')

Unnamed: 0,reg_id,name,log_id
0,1,Andrew,2.0
1,2,Bobo,4.0
2,3,Claire,
3,4,David,


* Let's explore an <b>how="right"</b> condition with our two example tables
* The same as the left table but the rows not actually in exact same order because the pandas sorts the left hand tale first



In [24]:
# the only column will be share
pd.merge(left=registrations, right=logins, how='right', on='name')

Unnamed: 0,reg_id,name,log_id
0,,Xavier,1
1,1.0,Andrew,2
2,,Yolanda,3
3,2.0,Bobo,4


In [25]:
# the Pandas is smart enough to understand what it should choose for the 'on' column if there only one column is shared between both tables
# this way aren't recommend in reality, especially for other people who want to read your code
pd.merge(left=registrations, right=logins, how='right')


Unnamed: 0,reg_id,name,log_id
0,,Xavier,1
1,1.0,Andrew,2
2,,Yolanda,3
3,2.0,Bobo,4


# Outer merger
* Setting the <b>how='outer'</b> allows us to include everything present in  both tables
* We have several name in table, but we have names that only appear in one table

In [26]:
# the change of left= and right= only affect the order of result
pd.merge(left=registrations, right=logins, how='outer', on='name')


Unnamed: 0,reg_id,name,log_id
0,1.0,Andrew,2.0
1,2.0,Bobo,4.0
2,3.0,Claire,
3,4.0,David,
4,,Xavier,1.0
5,,Yolanda,3.0


In [27]:
pd.merge(left=logins, right=registrations, how='outer', on='name')



Unnamed: 0,log_id,name,reg_id
0,1.0,Xavier,
1,2.0,Andrew,1.0
2,3.0,Yolanda,
3,4.0,Bobo,2.0
4,,Claire,3.0
5,,David,4.0


In [28]:
# join an index instead of column
# switch the name from column to labeled index
registrations = registrations.set_index('name')

In [29]:
registrations

Unnamed: 0_level_0,reg_id
name,Unnamed: 1_level_1
Andrew,1
Bobo,2
Claire,3
David,4


In [30]:
logins

Unnamed: 0,log_id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


In [31]:
# join the labeled index and name column
# on= mean that both the tables are column
# left_on= mean that column in the left table
# left_index= mean the index in the left table
pd.merge(left=registrations, right=logins, left_index=True, right_on='name', how='inner')

Unnamed: 0,reg_id,log_id,name
1,1,2,Andrew
3,2,4,Bobo


In [32]:
registrations

Unnamed: 0_level_0,reg_id
name,Unnamed: 1_level_1
Andrew,1
Bobo,2
Claire,3
David,4


In [33]:
# reset index
registrations = registrations.reset_index()

In [34]:
registrations

Unnamed: 0,name,reg_id
0,Andrew,1
1,Bobo,2
2,Claire,3
3,David,4


In [35]:
# Dealing with different key color names in the joint table
registrations.columns = ['reg_name', 'reg_id']

In [36]:
registrations

Unnamed: 0,reg_name,reg_id
0,Andrew,1
1,Bobo,2
2,Claire,3
3,David,4


In [37]:
logins


Unnamed: 0,log_id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


In [38]:
# the reg_name is in the different column with name column of login table
result = pd.merge(registrations, logins, how='inner', left_on='reg_name', right_on='name')

In [39]:
result

Unnamed: 0,reg_name,reg_id,log_id,name
0,Andrew,1,2,Andrew
1,Bobo,2,4,Bobo


In [40]:
# avoid the duplicate
result.drop('reg_name', axis=1)


Unnamed: 0,reg_id,log_id,name
0,1,2,Andrew
1,2,4,Bobo


In [41]:
# add or tag duplicate
registrations.columns = ['name', 'id']

In [42]:
logins.columns = ['id', 'name']

In [43]:
registrations

Unnamed: 0,name,id
0,Andrew,1
1,Bobo,2
2,Claire,3
3,David,4


In [44]:
logins


Unnamed: 0,id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


In [45]:
# it doesn't say the registration id and logins id but they only have one id in both table
# pandas automatically tags the duplicate columns with a suffix
# the suffixe=() - a tuple
pd.merge(registrations, logins, how='inner', on='name', suffixes=('_reg','_log'))

Unnamed: 0,name,id_reg,id_log
0,Andrew,1,2
1,Bobo,2,4
