# Joining Pandas Dataframes

## Joining Dataframes

In [2]:
import pandas as pd 

data_questionnaire = pd.read_csv("national-health-and-nutrition-examination-survey/questionnaire.csv")
data_examination = pd.read_csv("national-health-and-nutrition-examination-survey/examination.csv")

In [3]:
print(data_questionnaire)

        SEQN  ACD011A  ACD011B  ACD011C  ACD040  ACD110  ALQ101  ALQ110  \
0      73557      1.0      NaN      NaN     NaN     NaN     1.0     NaN   
1      73558      1.0      NaN      NaN     NaN     NaN     1.0     NaN   
2      73559      1.0      NaN      NaN     NaN     NaN     1.0     NaN   
3      73560      1.0      NaN      NaN     NaN     NaN     NaN     NaN   
4      73561      1.0      NaN      NaN     NaN     NaN     1.0     NaN   
5      73562      NaN      NaN      NaN     4.0     NaN     1.0     NaN   
6      73563      NaN      NaN      NaN     NaN     NaN     NaN     NaN   
7      73564      1.0      NaN      NaN     NaN     NaN     2.0     1.0   
8      73565      NaN      NaN      NaN     5.0     NaN     NaN     NaN   
9      73566      1.0      NaN      NaN     NaN     NaN     1.0     NaN   
10     73567      1.0      NaN      NaN     NaN     NaN     1.0     NaN   
11     73568      1.0      NaN      NaN     NaN     NaN     1.0     NaN   
12     73569      NaN    

In [4]:
print(data_examination)

       SEQN  PEASCST1  PEASCTM1  PEASCCT1  BPXCHR  BPAARM  BPACSZ  BPXPLS  \
0     73557         1     620.0       NaN     NaN     1.0     4.0    86.0   
1     73558         1     766.0       NaN     NaN     1.0     4.0    74.0   
2     73559         1     665.0       NaN     NaN     1.0     4.0    68.0   
3     73560         1     803.0       NaN     NaN     1.0     2.0    64.0   
4     73561         1     949.0       NaN     NaN     1.0     3.0    92.0   
5     73562         1    1064.0       NaN     NaN     1.0     5.0    60.0   
6     73563         1      90.0       NaN   152.0     NaN     NaN     NaN   
7     73564         1     954.0       NaN     NaN     1.0     5.0    82.0   
8     73566         1     625.0       NaN     NaN     1.0     4.0    86.0   
9     73567         1     932.0       NaN     NaN     1.0     3.0    70.0   
10    73568         1     585.0       NaN     NaN     1.0     3.0    70.0   
11    73570         1     710.0       NaN     NaN     1.0     2.0    78.0   

## SCENARIO 1 : Two data sets containing the same columns but different rows of data 

The concat() function appends the rows from the two Dataframes to create the df_all_rows Dataframe. When you list this out you can see that all of the data rows are there, however there is a problem with the index.

We didn’t explicitly set an index for any of the Dataframes we have used. For df_SN7577i_a and df_SN7577i_b default indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.

This is really only a problem if you need to access a row by its index. We can fix the problem with the following code.

## SCENARIO 2 : Adding the columns from one Dataframe to those of another Dataframe

We use the axis=1 parameter to indicate that it is the columns that need to be joined together. 

Notice that the Id column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not necessarily a problem. However there are better ways of combining columns from two Dataframes which avoid this problem.

In [16]:
data_concat_by_columns = pd.concat([data_questionnaire, data_examination], axis=1)
data_concat_by_columns

Unnamed: 0,SEQN,ACD011A,ACD011B,ACD011C,ACD040,ACD110,ALQ101,ALQ110,ALQ120Q,ALQ120U,...,CSXLEAOD,CSXSOAOD,CSXGRAOD,CSXONOD,CSXNGSOD,CSXSLTRT,CSXSLTRG,CSXNART,CSXNARG,CSAEFFRT
0,73557,1.0,,,,,1.0,,1.0,3.0,...,2.0,1.0,1.0,1.0,4.0,62.0,1.0,,,1.0
1,73558,1.0,,,,,1.0,,7.0,1.0,...,3.0,1.0,2.0,3.0,4.0,28.0,1.0,,,1.0
2,73559,1.0,,,,,1.0,,0.0,,...,2.0,1.0,2.0,3.0,4.0,49.0,1.0,,,3.0
3,73560,1.0,,,,,,,,,...,,,,,,,,,,
4,73561,1.0,,,,,1.0,,0.0,,...,3.0,1.0,4.0,3.0,4.0,,,,,1.0
5,73562,,,,4.0,,1.0,,5.0,3.0,...,3.0,1.0,2.0,3.0,4.0,21.0,1.0,,,1.0
6,73563,,,,,,,,,,...,,,,,,,,,,
7,73564,1.0,,,,,2.0,1.0,2.0,3.0,...,3.0,1.0,2.0,3.0,4.0,,,12.0,1.0,1.0
8,73565,,,,5.0,,,,,,...,3.0,1.0,2.0,3.0,4.0,,,20.0,1.0,1.0
9,73566,1.0,,,,,1.0,,1.0,1.0,...,3.0,1.0,2.0,3.0,4.0,,,54.0,1.0,1.0


## SCENARIO 3 : Using merge to join columns

This is similar to the SQL ‘join’ functionality.

Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using the on parameter.

In [14]:
data_merged = pd.merge(data_questionnaire, data_examination, how='inner', on='SEQN')
data_merged

Unnamed: 0,SEQN,ACD011A,ACD011B,ACD011C,ACD040,ACD110,ALQ101,ALQ110,ALQ120Q,ALQ120U,...,CSXLEAOD,CSXSOAOD,CSXGRAOD,CSXONOD,CSXNGSOD,CSXSLTRT,CSXSLTRG,CSXNART,CSXNARG,CSAEFFRT
0,73557,1.0,,,,,1.0,,1.0,3.0,...,2.0,1.0,1.0,1.0,4.0,62.0,1.0,,,1.0
1,73558,1.0,,,,,1.0,,7.0,1.0,...,3.0,1.0,2.0,3.0,4.0,28.0,1.0,,,1.0
2,73559,1.0,,,,,1.0,,0.0,,...,2.0,1.0,2.0,3.0,4.0,49.0,1.0,,,3.0
3,73560,1.0,,,,,,,,,...,,,,,,,,,,
4,73561,1.0,,,,,1.0,,0.0,,...,3.0,1.0,4.0,3.0,4.0,,,,,1.0
5,73562,,,,4.0,,1.0,,5.0,3.0,...,3.0,1.0,2.0,3.0,4.0,21.0,1.0,,,1.0
6,73563,,,,,,,,,,...,,,,,,,,,,
7,73564,1.0,,,,,2.0,1.0,2.0,3.0,...,3.0,1.0,2.0,3.0,4.0,,,12.0,1.0,1.0
8,73566,1.0,,,,,1.0,,1.0,1.0,...,3.0,1.0,2.0,3.0,4.0,,,20.0,1.0,1.0
9,73567,1.0,,,,,1.0,,4.0,1.0,...,3.0,1.0,2.0,3.0,4.0,,,54.0,1.0,1.0


You specify the type of join you want using the how parameter. The default is the inner join which returns the columns from both tables where the key or common column values match in both Dataframes.

The possible values of the how parameter are shown in the picture below (taken from the Pandas documentation)

![image.png](attachment:image.png)