# Merging tables with different join types

## 1.1 Left join
A left join returns all rows from the left table and only those rows from the right table where key columns match. 

Left join syntax:

df1_df2 = df1.merge(df2, on = "common_column", how = "left")

 A left join will return all of the rows from the left table. If those rows in the left table match multiple rows in the right table, then all of those rows will be returned. Therefore, the returned rows must be equal to if not greater than the left table. 


## 1.2 Right join

A right join returns all rows from the right table and includes only those rows from the left table that have matching values. It is the mirror opposite of the left join. 

If the common columns in the two tables have different column names, we introduce two arguments. left_on and right_on

Right join syntax:

df1_df2 = df1.merge(df2, how = "right", left_on = "common_column_name", right_on = "common_column_name")


## 1.3 Outer join
An outer join returns all the rows from both tables regardless if there is a match between the tables.

Outer join syntax:
 
df1_df2 = df1.merge(df2, on = "common_column", how = "outer", suffixes = ("_tab1", "_tab2"))


## 1.4 Self join

Merging a table to itself. By dafult, it is an inner join but we can use other types of joins too eg, left join.
Useful when you want to compare values in a column to other values in the same column.

You are likely to merge a table with itself
1. When working with tables that have a hierachical relationship. Eg, employee and manager.

2. On sequential relationships such as logistic movements.

3. Graph data such as network of friends.




We have a table crews that contains data about members working to produce a film.
Use a self join to differenciate the people working under different directors under each movie.

In [1]:
# import libraries
import pandas as pd

In [2]:
# load the dataframe
crews = pd.read_pickle("crews.p")

In [3]:
# view the head
crews.head()

Unnamed: 0,id,department,job,name
0,19995,Editing,Editor,Stephen E. Rivkin
2,19995,Sound,Sound Designer,Christopher Boyes
4,19995,Production,Casting,Mali Finn
6,19995,Directing,Director,James Cameron
7,19995,Writing,Writer,James Cameron


In [4]:
# inspect the dataframe information
crews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42502 entries, 0 to 129580
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          42502 non-null  int64 
 1   department  42502 non-null  object
 2   job         42502 non-null  object
 3   name        42502 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.6+ MB


In [5]:
# To a variable called crews_self_merged, merge the crews table to itself on the id column using an inner join, 
# setting the suffixes to '_dir' and '_crew' for the left and right tables respectively.

crews_self_merged = crews.merge(crews, left_on = "id", right_on = "id", suffixes = ("_dir", "_crew"))
crews_self_merged.head()

Unnamed: 0,id,department_dir,job_dir,name_dir,department_crew,job_crew,name_crew
0,19995,Editing,Editor,Stephen E. Rivkin,Editing,Editor,Stephen E. Rivkin
1,19995,Editing,Editor,Stephen E. Rivkin,Sound,Sound Designer,Christopher Boyes
2,19995,Editing,Editor,Stephen E. Rivkin,Production,Casting,Mali Finn
3,19995,Editing,Editor,Stephen E. Rivkin,Directing,Director,James Cameron
4,19995,Editing,Editor,Stephen E. Rivkin,Writing,Writer,James Cameron


In [6]:
# Create a Boolean index, named boolean_filter, that selects rows from the left table with the job of 'Director'
#  and avoids rows with the job of 'Director' in the right table.

boolean_filter = ((crews_self_merged["job_dir"] == "Director") & (crews_self_merged["job_crew"] != "Director"))
direct_crews = crews_self_merged[boolean_filter]
direct_crews.head()

# With the output, you can quickly see different movie directors and the people they worked with in the same movie.

Unnamed: 0,id,department_dir,job_dir,name_dir,department_crew,job_crew,name_crew
156,19995,Directing,Director,James Cameron,Editing,Editor,Stephen E. Rivkin
157,19995,Directing,Director,James Cameron,Sound,Sound Designer,Christopher Boyes
158,19995,Directing,Director,James Cameron,Production,Casting,Mali Finn
160,19995,Directing,Director,James Cameron,Writing,Writer,James Cameron
161,19995,Directing,Director,James Cameron,Art,Set Designer,Richard F. Mays


## 1.5 Merging on indexes

Often, DataFrame indexes are given a unique id that we can use when merging two tables together. The default 0 1 2 3 4 5 .......

You can set columns as index when importing a file eg, index_col["id"]
### 1.5.1 Merging on a single index
df1_df2 = df1.merge(df2, on = "index_column", how = "left")

### 1.5.2 Merging on a multiIndex
df1_df2 = df1.merge(df2, on = ["index_col1", "index_col2"])

### 1.5.3 Index merge with left_on and right_on
happens if the index names are different between the two tables that we want to merge. since we are merging on indexes, we need to set the left_index and right_index to True. They tell the merge method to use the separate indexes.

df1_df2 = df1.merge(df2, left_on = "df1_index_name", left_index = True, right_on = "df2_index_name", right_index = True)
