# Left Join

In a *left join* retrieves all records from the left table (A), and the matching records from the right table (B).
![left join venn diagram](../images/left-join.png)

In [1]:
# Import packages.
import pandas as pd

### Counting missing rows with left join

In [5]:
# Import datasets
movies = pd.read_pickle("../datasets/movies.pkl")
financials = pd.read_pickle("../datasets/financials.pkl")
# Print the first fiew roews of each dataset.
print(movies.head())
print(financials.head())

      id                 title  popularity release_date
0    257          Oliver Twist   20.415572   2005-09-23
1  14290  Better Luck Tomorrow    3.877036   2002-01-12
2  38365             Grown Ups   38.864027   2010-06-24
3   9672              Infamous    3.680896   2006-11-16
4  12819       Alpha and Omega   12.300789   2010-09-17
       id     budget       revenue
0   19995  237000000  2.787965e+09
1     285  300000000  9.610000e+08
2  206647  245000000  8.806746e+08
3   49026  250000000  1.084939e+09
4   49529  260000000  2.841391e+08


In [6]:
# Merge the movies table with the financials table with a left join
movies_financials = movies.merge(financials, on='id', how='left')

# Count the number of rows in the budget column that are missing
number_of_missing_fin = movies_financials['budget'].isnull().sum()

# Print the number of movies missing financials
print(number_of_missing_fin)

1574


### Enriching a dataset

In [10]:
# subset movies to only select toy story movies and import taglines.
toy_story = movies[movies["title"].str.contains("toy story", case=False)]
taglines = pd.read_pickle("../datasets/taglines.pkl")
# print the first fiew lines of these datasets.
print(toy_story.head())
print(taglines.head())

         id        title  popularity release_date
103   10193  Toy Story 3   59.995418   2010-06-16
2637    863  Toy Story 2   73.575118   1999-10-30
3716    862    Toy Story   73.640445   1995-10-30
       id                                         tagline
0   19995                     Enter the World of Pandora.
1     285  At the end of the world, the adventure begins.
2  206647                           A Plan No One Escapes
3   49026                                 The Legend Ends
4   49529            Lost in our world, found in another.


An *inner join* retrieves records from the left table (A), and the records from the right table (B) that match eachother this is otherwise known as an ntersection.
![inner join venn diagram](../images/inner-join.png)

In [11]:
# Merge the toy_story and taglines tables with a left join
toystory_tag = toy_story.merge(taglines, how="left", on="id")

# Print the rows and shape of toystory_tag
print(toystory_tag)
print(toystory_tag.shape)

      id        title  popularity release_date                   tagline
0  10193  Toy Story 3   59.995418   2010-06-16  No toy gets left behind.
1    863  Toy Story 2   73.575118   1999-10-30        The toys are back!
2    862    Toy Story   73.640445   1995-10-30                       NaN
(3, 5)


In [12]:
# Merge the toy_story and taglines tables with a inner join
toystory_tag = toy_story.merge(taglines, how="inner")

# Print the rows and shape of toystory_tag
print(toystory_tag)
print(toystory_tag.shape)

      id        title  popularity release_date                   tagline
0  10193  Toy Story 3   59.995418   2010-06-16  No toy gets left behind.
1    863  Toy Story 2   73.575118   1999-10-30        The toys are back!
(2, 5)


# Other Joins

### Right join to find unique movies

In a *right join* retrieves mathing records from the left table (A), and all records from the right table (B).
![right join venn diagram](../images/right-join.png)

In [18]:
# import additional datasets
genres = pd.read_pickle("../datasets/movie_to_genres.pkl")
# subset the genres table for action and science fiction
action_movies = genres[genres["genre"].str.contains("action", case=False)]
scifi_movies = genres[genres["genre"].str.contains("science fiction", case=False)]
# print the subsets.
print(action_movies.head())
print(scifi_movies.head())

    movie_id   genre
3         11  Action
14        18  Action
25        22  Action
26        24  Action
42        58  Action
    movie_id            genre
2         11  Science Fiction
17        18  Science Fiction
20        19  Science Fiction
38        38  Science Fiction
49        62  Science Fiction


In [19]:
# Merge action_movies to the scifi_movies with right join
action_scifi = action_movies.merge(scifi_movies, on='movie_id', how='right',
                                   suffixes=('_act','_sci'))

# From action_scifi, select only the rows where the genre_act column is null
scifi_only = action_scifi[action_scifi['genre_act'].isnull()]

# Merge the movies and scifi_only tables with an inner join
movies_and_scifi_only = movies.merge(scifi_only, left_on="id", right_on="movie_id", how="inner")

# Print the first few rows and shape of movies_and_scifi_only
print(movies_and_scifi_only.head())
print(movies_and_scifi_only.shape)

      id                         title  popularity release_date  movie_id  \
0  18841  The Lost Skeleton of Cadavra    1.680525   2001-09-12     18841   
1  26672     The Thief and the Cobbler    2.439184   1993-09-23     26672   
2  15301      Twilight Zone: The Movie   12.902975   1983-06-24     15301   
3   8452                   The 6th Day   18.447479   2000-11-17      8452   
4   1649    Bill & Ted's Bogus Journey   11.349664   1991-07-19      1649   

  genre_act        genre_sci  
0       NaN  Science Fiction  
1       NaN  Science Fiction  
2       NaN  Science Fiction  
3       NaN  Science Fiction  
4       NaN  Science Fiction  
(258, 7)
