# Pandas merge()

*sources: Pandas.pydata.org, Kyle Stratis (Real Python), Cameron Riddell (Two Sigma), data sourced from nflverse github: https://github.com/nflverse/nfldata/blob/master/DATASETS.md*

### Concat vs Merge
We have already reviewed how we can concat datasets, now we're going to talk about `merge()`--which does something similar.

Quick refresher, with concat() you can combine DataFrames across rows or columns.

You can use `pandas.concat` to join *many* DataFrames or Series, where `merge` is used to join **two** DataFrames or Series. 

### The key differences are:
- `pandas.concat`: join many DataFrames according to either their row or column indices (horizontally or vertically).
- `pandas.merge`: join two DataFrames horizontally on specific column(s) (or the index [optional])

there is a third (and similar) method `join`, which we are largely going to ignore for now.  The underlying executations of the `join` are the same as `merge`, however you are joining soley on the indices:
- `pandas.join`: join two DataFrames horizontally on their indices.  

We use `merge()` for combining data on common columns or indices, and for those familiar with SQL or database theory it defaults to an `inner` join.



In [4]:
from numpy import arange
import pandas as pd

data = arange(0, 18).reshape(3, 3, 2)
# take note where these indexes and columns overlap and where they do not

dfs = {
    'df1' : pd.DataFrame(data[0], index=['a', 'b', 'c'          ], columns=['x', 'y']), 
    'df2' : pd.DataFrame(data[1], index=[     'b', 'c', 'd'     ], columns=['x', 'z']),
    'df3' : pd.DataFrame(data[2], index=[     'b',      'd', 'e'], columns=['x', 'z'])
}

print(
    dfs['df1'], dfs['df2'], dfs['df3'],
    #    x  y   #    x  z   #    y   z
    # a  0  1   # b  6  7   # b  12  13 
    # b  2  3   # c  8  9   # d  14  15
    # c  4  5   # d  10 11  # e  16  17 
    sep='\n\n'
)

# passing a dictionary of DataFrames (or the equivalent argument into names pd.concat(..., names=..))
# will create a MultiIndex along the concatenation axis to label where each part originated
# this axis can also be named via the pd.concat(..., keys=...) argument.
print(
    pd.concat(dfs, axis=0, join='outer'), #align vertically, preserving all columns
    #          x    y   z 
    # df1  a   0  1.0   NaN
    #      b   2  3.0   NaN  
    #      c   4  5.0   NaN
    # df2  b   6  NaN   7.0
    #      c   8  NaN   9.0
    #      d  10  NaN  11.0
    # df3  b  12  NaN  13.0
    #      d  14  NaN  15.0
    #      e  16  NaN  17.0

    pd.concat(dfs, axis=0, join='inner'), # align vertically, preserving fully overlapping columns)
    #          x    
    # df1  a   0  
    #      b   2  
    #      c   4  
    # df2  b   6  
    #      c   8  
    #      d  10  
    # df3  b  12  
    #      d  14  
    #      e  16  

    pd.concat(dfs, axis=1, join='outer'), # align horizontally, preserving all row indices)
    #       df1       df2          df3
    #       x     y     x     z      x     z 
    #  a  0.0   1.0   NaN   NaN    NaN    NaN
    #  b  2.0   3.0   6.0   7.0   12.0   13.0
    #  c  4.0   5.0   8.0   9.0    NaN    NaN
    #  d  NaN   NaN  10.0  11.0   14.0   15.0
    #  e  NaN   NaN   NaN   Nan   16.0   17.0


    pd.concat(dfs, axis=1, join='inner'), # align horizontally preserving fully shared row indices
    #   df1      df2      df3
    #     x  y   x   z   x    z
    #  b  2  3   6   7   12  13
    sep='\n\n',
)

   x  y
a  0  1
b  2  3
c  4  5

    x   z
b   6   7
c   8   9
d  10  11

    x   z
b  12  13
d  14  15
e  16  17
        x    y     z
df1 a   0  1.0   NaN
    b   2  3.0   NaN
    c   4  5.0   NaN
df2 b   6  NaN   7.0
    c   8  NaN   9.0
    d  10  NaN  11.0
df3 b  12  NaN  13.0
    d  14  NaN  15.0
    e  16  NaN  17.0

        x
df1 a   0
    b   2
    c   4
df2 b   6
    c   8
    d  10
df3 b  12
    d  14
    e  16

   df1        df2         df3      
     x    y     x     z     x     z
a  0.0  1.0   NaN   NaN   NaN   NaN
b  2.0  3.0   6.0   7.0  12.0  13.0
c  4.0  5.0   8.0   9.0   NaN   NaN
d  NaN  NaN  10.0  11.0  14.0  15.0
e  NaN  NaN   NaN   NaN  16.0  17.0

  df1    df2    df3    
    x  y   x  z   x   z
b   2  3   6  7  12  13


## Merge Syntax 

When you want to combine data objects based on one or more keys, similar to what you’d do in a relational database, merge() is the tool you need. More specifically, merge() is most useful when you want to combine rows that share data

**Syntax:**

`DataFrame.merge(parameters)`

`DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)`

**Merge Parameters:**
There a few paramters you can leverage, these are some of the common and powerful ones:
- **right** : the Object to merge with (either a DataFrame or named Series)


- **how** : The type of merge to be performed `{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’},` the default is `‘inner’`
    - *left*: use only keys from left frame, similar to a SQL left outer join; preserve key order.

    - *right*: use only keys from right frame, similar to a SQL right outer join; preserve key order.

    - *outer*: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

    - *inner*: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

    - *cross*: creates the cartesian product from both frames, preserves the order of the left keys.
    

- **on**: a label or list. The **column** or **index level names** to join on. 
    - *These must be found in both DataFrames.* 
    - the default for **on** is "all the columns with the same name".  
    - If on is **None** and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

Let's see some of these parameters in practice. 
We are going to read three csv files, and save as dataframes:
    1. NFL Roster data from 2006-2019 from `rosters.csv`
    2. NFL Team Colors from `teamcolors.csv` 
    3. Team Data from 2019 from `2019_team_data.csv` `


In [1]:
import pandas as pd

roster = pd.read_csv("rosters.csv")
colors = pd.read_csv("teamcolors.csv")
team_data_2019 = pd.read_csv("2019_team_data.csv")

Let's check out what each of these csv's contains:

In [2]:
roster

Unnamed: 0,season,team,playerid,full_name,name,side,category,position,games,starts,years,av
0,2006,ARI,LeinMa00,Matt Leinart,M.Leinart,O,QB,QB,12.0,11.0,0,8.0
1,2006,ARI,LewiJo22,Jonathan Lewis,J.Lewis,,,,4.0,0.0,0,0.0
2,2006,ARI,LiwiCh20,Chris Liwienski,C.Liwienski,O,OL,LG,16.0,6.0,8,4.0
3,2006,ARI,ArriJ.00,J.J. Arrington,J.Arrington,,,,16.0,0.0,1,2.0
4,2006,ARI,LutuDe20,Deuce Lutui,D.Lutui,O,OL,RG,15.0,9.0,0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...
28612,2019,WAS,IoanMa01,Matthew Ioannidis,M.Ioannidis,D,DL,DT,,,3,
28613,2019,WAS,AlexAd00,Adonis Alexander,A.Alexander,,,,,,1,
28614,2019,WAS,BergTo00,Tony Bergstrom,T.Bergstrom,O,OL,OT,,,7,
28615,2019,WAS,WoodJo01,Josh Woodrum,J.Woodrum,O,QB,QB,,,1,


The `roster` dataframe includes team rosters (season year and team name),with players' names, positions, games played, games starts, years playing.

In [None]:
colors

The `colors` dataframe includes a list of the NFL teams and their correspondeing:
- color: The team's primary color in hexadecimal
- color2: The team's secondary color in hexadecimal
- color3: The team's tertiary color in hexadecimal
- color3: The team's quaternary color in hexadecimal

NOTE: there are on 32 teams in the NFL, take a look at the index range for this dataframe.  Anything strange?

In [None]:
colors.index

40? Where are the 8 extra teams coming from?  You will notice this data set repeats some teams using their The team abbreviation used by Pro Football Focus.
Don't worry about this for now, We will clean this up ourselves.

In [None]:
team_data_2019

The `team_data_2019` dataframe returns all 32 of the NFL teams, and their:

- team: The team in question. This consistently favors JAX over JAC, and LAR over LA.
- nfl: The team abbreviation used by the NFL.
- pfr: The team abbreviation used by the Pro Football Reference.
- pff: The team number used by Pro Football Focus.
- pfflabel: The team abbreviation used by Pro Football Focus.
- fo: The team abbreviation used by Football Outsiders.
- full: The long form name of the team, with both the location and nickname written out.
- location: The part of the team name that identifies where it plays. Uses St Louis for Saint Louis. Still includes the nickname if it is ambiguous.
- short_location: The part of the team name that identifies where it plays, except Los Angeles is shortened to LA and New York is shortened to NY. Uses St. Louis for Saint Louis. Still includes the nickname if it is ambiguous.
- nickname: The part of the team name that identifies its mascot.

We're going to ignore the last 4 columns.  

## Merge Examples

First, let's take a look at the `team_data_2019` and see how we can use panda's `merge` to add each team's colors to this dataframe.

refresh on syntax:

`DataFrame.merge(dataframe, parameters)`


Let's try merging the dataframes with the default `merge` settings.

In [None]:
# merge by default performs an `inner` join when merging

#                    V data that is being merged
team_data_2019.merge(colors)
# ^ data being merged on to

By default, `merge()` will perform an `inner` join when merging.

What happens when we set the parameter to `how="left"` ?

In [None]:
# how =left will return the same results as an inner join in this case

team_data_2019.merge(colors, how="left")

# we are dictating that the data object left of the .merge is the index range we should be merging on

What about when we merge using the `on=` parameter?

In [None]:
# using the `on=` parameter, allows you to call a specific column that is present in BOTH dataframes

team_data_2019.merge(colors, on="team")


What about when we merge using the `how="left"` parameter?

In [None]:
# what happens when we perform a merge with the parameter how=left?

team_data_2019.merge(colors, how="left")
# you've kept the index of the dataframe to the left of the merge

What about when we merge using the `how="right"` parameter?

In [None]:
# what happens when we perform a merge with the parameter how=right?

team_data_2019.merge(colors, how="right")
# you've kept the index of the dataframe to the right of the merge
# the missing data fields will be filled in with NaN, and retains their original index position

In [None]:
# what happens when we perform a merge with the parameter how=right?

team_data_2019.merge(colors, how="outer")
# you've kept the index of the dataframe to the right of the merge.
# the missing data fields will be filled in with NaN, but will have new index positions.
# the rows from the dataframe you merged ON TO will retain their original index positions.

#### Let's Switch things up.  Literally.

What about if we flip the orders of the dataframes when defining our merge?

Instead of calling `team_data_2019.merge()`, let's try `colors.merge()` :

In [None]:
colors.merge(team_data_2019, how="left")

What do you notice?


### QUICK NOTE: 
Performing these merges, didn't alter either of the original dataframes.  If you want to use this merged data again, you should save the new merged dataframe as a new variable to be used later, or modify the original `team_data_2019` or `colors` dataframe with this merge.

For our purposes, let's:
1. create a new variable, that
2. returns the 32 NFL teams data merged with their team colors.

In [None]:
new_team_data = team_data_2019.merge(colors)
new_team_data

# Exercises

## Exercise #1: preparing for the big game


Let's pretend it is 2019, and we have the game schedule.  Leif & Sarah's favorite teams are scheduled to play against each other.

They are organizing a game viewing party, and want to decorate the game snacks accordingly.

Sarah's favorite team is the Pittsburgh Steelers.
Leif's favorite team is the Buffalo Bills.

We have a dataframe that includes all of the team rosters from 2006 -2019 saved in `roster`.  
Start by creating a dataframe for each team's full 2019 roster:

In [None]:
season2019 = roster[roster["season"]==2019]

steelers_2019_roster = season2019[season2019["team"]=="PIT"]

In [None]:
steelers_2019_roster = 

In [None]:
bills_2019_roster = season2019[season2019["team"]=="BUF"]

In [None]:
bills_2019_roster =

Ok we can't see the full rosters in either dataframes.

You can use `pd.set_option` to define the number of `rows` or `columns` you want visible.

In [None]:
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)

let's re-run our team variables

In [None]:
steelers_2019_roster 

In [None]:
bills_2019_roster

They've decided they are going to going to decorate cupcakes with jersey names for their favorite players, and do it together.

So they aren't working off of two lists, let's use `concat` to create one master list with both team's full 2019 rosters.

In [None]:
steelers_and_bills = pd.concat([steelers_2019_roster, bills_2019_roster], axis="index")
steelers_and_bills

In [None]:
steelers_and_bills = 

## Exercise 2: getting the decorating commitee organized

Before Leif and Sarah can start decorating, they need to know the exact hex colors to make the cupcake icing.

Use the dataframes we have created thus far to create a masterlist for Leif and Sarah of all of the players from each team (in 2019) and the full details for each player's teams (location, colors, etc)

In [None]:
master_list = steelers_and_bills.merge(new_team_data) 
master_list 

Woah that is a lot of information that we probably don't need.

We only care about the following columns:
- season
- team
- full_name
- name
- side
- position
- nickname
- color
- color2
- color3
- color4

In [None]:
master_list[["season", "team", "full_name", "name", "side", "position", "nickname", "color", "color2", "color3", "color4"]] 

In [None]:
color_code = master_list[["season", "team", "full_name", "name", "side", "position", "nickname", "color", "color2", "color3", "color4"]] 
color_code.style.applymap(lambda hex_color: f"background-color: {hex_color}")

### Merge Function (and Method): 
Thus far we have been using the following function syntax to `merge`:
`left_DataFrame.merge(right_DataFrame)`

There is a second way to use `merge`, as a method!
`pd.merge(left_DataFrame, right_DataFrame)`

When you use the `pd.merge(left_DataFrame, right_DataFrame)`method, it will implicitly consider the left object in the join. 
So...
`pd.merge(left_DataFrame, right_DataFrame)` is the same as
`left_DataFrame.merge(right=df2)`
