# Advanced Pandas: combining data
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec29_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

## Combining Data

In [None]:
# Create the first dataframe
df1 = pd.DataFrame({
    'Name': ['Lorelai Gilmore', 'Rory Gilmore', 'Luke Danes', 'Emily Gilmore'],
    'Occupation': ['Manager', 'Student', 'Owner', 'Socialite'],
    'Age': [32, 20, 40, 60]
})

# Create the second dataframe
df2 = pd.DataFrame({
    'Name': ['Lorelai Gilmore', 'Rory Gilmore', 'Sookie St. James', 'Richard Gilmore'],
    'Home': ['Stars Hollow', 'Stars Hollow', 'Stars Hollow', 'Hartford']
})

In [None]:
df1

In [None]:
df2

### Merge

In [None]:
# A standard merge (inner)
df1.merge(df2)

In [None]:
# Explicitly specifying what to merge by (same as before)
df1.merge(df2, on = 'Name')

In [None]:
# What if they had different column names?
df2.rename(columns = {'Name':'Character Name'},inplace=True)
df2

In [None]:
df1.columns

In [None]:
df2.columns

In [None]:
# Try merging now (error!)
df1.merge(df2)

In [None]:
# Specify one of the column names (error!)
df1.merge(df2, on = 'Name')

In [None]:
# How to specify both names if they are different
df1.merge(df2, left_on = 'Name',right_on = 'Character Name')

In [None]:
# Can drop the redundant column
df1.merge(df2, left_on = 'Name',right_on = 'Character Name').drop(columns = 'Character Name')

In [None]:
# Reset it back to original
df2.rename(columns = {'Character Name':'Name'},inplace=True)

In [None]:
# Outer
df1.merge(df2, how = 'outer')

In [None]:
# Use the indicator parameter to keep track of where the rows came from (good for debugging)
df1.merge(df2, how = 'outer',indicator=True)

In [None]:
# Left
df1.merge(df2,how = 'left')

In [None]:
# Right
df1.merge(df2,how = 'right')

In [None]:
# Cross join (less common, but occasionally handy)
df1.merge(df2,how='cross')

In [None]:
# What if there are two common columns?
df1['School'] = ['Hartford Community College','Yale','Stars Hollow High','Smith College']
df2['School'] = ['Hartford Community College','Yale','Unknown','Yale']

In [None]:
# Test what a standard merge does
df1.merge(df2)

In [None]:
# Outer merge behaves as expected
df1.merge(df2, how = 'outer')

In [None]:
# What if we only specify one column to merge on?
df1.merge(df2, on = 'Name',how = 'outer')

In [None]:
# Merge on both
df1.merge(df2, on = ['Name','School'],how = 'outer')

In [None]:
# If one of the columns is actually an index
df2.set_index('Name',inplace=True)
df2

In [None]:
# If one of the columns is actually an index
df1.merge(df2, left_on='Name', right_index=True)

In [None]:
df2.reset_index(inplace=True)

### Join

In [None]:
# Standard join won't work (error!)
df1.join(df2)

In [None]:
# Join works when we just want to join on the index
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)

In [None]:
df1

In [None]:
df2

In [None]:
# still doesn't work (it doesn't know what to do with the identical columns) 
df1.join(df2)

In [None]:
# We have to define suffixes when we use join (merge did it automatically)
df1.join(df2, lsuffix='_left',rsuffix='_right')

In [None]:
# Can select different "how" with join
df1.join(df2,how='outer', lsuffix='_left',rsuffix='_right')

### Concatenate

In [None]:
# First lets reset the indices
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

In [None]:
# A standard concatenate
pd.concat([df1,df2])

In [None]:
# Explicitly state how to concatenate (keep all columns that appear in either dataset)
pd.concat([df1,df2],join='outer')

In [None]:
# Only keep common columns
pd.concat([df1,df2],join='inner')

In [None]:
# Left/right combinations are not supported
pd.concat([df1,df2],join='left')

In [None]:
# Be careful with indices!
concat_gilmores = pd.concat([df1,df2])
concat_gilmores

In [None]:
# indices might be duplicated
concat_gilmores.loc[2]

In [None]:
# You can prevent this by ignoring the index (this will give new indices)
pd.concat([df1,df2],ignore_index=True)

In [None]:
# concatenate the rows (can get very confusing if you aren't careful!)
pd.concat([df1,df2],axis=1)

In [None]:
# Concat is useful for adding rows
new_row = pd.DataFrame([{'Name':'Kurk', 'Home':'Stars Hollow', 'School': 'Unknown'}])
new_row

In [None]:
# Concat is useful for adding rows
df2 = pd.concat([df2,new_row],ignore_index=True)
df2

In [None]:
# Concatenating columns would do an outer
pd.concat([df1,df2],axis=1)

In [None]:
# Change it to inner
pd.concat([df1,df2],axis=1,join='inner')

## Activity

**1.** Run the following cells to read in Cal Poly Humboldt student data. Check what would happen if you did not include the `skiprows` argument.

In [None]:
passing = pd.read_csv('humboldt_data/Humboldt_Passing_Fa23.csv',skiprows=5)

In [None]:
passing.head()

In [None]:
first_gen = pd.read_csv('humboldt_data/FirstGenData_Fa23.csv',skiprows=5)

In [None]:
first_gen.head()

**2.** Merge the two dataframes. Try different `how` arguments.

**3.** Recreate the figure from the discussion question.