# Combining Data Wrap-up
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo30_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

## Combining Data

In [None]:
# Create the first dataframe
df1 = pd.DataFrame({
    'Name': ['Lorelai Gilmore', 'Rory Gilmore', 'Luke Danes', 'Emily Gilmore'],
    'Occupation': ['Manager', 'Student', 'Owner', 'Socialite'],
    'Age': [32, 20, 40, 60]
})

# Create the second dataframe
df2 = pd.DataFrame({
    'Name': ['Lorelai Gilmore', 'Rory Gilmore', 'Sookie St. James', 'Richard Gilmore'],
    'Home': ['Stars Hollow', 'Stars Hollow', 'Stars Hollow', 'Hartford']
})

In [None]:
df1

In [None]:
df2

### Merge

In [None]:
# A standard merge (inner)
df1.merge(df2)

In [None]:
# Explicitly specifying what to merge by (same as before)
df1.merge(df2, on = 'Name')

In [None]:
# What if they had different column names?
df2.rename(columns = {'Name':'Character Name'},inplace=True)
df2

In [None]:
# # What if they had different column names?
df1.merge(df2, left_on = 'Name',right_on = 'Character Name')

In [None]:
# Can drop the redundant column
df1.merge(df2, left_on = 'Name',right_on = 'Character Name').drop(columns = 'Character Name')

In [None]:
# Reset it back to original
df2.rename(columns = {'Character Name':'Name'},inplace=True)

In [None]:
# Outer
df1.merge(df2, how = 'outer')

In [None]:
# Left
df1.merge(df2,how = 'left')

In [None]:
# Right
df1.merge(df2,how = 'right')

In [None]:
# Cross join (not super common, but occasionally handy)
df1.merge(df2,how='cross')

In [None]:
# What if there are two common columns?
df1['School'] = ['Hartford Community College','Yale','Stars Hollow High','Smith College']
df2['School'] = ['Hartford Community College','Yale','Unknown','Yale']

In [None]:
# Test what a standard merge does


In [None]:
# Outer merge behaves as expected


In [None]:
# What if we only specify one column to merge on?


In [None]:
# Merge on both


### Join

In [None]:
# Standard join won't work


In [None]:
# Join works when we just want to join on the index


In [None]:
# still doesn't work. 


In [None]:
# We have to define suffixes when we use join (merge did it automatically)


In [None]:
# Can select different "how" with join


### Concatenate

In [None]:
# First lets reset the indices
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

In [None]:
# A standard concatenate


In [None]:
# Explicitly state how to concatenate (keep all columns that appear in either dataset)


In [None]:
# Only keep common columns


In [None]:
# concatenate the rows (can get very confusing if you aren't careful!)


In [None]:
# A better way to do that


In [None]:
# Another way


### Compare

In [None]:
# Create a similar dataframe
df3 = df1.copy()
df3.loc[1,'Occupation'] = 'Journalist'
df3['Age'] = df3.Age+1
df3

In [None]:
# Compare dataframes to spot differences


In [None]:
# Compare dataframes to spot similarities and differences


In [None]:
# If you don't need specific information about differences


## Activity

**1.** Run the following cells to read in Cal Poly Humboldt student data. Check what would happen if you did not include the `skiprows` argument.

In [None]:
pd.read_csv('humboldt_data/Humboldt_Passing_Fa23.csv',skiprows=5)

In [None]:
passing = pd.read_csv('humboldt_data/Humboldt_Passing_Fa23.csv',skiprows=5)

In [None]:
passing.head()

In [None]:
first_gen = pd.read_csv('humboldt_data/FirstGenData_Fa23.csv',skiprows=5)

In [None]:
first_gen.head()

**2.** Merge the two dataframes. Try different `how` arguments.

**3.** Recreate the figure from the discussion question.

**4.** There are additional datasets related to Cal Poly Humboldt students from Fall 2023 in the `humboldt_data` folder in this directory. Choose one or more to combine with `passing` and `first_gen`. 