# Lab03 Extra Practice

In lab03, the joins turned out pretty... pretty. But depending on the type of join you're doing, things may not always be so. It's very helpful to be well-acquainted with all the types of joins.


## Creating some data

In [21]:
import pandas as pd

t1 = pd.DataFrame({
    'Letter': ['a', 'b', 'c', 'd'],
    'Number': [42, 12, 6, 3]
})

t2 = pd.DataFrame({
    'Letter': ['b', 'b', 'c', 'd', 'e', 'e'],
    'Special': ['!','#', '?', '$', '@', '&']
})

display(t1, t2)

Unnamed: 0,Letter,Number
0,a,42
1,b,12
2,c,6
3,d,3


Unnamed: 0,Letter,Special
0,b,!
1,b,#
2,c,?
3,d,$
4,e,@
5,e,&


## Joins

### (FULL) OUTER

"If you're merging ON the Letter column, you will keep all rows from the left table, and all rows from the right table."


In [36]:
pd.merge(left = t1, right = t2, on = 'Letter', how = 'outer')

Unnamed: 0,Letter,Number,Special
0,a,42.0,
1,b,12.0,!
2,b,12.0,#
3,c,6.0,?
4,d,3.0,$
5,e,,@
6,e,,&


In [37]:
pd.merge(left = t2, right = t1, on = 'Letter', how = 'outer')

Unnamed: 0,Letter,Special,Number
0,b,!,12.0
1,b,#,12.0
2,c,?,6.0
3,d,$,3.0
4,e,@,
5,e,&,
6,a,,42.0


### (LEFT) OUTER
"If you're merging ON a column 'c', every row will be kept from the left table, and any rows in the right table matching in 'c' will be included, too."

In [24]:
pd.merge(left = t1, right = t2, on = 'Letter', how = 'left')

Unnamed: 0,Letter,Number,Special
0,a,42,
1,b,12,!
2,b,12,#
3,c,6,?
4,d,3,$


In [33]:
pd.merge(left = t2, right = t1, on = 'Letter', how = 'left')

Unnamed: 0,Letter,Special,Number
0,b,!,12.0
1,b,#,12.0
2,c,?,6.0
3,d,$,3.0
4,e,@,
5,e,&,


### (RIGHT) OUTER 
"If you're merging ON a column 'c', every row will be kept from the right table, and any rows in the left table that match values in 'c' will be included, too."

In [31]:
pd.merge(left = t1, right = t2, on = 'Letter', how = 'right')

Unnamed: 0,Letter,Number,Special
0,b,12.0,!
1,b,12.0,#
2,c,6.0,?
3,d,3.0,$
4,e,,@
5,e,,&


In [32]:
pd.merge(left = t2, right = t1, on = 'Letter', how = 'right')

Unnamed: 0,Letter,Special,Number
0,b,!,12
1,b,#,12
2,c,?,6
3,d,$,3
4,a,,42


### INNER
"If you're merging ON a column 'c', only rows that match values in column 'c' will be kept."

In [26]:
pd.merge(left = t1, right = t2, on = 'Letter', how = 'inner')

Unnamed: 0,Letter,Number,Special
0,b,12,!
1,b,12,#
2,c,6,?
3,d,3,$


In [38]:
pd.merge(left = t2, right = t1, on = 'Letter', how = 'inner')

Unnamed: 0,Letter,Special,Number
0,b,!,12
1,b,#,12
2,c,?,6
3,d,$,3


#### Small detail: "Left inner" and "right inner" function the same. Therefore, the (LEFT) or (RIGHT) is omitted in front of INNER joins.

#### Small detail: If a value in 'c' has duplicates in either table, it will take all unique combinations.

Example: merging ON 'Letter', t1 only has one row with a 'b' (b, 12), but t2 has two rows with a 'b' (b, !) (b, #)

So, all the unique combinations would be (b, 12, !), (b, 12, #).

## A Concrete Example 
As mentioned before, you are probably going to lose some information when doing joins (unless you're doing a full join, in which case you're probably going to have a very messy dataset).

In a sense, this is a good thing, as it a sort of "automatic filtering." Sometimes, it even makes further data manipulation (like grouping or indexing) easier!


Let's say we have a school of 5 students, Abigail, Ashwin, Kevin, Kevin, and Zach. Each have their own unique student ID.

At the end of the year, they all take a final exam. You're allowed as many repeats as you want, and you're allowed to skip it, too (you'd probably flunk out, but you can skip it if you really want).

In [40]:
### Make some data


students = pd.DataFrame({
    'Name': ['Abigail', 'Ashwin', 'Kevin', 'Kevin', 'Zach'],
    'SID': ['10000', '10001', '10002', '10003', '10004'],
    'Year': [1, 1, 2, 4, 3]   
})


final_grades = pd.DataFrame({
    'SID': ['10000', '10000', '10000', '10001', '10003', '10003', '10004'],
    'Grades': ['B', 'C-', 'A-', 'B+', 'F', 'D+', 'A+'] 
    #Yes, some students took it twice, and some students didn't take it at all
})

display(students, final_grades)

Unnamed: 0,Name,SID,Year
0,Abigail,10000,1
1,Ashwin,10001,1
2,Kevin,10002,2
3,Kevin,10003,4
4,Zach,10004,3


Unnamed: 0,SID,Grades
0,10000,B
1,10000,C-
2,10000,A-
3,10001,B+
4,10003,F
5,10003,D+
6,10004,A+


### 1. Return a table of students (with their name and year) with their corresponding grades on the exam

In [45]:
merged_table = pd.merge(left = students, right = final_grades, on = 'SID', how = 'inner')

### 2. Take your merged table and look at the number of exams taken for each Year.

In [51]:
merged_table.groupby('Year').count()['Name']

Year
1    4
3    1
4    2
Name: Name, dtype: int64

Conclusion: First-years are tryhards and second-years are degenerates.

# Congratulations! You reached the end :)