# Simpsons: Merging and Concatenation

## Imports

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

## Relationships between DataFrames

When you have multiple `DataFrame`s that have common keys you can have **relationships** between the entities in the different `DataFrame`s. There are three types of entity relationships that are possible:

* 1-to-1
* 1-to-many
* many-to-many

Here is a small data set from the TV show [The Simpsons](https://en.wikipedia.org/wiki/The_Simpsons) to illustrate these relationshps.

First, here is a `DataFrame` with students' first and last names, along with a unique student id:

In [9]:
students = DataFrame({'fname': ['Bart','Lisa','Milhouse'],
                      'lname': ['Simpson','Simpson','Van Houten']},
                     index=list('abc'))
students

Unnamed: 0,fname,lname
a,Bart,Simpson
b,Lisa,Simpson
c,Milhouse,Van Houten


Here is a `DataFrame` with the student social security numbers, indexed by their unique student id:

In [10]:
ssns = DataFrame({'ssn':[1234,5678,9101]}, index=list('abc'))
ssns

Unnamed: 0,ssn
a,1234
b,5678
c,9101


Each student can have aliases or nicknames:

In [11]:
aliases = DataFrame({'alias':['Bartman','Bartron','Cosmos','Truth Teller','Lady Penelope Ariel',
                              'Jake Boyman','Lou La Trec','Eagle Eye','Maestro'],
                     'student': list('aaabbbccc')})
aliases

Unnamed: 0,alias,student
0,Bartman,a
1,Bartron,a
2,Cosmos,a
3,Truth Teller,b
4,Lady Penelope Ariel,b
5,Jake Boyman,b
6,Lou La Trec,c
7,Eagle Eye,c
8,Maestro,c


Here are the student home addresses:

In [12]:
addresses = DataFrame({'address':['742 Evergreen Terrace','742 Evergreen Terrace','316 Pikeland Ave.']},
                      index=list('abc'))
addresses

Unnamed: 0,address
a,742 Evergreen Terrace
b,742 Evergreen Terrace
c,316 Pikeland Ave.


A table of courses the students can be enrolled in:

In [13]:
courses = DataFrame({'name':['Biology','Math','PE','Underwater electronics']}, index=range(4))
courses

Unnamed: 0,name
0,Biology
1,Math
2,PE
3,Underwater electronics


This table contains the enrollment for each course. Each row of this table has a student and course.

In [14]:
enroll = DataFrame({'student':['a','b','b','c','c','c']},index=(2,0,1,0,1,2))
enroll

Unnamed: 0,student
2,a
0,b
1,b
0,c
1,c
2,c


## 1-1 relationships

* Each student has exactly one SSN.
* Each SSN belongs to exactly one student.

Create a `DataFrame` with the students' first name, last name and social security number:

In [19]:
# YOUR CODE HERE
merge1 = pd.merge(students, ssns, left_index = True, right_index =True)

In [20]:
merge1

Unnamed: 0,fname,lname,ssn
a,Bart,Simpson,1234
b,Lisa,Simpson,5678
c,Milhouse,Van Houten,9101


In [21]:
assert list(merge1.columns)==['fname', 'lname', 'ssn']
assert list(merge1.index)==list('abc')

## 1-many relationships

### Students and addresses

* Each student has exactly one address.
* Each address can have many students.

Create a `DataFrame` with the students' first name, last name and address:

In [23]:
# YOUR CODE HERE
merge2 = pd.merge(students, addresses, left_index = True, right_index =True)

In [24]:
merge2

Unnamed: 0,fname,lname,address
a,Bart,Simpson,742 Evergreen Terrace
b,Lisa,Simpson,742 Evergreen Terrace
c,Milhouse,Van Houten,316 Pikeland Ave.


In [25]:
assert list(merge2.columns)==['fname', 'lname', 'address']
assert list(merge2.index)==list('abc')

### Students and aliases

* Each student can have many aliases.
* Each alias belong to exactly one student.

Create a `DataFrame` with the students' first name, last name and alias. The index of the data frame should be the student column (a, b, c).

In [33]:
# YOUR CODE HERE
aliases_test = aliases.set_index('student')
merge3 = pd.merge(students, aliases_test, left_index = True, right_index=True)

In [34]:
merge3

Unnamed: 0,fname,lname,alias
a,Bart,Simpson,Bartman
a,Bart,Simpson,Bartron
a,Bart,Simpson,Cosmos
b,Lisa,Simpson,Truth Teller
b,Lisa,Simpson,Lady Penelope Ariel
b,Lisa,Simpson,Jake Boyman
c,Milhouse,Van Houten,Lou La Trec
c,Milhouse,Van Houten,Eagle Eye
c,Milhouse,Van Houten,Maestro


In [35]:
assert list(merge3.columns)==['fname', 'lname', 'alias']
assert list(merge3.index)==list('aaabbbccc')

## Many-many relationships

* A student can take multiple classes.
* A single class can have multiple students.

Create a `DataFrame` with the students first name, last name and student (a, b, c) and course name. Multiple merges may be required.

In [85]:
# YOUR CODE HERE
classes = pd.merge(enroll, courses, left_index = True, right_index =True, how='outer')
merge4 = pd.merge(students, classes, left_index = True, right_on = 'student', how='outer')

In [86]:
merge4

Unnamed: 0,fname,lname,student,name
2,Bart,Simpson,a,PE
0,Lisa,Simpson,b,Biology
1,Lisa,Simpson,b,Math
0,Milhouse,Van Houten,c,Biology
1,Milhouse,Van Houten,c,Math
2,Milhouse,Van Houten,c,PE
3,,,,Underwater electronics


In [87]:
assert list(merge4.columns)==['fname', 'lname', 'student', 'name']
assert len(merge4)==7

## Concatenation

Use Pandas' `concat` function to combining the `students` and `ssns` `DataFrame`s by columns

In [92]:
# YOUR CODE HERE
concat1 = pd.concat([students, ssns], axis=1)

In [93]:
concat1

Unnamed: 0,fname,lname,ssn
a,Bart,Simpson,1234
b,Lisa,Simpson,5678
c,Milhouse,Van Houten,9101


In [94]:
assert list(concat1.columns)==['fname', 'lname', 'ssn']
assert list(concat1.index)==list('abc')

Do the same thing for the `students`, `ssns` and `addresses` `DataFrame`s:

In [95]:
# YOUR CODE HERE
concat2 = pd.concat([concat1, addresses], axis=1)

In [96]:
concat2

Unnamed: 0,fname,lname,ssn,address
a,Bart,Simpson,1234,742 Evergreen Terrace
b,Lisa,Simpson,5678,742 Evergreen Terrace
c,Milhouse,Van Houten,9101,316 Pikeland Ave.


In [97]:
assert list(concat2.columns)==['fname', 'lname', 'ssn', 'address']
assert list(concat2.index)==list('abc')