# SI 330: Data Manipulation 
## 04 - Joining, Combining, and Reshaping

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

## Learning Objectives
* load CSV files
* load JSON files
* use pd.read_html to extract tables from web pages
* load data from simple APIs 
* load data from a SQL database
* handle missing data (dropna and fillna)
* use vectorized string functions
* Pandas refresher (or introduction)
* explain how pandas operations differ from "traditional" python
* be able to load a CSV file into a Pandas DataFrame
* explain how to extract columns from a DataFrame
* sort a DataFrame
* assign a column as the index of a DataFrame
* filter a DataFrame according to some criteria
* explain how boolean masks work in filtering DataFrames

This lab was inspired by https://pythonhealthcare.org/2018/04/08/32-reshaping-pandas-data-with-stack-unstack-pivot-and-melt/

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [None]:
MY_UNIQNAME = 'joyip'

## Before we start...
### <font color="magenta">Q1: (1 point) Please let us know what you found confusing in the last class. </font>
We'll try to take time in the next class to review these concepts next class.


Replace this with your response.

## Review from last class

Recall from last class the ```read_html``` function, which made extracting tables from HTML pages a lot easier than using
BeautifulSoup (in fact, it uses bs4 but hides the ugly details).  Let's warm up for today's class by extracting some information from
a number of Wikipedia pages.

Our top-level goal is to extract information about the _aliases_ of some Lord Of The Rings characters.  Take a look at the Wikipedia page
for [Frodo Baggins](https://en.wikipedia.org/wiki/Frodo_Baggins) to get an idea of the sort of pages we're looking at.

In [1]:
import pandas as pd

In [2]:
frodo_url = 'https://en.wikipedia.org/wiki/Frodo_Baggins'

In [3]:
frodo_tables = pd.read_html(frodo_url)

In [4]:
frodo_tables

[                   0                                                  1
 0      Frodo Baggins                                                NaN
 1  Tolkien character                                                NaN
 2        Information                                                NaN
 3            Aliases                        Mr. Underhill,Maura Labingi
 4               Race                                             Hobbit
 5            Book(s)  The Fellowship of the RingThe Two TowersThe Re...,
                                                    0   \
 0         vteJ. R. R. Tolkien's The Lord of the Rings   
 1   Film series The Fellowship of the Ring The Two...   
 2                            Production and reception   
 3                                       Related works   
 4                                          Characters   
 5   Adaptations and other derivative worksBooks Bo...   
 6              Adaptations and other derivative works   
 7                      

In [7]:
frodo_tables[0]

Unnamed: 0,0,1
0,Frodo Baggins,
1,Tolkien character,
2,Information,
3,Aliases,"Mr. Underhill,Maura Labingi"
4,Race,Hobbit
5,Book(s),The Fellowship of the RingThe Two TowersThe Re...


Now let's load the page for [Legolas](https://en.wikipedia.org/wiki/Legolas):

In [11]:
legolas_url = 'https://en.wikipedia.org/wiki/Legolas'
legolas_tables = pd.read_html(legolas_url)

In [13]:
legolas_tables[0]

Unnamed: 0,0,1
0,,This article relies too much on references to ...


Hmmmm.  That doesn't look quite right.

Let's take a look at some URLs and figure out what's going on:

### <font color="magenta">Q2: (1 point) Inspect the Frodo and Legolas pages and see if you can figure out some _attributes_ of the table we're interested in.  </font>


Describe what you found.

You'll notice that there are some characteristics that the "Information" box share across pages.  We can leverage that 
information by using the ```attrs``` attribute of ```read_html```.  For example, if we wanted to extract  the element(s) that had
an ```id``` of ```info```, we could use

```pd.read_html(url,{'id':'info'})```



### <font color="magenta">Q3: (1 point) Fill in the following code block to extract only the "Information" table for the Legolas page:

In [14]:
a = {'class':'infobox'} # create an appropriate dictionary
pd.read_html(legolas_url, attrs=a)

[                   0                                                  1
 0            Legolas                                                NaN
 1  Tolkien character                                                NaN
 2        Information                                                NaN
 3            Aliases        Greenleaf, (Legolas translatedinto English)
 4               Race                                         Sindar Elf
 5             Gender                                               Male
 6            Book(s)  The Fellowship of the Ring The Two Towers The ...]

In [15]:
len(pd.read_html(legolas_url,attrs=a))

1

Now let's define a function that, given a Wikipedia URL, will extract the contents of the Aliases component of the infobox table:

In [16]:
def get_aliases(url):
    tables = pd.read_html(url, attrs={'class':'infobox'}) # extract only tables with class=infobox
    print(url,len(tables))   # sanity check: we should have just 1 table
    infotable = tables[0]    # pull the first table into a DataFrame
    ret = ''                 # initialize an empty string for our return value
    try:                     # in case the next line throws an exception
        x = infotable.set_index(0).loc['Aliases'] # setting the index on column 0 will allow us to use .loc to look up the value of 'Aliases'
        ret = x.values[0]
    except:
        ret = 'None'
    return ret

And let's try it out:

In [20]:
tables = pd.read_html(legolas_url,attrs=a)
tables

[                   0                                                  1
 0            Legolas                                                NaN
 1  Tolkien character                                                NaN
 2        Information                                                NaN
 3            Aliases        Greenleaf, (Legolas translatedinto English)
 4               Race                                         Sindar Elf
 5             Gender                                               Male
 6            Book(s)  The Fellowship of the Ring The Two Towers The ...]

In [18]:
infotable = tables[0]

In [19]:
infotable

Unnamed: 0,0,1
0,Legolas,
1,Tolkien character,
2,Information,
3,Aliases,"Greenleaf, (Legolas translatedinto English)"
4,Race,Sindar Elf
5,Gender,Male
6,Book(s),The Fellowship of the Ring The Two Towers The ...


In [42]:
x = infotable.set_index(0).loc['Aliases']
x

1    Greenleaf, (Legolas translatedinto English)
Name: Aliases, dtype: object

In [16]:
get_aliases(legolas_url)

https://en.wikipedia.org/wiki/Legolas 1


'Greenleaf, (Legolas translatedinto English)'

In [17]:
x.values[0]

'Greenleaf, (Legolas translatedinto English)'

So far, so good.  It seems to work.  Now let's set up a DataFrame with a bunch of URLs:

In [18]:
get_aliases(legolas_url)

https://en.wikipedia.org/wiki/Legolas 1


'Greenleaf, (Legolas translatedinto English)'

In [23]:
urls = ['https://en.wikipedia.org/wiki/Gimli_(Middle-earth)',
        'https://en.wikipedia.org/wiki/Frodo_Baggins',
        'https://en.wikipedia.org/wiki/Legolas',
        'https://en.wikipedia.org/wiki/Bilbo_Baggins',
        'https://en.wikipedia.org/wiki/Samwise_Gamgee',
        'https://en.wikipedia.org/wiki/Peregrin_Took',
        'https://en.wikipedia.org/wiki/Boromir',
        'https://en.wikipedia.org/wiki/Galadriel',
        'https://en.wikipedia.org/wiki/Meriadoc_Brandybuck']
names = ['Gimli',
         'Frodo',
         'Legolas',
         'Bilbo',
         'Sam',
         'Pippin',
         'Boromir',
         'Galadriel',
         'Meriadoc']

In [24]:
udf = pd.DataFrame()
udf['name'] = names
udf['url'] = urls

In [25]:
udf

Unnamed: 0,name,url
0,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
1,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins
2,Legolas,https://en.wikipedia.org/wiki/Legolas
3,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins
4,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee
5,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took
6,Boromir,https://en.wikipedia.org/wiki/Boromir
7,Galadriel,https://en.wikipedia.org/wiki/Galadriel
8,Meriadoc,https://en.wikipedia.org/wiki/Meriadoc_Brandybuck


The pythonic way of iterating through each of those rows would involve the use of some sort of ```for``` loop.  In pandas,
however, as can use the ```apply``` function to process an entire column!

In [26]:
udf['url'].apply(get_aliases)

https://en.wikipedia.org/wiki/Gimli_(Middle-earth) 1
https://en.wikipedia.org/wiki/Frodo_Baggins 1
https://en.wikipedia.org/wiki/Legolas 1
https://en.wikipedia.org/wiki/Bilbo_Baggins 1
https://en.wikipedia.org/wiki/Samwise_Gamgee 1
https://en.wikipedia.org/wiki/Peregrin_Took 1
https://en.wikipedia.org/wiki/Boromir 1
https://en.wikipedia.org/wiki/Galadriel 1
https://en.wikipedia.org/wiki/Meriadoc_Brandybuck 1


0    Elf-friend Lockbearer Lord of the Glittering C...
1                          Mr. Underhill,Maura Labingi
2          Greenleaf, (Legolas translatedinto English)
3                                        Bilba Labingi
4    Samwise Gardner, Sam, Samwise the Brave,Banazî...
5    Pippin, Pip,"Ernil i Pheriannath"Thain Peregri...
6    Captain of the White Tower,High Warden of the ...
7                       AlatárielAltárielArtanisNerwen
8    Merry,Kalimac Brandagamba,Meriadoc the Magnifi...
Name: url, dtype: object

We can take the resulting Series and assign it to a new column in our DataFrame:

In [27]:
udf('name').apply(str.lower)

TypeError: 'DataFrame' object is not callable

In [25]:
udf['aliases'] = udf['url'].apply(get_aliases)

https://en.wikipedia.org/wiki/Gimli_(Middle-earth) 1
https://en.wikipedia.org/wiki/Frodo_Baggins 1
https://en.wikipedia.org/wiki/Legolas 1
https://en.wikipedia.org/wiki/Bilbo_Baggins 1
https://en.wikipedia.org/wiki/Samwise_Gamgee 1
https://en.wikipedia.org/wiki/Peregrin_Took 1
https://en.wikipedia.org/wiki/Boromir 1
https://en.wikipedia.org/wiki/Galadriel 1
https://en.wikipedia.org/wiki/Meriadoc_Brandybuck 1


In [26]:
udf

Unnamed: 0,name,url,aliases
0,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...,Elf-friend Lockbearer Lord of the Glittering C...
1,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins,"Mr. Underhill,Maura Labingi"
2,Legolas,https://en.wikipedia.org/wiki/Legolas,"Greenleaf, (Legolas translatedinto English)"
3,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins,Bilba Labingi
4,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee,"Samwise Gardner, Sam, Samwise the Brave,Banazî..."
5,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took,"Pippin, Pip,""Ernil i Pheriannath""Thain Peregri..."
6,Boromir,https://en.wikipedia.org/wiki/Boromir,"Captain of the White Tower,High Warden of the ..."
7,Galadriel,https://en.wikipedia.org/wiki/Galadriel,AlatárielAltárielArtanisNerwen
8,Meriadoc,https://en.wikipedia.org/wiki/Meriadoc_Brandybuck,"Merry,Kalimac Brandagamba,Meriadoc the Magnifi..."


Let's just put the ```udf``` DataFrame aside for now.  We'll return to it later.

## Creating DataFrames and Exploring Indexes

Let's load the usual libraries...

In [31]:
import pandas as pd
import numpy as np

Let's create some lists of data that we can use to construct a DataFrame:

In [32]:
names = ['Gandalf',
         'Gimli',
         'Frodo',
         'Legolas',
         'Bilbo',
         'Sam',
         'Pippin',
         'Boromir',
         'Aragorn',
         'Galadriel',
         'Meriadoc',
        'Lily']
races = ['Maia',
         'Dwarf',
         'Hobbit',
         'Elf',
         'Hobbit',
         'Hobbit',
         'Hobbit',
         'Man',
         'Man',
         'Elf',
         'Hobbit',
        'Hobbit']
magic = [10, 1, 4, 6, 4, 2, 0, 0, 2, 9, 0, np.NaN]
aggression = [7, 10, 2, 5, 1, 6, 3, 8, 7, 2, 4, np.NaN ]
stealth = [8, 2, 5, 10, 5, 4 ,5, 3, 9, 10, 6, np.NaN]

There are a few different ways to construct a DataFrame.  We can either use an empty constructor and assign Series:

### <font color="magenta"> Q4: (2 points) Construct a dataframe with 5 columns (names, races, magic, aggression, and stealth) using the lists above.

In [33]:
df = pd.DataFrame()
df['names'] = names
df['races'] = races
df['magic'] = magic
df['aggression'] = aggression
df['stealth'] = stealth

In [34]:
df

Unnamed: 0,names,races,magic,aggression,stealth
0,Gandalf,Maia,10.0,7.0,8.0
1,Gimli,Dwarf,1.0,10.0,2.0
2,Frodo,Hobbit,4.0,2.0,5.0
3,Legolas,Elf,6.0,5.0,10.0
4,Bilbo,Hobbit,4.0,1.0,5.0
5,Sam,Hobbit,2.0,6.0,4.0
6,Pippin,Hobbit,0.0,3.0,5.0
7,Boromir,Man,0.0,8.0,3.0
8,Aragorn,Man,2.0,7.0,9.0
9,Galadriel,Elf,9.0,2.0,10.0


Alternatively, we could have set things up with a dict:

In [35]:
df = pd.DataFrame({'name': names,'race':races,'magic':magic,'aggression': aggression,'stealth':stealth})

In [36]:
df

Unnamed: 0,name,race,magic,aggression,stealth
0,Gandalf,Maia,10.0,7.0,8.0
1,Gimli,Dwarf,1.0,10.0,2.0
2,Frodo,Hobbit,4.0,2.0,5.0
3,Legolas,Elf,6.0,5.0,10.0
4,Bilbo,Hobbit,4.0,1.0,5.0
5,Sam,Hobbit,2.0,6.0,4.0
6,Pippin,Hobbit,0.0,3.0,5.0
7,Boromir,Man,0.0,8.0,3.0
8,Aragorn,Man,2.0,7.0,9.0
9,Galadriel,Elf,9.0,2.0,10.0


Let's take a look at the index on the resulting DataFrame:

In [37]:
df.index

RangeIndex(start=0, stop=12, step=1)

We can set the index to something more useful than the default RangeIndex:

In [38]:
df_nameindexed = df.set_index('name')

And if we take a look at the results, we see that we have a pandas Index instead of a RangeIndex:

In [39]:
df_nameindexed.index

Index(['Gandalf', 'Gimli', 'Frodo', 'Legolas', 'Bilbo', 'Sam', 'Pippin',
       'Boromir', 'Aragorn', 'Galadriel', 'Meriadoc', 'Lily'],
      dtype='object', name='name')

In [38]:
df_nameindexed

Unnamed: 0_level_0,race,magic,aggression,stealth
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gandalf,Maia,10.0,7.0,8.0
Gimli,Dwarf,1.0,10.0,2.0
Frodo,Hobbit,4.0,2.0,5.0
Legolas,Elf,6.0,5.0,10.0
Bilbo,Hobbit,4.0,1.0,5.0
Sam,Hobbit,2.0,6.0,4.0
Pippin,Hobbit,0.0,3.0,5.0
Boromir,Man,0.0,8.0,3.0
Aragorn,Man,2.0,7.0,9.0
Galadriel,Elf,9.0,2.0,10.0


Setting the name Series as the index allows us to do things like:

In [39]:
df_nameindexed.loc['Aragorn']

race          Man
magic           2
aggression      7
stealth         9
Name: Aragorn, dtype: object

Now recall the Hierarchical indexing from the readings.  We can pass a list of column names to set_index to create a Hierarchical Index:

In [40]:
df_racename_indexed = df.set_index(['race','name'])

In [41]:
df_racename_indexed

Unnamed: 0_level_0,Unnamed: 1_level_0,magic,aggression,stealth
race,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Maia,Gandalf,10.0,7.0,8.0
Dwarf,Gimli,1.0,10.0,2.0
Hobbit,Frodo,4.0,2.0,5.0
Elf,Legolas,6.0,5.0,10.0
Hobbit,Bilbo,4.0,1.0,5.0
Hobbit,Sam,2.0,6.0,4.0
Hobbit,Pippin,0.0,3.0,5.0
Man,Boromir,0.0,8.0,3.0
Man,Aragorn,2.0,7.0,9.0
Elf,Galadriel,9.0,2.0,10.0


In [41]:
df_racename_indexed.index

MultiIndex(levels=[['Dwarf', 'Elf', 'Hobbit', 'Maia', 'Man'], ['Aragorn', 'Bilbo', 'Boromir', 'Frodo', 'Galadriel', 'Gandalf', 'Gimli', 'Legolas', 'Lily', 'Meriadoc', 'Pippin', 'Sam']],
           labels=[[3, 0, 2, 1, 2, 2, 2, 4, 4, 1, 2, 2], [5, 6, 3, 7, 1, 11, 10, 2, 0, 4, 9, 8]],
           names=['race', 'name'])

This will allow us to get a DataFrame that matches a value on the outer index:

In [42]:
df_racename_indexed.loc['Hobbit']

Unnamed: 0_level_0,magic,aggression,stealth
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Frodo,4.0,2.0,5.0
Bilbo,4.0,1.0,5.0
Sam,2.0,6.0,4.0
Pippin,0.0,3.0,5.0
Meriadoc,0.0,4.0,6.0
Lily,,,


We can also use the index on a Series to match the outer index:

In [43]:
df_racename_indexed['magic'].loc['Hobbit']

name
Frodo       4.0
Bilbo       4.0
Sam         2.0
Pippin      0.0
Meriadoc    0.0
Lily        NaN
Name: magic, dtype: float64

Or both indexes:

In [44]:
df_racename_indexed['magic'].loc['Hobbit','Frodo']

4.0

Or just the inner index:

In [None]:
df_racename_indexed['magic'].loc[:,'Frodo']

### <font color="magenta"> Q5: (1 point) Using .loc find how much aggression Legalos, an Elf, has.

In [47]:
agg = df_racename_indexed['aggression'].loc['Elf', 'Legolas']

In [48]:
agg

5.0

## Stacking and Unstacking

Stacking takes "wide" data and makes it "taller"

In [49]:
df.set_index(['race']).stack()

race              
Maia    name            Gandalf
        magic                10
        aggression            7
        stealth               8
Dwarf   name              Gimli
        magic                 1
        aggression           10
        stealth               2
Hobbit  name              Frodo
        magic                 4
        aggression            2
        stealth               5
Elf     name            Legolas
        magic                 6
        aggression            5
        stealth              10
Hobbit  name              Bilbo
        magic                 4
        aggression            1
        stealth               5
        name                Sam
        magic                 2
        aggression            6
        stealth               4
        name             Pippin
        magic                 0
        aggression            3
        stealth               5
Man     name            Boromir
        magic                 0
        aggression   

If we call reset_index on the resulting Series, we get the following DataFrame:

In [50]:
df.set_index(['race']).stack().reset_index()

Unnamed: 0,race,level_1,0
0,Maia,name,Gandalf
1,Maia,magic,10
2,Maia,aggression,7
3,Maia,stealth,8
4,Dwarf,name,Gimli
5,Dwarf,magic,1
6,Dwarf,aggression,10
7,Dwarf,stealth,2
8,Hobbit,name,Frodo
9,Hobbit,magic,4


The column names in the above DataFrame aren't particularly helpful, so we can rename them:

In [51]:
df.set_index(['race']).stack().reset_index().rename(columns = {'level_0':'ID','level_1':'variable',0:'value'})

Unnamed: 0,race,variable,value
0,Maia,name,Gandalf
1,Maia,magic,10
2,Maia,aggression,7
3,Maia,stealth,8
4,Dwarf,name,Gimli
5,Dwarf,magic,1
6,Dwarf,aggression,10
7,Dwarf,stealth,2
8,Hobbit,name,Frodo
9,Hobbit,magic,4


You can do the opposite of stacking by using the ```unstack``` function:

In [52]:
df_stacked = df.stack()

In [53]:
df_stacked

0   name            Gandalf
    race               Maia
    magic                10
    aggression            7
    stealth               8
1   name              Gimli
    race              Dwarf
    magic                 1
    aggression           10
    stealth               2
2   name              Frodo
    race             Hobbit
    magic                 4
    aggression            2
    stealth               5
3   name            Legolas
    race                Elf
    magic                 6
    aggression            5
    stealth              10
4   name              Bilbo
    race             Hobbit
    magic                 4
    aggression            1
    stealth               5
5   name                Sam
    race             Hobbit
    magic                 2
    aggression            6
    stealth               4
6   name             Pippin
    race             Hobbit
    magic                 0
    aggression            3
    stealth               5
7   name            

In [54]:
df_stacked.unstack()

Unnamed: 0,name,race,magic,aggression,stealth
0,Gandalf,Maia,10.0,7.0,8.0
1,Gimli,Dwarf,1.0,10.0,2.0
2,Frodo,Hobbit,4.0,2.0,5.0
3,Legolas,Elf,6.0,5.0,10.0
4,Bilbo,Hobbit,4.0,1.0,5.0
5,Sam,Hobbit,2.0,6.0,4.0
6,Pippin,Hobbit,0.0,3.0,5.0
7,Boromir,Man,0.0,8.0,3.0
8,Aragorn,Man,2.0,7.0,9.0
9,Galadriel,Elf,9.0,2.0,10.0


Why would we want to stack or unstack?  It depends on what sorts of analyses we want to do "downstream".  It's also the basis for pivoting, melting, and pivot tables, which we'll cover in the next class.

## Joining Data



Let's say we have another CSV file that contains URLs to Wikipedia pages for some of the LOTR characters:

In [55]:
urls = pd.read_csv('data/lotr_wikipedia.csv')

In [56]:
urls

Unnamed: 0,name,url
0,Gandalf,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,https://en.wikipedia.org/wiki/Galadriel


Let's take a look at the original DataFrame:

In [57]:
df

Unnamed: 0,name,race,magic,aggression,stealth
0,Gandalf,Maia,10.0,7.0,8.0
1,Gimli,Dwarf,1.0,10.0,2.0
2,Frodo,Hobbit,4.0,2.0,5.0
3,Legolas,Elf,6.0,5.0,10.0
4,Bilbo,Hobbit,4.0,1.0,5.0
5,Sam,Hobbit,2.0,6.0,4.0
6,Pippin,Hobbit,0.0,3.0,5.0
7,Boromir,Man,0.0,8.0,3.0
8,Aragorn,Man,2.0,7.0,9.0
9,Galadriel,Elf,9.0,2.0,10.0


It looks like the rows are "aligned", so we can use the ```concat``` function to concatenate the two DataFrames.
Note that we specify the axis to be the columns.  The default is to concatenate by rows, which isn't what we want.

In [58]:
pd.concat([df,urls],axis="columns")

Unnamed: 0,name,race,magic,aggression,stealth,name.1,url
0,Gandalf,Maia,10.0,7.0,8.0,Gandalf,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,Dwarf,1.0,10.0,2.0,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,Hobbit,4.0,2.0,5.0,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,Elf,6.0,5.0,10.0,Legolas,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,Hobbit,4.0,1.0,5.0,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,Hobbit,2.0,6.0,4.0,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,Hobbit,0.0,3.0,5.0,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,Man,0.0,8.0,3.0,Boromir,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,Man,2.0,7.0,9.0,Aragorn,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,Elf,9.0,2.0,10.0,Galadriel,https://en.wikipedia.org/wiki/Galadriel


That's great, and it's consistent with what we've used in previous classes.  But what happens if the 
rows in the two DataFrames don't match up?  Let's load another file that has a slightly different
sequence of rows:

### <font color="magenta"> Q6: (1 point) Construct a dataframe with lotr_wikipedia_wrong_order.csv which is in the data folder.

In [60]:
urls_wrong_order = pd.read_csv('data/lotr_wikipedia_wrong_order.csv')

In [61]:
urls_wrong_order

Unnamed: 0,name,url
0,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
1,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins
2,Legolas,https://en.wikipedia.org/wiki/Legolas
3,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins
4,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee
5,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took
6,Boromir,https://en.wikipedia.org/wiki/Boromir
7,Aragorn,https://en.wikipedia.org/wiki/Aragorn
8,Galadriel,https://en.wikipedia.org/wiki/Galadriel
9,Meriadoc,https://en.wikipedia.org/wiki/Meriadoc_Brandybuck


In [62]:
pd.concat([df,urls_wrong_order],axis="columns")

Unnamed: 0,name,race,magic,aggression,stealth,name.1,url
0,Gandalf,Maia,10.0,7.0,8.0,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
1,Gimli,Dwarf,1.0,10.0,2.0,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins
2,Frodo,Hobbit,4.0,2.0,5.0,Legolas,https://en.wikipedia.org/wiki/Legolas
3,Legolas,Elf,6.0,5.0,10.0,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins
4,Bilbo,Hobbit,4.0,1.0,5.0,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee
5,Sam,Hobbit,2.0,6.0,4.0,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took
6,Pippin,Hobbit,0.0,3.0,5.0,Boromir,https://en.wikipedia.org/wiki/Boromir
7,Boromir,Man,0.0,8.0,3.0,Aragorn,https://en.wikipedia.org/wiki/Aragorn
8,Aragorn,Man,2.0,7.0,9.0,Galadriel,https://en.wikipedia.org/wiki/Galadriel
9,Galadriel,Elf,9.0,2.0,10.0,Meriadoc,https://en.wikipedia.org/wiki/Meriadoc_Brandybuck


Take a closer look at the name and url columns.  Something's not quite right.

We can work around that by using the appropriate indexing and then using the SQL-like ```merge``` function.

In [63]:
df_names = df.set_index('name')

In [64]:
df_names

Unnamed: 0_level_0,race,magic,aggression,stealth
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gandalf,Maia,10.0,7.0,8.0
Gimli,Dwarf,1.0,10.0,2.0
Frodo,Hobbit,4.0,2.0,5.0
Legolas,Elf,6.0,5.0,10.0
Bilbo,Hobbit,4.0,1.0,5.0
Sam,Hobbit,2.0,6.0,4.0
Pippin,Hobbit,0.0,3.0,5.0
Boromir,Man,0.0,8.0,3.0
Aragorn,Man,2.0,7.0,9.0
Galadriel,Elf,9.0,2.0,10.0


In [65]:
urls_wrong_order_names = urls_wrong_order.set_index('name')

In [66]:
df_names.join(urls_wrong_order_names)

Unnamed: 0_level_0,race,magic,aggression,stealth,url
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf
Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins
Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas
Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins
Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee
Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took
Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir
Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn
Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel


In [67]:
df.head()

Unnamed: 0,name,race,magic,aggression,stealth
0,Gandalf,Maia,10.0,7.0,8.0
1,Gimli,Dwarf,1.0,10.0,2.0
2,Frodo,Hobbit,4.0,2.0,5.0
3,Legolas,Elf,6.0,5.0,10.0
4,Bilbo,Hobbit,4.0,1.0,5.0


In [68]:
urls_wrong_order.head()

Unnamed: 0,name,url
0,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
1,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins
2,Legolas,https://en.wikipedia.org/wiki/Legolas
3,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins
4,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee


In [69]:
urls_wrong_order['name']

0         Gimli
1         Frodo
2       Legolas
3         Bilbo
4           Sam
5        Pippin
6       Boromir
7       Aragorn
8     Galadriel
9      Meriadoc
10      Gandalf
Name: name, dtype: object

In [70]:
df['name']

0       Gandalf
1         Gimli
2         Frodo
3       Legolas
4         Bilbo
5           Sam
6        Pippin
7       Boromir
8       Aragorn
9     Galadriel
10     Meriadoc
11         Lily
Name: name, dtype: object

In [71]:
df.merge(urls_wrong_order,on='name')

Unnamed: 0,name,race,magic,aggression,stealth,url
0,Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel


Now let's add a few additional URLs:

In [72]:
urls_extras = pd.read_csv("data/lotr_wikipedia_extras.csv")

In [73]:
urls_extras

Unnamed: 0,name,url
0,Treebeard,https://en.wikipedia.org/wiki/Treebeard
1,Elrond,https://en.wikipedia.org/wiki/Elrond


And now let's use concat to add the new entries to the DataFrame.

In [74]:
urls_complete = pd.concat([urls,urls_extras])

In [75]:
urls_complete

Unnamed: 0,name,url
0,Gandalf,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,https://en.wikipedia.org/wiki/Galadriel


Now that we've got a complete (for our purposes) list of URLs, let's use that DataFrame and our original
one to demonstrate the different types of ```join```s.

By default, ```join``` uses a left join, which means the all the values from the "left"
side are used, whether or not there's a corresponding entry from the "right" side.  In the example 
below, note that the url value for "Lily" is "NaN":

In [76]:
df.merge(urls_wrong_order,on='name',how='left')

Unnamed: 0,name,race,magic,aggression,stealth,url
0,Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel


The "opposite" of a left join is, perhaps unsurprisingly, a "right" join, in which
all the values from the "right" side are used, whether or not a corresponding
value from the "left" side exists. Note in the following example that "Lily" has
disappeared, and Treebeard and Elrond lack information about "race", "magic", "aggression", and "stealth".

In [79]:
df.merge(urls_complete,on='name',how='right')

Unnamed: 0,name,race,magic,aggression,stealth,url
0,Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel


In addition to "left" and "right" joins, we have "outer" joins, which include
values from both the "left" and "right" DataFrames, regardless of whether
there are corresponding values in the other DataFrame.  Note that all of 
"Lily", "Treebeard" and "Elrond" are present in the following DataFrame:

In [81]:
df.merge(urls_complete,on='name',how='outer')

Unnamed: 0,name,race,magic,aggression,stealth,url
0,Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel


Finally, there are "inner" joins, which include only those values that exist in both the "left" and "right" DataFrames:

In [82]:
df.merge(urls_complete,on='name',how='inner')

Unnamed: 0,name,race,magic,aggression,stealth,url
0,Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf
1,Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...
2,Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins
3,Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas
4,Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins
5,Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee
6,Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took
7,Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir
8,Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn
9,Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel


Sometimes it's nice to know how a particular row got added to the resulting DataFrame.  Using ```indicator=True```
allows us to examine this:

In [83]:
df.merge(urls_complete,how='outer',indicator=True)

Unnamed: 0,name,race,magic,aggression,stealth,url,_merge
0,Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf,both
1,Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...,both
2,Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins,both
3,Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas,both
4,Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins,both
5,Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee,both
6,Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took,both
7,Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir,both
8,Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn,both
9,Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel,both


You'll note that we used the ```merge``` function from the DataFrame and passed in the other DataFrame as an argument.
You can also call the ```merge``` function from pandas directly and pass it the two DataFrames you are merging:

In [84]:
pd.merge(df,urls_complete,how='outer',indicator=True)

Unnamed: 0,name,race,magic,aggression,stealth,url,_merge
0,Gandalf,Maia,10.0,7.0,8.0,https://en.wikipedia.org/wiki/Gandalf,both
1,Gimli,Dwarf,1.0,10.0,2.0,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...,both
2,Frodo,Hobbit,4.0,2.0,5.0,https://en.wikipedia.org/wiki/Frodo_Baggins,both
3,Legolas,Elf,6.0,5.0,10.0,https://en.wikipedia.org/wiki/Legolas,both
4,Bilbo,Hobbit,4.0,1.0,5.0,https://en.wikipedia.org/wiki/Bilbo_Baggins,both
5,Sam,Hobbit,2.0,6.0,4.0,https://en.wikipedia.org/wiki/Samwise_Gamgee,both
6,Pippin,Hobbit,0.0,3.0,5.0,https://en.wikipedia.org/wiki/Peregrin_Took,both
7,Boromir,Man,0.0,8.0,3.0,https://en.wikipedia.org/wiki/Boromir,both
8,Aragorn,Man,2.0,7.0,9.0,https://en.wikipedia.org/wiki/Aragorn,both
9,Galadriel,Elf,9.0,2.0,10.0,https://en.wikipedia.org/wiki/Galadriel,both


### <font color="magenta">Q7: (3 point) Join the ```udf``` DataFrame (that contains aliases) to the ```df``` DataFrame using an appropriate merge

In [86]:
udf.merge(df, on='name', how='right')

Unnamed: 0,name,url,aliases,race,magic,aggression,stealth
0,Gimli,https://en.wikipedia.org/wiki/Gimli_(Middle-ea...,Elf-friend Lockbearer Lord of the Glittering C...,Dwarf,1.0,10.0,2.0
1,Frodo,https://en.wikipedia.org/wiki/Frodo_Baggins,"Mr. Underhill,Maura Labingi",Hobbit,4.0,2.0,5.0
2,Legolas,https://en.wikipedia.org/wiki/Legolas,"Greenleaf, (Legolas translatedinto English)",Elf,6.0,5.0,10.0
3,Bilbo,https://en.wikipedia.org/wiki/Bilbo_Baggins,Bilba Labingi,Hobbit,4.0,1.0,5.0
4,Sam,https://en.wikipedia.org/wiki/Samwise_Gamgee,"Samwise Gardner, Sam, Samwise the Brave,Banazî...",Hobbit,2.0,6.0,4.0
5,Pippin,https://en.wikipedia.org/wiki/Peregrin_Took,"Pippin, Pip,""Ernil i Pheriannath""Thain Peregri...",Hobbit,0.0,3.0,5.0
6,Boromir,https://en.wikipedia.org/wiki/Boromir,"Captain of the White Tower,High Warden of the ...",Man,0.0,8.0,3.0
7,Galadriel,https://en.wikipedia.org/wiki/Galadriel,AlatárielAltárielArtanisNerwen,Elf,9.0,2.0,10.0
8,Meriadoc,https://en.wikipedia.org/wiki/Meriadoc_Brandybuck,"Merry,Kalimac Brandagamba,Meriadoc the Magnifi...",Hobbit,0.0,4.0,6.0
9,Gandalf,,,Maia,10.0,7.0,8.0


# END OF NOTEBOOK
Please remember to submit your notebook in .ipynb and .html formats.