**PySDS Week 2. Lecture 4. V.2**
Author: Bernie Hogan

# Week 2 Day 4. : Merging and grouping data 

In this lecture we are going to focus primarily on exercises where you must integrate different data sources together in a single table for analysis. 

Learning goals: 
- Understand merging / sorting
- Be able to read and write a table from iPython
- Understand one-to-many and many-to-many relationships. 
- Understanding grouping relationships

# Section 1. A review of adding data to a DataFrame

First, let's revisit the merging of data through append and concatenate and then move on to key-based merging. 

First we will create two dataframes based on dictionaries, then we will combine them. We will do this in two ways: 
1. The same columns (adding rows) 
2. The same rows (adding columns) 

## Adding rows 
When adding data where we have the same columns, it is typically because we have new rows. This happens when we are processing data and want to add rows one at a time as the data comes in. Imagine you have a stream of tweets and you add a new tweet to the existing DataFrame. 

*Things to remember:* 
- DataFrames have rows, and each row has an index. 
- The index can have a user-defined value, but it is assigned in numerical sequence by default. 

In [None]:
from pandas import Series, DataFrame
import pandas as pd 
import numpy as np
from IPython.display import display
%pylab inline 

Below we will create three small data frames with different values. We will use these so that you can watch where each of the values go when you are doing your merging. 

In [None]:

testList1 = [["a","b","c","d"],["g","h","j","k"]]
testFrame1 = pd.DataFrame(testList1)
print(testFrame1)

print()

testList2 = [["m","n","o","p"],["s","t","u","v"]]
testFrame2 = pd.DataFrame(testList2)
print(testFrame2)

print()

testList3 = [["x","y","z","aa","bb","cc"],["e","f","q","w","ww","www"]]
testFrame3 = pd.DataFrame(testList3)
print(testFrame3)




### Attempt 1: Adding the frames together ###

In the first case, see what happens when we add the frames together. Because they are the same dimension, it literally concatenates within cell. If the cells are not the same size, they will return missing data. See the two results below. 

In [None]:
exData01 = testList1 + testList2
display(exData01)

print()

# Notice the difference between adding the lists and adding the frames. 
# The DataFrame sought to a Hadamard operation (i.e. match cell-to-cell)
exFrame01 = testFrame1 + testFrame2
display(exFrame01)

print()

# Notice now the software does not know what to do adding misshapen frames. 
exFrame02 = testFrame1 + testFrame3
display(exFrame02)

### Attempt 2: Concatenating frames ###
In the second case, we are going to concatenate the data. The first way we will be doing this is by row. Recall what happens to the indices by default.

In [None]:
testFrame4 = pd.concat( [testFrame1, testFrame2] )
testFrame4

In [None]:
# To really understand the method, it's useful to read the help file. 
help(pd.concat)

Below we fix this using the *ignore_index = True* argument.  

Notice also that pd.concat and DataFrame.append accomplish the same thing but are not implemented the same way. Generally concat is faster.

In [None]:
testFrame4 = pd.concat([testFrame1, testFrame2],ignore_index=True)
print(testFrame4)

print()

testFrame4 = testFrame1.append(testFrame2,ignore_index=True)
print(testFrame4)

Now if we want to add these as **columns rather than rows**, we can use the *axis=1* (as opposed to the default axis=0 argument)

Also notice that this is not available as appending.

In [None]:
testFrame4 = pd.concat([testFrame1, testFrame2],axis=1)
testFrame4.index = ['top', 'bottom']
display(testFrame4)

print()

The issue with not using unique indices is that you can unintentionally edit the wrong cell. In the frame above we see that there are two columns named 0, so that when you want to change data for one but not the other, you run into trouble. See below: 

In [None]:
testFrame4.loc["top",0] = "test"
display(testFrame4)

If we want to preserve that index for some reason, we can actually use a multi-index. This is where there are subindices for the dataframe. This is also relevant when you are grouping data, as the grouped data can have a multi-index (on either the rows or columns or both).

In [None]:
testFrame4 = pd.concat([testFrame1, testFrame2],axis=1,keys=["left","right"])
testFrame4.index = ['top', 'bottom']
display(testFrame4)

print()

In [None]:
testFrame4.T

In [None]:
print(testFrame4["left",2])

print(testFrame4["left"][0]["top"])
try: 
    print(testFrame4["left",0,'top'])
except KeyError:
    print("The first bracket is for the column index only.")

If we want to **add a single series**, then we have to be careful about 
how it is structured. Noticed in the following. We can see this being done right and wrong. 

In [None]:
testSeries1 = pd.Series(["alpha","bravo","charlie","delta"],name="example")

testFrame5 = testFrame1.append(testSeries1)#,ignore_index=True)
testFrame5

In [None]:
testSeries1 = pd.Series({2:"bravo",3:"charlie",4:"delta",1:"alpha"},name="example")

testFrame5 = testFrame1.append(testSeries1)
testFrame5

In [None]:
# Ooops! It's "Zero" indexed

testSeries1 = pd.Series({0:"alpha",1:"bravo",2:"charlie",3:"delta"},name="example")

testFrame5 = testFrame1.append(testSeries1)
testFrame5

## Adding Columns 

Each DataFrame has an index and a series of columns. To add names to the index, you can assign a variable to DataFrame.index. To assign names to the columns, you can use DataFrame.columns. These are lists. They cannot be shorter or longer than the actual data frame, otherwise you will receive a ValueError. 

In [None]:
testFrame5.columns = ["first","second","third","fourth"]
display(testFrame5)

print(len(testFrame5.columns))

try:
    testFrame5.columns = ["1first","2second","3third"]
    display(testFrame5)
except ValueError:
    print("ValueError: Length mismatch on columns")
    

testFrame5.index = ["first_row","second_row","third_row"]
display(testFrame5)

try:
    testFrame5.index = ["first_row","second_row","third_row","fourth_row"]
    display(testFrame5)
except ValueError:
    print("ValueError: Length mismatch on rows")
    

    

# Section 2. One-to-many relationships

One to many relationships are really common in data wrangling. For example, you have people who are in states, and you have state level data on unemployment. How do you create a new table that includes these state-level indicators? This might be useful for a regression (particularly a popular class of regression models called 'hierarchical linear models'). 

In the examples below we will use the countries of the United Kingdom as one level in our data and then people as the other level. So sometimes we might want to see the 'average' age of people in a given country. Other times we might want to merge in data together where there might be some countries in one data set and a non-overlapping series in another set. Below, we can notice that while both our sets contain the four countries of the UK, one will contain Jersey, the other will contain Isle of Wight. Do we want to get rid of Jersey or Isle of Wight? Do we want to keep both? Thinking through how you merge data will answer these questions.

In [None]:
d = {"Wales":3,"England":53,"Scotland":5,"Northern Ireland":2,"Jersey":.1}
l = list(zip(d.keys(),d.values()))

countryFrame = pd.DataFrame(pd.Series(d),columns=["Population"])
display(countryFrame)

In [None]:
people = [["Alice",32,"Wales"],
          ["Bob",35,"Northern Ireland"],
          ["Charlie",21,"England"],
          ["Diane",45,"Northern Ireland"],
          ["Ellen",21,"Scotland"],
          ["Fong",50,"England"],
          ["Grant",28,"Scotland"],
          ["Harry",36,"England"],
          ["Idris",40,"Isle of Wight"]]

peopleFrame = pd.DataFrame(people,columns=["Name","Age","Country"])
# help(peopleFrame.merge)
display(peopleFrame)

Notice now that in the first data set, the countries were the indices. In the second data set, country was a column in the data. 

So when we merge, we will want to let the program know that for one DataFrame, which we will call the 'left', we will want to merge on the column that is the country. For another DataFrame, which we will call the 'right' , we will want to merge on the index. How do we know which is left and which is right? It's literally about reading the statement left and right: 

~~~ python
new_dataframe = <LEFT_DATAFRAME>.merge(<RIGHT_DATAFRAME>, left_on=<column> {or left_index=True}, 
                                                          right_on=<column> {or right_index=True}) 
~~~

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True)
display(mergeFrame)

print("\n\nNow let's do the same, but switch left and right\n\n")

mergeFrame = countryFrame.merge(peopleFrame,left_index=True,right_on="Country")
display(mergeFrame)

Merging / Joining is hard to get your head around because there are many choices to make and lots of potential missteps. 

Here are some steps: 
1. Identify the tables to merge. 
2. Select which is left: what is the key? index or column.
3. Select which is right: what is the key? index or column.
4. What should be preserved? All the data? All from the left side? All from the right? Or all in common?
5. Should the columns in the merged data be given different names after the merge? 

We have covered the first three steps above. Now lets cover step 4. This is called the 'join'. There are four basic joins here. You'll see a left and a right. These are basically the same except with the order of the frames. 

- Left: Unique rows on the left, mutliple on the right. 
- Right: Unique rows on the right, multiple on the left.
- Inner: The _intersection_ of both frames.
- Outer: The _union_ of both frames. 

Below is a very small crash course in "Union" and "Intersection". We first will create a set. A set is a data structure where there is only one element with any value. See below how a set collapses the multiple 2s and multiple 5s into a single 2 and a single 5. 

In [None]:
x = [1,2,3,4,5,5,5,5,5,6]
print(x)
print(set(x))

Because sets only contain unique discrete elements, we can then talk about set inclusion. That is, we can ask if things are in one or both sets. If they are in one set but not the other, we can discover this with set subtraction. If we want to combine them, we can do that with set addition. But most importantly for merging we can ask for the the union of items (i.e. all of the items) or the intersection of items (i.e. all of the items _in common_).  

In [None]:
setOdd = set([1,3,5,7,9]) # the first five odd numbers 
setCount = set([1,2,3,4,5]) # the first five numbers

print("setCount:\n%s" % setCount)
print("setOdd:\n%s" % setOdd)

print("Union: all of the elements from both")
print(setOdd.union(setCount))

print("Intersection: all of the elements in common")
print(setOdd.intersection(setCount))

print("Set subtraction. SetCount minus setOdd:")
print(setCount - setOdd)

print("Set subtraction. SetOdd minus setCount:")
print(setOdd - setCount)

Now let's return to the data we analyzed above and explore what happens when we join in different ways. 

### Outer Join

The outer join is the union of two sets. In one of our sets we have Jersey with some population data and Idris from Isle of Wight is in the other data. Here, the outer join on country means that both are included and the columns that aren't available are simply given missing values. 

Also notice below that the index is now pretty messed up. This is because it doesn't reset the index. If you uncomment the line below to reset the index, it fixes this and numbers all of the elements correctly. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='outer')
# mergeFrame.reset_index(inplace=True) #without the inplace it returns a new frame
display(mergeFrame)

### Inner Join

Inner join is the equivalent to a set intersection. We get rid of the keys where there is no match in the other table. So we lose both Isle of Wight and Jersey. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='inner')
mergeFrame.reset_index(inplace=True)
display(mergeFrame)

### Left Join

Below we will merge with ```how='left'```. The left is peopleFrame since it appears to the left of mergeFarme. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='left')
mergeFrame

### Right Join

Pretty much the same as the left join, except it is merging on the right instead of the left. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='right')
mergeFrame

## Multiple columns with the same name 

You'll notice that in the above example, countryFrame is really just one column of data. Let's add another column to it, and see what happens to our merging. you will notice that if we have two columns with 'age', one in either side, that when we merge together they are given default suffixes of ```_x``` and ```_y```. 

In [None]:
countryFrame = pd.DataFrame([["Wales",3,28,961],
                            ["England",53,51,1500],
                            ["Scotland",5,46,1175],
                            ["Northern Ireland",2,27,97],
                            ["Jersey",.1,58,808]],
                            columns=["Country","Population","Income","Age"])
display(countryFrame)

In [None]:
mergeFrame = peopleFrame.merge(countryFrame, on="Country")
mergeFrame

To give the new data more descriptive names than x and y you can use the argument```suffixes``` with two elements in a list, the suffix for the left and the suffix for the right. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,on="Country",how="outer",suffixes=["","_cntry"])
display(mergeFrame)

# Section 3. Grouping Data 

## Broadcasting Aggregations back to the original data

Sometimes we want to aggregate the data. Imagine we have an average population value for the county and we want to see how it compares to the average population value for our sample. Since we have many countries, we don't really want to do this for each one individually. Instead, we can group our sample data. 

In the example below, we have two people from Northern Ireland, Bob who is 35 and Diane, who is 45. The average age of them would be 40. So if we want to see this (and the average age across all countries) we can group the data. 

To note, there are many grouping operations. The common ones are: 
- mean 
- std (standard deviation)
- min
- max
- count


In [None]:
print("Average per group")
groupFrame = mergeFrame.groupby('Country').mean()
display(groupFrame)

print("Standard deviation per group")
groupFrame = mergeFrame.groupby('Country').std()
display(groupFrame)

print("Maximum per group")
groupFrame = mergeFrame.groupby('Country').max()
display(groupFrame)

print("Minimum per group:")
groupFrame = mergeFrame.groupby('Country').min()
display(groupFrame)

print("Count per group:")
groupFrame = mergeFrame.groupby('Country').count()
display(groupFrame)

In [None]:
help(mergeFrame.groupby)

You'll notice that it adds the prefix "m\_" to all the scalar values and uses all of them. This is alright, but if we want to merge these values back into the original data set, this will be a nuisance since m_Population is the same as Population since both came from the Country table to begin with. So, we can group on a slice of the dataframe. To slice the dataframe we have to query it in the following way. 

    DATAFRAME[ ['VAR1','VAR2'] ]
    
Yes, that's a list within a list. See below:    

In [None]:
mergeFrame[["Country","Age","Income"]]

In [None]:
groupFrame = mergeFrame[["Country","Age"]].groupby('Country').mean()

groupFrame

We have just one issue now - if we merge the average age back in, there is already a variable is already called Age. We can rename it before we merge it back in, but it is easier to add a prefix when we do the original grouping:

    DATAFRAME.groupby(KEY).add_prefix("mean_")
    

In [None]:
groupFrame = mergeFrame[["Country","Age"]].groupby('Country').mean().add_prefix("mean_")
groupFrame

In [None]:
newFrame = pd.merge(mergeFrame, groupFrame,left_on="Country",right_index=True)
display(newFrame)