**PySDS Week 2 Lecture 4. V.1 **
Last author: B. Hogan

Week 2 Day 4. : Merging and grouping data 
=================

This week we are going to focus primarily on exercises where you must integrate different data sources together in a single table for analysis. 

Learning goals: 
- Understand merging / sorting
- Be able to read and write a table from iPython
- Understand one-to-many and many-to-many relationships. 
- Understanding grouping relationships

# Section 1. A review of adding data to a DataFrame

First, let's revisit the merging of data through append and concatenate and then move on to key-based merging. 

First we will create two dataframes based on dictionaries, then we will combine them. We will do this in two ways: 
1. The same columns (adding rows) 
2. The same rows (adding columns) 

## Adding rows 
When adding data where we have the same columns, it is typically because we have new rows. This happens when we are processing data and want to add rows one at a time as the data comes in. You have seen this already.

*Things to remember:* 
- DataFrames have rows, and each row has an index. 
- The index can have a user-defined value, but it is assigned in numerical sequence by default. 

In [None]:
from pandas import Series, DataFrame
import pandas as pd 
import numpy as np
from IPython.display import display
%pylab inline 

In [None]:

testData1 = [["a","b","c","d"],["g","h","j","k"]]
testFrame1 = pd.DataFrame(testData1)
print(testFrame1)

print()

testData2 = [["m","n","o","p"],["s","t","u","v"]]
testFrame2 = pd.DataFrame(testData2)
print(testFrame2)

print()

testData3 = [["x","y","z","aa","bb","cc"],["e","f","q","w","ww","www"]]
testFrame3 = pd.DataFrame(testData3)
print(testFrame3)




### Attempt 1: Adding the frames together ###

In the first case, see what happens when we add the frames together. Because they are the same dimension, it literally concatenates within cell. If the cells are not the same size, they will repeat. See the two results below. 

In [None]:
exData01 = testData1 + testData2
display(exData01)

print()

# Notice the difference between adding the lists and adding the frames. 
exFrame01 = testFrame1 + testFrame2
display(exFrame01)

print()

# Notice now the software does not know what to do adding misshapen frames. 
exFrame02 = testFrame1 + testFrame3
display(exFrame02)

### Attempt 2: Concatenating frames ###
In the second case, we are going to concatenate the data. The first way we will be doing this is by row. Recall what happens to the indices by default.

In [None]:
testFrame4 = pd.concat(    [testFrame1, testFrame2])
testFrame4

In [None]:
# To really understand the method, it's useful to read the help file. 
help(pd.concat)

Below we fix this using the *ignore_index = True* argument.  

Notice also that pd.concat and DataFrame.append accomplish the same thing but are not implemented the same way. Generally concat is faster.

In [None]:
testFrame4 = pd.concat([testFrame1, testFrame2],ignore_index=True)
print(testFrame4)

print()

testFrame4 = testFrame1.append(testFrame2,ignore_index=True)
print(testFrame4)

Now if we want to add these as **columns rather than rows**, we can use the *axis=1* (as opposed to the default axis=0 argument)

Also notice that this is not available as appending.

In [None]:
testFrame4 = pd.concat([testFrame1, testFrame2],axis=1)
testFrame4.index = ['top', 'bottom']
display(testFrame4)

print()

In [None]:
testFrame4.loc["top",0] = "test"
display(testFrame4)

If we want to preservethat index for some reason, we can actually use a multi-index. This is where there are subindices for the dataframe. This is also relevant when you are grouping data, as the grouped data can have a multi-index. 

In [None]:
testFrame4 = pd.concat([testFrame1, testFrame2],axis=1,keys=["left","right"])
testFrame4.index = ['top', 'bottom']
display(testFrame4)

print()

In [None]:
print(testFrame4["left",0])
print(testFrame4["left",0]['top'])
try: 
    print(testFrame4["left",0,'top'])
except KeyError:
    print("The first bracket is for the index only.")

If we want to **add a single series**, then we have to be careful about 
how it is structured. Noticed in the following. We can see this being done right and wrong. 

In [None]:
testSeries1 = pd.Series(["alpha","bravo","charlie","delta"],name="example")

testFrame5 = testFrame1.append(testSeries1)#,ignore_index=True)
testFrame5

In [None]:
testSeries1 = pd.Series({2:"bravo",3:"charlie",4:"delta",1:"alpha"},name="example")

testFrame5 = testFrame1.append(testSeries1)
testFrame5

In [None]:
# Ooops! It's "Zero" indexed

testSeries1 = pd.Series({0:"alpha",1:"bravo",2:"charlie",3:"delta"},name="example")

testFrame5 = testFrame1.append(testSeries1)
testFrame5

## Adding Columns 

Each DataFrame has an index and a series of columns. To add names to the index, you can assign a variable to DataFrame.index. To assign names to the columns, you can use DataFrame.columns. These are lists. They cannot be shorter or longer than the actual data frame, otherwise you will receive a ValueError. 

In [None]:
testFrame5.columns = ["first","second","third","fourth"]
display(testFrame5)

print(len(testFrame5.columns))

try:
    testFrame5.columns = ["1first","2second","3third"]
    display(testFrame5)
except ValueError:
    print("ValueError: Length mismatch")
    

testFrame5.index = ["first_row","second_row","third_row"]
display(testFrame5)

try:
    testFrame5.index = ["first_row","second_row","third_row","fourth_row"]
    display(testFrame5)
except ValueError:
    print("ValueError: Length mismatch")
    

    

# Section 2. One-to-many relationships

One to many relationships are really common in data wrangling. For example, you have people who are in states, and you have state level data on unemployment. How do you create a new table that includes these state-level indicators? This might be useful for a regression (particularly a popular class of regression models called 'hierarchical linear models'). 

In [None]:
d = {"Wales":3,"England":53,"Scotland":5,"Northern Ireland":2,"Jersey":.1}
l = list(zip(d.keys(),d.values()))
print(l)

countryFrame = pd.DataFrame(l,columns=["Country","Population"])
display(countryFrame)

countryFrame = pd.DataFrame(pd.Series(d),columns=["Population"])
countryFrame

In [None]:
people = [["Alice",32,"Wales"],
          ["Bob",35,"Northern Ireland"],
          ["Charlie",21,"England"],
          ["Diane",45,"Northern Ireland"],
          ["Ellen",21,"Scotland"],
          ["Fong",50,"England"],
          ["Grant",28,"Scotland"],
          ["Harry",36,"England"],
          ["Idris",40,"Isle of Wight"]]

peopleFrame = pd.DataFrame(people,columns=["Name","Age","Country"])
# help(peopleFrame.merge)
display(peopleFrame)

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True)
display(mergeFrame)

mergeFrame = countryFrame.merge(peopleFrame,left_index=True,right_on="Country")
display(mergeFrame)

So. Merging / Joining is hard to get your head around. And it won't necessarily work the first couple times (I went through several iterations in getting the examples to work). But let's give a little overview of ways to join:

- Left: Unique rows on the left, mutliple on the right. 
- Right: Unique rows on the right, multiple on the left.
- Inner: The intersection of both frames.
- Outer: The union of both frames. 

Below is a very small crash course in "Union" and "Intersection". 

In [None]:
setleft = set([1,3,5,7,9])
setright = set([1,2,3,4,5])
print("Union: PRINT ALL THE THINGS!")
print(setleft.union(setright))

print("\nIntersection: Here's what we have in common")
print(setleft.intersection(setright))

Now let's return to the data we analyzed above and explore what happens when we join in different ways. 

### Outer Join

Notice above we include both Jersey and Isle of Wight, and then get some missing data. Country_y (which we really ought to rename) is missing for Isle of Wight and the individuals are missing for Jersey. It's the union of the keys. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='outer')
mergeFrame.reset_index(drop=True)

### Inner Join

Notice in this case, just like with an intersection, we get rid of the keys where there is no match in the other table. So, goodbye Isle of Wight and goodbye Jersey! 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='inner')
display(mergeFrame)

### Left Join

Notice above that we joined on left (which is "peopleFrame"). We could have also done the following, which would have been roughly equivalent (see for yourself!)

    pd.merge(peopleFrame,countryFrame,left_on="Country",right_index=True, how='left')

    peopleFrame.join(countryFrame,on="Country",how="inner",rsuffix="_x")
    
I say roughly equivalent, because the join command is actually a little more tidy than merge. Notice that the inner join doesn't have country_x and country_y, but merges those. In the end, no way is particularly "correct".

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='left')
mergeFrame

In [None]:
countryFrame.columns = ["Country"]
display(countryFrame)

In [None]:
# countryFrame.columns = ["Country"]
peopleFrame.join(countryFrame,on="Country",how="inner",rsuffix="_rightTable")

### Right Join

Pretty much the same as the left join, except it is merging on the right instead of the left. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame,left_on="Country",right_index=True, how='right')
mergeFrame

** Multiple columns ** 

You'll notice that in the above example, countryFrame is really just one column of data. Let's add another column to it, and see what happens to our merging.

In [None]:
countryFrame = pd.DataFrame([["Wales",3,28],
                            ["England",53,51],
                            ["Scotland",5,46],
                            ["Northern Ireland",2,27],
                            ["Jersey",.1,58]],
                            columns=["Country","Population","Income"])
display(countryFrame)

Now in this case, you'll notice that we no longer have the names of the countries as indices, so we will have to change the way we merge ever so slightly. 

In [None]:
mergeFrame = peopleFrame.merge(countryFrame, on="Country")
mergeFrame

Section 3. Grouping Data 
===================

Broadcasting Aggregations back to the original data
---------------------------

Now with these columns imagine that we want to create some variable that is a group-level aggregation of individual level variables. Population and Income are already group-level variables . 

We can group data together using the 

    groupby(KEY) 
    
command. First we will group all the data, and then just a subset of it. 

In [None]:
groupFrame = mergeFrame.groupby('Country').sum()
display(groupFrame)

In [None]:
help(mergeFrame.groupby)

You'll notice that it adds the prefix "m\_" to all the scalar values and uses all of them. This is alright, but if we want to merge these values back into the original data set, this will be a nuisance since m_Population is the same as Population since both came from the Country table to begin with. So, we can group on a slice of the dataframe. To slice the dataframe we have to query it in the following way. 

    DATAFRAME[ ['VAR1','VAR2'] ]
    
Yes, that's a list within a list. See below:    

In [None]:
mergeFrame[["Country","Age","Income"]]

In [None]:
groupFrame = mergeFrame[["Country","Age"]].groupby('Country').mean()

groupFrame

We have just one issue now - if we merge the average age back in, there is already a variable is already called Age. We can rename it before we merge it back in, but it is easier to add a prefix when we do the original grouping:

    DATAFRAME.groupby(KEY).add_prefix("mean_")
    

In [None]:
groupFrame = mergeFrame[["Country","Age"]].groupby('Country').mean().add_prefix("mean_")
groupFrame

In [None]:
newFrame = pd.merge(mergeFrame, groupFrame,left_on="Country",right_index=True)
display(newFrame)

Copying versus addressing
-------------------------

So, is m_Age part of the mergeFrame table now? No! Part of the way that Pandas works is to only put things in memory unless otherwise stated. See below:


In [None]:
newFrame["age_meancentered"] = newFrame["mean_Age"] - newFrame["Age"]
newFrame

In [None]:
mergeFrame = pd.merge(mergeFrame, groupFrame,left_on="Country",right_index=True)
mergeFrame

# Ta-dah! 