<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px;"> 
# Long, Wide, Pivoting, and Melting Tables in Pandas

---
Week 4 | Lesson 1.1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Describe a wide and long table
- Describe and use the pivot_table method
- Describe and data imputing
- Describe and using merging


### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- Understand how to load data into a dataframe
- Understand how numpy arrays work


![](http://dataconomy.com/wp-content/uploads/2015/03/Python-Pandas-Features-Tutorial-Data-Mining-e1427131108858.jpg)


# Long format, wide format, pivot tables, and melting

This lesson is all about data transformation in pandas. Data transformation is in essense reorganizing the rows and columns of your dataset to be a different shape and format. 

The benefits to transforming your data are primarily for easier access and manipulation of data, whether it be through easier masking/conditional statements or because you would prefer to operate across columns or down rows. 

Over time you will get a feel for which data formats are better for different tasks. This lesson, however, is focused in large part on the _functional application_ of data transformation (i.e. how do you do **this** to a dataset?)


### Need Help with Pandas?

The [Pandas Documention](http://pandas.pydata.org/pandas-docs/stable/api.html) tells you what methods do and what argumments they accept, as well as provide examples. 


---





In [2]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

## Warm up with Series

A **Series** is a single vector of data (like a NumPy array) with an index that labels each element in the vector.

In [2]:
series = pd.Series([100,200,300,400])

In [3]:
type(series)

pandas.core.series.Series

In [4]:
# like a numpy array but with added capabilities 
series.head()

0    100
1    200
2    300
3    400
dtype: int64

In [5]:
# Convert the series to its Numpy-array representation
arr = series.as_matrix()
type(arr)

numpy.ndarray

In [6]:
# Convert the series to a list
arr2 = series.tolist()
type(arr2)

list

---

## 1. "Long" format data

**Long** format data is the more common format of data for .csv type files. You are already familiar with wide format data: I believe all of the datasets we have been using thus far have been in wide format.


<img src=https://i.stack.imgur.com/RRfjY.png>
Long format data is formatted with criteria:

- There are multiple ID _and_ value columns. In other words, there is a column for every "variable" with its own unique values.
- The format has both the conceptual simplicity of a single column of values per variable and a more compact matrix.
- Is not useful for SQL-style operations: it can make it much harder or even impossible to join tables together on a value.
- Can be more useful in pandas when you need to preform operations on variables **across columns**. For example, multiplying columns together.
- It is the most commonly the format that you will put the data in when you are ready to perform modeling (with some exceptions). When we get into modeling next week I will explain why.

---

## 2. Load  "Nerdy Personality Attributes" dataset

This is a parsed and modified version of the full "Nerdy Personality Attributes" survey that asked subjects to self-rate on questions related to "nerdiness" as well as more general personality traits such as openness and extraversion. Demographic information on the subjects was also collected.

In this modified version, for the sake of example, some of the subjects have only data for the survey and not the demographic variables. Because there are missing values and the data in general is "messy", this is also in part a data cleaning problem.

We will load the data in wide format first:


In [7]:
# load data into dataframe
nerdy_long_f = '~nvr/desktop/dsi-sf-7-materials-nvr/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_wide_missing.csv'
nerdy_long = pd.read_csv(nerdy_long_f)

In [8]:
nerdy_long

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0
5,5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
6,6,4.0,18.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,...,1.0,5.0,5.0,5.0,5.0,2.0,2.0,1.0,4.0,1.0
7,7,3.0,21.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,...,12.0,5.0,5.0,5.0,6.0,2.0,1.0,3.0,3.0,3.0
8,8,3.0,25.0,5.0,5.0,4.0,3.0,2.0,1.0,1.0,...,7.0,7.0,3.0,4.0,7.0,2.0,2.0,3.0,0.0,5.0
9,9,3.0,17.0,6.0,4.0,5.0,5.0,4.0,2.0,6.0,...,1.0,7.0,5.0,5.0,4.0,2.0,2.0,5.0,5.0,5.0


In [9]:
# use the shape method to find out the dimentions 
nerdy_long.shape

(1391, 57)

In [11]:
## Let's get a sense of the data
nerdy_long.info(memory_usage='deep') 

# What's with the plus sign?
#nerdy_long.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1391 entries, 0 to 1390
Data columns (total 57 columns):
subject_id                      1391 non-null int64
academic_over_social            1391 non-null float64
age                             700 non-null float64
anxious                         1391 non-null float64
bookish                         1391 non-null float64
books_over_parties              1391 non-null float64
calm                            1391 non-null float64
collect_books                   1391 non-null float64
conventional                    1391 non-null float64
critical                        1391 non-null float64
dependable                      1391 non-null float64
diagnosed_autistic              1391 non-null float64
disorganized                    1391 non-null float64
education                       700 non-null float64
engnat                          700 non-null float64
enjoy_learning                  1391 non-null float64
excited_about_research          13

The dataset is in the familiar (rows, columns) format where each column is a variable, each row contains the observation for that variable for (in this case) that distinct subject.

In [16]:
nerdy_long.head(3)

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0


In [13]:
nerdy_long.isnull()

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,False,False,True,False,False,False,False,False,False,False,...,True,False,False,False,False,True,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,True,False,False,False,False,False,False,False,...,True,False,False,False,False,True,True,False,False,False
4,False,False,True,False,False,False,False,False,False,False,...,True,False,False,False,False,True,True,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


We can check to see how many null values there are per column with the convenient chained function pattern below:

In [18]:
# explore api for isnull method in class
nerdy_long.isnull().sum()

subject_id                        0
academic_over_social              0
age                             691
anxious                           0
bookish                           0
books_over_parties                0
calm                              0
collect_books                     0
conventional                      0
critical                          0
dependable                        0
diagnosed_autistic                0
disorganized                      0
education                       691
engnat                          691
enjoy_learning                    0
excited_about_research            0
extraverted                       0
familysize                      691
gender                          691
hand                            691
hobbies_over_people               0
in_advanced_classes               0
intelligence_over_appearance      0
interested_science                0
introspective                     0
libraries_over_publicspace        0
like_dry_topics             

### Null Values and Imputing Data


If we were to just drop all the rows that have any null values at this point, we would lose 970 rows due to the commonly missing variable `major`.

In [22]:

nerdy_long.dropna() #Drops the whole row with NA value 

nerdy_long.dropna(axis=1)  # Drops the value in cell 

Unnamed: 0,subject_id,academic_over_social,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,race_nerdy,race_white,read_tech_reports,reserved,socially_awkward,strange_person,sympathetic,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,0.0,1.0,5.0,7.0,5.0,5.0,7.0,5.0,5.0,3.0
1,1,2.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,0.0,1.0,4.0,5.0,5.0,4.0,5.0,3.0,5.0,1.0
2,2,5.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,0.0,1.0,5.0,7.0,5.0,5.0,2.0,5.0,5.0,4.0
3,3,5.0,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,0.0,1.0,4.0,2.0,5.0,5.0,6.0,5.0,5.0,4.0
4,4,4.0,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,0.0,1.0,5.0,6.0,0.0,5.0,5.0,5.0,4.0,1.0
5,5,4.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,5.0,...,0.0,1.0,5.0,5.0,4.0,5.0,4.0,4.0,5.0,3.0
6,6,4.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,2.0,...,0.0,0.0,4.0,5.0,5.0,5.0,5.0,1.0,4.0,1.0
7,7,3.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,5.0,...,0.0,1.0,5.0,5.0,5.0,5.0,6.0,3.0,3.0,3.0
8,8,3.0,5.0,5.0,4.0,3.0,2.0,1.0,1.0,5.0,...,0.0,0.0,3.0,7.0,3.0,4.0,7.0,3.0,0.0,5.0
9,9,3.0,6.0,4.0,5.0,5.0,4.0,2.0,6.0,3.0,...,0.0,0.0,5.0,7.0,5.0,5.0,4.0,5.0,5.0,5.0


In [29]:
nerdy_long.major.replace(to_replace =np.nan, value="unknown").head()

0       unknown
1    biophysics
2       biology
3       unknown
4       unknown
Name: major, dtype: object

In [30]:
nerdy_long.major.fillna(value='unknown')

0                       unknown
1                    biophysics
2                       biology
3                       unknown
4                       unknown
5                       Geology
6                       unknown
7                       unknown
8                    psychology
9                       unknown
10                      unknown
11                      unknown
12                      unknown
13                      unknown
14       information technology
15                Digital Media
16                      unknown
17                      unknown
18                      unknown
19                      unknown
20                      unknown
21                      unknown
22         chemical engineering
23                      unknown
24                      unknown
25                      unknown
26                      unknown
27                         Math
28                      unknown
29                      unknown
                 ...           
1361    

### Imputing 

**Imputation** is the process of replacing missing data with substituted values.

Sometimes it is not feasible to simply delete rows with missing data. For instance, if we were to delelet all 970 rows with missing data, we would be throwing away more than half of our data set! So instead we try to impute data whenever possible. 


#### Imputing Techniques 

Imputing techniques range from simple to more sophisticated. 

- Replacing missing numerical values with the mean or median of the column 
- Replaceing a missing categorical value with "unknown"
- Using statistical infer what the mising values should be
- Using machine learning models to predict what the values should be 




In [31]:
## filling in missing values 
nerdy_long['major'] = nerdy_long.major.\
                    fillna(value='unknown')

In [32]:
# What should we expect the folling output to be?
nerdy_long.major.isnull().sum()

0

In [33]:
nerdy_long.major.head(10)

0       unknown
1    biophysics
2       biology
3       unknown
4       unknown
5       Geology
6       unknown
7       unknown
8    psychology
9       unknown
Name: major, dtype: object

## 3. "wide" format

Now we can load the same data in but in what's commonly referred to as "wide format". 

wide data is formatted with criteria:

- Potentially multiple "id" (identification) columns.
- Variable:value column pairs that match a variable key to a value (in the simple case, a single variable column and a single value column).
- The "variable" column corresponds to the multiple variable columns in your wide format data. Now, instead of a column for each variable, you have a row for each variable:value pair, per id. 
- This is a standard format in SQL databases because it is appropriate for joining different tables together by keys.

<img src=https://i.stack.imgur.com/agIMh.png>


In [None]:
# nerdy_long.groupby(['major','anxious']).size().index

## Pandas `pivot_table()`: long to wide format

The `pd.pivot_table()` function is a very powerful tool to both transform data from long to wide format and also to conveniently summarize data into a matrix with arbitrary functions.

First we'll look at how we transform this long format data back into the wide format data.

**Parameters to note in the function:**

    nerdy_long: the pivot_table() function takes a dataframe to pivot as its first argument
    
- **`columns`**: this is the list of columns in the wide format data to transform back to columns in wide format, with each unique value in the long format column becoming a header for the wide format   
- **`values`**: a single column indicating the values to use when pivoting and filling in the new wide format columns
- **`index`**: columns in the long format data that are index variables – this means that these will be left as single columns, not spread out across columns by unique value such as in the columns parameter 
- **`aggfunc`**: often pivot_table() is used to perform a summary of the data. aggfunc stands for "aggregation function". It is required and defaults to np.mean. You can put your own function in, which I do below.
- **`fill_value`**: if a cell is missing for the wide format data, the value to fill in
    
I am putting in my own function, `select_item_or_nan()` to the `aggfunc` keyword argument. Because my `subject_id` column has a single variable value for each id, I just want the single element in the long format value cell. My data is messy and so I have to write a function to check for some places it can break. 

Note: `x` passed into my function is a series object (weirdly). I pull out the first element of that with the `.iloc` indexer.

> Pivot Tables are generally important for data anlaysis **because they allow us to aggregate variables of importance by conditioning on categories.** 

In [39]:
sales_funnel = pd.read_excel('sales-funnel.xlsx')
sales_funnel.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


In [40]:
## create your own category type 
sales_funnel["Status"] = sales_funnel["Status"].astype("category")
sales_funnel["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)


In [41]:
sales_funnel.Status.dtype

category

In [42]:
# Simplest Pivot table
# default aggfunc is mean
pd.pivot_table(sales_funnel, index=["Name"])

Unnamed: 0_level_0,Account,Price,Quantity
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Barton LLC,740150.0,35000.0,1.0
"Fritsch, Russel and Anderson",737550.0,35000.0,1.0
Herman LLC,141962.0,65000.0,2.0
Jerde-Hilpert,412290.0,5000.0,2.0
"Kassulke, Ondricka and Metz",307599.0,7000.0,3.0
Keeling LLC,688981.0,100000.0,5.0
Kiehn-Spinka,146832.0,65000.0,2.0
Koepp Ltd,729833.0,35000.0,2.0
Kulas Inc,218895.0,25000.0,1.5
Purdy-Kunde,163416.0,30000.0,1.0


In [45]:
sales_funnel[sales_funnel.Name == 'Trantow-Barrows']

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending


In [46]:
# we can do this but this is not really interesting
pd.pivot_table(sales_funnel, index=["Name", "Rep", "Manager"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Account,Price,Quantity
Name,Rep,Manager,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Barton LLC,John Smith,Debra Henley,740150.0,35000.0,1.0
"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,737550.0,35000.0,1.0
Herman LLC,Cedric Moss,Fred Anderson,141962.0,65000.0,2.0
Jerde-Hilpert,John Smith,Debra Henley,412290.0,5000.0,2.0
"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,307599.0,7000.0,3.0
Keeling LLC,Wendy Yule,Fred Anderson,688981.0,100000.0,5.0
Kiehn-Spinka,Daniel Hilton,Debra Henley,146832.0,65000.0,2.0
Koepp Ltd,Wendy Yule,Fred Anderson,729833.0,35000.0,2.0
Kulas Inc,Daniel Hilton,Debra Henley,218895.0,25000.0,1.5
Purdy-Kunde,Cedric Moss,Fred Anderson,163416.0,30000.0,1.0


In [103]:
pd.pivot_table(sales_funnel, index=["Manager", "Rep"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Account,Price,Quantity
Manager,Rep,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Debra Henley,Craig Booker,720237.0,20000.0,1.25
Debra Henley,Daniel Hilton,194874.0,38333.333333,1.666667
Debra Henley,John Smith,576220.0,20000.0,1.5
Fred Anderson,Cedric Moss,196016.5,27500.0,1.25
Fred Anderson,Wendy Yule,614061.5,44250.0,3.0


In [105]:
# explicitly only include prices 
pd.pivot_table(sales_funnel, 
               index = ["Manager", "Rep"],
               values = ["Price"],
               aggfunc = np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Manager,Rep,Unnamed: 2_level_1
Debra Henley,Craig Booker,80000
Debra Henley,Daniel Hilton,115000
Debra Henley,John Smith,40000
Fred Anderson,Cedric Moss,110000
Fred Anderson,Wendy Yule,177000


Unnamed: 0_level_0,Unnamed: 1_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2
Debra Henley,Craig Booker,80000,4
Debra Henley,Daniel Hilton,115000,3
Debra Henley,John Smith,40000,2
Fred Anderson,Cedric Moss,110000,4
Fred Anderson,Wendy Yule,177000,4


In [108]:
pd.pivot_table(sales_funnel, 
               index = ["Manager", "Rep"],
               values = ["Price"],
               columns = ["Product"],
               aggfunc = np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price
Unnamed: 0_level_1,Product,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Debra Henley,Craig Booker,65000.0,5000.0,,10000.0
Debra Henley,Daniel Hilton,105000.0,,,10000.0
Debra Henley,John Smith,35000.0,5000.0,,
Fred Anderson,Cedric Moss,95000.0,5000.0,,10000.0
Fred Anderson,Wendy Yule,165000.0,7000.0,5000.0,


In [109]:
# fill missing value 
pd.pivot_table(sales_funnel, 
               index = ["Manager", "Rep"],
               values = ["Price"],
               columns = ["Product"],
               aggfunc = np.sum, 
               fill_value = 0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price
Unnamed: 0_level_1,Product,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Debra Henley,Craig Booker,65000,5000,0,10000
Debra Henley,Daniel Hilton,105000,0,0,10000
Debra Henley,John Smith,35000,5000,0,0
Fred Anderson,Cedric Moss,95000,5000,0,10000
Fred Anderson,Wendy Yule,165000,7000,5000,0


In [110]:
# add more than one values
pd.pivot_table(sales_funnel, 
               index = ["Manager", "Rep"],
               values = ["Price", "Quantity"],
               columns = ["Product"],
               aggfunc = np.sum, 
               fill_value = 0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Debra Henley,Craig Booker,65000,5000,0,10000,2,2,0,1
Debra Henley,Daniel Hilton,105000,0,0,10000,4,0,0,1
Debra Henley,John Smith,35000,5000,0,0,1,2,0,0
Fred Anderson,Cedric Moss,95000,5000,0,10000,3,1,0,1
Fred Anderson,Wendy Yule,165000,7000,5000,0,7,3,2,0


In [112]:
# add more than one values with different visual rep
pd.pivot_table(sales_funnel, 
               index = ["Manager", "Rep", "Product"],
               values = ["Price", "Quantity"],
               aggfunc = np.sum, 
               fill_value = 0,
               margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Price,Quantity
Manager,Rep,Product,Unnamed: 3_level_1,Unnamed: 4_level_1
Debra Henley,Craig Booker,CPU,65000.0,2.0
Debra Henley,Craig Booker,Maintenance,5000.0,2.0
Debra Henley,Craig Booker,Software,10000.0,1.0
Debra Henley,Daniel Hilton,CPU,105000.0,4.0
Debra Henley,Daniel Hilton,Software,10000.0,1.0
Debra Henley,John Smith,CPU,35000.0,1.0
Debra Henley,John Smith,Maintenance,5000.0,2.0
Fred Anderson,Cedric Moss,CPU,95000.0,3.0
Fred Anderson,Cedric Moss,Maintenance,5000.0,1.0
Fred Anderson,Cedric Moss,Software,10000.0,1.0


In [113]:
pd.pivot_table(sales_funnel,
               index=["Manager", "Status"],
               values=["Price"],
               aggfunc=[np.sum],
               fill_value=0,
               margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Price
Manager,Status,Unnamed: 2_level_2
Debra Henley,won,65000.0
Debra Henley,pending,50000.0
Debra Henley,presented,50000.0
Debra Henley,declined,70000.0
Fred Anderson,won,172000.0
Fred Anderson,pending,5000.0
Fred Anderson,presented,45000.0
Fred Anderson,declined,65000.0
All,,522000.0


In [60]:
## One handy feature for aggfunc is the ability 
## to pass a dictionary. This has a side-effect of making the 
## labels a little cleaner.
pd.pivot_table(sales_funnel,
               index=["Manager","Status", "Product"],
               values=["Quantity","Price"],
               aggfunc={"Quantity":len,
                        "Price":[np.mean, np.sum]},
               fill_value=0)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Quantity,Price,Price
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,len,mean,sum
Manager,Status,Product,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Debra Henley,won,CPU,1,65000,65000
Debra Henley,pending,CPU,1,40000,40000
Debra Henley,pending,Maintenance,2,5000,10000
Debra Henley,presented,CPU,1,30000,30000
Debra Henley,presented,Software,2,10000,20000
Debra Henley,declined,CPU,2,35000,70000
Fred Anderson,won,CPU,2,82500,165000
Fred Anderson,won,Maintenance,1,7000,7000
Fred Anderson,pending,Maintenance,1,5000,5000
Fred Anderson,presented,CPU,1,30000,30000


In [61]:
table = pd.pivot_table(sales_funnel,
                       index=["Manager","Status", "Product"],
                       values=["Quantity","Price"],
                       aggfunc={"Quantity":len,
                                "Price":[np.sum, len]},
                       fill_value=0)

In [62]:
in

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Quantity,Price,Price
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,len,len,sum
Manager,Status,Product,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Debra Henley,won,CPU,1,1,65000
Debra Henley,pending,CPU,1,1,40000
Debra Henley,pending,Maintenance,2,2,10000
Debra Henley,presented,CPU,1,1,30000
Debra Henley,presented,Software,2,2,20000
Debra Henley,declined,CPU,2,2,70000


In [66]:
table.query('Product == ["CPU"]')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Quantity,Price,Price
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,len,len,sum
Manager,Status,Product,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Debra Henley,won,CPU,1,1,65000
Debra Henley,pending,CPU,1,1,40000
Debra Henley,presented,CPU,1,1,30000
Debra Henley,declined,CPU,2,2,70000
Fred Anderson,won,CPU,2,2,165000
Fred Anderson,presented,CPU,1,1,30000
Fred Anderson,declined,CPU,1,1,65000


In [67]:
table.query('Product == ["CPU"] \
            and Manager == ["Debra Henley"]') 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Quantity,Price,Price
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,len,len,sum
Manager,Status,Product,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Debra Henley,won,CPU,1,1,65000
Debra Henley,pending,CPU,1,1,40000
Debra Henley,presented,CPU,1,1,30000
Debra Henley,declined,CPU,2,2,70000


### Multiindex/Hierarchical indexing

Above in the header you can see that the format of the wide data is not the same as our original loaded wide format. Pandas implements something called **Multiindexing** or **Hierarchical indexing** which allows for "tiered" row and column labels.

<img src=http://pbpython.com/images/pivot-table-datasheet.png>

We can use the dataframe function `.reset_index()` to move `manager`, `status`, and `product`. into a column and create a new index. Now we have the dataframe in the format we got when we loaded the original wide data in before. The only exception is that we still have that "aggfuncs" column label.

## `pivot_table` for summarization

For those of you who are experienced with Excel, the pandas pivot table does the same thing as the pivot table in Excel. It's more powerful, but obviously harder to use than the user-friendly spreadsheet version.

In [69]:
table.reset_index()

Unnamed: 0_level_0,Manager,Status,Product,Quantity,Price,Price
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,len,len,sum
0,Debra Henley,won,CPU,1,1,65000
1,Debra Henley,pending,CPU,1,1,40000
2,Debra Henley,pending,Maintenance,2,2,10000
3,Debra Henley,presented,CPU,1,1,30000
4,Debra Henley,presented,Software,2,2,20000
5,Debra Henley,declined,CPU,2,2,70000
6,Fred Anderson,won,CPU,2,2,165000
7,Fred Anderson,won,Maintenance,1,1,7000
8,Fred Anderson,pending,Maintenance,1,1,5000
9,Fred Anderson,presented,CPU,1,1,30000


### Going from wide to long with `.melt()`

**`.melt()`** is a function that essentially performs the inverse operation of `pivot_table` on dataframes.

Melt takes a dataframe as its first argument. Additional arguments typically used in the melt function are:

- **`id_vars`**: the column or columns that will be id variables. id variables contain datapoints specified by the variable and value columns
- **`value_vars`**: a list that specifies which columns should be converted into a single value column and variable column.
- **`var_name`**: the header name of the variable column (default='variable')
- **`value_name`**: the header name of the value column (default='value')

Below I only specify the `id_vars` as subject_id and major. The variable and value columns are inferred.

In [70]:
table_melt = pd.pivot_table(sales_funnel,
                            index=["Manager","Status", "Product"],
                            values=["Quantity","Price"],
                            aggfunc={"Quantity":len,
                                     "Price":np.sum},
                            fill_value=0)

In [71]:
pd.melt(table_melt.reset_index(),
        id_vars= ['Manager', 'Status', 'Product'],
        value_vars = ['Quantity', 'Price'])

Unnamed: 0,Manager,Status,Product,variable,value
0,Debra Henley,won,CPU,Quantity,1
1,Debra Henley,pending,CPU,Quantity,1
2,Debra Henley,pending,Maintenance,Quantity,2
3,Debra Henley,presented,CPU,Quantity,1
4,Debra Henley,presented,Software,Quantity,2
5,Debra Henley,declined,CPU,Quantity,2
6,Fred Anderson,won,CPU,Quantity,2
7,Fred Anderson,won,Maintenance,Quantity,1
8,Fred Anderson,pending,Maintenance,Quantity,1
9,Fred Anderson,presented,CPU,Quantity,1


---

## Preface to merging/joining: long and wide data

Joining tables is a concept that has its roots in SQL, so we won't dive too deeply into it here. But it is good 

Load in the data we've been using above, but now split up with just the demographic variables in one dataset and the survey question answers in another. These datasets are in wide format, and they both contain `subject_id` to identify who the questions are for. 

As you may recall, the demographic responses have fewer observations.

In [3]:
n_demos_file = '~nvr/desktop/dsi-sf-7-materials-nvr/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_demo_sample.csv'
n_survey_file = '~nvr/desktop/dsi-sf-7-materials-nvr/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_survey.csv'

demos_subset = pd.read_csv(n_demos_file)
survey = pd.read_csv(n_survey_file)

In [4]:
print( demos_subset.shape, survey.shape)

((700, 12), (1391, 46))


In [5]:
demos_subset.head(2)

Unnamed: 0,education,urban,gender,engnat,age,hand,religion,voted,married,familysize,major,subject_id
0,4.0,2.0,2.0,1.0,50.0,1.0,1.0,1.0,1.0,3.0,biophysics,1
1,3.0,1.0,2.0,2.0,22.0,1.0,1.0,1.0,1.0,2.0,biology,2


In [6]:
survey.head(2)

Unnamed: 0,race_white,race_nerdy,race_native_american,writing_novel,read_tech_reports,online_over_inperson,introspective,hobbies_over_people,books_over_parties,bookish,...,reserved,conventional,was_odd_child,prefer_fictional_people,enjoy_learning,excited_about_research,strange_person,like_superheroes,socially_awkward,subject_id
0,1.0,0.0,0.0,3.0,5.0,4.0,5.0,4.0,5.0,5.0,...,7.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0
1,1.0,0.0,0.0,1.0,4.0,3.0,3.0,1.0,4.0,4.0,...,5.0,1.0,3.0,3.0,3.0,4.0,4.0,4.0,5.0,1


In [7]:
print demos_subset.columns
print survey.columns

Index([u'education', u'urban', u'gender', u'engnat', u'age', u'hand',
       u'religion', u'voted', u'married', u'familysize', u'major',
       u'subject_id'],
      dtype='object')
Index([u'race_white', u'race_nerdy', u'race_native_american', u'writing_novel',
       u'read_tech_reports', u'online_over_inperson', u'introspective',
       u'hobbies_over_people', u'books_over_parties', u'bookish',
       u'libraries_over_publicspace', u'race_native_austrailian',
       u'like_hard_material', u'race_hispanic', u'diagnosed_autistic',
       u'play_many_videogames', u'race_arab', u'race_asian',
       u'interested_science', u'playes_rpgs', u'in_advanced_classes',
       u'collect_books', u'intelligence_over_appearance',
       u'watch_science_shows', u'academic_over_social',
       u'like_science_fiction', u'like_dry_topics', u'race_black', u'calm',
       u'disorganized', u'extraverted', u'dependable', u'critical',
       u'opennness', u'anxious', u'sympathetic', u'reserved', u'convention

### Pandas `.merge()` function

The merge function is a built-in function in a DataFrame. The first argument is another DataFrame that you want to merge it with, and the `on` keyword argument is the key or keys that you want the DataFrames to be "matched" on.

We are specifying `how='inner'` here, which essentially means that the subject_id has to be present in both dataframes to merge them together and return them. Because the demographics dataset has fewer subject_ids, it will only merge the subject_id rows from the survey dataset that are present in the demographics dataset.

In [166]:
demos_survey = demos_subset.merge(survey, on=['subject_id'], how='inner')

In [167]:
print demos_survey.shape
demos_survey.head(2)

(700, 57)


Unnamed: 0,education,urban,gender,engnat,age,hand,religion,voted,married,familysize,...,sympathetic,reserved,conventional,was_odd_child,prefer_fictional_people,enjoy_learning,excited_about_research,strange_person,like_superheroes,socially_awkward
0,4.0,2.0,2.0,1.0,50.0,1.0,1.0,1.0,1.0,3.0,...,5.0,5.0,1.0,3.0,3.0,3.0,4.0,4.0,4.0,5.0
1,3.0,1.0,2.0,2.0,22.0,1.0,1.0,1.0,1.0,2.0,...,2.0,7.0,1.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0


## Conclusion

In this lesson we learned: 

- Wide tables have all unique categories as features 
- Long tables have multi-categorical values within features
- How to use the pivot_table method
- About Data imputing
- How to merge tables 

## Resources 

Checkout these resources for some extra help. 

[Pandas API](http://pandas.pydata.org/pandas-docs/stable/api.html) Official documentation for the Pandas package. An online "textbook" that explains how every method works, what parameters that it accepts, and provide examples. 

[Jupyter Notebook Tutorial](http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_1-Introduction-to-Pandas.ipynb) A tutorial for beginners. 

[Data Wrangling with Pandas](http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_2-Data-Wrangling-with-Pandas.ipynb) A jupyter notebook tutorial on how to clean and structure data using Pandas.  