
## Pandas Long Format, Wide Format, Pivot Tables, and Melting

---



This lesson is all about transforming data using `pandas`. Data transformation is the reorganization of your data set's rows and columns into a different, potentially more useful shape and format. 

The benefits of transforming your data include better access to relevant information and streamlined data manipulation. As you become more familiar with data sets and their associated operations, you will develop an intuition and appreciation for when it's better to work row-wise or column-wise.

Different data formats are better for different tasks. It takes time and experience to learn the distinctions. But, for now, we'll introduce the common structures, transformations, and how to apply these transformations.




### Learning Objectives
- Understand the differences between long and wide format data.
- Understand pivot tables.
- Practice transforming data between long and wide formats.
- Practice creating pivot tables.
- Learn how to avoid common pitfalls and obstacles in data transformation with `pandas`.

In [21]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<a id='wide_format'></a>

### Wide Format Data

---

Between "wide" and "long," wide format data is the more intuitive. It's also a common format for `.csv` files. You've already viewed multiple data sets in wide format throughout this course.



Wide format data is structured so that:

- Unique IDs, subjects, observations, etc. are represented as rows.
- Distinct information categories (variables) are represented as columns. In other words, there is a column for every "variable" with its own unique values.
- This format can often be a more compact matrix, particularly if little or no information is missing.
- It can be useful in `pandas` when you need to perform operations on variables **across columns**; for example, multiplying columns together to create a new column.
- It is the data format required for statistical modeling (with few exceptions).

<a id='load_nerdy'></a>

### Load and Examine the "Nerdy Personality Attributes" Data Set

---

This is a pre-cleaned and modified version of the full "Nerdy Personality Attributes" survey, which asked subjects to rate themselves based on questions related to "nerdiness" as well as more general personality traits such as openness and extraversion. Researches also collected demographic information from the subjects.

You can find the raw data [here](http://personality-testing.info/_rawdata/), along with many other sociological surveys.



In this modified version, for the sake of our example, some of the subjects provided data for the survey but not the demographic variables. Because there are missing values and the data are "messy," we have a data cleaning problem.

**Load the data (which is in wide format).** 

In [22]:
nerdy_wide_f = 'datasets/NPAS_parsed_trunc_wide_missing.csv'

nerdy_wide = pd.read_csv(nerdy_wide_f)
print nerdy_wide.shape

(1391, 57)


This data set is in a familiar format in which each column is a variable and each row contains an observation for that variable, corresponding to a distinct subject.

*Wide format implies that all of the information for one distinct subject will be represented in the columns corresponding to that row. A single subject should not be represented in multiple rows of data.*

In [23]:
nerdy_wide.head(3)

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0


**Check to see how many null values there are per column.**

In [24]:
nerdy_wide.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1391 entries, 0 to 1390
Data columns (total 57 columns):
subject_id                      1391 non-null int64
academic_over_social            1391 non-null float64
age                             700 non-null float64
anxious                         1391 non-null float64
bookish                         1391 non-null float64
books_over_parties              1391 non-null float64
calm                            1391 non-null float64
collect_books                   1391 non-null float64
conventional                    1391 non-null float64
critical                        1391 non-null float64
dependable                      1391 non-null float64
diagnosed_autistic              1391 non-null float64
disorganized                    1391 non-null float64
education                       700 non-null float64
engnat                          700 non-null float64
enjoy_learning                  1391 non-null float64
excited_about_research          13

In [25]:
nerdy_wide.isnull().sum()

subject_id                        0
academic_over_social              0
age                             691
anxious                           0
bookish                           0
books_over_parties                0
calm                              0
collect_books                     0
conventional                      0
critical                          0
dependable                        0
diagnosed_autistic                0
disorganized                      0
education                       691
engnat                          691
enjoy_learning                    0
excited_about_research            0
extraverted                       0
familysize                      691
gender                          691
hand                            691
hobbies_over_people               0
in_advanced_classes               0
intelligence_over_appearance      0
interested_science                0
introspective                     0
libraries_over_publicspace        0
like_dry_topics             

There are 691 missing demographic variables.

The `major` variable has 970 missing values. 

At this point, if we were to just drop all the rows that have any null values, we would lose at least 970 rows because of the missing `major` variable.

> What can we do about that?

With a numeric column, this would be hard to avoid without "imputing" some number to fill in those values. In the simplest case, imputing the mean or median for missing numeric values is a common fix (but not ideal).

With a **categorical variable** like `major`, we have the luxury of replacing the missing values with a new category label that stands for "missing." 

**Replace the missing `major` column values with `unknown`.**

In [26]:
nerdy_wide.loc[nerdy_wide.major.isnull(), 'major'] = 'unknown'
print nerdy_wide.major.isnull().sum()

0


In [27]:
nerdy_wide.isnull().sum()[nerdy_wide.isnull().sum().index=='major']

major    0
dtype: int64

<a id='long_format'></a>

### Long Format Data

---

Now, we can load the same data — this time in the format commonly called "long."

Long format data is structured so that:

- There are potentially multiple `ID` (identification) columns.
- There are pairs of columns such as `variable:value` that match a variable key to a value (In the simplest case, there would be a single `variable` column and a single `value` column).
- The `variable` column corresponds to the multiple variable columns in a wide format data set. Instead of a column for each variable, you have a row for each `variable:value` pair *per ID*. 
- This is a standard format for SQL databases because it makes it easier to join different tables together with keys.

**Load the long format of the same data below.**

In [28]:
nerdy_long_f = './datasets/NPAS_parsed_trunc_long_missing.csv'

nerdy_long = pd.read_csv(nerdy_long_f)
print nerdy_long.shape

(70295, 3)


In [29]:
nerdy_long.head()

Unnamed: 0,subject_id,variable,value
0,1,education,4.0
1,2,education,3.0
2,5,education,2.0
3,6,education,2.0
4,7,education,2.0


You can see that the long format data has far more rows than the wide data set but only three columns.

Below you can view the three columns: `subject_id`, `variable`, and `value`.

**`subject_id:`**
- This is the primary "key" or `ID` column. Each `subject_id` will have corresponding entries in the `variable` column — one for each row.

**`variable:`**
- This column indicates the variable with which the item in the `value` column corresponds.

**`value:`**

- This contains all values for all variables for all IDs. Essentially, every cell in the wide data set except the `subject_id` is listed in this column.

**Print out the unique values in the `variable` column.**

You can see that the unique values in the `variable` column correspond to the column headers in the wide format data.

In [30]:
nerdy_long.variable.unique()

array(['education', 'urban', 'gender', 'engnat', 'age', 'hand', 'religion',
       'voted', 'married', 'familysize', 'major', 'race_white',
       'race_nerdy', 'race_native_american', 'writing_novel',
       'read_tech_reports', 'online_over_inperson', 'introspective',
       'hobbies_over_people', 'books_over_parties', 'bookish',
       'libraries_over_publicspace', 'race_native_austrailian',
       'like_hard_material', 'race_hispanic', 'diagnosed_autistic',
       'play_many_videogames', 'race_arab', 'race_asian',
       'interested_science', 'playes_rpgs', 'in_advanced_classes',
       'collect_books', 'intelligence_over_appearance',
       'watch_science_shows', 'academic_over_social',
       'like_science_fiction', 'like_dry_topics', 'race_black', 'calm',
       'disorganized', 'extraverted', 'dependable', 'critical',
       'opennness', 'anxious', 'sympathetic', 'reserved', 'conventional',
       'was_odd_child', 'prefer_fictional_people', 'enjoy_learning',
       'excited_abou

**Replace the missing values in `major` with `unknown` in the long format data set.**

The process for replacing data will be different because of the format. Using logical selection masks with `pandas`' `.loc` syntax is the preferable way to do this.

In [31]:
nerdy_long[nerdy_long['variable']=='major'].head()

Unnamed: 0,subject_id,variable,value
7000,1,major,biophysics
7001,2,major,biology
7002,5,major,Geology
7003,6,major,
7004,7,major,


In [32]:
# Rather than calling the `major` column like we would in a wide DF,
# we have to isolate all rows with a variable value of `major` in a long DF.
major_mask = (nerdy_long.variable == 'major') & (nerdy_long.value.isnull())
print major_mask[:5]
nerdy_long.loc[major_mask, 'value'] = 'unknown'
print nerdy_long[nerdy_long.variable == 'major'].isnull().sum()

0    False
1    False
2    False
3    False
4    False
dtype: bool
subject_id    0
variable      0
value         0
dtype: int64


### So we know what wide data look like, and what long data (a.k.a. tidy) data look like...

But why the different types?

- SQL performs better with long data
- Models tend only to work with wide data
- Charts are often easier to make with long data
- [example use case](http://data.library.virginia.edu/reshaping-data-from-wide-to-long/)

### Let's convert... first, long to wide

<a id='pivot_tables'></a>

### `Pandas`' `.pivot_table()` Function: Long to Wide Format

---

The `pd.pivot_table()` function is a powerful tool for both transforming data from long to wide format as well as summarizing data with user-supplied functions.

First, we'll look at transforming the long format data back into the wide format using the `.pivot_table()` function.



**Important parameters for the `.pivot_table()` function include:**

- The `pivot_table()` function takes a DataFrame to pivot as its first argument. 
    
- **`columns`**: This is the list of columns in the long format data to be transformed back into columns in the wide format.
- **`values`**: A single column indicating the values to use when filling the new wide format columns.
- **`index`**: Columns in the long format data that we want to be the index variables. 
- **`aggfunc`**: Often `.pivot_table()` is used to perform a summary of the data. `aggfunc` stands for "aggregation function." It's required and defaults to `np.mean()`. You can also insert your own function, which we'll demonstrate below.
- **`fill_value`**: If a cell is missing for the wide format data, this value will fill it in.
    


Next we'll put in our own function — `select_item_or_nan()` — to the `aggfunc` keyword argument. 


Because my `subject_id` column has a single variable value for each ID, I just want the single element in the long format value cell. 

But my data are messy, and I have some missing values, so I have to write a function to check if a value is available, and if it's not, just to fill in with np.nan.


In [33]:
def select_item_or_nan(x):
    x = x.iloc[0]
    if len(x) == 0:
        return np.nan
    else:
        return x
# This will take a few seconds to run.
nerdy_wide_pv = pd.pivot_table(nerdy_long, columns=['variable'], values='value',
                            index=['subject_id'], aggfunc=select_item_or_nan,
                            fill_value=np.nan)
# 'pv' for 'pivot version.'
nerdy_wide_pv.head()

variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


<a id='multiindex'></a>

### MultiIndex/Hierarchical Indices in `pandas`

---

In the header, you can see that the format of the new wide data is *not* the same as our originally loaded wide format. `pandas` implements something called **MultiIndexing** or **hierarchical indexing**, which allows for "tiered" row and column labels.

The main difference is that we now have a `variable` name in the top left corner, which is "labeling" our columns (and corresponds to the name of our original column in the long format data). 

In [34]:
nerdy_wide_pv.head()

variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


In [37]:
# let's comprare the columns names with our original wide dataset... do you see a difference?
print nerdy_wide_pv.columns
print "\n"
print nerdy_wide.columns
print nerdy_wide_pv.shape

Index([u'academic_over_social', u'age', u'anxious', u'bookish',
       u'books_over_parties', u'calm', u'collect_books', u'conventional',
       u'critical', u'dependable', u'diagnosed_autistic', u'disorganized',
       u'education', u'engnat', u'enjoy_learning', u'excited_about_research',
       u'extraverted', u'familysize', u'gender', u'hand',
       u'hobbies_over_people', u'in_advanced_classes',
       u'intelligence_over_appearance', u'interested_science',
       u'introspective', u'libraries_over_publicspace', u'like_dry_topics',
       u'like_hard_material', u'like_science_fiction', u'like_superheroes',
       u'major', u'married', u'online_over_inperson', u'opennness',
       u'play_many_videogames', u'playes_rpgs', u'prefer_fictional_people',
       u'race_arab', u'race_asian', u'race_black', u'race_hispanic',
       u'race_native_american', u'race_native_austrailian', u'race_nerdy',
       u'race_white', u'read_tech_reports', u'religion', u'reserved',
       u'socially_awkwa

**Let's drop the null values from our recreated wide format data. How many unique subjects do we have?**

Remember our `subject_id` is now the **index**, so we can access it using the `.index` attribute.

In [36]:
print len(nerdy_wide_pv.index.unique())
nerdy_wide_pv.dropna(inplace=True)
print nerdy_wide_pv.shape
print len(nerdy_wide_pv.index.unique())

nerdy_wide_pv.head()

1391
(700, 56)
700


variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,5.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
6,4.0,18.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,2.0,...,1.0,5.0,5.0,5.0,5.0,2.0,2.0,1.0,4.0,1.0
7,3.0,21.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,5.0,...,12.0,5.0,5.0,5.0,6.0,2.0,1.0,3.0,3.0,3.0


**Now, let's convert the `subject_id` index back into a column, and remove it from the index.**

We can use the DataFrame function `.reset_index()` to move `subject_id` into a column and create a new index. We now have a DataFrame with the same format we loaded the original wide format data in previously. The only exception is that we still have the `variable` column label.

In [40]:
nerdy_wide_flat = nerdy_wide_pv.reset_index()

In [43]:
nerdy_wide_flat.head()

variable,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
1,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
2,5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
3,6,4.0,18.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,...,1.0,5.0,5.0,5.0,5.0,2.0,2.0,1.0,4.0,1.0
4,7,3.0,21.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,...,12.0,5.0,5.0,5.0,6.0,2.0,1.0,3.0,3.0,3.0


In [45]:
nerdy_wide_flat.columns

Index([u'subject_id', u'academic_over_social', u'age', u'anxious', u'bookish',
       u'books_over_parties', u'calm', u'collect_books', u'conventional',
       u'critical', u'dependable', u'diagnosed_autistic', u'disorganized',
       u'education', u'engnat', u'enjoy_learning', u'excited_about_research',
       u'extraverted', u'familysize', u'gender', u'hand',
       u'hobbies_over_people', u'in_advanced_classes',
       u'intelligence_over_appearance', u'interested_science',
       u'introspective', u'libraries_over_publicspace', u'like_dry_topics',
       u'like_hard_material', u'like_science_fiction', u'like_superheroes',
       u'major', u'married', u'online_over_inperson', u'opennness',
       u'play_many_videogames', u'playes_rpgs', u'prefer_fictional_people',
       u'race_arab', u'race_asian', u'race_black', u'race_hispanic',
       u'race_native_american', u'race_native_austrailian', u'race_nerdy',
       u'race_white', u'read_tech_reports', u'religion', u'reserved',
       u

**Remove the column label.**

You can remove the column label (which can be confusing during print statements) by setting the `.columns.name` attribute to `None`.

In [19]:
nerdy_wide_flat.columns.name = None
nerdy_wide_flat.head(2)

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
1,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0


Now we match the original wide format.

### Now let's do wide to long... 

<a id='melt'></a>

### Using pandas' `.melt()` Function: Wide to Long Format

---

**`.melt()`** is a function that essentially performs the inverse of `.pivot_table()` on DataFrames.

`.melt()` takes a DataFrame as its first argument. Additional arguments typically used with this function are:

- **`id_vars`**: The column or columns that will be ID variables. ID variables contain data points specified by the `variable` and `value` columns.
- **`value_vars`**: A list that specifies which columns should be converted into single `value` and `variable` columns.
- **`var_name`**: The header name of the `variable` column (default='variable').
- **`value_name`**: The header name of the `value` column (default='value').

**To keep it simple, let's first, subset the wide format data into just columns: `['subject_id','anxious','booking','calm','major']`.**


In [46]:
nerdy_subset = nerdy_wide_flat[['subject_id','major','anxious','bookish','calm']]
nerdy_subset.head(2)

variable,subject_id,major,anxious,bookish,calm
0,1,biophysics,4.0,4.0,6.0
1,2,biology,7.0,5.0,2.0


**Use `.melt()` on the subset with `id_vars=['subject_id','major']`.**

Print out the shape of the data and the header. 

**Note**: Columns that are not `id_vars` become part of the `variable` and `value` columns.

In [21]:
nerdy_sub_long = pd.melt(nerdy_subset, id_vars=['subject_id','major'])
print nerdy_subset.shape, nerdy_sub_long.shape
nerdy_sub_long.head(4)

(700, 5) (2100, 4)


Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,anxious,4.0
1,2,biology,anxious,7.0
2,5,Geology,anxious,5.0
3,6,unknown,anxious,1.0


So, if we don't specify `major` as an `id_var`, it will end up in the `variable` column.

In [22]:
nerdy_sub_long = pd.melt(nerdy_subset, id_vars='subject_id')
print nerdy_subset.shape, nerdy_sub_long.shape
nerdy_sub_long.head(4)

(700, 5) (2800, 3)


Unnamed: 0,subject_id,variable,value
0,1,major,biophysics
1,2,major,biology
2,5,major,Geology
3,6,major,unknown


You can achieve the same result without having to subset the DataFrame first by simply specifying the `value_vars` keyword argument. The output DataFrame will then only contain the data specified in the `id_vars` and `value_vars` arguments.

**Create the same DataFrame with `.melt()` on the full wide data set, but select the columns to use with the `value_vars` argument.**

In [23]:
# With two `value_vars`:
nerdy_sub_long = pd.melt(nerdy_wide_flat, id_vars=['subject_id','major'], 
                         value_vars=['bookish','calm'])
print nerdy_wide_flat.shape, nerdy_sub_long.shape
print nerdy_sub_long['variable'].unique()
nerdy_sub_long.head(4)

(700, 57) (1400, 4)
['bookish' 'calm']


Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,bookish,4.0
1,2,biology,bookish,5.0
2,5,Geology,bookish,3.0
3,6,unknown,bookish,4.0


If you don't specify any `value_vars`, all the variables are used

In [24]:
# With all `value_vars`:
nerdy_sub_long_big = pd.melt(nerdy_wide_flat, id_vars=['subject_id','major'])
print nerdy_wide_flat.shape, nerdy_sub_long.shape
nerdy_sub_long_big.head(4)

(700, 57) (1400, 4)


Unnamed: 0,subject_id,major,variable,value
0,1,biophysics,academic_over_social,2.0
1,2,biology,academic_over_social,5.0
2,5,Geology,academic_over_social,4.0
3,6,unknown,academic_over_social,4.0


In [25]:
nerdy_sub_long_big['variable'].unique()

array(['academic_over_social', 'age', 'anxious', 'bookish',
       'books_over_parties', 'calm', 'collect_books', 'conventional',
       'critical', 'dependable', 'diagnosed_autistic', 'disorganized',
       'education', 'engnat', 'enjoy_learning', 'excited_about_research',
       'extraverted', 'familysize', 'gender', 'hand',
       'hobbies_over_people', 'in_advanced_classes',
       'intelligence_over_appearance', 'interested_science',
       'introspective', 'libraries_over_publicspace', 'like_dry_topics',
       'like_hard_material', 'like_science_fiction', 'like_superheroes',
       'married', 'online_over_inperson', 'opennness',
       'play_many_videogames', 'playes_rpgs', 'prefer_fictional_people',
       'race_arab', 'race_asian', 'race_black', 'race_hispanic',
       'race_native_american', 'race_native_austrailian', 'race_nerdy',
       'race_white', 'read_tech_reports', 'religion', 'reserved',
       'socially_awkward', 'strange_person', 'sympathetic', 'urban',
       'vot

## More uses for pivot_tables

Pivot_tables are just used for converting long to wide.  

They are also used much like `group_by` to summarise data by groups.

<a id='pivot_table_summarizing'></a>

### Summarizing Your Data With  `.pivot_table()` and Aggregate Functions

---

For those of you who have experience with Excel, `pandas`' `.pivot_table()` accomplishes the same thing. It's more powerful but harder to use than the spreadsheet version.

`.pivot_table()` can take in a variable, value, and index to group by and apply aggregate functions to summarize the data. 

**Note**: Be careful that your index variable is not pulling out unique rows (For example, `subject_id` by variable would only have one value to send into the aggregate functions).

Below, I am calling the `.pivot_table()` function with:

- The long format data as the first argument.
- `variable` specified as the columns that indicate the variable names (groups).
- `value` specified as the column that contains the data per variable.
- `major` as the index; the rows will be grouped by `major`.
- `np.mean`, `np.median`, `np.std`, and `len` as aggregate functions. These will be calculated for each `major-by-variable` group.
- A `fill_value` of `np.nan` for cells in the output table that have no data.



In [26]:
nerdy_major_summary = pd.pivot_table(nerdy_sub_long, columns=['variable'], values='value',
                                     index=['major'], aggfunc=[np.mean, np.median, np.std, len],
                                     fill_value=np.nan)

DataError: No numeric types to aggregate

What's that error?  Let's try to figure it out....

In [27]:
nerdy_sub_long.dtypes

subject_id     int64
major         object
variable      object
value         object
dtype: object

In [28]:
nerdy_sub_long['value'] = nerdy_sub_long['value'].astype(float)

In [29]:
nerdy_sub_long.dtypes

subject_id      int64
major          object
variable       object
value         float64
dtype: object

In [30]:
nerdy_major_summary = pd.pivot_table(nerdy_sub_long, columns=['variable'], values='value',
                                     index=['major'], aggfunc=[np.mean, np.median, np.std, len],
                                     fill_value=np.nan)

The output DataFrame gives you a "hierarchical" column index — the three variables for each aggregate function. The row index is the `major` groups.

If you apply more index variables, the row indices will also become hierarchical! However, this can quickly make for a bloated DataFrame.

In [31]:
nerdy_major_summary.head(10)

Unnamed: 0_level_0,mean,mean,median,median,std,std,len,len
variable,bookish,calm,bookish,calm,bookish,calm,bookish,calm
major,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
None yet,3.0,3.0,3.0,3.0,,,1,1
+ACI-+ACIAIg-hotel and restaurant management+ACIAIg-+ACI-,2.0,7.0,2.0,7.0,,,1,1
Aerospace Engineer,2.0,7.0,2.0,7.0,,,1,1
Aerospace Engineering,4.0,3.0,4.0,3.0,,,1,1
Agricultural Economics,2.0,6.0,2.0,6.0,,,1,1
Anthropology,3.666667,4.333333,4.0,4.0,0.57735,2.516611,3,3
Anthropology,4.0,3.0,4.0,3.0,,,1,1
Architecture,4.0,5.666667,4.0,6.0,1.0,1.527525,3,3
Architecture,1.0,5.0,1.0,5.0,,,1,1
Art,4.333333,5.333333,4.5,5.5,0.816497,1.21106,6,6


<a id='examining_multiindex'></a>

### The Inner Workings of the MultiIndex

--- 



If you print out the columns, you can see the data set has become a `pandas` `MultiIndex` object that has levels, labels, and names.

In [32]:
print nerdy_major_summary.columns

MultiIndex(levels=[[u'mean', u'median', u'std', u'len'], [u'bookish', u'calm']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[None, u'variable'])


Indexing along the hierarchical column headers can be done with chained bracket keys 

In [33]:
nerdy_major_summary['mean'].head(2)

variable,bookish,calm
major,Unnamed: 1_level_1,Unnamed: 2_level_1
None yet,3.0,3.0
+ACI-+ACIAIg-hotel and restaurant management+ACIAIg-+ACI-,2.0,7.0


In [34]:
nerdy_major_summary['mean']['bookish'].head(2)

major
 None yet                                                    3.0
+ACI-+ACIAIg-hotel and restaurant management+ACIAIg-+ACI-    2.0
Name: bookish, dtype: float64

In [35]:
nerdy_major_summary['mean'][['calm','bookish']].head(2)

variable,calm,bookish
major,Unnamed: 1_level_1,Unnamed: 2_level_1
None yet,3.0,3.0
+ACI-+ACIAIg-hotel and restaurant management+ACIAIg-+ACI-,7.0,2.0


<a id='multiindex_to_flat'></a>

### Getting Rid of the MultiIndex: "Flattening" Data

---

MultiIndex DataFrames hold great potential and are a cool concept. That being said, the overhead and confusion on how to subset/mask them is most often not worth it, especially when your data needs to be formatted for insertion into a model.

The most reliable way to "flatten" a MultiIndexed DataFrame is with the `.to_records()` function. To make this a new DataFrame, it needs to be wrapped in a `pd.DataFrame()` like so:

In [36]:
nerdy_major_flat = pd.DataFrame(nerdy_major_summary.to_records())
nerdy_major_flat.head(2)

Unnamed: 0,major,"('mean', 'bookish')","('mean', 'calm')","('median', 'bookish')","('median', 'calm')","('std', 'bookish')","('std', 'calm')","('len', 'bookish')","('len', 'calm')"
0,None yet,3.0,3.0,3.0,3.0,,,1,1
1,+ACI-+ACIAIg-hotel and restaurant management+A...,2.0,7.0,2.0,7.0,,,1,1


You can see that the new column names are tuples of the hierarchy of MultiIndexed columns. For example, you could convert these to new, more easily indexed columns with something like a list comprehension.


### Let's see another application for `pivot_table`

We'll use the wide dataset, and just create some summary statistics

In [38]:
nerdy_wide.head()

Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


> Check: can anyone describe what the below pivot_table is showing?

In [39]:
pd.pivot_table(nerdy_wide, index='anxious', columns = 'calm', values='age', aggfunc=np.mean)

calm,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0
anxious,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,27.666667,,,,,,,
1.0,,,21.0,28.4,18.75,30.166667,25.157895,26.615385
2.0,,33.5,15.5,32.111111,24.0,33.833333,33.692308,32.368421
3.0,,42.0,26.6,21.833333,27.428571,29.933333,30.45,26.4
4.0,22.0,16.0,21.5,31.7,26.5,26.416667,34.166667,25.125
5.0,25.0,18.285714,24.636364,27.653846,25.735294,25.645161,26.636364,23.538462
6.0,17.0,18.0,26.866667,23.413793,18.923077,28.212121,29.6,35.833333
7.0,,20.53125,21.857143,20.925926,20.769231,19.181818,29.6,16.5


In [40]:
nerdy_wide[(nerdy_wide['anxious']==4) & (nerdy_wide['calm']==0)]['age'].mean()

22.0

### Conclusion:
- we saw the difference between long and wide data
- we used pivot_table for long -> wide
- we used melt for wide -> long
- we saw other applications of pivot_table, including summarizing data
