# Exploratory Data Analysis with Pandas


## Introduction

In [None]:
import numpy as np
import pandas as pd

Video tutorials: http://www.dataschool.io/easier-data-analysis-with-pandas/

A lot of the content of this notebook is based on Jake VanderPlas' excellent Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/

Many exercises are adapted from https://github.com/ajcr/100-pandas-puzzles

## NumPy - the Foundation of Data Science in Python

Data science is largely about the efficient manipulation of collections of numbers, so to support effective data science a language needs a way to do this. In Python this is done through the NumPy libary, which are much more memory- and computation-efficient than the built-in lists in Python.

Python dicts are suboptimal for several reasons:

* they are heterogeneous
* even when storing numbers, these numbers are objects (as Python is a pure OOPL), and so are more than just (say) int32 or int64's like in C/C++; they have reference counts, type info, size info, and the actual data

Python offers an array type which is homogeneous which improves om lists as far as storage goes, but offers limited operations on that data.

NumPy bridges the gap, offering both efficient storage of homogeneous data in single or multi-dimensional arrays, and a rich set of operations on that data.

In this section we will cover some of the basics of NumPy, but our focus will be mostly on Pandas, a library built on top of NumPy that is particularly well-suited to manipulating tabular data.


In [None]:
# Create one dimensional NumPy array from a list
a = np.array([1, 2, 3])
a

In [None]:
# Append a value
b = a
a = np.append(a, 4)  # Note that this makes a copy; the original array is not affected
print(b)
print(a)

In [None]:
# Get the shape and # of elements
print(np.shape(a))
print(np.size(a))

In [None]:
# Index and slice
print(f'Second element of a is {a[1]}')
print(f'Last element of a is {a[-1]}')
print(f'Middle two elements of a are {a[1:3]}')

In [None]:
# Create a 2-D array from a list of lists
b = np.array([[1,2,3],
              [4,5,6],
              [7,8,9]])
b

In [None]:
# Get the shape, # of elements, and # of dimensions
print(np.shape(b))
print(np.size(b))
print(np.ndim(b))

In [None]:
# Get the first row of b; these are equivalent
print(b[0]) 
print(b[0,:])

In [None]:
# Get the first column of b
print(b[:,0])

In [None]:
# Get a subsection of b, from 1,1 through 2,2 (i.e. before 3,3)
print(b[1:3,1:3])

In [None]:
# Create an array of zeros of length n
np.zeros(5)

In [None]:
# Create an array of 1s
np.ones(5)

In [None]:
# Creat an array of 10 random integers between 1 and 100
np.random.randint(1,100, 10)

In [None]:
# Create linearly spaced array of 5 values from 0 to 100
np.linspace(0, 100, 5)

### UFuncs

NumPy supports highly efficient operations on arrays called UFuncs (Universal Functions).

In [None]:
np.mean(b)  # Get the mean of all the elements

In [None]:
np.power(b, 2)  # Raise every element to second power

You can get the details on UFuncs here: https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html

## Pandas Series

A Pandas Series is a one-dimensional array of indexed data. It wraps a sequence of values and a sequence of indices. The values are a NumPy array, while the indices are an instance of a pd.Index object.


In [None]:
data = pd.Series([1, 4, 9, 16, 25])
data

In [None]:
data.values

In [None]:
data.index

You can show the first few lines with `.head()`. The argument, if omitted, defaults to 5.

In [None]:
data.head(2)

Normal indexing and slicing operations are available:

In [None]:
data[2]

In [None]:
data[2:4]

Where NumPy arrays have implicit integer sequence indices, Pandas indices are explicit and need not be integers:

In [None]:
data = pd.Series([1, 4, 9, 16, 25], index=['square of 1', 'square of 2', 'square of 3', 'square of 4', 'square of 5'])
data

In [None]:
data['square of 3']

As you can see, a Series is a lot like a Python dict (with additional slicing), and we can construct one from a Python dict:

In [None]:
pd.Series({'square of 1':1, 'square of 2':4, 'square of 3':9, 'square of 4':16, 'square of 5':25})

### Exercise 1

Given the list below, create a Series that has the list as both the index and the values, and then display the first 3 rows:

In [None]:
ex1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm']

In [None]:
# Exercise 1: Put your code here.
# Uncomment and run the %load magic for a sample solution.
# %load ex04-01.py

A number of dict-style operations work on a Series:

In [None]:
# Reconstruct the Series
data = pd.Series([1, 4, 9, 16, 25], index=['square of 1', 'square of 2', 'square of 3', 'square of 4', 'square of 5'])

In [None]:
'square of 5' in data

In [None]:
data.keys()

In [None]:
data.items()  # Iterable

In [None]:
list(data.items())

In [None]:
data.values  # Unlike Python dict, this is not the same - it's an array, not a function returning an interable

In [None]:
data['square of 6'] = 36  # We can add new entries
data

In [None]:
data['square of 6'] = -1  # And change existing values
data

In [None]:
del data['square of 6']  # And delete a value
data

There are a number of functions available on a series, like `.sum()`, `.median()`, `.mode()`, and `.mean()`:

In [None]:
data.mean()

Series also behaves a lot like a list. We saw some indexing and slicing earlier. This can be done on non-numeric indexes too:

In [None]:
data['square of 2': 'square of 4']  # This can be confusing as it INCLUDES the final value

In [None]:
data['square of 2': 'cube of 4']  # Be aware - a missing key will result in empty results

### Exercise 2

Delete the row 'k' from the earlier series, then display the rows from 'f' through 'l'.

In [None]:
# Exercise 2: put your code here
# %load ex04-02.py


## Pandas DataFrames

A DataFrame is like a dictionary that maps column names to Series objects that share the same index.

Read the sentence above again and make sure it makes sense to you.

In [None]:
names = pd.Series(['Alice', 'Bob', 'Carol'])
phones = pd.Series(['555-123-4567', '555-987-6543', '555-245-6789'])

df = pd.DataFrame({'Name': names, 'Phone': phones})  # 'Name' and 'Phone' are the column names
df

In [None]:
df.index  # Like Series, DataFrame has an index for rows

In [None]:
df.columns  # DataFrame also has an index for columns

In [None]:
df.values

In [None]:
df['Name']  # Acts similar to dictionary; returns Series

In [None]:
df.Name  # You can also access columns like this, with dot-notation.
# Occasionally this breaks if there is a name conflcit with a UFunc, like 'count'.

In [None]:
# You can add new columns. Later we'll see how to do this as a function of existing columns
df['Closed'] = True
df.head()

In [None]:
# Use .describe() to get summary statistics
df.describe()

There are many ways to construct a DataFrame. For example, from a Series or dictionary of Series, from a list of Python dicts, or from a 2-D NumPy array. There are also utility functions to read data from disk into a DataFrame, e.g. from a .csv file or an Excel spreadsheet. We'll cover some of these later.

Many DataFrame operations take an `axis` argument which defaults to zero. This specifies whether we want to apply the operation by rows (axis=0) or by columns (axis=1).

### Exercise 3

Create a DataFrame from the dictionary below:

In [None]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

In [None]:
# Uncomment and run next line for solutions
# %load ex04-03.py

In [None]:
# Put your code to create the DataFrame here


In [None]:
# Generate a summary of the data


In [None]:
# Calculate the sum of all visits (the total number of visits).


## Indexes

The Pandas Index can be thought of as an immutable ordered multiset (multiset as indices need not be unique). The immutability makes it safe to share an index between multiple columns of a DataFrame. The set-like properties are useful for things like joins (a join is like an intersection between Indexes). Let's look at some example operations to get more familiar with how they work:

In [None]:
# Let's create two Indexes for experimentation

i1 = pd.Index([1, 3, 5, 7, 9])
i2 = pd.Index([2, 3, 5, 7, 11])

In [None]:
i1[2]  # We can index like an array with []

In [None]:
i1[2:5]  # And slice

In [None]:
i1 & i2  # Intersection

In [None]:
i1 | i2  # Union

In [None]:
i1 ^ i2  # Difference

Series and DataFrames have an explicit Index but they also have an implicit index like a list. If the Index uses integer values things can get confusing. In such cases it is good to be explicit; there are attributes for this:

- `.loc` references the explicit Index
- `.iloc` references the implicit Index

The Python way is "explicit is better than implicit" so when indexing/slicing it is better to use these. The example below illustrates the difference:

In [None]:
# Note: explicit index starts at 1; implicit index starts at 0
s = pd.Series(['first', 'second', 'third', 'fourth'], index=[1, 2, 3, 4]) 

print(f'Item at explicit index 1 is {s.loc[1]}')
print(f'Item at implicit index 1 is {s.iloc[1]}')
print(s.loc[1:3])
print(s.iloc[1:3])

When using `.iloc`, the expression in `[]` can be:

* an integer, a list of integers, or a slice object (e.g. `1:7`)
* a Boolean array (see Filtering section below for why this is very useful)
* a function with one argument (the calling object) that returns one of the above

Selecing outside of the bounds of the object will raise an IndexError except when using slicing.

When using `.loc`, the expression in `[]` can be:

* an label, a list of labels, or a slice object with labels (e.g. `'a':'f'`; unlike normal slices the stop label is included in the slice)
* a Boolean array
* a function with one argument (the calling object) that returns one of the above

You can use one or two dimensions in `[]` after `.loc` or `.iloc` depending on whether you want to select a subset of rows, columns, or both.

You can use the `set_index` method to change the index of a DataFrame.

If you want to change entries in a DataFrame selectively to some other value, you can use assignment with indexing, such as:

    df.loc[row_indexer, column_indexer] = value
    
See the details at https://pandas.pydata.org/pandas-docs/stable/indexing.html
    
 ## Exercise 4
 
 Use the same DataFrame from Exercise 3.

In [None]:
# Select just the 'animal' and 'age' columns from the DataFrame.


In [None]:
# Select the data in rows [3, 5, 7] and in columns ['animal', 'age'].


## Loading a CSV into a Dataframe

Use `Pandas.read_csv` to read a CSV file into a dataframe. There are many optional argumemts that you can provide, for example to set or override column headers, skip initial rows, treat first row as containing column headers, specify the type of columns (Pandas will try to infer these otherwise), skip columns, and so on. The `parse_dates` argument is especially useful for specifying which columns have date fields as Pandas doesn't infer these.

Full docs are at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

In [None]:
df = pd.read_csv('http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv',
                 parse_dates=['cdatetime'])
df.head()

If you need to do some preprocessing of a field during loading you can use the `converters` argument which takes a dictionary mapping the field names to lambda functions that munge the field. E.g. if you had a field `zip` and you wanted to take just the first 3 digits, you could use:

    ..., converters={'zip': lambda x: x[:3]}, ...
    
You can pass a dictionary in with the `types` argument that maps field names to NumPy types, to override the type inference.

## Loading a Spreadsheet into a DataFrame

Use Pandas.read_excel to load spreadsheet data. Full details here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

In [None]:
df = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')
df.head()

## Saving a Dataframe to a CSV or Excel spreadsheet

You can use DataFrame.to_csv to write a DataFrame to a csv, and DataFrame.to_excel to save as a spreadsheet.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html

## Sorting

You can sort a DataFrame using the `sort_values` method:

    DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, na_position='last')
    
The `by` argument should be a column name or list of column names in priority order (if axis=0, i.e. we are sorting the rows, which is typically the case).

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html for the details.
    

## Filtering

In [None]:
import seaborn as sns;

# Get some sample data
titanic = sns.load_dataset('titanic')
titanic.head()

In [None]:
# A Boolean expression on a Series will return a Series of Booleans
titanic.survived == 1

In [None]:
# If you index a Series with a Boolean Series, you will select the items where the index is True.
# So:
titanic[titanic.survived == 1]

In [None]:
# You can combine these with & and | for and and or
# Pandas must use these normally bitwise operators because Python allows them to be overloaded
# while 'and' and 'or' cannot be.
# Unfortunately as these have higher operator precedence than relational operators, the 
# subexpressions we use with them need to be enclosed in parentheses.

titanic[titanic.survived & (titanic.sex == 'female') & (titanic.age > 50)]

### Exercise 5

Using the previous DataFrame from exercise 3, do the following:

In [None]:
# Select only the rows where the number of visits is greater than 3

In [None]:
# Select the rows where the age is missing, i.e. is NaN.

In [None]:
# Select the rows where the animal is a cat and the age is less than 3.

In [None]:
# Select the rows the age is between 2 and 4 (inclusive).

In [None]:
# Change the index to use this list
idx = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [None]:
# Change the age in row 'f' to 1.5.

In [None]:
# Append a new row 'k' to df with your choice of values for each column. 
# Then delete that row to return the original DataFrame.

In [None]:
# Calculate the mean age for each different type of animal.

In [None]:
# Count the number of each type of animal.

In [None]:
# Sort the data first by the values in the 'age' in decending order,
# then by the value in the 'visit' column in ascending order.

In [None]:
# In the 'animal' column, change the 'snake' entries to 'python'.

In [None]:
# The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 
#'yes' should be True and 'no' should be False.

## Concatenation

`pandas.concat` can be used to concatenate Series and DataFrames:

In [None]:
s1 = pd.Series(['A', 'B', 'C'])
s2 = pd.Series(['D', 'E', 'F'])
df = pd.concat([s1, s2])
df

Note that the Indexes are concatenated too, so if you are using a simple row number index you can end up with duplicate values.

In [None]:
df[2]

 If you don't want this behavior use the `ignore_index` argument:

In [None]:
pd.concat([s1, s2], ignore_index=True)

Alternatively you can use `verify_integrity=True` to cause an exception to be raised if the result would have duplicate indices.

In [None]:
pd.concat([s1, s2], verify_integrity=True)

In [None]:
d1 = pd.DataFrame([['A1', 'B1'],['A2', 'B2']], columns=['A', 'B'])
d2 = pd.DataFrame([['C3', 'D3'],['C4', 'D4']], columns=['A', 'B'])
d3 = pd.DataFrame([['B1', 'C1'],['B2', 'C2']], columns=['B', 'C'])
pd.concat([d1, d2])

In [None]:
# We can join on other axis too.
pd.concat([d1, d2], axis=1)

In [None]:
pd.concat([d1, d3], axis=1)

In [None]:
# If the columns are not completely shared, additional NaN entries will be made.
pd.concat([d1, d3])

In [None]:
# We can force concat to only include the columns that are shared with an inner join.
pd.concat([d1, d3], join='inner')

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html for more options.

## Merging and Joining

Pandas has a `merge` function that supports one-to-one, many-to-one and many-to-many joins. merge will look for matching column names between the inputs and use this as the key:

In [None]:
d1 = pd.DataFrame({'city': ['Seattle', 'Boston', 'New York'], 'population': [704352, 673184, 8537673]})
d2 = pd.DataFrame({'city': ['Boston', 'New York', 'Seattle'], 'area': [48.42, 468.48, 142.5]})
pd.merge(d1, d2)

In [None]:
# You can explicitly specify the column to join on; this is equivalent to the above example:
pd.merge(d1, d2, on='city')

In [None]:
# If the column names don't match you can specify the names to use:
d3 = pd.DataFrame({'place': ['Boston', 'New York', 'Seattle'], 'area': [48.42, 468.48, 142.5]})
pd.merge(d1, d3, left_on='city', right_on='place')

In [None]:
# If you want to drop the redundant column:
pd.merge(d1, d3, left_on='city', right_on='place').drop('place', axis=1)

`merge` joins on arbitrary columns; if you want to join on the index you can use `left_index` and `right_index`:

In [None]:
df1 = pd.DataFrame(list('ABC'), columns=['c1'])
df2 = pd.DataFrame(list('DEF'), columns=['c2'])
pd.merge(df1, df2, left_index=True, right_index=True)

Pandas provides a utility method on DataFrame, `join`, to do the above:

In [None]:
df1.join(df2)

`merge` can take a `how` argument that can be `inner`, `outer`, `left` or `right` to control the type of join. `inner` joins are the default.

For more info on merging see https://pandas.pydata.org/pandas-docs/stable/merging.html

## Aggregating and Pivot Tables

In [None]:
!pip install seaborn

In [None]:
import seaborn as sns;

titanic = sns.load_dataset('titanic')
titanic.head()

In [None]:
# Use unique() to see the full set of distinct values in a series
titanic.deck.unique()

In [None]:
# describe() will give summary statistics on a DataFrame. We first drop rows with NAs.
titanic.dropna().describe()

In [None]:
titanic.groupby('sex')['survived'].mean()

In [None]:
titanic.groupby(['sex', 'class'])['survived'].mean()

The DataFrame result is an example of a multi-indexed DataFrame (indexed by both 'sex' and 'class'). We're mostly going to ignore those in this notebook, but it is worth noting that Pandas has an `unstack` method that can turn a mutiply-indexed DataFrame back into a conventionally-indexed one:

In [None]:
titanic.groupby(['sex', 'class'])['survived'].mean().unstack()

In [None]:
# All of the above can be achieved with a convenience pivot table method
titanic.pivot_table('survived', index='sex', columns='class')

In [None]:
# Let's break things down further by age group
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', index=['sex', age], columns='class')

In [None]:
# Index and colummns are also the second and third positional arguments, so we could just use:
titanic.pivot_table('survived', ['sex', age], 'class')

## Applying Functions

We saw earlier that we can add new columns to a DataFrame easily. The new column can be a function of an existing column. For example, we could add an 'is_adult' field to the Titanic data:

In [None]:
titanic['is_adult'] = titanic.age >= 18
titanic.head()

That's a simple case; we can do more complex row-by-row applications of arbitrary functions; here's the same change done differently (this would be much less efficient but may be the only option if the function is complex):

In [None]:
titanic['is_adult'] = titanic.apply(lambda row: row['age'] >= 18, axis=1)
titanic.head()

## String Operations

Pandas has vectorized string operations that will skip over missing values. You can read about them here; we wil show a few examples: https://pandas.pydata.org/pandas-docs/stable/text.html

In [None]:
# Let's get the more detailed Titanic data set
df = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')
df.head()

In [None]:
# Upper-case the home.dest field
df['home.dest'].str.upper().head()

In [None]:
# Let's split the field up into two
place_df = df['home.dest'].str.split('/', expand=True)  # Expands the split list into DF columns
place_df.columns = ['home', 'dest', '']  # For some reason there is a third column
df['home'] = place_df['home']
df['dest'] = place_df['dest']
df = df.drop(['home.dest'], axis=1)
df

## Handling Missing Data

To see if there are missing values, we can use isnull() to get a DataFrame showing the rows that have nulls, and where they have them:

In [None]:
df.isnull().head()

The above will show us the first few rows that had null values. If we want to know which columns may have nulls, we can use:

In [None]:
df.isnull().any()

To drop rows that have missing values, use dropna(); add `inplace=True` to do it in place.

In [None]:
df.dropna().head()

In this case there are none - no-one could both be on a boat and be a recovered body, so at least one of these fields is always NaN.

## Pandas Plots

Pandas includes the ability to do simple plots. For a Series, this typically means plotting the values in the series as the Y values, and then index as the X values; for a DataFrame this would be a multiplot. You can use `x` and `y` named arguments to select specific columns to plot, and you can use a `kind` argument to specify the type of plot.

See https://pandas.pydata.org/pandas-docs/stable/visualization.html for details.

In [None]:
s = pd.Series([2, 3, 1, 5, 3], index=['a', 'b', 'c', 'd', 'e'])
s.plot()

In [None]:
s.plot(kind='bar')

In [None]:
df = pd.DataFrame(
    [
        [2, 1],
        [4, 4],
        [1, 2],
        [3, 6]
    ],
    index=['a', 'b', 'c', 'd'],
    columns=['s1', 's2']
)
df.plot()

In [None]:
df.plot(x='s1', y='s2', kind='scatter')

## Charting with Seaborn

See the Python Graph Gallery at https://python-graph-gallery.com/ for many examples of different types of charts including the code used to create them.

There are many plotting libraries for Python; the most well known are matplotlib, seaborn (which extends matplotlib), Bokeh, and Plotly. Some offer more interactivity than others. Seaborn is a popular library so we will examine it with some examples. We first need to use the following magic to get the plots to show up in Jupyter:

In [None]:
%matplotlib inline

In [None]:
# Let's get the more detailed Titanic data set
df = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')
df.head()

In [None]:
# We can use a factorplot to count categorical data
import seaborn as sns
sns.factorplot('sex',data=df,kind='count')

In [None]:
# Let's bring class in too:
sns.factorplot('pclass', data=df, hue='sex', kind='count')

In [None]:
# Of course we can aggregate the other way too
sns.factorplot('sex', data=df, hue='pclass', kind='count')

In [None]:
# Let's see how many people were on each deck
deck = pd.DataFrame(df['cabin'].dropna().str[0])
deck.columns = ['deck']
sns.factorplot('deck', data=deck, kind='count')

In [None]:
# What class passenger was on each deck?
df2 = df[['cabin', 'pclass']]
df2 = df2.dropna()
df2['deck'] = df2.apply(lambda row: ord(row.cabin[0]) -64, axis=1)

sns.regplot(x=df2["pclass"], y=df2["deck"])

## Adding Interactivity with ipywidgets

`ipywidgets` is an extension package for Jupyter that allows output cells to include interactive HTML elements. To install, you will need to run a command to enable the extension from a terminal and then restart Jupyter. First, install the package:

In [None]:
!echo y | conda install -c conda-forge ipywidgets

Now you need to run this command from a terminal and restart Jupyter, then return here.

    jupyter nbextension enable --py --sys-prefix widgetsnbextension

We will look at a simple example using the `interact` function from `ipywidgets`. You call this giving it a function as the first argument, followed by zero or more additional arguments that can be tuples, lists or dictionaries. These arguments will each become interactive controls like sliders and drop-downs, and any change in their values will cause the function to be called again with the new values as arguments.

In [None]:
%matplotlib inline

In [None]:
from ipywidgets import interact

df = pd.DataFrame(
    [
        [2, 1],
        [4, 4],
        [1, 2],
        [3, 6]
    ],
    index=['a', 'b', 'c', 'd'],
    columns=['s1', 's2']
)


def plot_graph(kind, col):
    if col == 'all':
        df.plot(kind=kind)
    else:
        df[col].plot(kind=kind)
    

interact(plot_graph, kind=['line', 'bar'], col=['all', 's1', 's2'])

See http://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html for more info on creating other types of controls when using `interact`.

## Summarizing Data with pandas_profiling and facets

`pandas_profiling` is a Python package that can produce much more detailed summaries of data than the `.describe()` method:

In [None]:
!pip install pandas-profiling

In [None]:
import pandas_profiling
import seaborn as sns;

titanic = sns.load_dataset('titanic')

pandas_profiling.ProfileReport(titanic)  # You may need to run cell twice

Facets is a new library from Google that looks very good. It has similar functionality to pandas_profiling as well as some powerful visualization. Installation is more complex so we won't use it now but it is worth considering.

https://github.com/pair-code/facets


## Example: Loading JSON into a DataFrame and Expanding Complex Fields

In this example we'll see how we can load some structured data and process it into a flat table form better suited to machine learning.

In [None]:
# Let's get some data; top stories from lobste.rs; populate a DataFrame with the JSON
stories = pd.read_json('https://lobste.rs/hottest.json')
stories.head()

In [None]:
# Use the "short_id' field as the index
stories = stories.set_index('short_id')

# Show the first few rows
stories.head()

In [None]:
# Take a look at the submitter_user field; it is a dictionary itself.
stories.submitter_user[0]

In [None]:
# We want to expand these fields into our dataframe. First expand into its own dataframe.
user_df = stories.submitter_user.apply(pd.Series)
user_df.head()

In [None]:
# We should make sure there are no collisions in column names.
set(user_df.columns).intersection(stories.columns)

In [None]:
# We can rename the column to avoid the clash
user_df = user_df.rename(columns={'created_at': 'user_created_at'})

In [None]:
# Now combine them, dropping the original compound column that we are expanding.
stories = pd.concat([stories.drop(['submitter_user'], axis=1), user_df], axis=1)
stories.head()

In [None]:
# The tags field is another compound field.
stories.tags.head()

In [None]:
# Make a new dataframe with the tag lists expanded into columns of Series.
tag_df = stories.tags.apply(pd.Series)
tag_df.head()

In [None]:
# Pivot the DataFrame
tag_df = tag_df.stack()
tag_df

In [None]:
# Expand into a 1-hot encoding
tag_df = pd.get_dummies(tag_df)
tag_df

In [None]:
# Merge multiple rows
tag_df = tag_df.sum(level=0)
tag_df

In [None]:
# And add back to the original dataframe
stories = pd.concat([stories.drop('tags', axis=1), tag_df], axis=1)
stories.head()

## Exercise: Baby Names

The data comes from US census and is the count of names of children born in years from 1880 to 2014.

In [None]:
df = pd.read_csv('NationalNames.csv.zip', compression='zip')  # Pandas can unzip the data for you
df.head()

In [None]:
# Exercise: show the baby names from 1918

In [None]:
# Exercise: get the counts per year for the name 'John'

In [None]:
# Exercise: do the same but restrict to boys now!

In [None]:
# Exercise: plot popularity of John as a boy's name per year
# (hint: look at help for Seaborn barplots)

In [None]:
# Exercise - use ipywidgets.interact to create a drop-down
# with several names and chart the selected name's popularity.