# Describing data in Python

## 1. Basic data description in pandas

### 1.1. The .describe() method
The `.describe()` method in pandas is the quickest way to generate descriptive statistics for specific columns and rows of your data.

In [1]:
import pandas as pd
from pandas import DataFrame, Series

df1 = DataFrame([[1, 2], [3, 4], [5, 6]], index=['a', 'b', 'c'], columns=['Ohio', 'Nevada'])
df1

Unnamed: 0,Ohio,Nevada
a,1,2
b,3,4
c,5,6


In [2]:
df1.describe()

Unnamed: 0,Ohio,Nevada
count,3.0,3.0
mean,3.0,4.0
std,2.0,2.0
min,1.0,2.0
25%,2.0,3.0
50%,3.0,4.0
75%,4.0,5.0
max,5.0,6.0


The output of the `.describe()` function is a DataFrame, the contents of which can be accessed through standard DataFrame indexing.

In [3]:
df1_desc = df1.describe()
df1_desc['Ohio']['mean']

3.0

Another way to get the mean of the `Ohio` column is to use the `.mean()` DataFrame function.

In [7]:
df1['Ohio'].mean()

3.0

### 1.2. Discretization and binning
Continuous data is often discretized or otherwise separated into "bins" for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets.

In [8]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let's divide these into bins of 18 to 25, 26 to 35, 35 to 60, and 60-and-older. We can use pandas' `pd.cut()` function.

In [9]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special `Categorical` object. You can treat it like an array of strings indicating the bin name associated with each age in `ages`. Internally, it contains a `codes` array indicating the distinct category names along with a `categories` array mapping bins to the elements in `ages`.

In [10]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [11]:
cats.categories

Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype='object')

The `pd.value_counts()` function gives the counts of how many elements are in each category. This is a printed version of the information that gets plotted in a histogram.

In [12]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side is open (exclusive of the end point) while the square bracket means the side is closed (inclusive of the end point. You can switch which side is open by passing `right=False`.

In [13]:
pd.cut(ages, bins, right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, object): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

You can also pass your own bin names by passing a list or array to the `labels` option.

In [14]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass `cut` an integer number of bins instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths.

In [15]:
import numpy as np

data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.49, 0.74], (0.0049, 0.25], (0.0049, 0.25], (0.49, 0.74], (0.74, 0.98], ..., (0.74, 0.98], (0.0049, 0.25], (0.74, 0.98], (0.0049, 0.25], (0.0049, 0.25]]
Length: 20
Categories (4, object): [(0.0049, 0.25] < (0.25, 0.49] < (0.49, 0.74] < (0.74, 0.98]]

A very cool function is `qcut`, which bins the data based on sample quantiles. Depending on the distribution of the data, using `cut` will not usually result in each bin having the same number of data points. Since `qcut` uses sample quantiles instead, by definition you will obtain roughly equal size bins.

In [16]:
data = np.random.randn(1000)
cats = pd.qcut(data, 4)  # cut into quartiles
cats

[(0.643, 3.0621], [-2.903, -0.667], [-2.903, -0.667], (0.643, 3.0621], (-0.667, -0.00315], ..., (-0.00315, 0.643], (-0.00315, 0.643], (-0.667, -0.00315], (0.643, 3.0621], (-0.00315, 0.643]]
Length: 1000
Categories (4, object): [[-2.903, -0.667] < (-0.667, -0.00315] < (-0.00315, 0.643] < (0.643, 3.0621]]

In [None]:
pd.value_counts(cats)

With `qcut` (as with `cut`), you can also pass in your own quantiles (numbers between 0 and 1).

In [None]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])

## 2. Groupby in pandas

In [None]:
df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                'key2': ['one', 'two', 'one', 'two', 'one'],
                'data1': np.random.randn(5),
                'data2': np.random.randn(5)})
df

In [None]:
grouped = df['data1'].groupby(df['key1'])
grouped

In [None]:
grouped.mean()

In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

## References

* McKinney, Wes, Python for Data Analysis, O'Reilly Media, Inc. (2013).
* [Python labs](http://www.acme.byu.edu/?page_id=2067), Applied and Computational Mathematics Emphasis (ACME), Brigham Young University.