In [4]:
import pandas as pd

### Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis. 
<br> Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [5]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To
do so, you have to use cut, a function in pandas:

In [6]:
bins = [18, 25, 35, 60, 100]

In [7]:
cats = pd.cut(ages, bins);cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. 
- The output you see describes the bins computed by pandas.cut. 
- You can treat it like an array of strings indicating the bin name; internally it contains a categories array specifying the distinct category names along with a labeling for the ages data in the codes attribute:

In [8]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [9]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.

In [10]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). 
- You can change which side is closed by passing right=False:

In [11]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

### Computing Indicator/Dummy Variables

- If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s. 
- pandas has a get_dummies function for doing this, though devising one yourself is not difficult.

In [12]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})

In [13]:
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [14]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data.

In [15]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [16]:
df_with_dummy = df[['data1']].join(dummies)

In [17]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


### Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
- stack
This “rotates” or pivots from the columns in the data to the rows
- unstack
This pivots from the rows into the columns

In [19]:
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                         index=pd.Index(['Ohio', 'Colorado'], name='state'),
                          columns=pd.Index(['one', 'two', 'three'],
                          name='number'))

In [20]:
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


Using the stack method on this data pivots the columns into the rows, producing a Series:

In [21]:
result = data.stack()

In [22]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

From a hierarchically indexed Series, you can rearrange the data back into a DataFrame with unstack:

In [24]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


- By default the innermost level is unstacked (same with stack). 
- You can unstack a different level by passing a level number or name:

In [25]:
 result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [26]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


- When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:


In [28]:
df = pd.DataFrame({'left': result, 'right': result + 5},
                    columns=pd.Index(['left', 'right'], name='side'))

In [29]:
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [30]:
 df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


# Pivotting

In [80]:
data = pd.read_csv('macrodata.csv')

In [81]:
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [82]:
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')


In [83]:
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')

In [84]:
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')

In [85]:
data = data.reindex(columns=columns)

In [86]:
data.head()

item,realgdp,infl,unemp
0,2710.349,0.0,5.8
1,2778.801,2.34,5.1
2,2775.488,2.74,5.3
3,2785.204,0.27,5.6
4,2847.699,2.31,5.2


In [87]:
data.index = periods.to_timestamp('D', 'end')

In [88]:
data.head()

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,2710.349,0.0,5.8
1959-06-30,2778.801,2.34,5.1
1959-09-30,2775.488,2.74,5.3
1959-12-31,2785.204,0.27,5.6
1960-03-31,2847.699,2.31,5.2


In [89]:
ldata=data.stack();ldata.head()

date        item   
1959-03-31  realgdp    2710.349
            infl          0.000
            unemp         5.800
1959-06-30  realgdp    2778.801
            infl          2.340
dtype: float64

In [90]:
ldata=ldata.reset_index();ldata.head()

Unnamed: 0,date,item,0
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34


In [91]:
ldata.rename(columns={0: 'value'},inplace=True);ldata.head()

Unnamed: 0,date,item,value
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34


- This is the so-called long format for multiple time series, or other observational data with two or more keys (here, our keys are date and item). 
- Each row in the table represents a single observation.

In [92]:
ldata.head()

Unnamed: 0,date,item,value
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34


In [95]:
pivoted = ldata.pivot(index='date',columns= 'item', values='value')

In [96]:
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


- Suppose you had two value columns that you wanted to reshape simultaneously:

In [97]:
ldata['value2'] = np.random.randn(len(ldata)) ; ldata.head()

Unnamed: 0,date,item,value,value2
0,1959-03-31,realgdp,2710.349,-1.270952
1,1959-03-31,infl,0.0,-1.629557
2,1959-03-31,unemp,5.8,0.023642
3,1959-06-30,realgdp,2778.801,-0.696158
4,1959-06-30,infl,2.34,-0.842901


In [98]:
pivoted = ldata.pivot('date', 'item')

In [99]:
pivoted.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,-1.629557,-1.270952,0.023642
1959-06-30,2.34,2778.801,5.1,-0.842901,-0.696158,0.035587
1959-09-30,2.74,2775.488,5.3,-1.469789,-1.242874,-0.383202
1959-12-31,0.27,2785.204,5.6,0.486577,0.583547,-1.534876
1960-03-31,2.31,2847.699,5.2,-0.185743,-0.216681,-1.142762


# Melting

An inverse operation to pivot for DataFrames is pandas.melt. 
- Rather than transforming one column into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input. 

In [100]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                         'A': [1, 2, 3],
                          'B': [4, 5, 6],
                          'C': [7, 8, 9]})


In [101]:
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


- The 'key' column may be a group indicator, and the other columns are data values. 
- When using pandas.melt, we must indicate which columns (if any) are group indicators. 
- Let’s use 'key' as the only group indicator here:

In [102]:
melted = pd.melt(df, ['key'])

In [103]:
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


Using pivot, we can reshape back to the original layout:

In [104]:
 reshaped = melted.pivot('key', 'variable', 'value')  ; reshaped.head()

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


Since the result of pivot creates an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:

In [105]:
reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7
