# Operations on Data
* Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

When working with Numpy, and essential part of our toolkit was the ability to perform element wise operations such as adding and multiply or operations like exponentiating or taking the log. We also want this functionality for DataFrames and pandas provides it - borrowing functionality from NumPy.

In this notebook we will look at how Pandas allows us to perform operations like those mentioned above - using universal functions.



In [None]:
import pandas as pd
import numpy as np

## UFuncs: Preserving Indices

Pandas is designed to work with Numpy. This means any numpy ufunc applied to a pandas object will return another pandas object of the same type.

To see this, let's set up some example pandas objects:

In [None]:
rng = np.random.RandomState(1234567890)
series = pd.Series(rng.randint(0, 10, 10))
series

In [None]:
type(series)

In [None]:
df = pd.DataFrame(rng.randint(0, 10, (10, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

In [None]:
type(df)

If we apply a NumPy ufunc to an object, let's see what happens

In [None]:
series2 = np.log(series)

In [None]:
type(series2)

In [None]:
df2 = np.exp(df)

In [None]:
type(df2)

In [None]:
type(df['B'])

In [None]:
type(np.log10(df['B']))

We can also apply a numpy ufunc to an individual row of a DataFrame

In [None]:
df3 = df2.copy()
df3['B'] = np.log(df3['B'])
df3

In addition to NumPy functions, pandas provides it's own operators using operator overloading. These are summarized here:

[insert table]

We can use these operators to combine a pandas object with a scalar:


In [None]:
df3['A'] * 10

and can also create new columns using ufuncs (either via NumPy or the pandas operators):

In [None]:
df3['E'] = df3['D'] / df3['C']
df3

### Challenge

Load in our labor market statistics data and perform the following operations:

1. Verify the `unemployment_rate` variable is correctly computed. (Hint, round your computation to 1 decimal place using .round(decimels=1), test equality using df[''].equals(df['']))
2. Does the Decomposition `labour_force = qty_employed + qty_employed` hold true?
3. Assume that each worker was employed for 38 hours per week. Create a new column that estimates the number of hours worked in each state-year-month
4. Assume that 60 percent of workers work 45 hours per week, whilst 40 percent work 20 hours per week, estimate the number of labour hours in each state-year-month
5. Calculate the difference in your estimates from 3 and 4.

#### Solutions

In [None]:
data = pd.read_csv('out_data/state_labour_statistics.csv')
data.head()

In [None]:
data['ue_rate2'] = (data['qty_unemployed'] / data['labour_force'] * 100).round(1)
data.head()

In [None]:
data['ue_rate2'].equals(data['unemployment_rate'])

In [None]:
data['labour_force'].equals(data['qty_employed'] + data['qty_unemployed'])

## Index Alignment

When pandas is performing binary operations on between different series, it will align indices when performing the operation. This is useful if we have incomplete data in one of both data frames when we are trying to combine them:

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])

In [None]:
A + B

by default, pandas returns NaNs - but this may not always be the behaviour we want. The option `fill_value` allows us to specify a value for the missing value to take:


In [None]:
A.add(B, fill_value = 0)

although this may not be desirable.

A similar idea holds for DataFrames:


In [None]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
A+B

Finally you can use ufuncs to combine info from a DataFrame and a Series:

In [None]:
B

In [None]:
B.subtract(B['B'], axis=0)