In [2]:
import os
import re
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

Import notes to increase pandas speed:

1. Try to use vectorized operations where possible rather than approaching problems with the for x in df... mentality. If your code is home to a lot of for-loops, it might be better suited to working with native Python data structures, because Pandas otherwise comes with a lot of overhead.

2. If you have more complex operations where vectorization is simply impossible or too difficult to work out efficiently, use the .apply() method.

3. If you do have to loop over your array (which does happen), use .iterrows() or .itertuples() to improve speed and syntax.

4. Pandas has a lot of optionality, and there are almost always several ways to get from A to B. Be mindful of this, compare how different routes perform, and choose the one that works best in the context of your project.

5. Integrating NumPy into Pandas operations can often improve speed and simplify syntax.

## 1. Read data

pandas can read data with different formats, but the speed of reading different formats varies significantly. Generally speaking, the speed of reading is ranked as follows:
 1. pkl
 2. csv
 3. hdf
 4. xlsx (VERY slow)
 
Most of the data is collected as 'csv' files. Therefore, it's useful to read the 'csv' first and save it as 'pkl' or 'hdf'. Later when we read the data again, we can save a lot of time.

```
# read csv
df = pd.read_csv('xxx.csv')

# save as pkl
df.to_pickle('xxx.pkl')
df = pd.read_pickle('xxx.pkl')
 
 # save as hdf
df.to_hdf('xxx.hdf','df')
df = pd.read_hdf('xxx.pkl','df')
```

## 2. Aggregate data

When aggregating the data using `agg` and `transform`, we can either use the built-in methods (`sum`, `mean`, et.) in pandas or use UDFs to perform the operations. However, use UDFs ony when you have to because the built-in methods are MUCH faster. See the [example](https://zhuanlan.zhihu.com/p/97012199)

## 3. Use `numba`

如果在你的数据处理过程涉及到了大量的数值计算，那么使用`numba`可以大大加快代码的运行效率。 首先需要安装numba模块, `pip install numba`。下面是一个应用实例：

```
import numba

@numba.vectorize
def f_with_numba(x): 
    return x * 2

df["double_energy"] = f_with_numba(df.energy_kwh.to_numpy())
```

__NOTE__: 运用numba加速, 需要以numpy数组的形式传入

__NOTE__: when converting pandas dataframe and series to numpy, there are several ways: `values`, ~~`as_matrix`~~ (deprecated), `to_numpy()` and `array`. `to_numpy` and `array` are highly [recommended](https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array). 

- `to_numpy()`, which is defined on Index, Series, and DataFrame objects, and
- `array`, which is defined on Index and Series objects only.

## 4. Loop v.s. Vector

The goal of this example will be to apply time-of-use energy tariffs to find the total cost of energy consumption for one year. That is, at different hours of the day, the price for electricity varies, so the task is to multiply the electricity consumed for each hour by the correct price for the hour in which it was consumed. The rows contains the electricity used in each hour, so there are 365 x 24 = 8760 rows for the whole year. Each row indicates the usage for the “hour starting” at the time, so 1/1/13 0:00 indicates the usage for the first hour of January 1st.

In [3]:
df = pd.read_csv('../data/demand_profile.csv')
df.head()

Unnamed: 0,date_time,energy_kwh
0,1/1/13 0:00,0.586
1,1/1/13 1:00,0.58
2,1/1/13 2:00,0.572
3,1/1/13 3:00,0.596
4,1/1/13 4:00,0.592


In [4]:
df.dtypes

date_time      object
energy_kwh    float64
dtype: object

In [5]:
# Convert the date_time to datetime dtype
df['date_time'] = pd.to_datetime(df['date_time'], format = '%d/%m/%y %H:%M')
# Always provide the format so that pandas doesn't need to infer the datetime format and thus saving a lot of time!!!

In [6]:
df.dtypes

date_time     datetime64[ns]
energy_kwh           float64
dtype: object

Now that your dates and times are in a convenient format, you are ready to get down to the business of calculating your electricity costs. Remember that cost varies by hour, so you will need to conditionally apply a cost factor to each hour of the day. In this example, the time-of-use costs will be defined as follows:

|Tariff Type	|Cents per kWh	|Time Range|
|---|---|---|
|Peak	        |28	            |17:00 to 24:00|
|Shoulder	    |20	            |7:00 to 17:00
|Off-Peak	    |12	            |0:00 to 7:00

### Directly loop through the dataframe

In [11]:
def apply_tariff(kwh, hour):
    """Calculates cost of electricity for given hour."""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'Invalid hour: {hour}')
    return rate * kwh

def apply_tariff_loop(df):
    """Calculate costs in loop.  Modifies `df` inplace."""
    energy_cost_list = []
    for i in range(len(df)):
        # Get electricity used and hour of day
        energy_used = df.iloc[i]['energy_kwh']
        hour = df.iloc[i]['date_time'].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df['cost_cents'] = energy_cost_list
    
apply_tariff_loop(df)

### Looping with .itertuples() and .iterrows()

`.itertuples()` yields a namedtuple for each row, with the row’s index value as the first element of the tuple. A nametuple is a data structure from Python’s collections module that behaves like a Python tuple but has fields accessible by attribute lookup.

`.iterrows()` yields pairs (tuples) of (index, Series) for each row in the DataFrame.

While `.itertuples()` tends to be a bit faster, let’s stay in Pandas and use `.iterrows()` in this example, because some readers might not have run across nametuple.

In [None]:
def apply_tariff_iterrows(df):
    enery_cost_list = []
    for index, row in df.iterrows():
        energy_used = row['energy_kwh']
        hour = row['date_time'].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df['cost_cents'] = energy_cost_list

apply_tariff_iterrows(df)

### Use `.apply`

The syntactic advantages of `.apply()` are clear, with a significant reduction in the number of lines and very readable, explicit code. In this case, the time taken was roughly half that of the `.iterrows()` method.

In [12]:
def apply_tariff_withapply(df):
    df['cost_cents'] = df.apply(lambda row: apply_tariff(row['energy_kwh'], row['date_time'].hour), axis=1)
    
apply_tariff_withapply(df)

### Selecting Data With `.isin()`

If there were a single electricity price, you could apply that price across all the electricity consumption data in one line of code (`df['energy_kwh'] * 28`). This particular operation was an example of a vectorized operation, and it is the fastest way to do things in Pandas. We really want to speed up the calculation by vectorization.

In this next example, you will see how to select rows with Pandas’ `.isin()` method and then apply the appropriate tariff in a vectorized operation. Before you do this, it will make things a little more convenient if you set the date_time column as the DataFrame’s index.

In [8]:
df.set_index('date_time', inplace=True)

In [9]:
df.head(2)

Unnamed: 0_level_0,energy_kwh
date_time,Unnamed: 1_level_1
2013-01-01 00:00:00,0.586
2013-01-01 01:00:00,0.58


In [16]:
def apply_tariff_isin(df):
    # Define hour range Boolean arrays
    peak_hours = df.index.hour.isin(range(17, 24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))

    # Apply tariffs to hour ranges
    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
    df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
    df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12

apply_tariff_isin(df)

The `.isin()` method is returning an array of Boolean values that looks like this:
```
[False, False, False, ..., True, True, True]
```

When you pass these Boolean arrays to the DataFrame’s .loc indexer, you get a slice of the DataFrame that only includes rows that match those hours. After that, it is simply a matter of multiplying the slice by the appropriate tariff, which is a speedy vectorized operation.

### Create a array of price using `pd.cut()`

`pd.cut()` discretize the column and assigns new values based on the bins.

In [27]:
def apply_tariff_cut(df):
    cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)
    print(type(cents_per_kwh))
    df['cost_cents'] = cents_per_kwh * df['energy_kwh'].values
    
apply_tariff_cut(df)

<class 'numpy.ndarray'>


### Don't forget NumPy!

In [10]:
def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df['cost_cents'] = prices[bins] * df['energy_kwh'].to_numpy()
    
apply_tariff_digitize(df)

In [25]:
prices = np.array([12, 20, 28])
prices

array([12, 20, 28])

In [26]:
bins = np.digitize(df.index.hour.to_numpy(), bins=[7, 17, 24])
bins

array([0, 0, 0, ..., 2, 2, 2], dtype=int64)