# 补充缺失某段时间的数据行

* 前面讲了时间（天/月）全备的情况下，补上数据值的情况
* 在现实情况中，也会碰到数据缺少某一天或某几月的数据的情况
* 我们现在通过 left join 来解决这个问题

In [3]:
# Load libraries
import pandas as pd
import numpy as np

## 创建有缺失行的时间序列

In [4]:
# Create date
time_index = pd.date_range('2010-01-01', periods=5, freq='M')
time_index
# Create data frame, set index
df = pd.DataFrame(index=time_index)

# Create feature with a gap of missing values
df['Sales'] = [1.0,2.0,3.0,4.0,5.0]
df.head()

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,4.0
2010-05-31,5.0


In [5]:
# 删掉3月份的数据
df.drop(index = df.index[2],inplace= True)
df.head()

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-04-30,4.0
2010-05-31,5.0


## 补上缺失的那一天(重点)

* 先生成一个只有索引的 time_df
* 然后再把 time_df 与 df 做 left join
* 这样缺失的那一天（行）就会回来了

In [12]:
# 先生成一个只有索引的 dataframe
time_df = pd.DataFrame(index=time_index)
time_df

2010-01-31
2010-02-28
2010-03-31
2010-04-30
2010-05-31


In [13]:
new_df = time_df.merge(df,left_index = True, right_index= True,how='left')
new_df

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,
2010-04-30,4.0
2010-05-31,5.0


## 插值处理缺失值

In [14]:
# Forward-fill
new_df.interpolate()


Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,4.0
2010-05-31,5.0


## 后填充缺失值

In [15]:
# Forward-fill
new_df.bfill()

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,4.0
2010-04-30,4.0
2010-05-31,5.0


## 插值处理一个缺失值


In [16]:
# Interpolate missing values
new_df.interpolate(limit=1, limit_direction='forward')

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,4.0
2010-05-31,5.0


## 参考资料
* [Handling Missing Values In Time Series](https://chrisalbon.com/machine_learning/preprocessing_dates_and_times/handling_missing_values_in_time_series/)