# Introduction to the Research Environment

The research environment is powered by IPython notebooks, which allow one to perform a great deal of data analysis and statistical validation. We'll demonstrate a few simple techniques here.

# 研究环境的介绍
这是一个基于jupyter notebook(原IPython notebook)的研究环境，能出色的完成大数据的分析和统计验证。我们将这里做一点简单的技术演示。

## Code Cells vs. Text Cells

As you can see, each cell can be either code or text. To select between them, choose from the 'Cell Type' dropdown menu on the top left.

## 代码单元格 vs 文本单元格
如你所见，单元格可以是代码编辑模式，也可以是文本编辑模式。可用通过点击顶部菜单的`Cell`的下拉菜单中`Cell Type`来进行选择。
亦可通过顶部菜单中下拉菜单进行选择：`Code`即为代码编辑模式，`Markdown`即为文本编辑模式。如下图所示:
![image](cell_mode_change.jpg)
在单元格未进入编辑时，可以通过快捷键进行切换，`Y`切换成代码模式，`M`切换成文本编辑模式，当单元格处于编辑状态时，可以按`Esc`退出编辑模式， 按`Enter`进入编辑模式。

## Executing a Command

A code cell will be evaluated when you press play, or when you press the shortcut, shift-enter. Evaluating a cell evaluates each line of code in sequence, and prints the results of the last line below the cell.

## 执行一个命令
当处于代码编辑模式时，通过快捷键`Shift`+`Enter`来逐行执行单元格中代码，并会打印出单元格最后一行代码执行的结果。

In [None]:
2 + 2

有时候并没有打印结果，比如说赋值的时候.

In [None]:
X = 2

记住，只有最后一行的代码执行的结果会被打印出来.

In [None]:
2 + 2
3 + 3

你可以通过`print`来打印你想要的代码结果.

In [None]:
print(2 + 2)
3 + 3

## Knowing When a Cell is Running

While a cell is running, a `[*]` will dsiplay on the left. When a cell has yet to be executed, `[ ]` will display. When it has been run, a number will display indicating the order in which it was run during the execution of the notebook `[5]`. Try on this cell and note it happening.

## 单元格的状态
当一个单元格左侧的标记为`[*]`时，表明程序正在运行.`[ ]`表明单元格还未执行.当一个单元格运行完毕后，会在这对方括号中加上一个数字，表示已完成的状态，比如`[5]`.
- 特别的说明:当你重新打开notebook,或者重启kernel(并未清理结果)之后，单元格左侧的方框虽然有数字，但其实并未运行.

In [None]:
# 当运行某段代码需要一定时间时
c = 0
for i in range(10000000):
    c = c + i
c

## Importing Libraries

The vast majority of the time, you'll want to use functions from pre-built libraries. You can't import every library on Quantopian due to security issues, but you can import most of the common scientific ones. Here I import numpy and pandas, the two most common and useful libraries in quant finance. I recommend copying this import statement to every new notebook.

Notice that you can rename libraries to whatever you want after importing. The `as` statement allows this. Here we use `np` and `pd` as aliases for `numpy` and `pandas`. This is a very common aliasing and will be found in most code snippets around the web. The point behind this is to allow you to type fewer characters when you are frequently accessing these libraries.

## 导入库
绝大多数时候，您将需要使用预构建的库中的函数。由于安全问题，您无法导入Quantopian上的每个库，但是你可以导入最常见用于科学计算的库。这里我将会导入`numpy`和`pandas`，两个在量化金融中常见且实用的库。
我建议将这个导入代码复制到每一个新的notebook.

你可以使用`as`语句将导入的库重命名，我们将使用 `np`和`pd`分别作为`numpy`和`pandas`的别名，这算是全世界大家公认的别名。使用别名的好处就是当你频繁使用者这些库时，能输入较少的字符。

In [None]:
import numpy as np
import pandas as pd
from scipy.io import loadmat
# 这是一个非常优秀用于画图的库
import matplotlib.pyplot as plt
import datetime

%matplotlib inline

## Tab Autocomplete

Pressing tab will give you a list of IPython's best guesses for what you might want to type next. This is incredibly valuable and will save you a lot of time. If there is only one possible option for what you could type next, IPython will fill that in for you. Try pressing tab very frequently, it will seldom fill in anything you don't want, as if there is ambiguity a list will be shown. This is a great way to see what functions are available in a library.

Try placing your cursor after the `.` and pressing tab.

## Tab键 代码自动补全
当你按下`Tab`键时，会自动出现一个猜测你下一步你可能输入内容的列表。这是一个十分有价值且能帮你节省很多时间的功能。如果只有一个可能选项，IPython则会自动帮助你完成输入。请频繁的按下`Tab`吧，它几乎不会出现你不想要的内容，一如这里有歧义的话，就会出现一个列表。这是一个非常好的方式去看看有哪些函数在库中。

请将鼠标的置于`np.random.`中最后一个`.`之后，并按下`Tab`。

In [None]:
np.random.

## Getting Documentation Help

Placing a question mark after a function and executing that line of code will give you the documentation IPython has for that function. It's often best to do this in a new cell, as you avoid re-executing other code and running into bugs.

## 获取文档帮助
在IPython中，在一个函数后面加上`?`并执行该代码，就会获得该函数的文本帮助。建议通常在一个新的单元格执行，这样能避免重新执行其他代码或者发生bug.

In [None]:
np.random.normal?

## Sampling

We'll sample some random data using a function from `numpy`.

## 生成样本
我们将用`numpy`中的函数随机生成一些数据

In [None]:
# Sample 100 points with a mean of 0 and an std of 1. This is a standard normal distribution.
# 随机生成100样本点，这些点服从均值为0，标准差为1的正态分布。
np.random.seed(1) # 让大家生成的样本保持一致

X = np.random.normal(0, 1, 100)

## Plotting

We can use the plotting library we imported as follows.

## 画图
我们使用刚才导入的进行画图

`import matplotlib.pyplot as plt`

In [None]:
plt.plot(X)

### Squelching Line Output

You might have noticed the annoying line of the form `[<matplotlib.lines.Line2D at 0x7f72fdbc1710>]` before the plots. This is because the `.plot` function actually produces output. Sometimes we wish not to display output, we can accomplish this with the semi-colon as follows.

### 让输出更纯净
你可能注意到我们画的图前面有一行非常讨厌的输出`[<matplotlib.lines.Line2D at 0x1c******>]`
这是`.plot`函数输出结果，有时候我们并希望他出现，我们可以加上分号实现这个功能；

In [None]:
plt.plot(X);

### Adding Axis Labels

No self-respecting quant leaves a graph without labeled axes. Here are some commands to help with that.

### 添加坐标轴标签
有追求的宽客是不会让图表没有坐标轴标签的。

这里有一些命令可以帮我们实现他。

In [None]:
np.random.seed(2)

X = np.random.normal(0, 1, 100)
X2 = np.random.normal(0, 1, 100)

plt.plot(X);
plt.plot(X2);
plt.xlabel('Time') # 我们生成的数据是没有单位的，但不要忘记他的单位。
plt.ylabel('Returns')
plt.legend(['X', 'X2']);

## Generating Statistics

Let's use `numpy` to take some simple statistics.

## 计算统计值
让我们用`numpy`来计算一些统计值

In [None]:
np.mean(X)

In [None]:
np.std(X)

## Getting Real Pricing Data

Randomly sampled data can be great for testing ideas, but let's get some real data. We can use `get_pricing` to do that. You can use the `?` syntax as discussed above to get more information on `get_pricing`'s arguments.

### 获取真实的价格数据

随机生成的数据能很好检验我们的想法，但是我们也需要去获取真实的数据。我们将用`get_pricing`去实现他，可以使用刚才我们提到`?`命令去获取更多关于`get_pricing`函数的参数信息。

- 译者注：请注意`get_pricing`为quantopain函数，本地使用相关函数是无效的。

In [None]:
# data = get_pricing('MSFT', start_date='2012-1-1', end_date='2015-6-1')

- 译者为大家提供一个本地的数据，以便完成后面的练习

In [None]:
file_name = r'data\IF888-2011.mat'
origin_data = loadmat(file_name)
data = pd.DataFrame(origin_data['IF888'])
data.columns = ['date', 'price']
origin = np.datetime64('0000-01-01', 'D') - np.timedelta64(1, 'D')
data['date'] = data['date'].map(lambda x: x  * np.timedelta64(1, 'D') + origin)
data.index = data['date'].tolist()
data = data.iloc[:,1:]

Our data is now a dataframe. You can see the datetime index and the colums with different pricing data.

`data`是一种dataframe数据格式。可以通过index和colums来查看不同维度下价格数据

This is a pandas dataframe, so we can index in to just get price like this. For more info on pandas, please [click here](http://pandas.pydata.org/pandas-docs/stable/10min.html).

这是一个pandas dataframe, 我们可以使用index去获取价格信息. 更多关于pandas的信息请[点击这里](http://pandas.pydata.org/pandas-docs/stable/10min.html).

In [None]:
X = data['price']

Because there is now also date information in our data, we provide two series to `.plot`. `X.index` gives us the datetime index, and `X.values` gives us the pricing values. These are used as the X and Y coordinates to make a graph.

因为这里有日期信息在我们的数据中，我们将`X.index`得到的日期索引数据，和`X.values`得到的价格数据作为`.plot`X和Y轴数据用于画图。

In [None]:
plt.plot(X.index, X.values)
plt.ylabel('Price')
plt.legend(['IF888'])

我们来计算真实数据的统计量

In [None]:
np.mean(X)

In [None]:
np.std(X)

## Getting Returns from Prices

We can use the `pct_change` function to get returns. Notice how we drop the first element after doing this, as it will be `NaN` (nothing -> something results in a NaN percent change).

## 从价格数据中提取回报
我们用`pct_change`函数提取回报，注意：返回的第一个元素是`NaN`，所以我们会忽略掉他(因为第一个元素之前没有元素，无法比较，故返回一个NaN)

In [None]:
R = X.pct_change()[1:]

We can plot the returns distribution as a histogram.

我们可以用一个直方图去描述回报分布

In [None]:
plt.hist(R, bins=20)
plt.xlabel('Returns')
plt.ylabel('Frequency')
plt.grid(True)
plt.legend(['IF888 Returns']);

Get statistics again.

再次计算统计量

In [None]:
np.mean(R) # the same as R.mean()

In [None]:
R.std() # the same sa np.std(R)

Now let's go backwards and generate data out of a normal distribution using the statistics we estimated from Microsoft's returns. We'll see that we have good reason to suspect Microsoft's returns may not be normal, as the resulting normal distribution looks far different.

现在我们用回报数据的均值和标准差作为一个正态分布的参数去生成数据，我们有理由相信`IF888`的回报并不服从正态分布，因为对比后发现，两者相去甚远。

In [None]:
np.random.seed(3)

plt.hist(np.random.normal(np.mean(R), np.std(R), 10000), bins=20)
plt.xlabel('Returns')
plt.ylabel('Frequency')
plt.grid(True)
plt.legend(['Normal Distribution Returns'], loc='best');

## Generating a Moving Average

`pandas` has some nice tools to allow us to generate rolling statistics. Here's an example. Notice how there's no moving average for the first 60 days, as we don't have 60 days of data on which to generate the statistic.

## 生成一条移动均线
`pandas`有非常优秀的工具能让我们生成rolling statistics.这里有个例子.注意我们是没有足够的数据去生成前60日均线数据的。 

In [None]:
# Take the average of the last 60 days at each timepoint.
MAVG = X.rolling(window=60).mean()
plt.plot(X.index, X.values)
plt.plot(MAVG.index, MAVG.values)
plt.ylabel('Price')
plt.legend(['IF888', '60-day MAVG']);