------------------------------------------
<a id='top'></a>
## Session contents
### [4. Setting data](#setting_data)
### [5. Numerical operations and aggregations](#numerical)
### [6. Cleaning and filtering data](#cleaning_and_filtering)
### [Exercise set 2](#exercises2)


----------------------------
<a id='setting_data'></a>
## 4. Setting data

When working with a data set it is useful to be able to append data, modify existing data, or create new data that is derived in some way from existing data. We can do this by using the indexing operations from the previous session, as well as some new pandas functions.

**Key methods covered:**

    pd.concat() - concatenates a list of Series or DataFrame objects together
    df.append() - like pd.concat() but as a method on a DataFrame

<div class="alert alert-block alert-info">

<span style="color:green">Additional resources</span>

http://pandas.pydata.org/pandas-docs/stable/10min.html#setting

http://pandas.pydata.org/pandas-docs/stable/merging.html

We'll try some basic operations on a DataFrame that contains a few different data types.

In [None]:
import pandas as pd
import numpy as np
import datetime as dt

In [None]:
df = pd.DataFrame(
    index=['First', 'Second', 'Third', 'Fourth', 'Fifth'],
    data={'A': range(1, 6),
          'B': np.random.random(5),
          'C': list('UVWXY')})
df

Let's check the datatypes of each column before proceeding.

In [None]:
df.dtypes

As a simple example of derived values, try multiplying the entire DataFrame by two - is the result what you expected?

In [None]:
df * 2

Now try dividing the DataFrame by two and take note of the error message. Which lines of the error message are most valuable in debugging your code?

In [None]:
df / 2

Try making a new column 'D' that contains the natural logarithm of column 'A'. You will need to use the numpy library.

In [None]:
#solutions
%load solutions/setting_sol1.py

Data can be overwritten by using loc - try overwriting row 0, column A with the value $\pi$ and view the DataFrame again.

In [None]:
#solutions
%load solutions/setting_sol2.py

Now check the data types.

In [None]:
df.dtypes

You should notice that column A is now a float64 rather than an int64 type. Numpy and pandas automatically convert your columns to the most general data type that your data shares - in this case, a floating point number in the first value resulted in __all__ values being floating point numbers.

In [None]:
df

Using __.loc__ with a new index value appends new rows and/or columns as necessary (called "setting with enlargement").

In [None]:
df.loc['Sixth'] = [6, np.random.random(), 'Z', np.log(6)]
df

In [None]:
df.loc['Seventh', 'E'] = dt.datetime.now()
df

As an alternative to loc, we can use the append method to add new rows to a DataFrame (where the 'other' argument is a dict/Series or another DataFrame)

    df = df.append(other)
    
Note that, unlike a list's append method which is an inplace operation, a DataFrame's append method returns a new DataFrame object.

Try append now by creating a new Series object and appending it to df.

If we have multiple Series or DataFrame objects to concatenate, we can avoid multiple append statements by using a single concat statement

    df = pd.concat([df1, df2, df3])
    
This function has many more options than append, but we'll skip over these for now.

Try using concat in the cells below.

[Back to top](#top)

-------------------
<a id='numerical'></a>
## 5. Numerical operations and aggregations

Since pandas objects are constructed from numpy arrays, we can use numerical functions from the numpy package on our Series and DataFrame objects. It is conventional to import numpy in the following way

    import numpy as np
    
Pandas objects also have a variety of useful numerical and statistical methods.

** Key methods covered: **
    
    np.where()
    df.where()
    df.mean(), df.min(), df.max(), etc.
    df.diff(), df.sum()
    

<div class="alert alert-block alert-info">

<span style="color:green">Additional resources</span>

http://pandas.pydata.org/pandas-docs/stable/10min.html#operations

Let's return to our DataFrame from earlier.

In [None]:
df = pd.DataFrame(index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
    data={'optiver_turnover': [ 46386,  43775,  75742,  17474, np.nan],
          'total_turnover':   [278837, 439771, 583722, 358834, np.nan],
          'volatility':       [  12.5,   14.0,   21.5,   16.0,   17.0]})
df

Try adding Optiver's market share (optiver_turnover divided by total_turnover as a percentage) as a column in df called 'market_share':

In [None]:
#solutions
%load solutions/numerical_sol1.py

Almost all numpy functions can be applied directly to a Series or DataFrame. For example, let's round the market share percentage to two decimal places using __np.round__.

In [None]:
df['market_share'] = np.round(df['market_share'], 2)
df

DataFrames have a number of basic statistical methods, such as __mean()__, __min()__, __max()__, __std()__, and __quantile(q)__ (where q=0.5 is the median). By default they are applied column-by-column and ignore missing data. Try a few of them in the cells below.

In [None]:
df.mean()

They can also be applied row-by-row, by adding the argument axis=1. Again, try a few of these below.

DataFrames also have a __describe__ method that calculates a number of summary statistics simultaneously.

In [None]:
df.describe()

As well as statistical methods, basic aggregation methods are also available. Try calculating the weekly optiver turnover and total turnover using __sum__.

In [None]:
#solutions
%load solutions/numerical_sol2.py

Other useful methods include __diff__, __cumsum__, and __rank__. Try these below and interpret the meaning of the output.

Comparison operators (<, <=, >, >=, ==, !=) will turn out to be very useful when filtering DataFrames. As an example, let's create a column called 'good_market_share' which is True when market share exceeded 10% and false otherwise.

In [None]:
#solutions
%load solutions/numerical_sol3.py

The resulting data type is Boolean (True or False). Any comparison with missing data returns False. One interesting way to use Boolean columns is to apply the sum operator - True and False are converted to 1 and 0 respectively, so the result is the number of times the condition was true.

In [None]:
print 'Number of days with good market share:', df['good_market_share'].sum()

Try making a column called 'high_volatility' which is True when volatility is greater than 20.

In [None]:
#solutions
%load solutions/numerical_sol4.py

We can combine Boolean arrays together with all the usual logical operators.

    low_volatility = ~ df['high_volatility']  # NOT operator
    positive_and_even =             (df.A > 0) & (mod(df.A, 2) == 0)  # AND operator (note: the brackets around the conditional statements are necessary)
    positive_or_even =              (df.A > 0) | (mod(df.A, 2) == 0)  # OR operator (not exclusive-or/XOR)
    positive_or_even_but_not_both = (df.A > 0) ^ (mod(df.A, 2) == 0)  # XOR operator
    
Try creating a column which is True when we have high volatility and good market share.

In [None]:
#solutions
%load solutions/numerical_sol5.py

Finally, __np.where__ provides a useful way of mapping True/False values to other values. See if you can make a 'commentary' column which has the value 'LowVol' if volatility is low and 'HighVol' if volatility is high.

In [None]:
#solutions
%load solutions/numerical_sol6.py

The DataFrame method __df.where__ is similar to np.where, but can only modify values where the condition is False.

[Back to top](#top)

--------------
<a id='cleaning_and_filtering'></a>
## 6. Cleaning and filtering data

Cleaning data prior to analysis is one of the most essential steps in ensuring the outputs of our analysis are accurate and can be relied upon.

For example, during the auction period the bid and ask prices can often take unrealistic values. It might even be a good idea for some projects to remove the volatile auction period altogether. We can use some techniques to remove certain blocks of data from our DataFrame.

The easiest way to do this is to apply a Boolean mask. For example, the line of code below would only return the subset of the original DataFrame where column1 was positive.

    df[df[column1] > 0]

It works by applying your Boolean criteria to column1 which yields a list of True and False values. This is then used to determine which rows of the DataFrame are retained (keep True, discard False).
Alternatively, you may wish to remove segments of your data (for example, to avoid the opening and closing auctions). The code below shows how Boolean masks can be applied to the index of a DataFrame too.

    df[df.index.hour > 9] 

The key methods are __df.isnull(), df.dropna(), df.fillna()__. 

<div class="alert alert-block alert-info">

<span style="color:green">YouTube video</span>

Watch the video below until the 2 hour and 5 minute mark.


https://www.youtube.com/watch?v=6ohWS7J1hVA&start=6661

**Key methods covered:**

    df.drop - drops certain rows/columns
    masking - returning a subsection of the DataFrame according to certain criteria


Refer to the Python Data Science Handbook below for more information.


<div class="alert alert-block alert-info">

<span style="color:green">Additional resources</span>

http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.04-Missing-Values.ipynb


<div class="alert alert-success">

If you have just started here or would like to refresh your df_qte object, run the line below.

In [None]:
%load solutions/cleaning_start.py

<br>

So far the df_qte DataFrame appears to be behaving as it should - but are there any hidden gotchas?

A good way to check this is to visualise the data. Pandas has some basic built in plotting functionality which allows us to plot DataFrame or Series objects. First run 
>%matplotlib inline 

to ensure figures display inline rather than as pop-up windows, and then try to __.plot()__ the bid and ask prices. (see: http://pandas.pydata.org/pandas-docs/stable/visualization.html)

What do you notice? 

In [None]:
%matplotlib inline

In [None]:
df_qte[['BidPrice', 'AskPrice']].plot()

It's quite clear from the above charts that we've got some 0 values in our data. Check for yourself by using a mask.

In [None]:
#solutions
%load solutions/clean_sol1.py

It would probably be a good idea (in most cases) to filter out these entries with a value 0 or less. We can do this by replacing the DataFrame with a subset of itself where bid and ask prices are positive.

In [None]:
#solutions
%load solutions/clean_sol2.py

If we revisit the charts of bid and ask prices, we should find the numbers much more reasonable now.

In [None]:
df_qte.BidPrice.plot()

In [None]:
df_qte.AskPrice.plot()

Since masks are simply lists of Boolean operators, we should be able to apply **multiple masks** to a DataFrame with ease. For example, if we have 2 criteria - A and B, and A returns a Series of 

    [True, False] 
    
and B returns a Series of 

    [False, False]
    
applying both masks A and B will simply return 

    [False, False]

Remember, when we combine Boolean operators only 2 Trues will return a True.

Let's try and apply 2 masks to our df_qte DataFrame - **BidSize >= 30 and BidSize <= 40**. Recall the format of masks!

    mask_name = df['column'] *criterion*
    
    df[mask_1 & mask_2]

Missing data may be represented in pandas using either Python's None or numpy's np.nan (rendered as NaN when displayed). We may encounter missing data if there are data quality issues, or we if we choose to set some (wrong or uninteresting) data as missing.

There are a few options for dealing with missing data:

Set to another number:

    - A fixed value (.fillna)
    - Fill forwards from the previous non-NaN value (.ffill)
    - Fill backwards from the next non-NaN value (.bfill)
    
Remove rows or columns:

    df.dropna(how='any', axis=0)  # drop row if any of its values are NaN
    df.dropna(how='any', axis=1)  # drop column if any of its values are NaN
    df.dropna(how='all')  # drop row if all values are NaN
    df.dropna(thresh=2)  # keep row if 2 or more entries are not NaN
    
Forward-fill is usually better to use than back-fill for trading data, since the current (missing) value is likely to be the most recent non-NaN value.

If you need to do more complicated operations with NaNs, there are methods that return True or False if the data is or is not NaN.

    df.isnull()  # or df.isna() - returns True if value is NaN and False otherwise
    df.notnull()  # or df.notna() - returns False if value is NaN and True otherwise

Let's try dealing with zero-priced bids/offers with missing-data operations instead. We'll reload the data again.

In [None]:
%load solutions/cleaning_start.py

This time, set invalid bids and offers to np.nan.

Now, forward fill the data.

Then drop any remaining NaNs.

Finally, check the plots again to make sure our bids and offers are correct.

[Back to top](#top)

-------------------
<a id='exercises2'></a>
## Exercise set 2

1. Using the cleaned df_qte dataframe, add the bid-ask midpoint as a column.
2. Calculate the following quantities.
    - The average bid-ask spread in the product.
    - The open/low/high/close of the midpoint.