# Understanding the [05b notebook](https://github.com/lcdm-uiuc/ABIS/blob/master/05a%20Beating%20Random%20Walk-Parallel-Implementation.ipynb)

**Note:** this notebook is designed for you to run locally, so that you can play around with these ideas to gain a better understanding. The concepts presented in this notebook are identical to 05b except:
- the data is much cleaner and smaller.
- I sacrifice speed for explainability. Ergo some things in the 05b notebook are done differently because it's faster. Nevertheless those things are still performing the same operations.
- I don't have enough time to replicate some of the issues with the data and explain how I dealt with them today. However we have gone over these issues. So there are some exceptions in 05b that are not in this notebook.
- I do not save the data, models, or the results to a file to view later.

##### Please read the cell below to understand what each library will do in this notebook.

In [1]:
import numpy as np   # data creation/ manipulation
import pandas as pd  # data creation/ manipulation

from IPython.display import display  # Used to display itmes.

from copy import deepcopy  # used to create copy of models.
from tqdm import tqdm  # Used to keep track of how things will take.

# Used to search over specified parameter values.
from sklearn.model_selection import GridSearchCV

# Provides train/validation indicies for model to use.
from sklearn.model_selection import TimeSeriesSplit  

# Model to make predictions
from sklearn.ensemble import RandomForestClassifier

# To replace missing values and scale values
from sklearn.preprocessing import Imputer, StandardScaler

##### The code cell below creates a relatively clean dataset that we will use for the reaminder of the notebook.

The following items are created:
- features (independent variables), 
- industry identifiers (i.e fake SIC codes),
- company identifiers (i.e fake permno codes,
- Time (i.e Fiscal year)

----
**Note**:
- The features in the notebook are ROE_t and the Industry [Mean/Median/std dev] ROE_t.
- You can think of the labels as whether the returns increase/decrease/or stays the same
- There are some missing values in the real dataset so I simulate some missing values.

In [2]:
# To understand what the np.random.rand function does please see:
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.rand.html

# To understand what the np.random.RandomState function does please see:
#https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.RandomState.html

# For the feature that I generate below there are 100 observations.
# Let this represent ROE
x1 = np.random.RandomState(23).rand(100,1).flatten()

# I randomly will replace 15% of the data with missing values from Feature 1
x1_nan_inds = np.random.choice(x1.size, size=int(x1.size *.15))
x1[x1_nan_inds] = np.nan

# Industry identifiers similar to SIC codes.
# I create 2 industry codes: 0 & 1. Industry codes 0 & 1 occur 50 times each.
industry_identifier = [0] * 50 + [1] * 50

# Company identifier similar to permno
# Each company in industry 0 & 1 have 10 observations.
# ergo they're 5 companies in each industry
company_identifier = np.hstack([[ii] * 10 for ii in range(10)])

# Contains yearly data from 0-9 10 times.
time = np.hstack([[jj] for ii in range(10) for jj in range(10)])

# I create a dictionary to store all of the data
# I choose header names that reflect the header names in 05b.

data = {'SIC':industry_identifier, 'permno': company_identifier,
        'FYEAR': time,  'ROE': x1 }

# Finally I combine all of the data into a pandas dataframe
df = pd.DataFrame(data)

# Display a random sample of 10
display(df.sample(10))

Unnamed: 0,SIC,permno,FYEAR,ROE
61,1,6,1,
16,0,1,6,0.845094
97,1,9,7,0.418742
43,0,4,3,0.557707
38,0,3,8,0.162012
63,1,6,3,
17,0,1,7,0.065075
56,1,5,6,0.465763
25,0,2,5,0.141501
77,1,7,7,0.053267


##### The code cell below will create the Industry Features  based on the simulated ROE

The logic is:
- For each industry:
- Select all of the companies in an industry and groupby the Fiscal year
- Then generate the mean, median, and std based on the fiscal year and store it in a dataframe.

In [3]:
# A list containing the simulated industry codes
industry_codes = [0, 1]

# select all of the companies in the first industry and groupby the fiscal year
df_industry = df[(df.SIC == industry_codes[0])].groupby('FYEAR')

tmp_df = pd.DataFrame()  # will store data for Industry Features

# Generate the unique Fiscal Years for the first industry
unique_FYEAR = np.sort(df[df.SIC == industry_codes[0]]['FYEAR'].unique())

# Add unique fiscal years to dataset
tmp_df['FYEAR'] =   unique_FYEAR

# Add SIC code to the industry dataset
tmp_df['SIC'] = [industry_codes[0]] * len(unique_FYEAR)

# Generate Mean, Median, STD by FYEAR for all companies in the given Industry
tmp_df['IndustryMeanROE'] = df_industry.mean()['ROE'].reset_index()['ROE']
tmp_df['IndustryMedianROE'] = df_industry.median()['ROE'].reset_index()['ROE']
tmp_df['IndustryStdROE'] = df_industry.std()['ROE'].reset_index()['ROE']


# ------
# Repeat the above using the second industry code
# ------

# select all of the companies in the second industry and groupby the fiscal year
df_industry = df[(df.SIC == industry_codes[1])].groupby('FYEAR')

tmp_df2 = pd.DataFrame()  # will store data for Industry Features

# Generate the unique Fiscal Years for the second industry
unique_FYEAR = np.sort(df[df.SIC == industry_codes[1]]['FYEAR'].unique())

# Add unique fiscal years to dataset
tmp_df2['FYEAR'] =   unique_FYEAR

# Add SIC code to the industry dataset
tmp_df2['SIC'] = [industry_codes[1]] * len(unique_FYEAR)

# Generate Mean, Median, STD by FYEAR for all companies in the given Industry
tmp_df2['IndustryMeanROE'] = df_industry.mean()['ROE'].reset_index()['ROE']
tmp_df2['IndustryMedianROE'] = df_industry.median()['ROE'].reset_index()['ROE']
tmp_df2['IndustryStdROE'] = df_industry.std()['ROE'].reset_index()['ROE']

# combine both dataframes together
industry_df = tmp_df.append(tmp_df2, ignore_index=True)

# display the dataframe containing Industry Features
display(industry_df)

Unnamed: 0,FYEAR,SIC,IndustryMeanROE,IndustryMedianROE,IndustryStdROE
0,0,0,0.542745,0.669882,0.388413
1,1,0,0.789729,0.773613,0.128247
2,2,0,0.60637,0.710653,0.33797
3,3,0,0.347485,0.30041,0.241338
4,4,0,0.470637,0.405314,0.371924
5,5,0,0.359511,0.191042,0.418027
6,6,0,0.554218,0.506055,0.287693
7,7,0,0.333028,0.392442,0.155583
8,8,0,0.48553,0.483055,0.276714
9,9,0,0.526436,0.428602,0.257066


##### The code below will create the labels that we want to predict, and merge industry data to each company

The logic is:
- To find all unique companies within an industry.
- For each company in that industry:
- Store all company data and store in a pandas dataframe. Called `_c_df`.
- compute all differences of the next value for the target column (ROE) in the company (`_c_df`). Next,
    - if the difference is less than 0, the ROE increases
    - if the difference is greater than 0, the ROE decreases
    - if the difference is 0, the ROE remained the same
    - if the difference is not greater/less than 0 or 0 then assign a NaN
- We then assign the change of differences to a column labeled `change` in `_c_df`. **(This is the label we want to classify)**

- Next we grab the the company's start and end year from `_c_df` and use this to select the industry features that have the same start and end year.

- We then merge the company data to the industry features.
- Finally we append the company data with the industry features and labels to a dataframe that stores data about an industry. Which is either `industry1_data` if the SIC code is 0 or `industry2_data` if the SIC code is 1.

----
**Note:**  while we know that the features `IndustryMeanROE`, `IndustryMedianROE	IndustryStdROE`, and `ROE` are functions of the label `change`, machine learning models do not know this. The whole objective is to learn the function as best as you can s.t you can generalize beyond examples in the training set. I strongly recommend viewing this paper to get a better picture of this idea: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
    

In [4]:
# this is the data that the RandomForestClassifier will use for Industry 1
industry1_data = pd.DataFrame()

# this is the data that the RandomForestClassifier will use for Industry 2
industry2_data = pd.DataFrame()

industry_codes = [0, 1] # list containing industry codes

# find all unique companies in the first industry
uniq_companies = df[(df.SIC == industry_codes[0])].permno.unique()

# for each unique company in the first industry
for company in uniq_companies:
    
    # Select all of the company data
    _c_df = df[df.permno == company]
    
    # Compute difference of next value for the target column
    c_diffs = _c_df['ROE'].diff(periods=-1)  
    
    # The code below does the following:
    # If the difference between the current period and the next period is less than 0
    # ROE increases
    # If the difference between the current period and the next period is greater than 0
    # ROe decreases
    # If the difference between the current period and the next period is 0
    # ROe stays the same
    # If the difference is not less/greater than 0 or 0 it recieves the value of a NaN
    _c_df = _c_df.assign(
        change=np.where(
            c_diffs < 0, 1, np.where(
                c_diffs > 0, -1, np.where (
                    c_diffs == 0, 0, np.nan
                )
            )
        )
    )
    
    # Note while we know that the features are functions of the label,
    # machine learning models do not know this.
    # The whole objective is to learn the function as best as you can s.t
    # you can generalize beyond examples in the training set

    # I strongly recommend viewing this paper to get a better picture of this idea: 
    # https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
    
    
    # Grab the first & last year of the company,
    start_year = float(_c_df.FYEAR.iloc[0])
    end_year = float(_c_df.FYEAR.iloc[(len(_c_df) - 1)])
    
    # The idea behind this: company start/end year may be different from industry start/end year
    # Next Grab the first industry data with the same start and end year
    df_ind_same_year = industry_df[(industry_df.FYEAR >= start_year) & (industry_df.FYEAR <= end_year) & (industry_df.SIC == industry_codes[0])]    
    
    
    # Nextmerge both dataframes on Fiscal year
    # also on SIC code to avoid duplicates
    c_df = _c_df.merge(df_ind_same_year, on=['FYEAR', 'SIC'])
    
    # c_df is the final dataset for company
    
    # Now we append c_df to the final dataset for Industry 1
    industry1_data = industry1_data.append(c_df, ignore_index=True)

# ------
# Repeat the above for the second industry...
# ------

# find all unique companies in the second industry
uniq_companies = df[(df.SIC == industry_codes[1])].permno.unique()

# for each unique company in the second industry
for company in uniq_companies:
    
    # Select all of the company data
    _c_df = df[df.permno == company]
    
    # Compute difference of next value for the target column
    c_diffs = _c_df['ROE'].diff(periods=-1)  
    
    # The code below does the following:
    # If the difference between the current period and the next period is less than 0
    # ROE increases
    # If the difference between the current period and the next period is greater than 0
    # ROe decreases
    # If the difference between the current period and the next period is 0
    # ROe stays the same
    # If the difference is not less/greater than 0 or 0 it recieves the value of a NaN
    _c_df = _c_df.assign(
        change=np.where(
            c_diffs < 0, 1, np.where(
                c_diffs > 0, -1, np.where (
                    c_diffs == 0, 0, np.nan
                )
            )
        )
    )
    
    # Note while we know that the features are functions of the label,
    # machine learning models do not know this.
    # The whole objective is to learn the function as best as you can s.t
    # you can generalize beyond examples in the training set

    # I strongly recommend viewing this paper to get a better picture of this idea: 
    # https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
    
    
    # Grab the first & last year of the company,
    start_year = float(_c_df.FYEAR.iloc[0])
    end_year = float(_c_df.FYEAR.iloc[(len(_c_df) - 1)])
    
    # The idea behind this: company start/end year may be different from industry start/end year
    # Next Grab the first industry data with the same start and end year
    df_ind_same_year = industry_df[(industry_df.FYEAR >= start_year) & (industry_df.FYEAR <= end_year) & (industry_df.SIC == industry_codes[1])]    
    
    
    # Next merge both dataframes on Fiscal year
    # also on SIC code to avoid duplicates
    c_df = _c_df.merge(df_ind_same_year, on=['FYEAR', 'SIC'])
    
    # c_df is the final dataset for company
    
    # Now we append c_df to the final dataset for Industry 1
    industry2_data = industry2_data.append(c_df, ignore_index=True)


##### In the code cell below we view the final industry datasets...

In [5]:
display("Displaying Industry %s data:"%industry_codes[0])
display(industry1_data)

display("Display Industry %s data:"%industry_codes[1])
display(industry2_data)

'Displaying Industry 0 data:'

Unnamed: 0,SIC,permno,FYEAR,ROE,change,IndustryMeanROE,IndustryMedianROE,IndustryStdROE
0,0,0,0,0.517298,1.0,0.542745,0.669882,0.388413
1,0,0,1,0.946963,-1.0,0.789729,0.773613,0.128247
2,0,0,2,0.76546,-1.0,0.60637,0.710653,0.33797
3,0,0,3,0.282396,-1.0,0.347485,0.30041,0.241338
4,0,0,4,0.221045,,0.470637,0.405314,0.371924
5,0,0,5,,,0.359511,0.191042,0.418027
6,0,0,6,0.167139,1.0,0.554218,0.506055,0.287693
7,0,0,7,0.392442,1.0,0.333028,0.392442,0.155583
8,0,0,8,0.618052,-1.0,0.48553,0.483055,0.276714
9,0,0,9,0.41193,,0.526436,0.428602,0.257066


'Display Industry 1 data:'

Unnamed: 0,SIC,permno,FYEAR,ROE,change,IndustryMeanROE,IndustryMedianROE,IndustryStdROE
0,1,5,0,0.901602,-1.0,0.71492,0.7986,0.296399
1,1,5,1,0.505759,,0.509324,0.482562,0.129182
2,1,5,2,,,0.618048,0.845009,0.45634
3,1,5,3,0.827716,-1.0,0.502561,0.527902,0.294046
4,1,5,4,0.231833,-1.0,0.346197,0.333716,0.22642
5,1,5,5,0.079055,1.0,0.540199,0.565076,0.360466
6,1,5,6,0.465763,1.0,0.270907,0.255131,0.21081
7,1,5,7,0.878976,-1.0,0.385569,0.418742,0.326428
8,1,5,8,0.147503,1.0,0.428091,0.345478,0.329749
9,1,5,9,0.231251,,0.493037,0.417711,0.323314


##### The code cell below:

- paritions the data into training and testing sets
- Scales features and replaces missing values


In [6]:
industry_codes = [0, 1] # list containing industry codes

# The 2 lines below store  the training/testing features for industry 1
industry1_train_features = None
industry1_test_features = None

# The 2 lines below store the training/testing labels for industry 1
industry1_train_labels = pd.Series()
industry1_test_labels = pd.Series()

# The 2 lines below store the training/testing position of the companies.
companies_train_position = pd.DataFrame()
companies_test_position = pd.DataFrame()

# find all unique companies in the first industry
uniq_companies = industry1_data.permno.unique()

# for each unique company
for company in uniq_companies:
    
    # select all of the company data
    tc_df = industry1_data[industry1_data.permno == company]
    
    # It does not make sense to predict a missing value.
    # We drop all columns where the change column has an NaN
    tc_df.dropna(subset=['change'], inplace=True)  
    
    # Next we reset the index of the company dataframe
    # I do this to keep indicies going from 0 to N
    tc_df.reset_index(drop=True, inplace=True)
            
    # Next I need to seperate features and labels
    
    # I drop  the change column because this is a label.
    # In addition to this I drop SIC  because this is not a feature in 05b
    features = tc_df.drop(['change', 'SIC'], axis=1)
    
    # I make a copy of the features to store company positon and values before scaled.
    _features = deepcopy(features)
    
    # I select the change column because this the label we want to predict.
    labels = tc_df['change']

    # Next we replace missing values with the median of the column
    imp = Imputer(strategy='median', verbose=1)
    features = imp.fit_transform(features)
    
    # Note: When I replaced the missing values, this operation
    # converted our features from a pandas
    # dataframe to a numpy array

    # scale features by removing the mean and scaling to unit variance
    scaler = StandardScaler()
    features = scaler.fit_transform(features)
    
    #####
    # NOTE: there's confusion about whether the company's data is in both
    # the training and testing sets.
    # The code below addresses this.
    #####
    
    # Split the labels into training and testing labels
    labels.reset_index(drop=True, inplace=True)
    
    # Find the total number of labels
    n = len(labels)  
    
    # Generate training indicies to from 0 to the value that contains 80% of the labels
    train_ind = [i for i in range(int(n * .8))]
    
   
    # I store copies of the training companies position and FYEAR in a dataframe
    # This is solely for the explainability of what the training/testing data
    # looks like
    companies_train_position = \
    companies_train_position.append(_features.iloc[train_ind])
    
    # I store copies of the training companies position and FYEAR in a dataframe
    # This is solely for explainability
    
    # generating test indicies from end of training set to end of dataframe
    test_ind = [i for i in range(int(n * .8), n)]
    
    companies_test_position = \
    companies_test_position.append(_features.iloc[test_ind])
    
    # select the training labels using the training indicies
    train_labels = labels[train_ind].reset_index(drop=True)
    
    # select the testing labels by dropping the training indicies
    test_labels = labels.drop(tc_df.index[train_ind]).reset_index(drop=True)
    
    # Note: the training features are numpy multi-dimensional arrays
    # select the training features using the training indicies
    train_features = features[train_ind]
    
    # select the testing features by deleting the training indicies from the features array
    test_features = np.delete(features, train_ind, axis=0) 
    
    # Now save the company training labels to a
    # pandas series that will contain all company training labels in an industry
    industry1_train_labels = industry1_train_labels.append(train_labels, ignore_index=True)
    
    # Now save the company testing labels to a
    # pandas series that will contain all company testing labels in an industry    
    industry1_test_labels = industry1_test_labels.append(test_labels, ignore_index=True)
    
    # Now save the company training features to a
    # numpy array that will contain all company training features in an industry
    if industry1_train_features is None:
        industry1_train_features =  train_features
    else:
        industry1_train_features =  np.concatenate((industry1_train_features,
                                                  train_features), axis=0)

    # Now save the company testing features to a
    # numpy array that will contain all company testing features in an industry
    if industry1_test_features is None:
        industry1_test_features =  test_features
    else:
        industry1_test_features =  np.concatenate((industry1_test_features,
                                                  test_features), axis=0)


#------
# Now you can repeat the code in this cell for another industry.
# To avoid making the rest of this notebook too lengthy,
# I will only perform predictions on Industry 0. You can repeat the last code cell 
# for Industry 1 by rerunning the code in this cell and
# rerunning the second to last code cell on Industry 1 data.
#------


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### In the cell below I  display what the training features and testing features look like:

In [7]:
display("Displaying Training Features for companies within an industry:")
display(companies_train_position)

display("Displaying Testing Features for companies within an industry:")
display(companies_test_position)


'Displaying Training Features for companies within an industry:'

Unnamed: 0,permno,FYEAR,ROE,IndustryMeanROE,IndustryMedianROE,IndustryStdROE
0,0,0,0.517298,0.542745,0.669882,0.388413
1,0,1,0.946963,0.789729,0.773613,0.128247
2,0,2,0.76546,0.60637,0.710653,0.33797
3,0,3,0.282396,0.347485,0.30041,0.241338
4,0,6,0.167139,0.554218,0.506055,0.287693
0,1,0,0.002465,0.542745,0.669882,0.388413
1,1,1,0.884032,0.789729,0.773613,0.128247
2,1,2,0.884948,0.60637,0.710653,0.33797
3,1,3,0.30041,0.347485,0.30041,0.241338
4,1,4,0.589582,0.470637,0.405314,0.371924


'Displaying Testing Features for companies within an industry:'

Unnamed: 0,permno,FYEAR,ROE,IndustryMeanROE,IndustryMedianROE,IndustryStdROE
5,0,7,0.392442,0.333028,0.392442,0.155583
6,0,8,0.618052,0.48553,0.483055,0.276714
7,1,7,0.065075,0.333028,0.392442,0.155583
8,1,8,0.294744,0.48553,0.483055,0.276714
5,2,7,0.346489,0.333028,0.392442,0.155583
6,2,8,0.869785,0.48553,0.483055,0.276714
7,3,7,0.464386,0.333028,0.392442,0.155583
8,3,8,0.162012,0.48553,0.483055,0.276714
4,4,7,0.396746,0.333028,0.392442,0.155583
5,4,8,0.483055,0.48553,0.483055,0.276714


##### Now that the data is formatted we make predictions on the testing data in the code cell below by:

- creating a Random Forest Tree to classify the data using `RandomForestClassifier`
- Creating a dictionary of parameters to optimize
- Creating a `GridSearchCV` object to optmize the parameters of a Random Forest Tree using `TimeSeriesSplit` cv iterator.
- Training models with `GridSearchCV` on the industry 1 training data and then
- use the best RandomForestTree found by `GridSearchCV` to make predictions on industry 1's testing data.
- We then display the testing score.

In [8]:
# I create a Random Forest Tree to classify data for Industry 0.
# like in the 05b notebook I set the random seed to 23.
# I also set the number of trees to a higher value than the default value
tree_for_industry0 = RandomForestClassifier(random_state=23, n_estimators=100)

# I create a dictionary that contains parameters I want to search over.
# I can find a list of parameters for a RandomForestClassifier here:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

# Since this is a demo I will search over 1 parameter.
prms = {'criterion':['gini', 'entropy']}

# I now create an object to optmize parameters of our tree for industry0 
# and using TimeSeriesSplit as a cross validation iterator
search = GridSearchCV(tree_for_industry0, param_grid=prms,
                      cv=TimeSeriesSplit(), verbose=3)

# I now optmize our tree using training data.
search.fit(industry1_train_features, industry1_train_labels)

# I now make predictions on the testing features and compute the testing accuracy score
testing_accuracy_score = search.score(industry1_test_features, industry1_test_labels)


Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] criterion=gini ..................................................
[CV] ......... criterion=gini, score=0.5714285714285714, total=   0.1s
[CV] criterion=gini ..................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s


[CV] ........ criterion=gini, score=0.42857142857142855, total=   0.1s
[CV] criterion=gini ..................................................
[CV] ......... criterion=gini, score=0.7142857142857143, total=   0.1s
[CV] criterion=entropy ...............................................
[CV] ...... criterion=entropy, score=0.5714285714285714, total=   0.1s
[CV] criterion=entropy ...............................................
[CV] ..... criterion=entropy, score=0.42857142857142855, total=   0.1s
[CV] criterion=entropy ...............................................
[CV] ...... criterion=entropy, score=0.7142857142857143, total=   0.1s


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.8s finished


In [9]:
print('Testing Accuracy Score for Industry 1 ROE:', testing_accuracy_score)

Testing Accuracy Score for Industry 1 ROE: 0.5
