# Forewords

For this notebook, I wanted to try something different than most of what I could see with this dataset so far. Just for you to know the background, I am a full time banker and leasure time self-proclamed data scientist (or I guess, just a banker). So what I am trying here, is to bring some financial knowledge to make something out of this dataset.

Most of the people who tried to played with these data took the stance of trying to predict future movements from past prices. While this seems natural, empirical evidences show that this is actually a difficult, if not impossible, task (this is what we call the weak form market efficiency hypothesis).

This notebook is a WIP that I copy-pasted from an existing ipynb file stored on my computer (for some reason, I could not make an import on Kaggle). I'll try to maintain both version up to date.

Enjoy!!

# Introduction
## Objectives

The aim of this notebook is to demonstrate how machine learning can be used to enhance the stock picking process for a value oriented fund using the backward looking P/E for equity.

The first section of this notebook is used to import libraries, retrieve data and define functions that might be useful later. The second section picks a bunch of securities using the traditional/naive P/E approach. The last part picks securities using various machine learning techniques.

## Reminder on Value vs Growth Funds

Investment funds can be split in a wide range of categories, depending on theirs assets, strategies and so on. Traditional (i.e. non-hedge funds) equity funds, are often described as "value", "blend" or "growth", depending on the type of equity they target. This toponomy is used for instance by <a href=http://www.morningstar.com/>Morningstar</a> for their funds ranking.

In the value style, the aim of the fund manager is to outperform a benchmark by selecting stocks that she believes to be undervalued. On the contrary, growth oriented managers will try to pick stocks with a high potential for growth, despite the fact that they may not be cheaper relative to the market (the high price paid would be then compensated by the growth in price).

Both styles have their merits and may perform well in different market conditions. The goal here is not define which style is better, but to assume that, provided that a value style has been chosen, to determine which stocks should be picked.

## Financial Multiples and the P/E

The use of multiples is widespread in the financial sector. It is a tool that allows (among other things) to screen a broad range of stock to select those that might be considered as good investment opportunities. One such multiple is the P/E or "price-to-earnings" ratio. As its name indicates, it is a measure of the price divided by earnings (per share) of the company. Multiplying both the denominator and the numerator by the number of shares, this formula can be rewritten using the market capitalization and the earnings of the full company.

$$P/E = {\mbox{Price per share}\over{\mbox{Earnings per Share}}} = {\mbox{Market Capitalization}\over{\mbox{Earnings}}}$$

From now on, I will use "Price" and "Earnings" in the forumla, but the reader should understand them as "per share" or globally, depending on the context. I may also use the abbreviation "EPS" instead of earnings per share.

This simple formula overlooks the practical difficulties associated with the use of P/E. Indeed, this formula says nothing about the timeframe of the numerator and denominator. Traditionally, two different measures are used:

$$P/E_\mbox{forward looking} = {{Price_{t_0}}\over{Earnings_{t_{1f}}}}$$
$$P/E_\mbox{backward looking} = {{Price_{t_0}}\over{Earnings_{t_0}}}$$

Where $t_0$ refers to the "current" time and $t_{1f}$ the forecasted value one year forward.
As one could imagine, using the forward looking P/E is more cumbersome than the backward looking P/E as it requires more inputs in order to forecast the one year ahead earnings of the company. Thus when used in a screening process, the backward P/E is easier to use. This measure is less "precise" in the sense that it does not incorporate estimations of future earnings of the company, but it is more stable and does not depend on the model used to forecast the year ahead earnings. The reader should however note that, provided that a methodology has been defined to estime the year ahead earnings, the content of this notebook could be replaced by the forward looking P/E without much difficulties.

In the remainder of this notebook, unless otherwise specified, the term "P/E" will refer to the backward looking P/E.

# Environment Setup

## Libraries
This notebook use the traditional python libraries for data processing and statistical learning (numpy, pandas, scikit-learn). To vizualise graphs, I am mostly using the pandas' built-in for convenience, and seaborn for purely esthetical reason (i.e. I like the final results of seaborn graphs). I am using their conventional names to import them in the namespace.


In [None]:
%matplotlib inline

#Basic data manipulation
import numpy as np
import pandas as pd

#Tools
import datetime as dt

#IPython manager
from IPython.display import display

#Graphs
import seaborn as sns
import matplotlib.pyplot as plt

#Machine learning
from sklearn.neighbors import NearestNeighbors

## Data

In [None]:
fundamentals = pd.read_csv('../input/fundamentals.csv',index_col = 0)
prices = pd.read_csv('../input/prices.csv')
securities = pd.read_csv('../input/securities.csv',index_col=0)
sectors = securities['GICS Sector'].unique()
sub_industry = securities['GICS Sub Industry'].unique()
accounts = fundamentals.columns.values[2:]

## Helper Functions

In [None]:
def set_year(period):
    """
    Take a date as a string formatted such as in the fundamentals dataframe and returns its year as an integer
    
    Parameters
    ----------
    period: Date as '%Y-%m-%d'
    
    Returns
    -------
    year: Year in the 'Period Ending' column
    """
    x = dt.datetime.strptime(period,'%Y-%m-%d')
    if x.month >= 6:        #If the reporting month is after June, return the reported year
        return x.year
    else:                   #Else, return the year before
        return x.year - 1

def get_publication_date(period):
    """
    Returns the estimated publication date as a datetime object, assuming a 90 days delay from the reporting day
    
    Parameters
    ----------
    period: Reporting date string as '%Y-%m-%d'
    
    Returns
    --------
    published: Publication date as datetime object
    
    """
    x = dt.datetime.strptime(period,'%Y-%m-%d') + dt.timedelta(days=90)
    
    return x

# Traditional P/E Stock Picking
## Introduction

Prior to optimize the screening process, this section will demonstrate how the non-optimized process works.

First, let us recall that the P/E can be obtained this way:

$$P/E_\mbox{backward looking} = {{Price_{t_{0}}}\over{Earnings_{t_{0}}}}$$

Naturally, if $Earnings_{t_{0}} < 0$, then this measure is of no use in trying to pick the most undervalued stocks. Therefore, these datapoint will be removed ex post.

The reader should also understand that the amount of earnings per share is meaningless in itself. Indeed, a higher value is not necessarily better for the investor. For instance, if we have an EPS of USD 10.- with a share price of USD 100.-, this is equivalent to an EPS of USD 100.- with a share price of USD 1000.- (excluding effects on liquidity and other matters which are not discussed here).

One difficulty which has not been discussed yet is the ambiguity surronding the $t_{0}$ concept. A delay of several months exists between the end of the period of financial statements and their official publications, mostly due to the auditing process. Companies may or may not discuss these figures before their publications (with the necessary caution to avoid material non-public disclosures and misrepresentations) or emit profit warning. 

While these dataset offers a broad range of data, the scope of the 'fundamentals' dataset is related to audited SEC fillings, so I will assume that {t_0} refers here to the day when the financial statements are publicly released and I will update the formula to:

$$P/E_\mbox{backward looking} = {{Price_{\mbox{relase date}}}\over{Earnings_{\mbox{relase date - 3months}}}}$$

## Earnings per Share Data Exploration

My first goal when taking into account EPS is to remove datapoints where $EPS_t < 0$ and have a global idea of the figures I have. This last point is only to make sure they make sense. One more time, the amount of EPS in itself is not representative of how well a company is doing as it is biased by the stock price. The main reason to do this is that an extremely large value would be doubtful.

In [None]:
df_eps = fundamentals[['Ticker Symbol','Period Ending','Earnings Per Share']]
df_eps['Year'] = df_eps['Period Ending'].map(set_year)   #This replace values in Period Ending by their closest year
df_eps = df_eps[df_eps['Year'] > 2011] #Just keep year from 2012

df_eps[[col for col in df_eps.columns if col not in ['Period Ending','Year']]].describe() #I do not need stats about 'Period Ending'

In [None]:
_ = [col for col in df_eps.columns if col != 'Period Ending']
sns.violinplot(data=df_eps[_],x='Year',y='Earnings Per Share')

Both the description of EPS and the graphs indicates that no extremely large values are present. While there seems to be unusually high value, at this stage, they still might be explained by a high share value.

I can now remove negative EPS values.

In [None]:
df_eps = df_eps[df_eps['Earnings Per Share'] > 0.10] #Remove any data where EPS is below 1cts

## Computing the P/E Ratio

Before to go any further, I will remove companies that are not present in both the `prices` and `fundamentals` dataframes.

In [None]:
_ = np.array(fundamentals['Ticker Symbol'].unique())
companies = [symb for symb in prices['symbol'].unique() if symb in fundamentals['Ticker Symbol'].unique()]

df_prices = prices[prices['symbol'].isin(companies)]
df_eps = df_eps[df_eps['Ticker Symbol'].isin(companies)]

#This is a quick summary of the 447 remaining stocks
securities.loc[companies].head()

The objective now is to retrieve the prices for these securities at the publication date. Although the actual publication date is not available from these data, I will assume that it takes 90 days between the end of the reporting period and the actual publication date. This is a rule of thumb using past experiences and knowledge from friends working in the Big 4.

As such, pandas does not allow to merge on "approximate" date. Therefore, I will use a trick that I found on <a href="https://stackoverflow.com/questions/33421551/how-to-merge-two-data-frames-based-on-nearest-date">Stack Overflow</a>. The basic idea is to use a KNN algorithem, with $k=1$.

I will then refine these results by setting the price retrieved to NA should the nearest date be off from the reporting date by more than one month (I would then consider that this data is not available).

The following steps being rather time-consuming, I splitted them in different cells.

In [None]:
def clean_prices_date(d):
    try:
        return dt.datetime.strptime(d,'%Y-%m-%d 00:00:00')
    except:
        return dt.datetime.strptime(d,'%Y-%m-%d')

def find_nearest(group, match, groupname):
    match = match[match[groupname] == group.name]
    nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
    dist, ind = nbrs.kneighbors(group['Published'].values[:, None])

    group['KeyDate'] = group['Published']
    group['ActualDate'] = match['date'].values[ind.ravel()]
    return group

df_eps['Published'] = df_eps['Period Ending'].map(get_publication_date)
df_prices['date'] = df_prices['date'].map(clean_prices_date)

df_eps_mod = df_eps.groupby('Ticker Symbol').apply(find_nearest, df_prices, 'symbol')


In [None]:
df_prices_mod = df_prices[[col for col in df_prices.columns]]
df_prices_mod.rename(columns={'symbol':'Ticker Symbol','date':'ActualDate'},inplace=True)
df_merged = pd.merge(df_eps_mod,df_prices_mod,on=['ActualDate','Ticker Symbol'])


We can now compute the P/E ratio for our companies by simply dividing the columns. Concerning the price to use, we have the choice among four values:
<ul>
<li>The opening price</li>
<li>The closing price</li>
<li>The max price of the day</li>
<li>The min price of the day</li>
</ul>

Whichever is chosen, the result would be arbitrary. I opted to take the closing price here as it is often used by investment fund to compute the daily NAV.

In [None]:
df_pe = df_merged[['Ticker Symbol','Year','Earnings Per Share','close']]
df_pe['PE'] = df_pe['close'] / df_pe['Earnings Per Share']

The computation of the P/E ratio is now complete. In order to have better view of the results, I will split these P/E by sector. As it will be discussed later, companies in the same sector/industry are expected to have similar P/E ratios.

In [None]:
_ = securities.loc[df_pe['Ticker Symbol'],'GICS Sector']
_.index = df_pe.index
df_pe_sector = pd.concat([df_pe,_],axis=1)

In [None]:
yearsplt = [2014,2015]
g = sns.violinplot(data=df_pe_sector[df_pe_sector['Year'].isin(yearsplt)][['PE','GICS Sector','Year']],
                   x='GICS Sector',y='PE',hue='Year',split=True)
plt.xticks(rotation=90)

The tables below summarize the breakdown of P/E ratios at the sub-industry level.

In [None]:
_ = securities.loc[df_pe['Ticker Symbol'],'GICS Sub Industry']
_.index = df_pe.index
df_pe_subindustry = pd.concat([df_pe,_],axis=1)

In [None]:
pd.pivot_table(data=df_pe_subindustry[['GICS Sub Industry','PE','Year']],
               index='GICS Sub Industry',columns='Year',values='PE').sort_values(2015,ascending=False)

# Predicting P/E Ratios

Now I obtained the P/E through the traditional means, it is time to add some machine learning on top of that in order to see which companies are out of track.

In order to predict what should be the correct P/E, let us go back to some theory. The P/E ratio for a given company can be modeled using the Gordon Growth models. This model assume that:
<ol>
<li>The growth rate of earnings is constant over time</li>
<li>The discount rate applied to a companies cash flow is constant</li>
<li>The relevant cash flow for valuation is dividends</li>
</ol>

While this model is well suited for large, mature companies, it is less so for startups and tech companies with a high/multi-staged growth rate. Nevertheless, it gives us some indications on features that might be relevant to predict the P/E ratio.

Assuming that dividends are discounted at a rate $K_e$ (for cost of equity) with a growth rate $g$, the price of an action is the following:

$$P_{t_0}={{Dividends_{t_1}}\over{K_e - g}}$$

Using the fact that the growth rate for dividends is constant (thus $Dividends_{t_0} = Dividends_{t_1} * (1 + g)$), this formula can be rewritten as follow:

$$P_{t_0}={{Dividends_{t_0} * (1 + g)}\over{K_e - g}}$$

Finally, we can compute the price of an action as:

$$P_{t_0}={{Earnings_{t_0} * b * (1 + g)}\over{K_e - g}}$$
where $b = {{Dividends_{t_0}}\over{Earnings_{t_0}}} = \mbox{Dividends distribution rate}$

This leads to a ex-ante backward looking P/E ratio of:

$${P/E}_{t_0} = {{b * (1 + g)}\over{K_e - g}}$$

We can now estimate the dividend distribution rate using the data in the "fundamentals" table. A simple accounting relationship gives us:
$$\mbox{Retained Earnings}_{t_1} = \mbox{Retained Earnings}_{t_0} + \mbox{Net Income}_{t_1} - Dividends_{t_1}$$

Thus:

$$Dividends_{t_1} = (\mbox{Retained Earnings}_{t_0} - \mbox{Retained Earnings}_{t_1}) + \mbox{Net Income}_{t_1} $$

Here, the dividends account both for actual dividends and share repurchases. Please note that retained earnings are part of the balance sheet and should therefore be treated as a "stock" while the net incomes and dividends are both included in the P&L and are therefore considered as a flow. The $t_0$ and $t_1$ indices have been chosen here such as they represent the actual year of financial statements in which they appear.

We can then compute the dividend distribution rate:

$$b = 1 + {{(\mbox{Retained Earnings}_{t_0} - \mbox{Retained Earnings}_{t_1})}\over{\mbox{Net Income}_{t_1}}}$$

In [None]:
#df_eps[['Stock Symbol','Year','Retained Earnings']]
df_re_ni = fundamentals[['Ticker Symbol','Period Ending','Net Income','Retained Earnings']]
df_re_ni['Year'] = df_re_ni['Period Ending'].map(set_year)
df_re_ni = df_re_ni[df_re_ni['Year']>2011]
_ = df_re_ni[['Ticker Symbol','Year','Retained Earnings']]
x = pd.pivot_table(data=_,values='Retained Earnings',columns='Year',index='Ticker Symbol')
re0 = x[[x for x in range(2012,2016)]]/1000000
re1 = x[[x for x in range(2013,2017)]]/1000000
re0.columns = re1.columns
delta_re = re0 - re1
_ = df_re_ni[['Ticker Symbol','Year','Net Income']]
ni = pd.pivot_table(data=_,values='Net Income',columns='Year',index='Ticker Symbol')/1000000
df_b = 1+delta_re.divide(ni[delta_re.columns])# - re
df_b.columns = ['b_'+str(x) for x in df_b.columns]
pt_pe = pd.pivot_table(data=df_pe,columns='Year',values='PE',index='Ticker Symbol')
pt_pe.columns = ['pe_'+str(x) for x in pt_pe.columns]
_ = pd.concat([pt_pe,df_b],axis=1)[['pe_2014','b_2014']]
_ = pd.concat([_,securities.loc[_.index][['GICS Sector']]],axis=1)
_ = _[(_['pe_2014'] < 60) & (_['b_2014'] < 1) & (_['b_2014'] > 0)]

sns.lmplot(data=_,x='b_2014',y='pe_2014',
           hue='GICS Sector',col='GICS Sector',col_wrap=3, ci=None, truncate=True,
           sharex = False)
#sns.lmplot(data=df_b,x='',y='')

Without digging further, some of these graphs would suggest that the theory does not pass the smell test. However, the following comments provide some perspectives:
<ul>
<li>First of all, recall that we are looking only at one of several components of a very simplified model. The fact that some of these graphs display an upward sloping trend is encouraging in itself.</li>
<li>Long standing sectors such as the industrial, real estate and financial sectors which can now be considered as mature fit the model more or less neatly in this model.</li>
<li>IT and health care are growing sector (the first one for obvious reasons and the second due to aging of the population). The fact that they exhibit a downward sloping curve is not alarming.</li>
<li>This is a simple hypothesis that worth investing further, but the flat curve for the energy sector could be explained by the discrepencies between companies specialized in shale vs traditional energy produces.</li>
<li>Materials and consumer discretionary are harder to explain.</li>
</ul>