# Equity Research Report - Classification Ratings

Every equity researcher examines a company's financial statement. Each statement maybe adjusted to reflect certain items (depending on the industry), such as adding back non-recurring items, adjustment for depreciation and amortization. Some of these adjustments are necessary and close attention to SEC filing on the footnotes is necessary.

With that in mind, is it possible to predict an earnings classification ratings based on financial statements and the researchers' output? 

This notebook will analyze over 10 year CFRA reports with each quarterly filings.

### Process

- Find all equity research report and its corresponding financial statement reportings
- With domain knowledge, select necessary features on the 3 statements to determine what is necessary
- (future) SEC started providing footnotes. We will need to map the footnote ID to the company

The ratings are broken down into another feature called `delt`. This occurs when an analyst indicates a change in position. For example: "...maintain buy", "...upgrade from buy to strong buy", "... downgrade from strong buy to buy" etc. 


### Word on Models
This model is simplified to first test out certain features.

- There are imbalance problems (of course!) 
- Imbalance occurs when one class is significantly more or less than other.
- Test for 4 classification models (Logistic, gradient boost, random forest, decision trees)

In [16]:
import pandas as pd
import numpy as np

# visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%matplotlib inline

# modelling imports

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, confusion_matrix,accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn. model_selection import cross_val_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from imblearn.over_sampling import SMOTE
import xgboost as xgb

from sklearn.model_selection import learning_curve

In [25]:
df = pd.read_csv("./../all_simplified_ver_data.csv")
features = ['Revenue', 'Cost of Revenue',
       'Net Income', 'Gross Profit', 'Operating Expenses',
       'Selling, General & Administrative', 'Research & Development',
       'Depreciation & Amortization_x', 'Operating Income (Loss)',
       'Non-Operating Income (Loss)', 'Interest Expense, Net',
       'Net Extraordinary Gains (Losses)',
       'Cash, Cash Equivalents & Short Term Investments',
       'Accounts & Notes Receivable', 'Inventories', 'Total Current Assets',
       'Property, Plant & Equipment, Net',
       'Long Term Investments & Receivables', 'Other Long Term Assets',
       'Total Noncurrent Assets', 'Total Assets', 'Payables & Accruals',
       'Short Term Debt', 'Total Current Liabilities', 'Long Term Debt',
       'Total Noncurrent Liabilities',
       'Share Capital & Additional Paid-In Capital', 'Treasury Stock',
       'Retained Earnings', 'Total Equity', 'Net Income/Starting Line',
       'Depreciation & Amortization_y', 'Non-Cash Items',
       'Change in Working Capital', 'Change in Accounts Receivable',
       'Change in Inventories', 'Change in Accounts Payable',
       'Change in Other', 'Net Cash from Operating Activities',
       'Change in Fixed Assets & Intangibles',
       'Net Change in Long Term Investment',
       'Net Cash from Acquisitions & Divestitures',
       'Net Cash from Investing Activities', 'Dividends Paid',
       'Cash from (Repayment of) Debt', 'Cash from (Repurchase of) Equity',
       'Net Cash from Financing Activities', 'Net Change in Cash',
    'Rating_Change']
target ='Rating'

In [26]:
df

Unnamed: 0.1,Unnamed: 0,Report Date,Ticker,Revenue,Cost of Revenue,Net Income,Gross Profit,Operating Expenses,"Selling, General & Administrative",Research & Development,...,Net Cash from Acquisitions & Divestitures,Net Cash from Investing Activities,Dividends Paid,Cash from (Repayment of) Debt,Cash from (Repurchase of) Equity,Net Cash from Financing Activities,Net Change in Cash,Next_Report_Date,Rating_Change,Rating
0,0,2014-01-31,A,1.008000e+09,-498000000.0,195000000,5.100000e+08,-386000000.0,-298000000.0,-88000000.0,...,-2.000000e+06,-4.700000e+07,-44000000.0,,-27000000.0,-6.800000e+07,67000000,2014-04-30,,
1,1,2014-04-30,A,9.880000e+08,-503000000.0,139000000,4.850000e+08,-391000000.0,-304000000.0,-87000000.0,...,0.000000e+00,-5.500000e+07,-44000000.0,,-28000000.0,-7.200000e+07,208000000,2014-07-31,0.0,0.0
2,2,2014-07-31,A,1.009000e+09,-507000000.0,147000000,5.020000e+08,-371000000.0,-285000000.0,-86000000.0,...,-2.400000e+07,-7.000000e+07,-44000000.0,,-9000000.0,-5.180000e+08,-559000000,2014-10-31,0.0,13.0
3,3,2014-10-31,A,1.043000e+09,-564000000.0,68000000,4.790000e+08,-409000000.0,-312000000.0,-97000000.0,...,-1.000000e+07,-5.800000e+07,-44000000.0,5.640000e+08,52000000.0,5.410000e+08,637000000,2015-01-31,0.0,6.0
4,4,2015-01-31,A,1.026000e+09,-513000000.0,63000000,5.130000e+08,-398000000.0,-310000000.0,-88000000.0,...,0.000000e+00,-3.100000e+07,-34000000.0,0.000000e+00,2000000.0,-8.280000e+08,-910000000,2015-04-30,0.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19665,19665,2018-06-30,ZTS,1.415000e+09,-447000000.0,384000000,9.680000e+08,-484000000.0,-359000000.0,-102000000.0,...,0.000000e+00,-7.600000e+07,-60000000.0,,-215000000.0,-2.710000e+08,-96000000,2018-09-30,0.0,13.0
19666,19666,2018-09-30,ZTS,1.480000e+09,-473000000.0,347000000,1.007000e+09,-507000000.0,-367000000.0,-108000000.0,...,-1.992000e+09,-2.064000e+09,-61000000.0,1.485000e+09,-143000000.0,1.263000e+09,-272000000,2018-12-31,0.0,13.0
19667,19667,2018-12-31,ZTS,1.564000e+09,-544000000.0,345000000,1.020000e+09,-584000000.0,-420000000.0,-125000000.0,...,-6.000000e+06,-7.400000e+07,-60000000.0,8.000000e+06,-150000000.0,-1.970000e+08,316000000,2019-03-31,0.0,13.0
19668,19668,2019-03-31,ZTS,1.455000e+09,-518000000.0,312000000,9.370000e+08,-509000000.0,-369000000.0,-102000000.0,...,,-2.300000e+07,-79000000.0,-9.000000e+06,-150000000.0,-2.460000e+08,123000000,2019-06-30,0.0,13.0


In [27]:
# We will fill out some absurb value for NaN
df[features] = df[features].fillna(df.groupby('Ticker')[features].transform('mean'))
df['Rating_Change']=df['Rating_Change'].fillna(1000)
df['Rating'] = df['Rating'].fillna(1000)
df[features] = df[features].fillna(0)

In [28]:
df_hold = df[df['Report Date'] >= '2019-01-01']
df_train = df[df['Report Date'] < '2019-01-01']
X = df_train[features]
y = df_train[target]

### Imbalance

In [29]:
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
y.value_counts()

 1000.0    6338
 6.0       6338
 5.0       6338
-3.0       6338
-1.0       6338
 13.0      6338
 3.0       6338
-2.0       6338
 1.0       6338
 2.0       6338
 0.0       6338
Name: Rating, dtype: int64

### CV All Models
Let's see which model performs the best. (I want to say xgboost).

Since financial data is time series dependent, we will split the dataset using `TimeSeriesSplit`. `TimeSeriesSplit` allows us to split the data but based on a sequential order as to kfolds isnt.

In [30]:
tscv = TimeSeriesSplit(n_splits=10)

#### Each model is placed into a dictionary

In [31]:
def cross_validation_folds(X, y, test_size=0.25, random_state=71, n_splits=5):
    cv = TimeSeriesSplit(n_splits=5).split(y)
    tscv = TimeSeriesSplit(max_train_size=None, n_splits=n_splits)
    result = []
    models = { "logistic": LogisticRegression(),"xgb":xgb.XGBClassifier(n_estimators=550, seed=0),  "random_forest":RandomForestClassifier(n_estimators=25), "Decision_Tree":DecisionTreeClassifier() }
    iteration = 1
    
    for train_index, test_index in tscv.split(X):
        print("TRAIN: ", train_index, "TEST:", test_index)
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        for name, model in models.items():
            model.fit(X_train, y_train)
            y_predict=model.predict(X_test)
            acc_score = accuracy_score(y_test, y_predict)
            precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_predict, average='macro')
            result.append({'iter':iteration, 'model': name, 'acc_score': acc_score, 'precision': precision, 'recall': recall, 'f1': f1 })
        iteration += 1
    return pd.DataFrame(result)    

In [None]:
model_result = cross_validation_folds(X,y)
model_result.sort_values(by="f1", ascending=False)

TRAIN:  [    0     1     2 ... 11620 11621 11622] TEST: [11623 11624 11625 ... 23239 23240 23241]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TRAIN:  [    0     1     2 ... 23239 23240 23241] TEST: [23242 23243 23244 ... 34858 34859 34860]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
#running xgboost    
xg_result = []
xg_coef = []

for train_index, test_index in tscv.split(X):
    print("TRAIN: ", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model = xgb.XGBClassifier(objective='binary:logistic', booster='gbtree', max_leaves=7)
    eval_set = [(X_test, y_test)]
    model.fit(X_train, y_train, early_stopping_rounds=10,eval_set=eval_set)
    y_predict=model.predict(X_test)
    acc_score = accuracy_score(y_test, y_predict)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_predict, average='macro')
    print(confusion_matrix(y_test, y_predict))
    xg_result.append({'acc_score': acc_score, 'precision': precision, 'recall': recall, 'f1': f1 })

In [None]:
pd.DataFrame(xg_result)

In [None]:
### Overall Score

F1 shows a range of [0.13, 0.29]. Not bad for not reading a financial statement. 
I believe model can be improved