# Random Forest Model For Predicting First Day IPO Performance

This notebook trains and tests a random forest model to predict whether an IPO (Inital Public Offering) will be underpriced or not. Please refer to the [paper](../B351_Main_Project_Final_Paper.pdf) for full documentation. 

To begin, we will import the necessary modules and libraries.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import pickle
# visualizaitions
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image
# utilities 
import itertools
import os
import numpy as np
import matplotlib.pyplot as plt


We first read the ipo data from a csv to a pandas data frame.

After collecting the labels for the data points, we select all numeric features so that it can be processed in the random forest model. We store the labels of the features we choose to use in `ipo_features`.



In [15]:
# load dataset
ipos = pd.read_csv("../data/clean_bloomberg_with_sectors_macro.csv")
# get labels
ipo_labels = ipos["Underpriced"].tolist()
# get features
ipos = ipos.select_dtypes(['float64', 'float32', int])
ipo_features = ipos._get_numeric_data().columns.values.tolist()[1:-1]
# remove feature wich defines the label
ipo_features.remove('Offer To 1st Close')
# # TODO remove these features from the csv
# ipo_features.remove('Shares Outstanding (M).1')
# ipo_features.remove('Offer Size (M).1')
print("Possible Features:", ipo_features)

Possible Features: ['Profit Margin', 'Return on Assets', 'Offer Size (M)', 'Shares Outstanding (M)', 'Offer Price', 'Market Cap at Offer (M)', 'Cash Flow per Share', 'Instit Owner (% Shares Out)', 'Instit Owner (Shares Held)', 'Real GDP Per Capita', 'OECD Leading Indicator', 'Interest Rate', 'Seasonally Adjusted Unemployment Rate', 'CPI Growth Rate', 'Industry Sector', 'Industry Group', 'Industry Subgroup']


The features we have chose to use are stored in `ipo_features_data` as a pandas dataframe. Using this data frame long with the labels, we make a test and training split. 

We then use `sci-kit learn`'s Random Forest model to initialize a classification model. This model is trained on the designated training data we have created. We then make a prediction by feeding the newly created model the test set we created.

Random forest is an ensemble machine learning algorithm that uses multiple decision trees to make predictions. It works by randomly selecting a subset of features from the dataset and then building a decision tree for each subset. This is process is called bagging. Each tree is then used to make a prediction, and the final prediction is made by taking the average of all the individual tree predictions. Bagging along with with the other processes helps reduce overfitting and improves accuracy. Random forest also has the ability to handle large datasets with high dimensionality, making it a powerful tool for predictive analytics.

In [16]:
# get columns for specified features
ipo_features = ipo_features[:-1]
ipo_features_data = ipos[ipo_features]
# split dataset to trianing set and test set
ipo_features_data_train, ipo_features_data_test, ipo_labels_train, ipo_labels_test = train_test_split(ipo_features_data, ipo_labels, test_size=0.3)
# create classifier 
clf = RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                             criterion="entropy", max_depth=13, max_features="auto", 
                             max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0,
                             min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, 
                             n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0,
                             warm_start=False)
"""
{'bootstrap': False, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 
'max_depth': 13, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 
'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 
'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
"""
# train the model
clf.fit(ipo_features_data_train, ipo_labels_train)
# predict
ipo_labels_pred = clf.predict(ipo_features_data_test)

  warn(


In [17]:
matrix = metrics.confusion_matrix(ipo_labels_test, ipo_labels_pred)
print("Class-wise Acurracy:", matrix.diagonal()/matrix.sum(axis=1))
print("Overall Accuracy:", metrics.accuracy_score(ipo_labels_test, ipo_labels_pred))

Class-wise Acurracy: [0.18012422 0.89971347]
Overall Accuracy: 0.6725490196078432


In [18]:
(0.14 + 0.178 + 0.180) / 3
#(0.946 + 0.939 + 0.899) / 3

0.9279999999999999