# Random Forest Model For Predicting First Day IPO Performance

[Text describing the overview of this notebook]

To begin, we will import the necessary modules and libraries

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import pickle

from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image

We read the ipo data from a csv to a pandas data frame.

Next, we convert the labels of the dataset to be numeric so that it can be processed in our random forest model. We store the labels of the features we choose to use in `ipo_features`. Then we normalize these features of each ipo by converting it to a float32 data type.



In [None]:
ipos = pd.read_csv("../data/clean_bloomberg_with_sectors_macro.csv")
# get labels
ipo_labels = ipos["Underpriced"].tolist()
# get features
ipos = ipos.select_dtypes(['float64', 'float32', int])
ipo_features = ipos._get_numeric_data().columns.values.tolist()[1:-1]
# remove feature wich defines the label
ipo_features.remove('Offer To 1st Close')
# TODO remove these features from the csv
ipo_features.remove('Shares Outstanding (M).1')
ipo_features.remove('Offer Size (M).1')
print("Possible Features:", ipo_features)
# convert data types of all possible feature columns
# for ipo_feature in ipo_features :
#     ipos[ipo_feature] = ipos[ipo_feature].astype('float32').notnull()

The features we have chose to use are stored in `ipo_features_data` as a pandas dataframe. Using this data frame long with the labels, we make a test and training split. 

We then use `sci-kit learn`'s Random Forest model to initialize a classification model. This model is trained on the designated training data we have created. We then make a prediction by feeding the newly created model the test set we created.

Random forest is an ensemble machine learning algorithm that uses multiple decision trees to make predictions. It works by randomly selecting a subset of features from the dataset and then building a decision tree for each subset. This is process is called bagging. Each tree is then used to make a prediction, and the final prediction is made by taking the average of all the individual tree predictions. Bagging along with with the other processes helps reduce overfitting and improves accuracy. Random forest also has the ability to handle large datasets with high dimensionality, making it a powerful tool for predictive analytics.

In [None]:
# get columns for specified features
ipo_features = ipo_features[:-1]
ipo_features_data = ipos[ipo_features]
# split dataset to trianing set and test set
ipo_features_data_train, ipo_features_data_test, ipo_labels_train, ipo_labels_test = train_test_split(ipo_features_data, ipo_labels, test_size=0.3)
# create classifier 
clf = RandomForestClassifier(n_estimators=200)
# train the model
clf.fit(ipo_features_data_train, ipo_labels_train)
# predict
ipo_labels_pred = clf.predict(ipo_features_data_test)
# save the trained model
with open('saved_models/random_forest_bloomberg_total.pkl', 'wb') as file :
    pickle.dump(clf, file)

In [None]:
# check accuracy
print("Accuracy:", metrics.accuracy_score(ipo_labels_test, ipo_labels_pred))

We now analyze the importance of each feature.

In [None]:
# find feature importance
feature_imp = pd.Series(clf.feature_importances_,index=[ipo_features]).sort_values(ascending=False)
feature_imp

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

In [None]:
estimator = clf.estimators_[10]

In [None]:
export_graphviz(estimator, out_file='small_tree.dot', 
                feature_names = ['Offer Price', 'Opening Price'],
                #class_names = clf.target_names,
                rounded = True, proportion = False, 
                precision = 1, filled = True)

In [None]:
call(['dot', '-Tpng', 'small_tree.dot', '-o', 'small_tree.png', '-Gdpi=600'])

In [None]:
Image(filename = 'small_tree.png')