# Exploring IPO Dataset

This dataset was taken from the SEC database of public records. The file urls were taken from Financial Market Prep via an api endpoint. Roughly 14k IPOs are in that list from FMP. However after post processing, only 2000 actually had useable data that could be extracted from those HTML files. The following is the exploration of a final raw dataset crafted from those 2000 HTML files.

## Goal

Predict the outcome of an IPO's first day of trading. In this data, there is a **diff** column that is either positive or negative. It represents the change in price on the opening day for the ipo. This notebook will explore which features are important to the prediction of our target column.

In [2]:
import pandas as pd

starting_df = pd.read_csv("./datasets/all_financial_with_keywords.csv")

In [3]:
starting_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025 entries, 0 to 2024
Data columns (total 83 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   additional_paid_in_capital_trend   1706 non-null   float64
 1   additional_paid_in_capital_recent  1706 non-null   float64
 2   total_assets_trend                 1890 non-null   float64
 3   total_assets_recent                1890 non-null   float64
 4   total_current_liabilities_trend    1506 non-null   float64
 5   total_current_liabilities_recent   1506 non-null   float64
 6   total_liabilities_trend            1638 non-null   float64
 7   total_liabilities_recent           1638 non-null   float64
 8   symbol                             2025 non-null   object 
 9   cash_trend                         1498 non-null   float64
 10  cash_recent                        1498 non-null   float64
 11  total_capitalization_trend         1356 non-null   float

In [4]:
starting_df.describe()

Unnamed: 0,additional_paid_in_capital_trend,additional_paid_in_capital_recent,total_assets_trend,total_assets_recent,total_current_liabilities_trend,total_current_liabilities_recent,total_liabilities_trend,total_liabilities_recent,cash_trend,cash_recent,...,jefferies,hill road,robert w. baird,william blair,goldman sachs,deutsche bank,davis polk,william blair.1,goldman sachs.1,public_price_per_share_y
count,1706.0,1706.0,1890.0,1890.0,1506.0,1506.0,1638.0,1638.0,1498.0,1498.0,...,224.0,148.0,107.0,188.0,317.0,165.0,267.0,112.0,172.0,2025.0
mean,8205914.0,19723990.0,24159540.0,55215060.0,5323661.0,14073300.0,16372700.0,33493520.0,7585227.0,12776750.0,...,1.0,1.472973,1.514019,2.281915,5.785489,1.0,1.187266,1.0,1.0,380861.1
std,119592200.0,134121900.0,356815600.0,603902100.0,131411700.0,253213100.0,375146500.0,531451000.0,164001200.0,162387800.0,...,0.0,0.786328,1.556268,1.758187,6.945405,0.0,1.048875,0.0,0.0,12422940.0
min,-683978800.0,-29851.0,-567739600.0,2.0,-903654200.0,2.0,-993456700.0,-1033545.0,-2075756000.0,0.01,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.01
25%,0.0,30515.75,0.0,135436.0,0.0,38041.75,0.0,61360.0,0.0,23859.5,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0
50%,0.0,424973.5,46317.0,443385.0,1207.5,152875.0,56.5,329919.0,0.0,112435.5,...,1.0,1.0,1.0,2.0,4.0,1.0,1.0,1.0,1.0,10.0
75%,477152.0,5000664.0,387692.0,6710219.0,107787.0,1293842.0,166476.5,4764252.0,120218.0,873769.8,...,1.0,2.0,1.0,3.0,8.0,1.0,1.0,1.0,1.0,18.0
max,4205601000.0,4210213000.0,12683440000.0,20221690000.0,3659522000.0,7840389000.0,13736720000.0,17932040000.0,4236084000.0,4343529000.0,...,1.0,5.0,11.0,11.0,54.0,1.0,17.0,1.0,1.0,504180000.0
