### IPO scraper

Can we actually predict when the IPO will be successful? Which features increase this probability? This is what this project is about. It scrapes information about recent IPOs, combines it with stock data, cleans and munges the data and finally uses engineered features in a machine learning model to predict the offering success measured in terms of 0s and 1s.

**Disclosure:** This idea of this project started from my bad experience of investing into Lyft on its first day of trading. Lyft severely lost it's value after IPO and its price continues to fall as I'm writing these lines. As a newbee in investing, I decided to explore the IPO market to better understand its dynamics in order to avoid similar mistakes in the future.

**Disclaimer:** This is a learning project to apply data science skills in Python. Thus, the insights from this project should not be taken as investment advice. I do not guarantee the accuracy of information, although it is taken from reliable sources. <br>

In [6]:
from bs4 import BeautifulSoup
import requests
import re

import pandas as pd
import numpy as np

from time import sleep

import dill

In [9]:
dill.load_session('ipo_scraper.db')

## 1. Scraping

 check out for more info - https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/ <br>
 
 1. Respect the rules of robots.txt: <br>
 e.g. www.finance.yahoo.com/robots.txt<br><br>
 
 2. Identify user-agent: <br>
 e.g. add headers={'User-Agent': "my user-agent 1") and then to requests.get(url, headers=headers)<br><br>
 
 3. Use a reasonable crawl rate: <br>
 e.g. from time import sleep and add sleep(15) <br>

In [3]:
dates = pd.period_range('2010-01-01', '2018-12-31', freq='M')
dates

PeriodIndex(['2010-01', '2010-02', '2010-03', '2010-04', '2010-05', '2010-06',
             '2010-07', '2010-08', '2010-09', '2010-10',
             ...
             '2018-03', '2018-04', '2018-05', '2018-06', '2018-07', '2018-08',
             '2018-09', '2018-10', '2018-11', '2018-12'],
            dtype='period[M]', length=108, freq='M')

In [372]:
df = pd.DataFrame()
links = []
url = 'https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=%s'
headers = {'User-Agent' : "non-profit learning project"}

for idx in dates:
    print(f'fill scrap - {url % idx}')
    result = requests.get(url % idx, headers=headers)
    sleep(30)
    content = result.content
    
    if not "There is no data for this month" in str(content):
        table = pd.read_html(content)[0]
        df = df.append(table, ignore_index=True)
    
        soup = BeautifulSoup(content)
    
        m = soup.find_all('a', id=re.compile('two_column_main_content_rptPricing_company_\d'))
        print(f"length of table vs length of links - {table.shape[0]-len(m)}")
        
        for link in m:
            links.append(link['href'])

    elif "There is no data for this month" in str(content):
        pass

fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-01
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-02
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-03
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-04
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-05
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-06
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-07
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2010-08
length 

length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2015-05
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2015-06
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2015-07
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2015-08
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2015-09
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2015-10
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.aspx?tab=pricings&month=2015-11
length of table vs length of links - 0
fill scrap - https://www.nasdaq.com/markets/ipos/activity.

In [375]:
descriptions = []
address = []
employees = []

for lnk in links:
    print("scraping -", lnk)
    res = requests.get(lnk).content
    soup = BeautifulSoup(res)
    desc = [x.get_text() for x in soup.findAll("div", {"class": "ipo-comp-description"})]
    desc2 = [x.get_text() for x in soup.findAll("div", {"id": "read_more_div_toggle1"})]
    if (not desc)|(not desc2):
        descriptions.append('')
    else:
        desc.extend(desc2)
        descriptions.append(desc)
    
    table = pd.read_html(res)
    address.append(table[0][1][1])
    employees.append(table[0][1][5])

scraping - https://www.nasdaq.com/markets/ipos/company/china-electric-motor-inc-767729-62318
scraping - https://www.nasdaq.com/markets/ipos/company/ifm-investments-ltd-818638-63053
scraping - https://www.nasdaq.com/markets/ipos/company/andatee-china-marine-fuel-services-corp-810870-61911
scraping - https://www.nasdaq.com/markets/ipos/company/china-hydroelectric-corp-817076-62817
scraping - https://www.nasdaq.com/markets/ipos/company/chesapeake-lodging-trust-812716-62173
scraping - https://www.nasdaq.com/markets/ipos/company/cellu-tissue-holdings-inc-648615-62353
scraping - https://www.nasdaq.com/markets/ipos/company/symetra-financial-corp-749215-62247
scraping - https://www.nasdaq.com/markets/ipos/company/sprott-physical-gold-trust-817209-62831
scraping - https://www.nasdaq.com/markets/ipos/company/ensign-services-inc-765029-58228
scraping - https://www.nasdaq.com/markets/ipos/company/generac-holdings-inc-814154-62379
scraping - https://www.nasdaq.com/markets/ipos/company/graham-packag

scraping - https://www.nasdaq.com/markets/ipos/company/makemytrip-ltd-833644-64779
scraping - https://www.nasdaq.com/markets/ipos/company/realpage-inc-630471-64014
scraping - https://www.nasdaq.com/markets/ipos/company/china-kanghui-holdings-833584-64773
scraping - https://www.nasdaq.com/markets/ipos/company/mediamind-technologies-inc-614466-63520
scraping - https://www.nasdaq.com/markets/ipos/company/united-states-commodity-index-funds-trust-818184-62995
scraping - https://www.nasdaq.com/markets/ipos/company/intralinks-holdings-inc-825687-63852
scraping - https://www.nasdaq.com/markets/ipos/company/nxp-semiconductors-nv-757890-63906
scraping - https://www.nasdaq.com/markets/ipos/company/nupathe-inc-720465-64175
scraping - https://www.nasdaq.com/markets/ipos/company/gestate-liquidation-stores-inc-827035-64026
scraping - https://www.nasdaq.com/markets/ipos/company/ambow-education-holding-ltd-832935-64691
scraping - https://www.nasdaq.com/markets/ipos/company/ophectra-real-estate-invts-l

scraping - https://www.nasdaq.com/markets/ipos/company/global-brokerage-inc-836552-65098
scraping - https://www.nasdaq.com/markets/ipos/company/adecoagro-sa-845500-66160
scraping - https://www.nasdaq.com/markets/ipos/company/bcd-semiconductor-manufacturing-ltd-634789-66100
scraping - https://www.nasdaq.com/markets/ipos/company/bankunited-inc-840262-65512
scraping - https://www.nasdaq.com/markets/ipos/company/interxion-holding-nv-845354-66151
scraping - https://www.nasdaq.com/markets/ipos/company/velti-plc-828512-64158
scraping - https://www.nasdaq.com/markets/ipos/company/nielsen-holdings-plc-829744-64361
scraping - https://www.nasdaq.com/markets/ipos/company/leaf-group-ltd-710930-64883
scraping - https://www.nasdaq.com/markets/ipos/company/tibet-pharmaceuticals-inc-828647-64188
scraping - https://www.nasdaq.com/markets/ipos/company/etfs-asian-gold-trust-833504-64762
scraping - https://www.nasdaq.com/markets/ipos/company/american-assets-trust-inc-837016-65140
scraping - https://www.nas

scraping - https://www.nasdaq.com/markets/ipos/company/first-connecticut-bancorp-inc-846582-66243
scraping - https://www.nasdaq.com/markets/ipos/company/ag-mortgage-investment-trust-inc-849551-66523
scraping - https://www.nasdaq.com/markets/ipos/company/homeaway-inc-712814-66558
scraping - https://www.nasdaq.com/markets/ipos/company/kior-inc-762815-66831
scraping - https://www.nasdaq.com/markets/ipos/company/saexploration-holdings-inc-850267-66577
scraping - https://www.nasdaq.com/markets/ipos/company/vanguard-health-systems-inc-91842-66876
scraping - https://www.nasdaq.com/markets/ipos/company/bankrate-inc-853102-66878
scraping - https://www.nasdaq.com/markets/ipos/company/integrated-drilling-equipment-holdings-corp-849473-66514
scraping - https://www.nasdaq.com/markets/ipos/company/csi-compressco-lp-791919-59657
scraping - https://www.nasdaq.com/markets/ipos/company/pandora-media-llc-405127-66361
scraping - https://www.nasdaq.com/markets/ipos/company/fusionio-inc-728969-66532
scrapin

scraping - https://www.nasdaq.com/markets/ipos/company/epam-systems-inc-698144-67380
scraping - https://www.nasdaq.com/markets/ipos/company/melinta-therapeutics-inc-new-805275-68346
scraping - https://www.nasdaq.com/markets/ipos/company/avg-technologies-nv-871866-69026
scraping - https://www.nasdaq.com/markets/ipos/company/greenway-medical-technologies-inc-99155-67687
scraping - https://www.nasdaq.com/markets/ipos/company/matador-resources-co-861962-67914
scraping - https://www.nasdaq.com/markets/ipos/company/us-silica-holdings-inc-860113-67694
scraping - https://www.nasdaq.com/markets/ipos/company/enphase-energy-inc-804337-67429
scraping - https://www.nasdaq.com/markets/ipos/company/gaslog-ltd-872099-69045
scraping - https://www.nasdaq.com/markets/ipos/company/merrimack-pharmaceuticals-inc-615489-67626
scraping - https://www.nasdaq.com/markets/ipos/company/millennial-media-inc-718200-68982
scraping - https://www.nasdaq.com/markets/ipos/company/cafepress-inc-147807-67381
scraping - htt

scraping - https://www.nasdaq.com/markets/ipos/company/hyde-park-acquisition-corp-ii-854787-67055
scraping - https://www.nasdaq.com/markets/ipos/company/qualys-inc-375214-70088
scraping - https://www.nasdaq.com/markets/ipos/company/summit-midstream-partners-lp-888110-70595
scraping - https://www.nasdaq.com/markets/ipos/company/santander-mexico-financial-group-sab-de-cv-887985-70579
scraping - https://www.nasdaq.com/markets/ipos/company/capital-bank-financial-corp-818543-67492
scraping - https://www.nasdaq.com/markets/ipos/company/national-bank-holdings-corp-815033-68629
scraping - https://www.nasdaq.com/markets/ipos/company/spirit-realty-capital-inc-617470-68594
scraping - https://www.nasdaq.com/markets/ipos/company/sunoco-lp-884445-70181
scraping - https://www.nasdaq.com/markets/ipos/company/trulia-inc-695941-70564
scraping - https://www.nasdaq.com/markets/ipos/company/endeavor-ip-inc-848986-66453
scraping - https://www.nasdaq.com/markets/ipos/company/kbs-fashion-group-ltd-878965-6958

scraping - https://www.nasdaq.com/markets/ipos/company/intelsat-sa-882312-69954
scraping - https://www.nasdaq.com/markets/ipos/company/hannon-armstrong-sustainable-infrastructure-capital-inc-900116-71843
scraping - https://www.nasdaq.com/markets/ipos/company/fairway-group-holdings-corp-890051-70840
scraping - https://www.nasdaq.com/markets/ipos/company/evertec-inc-899009-71762
scraping - https://www.nasdaq.com/markets/ipos/company/privileged-world-travel-club-inc-883421-70707
scraping - https://www.nasdaq.com/markets/ipos/company/rally-software-development-corp-660534-71967
scraping - https://www.nasdaq.com/markets/ipos/company/omthera-pharmaceuticals-inc-816563-71960
scraping - https://www.nasdaq.com/markets/ipos/company/chimerix-inc-376315-71959
scraping - https://www.nasdaq.com/markets/ipos/company/knot-offshore-partners-lp-900852-71901
scraping - https://www.nasdaq.com/markets/ipos/company/taylor-morrison-home-corp-894212-71312
scraping - https://www.nasdaq.com/markets/ipos/company

scraping - https://www.nasdaq.com/markets/ipos/company/ishares-commodity-optimized-trust-869351-68774
scraping - https://www.nasdaq.com/markets/ipos/company/independence-realty-trust-inc-807025-72506
scraping - https://www.nasdaq.com/markets/ipos/company/mix-telematics-ltd-910131-72957
scraping - https://www.nasdaq.com/markets/ipos/company/andeavor-midstream-partners-lp-906174-72469
scraping - https://www.nasdaq.com/markets/ipos/company/jason-industries-inc-909057-72839
scraping - https://www.nasdaq.com/markets/ipos/company/bmc-stock-holdings-inc-908837-72803
scraping - https://www.nasdaq.com/markets/ipos/company/franks-international-nv-906328-72486
scraping - https://www.nasdaq.com/markets/ipos/company/cvent-inc-164595-72974
scraping - https://www.nasdaq.com/markets/ipos/company/world-point-terminals-lp-908931-72820
scraping - https://www.nasdaq.com/markets/ipos/company/intrexon-corp-702326-72988
scraping - https://www.nasdaq.com/markets/ipos/company/fox-factory-holding-corp-768346-72

scraping - https://www.nasdaq.com/markets/ipos/company/norcraft-companies-inc-917286-73678
scraping - https://www.nasdaq.com/markets/ipos/company/twitter-inc-763922-73652
scraping - https://www.nasdaq.com/markets/ipos/company/wixcom-ltd-916877-73633
scraping - https://www.nasdaq.com/markets/ipos/company/karyopharm-therapeutics-inc-840732-73668
scraping - https://www.nasdaq.com/markets/ipos/company/barracuda-networks-inc-694838-73636
scraping - https://www.nasdaq.com/markets/ipos/company/zenith-energy-logistics-partners-lp-917000-73643
scraping - https://www.nasdaq.com/markets/ipos/company/avianca-holdings-sa-915952-73527
scraping - https://www.nasdaq.com/markets/ipos/company/blue-capital-reinsurance-holdings-ltd-917215-73671
scraping - https://www.nasdaq.com/markets/ipos/company/dariohealth-corp-866973-71601
scraping - https://www.nasdaq.com/markets/ipos/company/container-store-group-inc-756081-73610
scraping - https://www.nasdaq.com/markets/ipos/company/qunar-cayman-islands-ltd-916707

scraping - https://www.nasdaq.com/markets/ipos/company/galmed-pharmaceuticals-ltd-926632-74581
scraping - https://www.nasdaq.com/markets/ipos/company/achaogen-inc-647671-74489
scraping - https://www.nasdaq.com/markets/ipos/company/aquinox-pharmaceuticals-inc-749074-74508
scraping - https://www.nasdaq.com/markets/ipos/company/quotient-technology-inc-118476-74533
scraping - https://www.nasdaq.com/markets/ipos/company/recro-pharma-inc-918509-73797
scraping - https://www.nasdaq.com/markets/ipos/company/bg-staffing-inc-814205-74294
scraping - https://www.nasdaq.com/markets/ipos/company/lombard-medical-inc-929366-74844
scraping - https://www.nasdaq.com/markets/ipos/company/functionx-inc-6393-74380
scraping - https://www.nasdaq.com/markets/ipos/company/quotient-ltd-929179-74810
scraping - https://www.nasdaq.com/markets/ipos/company/immunic-inc-623342-73730
scraping - https://www.nasdaq.com/markets/ipos/company/sabre-corp-925028-74446
scraping - https://www.nasdaq.com/markets/ipos/company/spor

scraping - https://www.nasdaq.com/markets/ipos/company/radius-health-inc-775825-74728
scraping - https://www.nasdaq.com/markets/ipos/company/arista-networks-inc-931003-75019
scraping - https://www.nasdaq.com/markets/ipos/company/adverum-biotechnologies-inc-921110-75880
scraping - https://www.nasdaq.com/markets/ipos/company/catalent-inc-925423-74481
scraping - https://www.nasdaq.com/markets/ipos/company/healthequity-inc-771871-75682
scraping - https://www.nasdaq.com/markets/ipos/company/leap-therapeutics-inc-937750-75804
scraping - https://www.nasdaq.com/markets/ipos/company/marinus-pharmaceuticals-inc-601193-75413
scraping - https://www.nasdaq.com/markets/ipos/company/enlivex-therapeutics-ltd-926749-74590
scraping - https://www.nasdaq.com/markets/ipos/company/transocean-partners-llc-937707-75801
scraping - https://www.nasdaq.com/markets/ipos/company/synchrony-financial-929636-74880
scraping - https://www.nasdaq.com/markets/ipos/company/westlake-chemical-partners-lp-933537-75273
scrapin

scraping - https://www.nasdaq.com/markets/ipos/company/textmunication-holdings-inc-2056-75664
scraping - https://www.nasdaq.com/markets/ipos/company/bison-merger-sub-i-llc-296470-76273
scraping - https://www.nasdaq.com/markets/ipos/company/dermira-inc-889368-76307
scraping - https://www.nasdaq.com/markets/ipos/company/yodlee-inc-104440-75874
scraping - https://www.nasdaq.com/markets/ipos/company/jp-energy-partners-lp-859035-75368
scraping - https://www.nasdaq.com/markets/ipos/company/atento-sa-933741-75299
scraping - https://www.nasdaq.com/markets/ipos/company/axar-acquisition-corp-941734-76155
scraping - https://www.nasdaq.com/markets/ipos/company/aac-holdings-inc-933439-75963
scraping - https://www.nasdaq.com/markets/ipos/company/calithera-biosciences-inc-833221-76284
scraping - https://www.nasdaq.com/markets/ipos/company/vwr-corp-937839-75821
scraping - https://www.nasdaq.com/markets/ipos/company/wayfair-inc-942421-76214
scraping - https://www.nasdaq.com/markets/ipos/company/gulf-we

scraping - https://www.nasdaq.com/markets/ipos/company/solaredge-technologies-inc-763462-77672
scraping - https://www.nasdaq.com/markets/ipos/company/cellectis-sa-958510-77689
scraping - https://www.nasdaq.com/markets/ipos/company/nextdecade-corp-939000-75943
scraping - https://www.nasdaq.com/markets/ipos/company/tantech-holdings-ltd-944893-76487
scraping - https://www.nasdaq.com/markets/ipos/company/steadymed-ltd-956991-77602
scraping - https://www.nasdaq.com/markets/ipos/company/national-commerce-corp-936642-77378
scraping - https://www.nasdaq.com/markets/ipos/company/summit-materials-inc-952736-77262
scraping - https://www.nasdaq.com/markets/ipos/company/maxpoint-interactive-inc-956526-77558
scraping - https://www.nasdaq.com/markets/ipos/company/summit-therapeutics-plc-928879-77540
scraping - https://www.nasdaq.com/markets/ipos/company/blueprint-medicines-corp-924997-77940
scraping - https://www.nasdaq.com/markets/ipos/company/atlantic-alliance-partnership-corp-958643-77698
scraping

scraping - https://www.nasdaq.com/markets/ipos/company/mastercraft-boat-holdings-inc-964343-78298
scraping - https://www.nasdaq.com/markets/ipos/company/ooma-inc-674544-78694
scraping - https://www.nasdaq.com/markets/ipos/company/rapid7-inc-891765-78660
scraping - https://www.nasdaq.com/markets/ipos/company/ollies-bargain-outlet-holdings-inc-967922-78686
scraping - https://www.nasdaq.com/markets/ipos/company/sierra-oncology-inc-635290-78678
scraping - https://www.nasdaq.com/markets/ipos/company/jupai-holdings-ltd-967929-78689
scraping - https://www.nasdaq.com/markets/ipos/company/chiasma-inc-686107-78688
scraping - https://www.nasdaq.com/markets/ipos/company/hailiang-education-group-inc-953462-77324
scraping - https://www.nasdaq.com/markets/ipos/company/natera-inc-966846-78575
scraping - https://www.nasdaq.com/markets/ipos/company/consol-coal-resources-lp-961806-78013
scraping - https://www.nasdaq.com/markets/ipos/company/conformis-inc-652145-78505
scraping - https://www.nasdaq.com/mar

scraping - https://www.nasdaq.com/markets/ipos/company/aeglea-biotherapeutics-inc-961076-78705
scraping - https://www.nasdaq.com/markets/ipos/company/datasea-inc-957712-77645
scraping - https://www.nasdaq.com/markets/ipos/company/cotiviti-holdings-inc-993419-80796
scraping - https://www.nasdaq.com/markets/ipos/company/gms-inc-928385-79049
scraping - https://www.nasdaq.com/markets/ipos/company/waitr-holdings-inc-993296-80783
scraping - https://www.nasdaq.com/markets/ipos/company/reata-pharmaceuticals-inc-704859-80104
scraping - https://www.nasdaq.com/markets/ipos/company/us-foods-holding-corp-986420-80322
scraping - https://www.nasdaq.com/markets/ipos/company/midland-states-bancorp-inc-806625-80685
scraping - https://www.nasdaq.com/markets/ipos/company/fgl-holdings-992788-80752
scraping - https://www.nasdaq.com/markets/ipos/company/merus-nv-977595-79619
scraping - https://www.nasdaq.com/markets/ipos/company/reign-sapphire-corp-966453-78536
scraping - https://www.nasdaq.com/markets/ipos/

scraping - https://www.nasdaq.com/markets/ipos/company/advanced-disposal-services-inc-915347-79238
scraping - https://www.nasdaq.com/markets/ipos/company/obalon-therapeutics-inc-771392-81663
scraping - https://www.nasdaq.com/markets/ipos/company/coupa-software-inc-731131-81660
scraping - https://www.nasdaq.com/markets/ipos/company/scworx-corp-1001654-81513
scraping - https://www.nasdaq.com/markets/ipos/company/aquaventure-holdings-ltd-766355-79466
scraping - https://www.nasdaq.com/markets/ipos/company/hunter-maritime-acquisition-corp-1005763-81877
scraping - https://www.nasdaq.com/markets/ipos/company/motif-bio-plc-998919-81297
scraping - https://www.nasdaq.com/markets/ipos/company/smart-sand-inc-864559-81736
scraping - https://www.nasdaq.com/markets/ipos/company/gds-holdings-ltd-1004950-81839
scraping - https://www.nasdaq.com/markets/ipos/company/hebron-technology-co-ltd-981769-80004
scraping - https://www.nasdaq.com/markets/ipos/company/trivago-nv-1008287-82089
scraping - https://www

scraping - https://www.nasdaq.com/markets/ipos/company/tintri-inc-886447-83868
scraping - https://www.nasdaq.com/markets/ipos/company/aileron-therapeutics-inc-764139-83876
scraping - https://www.nasdaq.com/markets/ipos/company/blue-apron-holdings-inc-1024794-83865
scraping - https://www.nasdaq.com/markets/ipos/company/dova-pharmaceuticals-inc-1004119-83878
scraping - https://www.nasdaq.com/markets/ipos/company/mersana-therapeutics-inc-785645-83859
scraping - https://www.nasdaq.com/markets/ipos/company/tpg-pace-holdings-corp-1025168-83907
scraping - https://www.nasdaq.com/markets/ipos/company/esquire-financial-holdings-inc-873795-83849
scraping - https://www.nasdaq.com/markets/ipos/company/avenue-therapeutics-inc-982727-83546
scraping - https://www.nasdaq.com/markets/ipos/company/granite-point-mortgage-trust-inc-1024122-83784
scraping - https://www.nasdaq.com/markets/ipos/company/nrc-group-holdings-corp-1024531-83841
scraping - https://www.nasdaq.com/markets/ipos/company/safehold-inc-10

scraping - https://www.nasdaq.com/markets/ipos/company/reto-ecosolutions-inc-1029887-84374
scraping - https://www.nasdaq.com/markets/ipos/company/big-rock-partners-acquisition-corp-1034975-84990
scraping - https://www.nasdaq.com/markets/ipos/company/bluegreen-vacations-corp-7500-85070
scraping - https://www.nasdaq.com/markets/ipos/company/ameri-holdings-inc-607-84773
scraping - https://www.nasdaq.com/markets/ipos/company/sailpoint-technologies-holdings-inc-952413-85050
scraping - https://www.nasdaq.com/markets/ipos/company/sterling-bancorp-inc-1001137-85034
scraping - https://www.nasdaq.com/markets/ipos/company/scpharmaceuticals-inc-931703-85074
scraping - https://www.nasdaq.com/markets/ipos/company/cbdmd-inc-968652-84781
scraping - https://www.nasdaq.com/markets/ipos/company/legacy-acquisition-corp-1035806-85093
scraping - https://www.nasdaq.com/markets/ipos/company/stitch-fix-inc-1035402-85032
scraping - https://www.nasdaq.com/markets/ipos/company/jianpu-technology-inc-1035519-85063


scraping - https://www.nasdaq.com/markets/ipos/company/op-bancorp-1047400-86214
scraping - https://www.nasdaq.com/markets/ipos/company/bilibili-inc-1047286-86197
scraping - https://www.nasdaq.com/markets/ipos/company/homology-medicines-inc-983342-86199
scraping - https://www.nasdaq.com/markets/ipos/company/greentree-hospitality-group-ltd-1046879-86158
scraping - https://www.nasdaq.com/markets/ipos/company/dropbox-inc-808000-86123
scraping - https://www.nasdaq.com/markets/ipos/company/sunlands-technology-group-1046645-86129
scraping - https://www.nasdaq.com/markets/ipos/company/etf-managers-group-commodity-trust-i-946454-83870
scraping - https://www.nasdaq.com/markets/ipos/company/golden-bull-ltd-1040761-85679
scraping - https://www.nasdaq.com/markets/ipos/company/senmiao-technology-ltd-1036260-85132
scraping - https://www.nasdaq.com/markets/ipos/company/tiberius-acquisition-corp-1046235-86083
scraping - https://www.nasdaq.com/markets/ipos/company/zscaler-inc-1046093-86065
scraping - ht

scraping - https://www.nasdaq.com/markets/ipos/company/endava-plc-977973-87274
scraping - https://www.nasdaq.com/markets/ipos/company/opera-ltd-1057892-87280
scraping - https://www.nasdaq.com/markets/ipos/company/summit-wireless-technologies-inc-1001707-86566
scraping - https://www.nasdaq.com/markets/ipos/company/aurora-mobile-ltd-1057849-87266
scraping - https://www.nasdaq.com/markets/ipos/company/liquidia-technologies-inc-677210-87259
scraping - https://www.nasdaq.com/markets/ipos/company/tenable-holdings-inc-997936-87272
scraping - https://www.nasdaq.com/markets/ipos/company/pinduoduo-inc-1057876-87277
scraping - https://www.nasdaq.com/markets/ipos/company/focus-financial-partners-inc-1054565-86962
scraping - https://www.nasdaq.com/markets/ipos/company/berry-petroleum-corp-1057872-87275
scraping - https://www.nasdaq.com/markets/ipos/company/cango-inc-1057034-87209
scraping - https://www.nasdaq.com/markets/ipos/company/aquestive-therapeutics-inc-743843-87251
scraping - https://www.na

scraping - https://www.nasdaq.com/markets/ipos/company/guardant-health-inc-927439-87823
scraping - https://www.nasdaq.com/markets/ipos/company/kodiak-sciences-inc-809185-87842
scraping - https://www.nasdaq.com/markets/ipos/company/upwork-inc-1063545-87822
scraping - https://www.nasdaq.com/markets/ipos/company/medalist-diversified-reit-inc-976438-87763
scraping - https://www.nasdaq.com/markets/ipos/company/taiwan-liposome-company-ltd-1046140-86077
scraping - https://www.nasdaq.com/markets/ipos/company/tiziana-life-sciences-plc-1060156-87447
scraping - https://www.nasdaq.com/markets/ipos/company/tuanche-ltd-1067572-88187
scraping - https://www.nasdaq.com/markets/ipos/company/amci-acquisition-corp-1067890-88211
scraping - https://www.nasdaq.com/markets/ipos/company/boxwood-merger-corp-1068003-88220
scraping - https://www.nasdaq.com/markets/ipos/company/fintech-acquisition-corp-iii-1067609-88195
scraping - https://www.nasdaq.com/markets/ipos/company/weidai-ltd-1061629-87609
scraping - http

In [383]:
print(f"checking length of rows in dataframe {df.shape[0]}, descriptions {len(descriptions)}, employees {len(employees)}, address {len(address)}")

checking length of rows in dataframe 1866, descriptions 1866, employees 1866, address 1866


In [377]:
descr_pretty = [str(x).replace("\\n', '\\n", "").replace("['\\nCompany Description\\n", "").replace("\\n", " ").replace("']","")
                for x in descriptions]

In [381]:
#adding state column
states = []
for addr in address:
    pattern = re.compile(", [A-Z]{2} \d{5}")
    state = pattern.findall(str(addr))
    if state:
        state = str(state).split(" ")[1]
    else:
        state = ''
    states.append(state)

In [382]:
employees_clean = []
for emp in employees:
    try:
        employees_clean.append(int(emp))
    except ValueError: 
        employees_clean.append(np.nan)

### Adding new columns:

In [384]:
df['employees'] = employees_clean

In [385]:
df['address'] = address
df['address'] = df['address'].fillna('')

In [386]:
df['US_state'] = states

In [387]:
df['descriptions'] = descr_pretty

In [388]:
df['link_nasdaq'] = links

In [389]:
df.head()

Unnamed: 0,Company Name,Symbol,Market,Price,Shares,Offer Amount,Date Priced,employees,address,US_state,descriptions,link_nasdaq
0,"CHINA ELECTRIC MOTOR, INC.",CELM,NASDAQ,$4.50,5000000,"$22,500,000",1/29/2010,920.0,"SUNNA MOTOR INDUSTRY PARK, JIAN'ANFUYONG HI-TE...",,"Through Shenzhen YPC, we engage in the design,...",https://www.nasdaq.com/markets/ipos/company/ch...
1,IFM INVESTMENTS LTD,CTC,New York Stock Exchange,$7,12487500,"$87,412,500",1/28/2010,4654.0,"9/A5, EAST WING, HANWEI PLAZANO.7 GUANGHUA ROA...",,We are a leading comprehensive real estate ser...,https://www.nasdaq.com/markets/ipos/company/if...
2,ANDATEE CHINA MARINE FUEL SERVICES CORP,AMCF,NASDAQ,$6.30,3134921,"$19,750,002",1/26/2010,128.0,NO. 68 BINHAI RD DALIAN XIGANG DISTRICTDALIAN ...,,"We, through our VIE entities are engaged in th...",https://www.nasdaq.com/markets/ipos/company/an...
3,CHINA HYDROELECTRIC CORP,CHC,New York Stock Exchange,$16,6000000,"$96,000,000",1/25/2010,336.0,"420 LEXINGTON AVENUESUITE 860NEW YORK, NY 10170",NY,"We are a fast-growing consolidator, operator a...",https://www.nasdaq.com/markets/ipos/company/ch...
4,CHESAPEAKE LODGING TRUST,CHSP,New York Stock Exchange,$20,7500000,"$150,000,000",1/22/2010,3.0,"4300 WILSON BOULEVARDSUITE 625ARLINGTON, VA 22203",VA,We are a self-advised hotel investment company...,https://www.nasdaq.com/markets/ipos/company/ch...


In [588]:
df.shape

(1866, 12)

## 2. Converting Column Types

In [11]:
#transforming column from str type to datetime type
df['Date Priced'] = pd.to_datetime(df['Date Priced'])

NameError: name 'df' is not defined

In [590]:
df['year'] = df['Date Priced'].map(lambda x: x.year)

In [591]:
#converting column from str type to float
df['Price'] = df['Price'].map(lambda x: x.replace("$", ""))
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

In [592]:
df['Offer Amount'] = df['Offer Amount'].map(lambda x: x.replace("$", "").replace(",", ""))
df['Offer Amount'] = pd.to_numeric(df['Offer Amount'], errors='coerce')

In [602]:
df['Shares'] = pd.to_numeric(df['Shares'], errors='coerce')

In [610]:
df.dtypes

Company Name            object
Symbol                  object
Market                  object
Price                  float64
Shares                 float64
Offer Amount           float64
Date Priced     datetime64[ns]
employees              float64
address                 object
US_state                object
descriptions            object
link_nasdaq             object
year                     int64
dtype: object

## 3. Cleaning:

For this project, I am interested in those companies that add value or in other words are truly new companies. However, there are also so called "special purpose acquisition companies" that go to IPO. I'm not interested in them.
<br><br>
So we want to exclude companies:
1. that have "merge" | "acquisition" in their names
2. that have similar description as following:
"a blank check company formed for the purpose of entering into a merger, share exchange, asset acquisition, stock purchase, recapitalization, reorganization or other similar business combination with one or more businesses or entities."

for more on "blank check" companies or "special purpose acquisition company" (SPAC), see https://www.cnbc.com/2017/09/13/this-tech-ipo-has-everyone-talking.html

In [595]:
def clean_from_mergers(dataframe):
    acquisition_cond = dataframe['Company Name'].str.contains('ACQUISITION|MERGER', case=False, regex=True)
    blankcheck_cond = dataframe['descriptions'].str.contains('blank check', case=False)
    mergerdesc_cond = dataframe['descriptions'].str.contains('entering into a merger, share exchange, asset acquisition', case=False)
    return dataframe[~acquisition_cond & ~blankcheck_cond & ~mergerdesc_cond].copy()

In [596]:
def clean_from_etfs(dataframe):
    etf_cond1 = dataframe['descriptions'].str.contains('exchange-traded|Fund Equity|investment strategy', case=False, regex=True)
    etf_cond2 = dataframe['Company Name'].str.contains(' ETF')
    trust = dataframe['Company Name'].str.contains(' TRUST')
    realestate = dataframe['descriptions'].str.contains('properties|leasing|real estate|leased|REIT', case=False, regex=True)
    return dataframe[~etf_cond1 & ~etf_cond2 & ~(trust & ~realestate)].copy()              

In [605]:
df_nomergers = clean_from_mergers(df)
print(f"{df.shape[0]-df_nomergers.shape[0]} companies went to IPO with the purpose of merger/acquisition. These companies will be excluded")

166 companies went to IPO with the purpose of merger/acquisition. These companies will be excluded


In [606]:
df_clean = clean_from_etfs(df_nomergers)
print(f"{df_nomergers.shape[0]-df_clean.shape[0]} companies on IPO were ETFs or trusts (but not REITS). These companies will be excluded")

53 companies on IPO were ETFs or trusts (but not REITS). These companies will be excluded


I will also exclude those companies that were not listed on a formal exchange but traded over the counter.

In [607]:
otc = df_clean[df_clean['Market'] == 'OTCBB'].shape[0]
print(f'there are {otc} companies that were traded over the counter. Will be removed')
df_clean = df_clean[~(df_clean['Market'] == 'OTCBB')]

there are 47 companies that were traded over the counter. Will be removed


In [615]:
print(f"the final IPO list includes - {df_clean.shape[0]} companies")

the final IPO list includes - 1600 companies


In [616]:
df_clean.to_csv('ipo_clean_2010_2018.csv', index=False)

In [8]:
dill.dump_session('ipo_scraper.db')

# Conclusion:

We have scraped companies and their descriptions that went to IPO between 2010 and 2018 from NASDAQ site. In the next step, we cleaned the list to get rid of companies created specifically for mergers and acquisitions, those that are not traded on stock exchange but rather over-the-counter and those that were created as ETFs or trusts (but not REITS). As a result, we got 1600 companies in our IPO list down from initially 1866. <br>

WHAT'S NEXT: Now it's time to scrape the stock information for companies in our IPO list which will be done in a separate notebook using datareader package and Yahoo Finance.