# Creating an IPO Spider

## Table of Contents

1. Introduction
2. Install & Import Packages
3. Access HTML and Create Selector Object
4. Create & Run IPO Spider
5. Create & Style IPO Dataframe

## 1. Introduction

350+ companies and counting have IPO'd on U.S. exchanges in 2020, the highest number since 2000, including big names like Palantir, Asana, Snowflake, and McAfee. Others like Airbnb, Doordash, Instacart, and Robinhood are in the bullpen. It would be great to see what these companies are all about, learn their key products, and to have that info in one place.

To source info, let's look at the public website https://stockanalysis.com/ipos/2020-list/. We'll build a simple web spider that will scrape the links to each company on the site, follow each company link, scrape each company's description, and output a dictionary with keys as company names & tickers and values as company descriptions. We'll then convert the dictionary that the spider generated into a styled dataframe. We'll create the web spider using scrapy and use both xpath and css methods to access html elements. The heart of the spider are its two parsing methods using css to access html elements - the first to extract links that the spider will follow and the second to scrape the company name & ticker and company description from the new site for each company. Constructing the parsing methods requires time upfront to inspect the site's html in order to get the right css selector for the target html elements. The purpose is to demonstrate how to build a web spider and to create a single table of high-level company descriptions for prospective investors. Future potential areas of scraping, for anyone interested, includes financial detail, including revenue, EBITDA, and free cash flow. 

## 2. Install & Import Packages

In [1]:
import pandas as pd

!pip install scrapy
import scrapy
from scrapy import Selector
from scrapy.crawler import CrawlerProcess
import requests



## 3. Access HTML and Create Selector Object

In [2]:
# Url containing html
url = "https://stockanalysis.com/ipos/2020-list/"

# Get html source code using requests.get and .content and store in string html
html = requests.get(url).content

# Create the Selector object sel from html. Remember Selector returns a list
sel = Selector(text = html)

# Check number of html elements. We use xpath here, // means all generations and * is a wildcard for any child elements
print("Number of elements in html document: ", len(sel.xpath('//*')))

Number of elements in html document:  3059


## 4. Create & Run IPO Spider

In [8]:
# Create Spider class
class IPO_Spider(scrapy.Spider):
  name = "ipo_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url,
                         callback = self.parse_front)
  # 1st parsing method using css - link to follow
  def parse_front(self, response):
    links = response.css('table.maintable td a::attr(href)').extract() # extract href attribute from all chldren a of all children td of table with class maintable 
    for url in links:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # 2nd parsing method using css - company name & ticker, description
  def parse_pages(self, response):
    name_ticker = response.css('h1::text').extract_first().strip() # extract and strip text from h1
    description = response.css('p.desc::text').extract_first().strip() # extract and strip text from child p with class desc
    ipo_dict[name_ticker] = description # create dictionary with key name_ticker and value description

# Initialize dictionary outside of Spider class
ipo_dict = dict()

# Run Spider
process = CrawlerProcess()
process.crawl(IPO_Spider)
process.start()

2020-10-29 12:39:43 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-10-29 12:39:43 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.9 (default, Aug 31 2020, 12:42:55) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 2.9.2, Platform Linux-4.15.0-118-generic-x86_64-with-redhat-8.2-Ootpa
2020-10-29 12:39:43 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-29 12:39:44 [scrapy.crawler] INFO: Overridden settings:
{}
2020-10-29 12:39:44 [scrapy.extensions.telnet] INFO: Telnet Password: ee3d05d6906d1b89
2020-10-29 12:39:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-10-29 12:39:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.

In [9]:
# Check the spider's dictionary output - as we wanted, the keys are company & ticker and the values are company descriptions
ipo_dict

{'Absolute Software Corporation (ABST)': 'Absolute Software delivers a cloud-based service that supports the management and security of computing devices, applications, and data for a variety of organizations globally. Our differentiated technology is rooted in our patented Persistence® technology, which is embedded in the firmware of laptop, desktop, and tablet devices by almost every major global computer manufacturer. Enabling a permanent digital tether between the endpoint and the organization that distributed it, we provide IT and security personnel with connectivity, visibility, and control, whether a device is on or off the corporate network, and empower them with Self-Healing Endpoint® security to ensure mission critical applications remain healthy and deliver intended value. Our technology is embedded in over a half-billion devices and we currently serve more than 13,000 commercial customers with over 10.8 million activated licenses globally. Our solutions are delivered in a s

## 5. Create & Style IPO Dataframe 

In [13]:
# Create dataframe ipo from dictionary using pd.DataFrame.from_dict
ipo = pd.DataFrame.from_dict(ipo_dict, orient='index', columns=['Description'])

# Dataframe has alot of companies and text so let's set row and column display options 
pd.set_option('display.max_colwidth', None)
pd.set_option("display.max_rows", None, "display.max_columns", None)

# Left justify the text using ipo.style.set_properties. Left justify the column name using set_table_styles([dict(selector='th', props=[('text-align', 'left')
ipo.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

# Since there are many rows, to display the whole dataframe without vertical scroll bar, click on Cell --> Cell Outputs --> Toggle Scrolling

Unnamed: 0,Description
Absolute Software Corporation (ABST),"Absolute Software delivers a cloud-based service that supports the management and security of computing devices, applications, and data for a variety of organizations globally. Our differentiated technology is rooted in our patented Persistence® technology, which is embedded in the firmware of laptop, desktop, and tablet devices by almost every major global computer manufacturer. Enabling a permanent digital tether between the endpoint and the organization that distributed it, we provide IT and security personnel with connectivity, visibility, and control, whether a device is on or off the corporate network, and empower them with Self-Healing Endpoint® security to ensure mission critical applications remain healthy and deliver intended value. Our technology is embedded in over a half-billion devices and we currently serve more than 13,000 commercial customers with over 10.8 million activated licenses globally. Our solutions are delivered in a software-as-a-service (“SaaS”) model, where customers access our service through the cloud-based Absolute service. Our solutions are offered in specific versions for the (i) enterprise and government, and (ii) education verticals. All versions are available in three editions: Visibility, Control, and Resilience, each of which provides a different subset of product features and functionality. We also offer a Home and Office edition of our service, which is targeted to consumers and home office professionals."
"Praxis Precision Medicines, Inc. (PRAX)","Praxis Precision Medicines, a clinical-stage biopharmaceutical company, develops therapies for central nervous system disorders characterized by neuronal imbalance. Its lead product candidates include PRAX-114, an extrasynaptic-preferring GABAA receptor positive allosteric modulator that is in Phase IIa clinical trial for the treatment of major depressive disorder and perimenopausal depression; and PRAX-944, a selective small molecule inhibitor of T-type calcium channels, which is in Phase IIa clinical trial for the treatment of essential tremor. The company is also developing PRAX-562, a persistent sodium current blocker that is in Phase I clinical trial to treat severe pediatric epilepsy and adult cephalgia; PRAX-222, an antisense oligonucleotide for patients with gain-of-function (GOF) SCN2A epilepsy; and KCNT1 program for the treatment of KCNT1 GOF epilepsy. It has a cooperation and license agreement with RogCon Inc.; a license agreement Purdue Neuroscience Company; and a research collaboration, option, and license agreement with Ionis Pharmaceuticals, Inc. The company was incorporated in 2015 and is based in Cambridge, Massachusetts."
"Eastern Bankshares, Inc. (EBC)","Eastern Bankshares provides commercial banking products and services primarily to retail, commercial, and small business customers. The company offers interest-bearing and non interest-bearing checking deposits, money market deposits, savings deposits, and certificates of deposits. It also offers commercial and industrial loans, commercial real estate and construction loans, business banking loans, residential real estate loans, and home equity and other consumer loans. Its personal banking products and services also include debit and credit cards; mortgage and personal loans; personal and cash reserve lines of credit; auto and student loans; retirement planning products and services; and online learning services in the areas of finance. The company's business banking products and services also include preferred term loans, small business administration loans, lines of credit, cash reserves, cash management, merchant services, escrow express service, correspondent and government banking, international banking, interest on lawyers trust accounts services, products and services for not-for-profit and healthcare, and business telephone banking. In addition, it offers trust and investment products and services; community development and asset-based lending services; financial planning, portfolio management, wealth management, private banking, and fiduciary and retirement products and services; and treasury management, electronic banking, interest rate protection, and foreign exchange products and services. Further, the company acts as an independent insurance agent and offers commercial, personal, and employee benefits insurance products to individual and commercial clients. It operates through 89 banking offices located in eastern Massachusetts and southern and coastal New Hampshire. Eastern Bankshares, Inc. was formerly known as Eastern Bank Corporation. The company was founded in 1818 and is headquartered in Boston, Massachusetts."
"Eargo, Inc. (EAR)","Eargo is a medical device company dedicated to improving the quality of life of people with hearing loss. We developed the Eargo solution to create a hearing aid that consumers actually want to use. Our innovative product and go-to-market approach address the major challenges of traditional hearing aid adoption, including social stigma, accessibility and cost. We believe our Eargo hearing aids are the first and only virtually invisible, rechargeable, completely-in-canal, FDA regulated, exempt Class I device for the treatment of hearing loss. Our rapid pace of innovation is enabled by our deep industry and technical expertise across mechanical engineering, product design, audio processing, clinical and hearing science, consumer electronics and embedded software design, and is supported by our strategic intellectual property portfolio. Our differentiated, consumer-first approach empowers consumers to take control of their hearing by improving accessibility, with personalized, high-quality hearing support from licensed hearing professionals. We believe that our differentiated hearing aids, consumer-oriented approach and strong brand have fueled the rapid adoption of our products and high customer satisfaction, as evidenced by over 42,000 Eargo hearing aid systems sold, net of returns, as of June 30, 2020."
Spartacus Acquisition Corporation (TMTS),"Spartacus Acquisition Corporation is a newly organized blank check company formed for the purpose of effecting a merger, capital stock exchange, asset acquisition, stock purchase, reorganization or similar business combination with one or more businesses. While we may pursue an initial business combination target in any stage of its corporate evolution or in any industry or sector, we intend to focus our search on telecommunications, media and technology (“TMT”) companies. Our management team and board of directors have had significant success sourcing, acquiring, expanding and monetizing these types of companies. We believe this experience makes us exceptionally well suited to identify, negotiate and successfully execute an initial business combination with the ultimate goal of generating attractive returns for our shareholders. We believe our management team is well positioned to identify and evaluate target businesses within the TMT industry that would benefit from being a public company and from access to our expertise. We believe we can achieve this mission by utilizing our team’s extensive experience in growing and operating TMT companies, as well as our team’s broad network of contacts in the TMT sector."
"Kiromic BioPharma, Inc. (KRBP)","Kiromic BioPharma, a target discovery and gene editing company, focuses on developing immuno-oncology therapeutics for the treatment of blood cancers and solid tumors. Its product portfolio include ALEXIS AIDT-1, an allogenic CAR cell product candidate targeting AIDT-1; ALEXIS AIDT-2 EOC, an allogenic CAR cell product candidate targeting AIDT-2; ALEXIS AIDT-2 MPM (malignant pleural mesothelioma), an allogenic CAR/NKT-Like cell product candidate targeting AIDT-2; and PD-1-AR, a check point inhibitor for solid tumors, as well as oral healthcare products, such as mouthwash. The company was formerly known as Kiromic, Inc. and changed its name to Kiromic BioPharma, Inc. in December 2019. Kiromic BioPharma, Inc. was founded in 2006 and is headquartered in Houston, Texas."
Opthea Limited (OPT),"Opthea Limited, a biotechnology company, develops and commercializes therapies primarily for eye disease in Australia. The company's development activities are based on the intellectual property portfolio covering Vascular Endothelial Growth Factors (VEGF) VEGF-C, VEGF-D, and VEGF Receptor-3 for the treatment of diseases associated with blood and lymphatic vessel growth, as well as vascular leakage. Its lead molecule is OPT-302, a soluble form of VEGFR-3 for the treatment of wet age-related macular degeneration and diabetic macular edema. The company was formerly known as Circadian Technologies Limited and changed its name to Opthea Limited in December 2015. Opthea Limited was incorporated in 1984 and is based in South Yarra, Australia."
"Tarsus Pharmaceuticals, Inc. (TARS)","Tarsus Pharmaceuticals, a clinical-stage biopharmaceutical company, focuses on the development and commercialization of novel therapeutic candidates for ophthalmic conditions. Its lead product candidate is TP-03, a novel therapeutic that is in Phase IIb/III for the treatment of blepharitis caused by the infestation of Demodex mites, as well as to treat meibomian gland disease. The company is also developing TP-04 for the treatment of rosacea; and TP-05 for Lyme prophylaxis and community malaria reduction. Tarsus Pharmaceuticals, Inc. was founded in 2016 and is headquartered in Irvine, California."
Bridgetown Holdings Limited (BTWN),"Bridgetown Holdings is a blank check company incorporated as a Cayman Islands exempted company and formed for the purpose of effecting a merger, share exchange, asset acquisition, share purchase, reorganization or similar business combination with one or more businesses. While we may pursue an acquisition or a business combination target in any business or industry, we intend to focus our search on a target with operations or prospective operations in the technology, financial services, or media sectors, which we refer to as the “new economy sectors”, in Southeast Asia. We believe that Southeast Asia is entering a new era of economic growth, particularly in the new economy sectors, which we expect will result in attractive initial business combination opportunities for attractive risk-adjusted returns. The Association of Southeast Asian Nations, or ASEAN, is made up of countries in Southeast Asia including Indonesia, Thailand, Singapore, Vietnam, the Philippines, Malaysia, Brunei Darussalam, Myanmar, Cambodia and Laos. With a population of 649 million and a nominal GDP of approximately $3 trillion in 2018, ASEAN is fast becoming a major regional economic force and a driver of global growth. ASEAN remains one of the fastest growing regions in the world with economic growth continuing to average 5.4% per annum, and is estimated by ASEAN to become the fourth-largest economy in the world by 2030 after the United States, China, and the European Union."
"Aligos Therapeutics, Inc. (ALGS)","Aligos Therapeutics, a biopharmaceutical company, focuses to develop novel therapeutics to address unmet medical needs in viral and liver diseases. Its lead drug candidate is ALG-010133, a synthetic oligonucleotide that is in Phase I clinical trial for the treatment of chronic hepatitis B (CHB). The company is also developing ALG-000184, a capsid assembly modulator to treat CHB; ALG-020572, a oligonucleotide for the treatment of CHB; ALG-125097, an siRNA drug candidate to treat CHB; and ALG-055009, a small molecule THR-ß agonist for the treatment of non-alcoholic steatohepatitis. Aligos Therapeutics, Inc. was founded in 2018 and is headquartered in South San Francisco, California."


In [12]:
# Checking the shape, we have 353 IPOs in 2020 
ipo.shape

(353, 1)