# Introduction
Israel is well known as the startup nation of the world with a great entrepreneurial community and great minds working together making new technoligical breakthroughs.  

What makes Israel the startup nation? Well, that answer was not by Avinoam 🤓, so instead we'll try and see if the startup you thought of with your friends or the one your dad though about while scrolling facebook on the toilet is actually something worth pitching to the Shark Tank.

Our goal is to try and see if your dream might come true ✨.
To be honest, success is definately relative and we try not to dream too much and get disappointed. So we defined the criterias of success to be one or more of the following:
- **A company that raised over 💲4M** 
- **Got acquired (WooHoo with EXIT 🥳🤑)** 
- **Is an active and public company**
- **If the company is active and the product is released.**

There is a great website that shows information about the Israeli startup ecosystem and we decided to try and see if our 🤖 can scrape it and give us some interesting insights 💡.

The [Start-Up Nation Finder](https://finder.startupnationcentral.org/) website has access to over 10,000 companies, each company and her story.
The story begins at the beginning where the company is founded.
Then, the founders need to develop a product, raise money and start selling.

In this journey ahead, we'll take you to the core of the data of the Start-Up Nation.

So let us introduce you to 🤖 **Barurly** which is our USC (Unique Selenium Companion).

## 🤖 Barurly In Action

So we decided to scrape data from [Start-Up Nation Finder](https://finder.startupnationcentral.org/).

The main structure of the website is as following:

### Start-Up Nation Finder -> Companies Page -> Company Page
Each **Company Page** has the information about when the company was founded, how much money did they raise, what is their status, what are the markets that they are aiming to and the technologies they are using.

First, let's initialize 🤖 **Barurly** with the following blocks:

In [2]:
from selenium import webdriver

from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager

from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver import ActionChains
from selenium.webdriver.support.relative_locator import locate_with

import pandas as pd
import numpy as np
import time
import random

In [3]:
def get_driver():
    options = webdriver.FirefoxOptions()
    # options.headless = True
    driver = webdriver.Firefox(options = options, service=Service(GeckoDriverManager().install()))
    return driver

Here we set a rest function which selects a random time to wait, for the website defence systems not suspect 🤖 Barurly's activity.  

In [4]:
def rest(a,b) -> None:
    time.sleep(random.uniform(a,b))

This following section is where 🤖 Barurly scrapes the data from each company page.<br/><br/>
We asked him to fetch us the<br/>
- profile data  
- products and geomarkets  
- funding information  
- listing data  
- classification data  
- tags and markets  
<br/><br/>Each function tell's him where to approach in the page in order to collect the relevant data.

In [1]:
def get_data_from_page(page):
    """Get the data from a given page\n
    page is a WebDriver object.\n
    Return one line df"""

    d = {}
    d.update(get_profile_data(page))
    d.update(get_products_and_geomarkets(page))
    d.update(get_fund_data(page))
    d.update(get_listing_data(page))
    d.update(get_clasiffication_data(page))
    d.update(get_tags_and_markets(page))

    return pd.DataFrame([d])

In [6]:
def get_products_and_geomarkets(driver):
    d = {}
    company_profile = driver.find_element(By.CLASS_NAME,"zyno-card-4")
    titles = company_profile.find_elements(By.CLASS_NAME,"section-title")
    for title in titles:
        value = driver.find_element(locate_with(By.TAG_NAME,"div").below(title))
        d.update({title.text.lower() : value.text.lower()})
    return d

In [7]:
def get_profile_data(page) -> dict:

    name = page.find_element(By.CLASS_NAME,"top-profile-section").find_element(By.CLASS_NAME,"title").text
    about = page.find_element(By.CLASS_NAME,"about").text
    d = {'company_name' : name, 'company_about': about}

    company_profile = page.find_element(By.CLASS_NAME,"zyno-card-4")
    for info in company_profile.find_elements(By.CLASS_NAME,"metadata-item"):
        var = info.find_element(By.CLASS_NAME,"item-bottom").text
        value = info.find_element(By.CLASS_NAME,"metadata-description").text
        d.update({var.lower() : value})

    # status : PRIVATE / PUBLIC / ACQUIRED / NOT ACTIVE  
    status = 'active'

    try:
        topbar = page.find_element(By.CLASS_NAME,"top-bar-wrapper")
        if "Not Active" in topbar.text:
            status = 'not_active'
        
    except:
        # topbar = None
        pass

    d.update({'status' : status})
 

    return d

In [8]:
def get_clasiffication_data(page) -> dict:
    d = {} 
    classifications = page.find_element(By.CLASS_NAME, "js-startup-classification-section").find_elements(By.CLASS_NAME,"classification-item")
    classifications_list = []

    for cls in classifications:

        elements = cls.find_elements(By.CLASS_NAME,"js-lead-item")
        title = "_".join(cls.find_element(By.CLASS_NAME,"classification-title").text.lower().split(" "))
        for elm in elements:
            elm_title = elm.find_element(By.CLASS_NAME,"row-container").text
            classifications_list.append(f"{title}_{elm_title}")

            for subject in elm.find_elements(By.CLASS_NAME,"js-child-item"):
                classifications_list.append(f"{title}_{elm_title}_{subject.text}")

    for elm in classifications_list:
        d.update({elm : 1})
    
    return d

In [9]:
def get_tags_and_markets(page) -> dict:
    """scrape TAGS and TARGET MARKETS"""
    
    d= {}
    tags_and_markets_list =  page.find_elements(By.CLASS_NAME, "tags-wrapper")
    
        # scrape TAGS 
    try:
        tags = [tag.text for tag in tags_and_markets_list[0].find_elements(By.CLASS_NAME,"label")]
        for tag in tags:
            d.update({f"tag_{tag}": 1})
    except:
        tags = None

        # scrape TARGET MARKETS
    try:
        markets = [market.text for market in tags_and_markets_list[1].find_elements(By.CLASS_NAME,"label")]
        for market in markets:
            d.update({f"targetmarket_{market}": 1})

    except:
        markets = None 
    
    return d

In [10]:
def get_fund_data(page) -> dict:
    d={}
    try:
        fund_data = [x.text for x in page.find_element(By.CLASS_NAME, "funding-metadata").find_elements(By.CLASS_NAME,"title")]
    except:
        fund_data = [np.nan, np.nan, np.nan, np.nan]

    d.update({'fund_stage':fund_data[0], 'total_raised':fund_data[1], 'total_rounds':fund_data[2], 'investors': fund_data[3]})
    return d

In [11]:
def get_listing_data(page) -> dict:
    try:
        d = {}
        topbar = page.find_element(By.CLASS_NAME,"top-bar-wrapper")
        if "Public" in topbar.text:
            ipo_price = topbar.find_element(By.CLASS_NAME,"right").find_element(By.CLASS_NAME,'bold').text
            d.update({'ipo_price':ipo_price})
        
    except:
        d.update({'ipo_price':np.nan})
        
    return d

This is where we feed 🤖 Barurly the links to all of the company pages.

In [12]:
links = []

with open("data/full_links_list.txt", "r") as f:
    for line in f:
        links.append(line.strip('\n'))

len(links)

13102

<h4>Here is where 🤖 Barurly gets the companies information by us feeding him all of the links remaining and processing the companies information.</h4>

In [13]:
df = pd.DataFrame()
driver = get_driver()
LONG_WAIT = 10 # Minutes

for i, link in enumerate(links[5000:]):
    try:
        driver.get(link)
        df = pd.concat([df,get_data_from_page(driver)], ignore_index=True)
        if i % 100 == 0:
            print(i+1)
        if (i+1) % 500 == 0:
            print(f"Sleeping for {LONG_WAIT} minutes")
            time.sleep(60 * LONG_WAIT)
        else:
            rest(2,5)
    except Exception as e:
        print(f"Error on page {i} -> {str(e)}")
    

driver.quit()
df



Current firefox version is 100.0
Get LATEST geckodriver version for 100.0 firefox
Driver [C:\Users\matan\.wdm\drivers\geckodriver\win64\v0.31.0\geckodriver.exe] found in cache


1
101
Error on page 102 -> Message: Unable to locate element: .js-startup-classification-section
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:183:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:395:5
element.find/</<@chrome://remote/content/marionette/element.js:300:16

201
301
401
Sleeping for 10 minutes
501
601
701
801
901
Error on page 993 -> Message: Unable to locate element: .js-startup-classification-section
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:183:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:395:5
element.find/</<@chrome://remote/content/marionette/element.js:300:16

Error on page 996 -> Message: Unable to locate element: .js-startup-classification-section
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:183:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:395:5
element.find/</<@chrome://remote/cont

Unnamed: 0,company_name,company_about,founded,business model,employees,funding stage,raised,product stage,status,geographical markets,...,tag_lead-acid-batteries,tag_car-audio,tag_fuel-management,tag_trip,tag_derms,tag_flexible-heating-fabric,tag_outwear,tag_cars-heating,tag_medical-heat-treatment,tag_augmented-sound
0,CargoZone Workspace,CargoZone specializes in helping organizations...,1/2020,B2B,1-10,Pre-Seed,$350K,Beta,active,"americas, north america, europe, asia, israel",...,,,,,,,,,,
1,Hyperspace,"Hyperspace provides a purpose-built, high-perf...",2/2021,B2B,1-10,Pre-Seed,,Alpha,active,,...,,,,,,,,,,
2,DataWiz,DataWiz is developing a platform using busines...,7/2021,B2B,1-10,Bootstrapped,,R&D,active,,...,,,,,,,,,,
3,TUATARIX,"Tuatarix provides a complete, end-to-end digit...",7/2021,B2B,1-10,Bootstrapped,,Customer development,active,,...,,,,,,,,,,
4,Rupert,Rupert is a platform that integrates with anal...,5/2019,B2B,1-10,Seed,,Beta,active,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8055,TriEye,TriEye is a fabless semiconductor company that...,11/2017,B2B,51-200,ROUND A,$96M,R&D,active,global,...,,,,,,,,,,
8056,LYNX Smartcars,LYNX is developing software for connected and ...,1/2016,B2B,1-10,Bootstrapped,,R&D,not_active,,...,,,,,,,,,,
8057,Deeyook Location Technologies,Deeyook seeks to redefine location technology ...,3/2017,B2B,11-50,Seed,,Released,active,global,...,,,,,,,,,,
8058,SafeCue,SafeCue combines the power of deep learning wi...,1/2016,B2B,1-10,Seed,$500K,Beta,not_active,"asia, germany, india, united states",...,,,,,,,,,,


The new DataFrame is complete and we can save it.

In [14]:
df.to_csv('df5000_13102.csv')

Let's join them together:

In [None]:
df_complete = pd.concat([pd.read_csv('df0_5000.csv'), df])
df_complete.shape

<h4>🤖 Barurly has successfully collected and created a DataFrame with 13,048 rows and 2,870 columns which is 🤖🤔💭 ... 3,7447,760 Data Points!
<br /><br/>
And the work for 🤖 Barurly is done!</h4>

In [16]:
df_complete.to_csv('df_complete.csv')