# Web scraping
*Done by Océane Salmeron, December 2020*

This notebook is for explaination only, therefore it will not perfom the excel export.

## 1. Import libraries

In [25]:
import pandas as pd
import numpy as np
import time
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

## 2. Load page

**Ycombinator.com/companies** page is what we called an infinite loading. It will wait for the user to be at the bottom of the page to load more content. 

That said, we need to define a function that will scroll until it reach the end. The previous height of the screen calculated needs to be equal to new height of the screen, for the program to know we loaded all the content.

To be able to execute that we need a driver. To improve the performance we will set the driver capabilities **"pageloadstrategy"** to **eager**. This allows to still retrieve informations without waiting for the full page to load and therefore avoir TimeoutException.

In [26]:
# Set url
url = 'https://www.ycombinator.com/companies/'

# Set capabilities
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "eager"

# Instanciate driver
driver = webdriver.Chrome(desired_capabilities=caps, executable_path='../chromedriver')

driver.get(url)

In [27]:
# Scrolling to the end function
def scroll_to_end(driver):
    prev_len = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(0.5)
        new_len = driver.execute_script("return document.body.scrollHeight")
        if new_len == prev_len:
            break
        prev_len = new_len

In [28]:
scroll_to_end(driver)

Let's retrieve all the elements in the box that contains the startups.

In [29]:
results = driver.find_elements_by_xpath("//a[@class='SharedDirectory-module__company___AVmr6 no-hovercard']")

The information displayed on this page are not enough. We need to go in each startup profile page. For that we will loop on each elements to retrieve their profile ID.

In [30]:
def get_links(x):
    links = []
    for result in x:
        s=result.get_attribute('href')
        #Only append the id
        links.append(s.split('/')[-1])
        
    return links

In [31]:
links=get_links(results)

Once we have all the IDs, we will go on each page and retrieve those information :
- Company name
- Tagline
- Description
- Website
- Launch year
- Team size
- Socials

In [32]:
def retrieve_info(driver):
    facts = driver.find_elements_by_css_selector(".facts div span")
    socials = driver.find_elements_by_css_selector(".social")
               
    item = {'name': driver.find_element_by_class_name("heavy").text, 
            'info': driver.find_element_by_css_selector(".main-box h3").text,
            'description': driver.find_element_by_class_name("pre-line").text,
            'website': driver.find_element_by_css_selector(".main-box .links a").get_attribute('href'),
            'launch_year': facts[0].text,
            'team_size': facts[1].text,
            'location': facts[2].text
            }
            
    for social in socials:
        item[social.get_attribute('class').split()[-1]] = social.get_attribute('href')
        
    return item

In [33]:
dic = []
for link in links:
    driver.get(url+link)
    item=retrieve_info(driver)
    dic.append(item)

Now let's create our dataframe to prepare for the export.

In [34]:
data = pd.DataFrame.from_dict(dic)
data.replace(r'^\s*$', np.nan, regex=True, inplace = True)  

In [35]:
data.head()

Unnamed: 0,name,info,description,website,launch_year,team_size,location,linkedin,twitter,facebook,crunchbase
0,DoorDash,Restaurant delivery.,"Founded in 2013, DoorDash is a San Francisco-b...",http://doordash.com/,,1600,San Francisco,https://www.linkedin.com/company/doordash/,http://twitter.com/doordash,https://www.facebook.com/DoorDash/,https://www.crunchbase.com/organization/doordash
1,Dropbox,Backup and share files in the cloud.,Dropbox is building the world’s first smart wo...,http://dropbox.com/,2008.0,4000,San Francisco,https://www.linkedin.com/in/drewhouston/,https://twitter.com/drewhouston,https://www.facebook.com/Dropbox/,https://www.crunchbase.com/organization/dropbox
2,Airbnb,Book accommodations around the world,Founded in August of 2008 and based in San Fra...,http://airbnb.com/,2008.0,5000,San Francisco,https://www.linkedin.com/in/blecharczyk/,https://twitter.com/jgebbia,https://www.facebook.com/airbnb/,https://www.crunchbase.com/organization/airbnb
3,PagerDuty,Notify you about server troubles.,PagerDuty is an operations performance platfor...,http://pagerduty.com/,,775,San Francisco,https://www.linkedin.com/in/baskarfx/,,,https://www.crunchbase.com/organization/pagerduty
4,Embark Trucks,We build self-driving semi trucks.,We are a San Francisco based team building sel...,http://embarktrucks.com/,2016.0,100,San Francisco,https://ca.linkedin.com/in/rodriguesalex,,,https://www.crunchbase.com/organization/varden...


In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         1000 non-null   object
 1   info         985 non-null    object
 2   description  932 non-null    object
 3   website      999 non-null    object
 4   launch_year  621 non-null    object
 5   team_size    984 non-null    object
 6   location     985 non-null    object
 7   linkedin     882 non-null    object
 8   twitter      666 non-null    object
 9   facebook     389 non-null    object
 10  crunchbase   777 non-null    object
dtypes: object(11)
memory usage: 86.1+ KB
