# The Crawler

The crawler consists of 3 different functions: extract_data(), day1_crawler() & compare_records(), and finally our main function which uses them. 
For our crawler we use selenium - so a chrome driver has been downloaded to May's personal laptop & the entire procees has been running on that laptop.
The extract_data() fucntion is first in line - it's used to(as the name suggests) extract all the data we need.
In our case we are looking for an item's type, price, condition, amount of pictures uploaded, upload date, if the seller is a private seller or a business, if a description of the item exists and a link to the item.

### extract_data() rundown:
For each type of data we scrape from Yad2 we use try & except - since we've experienced and learned that sometimes errors accure and the page might just need a refresh and another try.
We the use selenium's 'find_element' to track down the relevant html elements we need after inspecting Yad2's different pages.
Some elements we track down by class name, some by id and some by xpath.
Once an element has been found, it is placed in a dedicated list for that type of element, that later on are used to build a dataframe.

### day1_crawler() rundown:
This function is only relevant on day number 1 of checking if an item is sold - as on this day we go through our main dataframe from the previous day, and go through the links we've saved, access each link and determine if the item has been sold or not.
That is determined by the page that we get when accessing each link - if a page have an element that have the class name 'sorry' this means that the item has been sold.
First, we add an empty column for the relevant day of the checkup.
We use try & except here as well - for each link we access we try to find the element 'sorry', and if it exists - we add to that row on column 'Is Sold Day #1' the value 1, which indicates item is sold.
If 'sorry' does not exist, the 'try' portion of the code will fail and the 'except' will be executed - there we add the value 0 instead meaning the item is still available for purchase.
We then return the modified dataframe.

### compare_records() rundown:
This function is nearly identical to the day1_crawler() function, only that this time we compare the previous day checked to this recent day.
If on the previous day an item has been sold, we have no need to go over to the link and check it again, so if on the previous day the item was sold we add the value 1 to today's column.
That is the one difference between those two functions.

We have three additional functions - isRobot(), isBusiness(), and load_csv().
isRobot() is used to help us handle a captcha test.
In this function we look for an element only found on a captcha test in Yad2.
If the element has been found we return True, else we return false.
The isBusiness() fuction helps us figure out if the seller is private or not.
Inspecting the item pages we found that an element of 'שם המפרסם' is only found and used when the seller is a business, so if the element is found - return True, else return False.
load_csv() simply loads the wanted CSV file and returns it.

### Imports & Chrome Driver PATH set:

In [11]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from datetime import datetime
import pandas as pd
import numpy as np


PATH = '/Users/mayvakrat/Downloads/chromedriver'

### Data Extraction Crawler & Assisting Functions:

In [None]:
mainPage = webdriver.Chrome(PATH)
mainPage.get('https://www.yad2.co.il/products/furniture?category=2')
itemPage = webdriver.Chrome(PATH)

In [12]:
typeDict = []
conditionDict = []
priceDict = []
descriptionDict = []
pictureCountDict = []
uploadDateDict = []
itemLinkDict = []

def isRobot(driver):
    try:
        driver.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]')
        return True
    except:
        return False
    
def isBusiness():
    seller = itemPage.find_elements(by=By.XPATH, value='//*[text()[contains(., "שם המפרסם")]]')
    if len(seller) == 0:
        return False
    else:
        return True

def extractData():
    
    try:
        fur_type = itemPage.find_element(by=By.CLASS_NAME, value='sub_category_title').text.split('-')[1].strip()
        typeDict.append(fur_type)
    except:
        itemPage.refresh()
        time.sleep(1)
        fur_type = itemPage.find_element(by=By.CLASS_NAME, value='sub_category_title').text.split('-')[1].strip()
        typeDict.append(fur_type)

    try:
        try:
            pic_count = itemPage.find_element(by=By.CLASS_NAME, value='swiper-pagination-total').text
            pictureCountDict.append(pic_count)
        except:
            pictureCountDict.append(0)
    except:
        itemPage.refresh()
        time.sleep(1)
        try:
            pic_count = itemPage.find_element(by=By.CLASS_NAME, value='swiper-pagination-total').text
            pictureCountDict.append(pic_count)
        except:
            pictureCountDict.append(0)

    try:
        price = itemPage.find_element(by=By.CLASS_NAME, value='classified_price').text
        priceDict.append(price)
    except:
        itemPage.refresh()
        time.sleep(1)
        price = itemPage.find_element(by=By.CLASS_NAME, value='classified_price').text
        priceDict.append(price)

    try:
        condition = itemPage.find_elements(by=By.XPATH, value='//*[text()[contains(., "מצב המוצר")]]')
        for item in condition:
            if item.tag_name == 'dt':
                currentElement = item
        sibling = currentElement.find_element(by=By.XPATH, value='following-sibling::*').text
        conditionDict.append(sibling)
    except:
        itemPage.refresh()
        time.sleep(1)
        condition = itemPage.find_elements(by=By.XPATH, value='//*[text()[contains(., "מצב המוצר")]]')
        for item in condition:
            if item.tag_name == 'dt':
                currentElement = item
        sibling = currentElement.find_element(by=By.XPATH, value='following-sibling::*').text
        conditionDict.append(sibling)

    try:
        upload = itemPage.find_elements(by=By.XPATH, value='//*[text()[contains(., "תאריך עדכון")]]')
        for item in upload:
            if item.tag_name == 'dt':
                currentElement = item
        sibling = currentElement.find_element(by=By.XPATH, value='following-sibling::*').text
        if sibling == 'עודכן היום':
            uploadDateDict.append(datetime.date(datetime.now()))
        else:
            uploadDateDict.append(sibling)
    except:
        itemPage.refresh()
        time.sleep(1)
        upload = itemPage.find_elements(by=By.XPATH, value='//*[text()[contains(., "תאריך עדכון")]]')
        for item in upload:
            if item.tag_name == 'dt':
                currentElement = item
        sibling = currentElement.find_element(by=By.XPATH, value='following-sibling::*').text
        if sibling == 'עודכן היום':
            uploadDateDict.append(datetime.date(datetime.now()))
        else:
            uploadDateDict.append(sibling)

    try:
        description = itemPage.find_element(by=By.CLASS_NAME, value='details_text').text
        if description == None:
            descriptionDict.append('no')
        else:
            descriptionDict.append('yes')
    except:
        itemPage.refresh()
        time.sleep(1)
        try:
            description = itemPage.find_element(by=By.CLASS_NAME, value='details_text').text
            if description == None:
                descriptionDict.append('no')
            else:
                descriptionDict.append('yes')
        except:
            descriptionDict.append('no')
    
    try:
        itemLink = itemPage.current_url
        itemLinkDict.append(itemLink)
    except:
        itemPage.refresh()
        time.sleep(1)
        itemLink = itemPage.current_url
        itemLinkDict.append(itemLink)
        

### Remaining Crawlers & Assisting Functions:

In [None]:
itemPage = webdriver.Chrome(PATH)

In [13]:
def load_csv(csv):
    df = pd.read_csv(csv, index_col=[0])
    return df

def day1_crawler(main_df):
    df_copy = main_df.copy()
    df_copy['Is Sold Day #1'] = np.nan
    for index, row in df_copy.iterrows():
        try:
            link = row['Link']
            itemPage.get(link)

            if isRobot(itemPage) == True:
                try:
                    element = itemPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]')
                    time.sleep(5)
                    while element == itemPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]'):
                        WebmaDriverWait(mainPage, 99999).until(EC.staleness_of(element))
                        time.sleep(5)
                except:
                    time.sleep(5)

            sorry = itemPage.find_element_by_class_name('sorry')
            df_copy.at[index, 'Is Sold Day #1'] = 1
        except:   
            df_copy.at[index, 'Is Sold Day #1'] = 0   
    df_copy = df_copy.astype({'Is Sold Day #1':'int'})
    return df_copy
    

def compare_records(main_df, prev_day_col, curr_day_col):
    df_copy = main_df.copy()
    new_prev_col = 'Is Sold Day #' + str(prev_day_col)
    new_curr_col = 'Is Sold Day #' + str(curr_day_col)
    df_copy[new_curr_col] = np.nan
    for index, row in df_copy.iterrows():
        if df_copy.at[index, new_prev_col] == 1:
            df_copy.at[index, new_curr_col] = 1
        else:
            try:
                link = row['Link']
                itemPage.get(link)
                
                if isRobot(itemPage) == True:
                    try:
                        element = itemPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]')
                        time.sleep(5)
                        while element == itemPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]'):
                            WebmaDriverWait(mainPage, 99999).until(EC.staleness_of(element))
                            time.sleep(5)
                    except:
                        time.sleep(5)
                
                sorry = itemPage.find_element_by_class_name('sorry')
                df_copy.at[index, new_curr_col] = 1
            except:  
                df_copy.at[index, new_curr_col] = 0  
    df_copy = df_copy.astype({new_curr_col:'int'})
    return df_copy


### Data Extraction Main Code & Main Dataframe Creation

For data extraction we use a while loop that run 800 times - for 800 different item.
We also address page changes when we are at the last item of the page.
In addition we ignore first three items as the are always posted by a business seller - something we are not interested in, and are built different from other items.

We use two driver instances - MainPage which is the second hand furniture page on Yad2, and itemPage which is the page we access links of each item.
On mainPage we find the item elements, extract their item-id which we use to make a link, and we go through pages.
On itemPage we access that link we made on mainPage, and extract all the information we need with the extract_data() function.

We use the isRobot() function both on mainPage and itemPage on each itteration to determine if we need to handle a captcha test or not.
If the function return the value True, we then go on a while loop - as long as the element only relevant to the captcha test exists - we do not go forward(there was a need for a loop since sometimes, the test repeats itself multiple times).
Once we handle that, we go on to check if the seller is a business seller - if so, we break the current itteration of the loop and continue on to the next one, as we are uninterested in business sellers.

When we are done extracting the data into designated lists, we make those lists into a dataframe and save it locally, and terminate the crawling process.

In [None]:
itemList = mainPage.find_element(by=By.CSS_SELECTOR, value="div[class='feed_list gallery_item new_gallery_view_design']")
itemCount = len(itemList.find_elements(by=By.CSS_SELECTOR, value="div[class='feeditem table']")) + 3
i = 0
itemIndex = 3

while i < 800:
        
    if itemIndex == itemCount:
        itemIndex = 3
        try:
            nextPage = mainPage.find_element(by=By.CSS_SELECTOR, value="span[class='navigation-button-text next-text']")
            nextPage.click()
            itemList = mainPage.find_element(by=By.CSS_SELECTOR, value="div[class='feed_list gallery_item new_gallery_view_design']")
            itemCount = len(itemList.find_elements(by=By.CSS_SELECTOR, value="div[class='feeditem table']")) + 3
        except:
            mainPage.quit()

    try:
        item = mainPage.find_element_by_id('feed_item_'+str(itemIndex))
        itemID = item.get_attribute("item-id")
        itemPage.get('https://www.yad2.co.il/item/'+str(itemID))
    except:
        mainPage.refresh()
        itemIndex = itemIndex + 1
        item = mainPage.find_element_by_id('feed_item_'+str(itemIndex))
        itemID = item.get_attribute("item-id")
        itemPage.get('https://www.yad2.co.il/item/'+itemID)
    
    if isRobot(mainPage) == True:
        try:
            element = mainPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]')
            time.sleep(5)
            while element == mainPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]'):
                WebmaDriverWait(mainPage, 99999).until(EC.staleness_of(element))
                time.sleep(5)
        except:
            time.sleep(5)
        
    if isRobot(itemPage) == True:
        try:
            element = itemPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]')
            time.sleep(5)
            while element == itemPage.find_element(by=By.XPATH, value='//*[text()[contains(., "?Are you for real")]]'):
                WebDriverWait(itemPage, 99999).until(EC.staleness_of(element))
                time.sleep(5)
        except:
            time.sleep(5)
    
    if isBusiness() == True:
        itemIndex += 1
        continue
    
    extractData()
    
    itemIndex += 1
    i += 1

    time.sleep(3)

mainPage.quit()
itemPage.quit()

columns = ['Link', 'Type', 'Price', 'Condition', 'Is description', 'Picture count', 'Upload Date', 'Is Sold Day #1','Is Sold Day #2','Is Sold Day #3', 'Is Sold Day #4','Is Sold Day #5', 'Catagory mean', 'Days until sold', 'Price to Catagory mean', 'Is change in price']
df = pd.DataFrame({'Link': itemLinkDict, 'Type': typeDict, 'Price': priceDict, 'Condition': conditionDict,'Is description': descriptionDict, 'Picture count': pictureCountDict, 'Upload Date': uploadDateDict})

df.to_csv("/Users/mayvakrat/Desktop/School Shit/Year 2/Semester B/Introduction To Data Science/Final Project/DF's/df-Day1.csv")


### Day #1 Crawler Execution:

We load our main dataframe - the one containing all the data we scraped from Yad2.
An important piece of data we collected was the link for each add we extracted information from, so we can access each ad we saved to check if it has been sold or not.
for that we first use 'day1_crawler' to go through the links of each item and check existence - since we have no previous day to compare existance to.
We then terminate the crawling process, and save our updated dataframe.

In [None]:
itemPage = webdriver.Chrome(PATH)

main_df = load_csv("DF's/df-initial data.csv")
new_df = day1_crawler(main_df)
itemPage.quit()
new_df.to_csv("/Users/mayvakrat/Desktop/School Shit/Year 2/Semester B/Introduction To Data Science/Final Project/DF's/df-day1.csv")

### Remaining Days Crawling:
Since we now have a column from day #1's checkup, we can use our 'compare_records' function for the remaining four days.

In [14]:
itemPage = webdriver.Chrome(PATH)
main_df = load_csv("DF's/df-day4.csv")
prev_day = 4
curr_day = 5
new_df = compare_records(main_df, prev_day, curr_day)
itemPage.quit()
new_df.to_csv("/Users/mayvakrat/Desktop/School Shit/Year 2/Semester B/Introduction To Data Science/Final Project/DF's/df-day5.csv")

  itemPage = webdriver.Chrome(PATH)
  sorry = itemPage.find_element_by_class_name('sorry')
