# Vinyl Grade


## Scraping Ebay
Now that we have a bunch of urls from ebay, let's scrap them in search of the audio and the vinyl condition label.

In [1]:
import pandas as pd
from time import sleep
import os
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC



Let's import the dataset we created scraping  watchcount.

In [2]:
wills = pd.read_csv("./output/wills.csv")

In [3]:
wills.head()

Unnamed: 0,Link,Title
0,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...
1,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...
2,http://www.watchcount.com/go/?item=11623035039...,Mohamed Mirghani - Rare SUDAN Arabic Afro 45 /...
3,http://www.watchcount.com/go/?item=11623034962...,Tayeb Abdullah ? - Rare SUDAN Arabic Afro 45 /...
4,http://www.watchcount.com/go/?item=11623034725...,Ibrahim Awad - Ya Zaman - Rare SUDAN Arabic Af...


Let's create new column *condition*, empty, where we'll storage the string where the user stores the label. We'll have to clean manually after.

In [4]:
wills["condition"] = ""

In [5]:
len(wills)

378

Let's run selenium headless

In [6]:
options = Options()

options.add_argument("-headless") 

In [9]:
i=0
save_directory = "./output/audio/raw"

It's time to iterate over the dataframe. We searched for the audio and label parts inspecting the webpages with Firefox and reported here the xpath patterns.

In [10]:
# Initialize Firefox WebDriver
driver = webdriver.Firefox(options=options)

# Iterate over the dataframe
for link in wills["Link"].tolist():
    try:
        driver.get(link)
        if i==0:
            # Wait for the GDPR banner accept button to be clickable
            accept_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.ID, "gdpr-banner-accept"))
            )
            
            # Scroll to the accept button
            driver.execute_script("arguments[0].scrollIntoView();", accept_button)
            
            # Click the accept button
            accept_button.click()
            
        # Switch to the iframe with the description
        driver.switch_to.frame("desc_ifr")
        
        # Wait for the element containing the vinyl condition text to be present
        condition_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//*[contains(text(), 'Vinyl condition')]"))
        )
        condition_text = driver.find_elements(By.XPATH,"//*[contains(text(), 'Vinyl condition')]")[1].get_attribute("text").split("condition")[1].split("/span>")[0]

        
        # Assign the condition to the DataFrame
        wills.loc[i, "condition"] = condition_text
        
        # Find all <source> elements on the page
        source_elements = driver.find_elements(By.TAG_NAME, "source")

        # Filter out the <source> elements with the "audio/mpeg" type
        mp3_sources = [source.get_attribute("src") for source in source_elements if source.get_attribute("type") == "audio/mpeg"]

        # If there are MP3 sources, select the first one
        if mp3_sources:
            mp3_source_url = mp3_sources[0]
            
            
            # Download the MP3 file
            mp3_filename = os.path.basename(mp3_source_url)
            mp3_save_path = os.path.join(save_directory, mp3_filename)
            with open(mp3_save_path, "wb") as f:
                f.write(requests.get(mp3_source_url).content)

            # Save the MP3 file path in the DataFrame
            wills.loc[i, "mp3_path"] = mp3_save_path
        else:
            print("No MP3 sources found on the page.")
        
        i += 1
        # Update the dataframe
        wills.to_csv("./output/wills_audio.csv")
        # Wait
        sleep(1)

    except Exception as e:
        print(i)
        print(link)
        print("Error:", e)
        i+=1



0
http://www.watchcount.com/go/?item=116230352998&cid=1721069927-CL-0-1---wills-rare-records-
Error: Message: Element <button id="gdpr-banner-accept"> could not be scrolled into view
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5
ElementNotInteractableError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:353:5
webdriverClickElement@chrome://remote/content/marionette/interaction.sys.mjs:167:11
interaction.clickElement@chrome://remote/content/marionette/interaction.sys.mjs:136:11
clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:205:29
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:85:31

No MP3 sources found on the page.


Completed! Only one audio missing. Let's close the driver.

In [11]:
driver.quit()