# Vinyl Grading Project
This project aims to construct an open source package to classify vinyl according to the [Goldmine Grading](https://www.goldminemag.com/collector-resources/record-grading-101).

First we need to create a Dataset. We will scrap some pages in search of labeled audio.

## Scrapin WatchCount

WatchCount is an archive of ebay sold items. I've selected one user that uses a template for the description and that includes always an audio. This will help us to create semi-automatically the dataset.

In [1]:
# install required packages if needed
!pip install pandas selenium requests

Collecting pandas
  Downloading pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting selenium
  Downloading selenium-4.22.0-py3-none-any.whl.metadata (7.0 kB)
Collecting requests
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.0.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting urllib3<3,>=1.26 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading urllib3-2.2.2-py3-none-any.whl.metadata (6.4 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.26.0-py3-none-any.whl.metadata (8.8 kB)
Collecting trio-websocket~=0.9 (from s

In [2]:
# Imports
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By

In [4]:
# Let's open a session
driver = webdriver.Firefox()
url = "http://www.watchcount.com/completed.php?bfw=1&bslr=wills-rare-records"
driver.get(url)



Let's create a dataframe for the url to the actual item on ebay and the title of the item sold.

In [4]:
df_links = pd.DataFrame(columns=["Link","Title"])

In [5]:
df_links

Unnamed: 0,Link,Title


Now is time to scrape all the table of sold items from this seller.

In [13]:
# This is the table's xpath
table = driver.find_element(By.XPATH, "/html/body/div/table/tbody/tr[3]/td/table/tbody/tr/td[2]/div/table[2]/tbody/tr[2]/td/div/table[1]")

# Find all rows within the table
rows = table.find_elements(By.TAG_NAME, "tr")

# Loop through each row
for row in rows:
    # Find all link elements within the row
    links = row.find_elements(By.TAG_NAME, "a")
    # Loop through each link element
    if len(links)==11:

    # Extract the URL from the href attribute
        url = links[1].get_attribute("href")
        text = links[1].get_attribute("text")
        
        df_links.loc[len(df_links)] = [url, text]


Let's have a look on the dataframe of the first scraped page.

In [14]:
df_links

Unnamed: 0,Link,Title
0,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...
1,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...
2,http://www.watchcount.com/go/?item=11623035039...,Mohamed Mirghani - Rare SUDAN Arabic Afro 45 /...
3,http://www.watchcount.com/go/?item=11623034962...,Tayeb Abdullah ? - Rare SUDAN Arabic Afro 45 /...
4,http://www.watchcount.com/go/?item=11623034725...,Ibrahim Awad - Ya Zaman - Rare SUDAN Arabic Af...
...,...,...
96,http://www.watchcount.com/go/?item=11620727474...,Asnaketch Worku – Krar Songs - ETHIOPIA Rare F...
97,http://www.watchcount.com/go/?item=11620668755...,Tekle Tesfazghi ‎– Abadit - ERITREA (Ethiopia)...
98,http://www.watchcount.com/go/?item=11620727128...,Ashenafi Kebede & The Hungarian State String O...
99,http://www.watchcount.com/go/?item=11620668101...,Black Magic Band - ZAMBIA Invisible Super Rare...


Look's good. Let's scrape the rest of 8 pages.

In [17]:
for i in range(1,9):
    driver = webdriver.Firefox()
    url = "http://www.watchcount.com/completed.php?bslr=wills-rare-records&csbin=all&cssrt=ts&bfw=1&bpg={}#serp".format(i)
    driver.get(url)
    table = driver.find_element(By.XPATH, "/html/body/div/table/tbody/tr[3]/td/table/tbody/tr/td[2]/div/table[2]/tbody/tr[2]/td/div/table[1]")

    # Find all rows within the table
    rows = table.find_elements(By.TAG_NAME, "tr")

    # Loop through each row
    print(i)
    for row in rows:
        try:
            # Find all link elements within the row
            links = row.find_elements(By.TAG_NAME, "a")
            
            # Loop through each link element
            if len(links)==11:

                # Extract the URL from the href attribute
                url = links[1].get_attribute("href")
                text = links[1].get_attribute("text")
                df_links.loc[len(df_links)] = [url, text]
        except:
            print(row)
driver.quit()

4


Let's store the output.

In [19]:
df_links.to_csv("./output/wills.csv", index = False)