# Scraping Specs from Jawa.gg

Jawa.gg is a popular online marketplace for selling pc parts and whole prebuilt pc's. I would like to scrape the specs of the pre built pc's sold. This will be broken down into parts. 

1. Access the webpage for sold pc's
2. Gather the links to each pc sold
3. Scrape the specs for each pc 
4. Repeat for all pages (up to 161 as of 1/1/2024)
5. Store in a sql database



In [3]:
# Setup
from bs4 import BeautifulSoup
import ssl
import urllib.request, urllib.parse
import pandas as pd
import time

Search Results


In [79]:
# Read URL
sample = 'https://www.jawa.gg/shop/full-systems/gaming-pcs-show-sold~5c456b-7fa58?page=100'
html=urllib.request.urlopen(sample).read()
soup=BeautifulSoup(html,'html.parser')


At this point, I've inspected the html and found that each of the 20 pc's that are present on a page are under 

div class="tw-group tw-relative"

In [98]:
# Find all instances
s = soup.find_all('div', class_="tw-group tw-relative")
# Check the length (should be 20)
len(s)

20

In [101]:
# First item
pc = s[0]

In [113]:

# There seem to be two instances of a
# It turns out that the first instance is the real information, and the 
# second has to do with the button.
pc.find('a').get('href') # succesfully gets the link

'/product/12028/lime-mid-range-gaming-pc-ryzen-5-3600x-and-1660ti'

In [118]:
# how about the title
pc.find('div',class_="tw-mt-1 tw-line-clamp-1 tw-overflow-ellipsis tw-text-xs md:tw-text-sm").get_text()

'LIME - Mid-Range Gaming PC - Ryzen 5 3600x & 1660ti'

In [115]:
# The Price the pc was sold for is under this structure
pc.find('div',class_="tw-order-2 tw-text-sm tw-font-semibold tw-text-white").get_text()

'$590.00'

I would like to run two tests, first I would like to make sure that I can get all 20 pcs, then test on a different test site.

In [128]:
# Create an empty df to store data
small_df = pd.DataFrame(columns=['Title','Price','Link'],index=range(20))
# loop through the 20 pcs ( note that there might not be 20, so we use len)
for i in range(0,len(s)):
    current_pc = s[i] # select just the one of interest
    title = current_pc.find('div',class_="tw-mt-1 tw-line-clamp-1 tw-overflow-ellipsis tw-text-xs md:tw-text-sm").get_text()
    sale_price = current_pc.find('div',class_="tw-order-2 tw-text-sm tw-font-semibold tw-text-white").get_text()
    # Make sale price into an actual number
    sale_price = sale_price.replace('$','') # Remove the $
    sale_price = float(sale_price) # convert to a float since its a number with decimals
    link = current_pc.find('a').get('href')
    link = 'https://jawa.gg' + link # Add jawa.gg as prefix so the link works
    #print(title,'|-|',sale_price,'|-|',link)

    # Now, since i premade the dataframe, we can use iloc to fill the table
    # this is based on knowing the row and column where the data will go
    # The first argument is the row and the second is the column.
    small_df.iloc[i,0] = title
    small_df.iloc[i,1] = sale_price
    small_df.iloc[i,2] = link
small_df

Unnamed: 0,Title,Price,Link
0,LIME - Mid-Range Gaming PC - Ryzen 5 3600x & 1...,590.0,https://jawa.gg/product/12028/lime-mid-range-g...
1,Budget Ryzen Gaming PC - Ryzen 5 1600 - GTX 10...,550.0,https://jawa.gg/product/12022/budget-ryzen-gam...
2,"Pi-Nano.v2 | RTX 3060, Intel 12400F, 1tb nvme ...",949.99,https://jawa.gg/product/11960/pi-nanov2-or-rtx...
3,Budget Gaming PC - Ryzen 5 3600 | 16gb DDR4 | ...,625.0,https://jawa.gg/product/12000/budget-gaming-pc...
4,Prism Mk III | RTX 3060Ti - Ryzen 5 3600 Gamin...,999.99,https://jawa.gg/product/11999/prism-mk-iii-or-...
5,Ryzen 7 3700x RTX 2070super Gaming & Streaming PC,750.0,https://jawa.gg/product/9558/ryzen-7-3700x-rtx...
6,"110+FPS Esports | Intel i5, Radeon RX 580, 16...",380.0,https://jawa.gg/product/11995/110fps-esports-o...
7,櫓 Yagura | Core i5-12400 + RTX 3060 Ti SFF Gam...,1300.0,https://jawa.gg/product/11994/yagura-or-core-i...
8,RTX 3070 | Ryzen 5 5600X | 32GB Ram | 1TB NVMe...,1250.0,https://jawa.gg/product/11970/rtx-3070-or-ryze...
9,🤍💚🤍 AMD Ryzen 5 3500X // Nvidia GeForce RTX 20...,890.0,https://jawa.gg/product/11955/amd-ryzen-5-3500...


Test 1 complete.

In [130]:
def basic_data_pull(link):
    html=urllib.request.urlopen(link).read()
    soup=BeautifulSoup(html,'html.parser')
    s = soup.find_all('div', class_="tw-group tw-relative")

    # Create an empty df to store data
    small_df = pd.DataFrame(columns=['Title','Price','Link'],index=range(20))
    # loop through the 20 pcs
    for i in range(0,len(s)):
        current_pc = s[i] # select just the one of interest
        title = current_pc.find('div',class_="tw-mt-1 tw-line-clamp-1 tw-overflow-ellipsis tw-text-xs md:tw-text-sm").get_text()
        sale_price = current_pc.find('div',class_="tw-order-2 tw-text-sm tw-font-semibold tw-text-white").get_text()
        # Make sale price into an actual number
        sale_price = sale_price.replace('$','') # Remove the $
        sale_price = float(sale_price) # convert to a float since its a number with decimals
        link = current_pc.find('a').get('href')
        link = 'https://jawa.gg' + link # Add jawa.gg as prefix so the link works
        #print(title,'|-|',sale_price,'|-|',link)

        # Now, since i premade the dataframe, we can use iloc to fill the table
        # this is based on knowing the row and column where the data will go
        # The first argument is the row and the second is the column.
        small_df.iloc[i,0] = title
        small_df.iloc[i,1] = sale_price
        small_df.iloc[i,2] = link
    return small_df

In [131]:
basic_data_pull('https://www.jawa.gg/shop/full-systems/gaming-pcs-show-sold~5c456b-7fa58?page=102')
# effective

Unnamed: 0,Title,Price,Link
0,R5 2600/ 1660Ti Gaming PC Computer,600.0,https://jawa.gg/product/11635/r5-2600-1660ti-g...
1,🌍 Streaming & Gaming PC - GTX 1060 6GB / Xeon ...,409.0,https://jawa.gg/product/11634/streaming-and-ga...
2,JT-1 | Ryzen 5 1600 | GTX 1070 Ti | 16GB RAM |...,525.0,https://jawa.gg/product/11612/jt-1-or-ryzen-5-...
3,"""Frosty"" 1440p Gaming/Streaming PC",739.99,https://jawa.gg/product/11610/frosty-1440p-gam...
4,"Budget 4K PC: GeForce RTX 2080 Super, Core i9 ...",999.99,https://jawa.gg/product/10530/budget-4k-pc-gef...
5,Stealthy RTX 3090 | Ryzen 9 5900X Gaming/Strea...,2479.88,https://jawa.gg/product/8783/stealthy-rtx-3090...
6,🦇🔴HellBat 🦇🔴||Gaming PC-Ryzen 5 5600 6-Core -A...,1075.0,https://jawa.gg/product/11582/hellbat-ororgami...
7,SALE: Watercooled PC 6600 XT + 3600 + 16GB DDR...,914.91,https://jawa.gg/product/11578/sale-watercooled...
8,"PC on Sale! Ryzen 7 5700G, RTX 3070, 32GB DDR4...",945.0,https://jawa.gg/product/11547/pc-on-sale-ryzen...
9,Gaming PC Intel i5 12400 RTX 3060 Ti,999.99,https://jawa.gg/product/11526/gaming-pc-intel-...


Great! I'm glad that worked, now I will use this function to get ALL the pc's.

Note there are 161 pages as of 1/1/2024

In [138]:
base = 'https://www.jawa.gg/shop/full-systems/gaming-pcs-show-sold~5c456b-7fa58?page='
big_df = pd.DataFrame(columns=['Title','Price','Link'])
for page in range(1,162):
    # combine the base with the page number
    page_link =  base+str(page)
    # Search on the current webpage and merge dataframes
    big_df = pd.concat([big_df,basic_data_pull(page_link)], ignore_index=True)
big_df= big_df.dropna(how='all')# Drop any extra rows
big_df.head()

Unnamed: 0,Title,Price,Link
0,AMD RX 7600 | Ryzen 5 5500 | 1TB SSD | The Min...,879.0,https://jawa.gg/product/26741/amd-rx-7600-or-r...
1,Artic Beast Ryzen 7 5700x / 32GB / 2TB / RTX 4060,1149.0,https://jawa.gg/product/26886/artic-beast-ryze...
2,DRK1 Gaming PC Custom 1060 Windows 10 Pro Inte...,419.99,https://jawa.gg/product/26887/drk1-gaming-pc-c...
3,Dark Side Liquid cooled RIG Intel inte 14 Core...,875.0,https://jawa.gg/product/26876/dark-side-liquid...
4,🩷🧁Buu🍧AMD RX 5700 XT 8GB 🍧AMD Ryzen 5 3500X 6 ...,695.0,https://jawa.gg/product/25309/buuamd-rx-5700-x...


In [139]:
big_df.describe()

Unnamed: 0,Title,Price,Link
count,3213,3213.0,3213
unique,3150,755.0,3213
top,Ryzen 5 3600 RX 6600 16GB RAM 1TB M.2 Wifi 5 B...,500.0,https://jawa.gg/product/26741/amd-rx-7600-or-r...
freq,4,84.0,1


Now to get the important details for a specficic product. 

Redo the previous loading process with this new link.

In [146]:
product = 'https://www.jawa.gg/product/26702/5600-x-6700xt-digital-dash-32gb-ram-wifi'
html=urllib.request.urlopen(product).read()
soup=BeautifulSoup(html,'html.parser')

[]

One method to get the details are in this description meta content.

In [155]:
soup.find_all('meta', property="og:description")

[<meta content="CPU Brand: AMD, CPU Series: Ryzen 5, CPU Model: Ryzen 5 5600, CPU Socket: AM4, CPU Core count: 6 cores, GPU Chipset: AMD, GPU Brand: PowerColor, GPU Series: Radeon RX 6700 XT, GPU Memory: 12GB, Memory Capacity: 32GB, Memory Type: DDR4, Memory Form Factor: DIMM (Desktop), Internal Storage Capacity: 1TB, Internal Storage Interface: NVME SSD, Case Brand: Lian Li, Case Color: Black, Power Supply Wattage: 800W, Motherboard Brand: AMD, Motherboard Socket: AM4, CPU Cooler Type: Air, CPU Cooler Socket: AM4" property="og:description"/>]

Another is to turn this next data line into a dictionary.

In [164]:
json.loads(soup.find('script',id="__NEXT_DATA__").get_text())

{'props': {'pageProps': {'dehydratedState': {'mutations': [],
    'queries': [{'state': {'data': {'buyer_protection_policy': None,
        'category': {'id': 28, 'name': 'Gaming PCs'},
        'condition': 'used',
        'created_at': '2023-12-24T13:05:14.118Z',
        'description': 'AMD 5600\n\nAMD 6700XT\n\nMSI B550 Tomahawk\n\n32GB RAM 3200MHZ\n\nLian li 216\n\nDeepCool AK400 Digital\n\n1TB NVME\n\n850Watt EVGA Fully modular PSU\n\n\nGrown up build all about business \n\n\n6700XT>>>>3060TI',
        'height': None,
        'id': 26702,
        'images': {'ids': ['production/listings/gtowbjpydqzyorbdlfi2',
          'production/listings/vadrki0lcpgwyq6jhsft',
          'production/listings/ewaloqeptuvk2zzkkiit',
          'production/listings/wlczdbpkmkkzdsnww5gt',
          'production/listings/dx14obbzmpjav4fbawvk'],
         'source': 'cloudinary'},
        'is_insured': False,
        'is_on_sale': False,
        'is_private_listing': False,
        'is_published': True,
     

I will likely go the dictionary route. The next steps are to get the key pieces out of the dictionary, and make a new dataframe with the details. This will get added to big_df once completed.

Variables of interest: all specs, date listed, date sold, (create delta variable), original price, price, 

note that I only want the first state. 