# Scraping Specs from Jawa.gg

Jawa.gg is a popular online marketplace for selling pc parts and whole prebuilt pc's. I would like to scrape the specs of the pre built pc's sold. This will be broken down into parts. 

1. Access the webpage for sold pc's
2. Gather the links to each pc sold
3. Scrape the specs for each pc 
4. Repeat for all pages (up to 161 as of 1/1/2024)
5. Store in a sql database



In [21]:
# Setup
from bs4 import BeautifulSoup
import ssl
import urllib.request, urllib.parse
import pandas as pd
import time
import json

In [6]:
# note that when switching from work computer to laptop I got an error about ssl certificate verification failed
# this should bypass this for now, but wouldn't be used if I put this into production

import ssl


ssl._create_default_https_context = ssl._create_unverified_context


Search Results


In [7]:
# Read URL
sample = 'https://www.jawa.gg/shop/full-systems/gaming-pcs-show-sold~5c456b-7fa58?page=100'
html=urllib.request.urlopen(sample).read()
soup=BeautifulSoup(html,'html.parser')


At this point, I've inspected the html and found that each of the 20 pc's that are present on a page are under 

div class="tw-group tw-relative"

In [8]:
# Find all instances
s = soup.find_all('div', class_="tw-group tw-relative")
# Check the length (should be 20)
len(s)

20

In [9]:
# First item
pc = s[0]

In [10]:

# There seem to be two instances of a
# It turns out that the first instance is the real information, and the 
# second has to do with the button.
pc.find('a').get('href') # succesfully gets the link

'/product/12134/raphael-ororgaming-and-streaming-pc-nvidia-gtx-980-4gb-intel-i3-10100f-16gb-ddr4-ram-512gb-ssd'

In [11]:
# how about the title
pc.find('div',class_="tw-mt-1 tw-line-clamp-1 tw-overflow-ellipsis tw-text-xs md:tw-text-sm").get_text()

'🐢🍕Raphael🍕🐢 ||Gaming and Streaming PC-Nvidia GTX 980 4GB-Intel i3 10100F-16GB DDR4 RAM-512GB SSD'

In [12]:
# The Price the pc was sold for is under this structure
pc.find('div',class_="tw-order-2 tw-text-sm tw-font-semibold tw-text-white").get_text()

'$499.00'

I would like to run two tests, first I would like to make sure that I can get all 20 pcs, then test on a different test site.

In [13]:
# Create an empty df to store data
small_df = pd.DataFrame(columns=['Title','Price','Link'],index=range(20))
# loop through the 20 pcs ( note that there might not be 20, so we use len)
for i in range(0,len(s)):
    current_pc = s[i] # select just the one of interest
    title = current_pc.find('div',class_="tw-mt-1 tw-line-clamp-1 tw-overflow-ellipsis tw-text-xs md:tw-text-sm").get_text()
    sale_price = current_pc.find('div',class_="tw-order-2 tw-text-sm tw-font-semibold tw-text-white").get_text()
    # Make sale price into an actual number
    sale_price = sale_price.replace('$','') # Remove the $
    sale_price = float(sale_price) # convert to a float since its a number with decimals
    link = current_pc.find('a').get('href')
    link = 'https://jawa.gg' + link # Add jawa.gg as prefix so the link works
    #print(title,'|-|',sale_price,'|-|',link)

    # Now, since i premade the dataframe, we can use iloc to fill the table
    # this is based on knowing the row and column where the data will go
    # The first argument is the row and the second is the column.
    small_df.iloc[i,0] = title
    small_df.iloc[i,1] = sale_price
    small_df.iloc[i,2] = link
small_df

Unnamed: 0,Title,Price,Link
0,🐢🍕Raphael🍕🐢 ||Gaming and Streaming PC-Nvidia G...,499.0,https://jawa.gg/product/12134/raphael-ororgami...
1,Snow Edition - Ryzen 7 5800X | ROG Strix 3080 ...,2399.95,https://jawa.gg/product/12120/snow-edition-ryz...
2,Entry Level White Gaming PC (Ryzen 5 2600 + RX...,550.0,https://jawa.gg/product/12129/entry-level-whit...
3,Starter DIY Set Up Just add case PSU and GPU,100.0,https://jawa.gg/product/10974/starter-diy-set-...
4,RTX 3080 // i9 12900KF // 32GB 5600Mhz DDR5 //...,2099.0,https://jawa.gg/product/12103/rtx-3080-i9-1290...
5,Gaming PC Intel i5 12400 RTX 3060 Ti,989.99,https://jawa.gg/product/12099/gaming-pc-intel-...
6,Intel I5 HTPC/Itx PC 16gb/500/Rx 6600,399.0,https://jawa.gg/product/11977/intel-i5-htpcitx...
7,Budget Gaming PC! Core i5 || 16 GB DDR4 || GTX...,347.0,https://jawa.gg/product/12091/budget-gaming-pc...
8,MAGENTA - Mid-Range Gaming PC - Ryzen 5 3700x ...,700.0,https://jawa.gg/product/12089/magenta-mid-rang...
9,Gaming/streaming computer Ryzen5 5600g/3060,850.0,https://jawa.gg/product/12055/gamingstreaming-...


Test 1 complete.

In [14]:
def basic_data_pull(link):
    html=urllib.request.urlopen(link).read()
    soup=BeautifulSoup(html,'html.parser')
    s = soup.find_all('div', class_="tw-group tw-relative")

    # Create an empty df to store data
    small_df = pd.DataFrame(columns=['Title','Price','Link'],index=range(20))
    # loop through the 20 pcs
    for i in range(0,len(s)):
        current_pc = s[i] # select just the one of interest
        title = current_pc.find('div',class_="tw-mt-1 tw-line-clamp-1 tw-overflow-ellipsis tw-text-xs md:tw-text-sm").get_text()
        sale_price = current_pc.find('div',class_="tw-order-2 tw-text-sm tw-font-semibold tw-text-white").get_text()
        # Make sale price into an actual number
        sale_price = sale_price.replace('$','') # Remove the $
        sale_price = float(sale_price) # convert to a float since its a number with decimals
        link = current_pc.find('a').get('href')
        link = 'https://jawa.gg' + link # Add jawa.gg as prefix so the link works
        #print(title,'|-|',sale_price,'|-|',link)

        # Now, since i premade the dataframe, we can use iloc to fill the table
        # this is based on knowing the row and column where the data will go
        # The first argument is the row and the second is the column.
        small_df.iloc[i,0] = title
        small_df.iloc[i,1] = sale_price
        small_df.iloc[i,2] = link
    return small_df

In [15]:
basic_data_pull('https://www.jawa.gg/shop/full-systems/gaming-pcs-show-sold~5c456b-7fa58?page=102')
# effective

Unnamed: 0,Title,Price,Link
0,Budget 1440p Gaming and Streaming PC,700.0,https://jawa.gg/product/11710/budget-1440p-gam...
1,"Mid-Range Used Custom PC, *RTX2060 I7-9700F*",650.0,https://jawa.gg/product/11709/mid-range-used-c...
2,SALE! Gaming/Streaming PC - Ryzen 7 5800x | RT...,1599.0,https://jawa.gg/product/11515/sale-gamingstrea...
3,SALE! Gaming/Streaming PC - Ryzen 5 5600 | 32g...,1199.0,https://jawa.gg/product/11517/sale-gamingstrea...
4,Budget Banger v3.0,400.0,https://jawa.gg/product/11655/budget-banger-v30
5,Custom Bundle for ToastyGaming,320.0,https://jawa.gg/product/11654/custom-bundle-fo...
6,"""Viper"" GTX 1070 Mid Range Gaming/Streaming PC",575.0,https://jawa.gg/product/11659/viper-gtx-1070-m...
7,☢️ 🟠Taskmaster☣️🔶- High Gaming PC-Intel i5 124...,995.0,https://jawa.gg/product/11651/taskmaster-high-...
8,HOLIDAY SALE: GTX 1660 Super + Ryzen 3600 + 16...,679.91,https://jawa.gg/product/11648/holiday-sale-gtx...
9,FLAME - Mid-Range Gaming PC - Ryzen 5 1500x + ...,500.0,https://jawa.gg/product/11645/flame-mid-range-...


Great! I'm glad that worked, now I will use this function to get ALL the pc's.

Note there are 161 pages as of 1/1/2024

In [16]:
base = 'https://www.jawa.gg/shop/full-systems/gaming-pcs-show-sold~5c456b-7fa58?page='
big_df = pd.DataFrame(columns=['Title','Price','Link'])
for page in range(1,162):
    # combine the base with the page number
    page_link =  base+str(page)
    # Search on the current webpage and merge dataframes
    big_df = pd.concat([big_df,basic_data_pull(page_link)], ignore_index=True)
big_df= big_df.dropna(how='all')# Drop any extra rows
big_df.head()

Unnamed: 0,Title,Price,Link
0,"AMD ""Anime"" Pink and White Theme CYBER MONDAY ...",725.0,https://jawa.gg/product/24994/amd-anime-pink-a...
1,ALL WHITE GAMING PC | RX 580 | I7 5930K EQUIVA...,499.99,https://jawa.gg/product/26931/all-white-gaming...
2,AMD RX 7600 | Ryzen 5 5500 | 1TB SSD | The Min...,879.0,https://jawa.gg/product/26741/amd-rx-7600-or-r...
3,Artic Beast Ryzen 7 5700x / 32GB / 2TB / RTX 4060,1149.0,https://jawa.gg/product/26886/artic-beast-ryze...
4,DRK1 Gaming PC Custom 1060 Windows 10 Pro Inte...,419.99,https://jawa.gg/product/26887/drk1-gaming-pc-c...


In [17]:
big_df.describe()

Unnamed: 0,Title,Price,Link
count,3219,3219.0,3219
unique,3155,757.0,3219
top,Ryzen 5 3600 RX 6600 16GB RAM 1TB M.2 Wifi 5 B...,500.0,https://jawa.gg/product/24994/amd-anime-pink-a...
freq,4,85.0,1


Now to get the important details for a specficic product. 

Redo the previous loading process with this new link.

In [18]:
product = 'https://www.jawa.gg/product/26702/5600-x-6700xt-digital-dash-32gb-ram-wifi'
html=urllib.request.urlopen(product).read()
soup=BeautifulSoup(html,'html.parser')

One method to get the details are in this description meta content.

In [19]:
soup.find_all('meta', property="og:description")

[<meta content="CPU Brand: AMD, CPU Series: Ryzen 5, CPU Model: Ryzen 5 5600, CPU Socket: AM4, CPU Core count: 6 cores, GPU Chipset: AMD, GPU Brand: PowerColor, GPU Series: Radeon RX 6700 XT, GPU Memory: 12GB, Memory Capacity: 32GB, Memory Type: DDR4, Memory Form Factor: DIMM (Desktop), Internal Storage Capacity: 1TB, Internal Storage Interface: NVME SSD, Case Brand: Lian Li, Case Color: Black, Power Supply Wattage: 800W, Motherboard Brand: AMD, Motherboard Socket: AM4, CPU Cooler Type: Air, CPU Cooler Socket: AM4" property="og:description"/>]

Another is to turn this next data line into a dictionary.

In [22]:
data_dict = json.loads(soup.find('script',id="__NEXT_DATA__").get_text())
data_dict

{'props': {'pageProps': {'dehydratedState': {'mutations': [],
    'queries': [{'state': {'data': {'buyer_protection_policy': None,
        'category': {'id': 28, 'name': 'Gaming PCs'},
        'condition': 'used',
        'created_at': '2023-12-24T13:05:14.118Z',
        'description': 'AMD 5600\n\nAMD 6700XT\n\nMSI B550 Tomahawk\n\n32GB RAM 3200MHZ\n\nLian li 216\n\nDeepCool AK400 Digital\n\n1TB NVME\n\n850Watt EVGA Fully modular PSU\n\n\nGrown up build all about business \n\n\n6700XT>>>>3060TI',
        'height': None,
        'id': 26702,
        'images': {'ids': ['production/listings/gtowbjpydqzyorbdlfi2',
          'production/listings/vadrki0lcpgwyq6jhsft',
          'production/listings/ewaloqeptuvk2zzkkiit',
          'production/listings/wlczdbpkmkkzdsnww5gt',
          'production/listings/dx14obbzmpjav4fbawvk'],
         'source': 'cloudinary'},
        'is_insured': False,
        'is_on_sale': False,
        'is_private_listing': False,
        'is_published': True,
     

I will likely go the dictionary route. The next steps are to get the key pieces out of the dictionary, and make a new dataframe with the details. This will get added to big_df once completed.

Variables of interest: all specs, date listed, date sold, (create delta variable), original price, price, 

note that I only want the first state. 

In [27]:
# what does the data look like?
data_dict.items()

dict_items([('props', {'pageProps': {'dehydratedState': {'mutations': [], 'queries': [{'state': {'data': {'buyer_protection_policy': None, 'category': {'id': 28, 'name': 'Gaming PCs'}, 'condition': 'used', 'created_at': '2023-12-24T13:05:14.118Z', 'description': 'AMD 5600\n\nAMD 6700XT\n\nMSI B550 Tomahawk\n\n32GB RAM 3200MHZ\n\nLian li 216\n\nDeepCool AK400 Digital\n\n1TB NVME\n\n850Watt EVGA Fully modular PSU\n\n\nGrown up build all about business \n\n\n6700XT>>>>3060TI', 'height': None, 'id': 26702, 'images': {'ids': ['production/listings/gtowbjpydqzyorbdlfi2', 'production/listings/vadrki0lcpgwyq6jhsft', 'production/listings/ewaloqeptuvk2zzkkiit', 'production/listings/wlczdbpkmkkzdsnww5gt', 'production/listings/dx14obbzmpjav4fbawvk'], 'source': 'cloudinary'}, 'is_insured': False, 'is_on_sale': False, 'is_private_listing': False, 'is_published': True, 'is_sold_out': True, 'labels': [], 'last_published_at': '2023-12-24T13:05:14.112Z', 'last_sold_at': '2023-12-31T22:48:45.137Z', 'lengt

In [24]:
data_dict.keys()

dict_keys(['props', 'page', 'query', 'buildId', 'isFallback', 'isExperimentalCompile', 'gssp', 'scriptLoader'])

the data of interest is in props. Next level down is pageProps. Then dehydratedState. Then queries. There seem to be two queries for this page specifically, but the data I want is in the first, so i will take only the first. I will need to check later on a different link if this pattern is the same. Then the data is in state. Lastly, each variable is under data. 


I would like to use flow.io to make a chart illustrating this relationship.


In [49]:
raw = data_dict['props']['pageProps']['dehydratedState']['queries'][0]['state']['data']


supporting info

In [50]:
raw['name']
raw['description']
raw['created_at']
raw['original_price']
raw['price']


85000

actual specs. will require going down one more level.

In [51]:
specs=raw['specs']

In [52]:
specs['CPU Brand']
specs['CPU Series']
specs['CPU Model']
specs['CPU Socket']
specs['CPU Core count']
specs['GPU Chipset']
specs['GPU Brand']
specs['GPU Series']
specs['GPU Memory']
specs['Memory Capacity']
specs['Memory Type']
specs['Memory Form Factor']
specs['Internal Storage Capacity']
specs['Internal Storage Interface']
specs['Case Brand']
specs['Case Color']
specs['Power Supply Wattage']
specs['Motherboard Brand']
specs['Motherboard Socket']
specs['CPU Cooler Type']
specs['CPU Cooler Socket']

'AM4'