# Computing lab: Scraping project
## Oscar Martínez Franco
In this project we will scrape from the tech site laptopsdirect. For this project we  will navigate through this store website to find a new laptop to buy. The website has more than 1500 laptops in its database and we will scrape them all. From each laptop we will follow the url to its personal description page and from there we will obtain the different specs that can be important when trying to buy a new laptop. These specs are: the full name, the price,the processor name and manfacturer, the storage, the RAM memory, the screen size, if the screen if a touchscreen or not and some others. This data will then be stored in a pandas dataframe and saved in our computer in a csv file. The advantage of saving it into a pandas dataframe first is that this library offers a wide rango of possibilities when it comes to filtering and ploting data. Therefore, it could be really easy to come with a selection of laptops based on their price, their comutational power or its battery runtime.

In [29]:
from bs4 import BeautifulSoup
import requests as rq
import time
import re
from time import perf_counter
import csv
import pandas as pd
from threading import Thread
from queue import Queue

In [2]:
def get_soup(url): #we build this function to get the soup object out of an url and return it 
    c=rq.get(url)
    result=c.content
    soup=BeautifulSoup(result,"lxml")
    return soup

In [3]:
def parse_laptops(url):
    
    soup=get_soup(url)
    class_box=soup.find_all("div",{'class':'OfferBox'}) #for each page of this site we want to obtain the url of the next page
    class_page=soup.find_all("div",{'class':'sr_pagination'})#and also the url of each laptop (also the name and the price)
    row_list=[]
    
    for i in class_box:
        laptop=i.find('a',{'class':'offerboxtitle'})
        laptop_url='https://www.laptopsdirect.co.uk'+laptop.get("href")
        laptop_specs=get_specs(laptop_url)#we get the specs for each laptop
        laptop_specs["Name"]=laptop.text #we add to the specs the name and price of the laptop
        laptop_specs["Price"]=i.find('span',{'class':'offerprice'}).text
        row_list.append(laptop_specs)
    try:
        next_page='https://www.laptopsdirect.co.uk'+class_page[0].find(title="Next Page").get("href")
    except:
        next_page=None #we get the next page url
   
    return row_list,next_page
    
    

In [4]:
def get_specs(url): #here we obtain the first 12 specs of the laptop page
    specs_dict={} #generally the first 12 specs are the most general ones, the ones that we are interested
    soup_lap=get_soup(url)
    class_spec=soup_lap.find(id='specData')
    
    if class_spec==None:
        return specs_dict 
    else:
        spec_name=class_spec.find_all('span',{'class':'Header'})[0:12] #we get the name of the spec and its value
        spec_val=class_spec.find_all('span',{'class':'BodyText'})[0:12]
        
        for i in range(len(spec_name)):
            specs_dict[spec_name[i].text]=spec_val[i].text #we build a dictionary to store the specs names and values
    
        return specs_dict

In [70]:
url='https://www.laptopsdirect.co.uk/ct/laptops-and-netbooks/laptops'
final_laptop_list=[]
toc=time.time()
page_num=1
while url != None: #we iterate over all different pages and we store the dictionaries of all the laptops in a list of dictionaries
    page_num
    page_num+=1
    dict_list, url=parse_laptops(url)
    for j in dict_list:
        final_laptop_list.append(j)
        
tic=time.time()
print("Elapsed time: "+str(round(tic-toc,1))+"s")

Elapsed time: 837.2s


In [162]:
laptops_data=pd.DataFrame(final_laptop_list) #we transform our list of dictionaries into a  dataframe

Now that we have all our specs and all our laptops we could just save it into a csv file. However, we see that some categories that are numerical (screen size, price, RAM...) are strings. This is not useful for our purpose of trying to filter latops for some characteristics. In addition we have seen that the values in this categories are not formated the same. For example, in the screen size some laptops have 13in other 13Inches and others 13. We want to unify the values for all of them and we build different functions to do such thing

In [119]:
def ram_split(ram_str):
    if type(ram_str) != str:
        ram_int=None
    else:
        ram_int=int(ram_str.split('G')[0])#we turn the part before the G of GB (the value) into a string
    return ram_int

In [138]:
def storage_split(storage_str):
    if type(storage_str) == float:
        storage_int=None
    elif storage_str=='':
        storage_int=None
    else:
        if storage_str.find('G')==-1: #if the storage is in TB the split function will outpu -1 if it doesn't find a G
            storage_int=int(storage_str.split('T')[0])*1000
            
        else:
            if storage_str[0]=="G": #some laptops have their storage as GB256
                storage_int=int(storage_str.split('B')[1])
            else:
                storage_int=int(storage_str.split('G')[0])
    return storage_int

In [148]:
def screen_split(screen_str):
    try:
        screen_int=float(screen_str)
    except:        
        if type(screen_str) != str:
            screen_int=None
        else:
            try:
                screen_int=float(screen_str.split('I')[0]) #some laptops have 13in or 13Inches
            except:
                screen_int=float(screen_str.split('i')[0])
    return screen_int

In [153]:
def battery_split(bat_str):
    try:
        bat_int=float(bat_str)
    except:
        if type(bat_str) != str:
            bat_int=None
        else:
            bat_int=float(bat_str.split('h')[0])
    return bat_int

In [163]:
laptops_data["Price"]=laptops_data["Price"].apply(lambda x: float(x.split('£')[1]))
laptops_data["RAM Memory"]=laptops_data["RAM Memory"].apply(lambda x: ram_split(x))
laptops_data["Storage"]=laptops_data["Storage"].apply(lambda x: storage_split(x))
laptops_data["Screen size"]=laptops_data["Screen size"].apply(lambda x: screen_split(x))
laptops_data["Battery Run Time"]=laptops_data["Battery Run Time"].apply(lambda x: battery_split(x))



Now that we have our data clean we can take a look at the numeric catgories that we have transformed and have an idea of how they look

In [165]:
laptops_data.describe()

Unnamed: 0,Battery Run Time,Price,RAM Memory,Screen size,Storage
count,705.0,1539.0,1485.0,1433.0,1420.0
mean,9.785106,715.477797,6.791919,14.409002,421.080282
std,3.447391,691.383663,4.80234,1.673811,349.904794
min,2.0,119.97,1.0,10.1,16.0
25%,8.0,249.97,4.0,13.3,150.0
50%,10.0,398.98,4.0,14.0,256.0
75%,12.0,989.98,8.0,15.6,512.0
max,26.0,4442.98,32.0,31.5,2000.0


Now we can turn our dataframe into a csv but only those columns that are relevant to us

In [164]:
laptops_data_filter.to_csv('laptop_results.csv',columns=['Name','Price','Storage','RAM Memory','Screen Size','Screen Resolution',\
                                                         'Processor Model','Processor Number','Processor manufacturer','Operating System',\
                                                         'Battery Run Time','Optical Drive','Touchscreen','Warranty'])

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


## Parallel scraping with threads
We have seen that scraping the 1504 laptops and its specs is quite a slow process (about 850s). However there is a more efficient way to do it. We can use the Thread library of Python to create different workers that will scrape in parallel so we can search in more than one laptop page at once. To start we have to get the links of all the pagesin the laptop site. This is aldo a slow process (75s) and I am sure that it could also be optimized via parallel computing. Nevertheless, for now we woill only focus in giving our workers the urls and let them scrape each laptop.

In [37]:
def parse_laptops_thread(url): #we modify our previous parse laptops function so it only scrapes the specs of the laptop
    
    soup=get_soup(url)
    class_box=soup.find_all("div",{'class':'OfferBox'})
    row_list=[]
    
    for i in class_box:
        laptop=i.find('a',{'class':'offerboxtitle'})
        laptop_url='https://www.laptopsdirect.co.uk'+laptop.get("href")
        laptop_specs=get_specs(laptop_url)
        laptop_specs["Name"]=laptop.text
        laptop_specs["Price"]=i.find('span',{'class':'offerprice'}).text
        row_list.append(laptop_specs)
   
   
    return row_list
    
    

In [14]:
def get_page_url(first_url):#we create a function to scrape the url's from all the different pages and store it in a list 
    soup=get_soup(url)
    class_page=soup.find_all("div",{'class':'sr_pagination'})
    try:
        next_page='https://www.laptopsdirect.co.uk'+class_page[0].find(title="Next Page").get("href")
    except:
        next_page=None
    
    return next_page


In [33]:
url='https://www.laptopsdirect.co.uk/ct/laptops-and-netbooks/laptops'
pages_url=[]
toc=time.time()
while url != None:
    url=get_page_url(url)
    pages_url.append(url)
        
tic=time.time()
print("Elapsed time: "+str(round(tic-toc,1))+"s")

Elapsed time: 75.9s


In [51]:
def main_scraper(urls):#we define our scraper and the queue of urls
    data = []
    q = Queue()
    for url in urls:
        q.put(url)
    for i in range(32): #we use 32 workers because it's half of the length of the pages_url list
        t = Thread(target = scraper_worker, args = (q, data))
        t.daemon = True
        t.start()
    q.join()
    return data

def scraper_worker(q, data):#we define our workers that take one url and call the function to ge its specs
    while not q.empty():
        print("inside scraper worker, queue not empty")
        f= q.get()
        dict_list=parse_laptops_thread(f)
        for j in dict_list:
            data.append(j)
        q.task_done() #onece they have finished they move to another task
    return data

In [46]:
len(laptops_data_thread["Price"])

1492

In [52]:
toc=time.time()
results=main_scraper(pages_url[0:len(pages_url)-1])
tic=time.time()
print("Elapsed time: "+str(round(tic-toc,1))+"s")

inside scraper worker, queue not empty
inside scraper worker, queue not emptyinside scraper worker, queue not empty

inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not emptyinside scraper worker, queue not empty

inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not emptyinside scraper worker, queue not empty

inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, queue not empty
inside scraper worker, qu

In [45]:
laptops_data_thread=pd.DataFrame(results)

In [47]:
laptops_data_thread.head(5)

Unnamed: 0,3G,Back lit keyboard,Battery Run Time,Capacity,Clock Speed,Colour,Convertible Device,Convertible Device (Y/N),Depth,Form factor,...,SSD Capacity,Screen Resolution,Screen size,Storage,Total Slots,Touchscreen,Warranty,Weight,Wi-Fi,Width
0,,,14hours,,,,,,,,...,,1366 x 768,15.6in,500GB,,Non Touch,3 Month RTB Warranty,,,
1,,,8hours,,,,,,,,...,,2880 x 1800,15.4in,256GB,,Non Touch,1 year warranty,,,
2,,,,,,,,,,,...,,1366 x 768,14in,500GB,,Non Touch,1 year warranty,,,
3,,,9hours,,,,,,,,...,,1920 x 1080,15.6in,1TB,,Non Touch,3 Month RTB Warranty,,,
4,,,,,,,,,,,...,,,14in,160GB,,Non Touch,30 Day RTB Warranty,,,


Now we could proceed the same way we did before (when we didn't use  the scraping). However this section was just to ilustrate the reduction in computation time that we have using threads.We have reduced 5 minutes of computation time, approximately.