# Data Scraping 

We use the Jupyter Notebook to download data for different parameters of mobile phones from the Heureka.cz webpage. The data is further processed into a dataframe and saved into csv  file. This file is then used in next part of the project, in which a subset of the data is selected and relationship between mobile phone rating on the website and its parameters in analysed.   

## Packages

There are 6 different packages used in order to scrap the data from the Heureka.cz website. The package requests is used to download html code from the individual webpages and we use the sleep function from time package to add delay between individual requests. Further, we use Selector from the scrapy library to create Selector object, from which the desired data can be extracted with xpath. The literal_eval function  from ast and loads function from json are used  to create a list and dictionary from a string. In the end we create a dataframe from a list of dicitonaries with the use of DataFrame function form pandas package.     

In [1]:
import requests
import pandas as pd
from time import sleep
from scrapy import Selector
from json import loads
from ast import literal_eval

## Functions and Classes

To create an algorithm to download desired data from the Heureka website, we first created function to attain all html code in a Selector object. The rest of the functions are divided into two classes. The objects are Url links in the class UrlHeureka and combination of selector and list in the GetDic class. The detailed explanation of the functions is provided within the code.   

In [53]:
def make_request(link):  #make a request #adds 10s delay before request
        req1=requests.get(link).content #format that can be used in selector
        return Selector(text=req1) #function returns selector object

class UrlHeureka:
    #the method has two attributes
    def __init__(self,url): 
        self.url=url #the first one is url of the default webpage (in our case https://mobilni-telefony.heureka.cz/)
        self.links=[] #used to collect links for the indvidiual mobile phones web page
        self.control=0  #used to end the loop after the last available page
    def get_links(self,page_count):
        url1=self.url+"/?f="+str(page_count) #used in getting links for the individual phones, moves to the next subpage of the default webpage
        req1=make_request(url1) #using function to get selector object
        urls=req1.xpath('//li[@class="c-product-list__item"]/section[@class="c-product"]/a[1]/@href').getall() # returns all url links to mobile phone pages
        if len(urls)==0: #when we reach the page with no links
            self.control=1 #*self control variable changes to 1, which ends the loop 
        for x in urls:
            self.links.append(x) #all the links from 1 subpage appended to object attribute list, where all links for all subpages are stored 
    def clean_links(self):#filters pages from other sites than heureka.cz
        helplist=[] 
        for x in self.links: #sites in which string .cz/exit/ is included are exclued from the dataset, becase they link to pages outside Heureka 
            if ".cz/exit/" not in x:
                helplist.append(x) #the links that fullfil the condition are appended to the help list
        self.links=list(helplist) #object attribute list contains only filtred urls     

class GetDic:
    def __init__(self,sel): #class GetDic contains 2 attributes, selector object and dictionary 
        self.sel=sel
        self.dic_par={}
    def dic_creat(self):
        #the first path returns parameters names, the second one their respective values
        list1=self.sel.xpath('//div[@class="o-layout__item c-parameters"]//div[@class="c-parameters-box"]//div[@class="o-layout__item c-parameters-box__parameter-name"]//text()').getall()
        list2=self.sel.xpath('//div[@class="o-layout__item c-parameters"]//div[@class="c-parameters-box"]//div[@class="o-layout__item c-parameters-box__parameter-value u-bold"]//text()[1]').getall()
        if len(list1)==len(list2): #sometimes there is no value for Vyrobce parameter or both are missing
            if "Výrobce" in list1: #if there is same number of parameters and values, both Vyrobce parameter and its value are excluded from the dataset, if the Vyrobce parameter is included 
                list1.remove("Výrobce") 
                list2.pop(0)
        else:
            if "Výrobce" in list1: #if there is less values than parameter that means that only Vyrobce parameter is included and its value is not
                list1.remove("Výrobce")   
        for key in list1: #iterrates over entire list 1
            for value in list2: #iterrates over entire list2
                self.dic_par[key]=value #name from the list1 is used as a key and name from the list 2 is used as a value
                list2.remove(value) #removes the value from the list2, because we always use  the first one
                break #breaks the second for loop so the other combinations with value from the list 1 are not added
    def get_sum(self): #gets summary information from the same page as the parameter names and values
        path1=self.sel.xpath('//meta[@name="gtm:ecommerce:detail:products"]/@content').get() #path to the summary information
        if path1 is None: #in rare cases the information is not available and it causes error, because load function can not procces empty string 
            return None
        elif len(path1)==0:#so if the path is empty, None value is returned
            return None 
        self.dic_sumary=loads(path1)[0] #creates a list from a text
        self.dic_par.update(self.dic_sumary) #adds values from the dictionary to the 
    def get_rat(self): #adds variable with number of ratings on Heureka website 
        dic1={}
        path1=self.sel.xpath('//div[@class="c-pipe-list u-standard-top-margin"]//p[@class="c-pipe-list__item"]//text()').get()
        path2=self.sel.xpath('//p[@class="c-rating-widget c-pipe-list__item u-color-highlight"]//text()[1]').get()
        if (path1 is None) or (path2 is None):
            return None
        elif (len(path1)==0) or (len(path2)==0):
            return None
        path1=path1.split(" ") #we attain a single string from the path, which we divided into individual words
        dic1["numberofratings"]=path1[2] #the second word is number of ratings, the third is used as a name of the variable, we add them togehter into a dictionary
        dic1["ratingH"]=path2
        self.dic_par.update(dic1) #we add the values to object attribute list
    def complete_dictionary(self): #function that creates the dictionary for one mobile phone
        self.dic_creat()
        self.get_rat()
        self.get_sum()
        return self.dic_par

    

## Downloading URLs

We acess all of the subpages code with the while loop, for each iterration one subpage is scraped for the links of the individual mobile phones. Before the beginning of the loop we create variable page_count, which is used to move between the indvidiual  subpages and create object of the class UrlHeureka that is used to download and store links to the individual mobile phones pages. With each iterration we add 1 to the source_page variable so the function get_links downloads the links from another subpage. When we reach subpage, where there are no links, the source_page object control is set to 1, which ends the loop. In the end we filter out the links that are from other webpages, check if the links are in the right format and save them as a text file so we do not have to scrap them again.

In [38]:
page_count=0 
source_page=UrlHeureka('https://mobilni-telefony.heureka.cz/') 
while source_page.control==0: 
    page_count+=1 
    source_page.get_links(page_count)


In [39]:
print(len(source_page.links))
source_page.clean_links() 
print(len(source_page.links)) 
print(source_page.links[1:10]) 

1334
1038
['https://mobilni-telefony.heureka.cz/poco-x3-pro-8gb-256gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-13-128gb/', 'https://mobilni-telefony.heureka.cz/xiaomi-redmi-note-9-pro-6gb-128gb/', 'https://mobilni-telefony.heureka.cz/poco-f3-8gb-256gb/', 'https://mobilni-telefony.heureka.cz/xiaomi-redmi-9a-2gb-32gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-se-2020-64gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-12-mini-64gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-13-pro-128gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-11-64gb/']


In [40]:
with open("urllist.txt", "w") as output: 
    output.write(str(source_page.links))

In [41]:
list_url = open("urllist.txt", "r")
list_url=list_url.read() 
list_url = literal_eval(list_url)

In [42]:
print(list_url[1:10]) 
len(list_url)

['https://mobilni-telefony.heureka.cz/poco-x3-pro-8gb-256gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-13-128gb/', 'https://mobilni-telefony.heureka.cz/xiaomi-redmi-note-9-pro-6gb-128gb/', 'https://mobilni-telefony.heureka.cz/poco-f3-8gb-256gb/', 'https://mobilni-telefony.heureka.cz/xiaomi-redmi-9a-2gb-32gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-se-2020-64gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-12-mini-64gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-13-pro-128gb/', 'https://mobilni-telefony.heureka.cz/apple-iphone-11-64gb/']


1038

## Creating Data Frame

To create a dataframe with the values of mobile phone parameters, we first iterrate over all the links we downloaded in the previous step. With each iterration, we make request to the 1 mobile phone webpage and create a selector with make_request function. Than we create a GetDic object, which is used to create a dictionary of all the parameters of an individual mobile phone and their values. The function used to create dictionary with all of the parameters is called complete_dictionary. All the dicitonaries are appended to the list called listofdic, which after completing the for loop used to create data frame with the help of DataFrame function from pandas package. In the end, we check that the data is downloaded in the right way and save the data frame as a csv file. 

In [57]:
listofdic=[] 
for x in list_url: 
    req1=make_request(x) 
    mp1=GetDic(req1)
    all_parameters=mp1.complete_dictionary()
    listofdic.append(all_parameters) 
df1=pd.DataFrame(listofdic) 

In [58]:
print(listofdic[1:3]) 

[{'Konstrukce': 'dotykové', 'Operační systém': 'Android', 'Verze operačního systému': 'Android 11', 'Hmotnost': '215', 'Možnost paměťové karty': 'ano', 'Paměť RAM': '8192', 'Produktová řada': 'Poco X3 Pro', 'Modelový rok': '2021', 'Velikost displeje': '6.67', 'Rozlišení displeje': '2400 x 1080', 'Poměr stran displeje': '20:9', 'Počet displejů': '1', 'Typ displeje': 'IPS LCD', 'Jemnost displeje (PPI)': '395', 'Výška': '165.3', 'Šířka': '76.8', 'Hloubka': '9.4', 'Fotoaparát': 'ano', 'Rozlišení fotoaparátu': '48', 'Blesk': 'ano', 'HD video': 'ano', 'Přední kamera': 'ano', 'Počet objektivů zadního fotoaparátu': '4 objektivy', 'Počet objektivů předního fotoaparátu': '1 objektiv', 'Světelnost objektivu hlavního fotoaparátu': 'f/1.8', 'Světelnost objektivu předního fotoaparátu': 'f/2.2', 'Maximální rozlišení videa': '2160p (4K)', 'Google Pay': 'ano', 'Odemykání obličejem': 'ano', 'Jack 3,5': 'ano', 'Snímač otisků prstů': 'ano', 'Typ nabíječky (konektor)': 'USB-C', 'USB On-The-Go': 'ano', 'Ryc

In [59]:
print(df1[1:10]) 
df1.to_csv('raw_data.csv', index = None, header=True,encoding='utf-8')

  Konstrukce Operační systém Verze operačního systému Hmotnost  \
1   dotykové         Android               Android 11      215   
2   dotykové             iOS                   iOS 15      174   
3   dotykové         Android               Android 10      209   
4   dotykové         Android               Android 11      196   
5   dotykové         Android               Android 10      196   
6   dotykové             iOS                   iOS 13      148   
7   dotykové             iOS                   iOS 14      133   
8   dotykové             iOS                   iOS 15      203   
9   dotykové             iOS                   iOS 13      194   

  Možnost paměťové karty Paměť RAM          Produktová řada Modelový rok  \
1                    ano      8192              Poco X3 Pro         2021   
2                     ne      4096          Apple iPhone 13         2021   
3                    ano      6144  Xiaomi Redmi Note 9 Pro         2020   
4                    NaN      8192 

In [None]:
#zaloha
path=req1.xpath('//div[@class="o-layout__item c-parameters"]//div[@class="c-parameters-box"]//div[@class="o-layout__item c-parameters-box__parameter-name"]//text()').getall()
print(path)

path2=req1.xpath('//div[@class="o-layout__item c-parameters"]//div[@class="c-parameters-box"]//div[@class="o-layout__item c-parameters-box__parameter-value u-bold"]//text()[1]').getall()
print(path2)
dic_creat(req1)