<h1>Parsing OMA website</h1>

<p>This notebook is just a prototype for testing pieces of python code aimed to parse a specific website.</p>

In [1]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as BSoup
import pandas as pd
import re

In [2]:
url = 'https://www.oma.by/catalog/'
#opens up a connection, grabs the page and downloads it
oma_main_html = requests.get(url)

<p>soup is a BeautifulSoup object which represents the document as a nested data sctructure:</p>

In [3]:
oma_catalog_soup = BSoup(oma_main_html.text, 'html.parser')
oma_catalog_soup.title
oma_catalog_soup.close

<h3>Getting categories </h3>
<p>So we get the soup object successfully. Now will try to find category headers on a catalog page (there are 7 main categories). </p>
<img src="screenshots/category_schema.png" align="left" width="600">

In [4]:
categories = oma_catalog_soup.select('section.bordered-section h2')
print(f'There are {len(categories)} objects that has h2 tags inside the header tag')

print('\nThe whole list of the categories:')
for category in categories:
    print(category)

There are 7 objects that has h2 tags inside the header tag

The whole list of the categories:
<h2>Инструменты, крепёж</h2>
<h2>Отделка</h2>
<h2>Садовый центр / отдых</h2>
<h2>Сантехника</h2>
<h2>Строительство</h2>
<h2>Товары для дома</h2>
<h2>Электротехника</h2>


<h3>Getting all categories</h3>
<p>So we get the categories list correctly and we can use this h2 tags as anchors for looping.</p>
<img src="screenshots/subcategories_schema.png" align="left" width="720">

<p>Getting subcategories lvl1 and subcategores lvl2 is possible by processing each category one by one. It is useful to extract 7 category sections for that. </p>

In [5]:
categories = oma_catalog_soup.findAll('section',\
                                      {'class':'bordered-section js-accordion-group'})

print(f'\nThe type of categories is {type(categories)}.')

print(f'\nThere are {len(categories)} main category objects in {type(categories)}.')

print(f'\nEach category object has {type(categories[0])} type.')

#strip the name of the category
category_name_0 = categories[0].select('section.bordered-section h2')

print('\nFirst category raw:')
print(category_name_0)
print(type(category_name_0))

print('\nFirst category extracted from list:')
print(category_name_0[0])
print(type(category_name_0[0]))

print('\nFirst category extracted from list and converted to string:')
print(str(category_name_0[0]))
print(type(str(category_name_0[0])))
print('')


The type of categories is <class 'bs4.element.ResultSet'>.

There are 7 main category objects in <class 'bs4.element.ResultSet'>.

Each category object has <class 'bs4.element.Tag'> type.

First category raw:
[<h2>Инструменты, крепёж</h2>]
<class 'list'>

First category extracted from list:
<h2>Инструменты, крепёж</h2>
<class 'bs4.element.Tag'>

First category extracted from list and converted to string:
<h2>Инструменты, крепёж</h2>
<class 'str'>



<p>Extract category names:</p>

In [6]:
def remove_tags(text): 
    """Remove html tags from a string"""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

def extract_category_name(list):
    name_with_tags = str(list[0])
    return remove_tags(name_with_tags)

for category in categories:
    category_name_raw = category.select('section.bordered-section h2')
    print(extract_category_name(category_name_raw))

Инструменты, крепёж
Отделка
Садовый центр / отдых
Сантехника
Строительство
Товары для дома
Электротехника


<h3>Getting subcategories lvl_1</h3>
<p>Getting subcategories lvl_1 fron single category object.</p>

In [7]:
subcats_lvl_1 = categories[0].findAll('div',\
                                      {'class':'catalog-all-item'})

print(f'\nThe type of subcats_lvl_1 is {type(subcats_lvl_1)}.')

print(f'\nThere are {len(subcats_lvl_1)} subcats_lvl_1 objects')

print(f'\nEach subcats_lvl_1 object has {type(subcats_lvl_1[0])} type.')

print('\n\n first subcategory_lvl_1:')
print(subcats_lvl_1[0])

print(subcats_lvl_1[0].select('div.accordion-item_title a'))


The type of subcats_lvl_1 is <class 'bs4.element.ResultSet'>.

There are 40 subcats_lvl_1 objects

Each subcats_lvl_1 object has <class 'bs4.element.Tag'> type.


 first subcategory_lvl_1:
<div class="catalog-all-item col col-lg-3-of-12 col-md-4-of-12 col-sm-6-of-12 col-xs-12-of-12">
<div class="catalog-all-item_img-box">
<img alt="" src="/upload/Sh/imageCache/ad0/5e0/14b9b1eb45799ec953c4d4a00d601b13.jpg"/>
</div>
<div class="accordion-item accordion-item__right-icon js-accordion">
<div class="accordion-item_top">
<div class="accordion-item_icon css-plus-icon js-accordion-bar"></div>
<div class="accordion-item_title">
<a href="/ankery-13127-c">Анкеры</a>
</div>
</div>
<div class="accordion-item_body js-accordion-body js-show-more-box" data-items-visible="11">
<a class="section-submenu-sublink" href="/ankery-ramnye-13251-c">Анкеры рамные</a>
<a class="section-submenu-sublink" href="/ankery-spetsialnye-13252-c">Анкеры специальные</a>
<a class="section-submenu-sublink" href="/bolty-anker

<p>Extract subcats_lvl_1 names from first <i>category</i> object:</p>

In [8]:
subcats_lvl_1_divs = categories[0].findAll('div',\
                                      {'class':'accordion-item_title'})

print(f'\nThere are {len(subcats_lvl_1_divs)} subcategory lvl_1 divs')

subcats_lvl_1_a_tags = categories[0].select('div.accordion-item_title a')
print(f'\nThere are {len(subcats_lvl_1_a_tags)} subcategory lvl_1 a tags')

print(f'\nThere first subcategory lvl_1 tag: {subcats_lvl_1_a_tags[0]}')

print('\n\nFirst 3 subcategory lvl_1 tags:')
for subcat_lvl_1_a_tag in subcats_lvl_1_a_tags[:3]:
    print(f'\n{subcat_lvl_1_a_tag}')
    
print('\n\nThe last 4 subcategory lvl_1 tag names:')
for subcat_lvl_1_a_tag in subcats_lvl_1_a_tags[-4:]:
    print(f'\n{remove_tags(str(subcat_lvl_1_a_tag))}')


There are 40 subcategory lvl_1 divs

There are 40 subcategory lvl_1 a tags

There first subcategory lvl_1 tag: <a href="/ankery-13127-c">Анкеры</a>


First 3 subcategory lvl_1 tags:

<a href="/ankery-13127-c">Анкеры</a>

<a href="/benzokosy-travokosilki-elektrotrimmery-14309-c">Бензокосы(травокосилки)/электротриммеры</a>

<a href="/bolty-vinty-gayki-shayby-shpilki-konfirmaty-13128-c">Болты/ винты/ гайки/ шайбы/шпильки/конфирматы</a>


The last 4 subcategory lvl_1 tag names:

Шлифмашины, полирователи, бороздоделы

Шурупы/ заглушки

Электросварочное оборудование и материалы

Ящики для инструмента , органайзеры


<p>The last 4 elements of subcategories 1lvl.</p>
<img src="screenshots/subcats_lvl_1.png" align="left" width="720">

<h3>Getting subcategories lvl_2</h3>
<p>Getting subcategories lvl_2 from single subcats_lvl_1 object.</p>

In [9]:
subcats_lvl_2_tags = subcats_lvl_1[1].findAll('a',\
                                      {'class':'section-submenu-sublink'})

print(f'\nThe type of subcats_lvl_2 is {type(subcats_lvl_2_tags)}.')

print(f'\n\nThere are {len(subcats_lvl_2_tags)} subcats_lvl_2 objects')

print(f'\n\nEach subcats_lvl_2 object has {type(subcats_lvl_2_tags[0])} type.')

print('\n\nSubcatergoy lvl_2 objects')
for subcat_lvl_2_tag in subcats_lvl_2_tags[:3]:
    print(f'{subcat_lvl_2_tag}')

print('\n\nSubcategory lvl_2 tag names:')
for subcat_lvl_2_a_tag in subcats_lvl_2_tags[-3:]:
    print(f'\n{remove_tags(str(subcat_lvl_2_a_tag))}')
    
print('\n\nSubcategory lvl_2 tag links:')
for subcat_lvl_2_a_tag in subcats_lvl_2_tags[-3:]:
    link = subcat_lvl_2_a_tag.get('href')
    print(f'\n {link}')    

    
#soup.findAll('a', attrs={'href': re.compile("^http://")})


The type of subcats_lvl_2 is <class 'bs4.element.ResultSet'>.


There are 5 subcats_lvl_2 objects


Each subcats_lvl_2 object has <class 'bs4.element.Tag'> type.


Subcatergoy lvl_2 objects
<a class="section-submenu-sublink" href="/benzokosy-travokosilki-bytovye-14324-c">Бензокосы (травокосилки) бытовые</a>
<a class="section-submenu-sublink" href="/benzokosy-travokosilki-professionalnye-14325-c">Бензокосы (травокосилки) профессиональные</a>
<a class="section-submenu-sublink" href="/trimmery-akkumulyatornye-14326-c">Триммеры аккумуляторные</a>


Subcategory lvl_2 tag names:

Триммеры аккумуляторные

Триммеры электрические с верхним двигателем

Триммеры электрические с нижним двигателем


Subcategory lvl_2 tag links:

 /trimmery-akkumulyatornye-14326-c

 /trimmery-elektricheskie-s-verkhnim-dvigatelem-14327-c

 /trimmery-elektricheskie-s-nizhnim-dvigatelem-14328-c


<h3>Create a dataframe for links storage</h3>
<p>Seems like a good idea to have dataframe containing all links to subcategories lvl 2. The dataframe is expected to be lightweight and unseful for future parsing.</p>

In [10]:
main_page_links_df = pd.DataFrame(columns = [ 'Category',\
                                             'Subcategory lvl 1','Subcategory lvl 2',\
                                             'Link']) 
main_page_links_df

Unnamed: 0,Category,Subcategory lvl 1,Subcategory lvl 2,Link


<h3>Create a loop for iterating over categories and subcategories and fill the dataframe</h3>
<p>Seems like a good idea to have dataframe containing all links to subcategories lvl 2. The dataframe is expected to be lightweight and unseful for future parsing.</p>

In [11]:
%%time
i = 0;
for category in categories:
    category_name_raw = category.select('section.bordered-section h2')
    category_name = extract_category_name(category_name_raw)
    subcats_lvl_1 = category.findAll('div',\
                                      {'class':'catalog-all-item'})
    for subcat_lvl_1 in subcats_lvl_1:
        subcat_lvl_1_name_raw = subcat_lvl_1.select('div.accordion-item_title a')
        subcat_lvl_1_name = extract_category_name(subcat_lvl_1_name_raw)
        subcats_lvl_2_tags = subcat_lvl_1.findAll('a',\
                                      {'class':'section-submenu-sublink'})
        
        for subcat_lvl_2_tag in subcats_lvl_2_tags:
            subcat_lvl_2_name = remove_tags(str(subcat_lvl_2_tag))
            link = 'https://www.oma.by' + subcat_lvl_2_tag.get('href')
            main_page_links_df.loc[i] = [category_name] + [subcat_lvl_1_name] + [subcat_lvl_2_name] + [link]
            i+=1

CPU times: user 3.36 s, sys: 26.2 ms, total: 3.39 s
Wall time: 3.45 s


In [12]:
main_page_links_df.head()

Unnamed: 0,Category,Subcategory lvl 1,Subcategory lvl 2,Link
0,"Инструменты, крепёж",Анкеры,Анкеры рамные,https://www.oma.by/ankery-ramnye-13251-c
1,"Инструменты, крепёж",Анкеры,Анкеры специальные,https://www.oma.by/ankery-spetsialnye-13252-c
2,"Инструменты, крепёж",Анкеры,Болты анкерные,https://www.oma.by/bolty-ankernye-13253-c
3,"Инструменты, крепёж",Бензокосы(травокосилки)/электротриммеры,Бензокосы (травокосилки) бытовые,https://www.oma.by/benzokosy-travokosilki-byto...
4,"Инструменты, крепёж",Бензокосы(травокосилки)/электротриммеры,Бензокосы (травокосилки) профессиональные,https://www.oma.by/benzokosy-travokosilki-prof...


In [13]:
main_page_links_df.tail()

Unnamed: 0,Category,Subcategory lvl 1,Subcategory lvl 2,Link
1218,Электротехника,Электрощитовое оборудование,Предохранители пар и плавкие вставки,https://www.oma.by/predokhraniteli-par-i-plavk...
1219,Электротехника,Электрощитовое оборудование,Узо и дифференциальные автоматы,https://www.oma.by/uzo-i-differentsialnye-avto...
1220,Электротехника,Электрощитовое оборудование,Шины соединительные,https://www.oma.by/shiny-soedinitelnye-14194-c
1221,Электротехника,Электрощитовое оборудование,Ящики и корпуса для электрооборудования металл...,https://www.oma.by/yashchiki-i-korpusa-dlya-el...
1222,Электротехника,Электрощитовое оборудование,Ящики и корпуса для электрооборудования пласти...,https://www.oma.by/yashchiki-i-korpusa-dlya-el...


<h3>Preliminary Conlusion</h3>
<p>What we have got is pandas dataframe containing category, subcategory lvl 1, subcategory lvl 2 and link to this subcategory lvl 2 section. This chart will be used for feeding the parser.</p>