<b><h2>Webscraping data about IKEA wadrobes with BeautifulSoup</h2></b>

### Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup

### Connection

In [2]:
url = 'https://www.ikea.com/pl/pl/cat/szafy-19053/?page=1'
response = requests.get(url)
response.status_code

200

### Scraping code - getting data about ikea wardrobe products across 4 pages

In [3]:
soup = BeautifulSoup(response.content,'html.parser')
ikea = soup.find_all('div',class_='pip-compact-price-package')

In [47]:
g = []
for page in range(1,5):
    url = f'https://www.ikea.com/pl/pl/cat/szafy-19053/?page={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    ikea = soup.find_all('div',class_='pip-compact-price-package')
    for i in ikea:
        name1 = i.find('span',class_='pip-header-section__title--small notranslate').text.strip()
        name2 = i.find('span',class_='pip-header-section__description-text').text.strip()
        dimension = [i.find('span',class_='pip-header-section__description-measurement').text.strip() if i.find('span',class_='pip-header-section__description-measurement') in i.find('span',class_='pip-header-section__description') else 'Not given']
        price = i.find('span',class_='pip-price__integer').text.strip().replace(' ','')
        g.append([name1,name2,dimension,price])
    time.sleep(np.random.randint(3,10))
    print(f'Getting page {page}. Waiting...')

Getting page 1. Waiting...
Getting page 2. Waiting...
Getting page 3. Waiting...
Getting page 4. Waiting...


In [48]:
ikea_df = pd.DataFrame(g,columns=('name1','name2','dimension','price'))

In [49]:
ikea_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name1      80 non-null     object
 1   name2      80 non-null     object
 2   dimension  80 non-null     object
 3   price      80 non-null     object
dtypes: object(4)
memory usage: 2.6+ KB


In [50]:
ikea_df.head(10)

Unnamed: 0,name1,name2,dimension,price
0,KLEPPSTAD,"Szafa/3 drzwi,",[117x176 cm],599
1,PAX / BERGSBO,"Szafa/2 drzwi,",[100x60x236 cm],800
2,VILHATTEN,Szafa 2 drzwi i 2 szuflady,[Not given],899
3,BRIMNES,"Szafa/3 drzwi,",[117x190 cm],899
4,PAX / GRIMO,"Kombinacja szafy,",[150x60x201 cm],1410
5,KLEPPSTAD,"Szafa/2 drzwi,",[79x176 cm],399
6,PAX / BERGSBO,"Szafa,",[150x60x236 cm],2095
7,RAKKESTAD,"Szafa/3 drzwi,",[117x176 cm],699
8,KLEPPSTAD,"Szafa z drzwiami przesuwanymi,",[117x176 cm],649
9,SMÅSTAD / PLATSA,"Szafa,",[60x42x181 cm],880


### Data cleaning

In [52]:
ikea_df['dimension'] = ikea_df['dimension'].apply(lambda x: ''.join(x))

In [53]:
ikea_df.head(5)

Unnamed: 0,name1,name2,dimension,price
0,KLEPPSTAD,"Szafa/3 drzwi,",117x176 cm,599
1,PAX / BERGSBO,"Szafa/2 drzwi,",100x60x236 cm,800
2,VILHATTEN,Szafa 2 drzwi i 2 szuflady,Not given,899
3,BRIMNES,"Szafa/3 drzwi,",117x190 cm,899
4,PAX / GRIMO,"Kombinacja szafy,",150x60x201 cm,1410


In [55]:
ikea_df['dimension2'] = ikea_df['dimension'].replace('Not given',np.nan)

In [60]:
ikea_df

Unnamed: 0,name1,name2,dimension,price,dimension2
0,KLEPPSTAD,"Szafa/3 drzwi,",117x176 cm,599,117x176 cm
1,PAX / BERGSBO,"Szafa/2 drzwi,",100x60x236 cm,800,100x60x236 cm
2,VILHATTEN,Szafa 2 drzwi i 2 szuflady,Not given,899,
3,BRIMNES,"Szafa/3 drzwi,",117x190 cm,899,117x190 cm
4,PAX / GRIMO,"Kombinacja szafy,",150x60x201 cm,1410,150x60x201 cm
...,...,...,...,...,...
75,STUK,"Organizator na ubrania 7 półek,",30x30x90 cm,49,30x30x90 cm
76,SKUBB,"Wisząca półka, 6 przegród,",35x45x125 cm,49,35x45x125 cm
77,STUK,"Pojemnik na ubrania/pościel,",55x51x18 cm,29,55x51x18 cm
78,SKUBB,"Pojemnik na ubrania/pościel,",93x55x19 cm,39,93x55x19 cm


In [61]:
ikea_df = ikea_df.iloc[:,[0,1,3,4]]

In [62]:
ikea_df.head(10)

Unnamed: 0,name1,name2,price,dimension2
0,KLEPPSTAD,"Szafa/3 drzwi,",599,117x176 cm
1,PAX / BERGSBO,"Szafa/2 drzwi,",800,100x60x236 cm
2,VILHATTEN,Szafa 2 drzwi i 2 szuflady,899,
3,BRIMNES,"Szafa/3 drzwi,",899,117x190 cm
4,PAX / GRIMO,"Kombinacja szafy,",1410,150x60x201 cm
5,KLEPPSTAD,"Szafa/2 drzwi,",399,79x176 cm
6,PAX / BERGSBO,"Szafa,",2095,150x60x236 cm
7,RAKKESTAD,"Szafa/3 drzwi,",699,117x176 cm
8,KLEPPSTAD,"Szafa z drzwiami przesuwanymi,",649,117x176 cm
9,SMÅSTAD / PLATSA,"Szafa,",880,60x42x181 cm


In [66]:
ikea_df['price'] = ikea_df['price'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ikea_df['price'] = ikea_df['price'].astype('int')


In [71]:
ikea_df['name1'] = ikea_df['name1'].apply(lambda x: x.capitalize())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ikea_df['name1'] = ikea_df['name1'].apply(lambda x: x.capitalize())


### Final data frame

In [72]:
ikea_df.head(10)

Unnamed: 0,name1,name2,price,dimension2
0,Kleppstad,"Szafa/3 drzwi,",599,117x176 cm
1,Pax / bergsbo,"Szafa/2 drzwi,",800,100x60x236 cm
2,Vilhatten,Szafa 2 drzwi i 2 szuflady,899,
3,Brimnes,"Szafa/3 drzwi,",899,117x190 cm
4,Pax / grimo,"Kombinacja szafy,",1410,150x60x201 cm
5,Kleppstad,"Szafa/2 drzwi,",399,79x176 cm
6,Pax / bergsbo,"Szafa,",2095,150x60x236 cm
7,Rakkestad,"Szafa/3 drzwi,",699,117x176 cm
8,Kleppstad,"Szafa z drzwiami przesuwanymi,",649,117x176 cm
9,Småstad / platsa,"Szafa,",880,60x42x181 cm
