# Webscraping: Selectors

## Environment setup

In [None]:
from google.colab import drive, files
import json
drive.mount('/mntDrive') 
path = "/mntDrive/My Drive/Colab Notebooks/"

Mounted at /mntDrive


## [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

__Making life easier to select the proper content from a website. The ones and only the ones you need.__

1. Click on the SelectorGadget icon to activate it. It is located in the upper right corner.
2. Right after clicking it, a bar will appear in the bottom right corner of your chrome window. Also you will realise that as you start moving the cursor, things will get frames. Do not panick, this is normal!
![frame](https://drive.google.com/uc?id=1mPHO_cGkhMzwCQuHYGWIGBuuo7ruht9Q)
3. You will probably want to get multiple instances of the same type of content (e.g. pictures from the main page of telex.hu). This program will help you select what they have in common.
4. Rules for selection:
 - First click to mark an instace of the type of content you like. It will become green, other things the program thinks to be similar will become yellow.
  ![example](https://drive.google.com/uc?id=1ZrrsFp8wMmquzzLkZZEDsAhBoAx_YhLf) <br></br>
 - Again, the same type of content will also be framed. If there is something you want to exclude (e.g. the telex logo at the top or the tiny weather icon), click on one of them. Starting with the second click, you may exclude anything. The program is smart enough to figure out that if you did not want the telex logo, it is likely that you will want to exclude the weather icon as well. Therefore, it is going to be removed automatically.<br></br>
   ![example](https://drive.google.com/uc?id=1hRr8lMI1rnYoow8g1_BEWai-EY4pPEZ2)
- In the bottom right corner, you will see the magic command (`.article_title img`) you should use to select all the content you want. Run `soup.select()` to get a list of instances.

Let try it!

In [32]:
import requests
from bs4 import BeautifulSoup

In [33]:
url = "https://telex.hu"
response = requests.get(url)
response.status_code

200

In [34]:
soup = BeautifulSoup(response.content)

In [38]:
teljes_szoveg = str(soup.get_text())
"pulyka" in teljes_szoveg

True

So far, this is business as usual. Let's get the pictures!

In [None]:
soup.select(".article_title img")[0].get("src")

'/uploads/img-cache/1/6/0/6/5/1606552184-temp-kpadie-20201128-600-400-90-zc.jpg'

In [None]:
image_list = []
for image in soup.select(".article_title img"): # select will always return a list
    image_list.append(url + image.get("src")) # prefix is needed
image_list

['https://telex.hu/uploads/img-cache/1/6/0/6/5/1606552184-temp-kpadie-20201128-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/3/1606305795-temp-mkempb-20201125-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/2/1606231842-temp-pglafo-20201124-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/7/1605704578-temp-meggfk-20201118-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/5/1606560050-temp-jianai-20201128-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/5/1606560406-temp-mcpjpo-20201128-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/5/1606555189-temp-nogacg-20201128-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/4/1606467235-temp-akonmc-20201127-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/5/1606515409-temp-icijbm-20201127-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/6/3/1606396520-temp-jkfnki-20201126-1120-629-90-zc.jpg',
 'https:/

# Lab: Scrape DunaHouse properties
- Scrape real estate data from the **first 5** of the provided link
- Details should include:
  - Price
  - Hyperlink
  - Number of rooms
  - Area (m2)
  - Description

In [None]:
url = "https://dh.hu/elado-ingatlan/lakas-haz/budapest/-/oldal-2"

response = requests.get(url)
print(response.status_code)

soup = BeautifulSoup(response.content)

200


In [None]:
list = []
for item in items:
  list.append(item.some_transformation)


In [None]:
[x.get_text() for x in soup.select(".price")]

['26 500 000 Ft',
 '64 500 000 Ft',
 '29 900 000 Ft',
 '49 000 000 Ft',
 '13 500 000 Ft',
 '79 500 000 Ft',
 '23 000 000 Ft',
 '119 800 000 Ft',
 '49 900 000 Ft',
 '65 900 000 Ft',
 '21 000 000 Ft',
 '17 500 000 Ft',
 '30 900 000 Ft',
 '89 900 000 Ft',
 '147 900 000 Ft']

In [None]:
[x.get_text() for x in soup.select("h2")]

['Eladó Lakás, Budapest, 20 kerület, Pesterzsébet, Tátra tér',
 'Eladó Lakás, Budapest, 4 kerület, Újépítésű, csendes',
 'Eladó Lakás, Budapest, 14 kerület, Zöldre néző 43 nm-es lakás Fűrész utcában',
 'Eladó Lakás, Budapest, 23 kerület, Soroksár-Újtelep',
 'Eladó Lakás, Budapest, 14 kerület, Balázs utca',
 'Eladó Lakás, Budapest, 13 kerület, Újlipótváros',
 'Eladó Lakás, Budapest, 20 kerület, Baross utca',
 'Eladó Lakás, Budapest, 9 kerület, Ráday utcában nagypolgári erkélyes lakás, garázs',
 'Eladó Lakás, Budapest, 14 kerület, Deés utca',
 'Eladó Lakás, Budapest, 13 kerület, PRESTIGE TOWERS - DUNAI PANORÁMA A TERASZRÓL',
 'Eladó Lakás, Budapest, 15 kerület, Újpalotai lakótelepen 32 nm-es, beépített erkélyes',
 'Eladó Lakás, Budapest, 3 kerület, Kis rezsivel felújítandó lakás Békásmegyeren eladó',
 'Eladó Lakás, Budapest, 14 kerület, Rákospatak közelében, parkos környezetben',
 'Eladó Ház, Budapest, 3 kerület, Ürömi út',
 'Eladó Lakás, Budapest, 3 kerület, Újlak']

In [None]:
[x.get_text() for x in soup.select(".room .value")]

['2 szobás',
 '4 szobás',
 '1 szobás',
 '5 szobás',
 '1 szobás',
 '2 szobás',
 '1 szobás',
 '3 szobás',
 '2 szobás',
 '2 szobás',
 '1 szobás',
 '1 szobás',
 '2 szobás',
 '6 szobás',
 '7 szobás']