# Webscraping: Selectors

## Environment setup

In [None]:
from google.colab import drive, files
import json
drive.mount('/mntDrive') 
path = "/mntDrive/My Drive/Colab Notebooks/"

## [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

__Making life easier to select the proper content from a website. The ones and only the ones you need.__

1. Click on the SelectorGadget icon to activate it. It is located in the upper right corner.
2. Right after clicking it, a bar will appear in the bottom right corner of your chrome window. Also you will realise that as you start moving the cursor, things will get frames. Do not panick, this is normal!
![frame](https://drive.google.com/uc?id=1mPHO_cGkhMzwCQuHYGWIGBuuo7ruht9Q)
3. You will probably want to get multiple instances of the same type of content (e.g. pictures from the main page of telex.hu). This program will help you select what they have in common.
4. Rules for selection:
 - First click to mark an instace of the type of content you like. It will become green, other things the program thinks to be similar will become yellow.
  ![example](https://drive.google.com/uc?id=1ZrrsFp8wMmquzzLkZZEDsAhBoAx_YhLf) <br></br>
 - Again, the same type of content will also be framed. If there is something you want to exclude (e.g. the telex logo at the top or the tiny weather icon), click on one of them. Starting with the second click, you may exclude anything. The program is smart enough to figure out that if you did not want the telex logo, it is likely that you will want to exclude the weather icon as well. Therefore, it is going to be removed automatically.<br></br>
   ![example](https://drive.google.com/uc?id=1hRr8lMI1rnYoow8g1_BEWai-EY4pPEZ2)
- In the bottom right corner, you will see the magic command (`.article_title img`) you should use to select all the content you want. Run `soup.select()` to get a list of instances.

Let try it!

In [1]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = "https://telex.hu"
response = requests.get(url)
response.status_code

200

In [None]:
soup = BeautifulSoup(response.content)

So far, this is business as usual. Let's get the pictures!

In [None]:
image_list = []
for image in soup.select(".article_title img"): # select will always return a list
    image_list.append(url + image.get("src")) # prefix is needed
image_list

['https://telex.hu/uploads/img-cache/1/6/0/5/9/1605947418-temp-oeidna-20201121-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/8/1605887835-temp-dkfagb-20201120-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/7/1605711067-temp-bnlgij-20201118-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/9/1605955568-temp-iedjon-20201121-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/9/1605951455-temp-bafmic-20201121-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/5/1605560953-temp-mcnjmp-20201116-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/8/1605884305-temp-mfdbad-20201120-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/9/1605945955-temp-opnfgi-20201121-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/8/1605874235-temp-jippnl-20201120-600-400-90-zc.jpg',
 'https://telex.hu/uploads/img-cache/1/6/0/5/8/1605892687-temp-dehnol-20201120-1120-629-90-zc.jpg',
 'https:/

# Lab: Scrape DunaHouse properties
- Scrape real estate data from the **first 5** of the provided link
- Details should include:
  - Price
  - Hyperlink
  - Number of rooms
  - Area (m2)
  - Description

In [2]:
import pandas as pd

prices = []
descriptions = []
areas = []
rooms = []
links = []

for i in range(2,10):
  url = f"https://dh.hu/elado-ingatlan/lakas-haz/budapest/-/oldal-{i}"
  response = requests.get(url)
  print(response.status_code)
  soup = BeautifulSoup(response.content)

  prices.extend([x.get_text() for x in soup.select(".price")])
  links.extend(["https://dh.hu" + listing.find("a").get("href") for listing in soup.find_all("div", {"class": "moreDetailsBox"})])
  rooms.extend([x.get_text() for x in soup.select(".room .value")])
  areas.extend([x.get_text() for x in soup.select(".value b")])
  descriptions.extend([x.get_text() for x in soup.select("h2")])
  
data = {
    "Price": prices,
    "Description": descriptions,
    "Area": areas,
    "Rooms": rooms,
    "Link": links
}

df = pd.DataFrame(data)
df

200
200
200
200
200
200
200
200


Unnamed: 0,Price,Description,Area,Rooms,Link
0,26 500 000 Ft,"Eladó Lakás, Budapest, 20 kerület, Pesterzsébe...",53,2 szobás,https://dh.hu/ingatlan/LK527242/elado-lakas-bu...
1,64 500 000 Ft,"Eladó Lakás, Budapest, 4 kerület, Újépítésű, c...",116,4 szobás,https://dh.hu/ingatlan/LK526603/elado-lakas-bu...
2,29 900 000 Ft,"Eladó Lakás, Budapest, 14 kerület, Zöldre néző...",43,1 szobás,https://dh.hu/ingatlan/LK526595/elado-lakas-bu...
3,49 000 000 Ft,"Eladó Lakás, Budapest, 23 kerület, Soroksár-Új...",83,5 szobás,https://dh.hu/ingatlan/H420378/elado-lakas-bud...
4,13 500 000 Ft,"Eladó Lakás, Budapest, 14 kerület, Balázs utca",12,1 szobás,https://dh.hu/ingatlan/LK526772/elado-lakas-bu...
...,...,...,...,...,...
115,46 900 000 Ft,"Eladó Lakás, Budapest, 13 kerület, Angyalföld",52,2 szobás,https://dh.hu/ingatlan/H420033/elado-lakas-bud...
116,35 900 000 Ft,"Eladó Lakás, Budapest, 13 kerület, Angyalföld",42,2 szobás,https://dh.hu/ingatlan/H420122/elado-lakas-bud...
117,53 900 000 Ft,"Eladó Lakás, Budapest, 11 kerület, Belbuda",60,2 szobás,https://dh.hu/ingatlan/H420110/elado-lakas-bud...
118,27 500 000 Ft,"Eladó Lakás, Budapest, 10 kerület, Téglagyárdűlő",48,2 szobás,https://dh.hu/ingatlan/H420126/elado-lakas-bud...
