# Exercises for Session 7: Web Scraping 2

In session 6 you learned how to download the HTML of a webpage. In this session you will learn how to locate the information you want in the HTML. It requires an understanding of how HTML is structured and methods to navigate the structure. In the exercises below you will mainly use the package `BeautifulSoup` to navigate the HTML (read more about BeautifulSoup in the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)).

# Part 1: Parsing a table from HTML using BeautifulSoup.

In this exercise you will learn how to extract the information you want from a webpage's HTML. `BeautifulSoup` is a useful package in Python that makes it easy to navigate the HTML and find the information you are looking for. 

The purpose of the exercise is to extract the data that are available on this webpage: https://www.basketball-reference.com/leagues/NBA_2018.html

Before working with the exercise, you should watch the 2 videos (7.1 and 7.2) below. The type of data you shall scrape in the exercise is quite different from what you see in the video: In the video we scrape text from articles; in the exercise you will scrape tables and make them into pandas DataFrames. 
Keep in mind that the principles are completely identical: You need to locate the information in the HTML and then convert it to some meaningful data (it could be a text file or a dataframe)

(I might talk a bit slow in some of the videos. Remember that you can turn up the speed on Youtube)

In [60]:
# YOUR CODE HERE
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

url_list = ['https://danbolig.dk/bolig/aarhus/8210/villa/2820000588-282/',
'https://danbolig.dk/bolig/odense/5260/villa/2590001318-259/',
'https://danbolig.dk/bolig/kerteminde/5300/villa/2770000472-277/',
'https://danbolig.dk/bolig/gentofte/2930/villa/0360000314-036/',
'https://danbolig.dk/bolig/rudersdal/3460/lejlighed/0350000728-035/',
'https://danbolig.dk/bolig/fredensborg/3050/villa/0970000912-097/',
'https://danbolig.dk/bolig/esbjerg/6700/villa/2810001073-281/',
'https://danbolig.dk/bolig/aarhus/8250/raekkehus/0370000700-037/',
'https://danbolig.dk/bolig/langeland/5953/villa/2610000349-261/',
'https://danbolig.dk/bolig/rudersdal/3460/villa/0350000708-035/',
'https://danbolig.dk/bolig/roedovre/2610/villa/0590000955-059/',
'https://danbolig.dk/bolig/gentofte/2820/villa/2420000238-242/',
'https://danbolig.dk/bolig/vordingborg/4750/villa/2040000623-204/',
'https://danbolig.dk/bolig/aarhus/8355/villa/0310000392-031/',
'https://danbolig.dk/bolig/rudersdal/3460/villa/0350000686-035/',
'https://danbolig.dk/bolig/gentofte/2820/villa/2420000258-242/',
'https://danbolig.dk/bolig/roskilde/4000/villa/0760000576-076/',
'https://danbolig.dk/bolig/gribskov/3200/villa/1600000365-160/',
'https://danbolig.dk/bolig/kalundborg/4400/villa/2470000587-247/',
'https://danbolig.dk/bolig/aarhus/8250/raekkehus/0370000690-037/',
'https://danbolig.dk/bolig/skive/7860/fritidsbolig/2800000290-280/',
'https://danbolig.dk/bolig/hjoerring/9800/villa/1910000512-191/',
'https://danbolig.dk/bolig/koege/4600/villa/0580000513-058/',
'https://danbolig.dk/bolig/bornholm/3700/villa/0410000575-041/',
            ]


with open('Fuller-list-of-links.txt') as f:
    lines = f.readlines()

#url_list = lines[1:10]
url_list = lines[:200]
url_list = [url.replace('\n', '') for url in url_list]

In [61]:
table_large = []
table_left = []
table_address = []
top_info_table = []

for i in range(0, len(url_list)):
    # Send a request to the URL and get the content of the webpage
    response = requests.get(url_list[i], headers = {'name' : 'Martin Skafte Andersen', 'email': 'tgx333@alumni.ku.dk', 'institution': 'University of Copenhagen'})
    soup = BeautifulSoup(response.content, 'lxml')

    # Access the large main table and add it to the list of large_tables
    info_table_large = soup.find_all('table')[0].find('tbody')
    table_large.append(info_table_large)

    # Access the small table to the left and add it to the list of left_tables
    info_table_small_left = soup.find_all('table')[1].find('tbody')
    table_left.append(info_table_small_left)

    # Access the small info table at the top (address) and add it to the list of top_addresses
    upper_info_table = soup.find_all('div', class_ ='o-propertyHero__info o-propertyHero__info--desktop')[0]
    
    # Find the addresses
    addresses = upper_info_table.find('h1')
    table_address.append(addresses)

    # Find house price and other stats
    info_table = upper_info_table.find_all('ul', class_ = 'o-propertyHero__facts')[0]
    top_info_table.append(info_table)


In [62]:

house_data = []

for house in range(len(table_large)):
    
    # Initialize an empty list to store dictionaries for each house
    main_table_dict = {}

    # Extract data from the main table and add it to the house dictionary
    main_table = table_large[house].find_all('td')
    for i in range(0,len(main_table), 2):
        name = main_table[i].text # Extract the name and value from the list
        value = main_table[i + 1].text
        value = re.sub(r'\s*(m²|kr\.)', '', value)
        # Save the name-value pair in the dictionary
        main_table_dict[name] = value
    
    top_table = top_info_table[house].find_all('span')
    for i in range(0,len(top_table), 2):
        name = top_table[i].text # Extract the name and value from the list
        value = top_table[i + 1].text
        value = re.sub(r'\s*(m²|kr\.)', '', value)
     # Exclude 'Energimærke' from being added to the dictionary and save the name-value pair in the dictionary
        if name != 'Energimærke':
            main_table_dict[name] = value.strip()
    
    # Extract data from the top address table and add it to the house dictionary
    name = "address"
    value = ' '.join(table_address[house].text.split())
    # Save the name-value pair in the dictionary
    main_table_dict[name] = value 


    house_data.append(main_table_dict)

df = pd.DataFrame(house_data)



Unnamed: 0,Type,Udbudsform,Energimærke,Varmekilde,Byggeår,Rum,Bad,Toilet,Plan,Boligareal,...,Købspris,Teknisk Pris,Ydelse,Drivhus,"Stald til kvæg, får mv.","Maskinhus, garage mv.","Lade til foder, afgrøder mv.",pulterrum,Fritliggende udestue,Væksthus
0,Villa,Salg,D,Fjernvarme,1972,4,1,2,3,170,...,,,,,,,,,,
1,Villa,Salg,C,Fjernvarme/blokvarme,1974,4,1,1,1,135,...,,,,,,,,,,
2,Villa,Salg,C,Fjernvarme,2006,7,2,2,1,178,...,,,,,,,,,,
3,Villa,Salg,E,Centralvarme med én fyringsenhed,1930,7,2,2,1,234,...,,,,,,,,,,
4,Lejlighed,Leje,E,Centralvarme med én fyringsenhed,1923,5,1,1,1,114,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,Villa,Salg,A2010,Varmepumpe,2008,4,2,2,1,122,...,,,,,,,,,,
196,Villa / Fritidsbolig,Salg,G,Oliefyr/Brændeovn,1920,3,1,1,2,124,...,,,,,,,,,,
197,Villa,Salg,C,Fjernvarme,1967,5,1,2,1,177,...,,,,,,,,,,
198,Villa,Salg,E,Fjernvarme,1972,8,2,2,2,219,...,,,,,,,,,,
