# Predicting house selling prices in Denmark



## Initial overview of steps:
* Guiding research question(s)
* Scrape real estate agency websites (gathering)
* Load data and organize in tidy format (wrangling)
* Deal with data issues (wrangling)
* Exploratory analysis
* Focussed questions
* Explanatory analysis
* Prediction models

## Questions
* How can we predict home prices?


* Is it possible to predict listing prices based on characteristics of the home?
* If so, what features are most important?
* Which ones doesn't matter at all?

# Notes 
The CRISP-DM Process (Cross Industry Process for Data Mining)
The lessons leading up to the first project are about helping you go through CRISP-DM in practice from start to finish. Even when we get into the weeds of coding, try to take a step back and realize what part of the process you are in, and assure that you remember the question you are trying answer and what a solution to that question looks like.

1. Business Understanding

2. Data Understanding

3. Prepare Data

4. Data Modeling

5. Evaluate the Results

6. Deploy

In [108]:
# Importing libraries
import pandas as pd
import requests
import bs4
import re
import time

Browsing Home, the largest real estate company in Denmark and playing arround with the developer tools, I managed to find HTTP call that seem to return the data of the listings.

In [10]:
# Using the home, the biggest real estate company in Denmark
#url = 'https://home.dk/umbraco/backoffice/home-api/SEARCH?CurrentPageNumber=0&SearchResultsPerPage=10&q=2200%20K%C3%B8benhavn%20N&Energimaerker=null&SearchType=0&_=1571481546474'
url = 'https://home.dk/umbraco/backoffice/home-api/SEARCH?CurrentPageNumber=0&SearchResultsPerPage=10&q=2200&Energimaerker=null&SearchType=0&_=1571481546474'
response = requests.get(url)
# Saving response to a dictionary
featuresDict = response.json()

In [12]:
# Checking our the data
featuresDict

{'redirectUrl': None,
 'inputModel': {'SortType': None,
  'SortOrder': None,
  'CurrentPageNumber': 0,
  'SearchResultsPerPage': 10,
  'q': '2200',
  'EjendomstypeV1': None,
  'EjendomstypeRH': None,
  'EjendomstypeEL': None,
  'EjendomstypeVL': None,
  'EjendomstypeAA': None,
  'EjendomstypePL': None,
  'EjendomstypeFH': None,
  'EjendomstypeLO': None,
  'EjendomstypeHG': None,
  'EjendomstypeFG': None,
  'EjendomstypeNL': None,
  'Forretningnr': None,
  'ProjectNodeId': None,
  'OnlyBrokerHome': None,
  'PriceMin': None,
  'PriceMax': None,
  'EjerudgiftPrMdrMin': None,
  'EjerudgiftPrMdrMax': None,
  'BoligydelsePrMdrMin': None,
  'BoligydelsePrMdrMax': None,
  'BoligstoerrelseMin': None,
  'BoligstoerrelseMax': None,
  'GrundstoerrelseMin': None,
  'GrundstoerrelseMax': None,
  'VaerelserMin': None,
  'VaerelserMax': None,
  'Energimaerker': ['null'],
  'ByggaarMin': None,
  'ByggaarMax': None,
  'EtageMin': None,
  'EtageMax': None,
  'PlanMin': None,
  'PlanMax': None,
  'Aabenth

What we want to extract seem to be withing the searchResult key:

In [13]:
featuresDict['searchResults']

[{'sagsnummer': '1050000139',
  'lng': 12.5457172243703,
  'lat': 55.6924852361034,
  'fokusbolig': False,
  'showNewPrice': False,
  'isNew': True,
  'adresse': 'Bjelkes Allé 6B, st..',
  'postal': 2200,
  'city': 'København N',
  'price': '2.095.000 ',
  'ejendomstypePrimaerNicename': 'Ejerlejlighed',
  'pictures': [{'PicId': 2993530,
    'CaseId': 10397003,
    'CaseNumber': '1050000139',
    'MediaType': 'b',
    'MaxWidth': 3000,
    'MaxHeight': 2000,
    'URL': 'https://home.mindworking.eu/resources/shops/105/cases/1050000139/casemedia/images/7687715b8b7896b4ff855797e16a8061/customsize.jpg?deviceId=jd83hsdf3',
    'Position': 0,
    'Description': 'Stue',
    'GUID': '7687715b-8b78-96b4-ff85-5797e16a8061',
    'refGUID': '00000000-0000-0000-0000-000000000000',
    'IsVertical': False,
    'IsHorizontal': True},
   {'PicId': 2993537,
    'CaseId': 10397003,
    'CaseNumber': '1050000139',
    'MediaType': 'b',
    'MaxWidth': 3000,
    'MaxHeight': 2000,
    'URL': 'https://home.

Great! This is the data we're interested in. However the pictures key contain a list of information, we don't need which would ruin the granularity should we convert it to a pandas Dataframe so let's drop it.

In [14]:
# dropping the pictures key from the list of dictionaries
features = featuresDict['searchResults']
for f in features:
    del f['pictures']
features

[{'sagsnummer': '1050000139',
  'lng': 12.5457172243703,
  'lat': 55.6924852361034,
  'fokusbolig': False,
  'showNewPrice': False,
  'isNew': True,
  'adresse': 'Bjelkes Allé 6B, st..',
  'postal': 2200,
  'city': 'København N',
  'price': '2.095.000 ',
  'ejendomstypePrimaerNicename': 'Ejerlejlighed',
  'floorPlan': {'PicId': 2993542,
   'CaseId': 10397003,
   'CaseNumber': '1050000139',
   'MediaType': 'p',
   'MaxWidth': 3000,
   'MaxHeight': 2000,
   'URL': 'https://home.mindworking.eu/resources/shops/105/cases/1050000139/casemedia/images/2f0b1e7e3e1981c99f5d514ebf3f9869/customsize.jpg?deviceId=jd83hsdf3',
   'Position': 0,
   'Description': 'Plantegning',
   'GUID': '2f0b1e7e-3e19-81c9-9f5d-514ebf3f9869',
   'refGUID': '00000000-0000-0000-0000-000000000000',
   'IsVertical': False,
   'IsHorizontal': True},
  'boligOrGrundAreal': 54,
  'andenmaegler': False,
  'boligurl': 'https://home.dk/boligkatalog/koebenhavn/2200/ejerlejligheder/bjelkes_alle_6b_st_1050000139.aspx',
  'billede

The data seem ready to be loaded to a pandas dataframe.

In [15]:
df = pd.DataFrame(features)
df.head()

Unnamed: 0,aabenthusNicename,aabenthusShowRegistration,adresse,andenmaegler,billedeUrl,boligKanLejes,boligOrGrundAreal,boligurl,city,ejendomstypePrimaerNicename,...,lejePerMaaned,lng,openHouseEndDate,openHouseStartDate,overskrift2,postal,price,sagsnummer,showNewPrice,solgtBolig
0,27.10 kl. 12.00-12.30,False,"Bjelkes Allé 6B, st..",False,https://home.mindworking.eu/resources/shops/10...,0,54,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,...,,12.545717,2019-10-27T12:30,2019-10-27T12:00,,2200,2.095.000,1050000139,False,False
1,27.10 kl. 14.30-14.50,False,"Poppelgade 4, 1. th.",False,https://home.mindworking.eu/resources/shops/10...,0,105,https://home.dk/boligkatalog/koebenhavn/2200/a...,København N,Andelsbolig,...,,12.559357,2019-10-27T14:50,2019-10-27T14:30,Beliggende i baghuset,2200,1.799.000,1050000162,False,False
2,27.10 kl. 13.30-13.50,False,"Husumgade 20, 2. th.",False,https://home.mindworking.eu/resources/shops/10...,0,53,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,...,,12.5454,2019-10-27T13:50,2019-10-27T13:30,Et super godt køb!,2200,2.399.000,1050000164,False,False
3,27.10 kl. 13.30-13.50,False,"Egegade 2, 1. th.",False,https://home.mindworking.eu/resources/shops/10...,0,78,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,...,,12.559457,2019-10-27T13:50,2019-10-27T13:30,Med altan og stort badeværelse,2200,3.999.000,1050000167,False,False
4,27.10 kl. 11.00-11.20,False,"Fredensborggade 2, 1. th.",False,https://home.mindworking.eu/resources/shops/10...,0,56,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,...,,12.53988,2019-10-27T11:20,2019-10-27T11:00,Super beliggenhed på Nørrebro,2200,2.199.000,1050000137,False,False


Let's remove columns that are not of interest.

In [16]:
df.drop(inplace = True, columns=[
    'billedeUrl','lejePerMaaned','showNewPrice',
    'aabenthusNicename','floorPlan','erSolgtOgLejebolig',
    'boligKanLejes','aabenthusShowRegistration', 
    'solgtBolig','isLejebolig','fokusbolig'
])

In [17]:
df.head()

Unnamed: 0,adresse,andenmaegler,boligOrGrundAreal,boligurl,city,ejendomstypePrimaerNicename,isNew,lat,lng,openHouseEndDate,openHouseStartDate,overskrift2,postal,price,sagsnummer
0,"Bjelkes Allé 6B, st..",False,54,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.692485,12.545717,2019-10-27T12:30,2019-10-27T12:00,,2200,2.095.000,1050000139
1,"Poppelgade 4, 1. th.",False,105,https://home.dk/boligkatalog/koebenhavn/2200/a...,København N,Andelsbolig,True,55.692049,12.559357,2019-10-27T14:50,2019-10-27T14:30,Beliggende i baghuset,2200,1.799.000,1050000162
2,"Husumgade 20, 2. th.",False,53,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.693495,12.5454,2019-10-27T13:50,2019-10-27T13:30,Et super godt køb!,2200,2.399.000,1050000164
3,"Egegade 2, 1. th.",False,78,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.690345,12.559457,2019-10-27T13:50,2019-10-27T13:30,Med altan og stort badeværelse,2200,3.999.000,1050000167
4,"Fredensborggade 2, 1. th.",False,56,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.698624,12.53988,2019-10-27T11:20,2019-10-27T11:00,Super beliggenhed på Nørrebro,2200,2.199.000,1050000137


The 'boligurl' is the URL to the site of each piece of real estate for sale, so let's use that to get more features!

In [18]:
response = requests.get(df['boligurl'][0])
html = response.text

In [22]:
html

'\r\n<!DOCTYPE html>\r\n<html lang="da" class="no-js" ng-app="home" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">\r\n<head>\r\n    <script id="CookieConsent" src="https://policy.cookieinformation.com/uc.js" data-culture="DA" async></script>\r\n    \r\n<script>(function(H){H.className=H.className.replace(/\\bno-js\\b/,\'js\')})(document.documentElement)</script>\r\n<meta charset="utf-8">\r\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\r\n<meta id="viewport" name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2">\r\n<meta name="format-detection" content="telephone=no">\r\n<title>Ejerlejlighed - 2200 København N - Bjelkes Allé 6B, st..</title>\r\n<meta name="title" content="Ejerlejlighed - 2200 København N - Bjelkes Allé 6B, st..">\r\n<meta name="keywords" content="" />\r\n<meta name="description" content="Ejerlejlighed til salg, København N - Førstehåndsindtrykket er rigtig godt, når I træder indenfor i entréen, for allere

The stuff we want is in the info-property and info-value class.

In [39]:
soup = bs4.BeautifulSoup(html, "html.parser")
additionalFeatures = soup.find_all('span', {"class": ["info-property","info-value"]})


[<span class="info-property">Kontantpris</span>,
 <span class="info-value"><b>3.650.000  kr.</b></span>,
 <span class="info-property">Ejerudgift pr. md.</span>,
 <span class="info-value"><b>2.356  kr.</b></span>,
 <span class="info-property">Kvm. pris <i class="tipso" title="Kvm-prisen er baseret på et vægtet areal,  som er mere præcist, fordi der også tages højde for kælderarealer, loftsarealer, udhuse etc. - og ikke kun boligareal. ">?</i></span>,
 <span class="info-value"><b>40.109  kr.</b></span>,
 <span class="info-property">Udbetaling</span>,
 <span class="info-value"><b>185.000  kr.</b></span>,
 <span class="info-property">
                         Brutto/Netto
                         <i class="tipso" title="I brutto- og nettoydelsen indgår standardfinansiering. Da der er tale om en standardfinansiering, vil den i visse tilfælde ikke kunne opnås, hvorfor brutto- og nettoydelsen i så fald kan afvige.">?</i>
 <br>
                         ekskl. ejerudgift
                     </

They come in pairs and we need them divivded into key-value pairs.

In [64]:
# Loop through each span in the list
#import json
count = 0
keys = []
values = []
for feat in additionalFeatures:
    if count % 2: # Odd number is a value
        values.append(feat.text.strip())
        #values.append(re.findall('<b>.+</b>',str(feat))[0][3:-4])
    else: # Even number is a key
        keys.append(feat.text.strip())
        #keys.append(re.findall('>.+<',str(feat))[0][1:-1])
    count +=1 
dictionary = dict(zip(keys, values))
dictionary

{'Kontantpris': '3.650.000  kr.',
 'Ejerudgift pr. md.': '2.356  kr.',
 'Kvm. pris ?': '40.109  kr.',
 'Udbetaling': '185.000  kr.',
 'Brutto/Netto\r\n                        ?\n\r\n                        ekskl. ejerudgift': '14.114  / 12.357  kr.',
 'Prisudvikling': '0%',
 'Boligareal': '91  m2',
 'Grundareal': '570  m2',
 'Antal toiletter': '1',
 'Antal rum': '3',
 'Byggeår': '1906',
 'Energimærke': 'D',
 'Sagsnr.': '1050000133',
 'Afstand til off. transport': '200  m',
 'Afstand til skole': '500  m',
 'Afstand til indkøb': '300  m',
 'Ydermur': 'Mursten',
 'Gulve': 'Plankegulve',
 'Vinduer': 'Termo',
 'El': 'HPFI-relæ',
 'Forurening': 'Jf. udskrift fra RegionH',
 'Overtagelse': 'Efter aftale',
 'Antenne': 'Kabel-tv',
 'Vaskeri': 'Ja',
 'Udlejning tilladt': 'Ja, jf. vedtægterne',
 'Tilbehør': 'Indesit opvaskemaskineGram køleskabVoss ovn',
 'Ejendomsværdi i kr.': '1.600.000',
 'Heraf grundværdi i kr.': '112.200',
 'Vurderingsår': '2018'}

This should be repeated for each line in the dataframe and to be appended as columns. Let's create a function for this.

In [125]:
def GetAdditionalFeatures(df):
    additionalFeaturesList = []
    counter = 0
    loops = df.shape[0]
    # Loop through all rows
    for i in df['boligurl']:
        response = requests.get(i)
        html = response.text
        soup = bs4.BeautifulSoup(html, "html.parser")
        additionalFeatures = soup.find_all('span', {"class": ["info-property","info-value"]})

        # Loop through each span in the list
        count = 0
        keys = []
        values = []
        for feat in additionalFeatures:
            if count % 2: # Odd number is a value
                values.append(feat.text.strip())
            else: # Even number is a key
                keys.append(feat.text.strip())
            count +=1 
        
        additionalFeaturesList.append(dict(zip(keys, values)))
        time.sleep(1)
        counter += 1
        print((float(counter)/float(loops))*100.)
    df2 = df.join(pd.DataFrame(additionalFeaturesList))
    return df2

In [86]:
df2 = GetAdditionalFeatures(df)
df2.head()

Unnamed: 0,adresse,andenmaegler,boligOrGrundAreal,boligurl,city,ejendomstypePrimaerNicename,isNew,lat,lng,openHouseEndDate,...,Sagsnr.,Teknisk pris ?,Tilbehør,Udbetaling,Udlejning,Udlejning tilladt,Vaskeri,Vinduer,Vurderingsår,Ydermur
0,"Bjelkes Allé 6B, st..",False,54,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.692485,12.545717,2019-10-27T12:30,...,1050000139,,Gorenje komfurAEG køle/fryseskabBosch emhætteh...,105.000 kr.,,Tilladt,Ja,,2018,Mursten
1,"Poppelgade 4, 1. th.",False,105,https://home.dk/boligkatalog/koebenhavn/2200/a...,København N,Andelsbolig,True,55.692049,12.559357,2019-10-27T14:50,...,1050000162,3.879.803 kr.,Bosch køle/fryseskabAEG vaskemaskine,,,"Tilladt i kortere periode, jf. vedtægternes § ...",Ja,Termo,2018,Mursten
2,"Husumgade 20, 2. th.",False,53,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.693495,12.5454,2019-10-27T13:50,...,1050000164,,Afventer oplysninger fra sælger,120.000 kr.,Tilladt,,Fællesvaskeri,Termo,2018,Pudset mursten
3,"Egegade 2, 1. th.",False,78,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.690345,12.559457,2019-10-27T13:50,...,1050000167,,Gram køle/fryseskabSiemens komfurElectrolux va...,200.000 kr.,,Med tilladelse fra ejerforeningens bestyrelse,Nej,Termo,2018,Mursten
4,"Fredensborggade 2, 1. th.",False,56,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,True,55.698624,12.53988,2019-10-27T11:20,...,1050000137,,Køleskab fra Blomberg (A+)Komfur fra SMEG,110.000 kr.,Tilladt,,Fællesvaskeri,Termo,2018,Mursten


In [88]:
df2.columns

Index(['adresse', 'andenmaegler', 'boligOrGrundAreal', 'boligurl', 'city',
       'ejendomstypePrimaerNicename', 'isNew', 'lat', 'lng',
       'openHouseEndDate', 'openHouseStartDate', 'overskrift2', 'postal',
       'price', 'sagsnummer', 'Afstand til indkøb',
       'Afstand til off. transport', 'Afstand til skole', 'Altan',
       'Antal plan', 'Antal rum', 'Antal toiletter', 'Antenne', 'Boligareal',
       'Boligydelse pr. måned',
       'Brutto/Netto\r\n                        ?\n\r\n                        ekskl. ejerudgift',
       'Byggeår', 'Ejendomsværdi i kr.', 'Ejerudgift pr. md.', 'El',
       'Energimærke', 'Etage', 'Fibernet', 'Forurening', 'Grundareal', 'Gulve',
       'Heraf grundværdi i kr.', 'Husdyr', 'Husdyr tilladt', 'Kontantpris',
       'Kvm. pris ?', 'Købspris', 'Overtagelse', 'Prisudvikling', 'Pulterrum',
       'Sagsnr.', 'Teknisk pris ?', 'Tilbehør', 'Udbetaling', 'Udlejning',
       'Udlejning tilladt', 'Vaskeri', 'Vinduer', 'Vurderingsår', 'Ydermur'],
     

Alright, we can now do this entire process for multiple zip codes and more than 10 returns.

Note: Through trial and error I found the maximum number of returns to be 200 and in order to get all the data, we can use the URL to add search criteria to split our results into smaller bins.

In [89]:
# Zip codes in Denmark
zipCode = [2200, 9000, 8210, 8000]
"""
zipCode = [1301,2000,2100,2200,2300,2400,2450,2500,2600,2605,2610,2625,2630,
           2635,2640,2650,2660,2665,2670,2670,2680,2690,2700,2720,2730,2740,
           2750,2760,2765,2770,2791,2800,2820,2830,2840,2850,2860,2880,2900,
           2920,2930,2942,2950,2960,2970,2980,2990,3000,3050,3060,3070,3080,
           3100,3120,3140,3150,3200,3210,3220,3230,3250,3300,3310,3320,3330,
           3360,3370,3390,3400,3460,3480,3490,3500,3520,3540,3550,3600,3630,
           3650,3660,3670,3700,3720,3730,3740,3751,3760,3770,3782,3790,4000,
           4040,4050,4060,4070,4100,4130,4140,4160,4171,4173,4174,4180,4190,
           4200,4220,4230,4241,4242,4243,4250,4261,4262,4270,4281,4291,4293,
           4295,4296,4300,4320,4330,4340,4350,4360,4370,4390,4400,4420,4440,
           4450,4460,4470,4480,4490,4500,4520,4532,4534,4540,4550,4560,4571,
           4572,4573,4581,4583,4591,4592,4593,4600,4621,4622,4623,4632,4640,
           4652,4653,4654,4660,4671,4672,4673,4681,4682,4683,4684,4690,4700,
           4720,4733,4735,4736,4750,4760,4771,4772,4773,4780,4791,4792,4793,
           4800,4840,4850,4862,4863,4871,4872,4873,4874,4880,4891,4892,4894,
           4895,4900,4912,4913,4920,4930,4941,4943,4944,4951,4952,4953,4960,
           4970,4983,4990,5000,5200,5210,5220,5230,5240,5250,5260,5270,5290,
           5300,5330,5350,5370,5380,5390,5400,5450,5462,5463,5464,5466,5471,
           5474,5485,5491,5492,5500,5540,5550,5560,5580,5591,5592,5600,5610,
           5620,5631,5642,5672,5683,5690,5700,5750,5762,5771,5772,5792,5800,
           5853,5854,5856,5863,5871,5874,5881,5882,5883,5884,5892,5900,5932,
           5935,5953,5960,5970,5985,6000,6040,6051,6052,6064,6070,6091,6092,
           6093,6094,6100,6200,6230,6240,6261,6270,6280,6300,6310,6320,6330,
           6340,6360,6372,6392,6400,6430,6440,6470,6500,6510,6520,6535,6541,
           6560,6580,6600,6621,6622,6623,6630,6640,6650,6660,6670,6682,6683,
           6690,6700,6701,6705,6710,6715,6720,6731,6740,6752,6760,6771,6780,
           6792,6800,6818,6823,6830,6840,6851,6852,6853,6854,6855,6857,6862,
           6870,6880,6893,6900,6920,6933,6940,6950,6960,6971,6973,6980,6990,
           7000,7080,7100,7120,7130,7140,7150,7160,7171,7173,7182,7183,7184,
           7190,7200,7250,7260,7270,7280,7300,7321,7323,7330,7361,7362,7400,
           7430,7441,7442,7451,7470,7480,7490,7500,7540,7550,7560,7570,7600,
           7620,7650,7660,7673,7680,7700,7730,7741,7742,7752,7755,7760,7770,
           7790,7800,7830,7840,7850,7860,7870,7884,7900,7950,7960,7970,7980,
           7990,8000,8200,8210,8220,8230,8240,8250,8260,8270,8300,8305,8310,
           8320,8330,8340,8350,8355,8361,8362,8370,8380,8381,8382,8400,8410,
           8420,8444,8450,8462,8464,8471,8472,8500,8520,8530,8541,8543,8544,
           8550,8560,8570,8581,8585,8586,8592,8600,8620,8632,8641,8643,8653,
           8654,8660,8670,8680,8700,8721,8722,8723,8732,8740,8751,8752,8762,
           8763,8765,8766,8781,8783,8800,8830,8831,8832,8840,8850,8860,8870,
           8881,8882,8883,8900,8950,8961,8963,8970,8981,8983,8990,9000,9200,
           9210,9220,9230,9240,9260,9270,9280,9293,9300,9310,9320,9330,9340,
           9352,9362,9370,9380,9381,9382,9400,9430,9440,9460,9480,9490,9492,
           9493,9500,9510,9520,9530,9541,9550,9560,9574,9575,9600,9610,9620,
           9631,9632,9640,9670,9681,9690,9700,9740,9750,9760,9800,9830,9850,
           9870,9881,9900,9940,9970,9981,9982,9990
          ]
"""

In [116]:

featureList = []
# Loop through zip codes
for code in zipCode:
    # Setting size interval to bin responses into smaller chunks
    minSize = 21
    maxSize = 30
    # Loop through sizes
    for i in range(30):
        url = 'https://home.dk/umbraco/backoffice/home-api/SEARCH?CurrentPageNumber=0&SearchResultsPerPage=200&BoligstoerrelseMin=' + str(minSize) + '&BoligstoerrelseMax=' + str(maxSize) + '&q=' + str(code) + '&Energimaerker=null&SortOrder=asc&SearchType=0&_=1571481546474'

        response = requests.get(url)
        # Saving response to a dictionary
        featuresDict = response.json()
        # dropping the pictures key from the list of dictionaries
        features = featuresDict['searchResults']
        for f in features:
            del f['pictures']
        featureList.extend(features)
        # Pausing to not be a dick towards the server
        #time.sleep(2)
        
        # Count up sizes
        minSize += 10
        maxSize += 10

    
len(featureList)

887

In [117]:
minSize

321

In [119]:
df_new = pd.DataFrame(featureList)
df_new.drop(inplace = True, columns=[
    'billedeUrl','lejePerMaaned','showNewPrice',
    'aabenthusNicename','floorPlan','erSolgtOgLejebolig',
    'boligKanLejes','aabenthusShowRegistration', 
    'solgtBolig','isLejebolig','fokusbolig'
])
df_new.head()

Unnamed: 0,adresse,andenmaegler,boligOrGrundAreal,boligurl,city,ejendomstypePrimaerNicename,isNew,lat,lng,openHouseEndDate,openHouseStartDate,overskrift2,postal,price,sagsnummer
0,"Dagmarsgade 36, 1 Lejl. 4",True,32,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,True,55.6986,12.546053,,,,2200,1.450.000,22004392_10007
1,"Åboulevard 34D, 5 th",True,37,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,False,55.684981,12.554957,,,,2200,1.995.000,11-X00002323503001_10016
2,"Dagmarsgade 36, 4",True,32,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,False,55.6986,12.546053,,,,2200,1.295.000,22004319_10007
3,"Søllerødgade 46, 5 tv",True,37,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,False,55.696019,12.543893,,,,2200,1.998.000,11-X00002283503001_10016
4,"Slejpnersgade 6, 1. 3.",False,44,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,False,55.701175,12.543234,2019-10-27T16:20,2019-10-27T16:00,,2200,1.899.000,1050000126


In [126]:
df_new2 = GetAdditionalFeatures(df_new.head())
df_new2.head()

20.0
40.0
60.0
80.0
100.0


Unnamed: 0,adresse,andenmaegler,boligOrGrundAreal,boligurl,city,ejendomstypePrimaerNicename,isNew,lat,lng,openHouseEndDate,...,Overtagelse,Prisudvikling,Sagsnr.,Tilbehør,Udbetaling,Udlejning tilladt,Vaskeri,Vinduer,Vurderingsår,Ydermur
0,"Dagmarsgade 36, 1 Lejl. 4",True,32,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,True,55.6986,12.546053,,...,,,,,,,,,,
1,"Åboulevard 34D, 5 th",True,37,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,False,55.684981,12.554957,,...,,,,,,,,,,
2,"Dagmarsgade 36, 4",True,32,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,False,55.6986,12.546053,,...,,,,,,,,,,
3,"Søllerødgade 46, 5 tv",True,37,https://www.boligsiden.dk/viderestillingekster...,København N,Ejerlejlighed,False,55.696019,12.543893,,...,,,,,,,,,,
4,"Slejpnersgade 6, 1. 3.",False,44,https://home.dk/boligkatalog/koebenhavn/2200/e...,København N,Ejerlejlighed,False,55.701175,12.543234,2019-10-27T16:20,...,Efter aftale,-5%,1050000126.0,Whirlpool vaskemaskineGorenje gaskomfur - Bauk...,95.000 kr.,"Tilladt, jf. vedtægternes § 13",Vaskemaskine i lejligheden,Termo,2018.0,Mursten


In [None]:
df_new2.head()

In [75]:
df2.columns

Index(['adresse', 'andenmaegler', 'boligOrGrundAreal', 'boligurl', 'city',
       'ejendomstypePrimaerNicename', 'isNew', 'lat', 'lng',
       'openHouseEndDate', 'openHouseStartDate', 'overskrift2', 'postal',
       'price', 'sagsnummer', 'Afstand til indkøb',
       'Afstand til off. transport', 'Afstand til skole', 'Altan',
       'Antal plan', 'Antal rum', 'Antal toiletter', 'Antenne', 'Boligareal',
       'Boligydelse pr. måned',
       'Brutto/Netto\r\n                        ?\n\r\n                        ekskl. ejerudgift',
       'Byggeår', 'Ejendomsværdi i kr.', 'Ejerudgift pr. md.', 'El',
       'Energimærke', 'Etage', 'Fibernet', 'Forurening', 'Grundareal', 'Gulve',
       'Heraf grundværdi i kr.', 'Husdyr', 'Husdyr tilladt', 'Kontantpris',
       'Kvm. pris ?', 'Købspris', 'Overtagelse', 'Prisudvikling', 'Pulterrum',
       'Sagsnr.', 'Teknisk pris ?', 'Tilbehør', 'Udbetaling', 'Udlejning',
       'Udlejning tilladt', 'Vaskeri', 'Vinduer', 'Vurderingsår', 'Ydermur'],
     

In [76]:
df2.shape

(10, 55)

In [213]:
feats = dict['searchResults']
url = 'https://home.dk/umbraco/backoffice/home-api/SEARCH?CurrentPageNumber=0&SearchResultsPerPage=200&BoligstoerrelseMin=60&BoligstoerrelseMax=70&q=2200&Energimaerker=null&SortOrder=asc&SearchType=0&_=1571481546474'
response = requests.get(url)    
dict2 = response.json()
feats2 = dict2['searchResults']

In [214]:
len(feats2)

21

In [188]:
for f in feats:
    del f['pictures']

TypeError: list indices must be integers or slices, not str

In [144]:
len(feats)

200

In [165]:
feats[0]['floorPlan']

{'PicId': 2985754,
 'CaseId': 10395211,
 'CaseNumber': '1050000137',
 'MediaType': 'p',
 'MaxWidth': 3000,
 'MaxHeight': 2000,
 'URL': 'https://home.mindworking.eu/resources/shops/105/cases/1050000137/casemedia/images/d648ebc328cc9e08efd3e8f608061497/customsize.jpg?deviceId=jd83hsdf3',
 'Position': 0,
 'Description': 'Plantegning',
 'GUID': 'd648ebc3-28cc-9e08-efd3-e8f608061497',
 'refGUID': '00000000-0000-0000-0000-000000000000',
 'IsVertical': False,
 'IsHorizontal': True}

In [145]:
df = pd.DataFrame(feats)

In [179]:
df.drop(inplace = True, columns=[
    'billedeUrl','lejePerMaaned','showNewPrice',
    'aabenthusNicename','floorPlan','erSolgtOgLejebolig',
    'boligKanLejes','aabenthusShowRegistration', 
    'solgtBolig','isLejebolig','fokusbolig'
])

In [149]:
df.postal.value_counts()

2200    200
Name: postal, dtype: int64

In [180]:
df.nunique()

adresse                        200
andenmaegler                     2
boligOrGrundAreal               80
city                             1
ejendomstypePrimaerNicename      3
isNew                            2
lat                            154
lng                            154
openHouseEndDate                 5
openHouseStartDate               5
overskrift2                     30
postal                           1
price                          114
sagsnummer                     200
dtype: int64

In [70]:
df.nunique()

adresse                        10
andenmaegler                    1
boligOrGrundAreal              10
boligurl                       10
city                            1
ejendomstypePrimaerNicename     2
isNew                           2
lat                            10
lng                            10
openHouseEndDate                8
openHouseStartDate              8
overskrift2                    10
postal                          1
price                          10
sagsnummer                     10
dtype: int64

In [82]:
df.shape

(22, 27)

In [65]:
url = 'https://home.dk/resultatliste/?CurrentPageNumber=0&SearchResultsPerPage=15&q=2200%20K%C3%B8benhavn%20N&Energimaerker=null&SearchType=0'

In [11]:
content_div = soup.find_all('home-tile-info')
content_div

[]

In [21]:
urlreq = 'https://home.dk/umbraco/backoffice/home-api/BoligOrAddress/Boligdata?max=100&searchstring=2200'

In [24]:
# import json library
import json
import urllib.request
# request url
#urlreq = 'https://groceries.asda.com/api/items/search?keyword=yogurt'
# get response
response = urllib.request.urlopen(urlreq)
# load as json
jresponse = json.load(response)
json.loads(line.decode("utf-8","ignore"))
# write to file as pretty print
with open('asdaresp.json', 'w') as outfile:
    json.dump(jresponse, outfile, sort_keys=True, indent=4)
response.read()

b''

In [26]:
req = urllib.request.Request(urlreq)
with urllib.request.urlopen(req) as response:
   the_page = response.read()
print(the_page)

b'{"Successed":true,"Status":"OK","InputModel":{"SearchString":"2200","Max":100},"SuggestItems":[{"suggest":"2200 K\xc3\xb8benhavn N","count":"200","sortorder":40,"IsHeadLine":false}]}'


In [66]:
import requests

url = 'https://home.dk/umbraco/backoffice/home-api/SEARCH?CurrentPageNumber=0&SearchResultsPerPage=15&q=2200%20K%C3%B8benhavn%20N&Energimaerker=null&SearchType=0&_=1571481546474'
response = requests.get(url)    
dict = response.json()
dict

{'redirectUrl': None,
 'inputModel': {'SortType': None,
  'SortOrder': None,
  'CurrentPageNumber': 0,
  'SearchResultsPerPage': 15,
  'q': '2200 København N',
  'EjendomstypeV1': None,
  'EjendomstypeRH': None,
  'EjendomstypeEL': None,
  'EjendomstypeVL': None,
  'EjendomstypeAA': None,
  'EjendomstypePL': None,
  'EjendomstypeFH': None,
  'EjendomstypeLO': None,
  'EjendomstypeHG': None,
  'EjendomstypeFG': None,
  'EjendomstypeNL': None,
  'Forretningnr': None,
  'ProjectNodeId': None,
  'OnlyBrokerHome': None,
  'PriceMin': None,
  'PriceMax': None,
  'EjerudgiftPrMdrMin': None,
  'EjerudgiftPrMdrMax': None,
  'BoligydelsePrMdrMin': None,
  'BoligydelsePrMdrMax': None,
  'BoligstoerrelseMin': None,
  'BoligstoerrelseMax': None,
  'GrundstoerrelseMin': None,
  'GrundstoerrelseMax': None,
  'VaerelserMin': None,
  'VaerelserMax': None,
  'Energimaerker': ['null'],
  'ByggaarMin': None,
  'ByggaarMax': None,
  'EtageMin': None,
  'EtageMax': None,
  'PlanMin': None,
  'PlanMax': None

In [59]:
#urlreq = 'https://home.dk/resultatliste/?q=2200+K%C3%B8benhavn+N:33'
urlreq = 'https://home.dk/umbraco/backoffice/home-api/SEARCH?CurrentPageNumber=0&SearchResultsPerPage=15&q=2200%20K%C3%B8benhavn%20N&Energimaerker=null&SearchType=0&_=1571481546474'
response = urllib.request.urlopen(urlreq)
#req = urllib.request.Request(urlreq)
#with urllib.request.urlopen(req) as response:
#   the_page = response.read()
#print(the_page)
# load as json
response.read()#.decode('utf-8')
#jresponse = json.load(response, encoding='utf-8')
#json.loads(line.decode("utf-8","ignore"))
# write to file as pretty print
#with open('asdaresp.json', 'w') as outfile:
#    json.dump(jresponse, outfile, sort_keys=True, indent=4)

b'\xd5\xbd\xddr\x1c\xc7\x96\xa5\xf9*\x18\\\xf4\x95\x1c\x8cp\x0f\xf7\x88\xa0YY\x1bO\x89\xfa)Q?FR*\x1b+k\xa3yDx\x80\x10A\x80\x93\x99\x90\x8eN\xd9\xb9\x9c~\x86\xba\x9c\x9bj\xb3y\x83\xba\x1e\xbd\xd8|;\tGF\xa6\xa7\x8e\xe8$\x11\xec\xd0\xe9f\x91 \xc0\x05\xe4\xca\xf0\x9f\xbd\xd7^\xeb\xdfOWa\xb8X\x85~\xf3\xe3\xea\xf2\xf4\xe1\xd5\xcd\xe5\xe5g\xa7\x17Won6\xdf^\x0f\x81\x8f\xfc\xfb\xe9\xb3\xeb\xd5\xe6\xf9ooB\xfc[\xf9\xf3\xf7\xab!\xac\xe2\x07\xfe\xf9f\xb5\nW\x9b\x1f\xfcy\xf8\xee\xe6u\'\x7fQ|v\xfa,\xf8U\xff\xf2iX\xdf\\n\xd6?\x84\x95\xfc\xf5\xe9\xc3\xd2~v\xfa\x7f\x9d><\xd5\xba(N\xbe\xf9\xfd\xbf\xbap\xf5\xd2\xffru\xf2\xdd\xe9g\xa7\x8f\x7f\x0eW\xc3\xf5\xeb\xf5\x06\xb0\x9f\xca\xf8\xafO?\xfa\xf4\xabc\x1f}\xfc\xe4\xd8G\x7f:\xfa\xd1G\x8f\x8e}\xee\x0fG?\xf7\x8b\xa3hO\xbe?\xf6/|\xf5\xe5\xb1\x8f~q\xf4\xa3\xdf\xdd\xa1}q\xcd+\xb7\xb9\xba\xb8:\xbf\xba{5\x7fX]\xff\x0c\x1d\xdf\xf1\xf2\x7f=\xc4\x7f\xf4\xfb\xab\xcb\xdf\xfe\xb2\xba~\x15V_]\xbf\xbec\xe2\x87\xd5E\x1f\xbe\xbd\xb8\x8a\x9f\xf5\xf6\xcf\xfe\xaf\xf1\xcf\xbct\

In [None]:
GET /resultatliste/?q=2200+K%C3%B8benhavn+N HTTP/1.1
Host: home.dk
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36
Sec-Fetch-Mode: navigate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Sec-Fetch-Site: same-origin
Referer: https://home.dk/
Accept-Encoding: gzip, deflate, br
Accept-Language: da-DK,da;q=0.9,en-US;q=0.8,en;q=0.7
Cookie: _ga=GA1.2.55867073.1571476775; _gid=GA1.2.440813985.1571476775; _gcl_au=1.1.826145108.1571476775; adv_guid=bb3bdd31-b9117e-fea1b-b23eb8-44b028e|ADV; CookieInformationConsent=%7B%22website_uuid%22%3A%22bfb17c80-64c9-4e36-bca8-739bd5bf03ee%22%2C%22timestamp%22%3A%222019-10-19T09%3A19%3A37.172Z%22%2C%22consent_url%22%3A%22https%3A%2F%2Fhome.dk%2F%22%2C%22consent_website%22%3A%22home.dk%22%2C%22consent_domain%22%3A%22home.dk%22%2C%22user_uid%22%3A%220e7584f9-4838-44f8-996a-90f14c9fc36c%22%2C%22consents_approved%22%3A%5B%22cookie_cat_necessary%22%2C%22cookie_cat_functional%22%2C%22cookie_cat_statistic%22%2C%22cookie_cat_marketing%22%2C%22cookie_cat_unclassified%22%5D%2C%22consents_denied%22%3A%5B%5D%2C%22user_agent%22%3A%22Mozilla%2F5.0%20%28Macintosh%3B%20Intel%20Mac%20OS%20X%2010_14_6%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F77.0.3865.120%20Safari%2F537.36%22%7D; ASP.NET_SessionId=eedeczunm3i0d4vnrcsipwvy


In [28]:

urlreq = 'https://home.dk/umbraco/backoffice/home-api/SEARCH?CurrentPageNumber=0&SearchResultsPerPage=15&q=2200%20K%C3%B8benhavn%20N&Energimaerker=null&SearchType=0&_=1571481546474'
#response = urllib.request.urlopen(urlreq)
req = urllib.request.Request(urlreq)
with urllib.request.urlopen(req) as response:
   the_page = response.read()
print(the_page)

b'\xd5\xbd\xddr\x1c\xc7\x96\xa5\xf9*\x18\\\xf4\x95\x1c\x8cp\x0f\xf7\x88\xa0YY\x1bO\x89\xfa)Q?FR*\x1b+k\xa3yDx\x80\x10A\x80\x93\x99\x90\x8eN\xd9\xb9\x9c~\x86\xba\x9c\x9bj\xb3y\x83\xba\x1e\xbd\xd8|;\tGF\xa6\xa7\x8e\xe8$\x11\xec\xd0\xe9f\x91 \xc0\x05\xe4\xca\xf0\x9f\xbd\xd7^\xeb\xdfOWa\xb8X\x85~\xf3\xe3\xea\xf2\xf4\xe1\xd5\xcd\xe5\xe5g\xa7\x17Won6\xdf^\x0f\x81\x8f\xfc\xfb\xe9\xb3\xeb\xd5\xe6\xf9ooB\xfc[\xf9\xf3\xf7\xab!\xac\xe2\x07\xfe\xf9f\xb5\nW\x9b\x1f\xfcy\xf8\xee\xe6u\'\x7fQ|v\xfa,\xf8U\xff\xf2iX\xdf\\n\xd6?\x84\x95\xfc\xf5\xe9\xc3\xd2~v\xfa\x7f\x9d><\xd5\xba(N\xbe\xf9\xfd\xbf\xbap\xf5\xd2\xffru\xf2\xdd\xe9g\xa7\x8f\x7f\x0eW\xc3\xf5\xeb\xf5\x06\xb0\x9f\xca\xf8\xafO?\xfa\xf4\xabc\x1f}\xfc\xe4\xd8G\x7f:\xfa\xd1G\x8f\x8e}\xee\x0fG?\xf7\x8b\xa3hO\xbe?\xf6/|\xf5\xe5\xb1\x8f~q\xf4\xa3\xdf\xdd\xa1}q\xcd+\xb7\xb9\xba\xb8:\xbf\xba{5\x7fX]\xff\x0c\x1d\xdf\xf1\xf2\x7f=\xc4\x7f\xf4\xfb\xab\xcb\xdf\xfe\xb2\xba~\x15V_]\xbf\xbec\xe2\x87\xd5E\x1f\xbe\xbd\xb8\x8a\x9f\xf5\xf6\xcf\xfe\xaf\xf1\xcf\xbct\