# Motivation

Before walking you through the project, let me tell you a little bit about the motivation behind it. Aside from Data Science and Machine Learning, my other passion is spending time in the mountains. Planning a trip to any mountain requires lots of careful planning in order to minimize the risks. That means paying close attention to the weather conditions as the summit day approaches. My absolutely favorite website for this is Mountain-Forecast.com which gives you the weather forecasts for almost any mountain in the world at different elevations. The only problem is that it doesn’t offer any historical data (as far as I can tell), which can sometimes be useful when determining if it is a good idea to make the trip or wait for better conditions.
This problem has been in the back of my mind for a while and I finally decided to do something about it. Below, I’ll describe how I wrote a web scraper for Mountain-Forecast.com using Python and Beautiful Soup, and put it into a Raspberry Pi to collect the data on a daily basis.

# I. Scraping with Beautiful Soup and Selenium

In [17]:
import os
from bs4 import BeautifulSoup
from requests import get
import pandas as pd
import numpy as np
import itertools as it
import random
import time
from fake_useragent import UserAgent
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import sys
import regex as re

In [19]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
chromedriver = "C:/Users/jerem/Anaconda3/envs/environment-deep-learning-cookbook/Lib/site-packages/notebook/tests/selenium/chromedriver_win32/chromedriver" # path to the chromedriver executable
chromedriver = os.path.expanduser(chromedriver)
print('chromedriver path: {}'.format(chromedriver))
sys.path.append(chromedriver)

driver = webdriver.Chrome(chromedriver)

chromedriver path: C:/Users/jerem/Anaconda3/envs/environment-deep-learning-cookbook/Lib/site-packages/notebook/tests/selenium/chromedriver_win32/chromedriver


In [3]:
seloger_toulouse_url = 'https://www.seloger.com/immobilier/locations/immo-toulouse-31/bien-appartement/?LISTING-LISTpg='

def get_page_links(url, number_of_pages):
    page_links=[] # Create a list of pages links
    for i in range(1,number_of_pages+1):
        j = url + str(i)
        page_links.append(j)
    return page_links

#page_links = get_page_links(seloger_toulouse_url,3) 

In [4]:
def get_appartment_links(pages, driver):
    
    # Setting a list of listings links
    appartment_links=[] 
    
    # Getting length of list 
    length = len(pages) 
    
    while len(pages) > 0: 
        for i in pages:
            
            #print('Extracting links from page',pages.index(i)+1,'out of',len(pages),'left')
            # we try to access a page with the new proxy
            try:
                driver.get(i)
                soup = BeautifulSoup(driver.page_source, 'html.parser')

                # Extract links information via the find_all function of the soup object 
                listings = soup.find_all("a", attrs={"name": "classified-link"})
                page_data = [row['href'] for row in listings]
                appartment_links.append(page_data) # list of listings links
                
                pages.remove(i)

                print('There are',len(pages),'pages left to examine')
                time.sleep(random.randrange(11,21))
    
            except:
                print("Skipping. Connnection error")
                time.sleep(random.randrange(300,600))
                
    return appartment_links

In [5]:
# Create a flatten function:
def flatten_list(appartment_links):
    appartment_links_flat=[]
    for sublist in appartment_links:
        for item in sublist:
            appartment_links_flat.append(item)
    return appartment_links_flat
        
#or appartment_links_flat = list(it.chain.from_iterable(appartment_links))

#appartment_links_flat = flatten_list(appartment_links)
# Check the number of links
#len(appartment_links_flat)

In [6]:
def get_title(soup):
    try:
        title = soup.title.text
        return title
    except:
        return np.nan

    
def get_agency(soup):
    try:
        agency = soup.find_all("a", class_="agence-link")
        agency2 = agency[0].text
        agency3 = agency2.replace('\n', '').lower()
        return agency3
    except:
        return np.nan
    
    
def get_housing_type(soup):
    try:
        ht = soup.find_all("h2", class_="c-h2")
        ht2 = ht[0].text
        return ht2
    except:
        return np.nan
    
    
def get_city(soup):
    try:
        city = soup.find_all("p", class_="localite")
        city2 = city[0].text
        return city2
    except:
        return np.nan
    
    
def get_details(soup):
    try:
        details = soup.find_all("h1", class_="detail-title title1")
        details2 = details[0].text
        details3 = details2.replace('\n', '').lower()
        return details3
    except:
        return np.nan
    
    
def get_rent(soup):
    try:
        rent = soup.find_all("a", class_="js-smooth-scroll-link price")
        rent2 = rent[0].text
        rent3 = int(''.join(filter(str.isdigit, rent2)))
        return rent3
    except:
        return np.nan
    
def get_charges(soup):
    try:
        cha = soup.find_all("sup", class_="u-thin u-300 u-black-snow")
        cha2 = cha[0].text
        return cha2
    except:
        return np.nan
    
    
def get_rent_info(soup):
    try:
        rent_info = soup.find_all("section", class_="categorie with-padding-bottom")
        rent_info2 = rent_info[0].find_all("p", class_="sh-text-light")
        rent_info3 = rent_info2[0].text
        return rent_info3
    except:
        return 'None'
    
    
def get_criteria(soup):
    try:
        crit = soup.find_all("section", class_="categorie")
        crit2 = [div.text for ul in crit for div in ul.find_all("div", class_="u-left")]
        crit3=(" ; ".join(crit2)) # concatenate string items in a list into a single string
        return crit3
    except:
        return 'None'
    
    
def get_energy_rating(soup):
    try:
        ener = soup.find_all("div", class_="info-detail")
        ener2 = ener[0].text
        ener3 = int(''.join(filter(str.isdigit, ener2)))
        return ener3
    except:
        return np.nan
    
    
def get_gas_rating(soup):
    try:
        gas = soup.find_all("div", class_="info-detail")
        gas2 = gas[1].text
        gas3 = int(''.join(filter(str.isdigit, gas2)))
        return gas3
    except:
        return np.nan
    
    
def get_description(soup):
    try:
        descr = soup.find_all(class_="sh-text-light")
        descr2 = descr[0].text
        return descr2
    except:
        return 'None'

In [7]:
def get_html_data(url, driver):
    driver.get(url)
    #time.sleep(random.lognormal(0,1))
    time.sleep(random.randrange(5,15))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    return soup

In [8]:
def get_appartment_data(driver,links):
    appartment_data = []

    while len(links) > 0: 
        for i in links:

            print('Now extracting data from listing {} out of {}.'.format(links.index(i)+1,len(links)))

            # we try to access a page with the new proxy
            try:
                soup = get_html_data(i,driver)
                title = get_title(soup)
                agency = get_agency(soup)
                housing_type = get_housing_type(soup)
                city = get_city(soup)
                details = get_details(soup)
                rent = get_rent(soup)
                charges = get_charges(soup)
                rent_info = get_rent_info(soup)
                criteria = get_criteria(soup)
                energy_rating = get_energy_rating(soup)
                gas_rating = get_gas_rating(soup)
                description = get_description(soup)

                # if listings is not available anymore then remove the listing from the list
                if title == 'Location appartements Toulouse (31) | Louer appartements à Toulouse 31000': 
                    print('This appartment is no longer available.')   
                    links.remove(i)

                # if listing not accessible (robot) then go to the next one and try again later
                elif pd.isna(housing_type) == True and pd.isna(city) == True and pd.isna(rent) == True: 
                    print('You Shall Not Pass!')                    
                    time.sleep(random.randrange(300,600))

                # if access to the listing granted then extract data and remove the listing from the list
                else:
                    appartment_data.append([i,title,agency,housing_type,city,details,rent,charges,rent_info,
                                            criteria,energy_rating,gas_rating,description]) 
                    links.remove(i)
                    print('Good! There are {} listings left to examine.'.format(len(links)))

            except:
                print("Skipping. Connnection error")
                time.sleep(random.randrange(60,120))
     
    
    df = pd.DataFrame(appartment_data,columns = ['link','title','agency','housing_type','city','details','rent','charges','rent_info',
                                            'criteria','energy_rating','gas_rating','description'])
    
    return df

In [9]:
page_links = get_page_links(seloger_toulouse_url,96) 
appartment_links = get_appartment_links(page_links,driver)
appartment_links_flat = flatten_list(appartment_links)
df_appartment = get_appartment_data(driver,appartment_links_flat)

There are 95 pages left to examine
There are 94 pages left to examine
There are 93 pages left to examine
There are 92 pages left to examine
There are 91 pages left to examine
There are 90 pages left to examine
There are 89 pages left to examine
There are 88 pages left to examine
There are 87 pages left to examine
There are 86 pages left to examine
There are 85 pages left to examine
There are 84 pages left to examine
There are 83 pages left to examine
There are 82 pages left to examine
There are 81 pages left to examine
There are 80 pages left to examine
There are 79 pages left to examine
There are 78 pages left to examine
There are 77 pages left to examine
There are 76 pages left to examine
There are 75 pages left to examine
There are 74 pages left to examine
There are 73 pages left to examine
There are 72 pages left to examine
There are 71 pages left to examine
There are 70 pages left to examine
There are 69 pages left to examine
There are 68 pages left to examine
There are 67 pages l

Good! There are 2323 listings left to examine.
Now extracting data from listing 53 out of 2323.
Good! There are 2322 listings left to examine.
Now extracting data from listing 54 out of 2322.
Good! There are 2321 listings left to examine.
Now extracting data from listing 55 out of 2321.
Good! There are 2320 listings left to examine.
Now extracting data from listing 56 out of 2320.
Good! There are 2319 listings left to examine.
Now extracting data from listing 57 out of 2319.
Good! There are 2318 listings left to examine.
Now extracting data from listing 58 out of 2318.
Good! There are 2317 listings left to examine.
Now extracting data from listing 59 out of 2317.
Good! There are 2316 listings left to examine.
Now extracting data from listing 60 out of 2316.
Good! There are 2315 listings left to examine.
Now extracting data from listing 61 out of 2315.
Good! There are 2314 listings left to examine.
Now extracting data from listing 62 out of 2314.
Good! There are 2313 listings left to ex

Now extracting data from listing 140 out of 2245.
You Shall Not Pass!
Now extracting data from listing 141 out of 2245.
You Shall Not Pass!
Now extracting data from listing 142 out of 2245.
You Shall Not Pass!
Now extracting data from listing 143 out of 2245.
You Shall Not Pass!
Now extracting data from listing 144 out of 2245.
You Shall Not Pass!
Now extracting data from listing 145 out of 2245.
You Shall Not Pass!
Now extracting data from listing 146 out of 2245.
You Shall Not Pass!
Now extracting data from listing 147 out of 2245.
Good! There are 2244 listings left to examine.
Now extracting data from listing 148 out of 2244.
Good! There are 2243 listings left to examine.
Now extracting data from listing 149 out of 2243.
Good! There are 2242 listings left to examine.
Now extracting data from listing 150 out of 2242.
Good! There are 2241 listings left to examine.
Now extracting data from listing 151 out of 2241.
Good! There are 2240 listings left to examine.
Now extracting data from 

Good! There are 2165 listings left to examine.
Now extracting data from listing 227 out of 2165.
Good! There are 2164 listings left to examine.
Now extracting data from listing 228 out of 2164.
Good! There are 2163 listings left to examine.
Now extracting data from listing 229 out of 2163.
Good! There are 2162 listings left to examine.
Now extracting data from listing 230 out of 2162.
Good! There are 2161 listings left to examine.
Now extracting data from listing 231 out of 2161.
Good! There are 2160 listings left to examine.
Now extracting data from listing 232 out of 2160.
Good! There are 2159 listings left to examine.
Now extracting data from listing 233 out of 2159.
Good! There are 2158 listings left to examine.
Now extracting data from listing 234 out of 2158.
Good! There are 2157 listings left to examine.
Now extracting data from listing 235 out of 2157.
Good! There are 2156 listings left to examine.
Now extracting data from listing 236 out of 2156.
Good! There are 2155 listings 

Good! There are 2082 listings left to examine.
Now extracting data from listing 313 out of 2082.
Good! There are 2081 listings left to examine.
Now extracting data from listing 314 out of 2081.
Good! There are 2080 listings left to examine.
Now extracting data from listing 315 out of 2080.
Good! There are 2079 listings left to examine.
Now extracting data from listing 316 out of 2079.
Good! There are 2078 listings left to examine.
Now extracting data from listing 317 out of 2078.
Good! There are 2077 listings left to examine.
Now extracting data from listing 318 out of 2077.
Good! There are 2076 listings left to examine.
Now extracting data from listing 319 out of 2076.
Good! There are 2075 listings left to examine.
Now extracting data from listing 320 out of 2075.
Good! There are 2074 listings left to examine.
Now extracting data from listing 321 out of 2074.
Good! There are 2073 listings left to examine.
Now extracting data from listing 322 out of 2073.
Good! There are 2072 listings 

Good! There are 1997 listings left to examine.
Now extracting data from listing 398 out of 1997.
Good! There are 1996 listings left to examine.
Now extracting data from listing 399 out of 1996.
Good! There are 1995 listings left to examine.
Now extracting data from listing 400 out of 1995.
Good! There are 1994 listings left to examine.
Now extracting data from listing 401 out of 1994.
Good! There are 1993 listings left to examine.
Now extracting data from listing 402 out of 1993.
Good! There are 1992 listings left to examine.
Now extracting data from listing 403 out of 1992.
Good! There are 1991 listings left to examine.
Now extracting data from listing 404 out of 1991.
Good! There are 1990 listings left to examine.
Now extracting data from listing 405 out of 1990.
Good! There are 1989 listings left to examine.
Now extracting data from listing 406 out of 1989.
Good! There are 1988 listings left to examine.
Now extracting data from listing 407 out of 1988.
Good! There are 1987 listings 

Good! There are 1912 listings left to examine.
Now extracting data from listing 483 out of 1912.
Good! There are 1911 listings left to examine.
Now extracting data from listing 484 out of 1911.
Good! There are 1910 listings left to examine.
Now extracting data from listing 485 out of 1910.
Good! There are 1909 listings left to examine.
Now extracting data from listing 486 out of 1909.
Good! There are 1908 listings left to examine.
Now extracting data from listing 487 out of 1908.
Good! There are 1907 listings left to examine.
Now extracting data from listing 488 out of 1907.
Good! There are 1906 listings left to examine.
Now extracting data from listing 489 out of 1906.
Good! There are 1905 listings left to examine.
Now extracting data from listing 490 out of 1905.
Good! There are 1904 listings left to examine.
Now extracting data from listing 491 out of 1904.
Good! There are 1903 listings left to examine.
Now extracting data from listing 492 out of 1903.
Good! There are 1902 listings 

Good! There are 1828 listings left to examine.
Now extracting data from listing 569 out of 1828.
Good! There are 1827 listings left to examine.
Now extracting data from listing 570 out of 1827.
Good! There are 1826 listings left to examine.
Now extracting data from listing 571 out of 1826.
Good! There are 1825 listings left to examine.
Now extracting data from listing 572 out of 1825.
Good! There are 1824 listings left to examine.
Now extracting data from listing 573 out of 1824.
Good! There are 1823 listings left to examine.
Now extracting data from listing 574 out of 1823.
Good! There are 1822 listings left to examine.
Now extracting data from listing 575 out of 1822.
Good! There are 1821 listings left to examine.
Now extracting data from listing 576 out of 1821.
Good! There are 1820 listings left to examine.
Now extracting data from listing 577 out of 1820.
Good! There are 1819 listings left to examine.
Now extracting data from listing 578 out of 1819.
Good! There are 1818 listings 

Good! There are 1743 listings left to examine.
Now extracting data from listing 654 out of 1743.
Good! There are 1742 listings left to examine.
Now extracting data from listing 655 out of 1742.
Good! There are 1741 listings left to examine.
Now extracting data from listing 656 out of 1741.
Good! There are 1740 listings left to examine.
Now extracting data from listing 657 out of 1740.
Good! There are 1739 listings left to examine.
Now extracting data from listing 658 out of 1739.
Good! There are 1738 listings left to examine.
Now extracting data from listing 659 out of 1738.
Good! There are 1737 listings left to examine.
Now extracting data from listing 37 out of 1737.
Good! There are 1736 listings left to examine.
Now extracting data from listing 661 out of 1736.
Good! There are 1735 listings left to examine.
Now extracting data from listing 662 out of 1735.
Good! There are 1734 listings left to examine.
Now extracting data from listing 663 out of 1734.
Good! There are 1733 listings l

Good! There are 1658 listings left to examine.
Now extracting data from listing 739 out of 1658.
Good! There are 1657 listings left to examine.
Now extracting data from listing 740 out of 1657.
Good! There are 1656 listings left to examine.
Now extracting data from listing 741 out of 1656.
Good! There are 1655 listings left to examine.
Now extracting data from listing 742 out of 1655.
Good! There are 1654 listings left to examine.
Now extracting data from listing 743 out of 1654.
Good! There are 1653 listings left to examine.
Now extracting data from listing 744 out of 1653.
Good! There are 1652 listings left to examine.
Now extracting data from listing 745 out of 1652.
Good! There are 1651 listings left to examine.
Now extracting data from listing 746 out of 1651.
Good! There are 1650 listings left to examine.
Now extracting data from listing 296 out of 1650.
Good! There are 1649 listings left to examine.
Now extracting data from listing 748 out of 1649.
Good! There are 1648 listings 

Good! There are 1573 listings left to examine.
Now extracting data from listing 824 out of 1573.
Good! There are 1572 listings left to examine.
Now extracting data from listing 825 out of 1572.
Good! There are 1571 listings left to examine.
Now extracting data from listing 826 out of 1571.
Good! There are 1570 listings left to examine.
Now extracting data from listing 827 out of 1570.
Good! There are 1569 listings left to examine.
Now extracting data from listing 422 out of 1569.
Good! There are 1568 listings left to examine.
Now extracting data from listing 423 out of 1568.
Good! There are 1567 listings left to examine.
Now extracting data from listing 830 out of 1567.
Good! There are 1566 listings left to examine.
Now extracting data from listing 831 out of 1566.
Good! There are 1565 listings left to examine.
Now extracting data from listing 832 out of 1565.
Good! There are 1564 listings left to examine.
Now extracting data from listing 833 out of 1564.
Good! There are 1563 listings 

Good! There are 1518 listings left to examine.
Now extracting data from listing 921 out of 1518.
Good! There are 1517 listings left to examine.
Now extracting data from listing 922 out of 1517.
Good! There are 1516 listings left to examine.
Now extracting data from listing 923 out of 1516.
Good! There are 1515 listings left to examine.
Now extracting data from listing 924 out of 1515.
Good! There are 1514 listings left to examine.
Now extracting data from listing 925 out of 1514.
Good! There are 1513 listings left to examine.
Now extracting data from listing 926 out of 1513.
Good! There are 1512 listings left to examine.
Now extracting data from listing 927 out of 1512.
Good! There are 1511 listings left to examine.
Now extracting data from listing 928 out of 1511.
Good! There are 1510 listings left to examine.
Now extracting data from listing 929 out of 1510.
Good! There are 1509 listings left to examine.
Now extracting data from listing 28 out of 1509.
Good! There are 1508 listings l

Good! There are 1433 listings left to examine.
Now extracting data from listing 1006 out of 1433.
Good! There are 1432 listings left to examine.
Now extracting data from listing 1007 out of 1432.
Good! There are 1431 listings left to examine.
Now extracting data from listing 335 out of 1431.
Good! There are 1430 listings left to examine.
Now extracting data from listing 328 out of 1430.
Good! There are 1429 listings left to examine.
Now extracting data from listing 341 out of 1429.
Good! There are 1428 listings left to examine.
Now extracting data from listing 1011 out of 1428.
Good! There are 1427 listings left to examine.
Now extracting data from listing 1012 out of 1427.
Good! There are 1426 listings left to examine.
Now extracting data from listing 335 out of 1426.
Good! There are 1425 listings left to examine.
Now extracting data from listing 335 out of 1425.
Good! There are 1424 listings left to examine.
Now extracting data from listing 1015 out of 1424.
Good! There are 1423 list

Good! There are 1349 listings left to examine.
Now extracting data from listing 946 out of 1349.
Good! There are 1348 listings left to examine.
Now extracting data from listing 1091 out of 1348.
Good! There are 1347 listings left to examine.
Now extracting data from listing 1092 out of 1347.
Good! There are 1346 listings left to examine.
Now extracting data from listing 1093 out of 1346.
Good! There are 1345 listings left to examine.
Now extracting data from listing 1094 out of 1345.
This appartment is no longer available.
Now extracting data from listing 274 out of 1344.
Good! There are 1343 listings left to examine.
Now extracting data from listing 1096 out of 1343.
Good! There are 1342 listings left to examine.
Now extracting data from listing 276 out of 1342.
Good! There are 1341 listings left to examine.
Now extracting data from listing 254 out of 1341.
Good! There are 1340 listings left to examine.
Now extracting data from listing 1099 out of 1340.
Good! There are 1339 listings l

Good! There are 1264 listings left to examine.
Now extracting data from listing 1175 out of 1264.
Good! There are 1263 listings left to examine.
Now extracting data from listing 798 out of 1263.
Good! There are 1262 listings left to examine.
Now extracting data from listing 1177 out of 1262.
Good! There are 1261 listings left to examine.
Now extracting data from listing 1178 out of 1261.
Good! There are 1260 listings left to examine.
Now extracting data from listing 798 out of 1260.
Good! There are 1259 listings left to examine.
Now extracting data from listing 1180 out of 1259.
Good! There are 1258 listings left to examine.
Now extracting data from listing 809 out of 1258.
Good! There are 1257 listings left to examine.
Now extracting data from listing 810 out of 1257.
Good! There are 1256 listings left to examine.
Now extracting data from listing 120 out of 1256.
This appartment is no longer available.
Now extracting data from listing 1184 out of 1255.
Good! There are 1254 listings le

You Shall Not Pass!
Now extracting data from listing 54 out of 1214.
You Shall Not Pass!
Now extracting data from listing 55 out of 1214.
You Shall Not Pass!
Now extracting data from listing 56 out of 1214.
You Shall Not Pass!
Now extracting data from listing 57 out of 1214.
You Shall Not Pass!
Now extracting data from listing 58 out of 1214.
You Shall Not Pass!
Now extracting data from listing 39 out of 1214.
You Shall Not Pass!
Now extracting data from listing 40 out of 1214.
You Shall Not Pass!
Now extracting data from listing 61 out of 1214.
You Shall Not Pass!
Now extracting data from listing 62 out of 1214.
You Shall Not Pass!
Now extracting data from listing 63 out of 1214.
You Shall Not Pass!
Now extracting data from listing 64 out of 1214.
You Shall Not Pass!
Now extracting data from listing 65 out of 1214.
You Shall Not Pass!
Now extracting data from listing 66 out of 1214.
You Shall Not Pass!
Now extracting data from listing 67 out of 1214.
You Shall Not Pass!
Now extracting

You Shall Not Pass!
Now extracting data from listing 172 out of 1214.
You Shall Not Pass!
Now extracting data from listing 173 out of 1214.
You Shall Not Pass!
Now extracting data from listing 174 out of 1214.
You Shall Not Pass!
Now extracting data from listing 175 out of 1214.
You Shall Not Pass!
Now extracting data from listing 176 out of 1214.
You Shall Not Pass!
Now extracting data from listing 177 out of 1214.
You Shall Not Pass!
Now extracting data from listing 178 out of 1214.
You Shall Not Pass!
Now extracting data from listing 179 out of 1214.
You Shall Not Pass!
Now extracting data from listing 180 out of 1214.
You Shall Not Pass!
Now extracting data from listing 181 out of 1214.
You Shall Not Pass!
Now extracting data from listing 182 out of 1214.
You Shall Not Pass!
Now extracting data from listing 183 out of 1214.
You Shall Not Pass!
Now extracting data from listing 184 out of 1214.
You Shall Not Pass!
Now extracting data from listing 185 out of 1214.
You Shall Not Pass!


Good! There are 1166 listings left to examine.
Now extracting data from listing 272 out of 1166.
Good! There are 1165 listings left to examine.
Now extracting data from listing 273 out of 1165.
Good! There are 1164 listings left to examine.
Now extracting data from listing 274 out of 1164.
Good! There are 1163 listings left to examine.
Now extracting data from listing 275 out of 1163.
Good! There are 1162 listings left to examine.
Now extracting data from listing 276 out of 1162.
Good! There are 1161 listings left to examine.
Now extracting data from listing 277 out of 1161.
Good! There are 1160 listings left to examine.
Now extracting data from listing 278 out of 1160.
Good! There are 1159 listings left to examine.
Now extracting data from listing 279 out of 1159.
Good! There are 1158 listings left to examine.
Now extracting data from listing 280 out of 1158.
Good! There are 1157 listings left to examine.
Now extracting data from listing 281 out of 1157.
Good! There are 1156 listings 

Good! There are 1081 listings left to examine.
Now extracting data from listing 357 out of 1081.
Good! There are 1080 listings left to examine.
Now extracting data from listing 358 out of 1080.
Good! There are 1079 listings left to examine.
Now extracting data from listing 359 out of 1079.
Good! There are 1078 listings left to examine.
Now extracting data from listing 360 out of 1078.
Good! There are 1077 listings left to examine.
Now extracting data from listing 361 out of 1077.
Good! There are 1076 listings left to examine.
Now extracting data from listing 362 out of 1076.
Good! There are 1075 listings left to examine.
Now extracting data from listing 363 out of 1075.
Good! There are 1074 listings left to examine.
Now extracting data from listing 364 out of 1074.
Good! There are 1073 listings left to examine.
Now extracting data from listing 22 out of 1073.
Good! There are 1072 listings left to examine.
Now extracting data from listing 366 out of 1072.
Good! There are 1071 listings l

Good! There are 996 listings left to examine.
Now extracting data from listing 442 out of 996.
Good! There are 995 listings left to examine.
Now extracting data from listing 443 out of 995.
Good! There are 994 listings left to examine.
Now extracting data from listing 444 out of 994.
Good! There are 993 listings left to examine.
Now extracting data from listing 445 out of 993.
Good! There are 992 listings left to examine.
Now extracting data from listing 446 out of 992.
Good! There are 991 listings left to examine.
Now extracting data from listing 447 out of 991.
Good! There are 990 listings left to examine.
Now extracting data from listing 443 out of 990.
Good! There are 989 listings left to examine.
Now extracting data from listing 197 out of 989.
Good! There are 988 listings left to examine.
Now extracting data from listing 450 out of 988.
Good! There are 987 listings left to examine.
Now extracting data from listing 451 out of 987.
Good! There are 986 listings left to examine.
Now 

Good! There are 911 listings left to examine.
Now extracting data from listing 529 out of 911.
Good! There are 910 listings left to examine.
Now extracting data from listing 530 out of 910.
Good! There are 909 listings left to examine.
Now extracting data from listing 531 out of 909.
Good! There are 908 listings left to examine.
Now extracting data from listing 532 out of 908.
Good! There are 907 listings left to examine.
Now extracting data from listing 533 out of 907.
Good! There are 906 listings left to examine.
Now extracting data from listing 534 out of 906.
Good! There are 905 listings left to examine.
Now extracting data from listing 535 out of 905.
Good! There are 904 listings left to examine.
Now extracting data from listing 536 out of 904.
Good! There are 903 listings left to examine.
Now extracting data from listing 537 out of 903.
Good! There are 902 listings left to examine.
Now extracting data from listing 538 out of 902.
Good! There are 901 listings left to examine.
Now 

Good! There are 824 listings left to examine.
Now extracting data from listing 616 out of 824.
Good! There are 823 listings left to examine.
Now extracting data from listing 617 out of 823.
Good! There are 822 listings left to examine.
Now extracting data from listing 618 out of 822.
Good! There are 821 listings left to examine.
Now extracting data from listing 619 out of 821.
Good! There are 820 listings left to examine.
Now extracting data from listing 620 out of 820.
Good! There are 819 listings left to examine.
Now extracting data from listing 621 out of 819.
Good! There are 818 listings left to examine.
Now extracting data from listing 622 out of 818.
Good! There are 817 listings left to examine.
Now extracting data from listing 623 out of 817.
Good! There are 816 listings left to examine.
Now extracting data from listing 624 out of 816.
Good! There are 815 listings left to examine.
Now extracting data from listing 625 out of 815.
Good! There are 814 listings left to examine.
Now 

Good! There are 737 listings left to examine.
Now extracting data from listing 703 out of 737.
Good! There are 736 listings left to examine.
Now extracting data from listing 704 out of 736.
Good! There are 735 listings left to examine.
Now extracting data from listing 705 out of 735.
Good! There are 734 listings left to examine.
Now extracting data from listing 84 out of 734.
Good! There are 733 listings left to examine.
Now extracting data from listing 707 out of 733.
Good! There are 732 listings left to examine.
Now extracting data from listing 708 out of 732.
Good! There are 731 listings left to examine.
Now extracting data from listing 709 out of 731.
Good! There are 730 listings left to examine.
Now extracting data from listing 710 out of 730.
Good! There are 729 listings left to examine.
Now extracting data from listing 711 out of 729.
Good! There are 728 listings left to examine.
Now extracting data from listing 712 out of 728.
Good! There are 727 listings left to examine.
Now e

Good! There are 651 listings left to examine.
Now extracting data from listing 71 out of 651.
Good! There are 650 listings left to examine.
Now extracting data from listing 72 out of 650.
Good! There are 649 listings left to examine.
Now extracting data from listing 73 out of 649.
Good! There are 648 listings left to examine.
Now extracting data from listing 74 out of 648.
Good! There are 647 listings left to examine.
Now extracting data from listing 75 out of 647.
Good! There are 646 listings left to examine.
Now extracting data from listing 76 out of 646.
This appartment is no longer available.
Now extracting data from listing 77 out of 645.
Good! There are 644 listings left to examine.
Now extracting data from listing 78 out of 644.
Good! There are 643 listings left to examine.
Now extracting data from listing 79 out of 643.
Good! There are 642 listings left to examine.
Now extracting data from listing 80 out of 642.
Good! There are 641 listings left to examine.
Now extracting data 

Good! There are 564 listings left to examine.
Now extracting data from listing 158 out of 564.
Good! There are 563 listings left to examine.
Now extracting data from listing 159 out of 563.
Good! There are 562 listings left to examine.
Now extracting data from listing 160 out of 562.
Good! There are 561 listings left to examine.
Now extracting data from listing 161 out of 561.
Good! There are 560 listings left to examine.
Now extracting data from listing 162 out of 560.
Good! There are 559 listings left to examine.
Now extracting data from listing 163 out of 559.
Good! There are 558 listings left to examine.
Now extracting data from listing 164 out of 558.
Good! There are 557 listings left to examine.
Now extracting data from listing 165 out of 557.
Good! There are 556 listings left to examine.
Now extracting data from listing 166 out of 556.
Good! There are 555 listings left to examine.
Now extracting data from listing 167 out of 555.
Good! There are 554 listings left to examine.
Now 

Good! There are 477 listings left to examine.
Now extracting data from listing 245 out of 477.
You Shall Not Pass!
Now extracting data from listing 246 out of 477.
You Shall Not Pass!
Now extracting data from listing 247 out of 477.
You Shall Not Pass!
Now extracting data from listing 248 out of 477.
You Shall Not Pass!
Now extracting data from listing 249 out of 477.
You Shall Not Pass!
Now extracting data from listing 250 out of 477.
You Shall Not Pass!
Now extracting data from listing 251 out of 477.
You Shall Not Pass!
Now extracting data from listing 252 out of 477.
You Shall Not Pass!
Now extracting data from listing 253 out of 477.
You Shall Not Pass!
Now extracting data from listing 254 out of 477.
You Shall Not Pass!
Now extracting data from listing 255 out of 477.
You Shall Not Pass!
Now extracting data from listing 256 out of 477.
You Shall Not Pass!
Now extracting data from listing 257 out of 477.
You Shall Not Pass!
Now extracting data from listing 258 out of 477.
You Shal

Good! There are 422 listings left to examine.
Now extracting data from listing 344 out of 422.
Good! There are 421 listings left to examine.
Now extracting data from listing 345 out of 421.
Good! There are 420 listings left to examine.
Now extracting data from listing 346 out of 420.
Good! There are 419 listings left to examine.
Now extracting data from listing 347 out of 419.
Good! There are 418 listings left to examine.
Now extracting data from listing 348 out of 418.
Good! There are 417 listings left to examine.
Now extracting data from listing 349 out of 417.
Good! There are 416 listings left to examine.
Now extracting data from listing 350 out of 416.
Good! There are 415 listings left to examine.
Now extracting data from listing 351 out of 415.
Good! There are 414 listings left to examine.
Now extracting data from listing 352 out of 414.
Good! There are 413 listings left to examine.
Now extracting data from listing 353 out of 413.
Good! There are 412 listings left to examine.
Now 

Good! There are 334 listings left to examine.
Now extracting data from listing 49 out of 334.
Good! There are 333 listings left to examine.
Now extracting data from listing 50 out of 333.
Good! There are 332 listings left to examine.
Now extracting data from listing 51 out of 332.
Good! There are 331 listings left to examine.
Now extracting data from listing 52 out of 331.
Good! There are 330 listings left to examine.
Now extracting data from listing 53 out of 330.
Good! There are 329 listings left to examine.
Now extracting data from listing 54 out of 329.
Good! There are 328 listings left to examine.
Now extracting data from listing 55 out of 328.
Good! There are 327 listings left to examine.
Now extracting data from listing 56 out of 327.
Good! There are 326 listings left to examine.
Now extracting data from listing 57 out of 326.
Good! There are 325 listings left to examine.
Now extracting data from listing 58 out of 325.
Good! There are 324 listings left to examine.
Now extracting

You Shall Not Pass!
Now extracting data from listing 136 out of 248.
You Shall Not Pass!
Now extracting data from listing 137 out of 248.
You Shall Not Pass!
Now extracting data from listing 138 out of 248.
You Shall Not Pass!
Now extracting data from listing 139 out of 248.
You Shall Not Pass!
Now extracting data from listing 140 out of 248.
You Shall Not Pass!
Now extracting data from listing 15 out of 248.
You Shall Not Pass!
Now extracting data from listing 142 out of 248.
Good! There are 247 listings left to examine.
Now extracting data from listing 143 out of 247.
Good! There are 246 listings left to examine.
Now extracting data from listing 144 out of 246.
Good! There are 245 listings left to examine.
Now extracting data from listing 145 out of 245.
Good! There are 244 listings left to examine.
Now extracting data from listing 146 out of 244.
Good! There are 243 listings left to examine.
Now extracting data from listing 147 out of 243.
This appartment is no longer available.
Now

Good! There are 165 listings left to examine.
Now extracting data from listing 30 out of 165.
Good! There are 164 listings left to examine.
Now extracting data from listing 31 out of 164.
Good! There are 163 listings left to examine.
Now extracting data from listing 32 out of 163.
Good! There are 162 listings left to examine.
Now extracting data from listing 33 out of 162.
Good! There are 161 listings left to examine.
Now extracting data from listing 34 out of 161.
Good! There are 160 listings left to examine.
Now extracting data from listing 35 out of 160.
Good! There are 159 listings left to examine.
Now extracting data from listing 36 out of 159.
Good! There are 158 listings left to examine.
Now extracting data from listing 37 out of 158.
Good! There are 157 listings left to examine.
Now extracting data from listing 38 out of 157.
Good! There are 156 listings left to examine.
Now extracting data from listing 39 out of 156.
Good! There are 155 listings left to examine.
Now extracting

Good! There are 77 listings left to examine.
Now extracting data from listing 21 out of 77.
Good! There are 76 listings left to examine.
Now extracting data from listing 22 out of 76.
Good! There are 75 listings left to examine.
Now extracting data from listing 23 out of 75.
Good! There are 74 listings left to examine.
Now extracting data from listing 24 out of 74.
Good! There are 73 listings left to examine.
Now extracting data from listing 25 out of 73.
Good! There are 72 listings left to examine.
Now extracting data from listing 26 out of 72.
Good! There are 71 listings left to examine.
Now extracting data from listing 27 out of 71.
Good! There are 70 listings left to examine.
Now extracting data from listing 28 out of 70.
Good! There are 69 listings left to examine.
Now extracting data from listing 29 out of 69.
Good! There are 68 listings left to examine.
Now extracting data from listing 30 out of 68.
Good! There are 67 listings left to examine.
Now extracting data from listing 31

In [10]:
df_appartment.shape

(2338, 13)

In [15]:
df_appartment.head()

Unnamed: 0,link,title,agency,housing_type,city,details,rent,charges,rent_info,criteria,energy_rating,gas_rating,description
0,https://www.seloger.com/annonces/locations/app...,Location appartement 2 pièces Toulouse - appar...,zaf immobilier,Appartement,Toulouse,"location appartement 42,03m² t...",590,CC,,"Surface de 42,03 m² ; Au 3ème étage ; 2 Pièces...",227.0,10.0,Votre agence ZAF IMMOBILIER vous propose: Char...
1,https://www.seloger.com/annonces/locations/app...,Location appartement 3 pièces Toulouse - appar...,eurim,Appartement,Toulouse,location appartement 85m² toul...,1650,CC,"\n Loyer : 1,650 € / mo...",Ascenseur ; Orientation Ouest ; Vue ; Refait À...,,,Wilson - Vue exceptionnelle WILSON - Superbe v...
2,https://www.seloger.com/annonces/locations/app...,Location appartement 4 pièces Toulouse - appar...,eurim,Appartement,Toulouse,"location appartement 90,7m² to...",1351,CC,"\n Loyer : 1,351 € / mo...","Ascenseur ; Orientation Est, Sud ; Vue ; Surfa...",148.0,7.0,COMPANS - PLACE DE L'EUROPE Compans-Caffarelli...
3,https://www.seloger.com/annonces/locations/app...,Location appartement 2 pièces Toulouse - appar...,sporting immobilier,Appartement,Toulouse,location appartement 42m² toul...,545,CC,,Ascenseur ; 1 Balcon ; Surface de 42 m² ; Bâti...,61.0,14.0,APT 109\nSPORTING IMMOBILIER VOUS OFFRE L INTÉ...
4,https://www.seloger.com/annonces/locations/app...,Location appartement 3 pièces Toulouse - appar...,cdc habitat,Appartement,Toulouse,location appartement 66m² toul...,685,CC,,Ascenseur ; 1 Balcon ; Orientation Ouest ; Sur...,66.0,15.0,SANS HONORAIRES. Dans immeuble sécurisée de 20...


In [16]:
df_appartment.to_csv('data_seloger_scraping_part1.csv',index=False)

Next steps for improvement:
- Rotate the user agent
- Rotate of proxies (proxy pool)
- Only extract the new listings to consolidate our data: Trier par date => afin d’avoir les offres récentes (le but étant de ne pas scraper 2 fois la même annonce, donc si 200 nouvelles annonces ont été publiées depuis hier, le scrap d’aujourd’hui doit s’arrêter quand il aura scraper ces 200 annonces et qu’il retrouve par la suite une annonce qu’il a déjà scrapé la veille)

#### References & code:
- https://medium.com/france-school-of-ai/web-scraping-avec-python-apprenez-%C3%A0-utiliser-beautifulsoup-proxies-et-un-faux-user-agent-d7bfb66b6556
- https://towardsdatascience.com/looking-for-a-house-build-a-web-scraper-to-help-you-5ab25badc83e
- https://medium.com/@ben.sturm/scraping-house-listing-data-using-selenium-and-beautiful-soup-1cbb94ba9492