# WebScrapping

## Introduction 

The web is rich with almost each information one might desire. However, getting that information in usable format requires some work-around. To get some information associated with houses for sale, I have written the `links` module to scrape one website that list properties for sale with two main functions `find_all_links` that given the number of pages returns all hyperlinks on that page and `get_houses_info` that returns the house information given it url link. The modules uses the `requests` and `BeautifulSoup` libraries behind the scene. Getting each link at a time and associated information would take many hours. To speed up the process, I employs the `multi-threading` and `multi-processing` to perform some of the tasks in parallel using the `concurrent` library. 

In [1]:
#importing important modules & libraries
from links import get_houses_info, find_all_links
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from itertools import chain
import time
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_colwidth = None
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime
import numpy as np

## Links

The websites that list properties have embedded links for more details about the listed property. The following lines of code get the links to the property's details given the number of pages.
Because this requires to each page and get all the links to the property listed on that page, this fall into the I/O bound tasks. Therefore, the threads are ideal for such kind of tasks.

In [2]:
#Threads
with ThreadPoolExecutor(max_workers=20) as threads:
    links = threads.map(find_all_links, range(100))
    links = list(chain.from_iterable(links))

## Data

After retrieving all links to details of properties, it is time to visits the property's link and extract all the information related to the property such as the Date on which it was listed, Location, Price and other relevant information. This task not only requires to visit the link but also requires to processes the returned data by extracting the necessary information. That relies much on CPU which makes it the CPU bound task. Therefore the processes are used. 

In [3]:
#Processes
with ProcessPoolExecutor(max_workers=25) as processes:
    data1 = processes.map(get_houses_info, links)
    data1 = list(data1)

In [4]:
#Converting the returned info into a dataframe
imali = pd.DataFrame(data1, columns = ['Date','Link','Price','Location','Details'])
imali.head()

Unnamed: 0,Date,Link,Price,Location,Details
0,16-Mar-2021,https://www.imali.biz/announce-24-134296.html,100 000 000 Rwf,"Kigali -ndera hafi na 15, hafi na free zone","DUFITE INZU ZiGURISHWA INDERA HAFI NA FREE ZONE HAFI NOKURI 15 KUVA KURI KABURIMBO UYIGERAHO NI MINOTA 3, NAHANTU HEZA CYANE UBA UREBA KANOMBE YOSE ✅ Izi nzu ziri MUKIBANZA kinini cyane gikora kumihanda 2 Gifite Ubuso 1558sqm,✅Izi nzu zifite imiryango 25 Yinjiza Amafaranga .✅ Izi nzu zifite imiryango 4 Yubucuruzi.✅ Inzu Nini ifite ibyumba 6,salo , toilet, Parking nibindi.✅ INZU ZOSE ZIKODESHWA IBIHUMBI 600K UKUYEMO INZU NINI . Ikindi nuko izinzu ziri ahantu heza mubaturanyi beza ahantu Hari kubakwa Amazu meza .✅Igiciro zigurishwa Ni Millions 100, ☎️KUBINDI BISOBANURO WADUHAMAGARA KURI+250788236675. DUFITE NIZINDI NZU NYINSHI TUGURISHA TUKAZIKODESHA."
1,16-Mar-2021,https://www.imali.biz/announce-24-134294.html,35 000 000 Rwf,Kigali masoro million 35,Ndagurisha ahantu heza harimo imiryango 23 imasoro nahantu harimo imiryango 23 harimo ni miryango yubucuruzi ku muhanda hagurishwa millions 35 hinjiza ibihumbi magana ane 400 iyi nimari ntigucike uhashaka hamagara tel :0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze
2,16-Mar-2021,https://www.imali.biz/announce-24-134293.html,5 500 000 Rwf,Kigali kimironko musave gasabo,"Ndagurisha inzu nziza ku mafaranga make cyane udahenzwe igurishwa million 5,500,000 iri kumuhanda musave mumurenge wa bumbogo akarere ka gasabo umugi wa kigali iyi nimari ntigucike uyishaka hamagara tel: 0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze"
3,16-Mar-2021,https://www.imali.biz/announce-24-134289.html,45 000 000 Rwf,Kigali kimironko kinyaga,Ndagurisha inzu nziza cyane ikomeye iri hafi nuruganda rwa Azam mukinyaga hafi Ni midugudu ya sekimondo iri hafi na kaburimbo igurishwa million 45 ifite chambre 4 tuwarete 3 sallon salle Ã manger nigikoni munzu yubakishije amatafari ahiye uyishaka hamagara tel :0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze
4,16-Mar-2021,https://www.imali.biz/announce-24-134288.html,17 000 000 Rwf,Kigali kimironko musave gasabo,Ndagurisha ahantu heza harimo inzu eshatu 3 hafi nuruganda rwa Azam hafi na kagari ka musave nahantu harimo inzu eshatu 3 hafite pariseri m45 m30 ni kumuhanda hagurishwa million 17 hemewe kubakwa nimuri R1 iyi nimari ntigucike uzishaka hamagara tel :0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze


# CLEANING

This part relies heavily on the regular expressions module to extract information such as Number of bedrooms, Number of bathrooms, the plot size, whether the house has the built-in wardrobes, modern kitchen, parking and boys' quarters.

In [5]:
##################################Extracting the number of bedrooms#########################
def extract_bedrooms(text):
    """Extracts the number of bedrooms written in different variations"""
    
    bedrooms1 = r'(\d+)-?\s*-?(?:bed\s?rooms?|rooms?)'
    bedrooms2 = r'(?:ibyumba|chambres?)\s*(\d)'
    
    b1 = re.findall(bedrooms1, text, re.I) 
    
    if len(b1) > 0:
        rooms = b1[0]
    else:
        b2 = re.findall(bedrooms2, text, re.I)
        if len(b2) > 0:
            rooms = b2[0]
        else:
            rooms = np.nan
    
    return rooms
###########################Extracting the number of bathrooms###############################################################
def extract_bathrooms(text):
    bathrooms1 = r'(\d+)-?\s*-?(?:bathrooms?|restrooms?|bath|bothrooms?)'
    bathrooms2 = r'(?:toilets?|douches?|toilettes?|tuwareti|tuwarete|ubwiherero)\s*(\d)'
    b1 = re.findall(bathrooms1, text, re.I) 
    
    if len(b1) > 0:
        bathrooms = b1[0]
    else:
        b2 = re.findall(bathrooms2, text, re.I)
        if len(b2) > 0:
            bathrooms = b2[0]
        else:
            bathrooms3 = r'(toilets?|douches?|toilettes?|tuwareti|tuwarete|ubwiherero|bathrooms?|bath)'
            b3 = re.findall(bathrooms3, text, re.I)
            if len(b3) > 0:
                bathrooms = '1'
            else:
                bathrooms = np.nan
    
    return bathrooms
##############################################Extracting the plotsize#####################################################
def extract_plotsize(text):
    plotsize1 = r'(?:spaceplotsize|plotsize|plot size|plotize).*?(\d+.{0,2}\d*.)'
    plotsize2 = r'(\b\d{,2}/\d{,2}\b)'
    plotsize3 = r'(\d+\s*sqm)'
    plotsize4 = r'(sqm\s*\d+)'
    
    p1 = re.findall(plotsize1, text, re.I)
    p2 = re.findall(plotsize2, text, re.I)
    p3 = re.findall(plotsize3, text, re.I)
    p4 = re.findall(plotsize4, text, re.I)
    
    if len(p1) > 0:
        size = p1[0]
    elif len(p2) > 0:
        size = p2[0]    
    elif len(p3) > 0:
        size = p3[0]
    elif len(p4) > 0:
        size = p4[0]
    else:
        size = np.nan
            
    return size
##################################Checking if has boys quarters or annexes#########################
def extract_quarters(text):
    boys = r'(boys?)'
    annexes = r'(Annex[es]?|anegisi)'
    maids = r'(maids?|servants?|keeper)'
    a = re.findall(annexes, text, re.I)
    b = re.findall(boys, text, re.I)
    m = re.findall(maids, text, re.I)
    
    if len(a) > 0:
        quarters = 'Yes'
    elif len(b) > 0:
        quarters = 'Yes'
    elif len(m) > 0:
        quarters = 'Yes'
    else:
        quarters = np.nan
    return quarters

In [7]:
imali['Size'] = imali['Details'].apply(extract_plotsize)
imali['Bedrooms'] = imali['Details'].apply(extract_bedrooms)
imali['Bathrooms'] = imali['Details'].apply(extract_bathrooms)
imali['Quarters'] = imali['Details'].apply(extract_quarters)
imali['Wardrobes'] = imali['Details'].str.extract(r'(Wardrobes?)',re.I)
imali['Cabinets'] = imali['Details'].str.extract(r'(cabinets?)', re.I)
imali['Balcony'] = imali['Details'].str.extract(r'(Balcony)',re.I)
imali['Parking'] = imali['Details'].str.extract(r'(parking|pariking|parikingi)', re.I)

In [8]:
imali['Parking'] = imali['Parking'].apply(lambda x: 'Yes' if str(x) != 'nan' else x)
imali['Balcony'] = imali['Balcony'].apply(lambda x: 'Yes' if str(x)!='nan' else x)
imali['Cabinets'] = imali['Cabinets'].apply(lambda x: 'Yes' if str(x)!='nan' else x)
imali['Wardrobes'] = imali['Wardrobes'].apply(lambda x: 'Yes' if str(x)!='nan' else x)

In [9]:
imali

Unnamed: 0,Date,Link,Price,Location,Details,Size,Bedrooms,Bathrooms,Quarters,Wardrobes,Cabinets,Balcony,Parking
0,16-Mar-2021,https://www.imali.biz/announce-24-134296.html,100 000 000 Rwf,"Kigali -ndera hafi na 15, hafi na free zone","DUFITE INZU ZiGURISHWA INDERA HAFI NA FREE ZONE HAFI NOKURI 15 KUVA KURI KABURIMBO UYIGERAHO NI MINOTA 3, NAHANTU HEZA CYANE UBA UREBA KANOMBE YOSE ✅ Izi nzu ziri MUKIBANZA kinini cyane gikora kumihanda 2 Gifite Ubuso 1558sqm,✅Izi nzu zifite imiryango 25 Yinjiza Amafaranga .✅ Izi nzu zifite imiryango 4 Yubucuruzi.✅ Inzu Nini ifite ibyumba 6,salo , toilet, Parking nibindi.✅ INZU ZOSE ZIKODESHWA IBIHUMBI 600K UKUYEMO INZU NINI . Ikindi nuko izinzu ziri ahantu heza mubaturanyi beza ahantu Hari kubakwa Amazu meza .✅Igiciro zigurishwa Ni Millions 100, ☎️KUBINDI BISOBANURO WADUHAMAGARA KURI+250788236675. DUFITE NIZINDI NZU NYINSHI TUGURISHA TUKAZIKODESHA.",1558sqm,6,1,,,,,Yes
1,16-Mar-2021,https://www.imali.biz/announce-24-134294.html,35 000 000 Rwf,Kigali masoro million 35,Ndagurisha ahantu heza harimo imiryango 23 imasoro nahantu harimo imiryango 23 harimo ni miryango yubucuruzi ku muhanda hagurishwa millions 35 hinjiza ibihumbi magana ane 400 iyi nimari ntigucike uhashaka hamagara tel :0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze,,,,,,,,
2,16-Mar-2021,https://www.imali.biz/announce-24-134293.html,5 500 000 Rwf,Kigali kimironko musave gasabo,"Ndagurisha inzu nziza ku mafaranga make cyane udahenzwe igurishwa million 5,500,000 iri kumuhanda musave mumurenge wa bumbogo akarere ka gasabo umugi wa kigali iyi nimari ntigucike uyishaka hamagara tel: 0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze",,,,,,,,
3,16-Mar-2021,https://www.imali.biz/announce-24-134289.html,45 000 000 Rwf,Kigali kimironko kinyaga,Ndagurisha inzu nziza cyane ikomeye iri hafi nuruganda rwa Azam mukinyaga hafi Ni midugudu ya sekimondo iri hafi na kaburimbo igurishwa million 45 ifite chambre 4 tuwarete 3 sallon salle Ã manger nigikoni munzu yubakishije amatafari ahiye uyishaka hamagara tel :0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze,,4,3,,,,,
4,16-Mar-2021,https://www.imali.biz/announce-24-134288.html,17 000 000 Rwf,Kigali kimironko musave gasabo,Ndagurisha ahantu heza harimo inzu eshatu 3 hafi nuruganda rwa Azam hafi na kagari ka musave nahantu harimo inzu eshatu 3 hafite pariseri m45 m30 ni kumuhanda hagurishwa million 17 hemewe kubakwa nimuri R1 iyi nimari ntigucike uzishaka hamagara tel :0784969631 dufite nandi mazu agurishwa menshi atandukanye kd meza cyane adahenze,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2595,20-Nov-2020,https://www.imali.biz/announce-24-127569.html,200 000 000 Rwf,Kacyiru down American Embassy,"Beautiful house is available for sale at kacyiru just down American Embassy and Good neighborhood. In very good secured and nice view. It has sitting and dinningroom, 5bedrooms, 4bathrooms, kitchen, store, out side annex with boy quarters, Big garden and parking within a big plot of 1500sqm Price:200millionsFor more details contact us on: +250788225193/+250788225193/+250738225193 or send an e-mail to: jerremyd2002@yahoo.fr7 NOVA CONSTRUCTION LTDMob: 0788225193 Whatsapp: 0738225193Location: Our office is located in Downtown at KN87 Street, Nyarugenge, 2nd floor Beatitude house just near Hotel Okapi.",1500sqm,5,4,Yes,,,,Yes
2596,20-Nov-2020,https://www.imali.biz/announce-24-127568.html,32 000 000 Rwf,kigali,Ifite ibyumba 3nasaro nasara manjenadushe natuwarete 2nannexeskubindi biso banuro +250785477479,,3,2,Yes,,,,
2597,20-Nov-2020,https://www.imali.biz/announce-24-127566.html,58 000 000 Rwf,kanombe kigali VIP,"Negotiable house for Sale at Kanombe on the main road:5bedrooms, 3bedrooms, dinning and sitting room , modern kitchen , annex, boy's quarters, garden, on the main road.Call Omar:0784842444Price:58,000,000RwfN.B: price is negotiable downwww.kigalidealer.com",,5,,Yes,,,,
2598,20-Nov-2020,https://www.imali.biz/announce-24-127565.html,200 000 000 Rwf,kigali,Ifite ibyumba 5nasaro nasara manje nadushe natuwarete 4nannexes kubindi bisobanuro +250784455496 /0785477479,,5,4,Yes,,,,


In [10]:
#Saving the dataset to csv
#imali.to_csv('kigali houses.csv', index = False)