<a href="https://colab.research.google.com/github/michalis0/Business_Intelligence_Analytics_private2021/blob/main/week3%20-%20Pandas%20-%20Data%20Cleaning/Exercises/solutions/Week_3_Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [301]:
#@title Walkthrough - Lab 3

%%html

<div>
<td> 
<img src="https://www.unil.ch/logo/files/live/sites/logo/files/web/eps/lo_unil06_bleu.eps" style="padding-right:10px;width:240px;float:left"/></td>
<h2 style="white-space: nowrap">Business Intelligence and Analytics</h2></td>
<hr style="clear:both">
<p style="font-size:0.85em; margin:2px; text-align:justify">

</div>


>The goal of this Walkthrough is to provide you with insights on exploratory data analysis throughout the use of a fun and interactive technique known as webscrapping. During this laboratary, you will be provided with a broad overview of this technique as well as the major tools used in the process. You will also understand how data analysis can be conducted on real-time web data and hopefully see how this combination can be further applied to any other context. 


>In essence, webscrapping consits in harvesting the content of a web page in order to process its information for further use. In our example, webscrapping is used as fun way to extract data that we will analyse afterwards. In most cases, this thechniques comes hand in hand with data cleaning and data analysis. For futher information on webscrapping, click on the following [link](https://en.wikipedia.org/wiki/Web_scraping).


## Web Scrapping libraries

In order to get the data from the Web with Python, we will require during the course of this lab to use the follwing two essential libraries:

*  Requests (HTTP): retreives the web pages (html) to parse.

*  Beautiful Soup (HTML Parsing): parses the html.

Thanks to google colab, no specific environmental installation is needed to work on this lab. We can directly import the need libraries. 
Moreover, last week you were introduced with a very important library when it comes to EDA known as [Pandas](https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html). Dataframes and the allowed manipulations come in very handy for the analysis we shall conduct on the newly fetched Data. Therefore, we will continue using this library throughout this lab as well. If you feel like you are still not at ease with Pandas basic concepts, please refer to the previous documentation or ask one of the TAs. 


In [262]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd


## Retreiving the Data


>In order to get started with webscrapping we must first make a [request](https://requests.readthedocs.io/en/master/user/quickstart/). In simple words, we will ask the server hosting the webpage we are interested in for its content.

> In this laboratory, we will use the [Boat24.ch](https://www.boot24.ch/chde/motorboote/furtif-28-modele-unique/detail/463101/) by specifying its URL as parameter to the ``requests.get`` method. 

> We can check the status of our request using the library method ``status_code``. You can find more on the HTTP status code on this [link](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). A code of **200** means the HTTP request was successfuly completed. On the other hand, the response header display metadata aboug the request itself. You can see for yourself the details of the header running the next cell. 



In [290]:
# Make the request
r = requests.get('https://www.boot24.ch/chde/motorboote/furtif-28-modele-unique/detail/463101/')
# Response content
print('Response status code: {0}\n'.format(r.status_code))
print('Response headers: {0}\n'.format(r.headers))


Response status code: 200

Response headers: {'Date': 'Mon, 22 Feb 2021 18:03:17 GMT', 'Server': 'Apache', 'Expires': '0', 'Cache-Control': 'private, post-check=0, pre-check=0, max-age=0', 'Pragma': 'no-cache', 'X-Frame-Options': 'deny', 'Set-Cookie': 'domaincheck=1; expires=Tue, 22-Feb-2022 18:03:17 GMT; Max-Age=31536000; path=/; domain=.boot24.ch, lan=chde; expires=Tue, 22-Feb-2022 18:03:17 GMT; Max-Age=31536000; path=/; domain=.boot24.ch, data=a%3A1%3A%7Bs%3A3%3A%22cat%22%3Ba%3A1%3A%7Bi%3A2%3Bi%3A1%3B%7D%7D; expires=Tue, 22-Feb-2022 18:03:17 GMT; Max-Age=31536000; path=/; domain=.boot24.ch', 'Upgrade': 'h2', 'Connection': 'Upgrade, Keep-Alive', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '12535', 'Keep-Alive': 'timeout=4, max=256', 'Content-Type': 'text/html; charset=utf-8'}



Now, lets see the raw content of our request. The body of the response here will be in HTML since we are asking for a webpage. Different format such as  JSON or XML could also be imagined for web services.

In [291]:
print('Response body: {0}'.format(r.text))


Response body: <!doctype html>
<html class="no-js" lang="de" data-lan="chde">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>Furtif 28, modèle unique., 2020, 2h, CHF 295'000.- | boot24.ch</title>
<meta name="keywords" content="furtif 28, mod&egrave;le unique.,konsolenboot,motorboot,neuboot ab lager,kaufen" />
<meta name="description" content="Furtif 28, mod&egrave;le unique. kaufen - Baujahr: 2020, L&auml;nge: 9.00 m, Breite: 2.80 m - Informationen, Fotos &amp; Kontaktangaben zum Occasionsboot. (ID: 463101)" />
<meta name="revisit-after" content="1 days" />
<meta name="page-topic" content="Occasionsboote, Boote" />
<meta name="page-type" content="Anzeigen, Kleinanzeigen" />
<meta name="audience" content="all" />
<meta name="publisher" content="boot24.ch" />
<meta name="copyright" content="boot24.ch" />
<meta name="distribution" content="global" />
<meta name="author" content="boot24.ch" />
<meta name="language" content="de" />
<meta name="cou

## Parsing the Data

Now as you can see, the HTTP response's body as it is, is hardly usable. Therefore, we rely on BeautifulSoup to parse the content to for further processings. Thus, we specify we need the html.parser. For more information, you can click [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)

BeatifulSoup, thanks to parsing the content, will allow us to conduct a series of different operations and commands that you will be discovering in the remaining part of this lab. Note that this library can be very powerfull and complete when it comes to parsing and manipulations, this overview is not meant to display all possible features offered by BeautifulSoup.


In [264]:
page_body = r.text
soup = BeautifulSoup(page_body, 'html.parser')


For instance, you can very easily get the title of the page using ``soup.title``.

In [266]:
soup.title


<title>Furtif 28, modèle unique., 2020, 2h, CHF 295'000.- | boot24.ch</title>

Yet, this is stil in HTML format, therefore using the ``.string`` allows for a more conventional layout.

In [267]:
soup.title.string


"Furtif 28, modèle unique., 2020, 2h, CHF 295'000.- | boot24.ch"

To go further with this laboratory and with Data retreival after parsing, some html notions are required. In essence, you should get acquainted with concepts like **HTML tags**. Several functions and manipulations allowed by beautifulSoup rely on the different tags (headers,divisions, paragraphs, classes, ids etc..) to retreive the data they contain. You can find more on HTML tags [here](https://www.w3schools.com/html/html_elements.asp).

**Important** : All the manipulations that are performed bellow rely on a study of the HTML body of the repsonse. As it is specific to the website, it is fundamental to understand how to retreive the information and how where to get it from.  

In the next cell, we use the "`a`" tag as it is generally used for website links embedding (combined with ``href``). 

``soup.find`` and ``soup.find_all`` will be extensively in this lab to navigate the data structure, please do not hesitate to refere to the correpsonding  [documentation](https://https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more information.


In [292]:
links = soup.find_all('a')
print('The webpage cointains {0} links...'.format(len(links)))

The webpage cointains 158 links...


Analysis of the body can also allow us to easily retreive the purchase price. 

In [270]:
price = soup.find_all('span',class_="list__value list__value--large")
price[0].text

"CHF 295'000.-"

Another usefull concept of html on wich BeautifulSoup relies is the notion of children. In fact, in HTML, tags are often assembled as a serie of containers each containg other tags. It is typical of a documented object model structure, you can find more clicking on this link : [DOM](https://www.w3schools.com/whatis/whatis_htmldom.asp).

Inspecting the page and using this notion led to the following commands to retreive the advertisment metadata. 

In [271]:
t=soup.find("ul",class_="list list--space-8")

for child in t.children:
    print(child.text)

Inserat-ID: 463101
Aufgabedatum: 08.02.2021
Aufrufe in den letzten 7 Tagen: 643
In Favoriten: bei 1 Personen


Similarly, you can also retreive and display articles with their links worth consulting as they are similar to the boat we seem to be interested in. 

In [286]:
t = soup.find_all("div", class_="blurb__link-area js-link")
for link in t:
    link_ = link.findAll('a', attrs={'href': re.compile("^https://")})
    print( "Item : {0}, Link : {1}".format( link_[0].text,link_[0].get('href') ))
        

Item : Chris Craft Catalina 30, Link : https://www.boot24.ch/chde/motorboote/chris-craft/chris-craft-catalina-30/detail/463495/
Item : De Antonio Yachts D34 Cruiser, Link : https://www.boot24.ch/chde/motorboote/de-antonio-yachts/de-antonio-yachts-d34-cruiser/detail/435166/
Item : Nuova Jolly Prince 38 CC, Link : https://www.boot24.ch/chde/motorboote/nuova-jolly/nuova-jolly-prince-38-cc/detail/463424/
Item : De Antonio Yachts D28 Open, Link : https://www.boot24.ch/chde/motorboote/de-antonio-yachts/de-antonio-yachts-d28-open/detail/330366/


Now, we are interesting in getting the description for our boat, to that end, we first retreived the according section in the page. 

In [273]:
informations = soup.find_all("ul",class_="list l-mt-16")
informations


[<ul class="list l-mt-16">
 <li><span class="list__value">9.00 m x 2.80 m</span><span class="list__key">Länge x Breite</span></li>
 <li><span class="list__value">1.00 m</span><span class="list__key">Tiefgang</span></li>
 <li><span class="list__value">2'200 kg</span><span class="list__key">Verdrängung</span></li>
 <li><span class="list__value">C - Küstennahe Gewässer</span><span class="list__key">CE-Kennzeichnung</span></li>
 <li><span class="list__value">Holz</span><span class="list__key">Material</span></li>
 </ul>, <ul class="list l-mt-16">
 <li><span class="list__value">8 Personen</span><span class="list__key">zugel. Personenzahl</span></li>
 <li><span class="list__value">1 Kabine</span><span class="list__key">Anz. Kabinen</span></li>
 <li><span class="list__value">1 Koje</span><span class="list__key">Anz. Kojen</span></li>
 <li><span class="list__value">50 l Wasser</span><span class="list__key">Frischwassertank</span></li>
 </ul>, <ul class="list l-mt-16">
 <li><span class="list__v

Each part can be separatly processed. We can choose to view the first 5 characteristics. 

In [274]:
description = informations[0]
description

<ul class="list l-mt-16">
<li><span class="list__value">9.00 m x 2.80 m</span><span class="list__key">Länge x Breite</span></li>
<li><span class="list__value">1.00 m</span><span class="list__key">Tiefgang</span></li>
<li><span class="list__value">2'200 kg</span><span class="list__key">Verdrängung</span></li>
<li><span class="list__value">C - Küstennahe Gewässer</span><span class="list__key">CE-Kennzeichnung</span></li>
<li><span class="list__value">Holz</span><span class="list__key">Material</span></li>
</ul>

In [294]:
list__value =description.find_all("span", class_="list__value")
list__key =description.find_all("span", class_="list__key")


Here comes Pandas and its Dataframes. We put the information in a two column Dataframe that could be further used. 

In [295]:
specs = pd.DataFrame(data={'list__value': list__value,
     'list__key': list__key})


In [296]:
specs

Unnamed: 0,list__value,list__key
0,[9.00 m x 2.80 m],[Länge x Breite]
1,[1.00 m],[Tiefgang]
2,[2'200 kg],[Verdrängung]
3,[C - Küstennahe Gewässer],[CE-Kennzeichnung]
4,[Holz],[Material]


For processing purposes, we get rid of the braquets, it is trickier than it looks as the dtype of the dataframe is not ``string``. You can investigate the innerworkings of this command to feel more comfertable with the output. 

In [297]:
specs.list__value=specs.list__value.apply(lambda x: re.search('>(.*)<',str(x)).group(1) )
specs.list__key=specs.list__key.apply(lambda x: re.search('>(.*)<',str(x)).group(1) )

In [298]:
specs

Unnamed: 0,list__value,list__key
0,9.00 m x 2.80 m,Länge x Breite
1,1.00 m,Tiefgang
2,2'200 kg,Verdrängung
3,C - Küstennahe Gewässer,CE-Kennzeichnung
4,Holz,Material


Now, we use a different approach to get all the characteristics at once. Yet, there is one odd piece of Data in our Dataframe, can you notice it ? 

In [299]:

list__value =soup.find_all("span", class_="list__value")
list__key =soup.find_all("span", class_="list__key")
furn = pd.DataFrame(data={'list__value': list__value,
     'list__key': list__key})
furn.list__value=furn.list__value.apply(lambda x: re.search('>(.*)<',str(x)).group(1) )
furn.list__key=furn.list__key.apply(lambda x: re.search('>(.*)<',str(x)).group(1) )

In [300]:
furn.head()

Unnamed: 0,list__value,list__key
0,CHF 295'000.-,
1,2020,Baujahr
2,neu,Zustand
3,9.00 m x 2.80 m,Länge x Breite
4,1.00 m,Tiefgang


In [285]:
furn[1:]

Unnamed: 0,list__value,list__key
1,2020,Baujahr
2,neu,Zustand
3,9.00 m x 2.80 m,Länge x Breite
4,1.00 m,Tiefgang
5,2'200 kg,Verdrängung
6,C - Küstennahe Gewässer,CE-Kennzeichnung
7,Holz,Material
8,8 Personen,zugel. Personenzahl
9,1 Kabine,Anz. Kabinen
10,1 Koje,Anz. Kojen
