# Working with HTML
There is a lot of data that can be found in the internet. To get the data, there are two techniques:


*   Web scrapping - Extracting underlying data found in HTML code and store in a new file format
*   web crawling - Use of bots to process different url links, get the data from all the pages and store the data in websites. e.g Google, Bing



## Web Scrapping
In this session, we will be looking at web scrapping. We will be examining news websites and look at how to extract the articles. 

We will use a python package called BEAUTIFULSOUP.

`pip install beautifulsoup4`

To import the package:

`from bs4 import BeautifulSoup`

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
# Read the html doc
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [None]:
soup.head

<head><title>The Dormouse's story</title></head>

In [None]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [None]:
mainbody = soup.body

In [None]:
# find a particular tag
soup.find('p')

<p class="title"><b>The Dormouse's story</b></p>

In [None]:
# find all p
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [None]:
# get the text
soup.find('p').get_text()

"The Dormouse's story"

In [None]:
# loop through tag to get the text
sisters = soup.find_all('a', class_='sister')

[a.getText() for a in sisters]

['Elsie', 'Lacie', 'Tillie']

### Practicle example
Website - English Premier League ResultDB

**URL** - http://www.resultdb.com/english-premier-league-tables/

**Goal**: *Get the aggregated details of each team for a particular season* 


In [None]:
import requests
import pandas as pd

year = '2000'
page = requests.get("http://www.resultdb.com/english-premier-league-tables/"+year+"/")
maindetails = BeautifulSoup(page.text,'html.parser')

# soup = BeautifulSoup(page.text,'lxml')
# table = soup.find('table')

# data = []
# rows = table.find_all('tr')
# for row in rows:
#     cols = row.find_all('td')
#     cols = [ele.text.strip() for ele in cols]
#     data.append([ele for ele in cols if ele]) # Get rid of empty values

# columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
# season = pd.DataFrame(data[1:],columns=columns)

In [None]:
print(maindetails.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   English Premier League 2000/2001 table - Result DB
  </title>
  <meta content="Premier League tables for the 2000/2001 season. Full table for the English Premier League 2000/2001 as well as home and away league tables. " name="description">
   <meta content="Premier League 2000/2001 table,Premier League table,2000/2001 table" name="keywords"/>
   <link href="/style.css" rel="stylesheet" type="text/css"/>
   <link href="/images/favicon.ico" rel="Shortcut Icon"/>
   <script type="text/javascript">
    var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-23500708-1']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createEleme

In [None]:
# The details are in the table tag. Find the table
table = maindetails.find('table')

# Table has rows. Get all the table rows. the result will be a list
rows = table.find_all('tr')



In [None]:
# Get the details in each row
# Loop through each row
data =[]
all_details = []
for row in rows:
  details = row.find_all('td')
  

  cols = [ele.text.strip() for ele in details]
  data.append([ele for ele in cols if ele])  # Get rid of empty values




['2', 'Arsenal', '38', '20', '10', '8', '63', '38', '+25', '70 pts']

In [None]:
# Create a dataframe where the data will be placed and processed
columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
season = pd.DataFrame(data[1:],columns=columns)

In [None]:
season.head()

Unnamed: 0,position,team name,games,won,draw,lost,goal scored,goals conceded,goal difference,points
0,1,Manchester United,38,24,8,6,79,31,48,80 pts
1,2,Arsenal,38,20,10,8,63,38,25,70 pts
2,3,Liverpool,38,20,9,9,71,39,32,69 pts
3,4,Leeds,38,20,8,10,64,43,21,68 pts
4,5,Ipswich Town,38,20,6,12,57,42,15,66 pts


In [None]:
# TODO convert the above to a function. Then get the details from 2000-2015, place all the details in one dataframe, add a column called season
# ENTER CODE HERE

## Assignment
Based on the above, get the main articles from igihe from February 2022 - present

Steps to do this


1.   Get the links to the main pages from january. Create a list
2.   In each link, get all the links to the main articles
3.   For each article, get the main tag that holds the texts
4.   Get the text and store them in a txt file. The data will be used in week 2
5.   Each article its own txt file. Naming is the date_article_1

