# Navegador web tn Python (socket)

Vamos a crear nuestro primer navegador web utilizando el protocolo HTTP, realizando una conexión a un servidor web siguiendo las reglas de este protocolo para solicitar un documento y mostrar lo que el servidor nos devuelve.

Vamos a comenzar importando la librería socket =>

In [11]:
import socket

# conexión con el servidor

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.w3.org', 80))

# Obtención de datos

cmd = 'GET https://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

# Recibir datos

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(),end='')

# Cerrar la conexión
    
mysock.close()

HTTP/1.1 200 OK
date: Wed, 02 Nov 2022 18:37:32 GMT
last-modified: Mon, 12 Feb 1996 18:20:25 GMT
etag: "13c91-2ed8f31cc4c40"
accept-ranges: bytes
content-length: 81041
cache-control: max-age=21600
expires: Thu, 03 Nov 2022 00:37:32 GMT
vary: Accept-Encoding,upgrade-insecure-requests
keep-alive: timeout=5, max=2000
content-type: text/plain
x-backend: www-mirrors
x-request-id: 6362b8ec96d2c7f3
connection: close

Hypertext Markup Language (HTML)          Tim Berners-Lee, CERN
Internet Draft                          Daniel Connolly, Atrium
IIIR Working Group                                    June 1993


                  Hypertext Markup Language (HTML)
                
   A Representation of Textual Information and MetaInformation
                   for Retrieval and Interchange


Status of this Document

   This document is an Internet Draft. Internet Drafts are working
   documents of the Internet Engineering Task Force (IETF), its Areas,
   and its Working Groups.  Note

                <LI>Walk for a mile or so until you reach the
                "Asquith Arms" then
                <LI>Wait and see...
                </OL>

                < MENU >
                <LI>The oranges should be pressed fresh
                <LI>The nuts may come from a packet
                <LI>The gin must be good quality
                </MENU>

                < DIR >
                <LI>A-H<LI>I-M
                <LI>M-R<LI>S-Z
                </DIR>



Next ID




Berners-Lee and Connolly                                             21

   This tag takes a  single attribute which is the number of the next
   document-wide numeric identifier to be allocated of the form z123.
   
   When modifying a document, old anchor ids should not be reused, as
   there may be references stored elsewhere which point to them.  This
   is read and generated by hypertext editors. Human writers of HTML
   usually use mnemonic alphabetical identifiers. Browser software may
   ignore thi

Primero, el programa realiza una conexión al puerto 80 del servidor www.w3.org. Como nuestro programa está asumiendo el rol de “navegador web”, el protocolo HTTP nos dice que tenemos que enviar el comando GET seguido por una línea en blanco. \r\n significa un final de línea, y \r\n\r\n es el equivalente a la línea en blanco. Después recibiremos los datos hasta que no quede ninguno y finalizaremos la conexión

# Recepción de páginas web con urllib

El navegador anterior puede implementarse de un modo más sencillo, mediante el uso de la librería urllib.

In [12]:
import urllib.request

fhand = urllib.request.urlopen('https://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt')
for line in fhand:
    print(line.decode().strip())

# Este método me devuelve solo el texto, mientras que el anterior me devuelve también las etiquetas html

Hypertext Markup Language (HTML)          Tim Berners-Lee, CERN
Internet Draft                          Daniel Connolly, Atrium
IIIR Working Group                                    June 1993


Hypertext Markup Language (HTML)

A Representation of Textual Information and MetaInformation
for Retrieval and Interchange


Status of this Document

This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its Areas,
and its Working Groups.  Note that other groups may also distribute
working documents as Internet Drafts.

Internet Drafts are working documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or  obsoleted by
other documents at any time.  It is not appropriate to use Internet
Drafts as reference material or to cite them other than as a
"working draft" or "work in progress".

Distribution of this document is unlimited.   The document is a
draft form of a standard for interchange of informa

Berners-Lee and Connolly                                             19

Typical rendering

The definition list DT, DD pairs are arranged vertically.   For
each pair, the DT element is on the left, in a column of about a
third of the display area, and the DD element is in the right hand
two thirds of the display area.  The DT term is normally small
enough to fit on one line within the left-hand column. If it is
longer, it will either extend across the page, in which case the DD
section is moved down to separate them, or it is wrapped onto
successive lines of the left hand column.

White space is typically left between successive DT,DD pairs unless
the COMPACT attribute is given.  The COMPACT attribute is
appropriate for lists which are long and/or have DT,DD pairs which
each take only a line or two.  It is of course possible for the
rendering software to discover these cases itself and make its own
decisions, and this is to be encouraged.

The COMPACT attribute may also reduce the widt

In [13]:
import urllib.request

fhand = urllib.request.urlopen('https://openwebinars.net/academia/')
for line in fhand:
    print(line.decode().strip())


<!DOCTYPE html>
<html lang="es">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="google-site-verification" content="3xIl6rwvbZFMPw2Os_ecSGlPPwrQuE11U5J9Ko2bbas"/>
<meta property="fb:pages" content="411445362304338"/>
<title>Iniciar Sesión | OpenWebinars</title>
<link rel="shortcut icon" href="/static/public/images/favicons/favicon.ico" sizes="64x64"/>
<link rel="apple-touch-icon" sizes="57x57" href="/static/public/images/favicons/xapple-icon-57x57.png.pagespeed.ic.7OyWpP5A5m.png">
<link rel="apple-touch-icon" sizes="60x60" href="/static/public/images/favicons/xapple-icon-60x60.png.pagespeed.ic.ZMuOsdKt-v.png">
<link rel="apple-touch-icon" sizes="72x72" href="/static/public/images/favicons/xapple-icon-72x72.png.pagespeed.ic.0cXT2U6yaP.png">
<link rel="apple-touch-icon" sizes="76x76" href="/static/public/images/favicons/xapple-icon-76x76.png.pagespeed.ic.QHd8G2Ux1F.png"

# Análisis de HTML mediante BeautifulSoup

In [15]:
# Instalación de la librería

import sys
!{sys.executable} -m pip install beautifulsoup4

# Con este comando instalaremos la librería


Defaulting to user installation because normal site-packages is not writeable


In [17]:
# Procedemos a importar la librería

import urllib
from bs4 import BeautifulSoup

html = urllib.request.urlopen('https://openwebinars.net')
soup = BeautifulSoup(html)

# Ahora que hemos leido el html y lo hemos transformado en un objeto BeautifulSoup, podremos extraer los datos

tags = soup('a') # busca todas las etiquetas "a"

In [18]:
tags

[<a href="/">
 <figure class="brand">
 <img alt="OpenWebinars" class="logo" src="/static/public/images/logo.svg"/>
 </figure>
 </a>,
 <a data-item="cursos" href="/cursos/">Cursos<span class="icon-chevron-right"></span></a>,
 <a href="/cursos/cloud-computing/">Cloud Computing</a>,
 <a href="/cursos/backend/">Backend</a>,
 <a href="/cursos/blockchain/">Blockchain</a>,
 <a href="/cursos/certificaciones-oficiales/">Certificaciones oficiales</a>,
 <a href="/cursos/metodologias/">Metodologías</a>,
 <a href="/cursos/drupal/">Drupal</a>,
 <a href="/cursos/devops/">DevOps</a>,
 <a href="/cursos/wordpress/">WordPress</a>,
 <a href="/cursos/videojuegos/">Videojuegos</a>,
 <a href="/cursos/bases-de-datos/">Bases de datos</a>,
 <a href="/cursos/robotica/">Robótica y Hardware</a>,
 <a href="/cursos/management/">Management</a>,
 <a href="/cursos/ciberseguridad-ethical-hacking/">Ciberseguridad</a>,
 <a href="/cursos/sistemas/">Sistemas y Redes</a>,
 <a href="/cursos/herramientas/">Herramientas</a>,
 <

Ahora podremos recorrer la lista de etiquetas y extraer los valores de el atributo "href"

In [21]:
for tag in tags:
    print('TAG', tag)
    print('URL', tag.get('href'))
    print('CONTENIDO', tag.contents)
    print('ATRIBUTO', tag.attrs)
    print()

TAG <a href="/">
<figure class="brand">
<img alt="OpenWebinars" class="logo" src="/static/public/images/logo.svg"/>
</figure>
</a>
URL /
CONTENIDO ['\n', <figure class="brand">
<img alt="OpenWebinars" class="logo" src="/static/public/images/logo.svg"/>
</figure>, '\n']
ATRIBUTO {'href': '/'}

TAG <a data-item="cursos" href="/cursos/">Cursos<span class="icon-chevron-right"></span></a>
URL /cursos/
CONTENIDO ['Cursos', <span class="icon-chevron-right"></span>]
ATRIBUTO {'href': '/cursos/', 'data-item': 'cursos'}

TAG <a href="/cursos/cloud-computing/">Cloud Computing</a>
URL /cursos/cloud-computing/
CONTENIDO ['Cloud Computing']
ATRIBUTO {'href': '/cursos/cloud-computing/'}

TAG <a href="/cursos/backend/">Backend</a>
URL /cursos/backend/
CONTENIDO ['Backend']
ATRIBUTO {'href': '/cursos/backend/'}

TAG <a href="/cursos/blockchain/">Blockchain</a>
URL /cursos/blockchain/
CONTENIDO ['Blockchain']
ATRIBUTO {'href': '/cursos/blockchain/'}

TAG <a href="/cursos/certificaciones-oficiales/">Cert

Hemos obtenido todos los links de todas las etiquetas de la web

Esto puede incluso extraer más campos de otras etiquetas, otros atributos, etc...