<a href="https://colab.research.google.com/github/jaiderog/Learning-Computational-Social-Sciences/blob/master/Generalidades_del_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="250px" src="https://upload.wikimedia.org/wikipedia/commons/archive/f/fb/20161010213812%21Escudo-UdeA.svg" align="left" hspace="10px" vspace="0px"></p>

<h1><b>Introducción al web scraping</b></h1>

<h2>Material preparado por: Jaider Ochoa Gutiérrez y Juan Fernando Pérez Pérez</h2>

Este será un panorama general del uso de herramientas computacionales para hacer scraping de páginas web.

**¡Manos a la obra!**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Paso 1: preparar el entorno de trabajo y las librerías

In [None]:
!pip install BeautifulSoup4
!pip install requests



In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

# Paso 2: Selección e inspección del sitio web

**Estructura de la página HTML**

El lenguaje de marcado de hipertexto (HTML) es el lenguaje de marcado estándar para documentos diseñados para mostrarse en un navegador web. HTML describe la estructura de una página web y se puede utilizar con hojas de estilo en cascada (CSS) y un lenguaje de secuencias de comandos como JavaScript para crear sitios web interactivos. HTML consta de una serie de elementos que "le dicen" al navegador cómo mostrar el contenido. Por último, los elementos están representados por etiquetas .

A continuación se muestran algunas etiquetas:



* La declaración `<!DOCTYPE html>` define este documento como HTML5.
* El elmento`<html>` es el elemento raíz de una página HTML.
* La etiqueta `<div>` define una división o una sección en un documento HTML. Suele ser un contenedor para otros elementos.
* El elemento `<head>` contiene metainformación sobre el documento.
* El elemento `<title>` especifica un título para el documento.
* El elemento `<body>` contiene el contenido de la página visible.
* El elemento `<h1>` define un encabezado grande.
* El elemento `<p>` define un párrafo.
* El elemento `<a>` define un hipervínculo.

Las etiquetas HTML normalmente vienen en pares como `<p>` y `</p>`. La primera etiqueta de un par es la etiqueta de apertura, la segunda etiqueta es la etiqueta de cierre. La etiqueta final se escribe como la etiqueta de inicio, pero con una barra inclinada insertada antes del nombre de la etiqueta.

In [None]:
url = 'https://techcrunch.com/'

# Paso 3: Enviar una solicitud HTML

In [None]:
page = requests.get(url) #Hacer la solicitud al sitio web
soup = BeautifulSoup(page.content, 'html.parser')
page #La respuesta que debemos obtener es 200, lo que indica que SÍ podemos "escrapear" el contenido del sitio web

<Response [200]>

In [None]:
titulo = soup.find_all('div', {'class': 'very-small'})
titulo

[]

# Paso 4: Extracción de secciones específicas

In [None]:
#Extraer los títulos de las noticias
titulo = soup.find_all('a', {'class': 'post-block__title__link'}) #En la inspección se debe identificar el atributo o elemento que deseamos extraer
titulo_list = [] #Lista vacía donde vamos a almacer los datos
for x in titulo[1:] : #hacemos un ciclo for para extraer cada unos de los títulos de las noticias
   titulo_list.append((x.get_text())) #Append es un método de pandas que se emplea para almacenar los datos en la lista vacía que hemos creado anteriormente
                                      # .get_text método de beautifulsoup
titulos=pd.DataFrame(titulo_list, columns=['Títulos']) #Guardamos el resultado en un dataframe
titulos #visualimos los datos almacenados en el dataframe

Unnamed: 0,Títulos
0,\n\n\n\n
1,\n\t\t\t\tQuick thinking and a stroke of luck ...
2,"\n\t\t\t\tAs Techstars retools, some former st..."
3,"\n\t\t\t\tTechstars CEO defends changes, says ..."
4,\n\t\t\t\tGoogle court filing reveals new busi...
5,\n\t\t\t\t‘Embarrassing and wrong’: Google adm...
6,\n\t\t\t\tVirtual Staging AI helps Realtors di...
7,\n\t\t\t\tA bumpy road for EV manufacturers\t\t\t
8,\n\t\t\t\tTreating a chatbot nicely might boos...
9,\n\t\t\t\tHumane pushes Ai Pin ship date to mi...


In [None]:
#Extraer el contenido de las noticias (resumen de la noticia)
content = soup.find_all('div', {'class': 'post-block__content'}) #En la inspección se debe identificar el atributo o elemento que deseamos extraer
content_list = []
for i in content[1:]: #hacemos un ciclo para extraer los resúmenes de las noticias
   content_list.append((i.get_text()))

contenido=pd.DataFrame(content_list, columns=['Contenido']) #Guardamos el resultado en un dataframe
contenido

Unnamed: 0,Contenido
0,\n\t\tWell-known accelerator group Techstars a...
1,"\n\t\tEarlier this week, accelerator group Tec..."
2,\n\t\tA court filing in the U.S. Department of...
3,\n\t\tGoogle has apologized (or come very clos...
4,\n\t\tHouse staging is a significant part of t...
5,\n\t\tWelcome to Startups Weekly — your weekly...
6,\n\t\tPeople are more likely to do something i...
7,"\n\t\tHardware is difficult, to paraphrase a f..."
8,"\n\t\tThe Browser company’s Arc, a browser foc..."
9,"\n\t\tReddit’s long-awaited IPO is nearing, pr..."


In [None]:
#Extraer el link de la noticia
link = soup.find_all('a', {'class': 'post-block__title__link'}) #En la inspección se debe identificar el atributo o elemento que deseamos extraer
link_list = []
for i in link[1:]: #hacemos un ciclo para extraer cada unos de los enlaces de las noticias
   link_list.append((i.get('href')))

enlaces=pd.DataFrame(link_list, columns=['Enlaces']) #Guardamos el resultado en un dataframe
enlaces

Unnamed: 0,Enlaces
0,https://techcrunch.com/2024/02/23/techstars-ce...
1,https://techcrunch.com/2024/02/23/quick-thinki...
2,https://techcrunch.com/2024/02/23/as-techstars...
3,https://techcrunch.com/2024/02/23/techstars-ce...
4,https://techcrunch.com/2024/02/23/search-start...
5,https://techcrunch.com/2024/02/23/embarrassing...
6,https://techcrunch.com/2024/02/23/virtual-stag...
7,https://techcrunch.com/2024/02/23/a-bumpy-road...
8,https://techcrunch.com/2024/02/23/treating-a-c...
9,https://techcrunch.com/2024/02/23/humane-pushe...


In [None]:
#Concatenamos todas los dataframe (tablas creadas)
df=pd.concat([titulos,contenido,enlaces],axis=1)
df

Unnamed: 0,Títulos,Contenido,Enlaces
0,\n\n\n\n,\n\t\tWell-known accelerator group Techstars a...,https://techcrunch.com/2024/02/23/techstars-ce...
1,\n\t\t\t\tQuick thinking and a stroke of luck ...,"\n\t\tEarlier this week, accelerator group Tec...",https://techcrunch.com/2024/02/23/quick-thinki...
2,"\n\t\t\t\tAs Techstars retools, some former st...",\n\t\tA court filing in the U.S. Department of...,https://techcrunch.com/2024/02/23/as-techstars...
3,"\n\t\t\t\tTechstars CEO defends changes, says ...",\n\t\tGoogle has apologized (or come very clos...,https://techcrunch.com/2024/02/23/techstars-ce...
4,\n\t\t\t\tGoogle court filing reveals new busi...,\n\t\tHouse staging is a significant part of t...,https://techcrunch.com/2024/02/23/search-start...
5,\n\t\t\t\t‘Embarrassing and wrong’: Google adm...,\n\t\tWelcome to Startups Weekly — your weekly...,https://techcrunch.com/2024/02/23/embarrassing...
6,\n\t\t\t\tVirtual Staging AI helps Realtors di...,\n\t\tPeople are more likely to do something i...,https://techcrunch.com/2024/02/23/virtual-stag...
7,\n\t\t\t\tA bumpy road for EV manufacturers\t\t\t,"\n\t\tHardware is difficult, to paraphrase a f...",https://techcrunch.com/2024/02/23/a-bumpy-road...
8,\n\t\t\t\tTreating a chatbot nicely might boos...,"\n\t\tThe Browser company’s Arc, a browser foc...",https://techcrunch.com/2024/02/23/treating-a-c...
9,\n\t\t\t\tHumane pushes Ai Pin ship date to mi...,"\n\t\tReddit’s long-awaited IPO is nearing, pr...",https://techcrunch.com/2024/02/23/humane-pushe...


In [None]:
#Es importante que de manera constante guardemos nuestros datos en formato JSON o CSV para evitar perderlos y tener que correr el código una y otra vez
df.to_csv("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1.csv", sep=";",encoding="utf-8-sig")

In [None]:
#Limpieza del texto
import re #librería de expresiones regulares

In [None]:
def clean_text(df, text_field):
    patternURLEMAIL = r'(\w+[.]?\w+@(\w+\.)+\w+)|((http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?\w+([\-\.]{1}\w+)*\.[a-z]{2,5}(\/)?(([^\s@])*(\/)?)*)'
    patternHashtagMention = r'(@\w+)|(#\w+)'
    # Utilizamos las expresiones regulares anteriores sobre URL, email, hashtag y menciones para quitarlos
    # Asegurarse de que cada elemento es una cadena antes de aplicar la limpieza
    df[text_field] = df[text_field].astype(str).apply(lambda elem: re.sub(patternURLEMAIL, '', elem))
    df[text_field] = df[text_field].apply(lambda elem: re.sub(patternHashtagMention, '', elem))
    # Sustituir espacios de más
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r'\s+', ' ', elem))

    return df

In [None]:
dfclean=clean_text(df,"Contenido")
dfclean=clean_text(df,"Títulos")

In [None]:
dfclean

Unnamed: 0,Títulos,Contenido,Enlaces
0,,Well-known accelerator group Techstars announ...,https://techcrunch.com/2024/02/23/techstars-ce...
1,Quick thinking and a stroke of luck averted a...,"Earlier this week, accelerator group Techstar...",https://techcrunch.com/2024/02/23/quick-thinki...
2,"As Techstars retools, some former staffers sa...",A court filing in the U.S. Department of Just...,https://techcrunch.com/2024/02/23/as-techstars...
3,"Techstars CEO defends changes, says physical ...",Google has apologized (or come very close to ...,https://techcrunch.com/2024/02/23/techstars-ce...
4,Google court filing reveals new business deta...,House staging is a significant part of the re...,https://techcrunch.com/2024/02/23/search-start...
5,‘Embarrassing and wrong’: Google admits it lo...,Welcome to Startups Weekly — your weekly reca...,https://techcrunch.com/2024/02/23/embarrassing...
6,Virtual Staging AI helps Realtors digitally f...,People are more likely to do something if you...,https://techcrunch.com/2024/02/23/virtual-stag...
7,A bumpy road for EV manufacturers,"Hardware is difficult, to paraphrase a famous...",https://techcrunch.com/2024/02/23/a-bumpy-road...
8,Treating a chatbot nicely might boost its per...,"The Browser company’s Arc, a browser focused ...",https://techcrunch.com/2024/02/23/treating-a-c...
9,Humane pushes Ai Pin ship date to mid-April,"Reddit’s long-awaited IPO is nearing, promisi...",https://techcrunch.com/2024/02/23/humane-pushe...


In [None]:
dfclean.Contenido[4]

' House staging is a significant part of the real estate industry, and while Realtors have traditionally staged houses physically before posting a listing, it’s an expensive and time-consuming pro '

In [None]:
dfclean.Títulos[4]

' Google court filing reveals new business details of DuckDuckGo and Neeva '

In [None]:
dfclean.Enlaces[4]

'https://techcrunch.com/2024/02/23/search-startups-duckduckgo-and-neeva-had-a-tough-time-competing-with-google-court-filing-shows/'

# Paso 5: almacenar datos

In [None]:
#Guardar en formato CSV
dfclean.to_csv("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.csv", sep=";",encoding="utf-8-sig")

In [None]:
dfclean=pd.read_csv("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.csv", sep=";",encoding="utf-8-sig")

In [None]:
#Guardar en formato JSON
dfclean.to_json("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.json") # exportar a json

In [None]:
dfclean=pd.read_json("/content/drive/MyDrive/2021/3. Estrategia digital TC/6. Semana de la innovación/ejemplo_1_clean.json")
dfclean

Unnamed: 0.1,Unnamed: 0,Títulos,Contenido,Enlaces
0,0,Potential winners and losers line up as Plaid...,So now Plaid says it’s a payments company. It...,https://techcrunch.com/2021/10/21/potential-wi...
1,1,Facebook agrees terms to pay French publisher...,Facebook has reached a multi-year agreement t...,https://techcrunch.com/2021/10/21/facebook-agr...
2,2,ESG and shareholder activism: A tsunami is co...,With the increase in attention on ESG issues ...,https://techcrunch.com/2021/10/21/esg-and-shar...
3,3,Charting a course through the internet’s ever...,Google Chrome's Manifest V3 update is just on...,https://techcrunch.com/2021/10/21/charting-a-c...
4,4,Network your way to opportunity at TC Session...,TC Sessions: SaaS 2021 kicks off in just five...,https://techcrunch.com/2021/10/21/network-your...
5,5,Twitter rolls out the ability for anyone to h...,After steadily expanding access over the cour...,https://techcrunch.com/2021/10/21/twitter-roll...
6,6,Lessons from founders raising their first rou...,"In a bull market, it's especially hard to und...",https://techcrunch.com/2021/10/21/lessons-from...
7,7,Amazon rolls out in-store pickup for products...,"Amazon is rolling out “Local Selling,” a set ...",https://techcrunch.com/2021/10/21/amazon-rolls...
8,8,Surface Duo 2 review: Getting better,"If challenging the status quo was easy, we’d ...",https://techcrunch.com/2021/10/21/surface-duo-...
9,9,The climate policies tucked into Congress’ bu...,The climate measures in the budget reconcilia...,https://techcrunch.com/2021/10/21/the-climate-...


# Referencias


[Documentación de BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

[Documentación de re](https://docs.python.org/3/library/re.html)

[Documentación de request](https://docs.python-requests.org/en/latest/)

[Documentación de Pandas](https://pandas.pydata.org/)

[Documentación de HTML](https://html.spec.whatwg.org/multipage/)

[Documentación JSON](https://www.json.org/json-en.html)