# WEBSCRAPING ROGUES
![Webscraping Rogues](../img/wr.jpg)

## ```Welcome to your challenge```

You and your teammates have been asigned the role of `webscraping rogues`, the silent shadows of the data wars. Invisible spies of the _data world_, you journey through the world wide web, disguised as a regular citzen, gathering intel. It will not always be easy, guards get suspicious and hunt you down constantly. You, however, have a very special set of skills

However, fighting the battle is not the end, you must live to tell the tale.

### The song of a hero

Your theme is: `html and webscraping`
The task in this challenge is twofold.
Your team must:
- Answer the most questions you can.
- List **at least** `5` topics of interest (important points) on the theme to guide your 20 min presentation to your fellow students.
- Work together and help each other out.

_TIP_: Remember to check your topics along the exercise and adjust them accordingly.

The team captain will be responsible for putting all the answers together on this notebook and making the pull request before the deadline set with the instructors.

# IMPORTANT POINTS

- ¿En qué consiste el web scraping?
- Organización de los datos
- ¿Cómo funciona HTML?
- ¿Cómo extraemos los datos?
- Trabajar con los datos (por ejemplo, con DataFrames)

## Questions

- What is HTML?
- What is web scraping?
- What is the difference between using an API and webscraping?
- How is data organized on HTML?
- What are some of the roadblocks you may find when trying to webscrape?
- What are some of the tools web developers may use to prevent webscraping?
- What are some of the libraries we use for webscraping in python and what is the purpose of each one?

![URL](../img/url.png)

##### What are the different parts of a url?
-  1. Protocolo
-  2. Nombre del dominio
-  3. Ruta (path)/Directorio del servidor
-  4. Parámetros
- `A.`Host name
- `B.`Subdominio
- `C.`Extensión del dominio

Import the libraries.

In [1]:
# Your answer
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.request import urlopen
import regex as re

First of all, lets start with a simple example.

Try acquiring the content of [Wikipedia's robots.txt](https://en.wikipedia.org/robots.txt). The text file `robots.txt` is sometimes included by web developers to provide information to search engine crawlers. It contains information on which agents are allowed to access the page, which are forbidden, forbidden zones and sometimes even a map of the webpage. 

In [2]:
# Your answer
url = "https://en.wikipedia.org/robots.txt"
pagina = requests.get(url)
soup = BeautifulSoup(pagina.content, 'html.parser')
print(soup.text)

# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Z

Good. However, that is not web scraping yet. In order to do that, instead of a simple text file, we must get data from an actual html page.

For our first webscraping challenge, get the title of all the articles on [this page](https://www.dataversity.net/category/education/articles/).

In [3]:
# Your answer
url = "https://www.dataversity.net/category/education/articles/"
pagina = requests.get(url)
soup = BeautifulSoup(pagina.content, 'html.parser')
title = soup.find_all('h2', class_="entry-title")
titulos = []
for i in title:
    titulos.append(i.text)
lista_titulos = []
for t in titulos:
    lista_titulos.append(t.replace("\n","").strip())
print(*lista_titulos, sep="\n")

Serverless Computing Use Cases
Data Management vs. Data Science
Case Study: Cornell University Automates Data Warehouse Infrastructure
Building Machine Learning Program Success
Case Study: Polaris Puts Data Analysis in the Service of Defeating Human Trafficking
Seven Tools for Effective CDO Leadership
Streaming Analytics: The Value is in the Action
Emerging Cloud Computing Technologies
Maturing Data Governance and Data Literacy to Serve Organizations Better
So You Want to Be a Data Protection Officer?


Now, try getting the titles, authors and date of the publications. You may store the results on a DataFrame.

In [4]:
# Your answer
author = soup.find_all('a', class_="url fn n")
autores = []
for i in author:
    autores.append(i.text)
lista_autores = []
for a in autores:
    lista_autores.append(a.replace("\n","").strip())
print(*lista_autores, sep="\n")

Paramita (Guha) Ghosh
Paramita (Guha) Ghosh
Amber Lee Dennis
Amber Lee Dennis
Jennifer Zaino
Amber Lee Dennis
Amber Lee Dennis
Paramita (Guha) Ghosh
Michelle Knight
Keith D. Foote


In [5]:
date = soup.find_all('time', class_="entry-date published")
fechas = []
for i in date:
    fechas.append(i.text)
lista_fechas = []
for f in fechas:
    lista_fechas.append(f.replace("\n","").strip())
print(*lista_fechas, sep="\n")

November 12, 2020
November 10, 2020
November 5, 2020
November 3, 2020
October 29, 2020
October 27, 2020
October 22, 2020
October 21, 2020
October 20, 2020
October 15, 2020


In [6]:
df = pd.DataFrame(columns = ["Títulos", "Autores", "Fechas"])

In [7]:
df["Títulos"] = lista_titulos 
df["Autores"] = lista_autores 
df["Fechas"] = lista_fechas 

In [8]:
df

Unnamed: 0,Títulos,Autores,Fechas
0,Serverless Computing Use Cases,Paramita (Guha) Ghosh,"November 12, 2020"
1,Data Management vs. Data Science,Paramita (Guha) Ghosh,"November 10, 2020"
2,Case Study: Cornell University Automates Data ...,Amber Lee Dennis,"November 5, 2020"
3,Building Machine Learning Program Success,Amber Lee Dennis,"November 3, 2020"
4,Case Study: Polaris Puts Data Analysis in the ...,Jennifer Zaino,"October 29, 2020"
5,Seven Tools for Effective CDO Leadership,Amber Lee Dennis,"October 27, 2020"
6,Streaming Analytics: The Value is in the Action,Amber Lee Dennis,"October 22, 2020"
7,Emerging Cloud Computing Technologies,Paramita (Guha) Ghosh,"October 21, 2020"
8,Maturing Data Governance and Data Literacy to ...,Michelle Knight,"October 20, 2020"
9,So You Want to Be a Data Protection Officer?,Keith D. Foote,"October 15, 2020"


How would you describe the step by step of scraping a web page?

- Inspeccionar la página buscando etiquetas y clases del elemento que queremos scrapear
- Sacamos el elemento de texto que queremos extraer en cada una de las clases
- Limpiamos los datos usamos regex (en este caso replace y strip)
- Guardamos los datos en una lista
- OPCIONAL: Visualizamos los datos
- ...

Can you write a function that returns the number of points and the `Members Since` date of a given codewars user?

For example: https://www.codewars.com/users/WHYTEWYLL

In [9]:
def codewars(user):
    url = 'https://www.codewars.com/users/{}'.format(user)
    html = str(urlopen(url).read())
    fecha = re.search('Member Since:</b>(.*?)</div>', html)
    puntos = re.search('Honor:</b>(.*?)</div>', html)
    return fecha.group(1), puntos.group(1)


In [10]:
miembrodesde = codewars("WHYTEWYLL")
miembrodesde

('Dec 2019', '691')

Are you able to scrape the photos on [Shutterstock's search for kittens](https://www.shutterstock.com/search/kitten)?

- If not, how come?

How about on [Google Images search for kittens](https://www.google.com/search?q=kitten&tbm=isch)?

What is the difference?

You may try to download one of the photos.

_HINT:_ To display images on markdown, you should use the following syntax:
```markdown
![alt_text](image_url)
```

In [11]:
# Your answer
gatete = requests.get("https://image.shutterstock.com/image-photo/british-shorthair-kitten-silver-color-260nw-1510641710.jpg")
img_bytes = gatete.content

![gatete](https://image.shutterstock.com/image-photo/british-shorthair-kitten-silver-color-260nw-1510641710.jpg)

In [12]:
with open("gatete.jpg","wb+") as file:
    file.write(img_bytes)
    

### Discussion topics

The following are a few questions and points you should debate and think about with your teammates.

- Is webscraping legal?
- Is webscraping immoral? 
- What are some fair and unfair uses of webscraping data?
- If people go through great lengths to protect and to acquire data, how much is data really worth?

#### Extra
[TDS: Step by step guide](https://towardsdatascience.com/a-step-by-step-guide-to-web-scraping-in-python-5c4d9cef76e8)