# Web Scraping Quest

Before you start your robot-junior ad hunter project, your goal for this quest will be to scrap Chuck Norris facts from this address: [Chuck Norris facts](https://chucknorrisfacts.net/facts.php?page=1).

* Make a request on the above web page.

In [1]:
# Packages we need:
#   - request (to get the HTML data)
#   - BeautifulSoup (to parse it)

import requests
from bs4 import BeautifulSoup

In [2]:
# Setting up the initial variables

# URL of the page
url = "https://chucknorrisfacts.net/facts.php?page=1"

page = requests.get(url)

page

<Response [406]>

* What is the `response code`? What is this error? Chuck burned you!

> The `response code` is the HTTP protocol error. This one is `406 Not Acceptable` 

* Bypass the protection by specifying an existing browser. And retry your request.

In [3]:
# Setting the user-agent from my browser
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"

# Adding the "User-Agent" into HTTP headers (dictionary)
headers = {
    "User-Agent": user_agent
}

# Retrying the request WITH headers
page = requests.get(url, headers=headers)

page

<Response [200]>

* Find the tag to identify the jokes. How many jokes are there? (there should be 20). Displays the 8th joke using the `.text` attribute. (the joke should be displayed, but also other elements such as the note)

In [4]:
# We'll create the BeautifulSoup object
soup = BeautifulSoup(page.content, "html.parser")

# By inspecting the page, I've found that every joke is in the <div>
#   with following "style" attribute value:
#   "border-top: 1px solid #251836; padding: 0 0 5px 7px;"
jokes = soup.find_all("div", attrs={"style": "border-top: 1px solid #251836; padding: 0 0 5px 7px;"})

# How many?
print(f"There are {len(jokes)} jokes on the page.")

# Display 8th joke using the `.text` attribute
print("\n8th joke:")
print(jokes[7].text)

There are 20 jokes on the page.

8th joke:

Jesus can walk on water, but Chuck Norris can swim through dry land.Rated 4.05/5 (2551 Votes)


1
2
3
4
5




* Each of the elements of your iterable is itself a mini-soup. Use the `.find` method with the appropriate tag to isolate only the 8th joke, without the other elements (so without the note, etc...).

In [5]:
# There's <p> tag that holds the joke
jokes[7].find("p").text

'Jesus can walk on water, but Chuck Norris can swim through dry land.'

* Do the same to isolate only the note of the 8th joke.

In [6]:
# The "note" is I guess the part with the rating, which is inside a "div" tag:
jokes[7].find("div").text

'Rated 4.05/5 (2551 Votes)'

* Create an empty dictionary, then create a loop that will record in this dictionary each joke as a key and each corresponding note as a value.

In [7]:
d = {}

for i in range(len(jokes)):
    k = jokes[i].find("p").text
    v = jokes[i].find("div").text
    d[k] = v

d



{'Chuck Norris can unscramble an egg.': 'Rated 4.06/5 (3631 Votes)',
 'Chuck Norris destroyed the periodic table, because he only recognizes the element of surprise.': 'Rated 3.94/5 (1694 Votes)',
 "Chuck Norris doesn't read books. He stares them down until he gets the information he wants.": 'Rated 3.83/5 (779 Votes)',
 'Chuck Norris doesnt sleep. He waits.': 'Rated 3.82/5 (865 Votes)',
 'Chuck Norris has a vacation home on the sun.': 'Rated 3.22/5 (473 Votes)',
 'Chuck Norris is suing Myspace for taking the name of what he calls everything around you.': 'Rated 3.4/5 (498 Votes)',
 'Chuck Norris is the only person in the world that can actually email a roundhouse kick.': 'Rated 3.58/5 (672 Votes)',
 'Chuck Norris once shot an enemy plane down with his finger by yelling, "Bang!"': 'Rated 3.95/5 (969 Votes)',
 "Chuck Norris' calendar goes straight from March 31st to April 2nd. No one fools Chuck Norris. ": 'Rated 4.08/5 (5018 Votes)',
 "Chuck Norris' tears cure cancer. Too bad he has ne

* Transform this dictionary into a DataFrame with 2 columns: joke and note. It must have 20 rows: one per joke.

In [8]:
# Importing necessary pandas package
import pandas as pd

# Creating initial DataFrame from the dictionary "d"
#   - orient="index", because keys are rows (index)
#   - .reset_index, because the index was actual text of the joke
df_jokes = pd.DataFrame.from_dict(d, orient='index').reset_index()

# Renaming the columns as we like
df_jokes = df_jokes.rename(columns={"index":"joke", 0: "note"})

df_jokes

Unnamed: 0,joke,note
0,There is no 'ctrl' button on Chuck Norris' com...,Rated 3.74/5 (1076 Votes)
1,Chuck Norris doesnt sleep. He waits.,Rated 3.82/5 (865 Votes)
2,Some kids piss their name in the snow. Chuck N...,Rated 4.05/5 (4940 Votes)
3,Chuck Norris' calendar goes straight from Marc...,Rated 4.08/5 (5018 Votes)
4,Chuck Norris is the only person in the world t...,Rated 3.58/5 (672 Votes)
5,Chuck Norris' tears cure cancer. Too bad he ha...,Rated 4.05/5 (2959 Votes)
6,When Chuck Norris is put in a straight jacket ...,Rated 3.39/5 (659 Votes)
7,"Jesus can walk on water, but Chuck Norris can ...",Rated 4.05/5 (2551 Votes)
8,Chuck Norris once shot an enemy plane down wit...,Rated 3.95/5 (969 Votes)
9,Chuck Norris doesn't read books. He stares the...,Rated 3.83/5 (779 Votes)


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=cfd8b0ae-bede-4fbf-8370-46f1acdfcc89' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>