# Week 6 — Part 1 Solutions

We're going to see if we can scrape data about all Love Island contestants from this Wiki: https://loveisland.fandom.com/wiki/Category:Islanders

## Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Get Web Page Data/Text

To start, just try to extract information from a single page: https://loveisland.fandom.com/wiki/Jack_Fincham 

> With the `.get()` method, we can request to "get" web page data for a specific URL, which we will store in a varaible called `response`.

> To actually get at the text data in the reponse, we need to use `.text`, which we will save in a variable called `html_string`. The text data that we're getting is formatted in the HTML markup language, which we will talk more about in the BeautifulSoup section below.

> To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request and [the kind of parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use) that we want to use, which will always be `"html.parser"` for our purposes.

In [2]:
response = requests.get("https://loveisland.fandom.com/wiki/Jack_Fincham")
html_string = response.text

document = BeautifulSoup(html_string, "html.parser")

# Extract Info That You Want

## Name

Write Python code that will extract the contestant's name from this web page

In [3]:
document.find("h2").text

'Jack Fincham'

## Age

Write Python code that will extract the contestant's age from this web page

In [4]:
document.find("h3", string="Age").find_next_sibling().text

'31'

## Birthday

Write Python code that will extract the contestant's age from this web page

In [5]:
document.find("h3", string="Born").find_next_sibling().text

'May 9, 1991'

## Occupation

Write Python code that will extract the contestant's occupation from this web page

In [6]:
document.find("h3", string="Occupation").find_next_sibling().text

'Stationary Sales Manager'

## Hometown

Write Python code that will extract the contestant's occupation from this web page

In [7]:
document.find("h3", string="Hometown").find_next_sibling().text

'Kent, England'

## Another Category

Write Python code that will extract another bit of information about the contestant that you think would be important or interesting

In [8]:
document.find("h3", string="Status").find_next_sibling().text

'Winner'

## Make a Dataframe

Now we want to take these bits of code and see if we can put them together to extract contestant information for any URL and return a Pandas dataframe of all the info.

To do so, we will loop through a list of Love Island contestant wiki URLs, extract the relevant info, and then add a Python dictionary with the info to an empty list. **You can easily make a Pandas dataframe from a list of dictionaries.** 

In [20]:
wiki_urls = ["https://loveisland.fandom.com/wiki/Jack_Fincham", 
             "https://loveisland.fandom.com/wiki/Olivia_Attwood"]

In [40]:
# empty list that will become a dataframe
dicts_to_df = []


for wiki_url in wiki_urls:
    
    # Age
    age = document.find("h3", string="Age")
    if age != None:
        age = age.find_next_sibling().text
    
    
     # Birthday
    birthday = document.find("h3", string="Born")
    if birthday != None:
        birthday = birthday.find_next_sibling().text
    
    
    # Occupation
    job = document.find("h3", string="Occupation")
    if job != None:
        job = job.find_next_sibling().text
    
    
    # Hometown
    town = document.find("h3", string="Hometown")
    if town != None:
        town = town.find_next_sibling().text
    
    
    # Another Category
    status = document.find("h3", string="Status")
    if status != None:
        status = status.find_next_sibling().text
    
    
    # Appending a dictionary to a list
    dicts_to_df.append({"name": "Person", # add the correct variable here
                        "age": 25, # add the correct variable here
                        "birthday": "Birthday", # add the correct variable here
                        "occupation": "Job", # add the correct variable here
                        "hometown": "Town", # add the correct variable here
                        "another_category": "Another Category"
        
    })
    

In [32]:
pd.DataFrame(dicts_to_df)

Unnamed: 0,name,age,birthday,occupation,hometown,another_category
0,Person,25,Birthday,Job,Town,Another Category
1,Person,25,Birthday,Job,Town,Another Category


# Bonus!

If you complete any of these bonus challenges, share in #code-sharing on Discord!

## Bonus 1: Inconsistent Formats

Now let's see if you can write Python code that will extract the same info and create a dataframe with the following URLs.

Hint: one of these URLs might have a different web page format, which could cause problems with your code...

In [20]:
tough_wiki_urls = ["https://loveisland.fandom.com/wiki/Jack_Fincham", 
             "https://loveisland.fandom.com/wiki/Olivia_Attwood",
             "https://loveisland.fandom.com/wiki/Justyna_Walczak"]

In [31]:
# empty list that will become a dataframe
dicts_to_df = []


for wiki_url in wiki_urls:
    
    # Age
    age = document.find("h3", string="Age")
    if age != None:
        age = age.find_next_sibling().text
    
    
     # Birthday
    birthday = document.find("h3", string="Born")
    if birthday != None:
        birthday = birthday.find_next_sibling().text
    
    
    # Occupation
    job = document.find("h3", string="Occupation")
    if job != None:
        job = job.find_next_sibling().text
    
    
    # Hometown
    town = document.find("h3", string="Hometown")
    if town != None:
        town = town.find_next_sibling().text
    
    
    # Another Category
    status = document.find("h3", string="Status")
    if status != None:
        status = status.find_next_sibling().text
    
    
    dicts_to_df.append({"name": "Person", # add the correct variable here
                        "age": 25, # add the correct variable here
                        "birthday": "Birthday", # add the correct variable here
                        "occupation": "Job", # add the correct variable here
                        "hometown": "Town", # add the correct variable here
                        "another_category": "Another Category"
        
    })
    

In [32]:
pd.DataFrame(dicts_to_df)

Unnamed: 0,name,age,birthday,occupation,hometown,another_category
0,Person,25,Birthday,Job,Town,Another Category
1,Person,25,Birthday,Job,Town,Another Category


## Bonus 2: Scrape It All!

Now, for the final boss challenge, see if you can scrape info for ALL of the Love Island contestants.

You should be able to find URLs for all the contestants here: https://loveisland.fandom.com/wiki/Category:Islanders 

However, not all URLs will be visible from this page, so you might need to find a way to navigate to other pages...

The code below is included to help you, in case you need it. This is one way of looping through letters A-Z.

In [4]:
def range_char(start, stop):
    return (chr(n) for n in range(ord(start), ord(stop) + 1))

In [5]:
full_urls = []

response = requests.get("https://loveisland.fandom.com/wiki/Category:Islanders")
html_string = response.text

document = BeautifulSoup(html_string, "html.parser")

for letter in range_char("A", "Z"):
    response = requests.get("https://loveisland.fandom.com/wiki/Category:Islanders?from=" + letter)
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    
    names = document.find_all("a", attrs={"class":"category-page__member-link"})
    full_urls += [name['href'] for name in names]

In [6]:
dicts_to_df = []


for url in full_urls:
    response = requests.get("https://loveisland.fandom.com" + url)
    html_string = response.text

    document = BeautifulSoup(html_string, "html.parser")
    
    # Name
    name = document.find("h2", attrs={"data-source":"Name"}).text
    
    
    # Age
    age = document.find("h3", string="Age")
    if age != None:
        age = age.find_next_sibling().text
    
    
     # Birthday
    birthday = document.find("h3", string="Born")
    if birthday != None:
        birthday = birthday.find_next_sibling().text
    
    
    # Occupation
    job = document.find("h3", string="Occupation")
    if job != None:
        job = job.find_next_sibling().text
    
    
    # Hometown
    town = document.find("h3", string="Hometown")
    if town != None:
        town = town.find_next_sibling().text
    
    
    # Another Category
    status = document.find("h3", string="Status")
    if status != None:
        status = status.find_next_sibling().text
    
    
    # Appending a dictionary to a list
    dicts_to_df.append({"name": name, 
                        "age": age,
                        "birthday": birthday,
                        "occupation": job,
                        "hometown": town, 
                        "another_category": status,
        
    })

In [7]:
pd.DataFrame(dicts_to_df)

Unnamed: 0,name,age,birthday,occupation,hometown,another_category
0,Aaron Deacon Shaw,27,,Former soldier and Model,"Gold Coast, Queensland",3rd Place
1,Aaron Francis,26,"September 29, 1996",Luxury events host,"London, England",Dumped
2,Aaron Owen,26,,Cake Decorator,"Henderson, Nevada",Dumped
3,Aaron Simpson,26,"March 7, 1997",Footballer,"Devon, England",Dumped
4,Aaron Waters,25,"September 12, 1997",Model,"Perth, Western Australia",Runner-Up
...,...,...,...,...,...,...
941,Zara McDermott,26,"December 14, 1996",Government Advisor,"Essex, England",Dumped
942,Zeta Morrison,29,"May 24, 1993",Babysitter and Model,"Surrey, England",Winner
943,Ziggy Martin,28,,Model,Aruba,3rd Place
944,Zoe Basia Brown,32,"November 26, 1990",High fashion model,"South London, England",Dumped
