# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from pprint import pprint

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
#Declare a variable that stores the url
devs_url = "https://github.com/trending/developers" 

#'devs_response' stores response status
devs_response = requests.get(devs_url)

#Use BeautifulSoup function to save web HTML as a BS item
devs_soup = BeautifulSoup(devs_response.text, "html.parser")

In [7]:
#Test
print("Stored text in devs_soup:\n", devs_soup)

Stored text in devs_soup:
 
<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-e7318add1f7e055d040edb0f75aaa0ba.css" integrity="sha512-67V2J9Se2CifJlftk9/cExHGvxd7N9b9EdGnQEpszu99Ogeecilu9jIDxoCkx3zNLfB9ArraXW0J03qyVmN0Uw==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-294181adec18ed639e160b96b45d17ac.css" integrity="sha512-MR

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [154]:
#Find the HTML container with devs info
devs_html_container = devs_soup.find("ol", {"class": "list-style-none"})

In [155]:
#Test
print("HTML container with Github devs:\n\n", devs_html_container)

HTML container with Github devs:

 <ol class="list-style-none">
<li class="d-sm-flex flex-justify-between border-bottom border-gray-light py-3" id="pa-rasbt">
<div class="d-flex">
<a class="text-center text-small text-gray mx-2" href="#pa-rasbt" style="width: 16px;">1</a>
<div class="mx-2">
<a class="d-inline-block" data-hovercard-type="user" data-hovercard-url="/hovercards?user_id=5618407" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" href="/rasbt"><img alt="@rasbt" class="rounded-1" height="48" src="https://avatars3.githubusercontent.com/u/5618407?s=96&amp;v=4" width="48"/></a>
</div>
<div class="mx-2">
<h2 class="f3 text-normal">
<a href="/rasbt">
                          rasbt
                            <span class="text-gray text-bold">
                              (Sebastian Raschka)
                            </span>
</a> </h2>
<a class="repo-snipit css-truncate" data-ga-click="Explore, go to repository, location:trending" href="/rasbt/python-m

In [156]:
#Filter the container
devs_cards = devs_html_container.find_all("h2", {"class":"f3 text-normal"})

In [157]:
#Test
print("'devs_cards' type is", type(devs_cards), "and its content:\n\n", devs_cards)

'devs_cards' type is <class 'bs4.element.ResultSet'> and its content:

 [<h2 class="f3 text-normal">
<a href="/rasbt">
                          rasbt
                            <span class="text-gray text-bold">
                              (Sebastian Raschka)
                            </span>
</a> </h2>, <h2 class="f3 text-normal">
<a href="/microsoft">
                          microsoft
                            <span class="text-gray text-bold">
                              (Microsoft)
                            </span>
</a> </h2>, <h2 class="f3 text-normal">
<a href="/jlevy">
                          jlevy
                            <span class="text-gray text-bold">
                              (Joshua Levy)
                            </span>
</a> </h2>, <h2 class="f3 text-normal">
<a href="/imhuster">
                          imhuster
                            <span class="text-gray text-bold">
                              (Aven Liu)
                            

In [158]:
#ResultSet type is a subclass of a list, so 'devs_cards' is iterable
dev_list = [card.find("a").text for card in devs_cards]

#Remove all spaces
dev_list = [item.replace(" ", "") for item in dev_list]

#Remove newlines from beginning and end 
dev_list = [item.strip("\n ") for item in dev_list]

#Convert the newlines between nickname and username
dev_list = [item.replace("\n\n", " ") for item in dev_list]

#Test
pprint (dev_list)

['rasbt (SebastianRaschka)',
 'microsoft (Microsoft)',
 'jlevy (JoshuaLevy)',
 'imhuster (AvenLiu)',
 'jackfrued (骆昊)',
 'CarGuo (ShuyuGuo)',
 'google (Google)',
 'github (GitHub)',
 'facebook (Facebook)',
 'symfony (Symfony)',
 'sfyc23 (ThunderBouble)',
 'SandboxEscaper',
 'pipe-dream',
 'apache (TheApacheSoftwareFoundation)',
 'tensorflow',
 'NVIDIA (NVIDIACorporation)',
 'MisterBooo (程序员吴师兄)',
 'mathieudutour (MathieuDutour)',
 'torvalds (LinusTorvalds)',
 'sindresorhus (SindreSorhus)',
 'entropic-dev (Entropic)',
 'luruke (LuigiDeRosa)',
 'google-research (GoogleAIResearch)',
 'zeit (ZEIT)',
 'sveltejs (Svelte)']


#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [159]:
#Declare a variable that stores the url
repos_url = "https://github.com/trending/python?since=daily"

In [160]:
#'repos_response' stores response status
repos_response = requests.get(repos_url)

#Use BeautifulSoup function to save web HTML as a BS item
repos_soup = BeautifulSoup(repos_response.text, "html.parser")

#Find the HTML container with devs info
repos_html_container = repos_soup.find("ol", {"class": "repo-list"})

#Filter the container
repos_cards = repos_html_container.find_all("h3")
repos_list = [card.find("a").text for card in repos_cards]

#Clean the output
repos_list = [item.strip("\n") for item in repos_list]

#Test
pprint(repos_list)

['sfyc23 / EverydayWechat',
 'quantumblacklabs / kedro',
 'TheAlgorithms / Python',
 'google-research / football',
 'donnemartin / system-design-primer',
 'vinta / awesome-python',
 'chrishutchinson / train-departure-screen',
 'tensorflow / models',
 'python / cpython',
 'jhjacobsen / invertible-resnet',
 'apachecn / AiLearning',
 'facebookresearch / fair_self_supervision_benchmark',
 'notadamking / Bitcoin-Trader-RL',
 'keras-team / keras',
 'lukemelas / EfficientNet-PyTorch',
 'xFreed0m / RDPassSpray',
 'public-apis / public-apis',
 'fighting41love / funNLP',
 'OWASP / CheatSheetSeries',
 'ageitgey / face_recognition',
 'ytdl-org / youtube-dl',
 'sundowndev / PhoneInfoga',
 'ConnorJL / GPT2',
 'arcelien / pba',
 'cwerling / psptool']


#### Display all the image links from Walt Disney wikipedia page

In [164]:
#Declare a variable that stores the url
disney_url = "https://en.wikipedia.org/wiki/Walt_Disney"

In [179]:
#Store response status
disney_response = requests.get(disney_url)

#BeautifulSoup primordial soup
disney_soup = BeautifulSoup(disney_response.text, "html.parser")

#Find the related HTML
disney_html_container = disney_soup.find("div", {"class": "mw-parser-output"})

#Filter the images containers
disney_images = disney_html_container.find_all("img")

#Extract the absolute links with 'get' (links are 'src' attributes)
disney_images_links = ["https:" + image.get("src") for image in disney_images]

#Test
pprint(disney_images_links)

['https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 'https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-S

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [180]:
"""
Now, I will use generic variable names in common code so it can be reused in other exercises
"""

#1. Declare a variable that stores the url
url = "https://en.wikipedia.org/wiki/Python"
root = "https://en.wikipedia.org"

#2. Store response status
url_response = requests.get(url)

#3. BeautifulSoup
soup = BeautifulSoup(url_response.text, "html.parser")

In [183]:
"""
Specific code
"""

#Find the related HTML container
python_html_container = soup.find("div", {"class": "mw-content-ltr"})

#Filter the urls containers
python_urls_containers = python_html_container.find_all("a")

#Create a list with the urls
python_urls = [url_container.get("href") for url_container in python_urls_containers]

#Remove links to sections (start with "#")
python_urls = [url for url in python_urls if url[0] != "#"]

#Remove relative links in absolute links
python_urls = [url if url[0] != "/" else root+url for url in python_urls]

pprint(python_urls)

['https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 'https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=1',
 'https://en.wikipedia.org/wiki/Pythonidae',
 'https://en.wikipedia.org/wiki/Python_(genus)',
 'https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=2',
 'https://en.wikipedia.org/wiki/Python_(mythology)',
 'https://en.wikipedia.org/wiki/Python_of_Aenus',
 'https://en.wikipedia.org/wiki/Python_(painter)',
 'https://en.wikipedia.org/wiki/Python_of_Byzantium',
 'https://en.wikipedia.org/wiki/Python_of_Catana',
 'https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=3',
 'https://en.wikipedia.org/wiki/Python_(film)',
 'https://en.wikipedia.org/wiki/Pythons_2',
 'https://en.wikipedia.org/wiki/Monty_Python',
 'https://en.wikipedia.org/wiki/Python_(Monty)_Pictures',
 'https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=4',
 'https://en.wikipedia.org/wiki/Python_(programming_language)'

#### Number of Titles that have changed in the United States Code since its last release point 

In [184]:
#1. Declare a variable that stores the url
url = "http://uscode.house.gov/download/download.shtml"

#2. Store response status
url_response = requests.get(url)

#3. BeautifulSoup
soup = BeautifulSoup(url_response.text, "html.parser")

In [185]:
#Find the related HTML container
us_law_container = soup.find("div", {"class": "uscitemlist"})

#Filter the titles in bold
us_law_titles = us_law_container.find_all("div", {"class":"usctitlechanged"})

#Create a list with the titles in bold
us_changed_laws = [title.text.strip("\n ") for title in us_law_titles]

pprint(us_changed_laws)

['Title 25 - Indians', 'Title 42 - The Public Health and Welfare']


#### A Python list with the top ten FBI's Most Wanted names 

In [186]:
#1. Declare a variable that stores the url
url = "https://www.fbi.gov/wanted/topten"

#2. Store response status
url_response = requests.get(url)

#3. BeautifulSoup
soup = BeautifulSoup(url_response.text, "html.parser")

In [190]:
#Find the related HTML container
most_wanted_container = soup.find("ul", {"class": "full-grid wanted-grid-natural infinity castle-grid-block-xs-2 castle-grid-block-sm-2castle-grid-block-md-3 castle-grid-block-lg-5 dt-grid"})

#Filter the names container
most_wanted_info = most_wanted_container.find_all("h3", {"class":"title"})

#Create a list with most wanted names
most_wanted_names = [name.text for name in most_wanted_info]

#Clean the list
most_wanted_names = [name.strip("\n").title() for name in most_wanted_names]

pprint(most_wanted_names)

['Eugene Palmer',
 'Santiago Villalba Mederos',
 'Robert William Fisher',
 'Bhadreshkumar Chetanbhai Patel',
 'Arnoldo Jimenez',
 'Alejandro Rosales Castillo',
 'Yaser Abdel Said',
 'Jason Derek Brown',
 'Rafael Caro-Quintero',
 'Alexis Flores']


####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [215]:
#1. Declare a variable that stores the url
url = "https://www.emsc-csem.org/Earthquake/"

#2. Store response status
url_response = requests.get(url)

#3. BeautifulSoup
soup = BeautifulSoup(url_response.text, "html")

In [314]:
#Find the related HTML container
earthquakes_container = soup.find("tbody", {"id":"tbody"})

#Lists of dates and times
eq_dates_and_times = earthquakes_container.find_all("a")
eq_dates_and_times = [date_and_time.string for date_and_time in eq_dates_and_times] #Clean tags
eq_dates_and_times = [date_and_time.split("\xa0\xa0\xa0") for date_and_time in eq_dates_and_times] #Clean '\xa0'

#List of dates from 'eq_dates_and_times'
eq_dates = [date_and_time[0] for date_and_time in eq_dates_and_times]

#List of times from 'eq_dates_and_times'
eq_times = [date_and_time[1] for date_and_time in eq_dates_and_times]

#List of latitudes and longitudes
eq_lats_and_lons = earthquakes_container.find_all("td", {"class":"tabev1"})
eq_lats_and_lons = [lat_or_lon.string for lat_or_lon in eq_lats_and_lons] #Clean tags
eq_lats_and_lons = [lat_or_lon.replace("\xa0", "") for lat_or_lon in eq_lats_and_lons] #Clean '\xa0'

eq_lats = []
eq_lons = []

for index in range(len(eq_lats_and_lons)): #Scrolls the entire list of latitudes and longitudes
    if index%2 == 0: #Append odd index items to latitudes list
        eq_lats.append(eq_lats_and_lons[index])
    else: #Append even index items to longitudes list
        eq_lons.append(eq_lats_and_lons[index])

#List of regions name
eq_regions = earthquakes_container.find_all('td', {"class":"tb_region"})
eq_regions = [region.string.title() for region in eq_regions] #Clean tags
eq_regions = [region.replace("\xa0", "") for region in eq_regions] #Clean '\xa0'

eq_data = list(zip(eq_dates, eq_times, eq_lats, eq_lons, eq_regions))
eq_data_df = pd.DataFrame(eq_data, columns=["Date", "Time", "Latitude", "Longitude", "Region"])

#Test
eq_data_df

Unnamed: 0,Date,Time,Latitude,Longitude,Region
0,2019-06-08,09:03:19.5,38.28,41.0,Eastern Turkey
1,2019-06-08,08:41:04.5,18.81,155.24,"Hawaii Region, Hawaii"
2,2019-06-08,08:30:28.9,39.16,27.63,Western Turkey
3,2019-06-08,08:25:45.0,15.15,94.63,"Off Coast Of Oaxaca, Mexico"
4,2019-06-08,08:23:53.6,38.88,30.07,Western Turkey
5,2019-06-08,08:20:24.0,32.64,72.03,"Offshore Valparaiso, Chile"
6,2019-06-08,08:10:58.0,16.79,100.91,"Offshore Guerrero, Mexico"
7,2019-06-08,07:52:00.0,6.21,80.99,Near Coast Of Northern Peru
8,2019-06-08,07:12:31.0,16.75,95.03,"Oaxaca, Mexico"
9,2019-06-08,07:10:46.9,52.63,174.43,"Andreanof Islands, Aleutian Is."


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [None]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'

In [None]:
#your code

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [None]:
#your code

#### A list with the different kind of datasets available in data.gov.uk 

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [None]:
#your code 

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
#your code

### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
#your code

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
#your code