# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [663]:
# This is the url you will scrape in this exercise
web_address = 'https://github.com/trending/developers'
source = requests.get(web_address).text
soup = BeautifulSoup(source, 'lxml')

In [672]:
articles = soup.find_all('div', class_='Box')

In [675]:
print(soup.get_text())















Trending  developers on GitHub today · GitHub


























































Skip to content













                Sign up
              
















                    Why GitHub?
                    




Features →

Code review
Project management
Integrations
Actions
Packages
Security
Team management
Hosting


Customer stories →
Security →





Team


Enterprise




                    Explore
                    





Explore GitHub →

Learn & contribute

Topics
Collections
Trending
Learning Lab
Open source guides

Connect with others

Events
Community forum
GitHub Education





Marketplace




                    Pricing
                    




Plans →

Compare plans
Contact Sales


Nonprofit →
Education →






























        Search
      

        All GitHub
      
↵


      Jump to
      ↵






No suggested jump to results















        Search
      

        All GitHub
      
↵


      Jump to
   

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [671]:
for article in articles:
    users = article.find_all('div', class_="col-md-6")
    for user in users:
        try:
            name = user.find('h1', class_="h3 lh-condensed").text
            print(name.strip())
        except:
            continue

Thibault Duplessis
Jake Archibald
Stephen Roller
David Rodríguez
Shawn Tabrizi
Steven Allen
Taylor Otwell
Sebastián Ramírez
Leo Di Donato
Felix Angelov
Agniva De Sarker
Tyler Neely
Klaus Post
James Agnew
Dustin L. Howett
Anmol Sethi
Yiyi Wang
Philipp Oppermann
Ben Manes
Stefan Prodan
Dries Vints
John Keiser
Carlos Alexandro Becker
Ines Montani
Brian Flad


#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [19]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [20]:
articles = soup.find_all('div', class_='Box')

In [37]:
repos = []
for article in articles:
    users = article.find_all('article', class_="Box-row")
    for user in users:
        try:
            repo = user.find('h1', class_="h3 lh-condensed").text
            repos.append(repo)
        except:
            continue

In [96]:
for r in repos:
    print(''.join(r.split()))


ranjian0/building_tool
google-research/big_transfer
emadboctorx/yolov3-keras-tf2
TurboWay/spiderman
sherlock-project/sherlock
xillwillx/skiptracer
sebastianruder/NLP-progress
anandpawara/Real_Time_Image_Animation
3b1b/manim
espnet/espnet
Rapptz/discord.py
yunjey/pytorch-tutorial
CorentinJ/Real-Time-Voice-Cloning
home-assistant/core
willmcgugan/rich
corpnewt/gibMacOS
ytdl-org/youtube-dl
minimaxir/big-list-of-naughty-strings
gunthercox/ChatterBot
MrMimic/data-scientist-roadmap
LandGrey/SpringBootVulExploit
benbusby/whoogle-search
corpnewt/ProperTree
ianzhao05/textshot
ankitects/anki


#### Display all the image links from Walt Disney wikipedia page.

In [97]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [140]:
links = soup.find_all('img')
for link in links:
    print(link['src'])

//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_and_his_cartoon

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [141]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [151]:
links = soup.find_all('a', href=True)
for link in links:
    print(link['href'])

#mw-head
#p-search
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
#Snakes
#Ancient_Greece
#Media_and_entertainment
#Computing
#Engineering
#Roller_coasters
#Vehicles
#Weaponry
#People
#Other_uses
#See_also
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(mythology)
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_(film)
/wiki/Pythons_2
/wiki/Monty_Python
/wiki/Python_(Monty)_Pictures
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(programming_language)
/wiki/CPython
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=5
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(Efteling)
/w/index.php?title=Python&action=edi

#### Find the number of titles that have changed in the United States Code since its last release point.

In [153]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [158]:
article = soup.find_all('div', class_='usctitlechanged')

In [160]:
len(article)

2

#### Find a Python list with the top ten FBI's Most Wanted names.

In [161]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [193]:
articles = soup.find_all('li', class_='portal-type-person castle-grid-block-item')

In [197]:
for article in articles:
    print(article.find('h3', class_='title').text)


RAFAEL CARO-QUINTERO


ROBERT WILLIAM FISHER


BHADRESHKUMAR CHETANBHAI PATEL


ALEJANDRO ROSALES CASTILLO


ARNOLDO JIMENEZ


JASON DEREK BROWN


YASER ABDEL SAID


ALEXIS FLORES


EUGENE PALMER


SANTIAGO VILLALBA MEDEROS



####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [196]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [199]:
article = soup.find_all('tbody')

In [271]:
table_rows = article[0].find_all('tr')

In [276]:
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
pd.DataFrame(l, columns=["", "", "", "Datetime", "Lat", "Direction", "Long", "Direction", "Depth", "", "Magnitude", "Region", ""])

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Datetime,Lat,Direction,Long,Direction.1,Depth,Unnamed: 10,Magnitude,Region,Unnamed: 13
0,,,,earthquake2020-05-26 08:52:32.014min ago,51.26,N,16.18,E,1,ML,2.5,POLAND,2020-05-26 08:58
1,,,,earthquake2020-05-26 08:37:50.228min ago,38.24,N,117.81,W,2,ml,2.0,NEVADA,2020-05-26 08:41
2,,,,earthquake2020-05-26 08:23:09.243min ago,46.91,N,9.14,E,2,ML,1.9,SWITZERLAND,2020-05-26 08:32
3,,,,earthquake2020-05-26 08:20:54.245min ago,35.65,N,26.62,E,5,ML,3.0,"CRETE, GREECE",2020-05-26 08:56
4,,,,earthquake2020-05-26 08:14:54.251min ago,17.96,N,66.96,W,6,Md,3.0,PUERTO RICO,2020-05-26 08:44
5,,,,earthquake2020-05-26 08:10:13.856min ago,35.57,N,26.59,E,0,ML,2.9,"CRETE, GREECE",2020-05-26 08:45
6,,,,earthquake2020-05-26 07:57:41.51hr 08min ago,46.78,N,153.21,E,35,mb,4.4,KURIL ISLANDS,2020-05-26 08:57
7,,,,earthquake2020-05-26 07:54:35.51hr 11min ago,46.91,N,9.14,E,1,ML,1.9,SWITZERLAND,2020-05-26 08:00
8,,,,earthquake2020-05-26 07:50:33.81hr 15min ago,22.26,S,67.34,W,160,mb,4.6,"POTOSI, BOLIVIA",2020-05-26 08:31
9,,,,earthquake2020-05-26 07:49:19.81hr 17min ago,38.17,N,117.95,W,1,ml,2.2,NEVADA,2020-05-26 07:53


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [360]:
handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--tweets is-active'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print(tweets.get('data-count'))

except:
    print('Account name not found...')
  

Input your account name on Twitter: ferenc
2023


'2023'

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [397]:
handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--followers'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print(tweets.get('data-count'))

except:
    print('Account name not found...')
  

Input your account name on Twitter: ferenc
334


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [409]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [457]:
languages = soup.find_all('div', class_="central-featured-lang")

In [479]:
for language in languages:
    print(language.strong.text, language.bdi.text)
    
# Its circular there is no order

English 6 085 000+
EspaÃ±ol 1 601 000+
æ¥æ¬èª 1 208 000+
Deutsch 2 436 000+
Ð ÑÑÑÐºÐ¸Ð¹ 1 629 000+
FranÃ§ais 2 219 000+
Italiano 1 609 000+
ä¸­æ 1 121 000+
PortuguÃªs 1 033 000+
Polski 1 411 000+


#### A list with the different kind of datasets available in data.gov.uk.

In [480]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [495]:
articles = soup.find('div', class_='grid-row dgu-topics')

In [515]:
types = articles.find_all('a', href=True)

In [565]:
types[1]

<a href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>

In [566]:
import re
for type in types:

    pattern = '>.*<'
    print(re.findall(pattern, str(type)))

['>Business and economy<']
['>Crime and justice<']
['>Defence<']
['>Education<']
['>Environment<']
['>Government<']
['>Government spending<']
['>Health<']
['>Mapping<']
['>Society<']
['>Towns and cities<']
['>Transport<']


#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [567]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [572]:

article = soup.find_all('tbody')
table_rows = article[0].find_all('tr')
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
pd.DataFrame(l, columns=['Rank', 'Language', 'Speakers', '% of world pop', 'lang_family']).head(11)

Unnamed: 0,Rank,Language,Speakers,% of world pop,lang_family
0,,,,,
1,1\n,Mandarin Chinese\n,918\n,11.922\n,Sino-TibetanSinitic\n
2,2\n,Spanish\n,480\n,5.994\n,Indo-EuropeanRomance\n
3,3\n,English\n,379\n,4.922\n,Indo-EuropeanGermanic\n
4,4\n,Hindi (Sanskritised Hindustani)[9]\n,341\n,4.429\n,Indo-EuropeanIndo-Aryan\n
5,5\n,Bengali\n,228\n,2.961\n,Indo-EuropeanIndo-Aryan\n
6,6\n,Portuguese\n,221\n,2.870\n,Indo-EuropeanRomance\n
7,7\n,Russian\n,154\n,2.000\n,Indo-EuropeanBalto-Slavic\n
8,8\n,Japanese\n,128\n,1.662\n,JaponicJapanese\n
9,9\n,Western Punjabi[10]\n,92.7\n,1.204\n,Indo-EuropeanIndo-Aryan\n


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [573]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
handle = input('Input your account name on Twitter: ')
ctr = int(input('Input number of tweets to scrape: '))
res=requests.get('https://twitter.com/'+ handle)
bs=BeautifulSoup(res.content,'lxml')
all_tweets = bs.find_all('div',{'class':'tweet'})
if all_tweets:
  for tweet in all_tweets[:ctr]:
    context = tweet.find('div',{'class':'context'}).text.replace("\n"," ").strip()
    content = tweet.find('div',{'class':'content'})
    header = content.find('div',{'class':'stream-item-header'})
    user = header.find('a',{'class':'account-group js-account-group js-action-profile js-user-profile-link js-nav'}).text.replace("\n"," ").strip()
    time = header.find('a',{'class':'tweet-timestamp js-permalink js-nav js-tooltip'}).find('span').text.replace("\n"," ").strip()
    message = content.find('div',{'class':'js-tweet-text-container'}).text.replace("\n"," ").strip()
    footer = content.find('div',{'class':'stream-item-footer'})
    stat = footer.find('div',{'class':'ProfileTweet-actionCountList u-hiddenVisually'}).text.replace("\n"," ").strip()
    if context:
      print(context)
    print(user,time)
    print(message)
    print(stat)
    print()
else:
    print("List is empty/account name not found.")
  

Input your account name on Twitter: ferenc
Input number of tweets to scrape: 50
Ferenc‏ @ferenc 26 jul. 2018
http://Booking.com  - #UX Case Study https://buff.ly/2L6jovz pic.twitter.com/Q1WsiCKzq4
0 antwoorden     0 retweets     0 vind-ik-leuks

Ferenc‏ @ferenc 26 jul. 2018
Top 10 Elon Musk Productivity Secrets for Insane Success https://buff.ly/2JTsZ3E pic.twitter.com/AVeGqtXY2A
0 antwoorden     1 retweet     0 vind-ik-leuks

Ferenc‏ @ferenc 25 jul. 2018
Simplify Life: What Can You Remove? https://buff.ly/2JED9Fd pic.twitter.com/BC62vemSA7
1 antwoord     0 retweets     0 vind-ik-leuks

Ferenc‏ @ferenc 17 jun. 2018
What It’s Really Like to Be an Entrepreneur (Are You Sure You Want to Be One?) https://buff.ly/2sRjigo pic.twitter.com/SDgtPyaLY9
0 antwoorden     0 retweets     0 vind-ik-leuks

Ferenc‏ @ferenc 13 jun. 2018
Digital manipulation: How platforms like Uber and Deliveroo exploit workers https://buff.ly/2Ml20jM pic.twitter.com/zqWy9NB60c
0 antwoorden     0 retweets     0 vind-ik-

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [574]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [594]:

article = soup.find_all('tbody')
table_rows = article[0].find_all('tr')
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
df = pd.DataFrame(l, columns=['empty1', 'Rank&Title', 'IMDB Rating','empty2',''])

In [609]:
df = df['Rank&Title'].str.split('\n', expand=True).set_index(1)
df.columns = ['Rank', 'Title', 'Year', '']
df.head()

Unnamed: 0_level_0,Rank,Title,Year,Unnamed: 4_level_0
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,,The Shawshank Redemption,(1994),
2.0,,The Godfather,(1972),
3.0,,The Godfather: Part II,(1974),
4.0,,The Dark Knight,(2008),
5.0,,12 Angry Men,(1957),


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [610]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [611]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [None]:
# your code here