# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [32]:
url = 'https://github.com/trending/developers'

In [33]:
page = requests.get(url)
soup=BeautifulSoup(page.content, "html.parser")
print(soup)


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-8c550109d58e0353afdf1a37a05301c2.css" integrity="sha512-jFUBCdWOA1Ov3xo3oFMBwsdP4Up2K1bRnP4QYI5WqvpaIYxWVek89k2M0oyTbNhYMViGtxJB3Vdwcw8ln8hGQw==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-77cb67c5e6c23f78d6d0327713f088c4.css" integrity="sha512-d8tnxebCP3jW0DJ3E/CIxAfZO2DHR

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [98]:
#Names
names = []
for i in range(len(soup.find_all('h1', attrs={'class': 'h3 lh-condensed'}))):
    #we got all the html tag where the developer names are stored
    #we append the list for including all developer names and so clean the wrong format
    
    names.append(soup.find_all('h1', attrs={'class': 'h3 lh-condensed'})[i].get_text().strip())
    
#Users
users = []
for j in range(len(soup.find_all('p', attrs={'class': 'f4 text-normal mb-1'}))):   
    #we get the user names by appending them to the list. Those are stored in a wrong format and so clean the wrong format
    users.append(soup.find_all("p", class_="f4 text-normal mb-1")[j].get_text().strip())

#Name_users_list, concatenation for having the string
names_users_list =[i + " (" + j + ")" for i, j in zip(names, users)]
names_users_list

['Josh Holtz (joshdholtz)',
 'Chris Banes (chrisbanes)',
 'Leonid Bugaev (buger)',
 'Kyle Mathews (KyleAMathews)',
 'Violeta Georgieva (violetagg)',
 'Fatih Arslan (fatih)',
 'Robert Mosolgo (rmosolgo)',
 'Chocobozzz (stephencelis)',
 'Stephen Celis (michelleN)',
 'Michelle Noorali (lucidrains)',
 'Phil Wang (orta)',
 'Orta Therox (tiangolo)',
 'Sebastián Ramírez (alex)',
 'Alex Gaynor (daybrush)',
 'Daybrush (Younkue Choi) (mperham)',
 'Mike Perham (tymondesigns)',
 'Sean Tymon (SwampDragons)',
 'Megan Marsh (pranavkm)',
 'Pranav K (MikeMcQuaid)',
 'Mike McQuaid (aknuds1)',
 'Arve Knudsen (mholt)',
 'Matt Holt (wojtekmaj)',
 'Wojciech Maj (inducer)',
 'Andreas Klöckner (claudiosanches)']

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [150]:
url1 = 'https://github.com/trending/python?since=daily'

In [151]:
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.content, "html.parser")

In [189]:
repositories = []

for i in range(len(soup1.find_all('h1', attrs={'class': 'h3 lh-condensed'}))):
    #the trending repositories are in class h3 lh-condensed
    #we iterate for all results
    #we use regex for only obtaining the repositories names
    #we only get the part where the repository name starts
    
    repositories.append(re.findall("\s+\w+\W?\w+\W?\w+", soup1.find_all('h1', attrs={'class': 'h3 lh-condensed'})[i].get_text().strip())[0][8:])
    
repositories

['tuya-convert',
 'prefect',
 'TikTok-Shares-Botter',
 'public-apis',
 'ai-economist',
 'bpytop',
 'Atlas',
 'youtube-dl',
 'system-design-primer',
 'pytorch-lightning',
 'antenny',
 'PySyft',
 'wttr.in',
 'cupp',
 'zulip',
 'h4cker',
 'gibMacOS',
 'learn-python',
 'routersploit',
 'core',
 'howdoi',
 'SinGAN',
 'festin',
 'ArchiveBox',
 'Lazymux']

#### Display all the image links from Walt Disney wikipedia page

In [191]:
url2 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [215]:
page2 = requests.get(url2)
soup2 = BeautifulSoup(page2.content, "html.parser")


pictures = []

for i in range(len(soup2.find_all('img'))):
    #pictures are under 'img' class
    #we iterate for all results
    #we append the empty list created before to add all images found
    pictures.append("http:"+soup2.find_all('img')[i]['src'])

pictures

['http://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 'http://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 'http://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'http://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 'http://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'http://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 'http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'http://upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_N

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [217]:
url3 ='https://en.wikipedia.org/wiki/Python' 

In [226]:
page3 = requests.get(url3)
soup3 = BeautifulSoup(page3.content, "html.parser")

links =[] 

for i in range(len(soup3.find_all('a'))):
    #links are unders all '<a' class
    #we iterate for all results
    #we append the empty list created before to add all links found
    links.append(soup3.find_all('a')[i])

links

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a href="#Snakes"><span class="tocnumber">1</span> <span class="toctext">Snakes</span></a>,
 <a href="#Ancient_Greece"><span class="tocnumber">2</span> <span class="toctext">Ancient Greece</span></a>,
 <a href="#Media_and_entertainment"><span class="tocnumber">3</span> <span class="toctext">Media and entertainment</span></a>,
 <a href="#Computing"><span class="tocnumber">4</span> <span class="toctext">Computing</span></a>,
 <a href="#Engineering"><span class="tocnumber">5</span> <span class="toctext">Engineering</span></a>,
 <a href="#Roller_coasters"><span class="tocnumber">5.1</span> <span class="toctext">Roller coasters</span></a>,
 <a h

#### Number of Titles that have changed in the United States Code since its last release point 

In [227]:
url4 = 'http://uscode.house.gov/download/download.shtml'

In [238]:
page4 = requests.get(url4)
soup4 = BeautifulSoup(page4.content, "html.parser")

titles = []

for i in range(len(soup4.find_all("div", attrs={'class': "usctitlechanged"}))):
    #titles are under 'usctitlechanged' class
    #we iterate for all results
    #we append the empty list created before to add codes that have changed
    #for doing that we use regex
    titles.append(re.findall(r"-.*$", soup4.find_all("div", attrs={'class': "usctitlechanged"})[i].get_text().strip())[0][2:])

titles

['Domestic Security', 'The Public Health and Welfare']

#### A Python list with the top ten FBI's Most Wanted names 

In [239]:
url5 = 'https://www.fbi.gov/wanted/topten'

In [242]:
page5 = requests.get(url5)
soup5 = BeautifulSoup(page5.content, "html.parser")

fbi_mostwanted = []

for i in range(len(soup5.find_all("h3", attrs={'class' : "title"}))):
    #names are under 'title' class
    #we iterate for all results
    #we append the empty list created before to add codes that have changed
    fbi_mostwanted.append(soup5.find_all("h3", attrs={'class' : "title"})[i].get_text().strip())

fbi_mostwanted

['EUGENE PALMER',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'YASER ABDEL SAID',
 'ALEXIS FLORES',
 'SANTIAGO VILLALBA MEDEROS']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [67]:
url = 'https://www.emsc-csem.org/Earthquake/'

In [68]:
page = requests.get(url)
earthquake = BeautifulSoup(page.content, "html.parser")

In [69]:
#We start with date and time observations
#Both observations are unders class='tabev6'

earthquake.find_all('td', attrs={'class':'tabev6'})[0].get_text()
#We will need to used regex for extracting the information separately

'earthquake2020-08-10\xa0\xa0\xa007:14:53.018min ago'

In [89]:
#Dates
dates = []

for i in range(len(earthquake.find_all('td', attrs={'class':'tabev6'}))):
    dates.append(re.findall(r"\d{4}-\d{2}-\d{2}", earthquake.find_all('td', attrs={'class':'tabev6'})[i].get_text())[0])

dates

['2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10',
 '2020-08-10']

In [88]:
#Times
times = []

for i in range(len(earthquake.find_all('td', attrs={'class':'tabev6'}))):
    times.append(re.findall(r"\d{2}:\d{2}:\d{2}", earthquake.find_all('td', attrs={'class':'tabev6'})[i].get_text())[0])

times

['07:14:53',
 '07:11:54',
 '07:01:21',
 '06:54:16',
 '06:42:28',
 '06:28:27',
 '06:20:50',
 '06:16:59',
 '06:16:14',
 '06:07:57',
 '05:59:19',
 '05:47:44',
 '05:43:41',
 '05:32:08',
 '04:56:46',
 '04:47:42',
 '04:33:05',
 '04:29:21',
 '04:21:05',
 '04:17:36',
 '04:14:02',
 '04:11:43',
 '03:59:58',
 '03:58:41',
 '03:39:36',
 '03:32:15',
 '03:30:24',
 '03:26:55',
 '03:25:31',
 '03:24:00',
 '03:17:54',
 '03:09:45',
 '03:06:14',
 '02:52:00',
 '02:41:43',
 '02:27:15',
 '02:25:00',
 '02:21:07',
 '02:11:27',
 '02:10:14',
 '02:07:54',
 '01:53:35',
 '01:41:35',
 '01:39:13',
 '01:25:52',
 '01:07:31',
 '00:56:57',
 '00:44:29',
 '00:34:07',
 '00:24:40']

In [85]:
#Latitudes & Longitudes

#Results are stored in class: "tabev1" 
earthquake.find_all('td', attrs={'class':'tabev1'})[1].get_text().strip()

#We will store results into a list and then we will split them into lat or lon respectively per index position
lat_lon = []

for i in range(len(earthquake.find_all('td', attrs={'class':'tabev1'}))):
    lat_lon.append(earthquake.find_all('td', attrs={'class':'tabev1'})[i].get_text().strip())

#Latitudes are the odd positions
latitudes = lat_lon[::2]

#Longitudes are the even positions
longitudes = lat_lon[1::2]

#We print the results for checking purposes
print(latitudes)
print(longitudes)

['2.73', '41.98', '8.06', '21.00', '38.82', '59.97', '23.11', '27.60', '42.53', '40.57', '40.57', '19.22', '38.27', '44.98', '10.01', '28.18', '12.85', '37.43', '9.91', '19.92', '9.82', '23.95', '35.42', '40.53', '9.70', '12.68', '34.93', '34.22', '38.45', '19.20', '34.05', '2.58', '8.85', '6.82', '40.56', '38.82', '37.14', '38.18', '28.65', '21.42', '40.56', '35.22', '45.85', '23.48', '12.48', '5.86', '60.72', '19.26', '45.11', '35.37']
['128.18', '20.26', '107.92', '68.71', '122.81', '146.39', '68.50', '17.97', '13.50', '22.55', '22.56', '155.42', '38.85', '146.16', '124.49', '15.01', '45.54', '35.87', '119.13', '72.81', '119.00', '66.93', '9.75', '22.57', '119.01', '123.64', '121.02', '118.58', '44.45', '155.43', '70.79', '127.35', '117.81', '73.20', '22.60', '122.80', '42.75', '117.81', '70.24', '68.80', '22.71', '26.31', '77.09', '115.24', '121.31', '123.06', '142.99', '155.41', '13.66', '140.53']


In [87]:
#Region
#Results are stored in class: "tb_region" 
earthquake.find_all('td', attrs={'class':'tb_region'})[1].get_text().strip()

regions = []
for i in range(len(earthquake.find_all('td', attrs={'class':'tb_region'}))):
    regions.append(earthquake.find_all('td', attrs={'class':'tb_region'})[i].get_text().strip())

regions

['HALMAHERA, INDONESIA',
 'ALBANIA',
 'JAVA, INDONESIA',
 'TARAPACA, CHILE',
 'NORTHERN CALIFORNIA',
 'GULF OF ALASKA',
 'ANTOFAGASTA, CHILE',
 'CANARY ISLANDS, SPAIN REGION',
 'CENTRAL ITALY',
 'GREECE',
 'GREECE',
 'ISLAND OF HAWAII, HAWAII',
 'EASTERN TURKEY',
 'KURIL ISLANDS',
 'TIMOR REGION, INDONESIA',
 'CANARY ISLANDS, SPAIN REGION',
 'MAYOTTE REGION',
 'CENTRAL TURKEY',
 'SUMBA REGION, INDONESIA',
 'NEAR COAST OF MAHARASHTRA, INDIA',
 'SUMBA REGION, INDONESIA',
 'JUJUY, ARGENTINA',
 'WEST OF GIBRALTAR',
 'GREECE',
 'SUMBA REGION, INDONESIA',
 'MASBATE REGION, PHILIPPINES',
 'OFFSHORE CENTRAL CALIFORNIA',
 'WESTERN AUSTRALIA',
 'TURKEY-IRAN BORDER REGION',
 'ISLAND OF HAWAII, HAWAII',
 "LIBERTADOR O'HIGGINS, CHILE",
 'MOLUCCA SEA',
 'SUMBAWA REGION, INDONESIA',
 'NORTHERN COLOMBIA',
 'GREECE',
 'NORTHERN CALIFORNIA',
 'TURKEY-SYRIA-IRAQ BORDER REGION',
 'NEVADA',
 'ATACAMA, CHILE',
 'ANTOFAGASTA, CHILE',
 'GREECE',
 'CRETE, GREECE',
 'SOUTHERN ONTARIO, CANADA',
 'SOUTHERN EAST P

In [90]:
earthquakesdf = pd.DataFrame({"date": dates, "time": times, "latitude": latitudes, "longitude": longitudes, "region": regions})
earthquakesdf

Unnamed: 0,date,time,latitude,longitude,region
0,2020-08-10,07:14:53,2.73,128.18,"HALMAHERA, INDONESIA"
1,2020-08-10,07:11:54,41.98,20.26,ALBANIA
2,2020-08-10,07:01:21,8.06,107.92,"JAVA, INDONESIA"
3,2020-08-10,06:54:16,21.0,68.71,"TARAPACA, CHILE"
4,2020-08-10,06:42:28,38.82,122.81,NORTHERN CALIFORNIA
5,2020-08-10,06:28:27,59.97,146.39,GULF OF ALASKA
6,2020-08-10,06:20:50,23.11,68.5,"ANTOFAGASTA, CHILE"
7,2020-08-10,06:16:59,27.6,17.97,"CANARY ISLANDS, SPAIN REGION"
8,2020-08-10,06:16:14,42.53,13.5,CENTRAL ITALY
9,2020-08-10,06:07:57,40.57,22.55,GREECE


#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [1]:
url = 'https://www.wikipedia.org/'

In [4]:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

In [9]:
#We use find_all strong as it is where languages are stored
strong = soup.find_all('strong')
print(strong)
#We observe that the first and the last result are not languages

len_strong = len(soup.find_all('strong'))
print(len_strong)

[<strong class="jsl10n localized-slogan" data-jsl10n="slogan">The Free Encyclopedia</strong>, <strong>English</strong>, <strong>日本語</strong>, <strong>Español</strong>, <strong>Deutsch</strong>, <strong>Русский</strong>, <strong>Français</strong>, <strong>Italiano</strong>, <strong>中文</strong>, <strong>Português</strong>, <strong>Polski</strong>, <strong class="jsl10n" data-jsl10n="app-links.title">
<a class="jsl10n" data-jsl10n="app-links.url" href="https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications">
Download Wikipedia for Android or iOS
</a>
</strong>]
12


In [10]:
languages = []

for i in range(1,11):
    #lenght is 12, we keed in mind that first result and last result are not valid for our analysis as they are not languages
    languages.append(soup.find_all("strong")[i].get_text())

languages

['English',
 '日本語',
 'Español',
 'Deutsch',
 'Русский',
 'Français',
 'Italiano',
 '中文',
 'Português',
 'Polski']

In [20]:
#We use find_all bdi as it is where number of visits are stored.
bdi = soup.find_all('bdi')[0:10]
print(bdi)
#We select only the first 10 results as those are the ones we are interested in

[<bdi dir="ltr">6 134 000+</bdi>, <bdi dir="ltr">1 220 000+</bdi>, <bdi dir="ltr">1 615 000+</bdi>, <bdi dir="ltr">2 464 000+</bdi>, <bdi dir="ltr">1 648 000+</bdi>, <bdi dir="ltr">2 239 000+</bdi>, <bdi dir="ltr">1 626 000+</bdi>, <bdi dir="ltr">1 134 000+</bdi>, <bdi dir="ltr">1 041 000+</bdi>, <bdi dir="ltr">1 422 000+</bdi>]


In [21]:
articles = []

for j in range(0,10):
    #By checking we see that number of visits are separated by spaces
    #We solved that by using "".join and regex
    articles.append("".join(re.findall(r"\d+", soup.find_all("bdi")[j].get_text())))
    
articles

['6134000',
 '1220000',
 '1615000',
 '2464000',
 '1648000',
 '2239000',
 '1626000',
 '1134000',
 '1041000',
 '1422000']

In [23]:
#We create a list comprehension for joining both results obtained previously, languages and articles
results = [i +": "+j for i,j in zip(languages, articles)]
results

['English: 6134000',
 '日本語: 1220000',
 'Español: 1615000',
 'Deutsch: 2464000',
 'Русский: 1648000',
 'Français: 2239000',
 'Italiano: 1626000',
 '中文: 1134000',
 'Português: 1041000',
 'Polski: 1422000']

#### A list with the different kind of datasets available in data.gov.uk 

In [24]:
url = 'https://data.gov.uk/'

In [31]:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

#Datasets are stored under a class="govuk-link"
ds = soup.select(".govuk-link")
ds
#The first result we want is number 3

[<a class="govuk-link" href="/cookies">cookies to collect information</a>,
 <a class="govuk-link" data-module="track-click" data-track-action="Cookie banner settings clicked from confirmation" data-track-category="cookieBanner" href="/cookies">change your cookie settings</a>,
 <a class="govuk-link" href="http://www.smartsurvey.co.uk/s/3SEXD/">feedback</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a>,
 <a cla

In [33]:
#We calculate the lenght to know which range should our for loop iterate through
len_ds = len (soup.select(".govuk-link"))
len_ds

15

In [36]:
ds_available = []

for i in range(3,15):
    ds_available.append(soup.select(".govuk-link")[i].get_text())
    
ds_available

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport']

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [96]:
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [97]:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

In [92]:
#First list will be from range (0,9) and each 3 results, otherwise we do not get the result expected

lista1 = [soup.select("tbody tr td a")[i].get_text() for i in range(0,10)][::3]
lista1

['Mandarin Chinese', 'Spanish', 'English', 'Hindi']

In [93]:
#Second list will be from range (14,27) and each 3 results, otherwise we do not get the result expected.
#It starts at Benjali (14) and ends at Western Punjabi (27).
#We can't start at position 11 or so because the [9] comments makes it more difficult

lista2 = [soup.select("tbody tr td a")[i].get_text() for i in range(14,27)][::3]
lista2

['Bengali', 'Portuguese', 'Russian', 'Japanese', 'Western Punjabi']

In [94]:
#Second list will be position 30 as we have 9 results and we only need one more
#We can't add it to previous lista2 because the [10] comments makes it more difficult
#No iteration is needed

lista3 = [soup.select("tbody tr td a")[30].get_text()]
lista3

['Marathi']

In [101]:
#We just need to join the three lists together

top10languages = lista1 + lista2 + lista3
top10languages = pd.DataFrame(top10languages)
top10languages

Unnamed: 0,0
0,Mandarin Chinese
1,Spanish
2,English
3,Hindi
4,Bengali
5,Portuguese
6,Russian
7,Japanese
8,Western Punjabi
9,Marathi


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
#your code

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
#your code