# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import random
import warnings

In [2]:
warnings.filterwarnings('ignore')

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'
response = requests.get(url)

response

<Response [200]>

In [4]:
# your code here
soup = BeautifulSoup(response.content)

# soup

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [5]:
# your code here
name_tags = soup.find_all('h1', {'class': 'h3 lh-condensed'})
nick_tags = soup.find_all('p', {'class': 'f4 text-normal mb-1'})

In [6]:
names = ['(' + nt.a.text.strip('\n ') + ')' for nt in name_tags]
nicks = [nt.a.text.strip('\n ') for nt in nick_tags]
nick_name_pairs = [' '.join(p) for p in list(zip(nicks, names))]

nick_name_pairs

['asim (Asim Aslam)',
 'natario1 (Mattia Iavarone)',
 'kripken (Alon Zakai)',
 'fzaninotto (Francois Zaninotto)',
 'CookPete (Pete Cook)',
 'gcanti (Giulio Canti)',
 'developit (Jason Miller)',
 'nikolasburk (Nikolas)',
 'nedbat (Ned Batchelder)',
 'jd (Julien Danjou)',
 'neuecc (Yoshifumi Kawai)',
 'hacksparrow (Hage Yaapa)',
 'vektah (Adam Scarr)',
 'agnivade (Agniva De Sarker)',
 'rs (Olivier Poitrey)',
 'schneems (Richard Schneeman)',
 'markbates (Mark Bates)',
 'alvarotrigo (Álvaro)',
 'jashkenas (Jeremy Ashkenas)',
 'ycjcl868 (信鑫-King)',
 'Pessimistress (Xiaoji Chen)',
 'nunomaduro (Nuno Maduro)',
 'mmazzarolo (Matteo Mazzarolo)',
 'arvidn (Arvid Norberg)',
 'skidding (Ovidiu Cherecheș)']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [7]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'
response = requests.get(url)

response

<Response [200]>

In [8]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [9]:
repo_tags = soup.find_all('h1', {'class': 'h3 lh-condensed'})
repos = [rt.text.split(' /\n\n\n\n      ')[1].strip('\n') for rt in repo_tags]

repos

['12306',
 'coding-problems',
 'you-get',
 'transformers',
 'Real-Time-Voice-Cloning',
 'yolact',
 'JAV-Scraper-and-Rename-local-files',
 'ChromeAppHeroes',
 'seeprettyface-generator-wanghong',
 'Autoticket',
 'py12306',
 'detectron2',
 'ansible',
 'YouTube-Report',
 'BMW-TensorFlow-Training-GUI',
 'zhao',
 'andriller',
 'odoo',
 'maskrcnn-benchmark',
 'albert_vi',
 'examples-of-web-crawlers',
 'ihatemoney',
 'trt_pose',
 '12306_code_server',
 'BMW-YOLOv3-Training-Automation']

#### Display all the image links from Walt Disney wikipedia page.

In [10]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
response = requests.get(url)

response

<Response [200]>

In [11]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [12]:
image_tags = soup.find_all('a', {'class': 'image'})
images = ['https://en.wikipedia.org' + it.get('href') for it in image_tags][:16]

images

['https://en.wikipedia.org/wiki/File:Walt_Disney_1946.JPG',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_1942_signature.svg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 'https://en.wikipedia.org/wiki/File:Trolley_Troubles_poster.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 'https://en.wikipedia.org/wiki/File:Steamboat-willie.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_1935.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 'https://en.wikipedia.org/wiki/File:Disney_drawing_goofy.jpg',
 'https://en.wikipedia.org/wiki/File:DisneySchiphol1951.jpg',
 'https://en.wikipedia.org/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_disney_portrait_right.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_Grave.JPG',
 'https://en.wikipedia.org/wiki/File:Roy_O._Dis

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [13]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'
response = requests.get(url)

response

<Response [200]>

In [14]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [15]:
link_tags = [lt for lt in soup.find_all('li') if lt.attrs == {}]
links = ['https://en.wikipedia.org' + lt.a.get('href') for lt in link_tags][:27]

rnd_url = links[random.choice(range(len(links)))]
print(rnd_url)
response = requests.get(rnd_url)
soup = BeautifulSoup(response.content)

https://en.wikipedia.org/wiki/Python_Anghelo


In [16]:
anchor_tags = [at for at in soup.find_all('a') if at.has_attr('href')]
href_tags = [at.get('href') for at in anchor_tags]

links = []
for ht in href_tags:
    if ht.startswith('#'):
        links.append(rnd_url + ht)
    elif ht.startswith('/wiki/'):
        links.append('https://en.wikipedia.org' + ht)
    elif ht.startswith('/w/') or ht.startswith('//'):
        pass
    elif ht.startswith('http'):
        links.append(ht)

In [17]:
links

['https://en.wikipedia.org/wiki/Python_Anghelo#mw-head',
 'https://en.wikipedia.org/wiki/Python_Anghelo#p-search',
 'https://en.wikipedia.org/wiki/Graphic_artist',
 'https://en.wikipedia.org/wiki/Video_game',
 'https://en.wikipedia.org/wiki/Pinball',
 'https://en.wikipedia.org/wiki/Transylvania',
 'https://en.wikipedia.org/wiki/Romania',
 'https://en.wikipedia.org/wiki/United_States',
 'https://en.wikipedia.org/wiki/Python_Anghelo#cite_note-obit-1',
 'https://en.wikipedia.org/wiki/Python_Anghelo#Life',
 'https://en.wikipedia.org/wiki/Python_Anghelo#Pinball_projects',
 'https://en.wikipedia.org/wiki/Python_Anghelo#Video_game_projects_(incomplete)',
 'https://en.wikipedia.org/wiki/Python_Anghelo#References',
 'https://en.wikipedia.org/wiki/Python_Anghelo#External_links',
 'https://en.wikipedia.org/wiki/Disney',
 'https://en.wikipedia.org/wiki/Williams_Electronics',
 'https://en.wikipedia.org/wiki/Joust_(video_game)',
 'https://en.wikipedia.org/wiki/Python_Anghelo#cite_note-2',
 'https://

#### Find the number of titles that have changed in the United States Code since its last release point.

In [18]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'
response = requests.get(url)

response

<Response [200]>

In [19]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [20]:
title_tags = soup.find_all('div', {'class': 'usctitlechanged'})

len(title_tags)

5

#### Find a Python list with the top ten FBI's Most Wanted names.

In [21]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'
response = requests.get(url)

response

<Response [200]>

In [22]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [23]:
name_tags = soup.find_all('li', {'class': 'portal-type-person castle-grid-block-item'})
names = [nt.h3.text.strip('\n') for nt in name_tags]
names

['BHADRESHKUMAR CHETANBHAI PATEL',
 'ARNOLDO JIMENEZ',
 'ALEJANDRO ROSALES CASTILLO',
 'YASER ABDEL SAID',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'EUGENE PALMER',
 'SANTIAGO VILLALBA MEDEROS',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER']

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [24]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'
lst = pd.read_html(url + '?view=1')
df = lst[3][['Date & Time UTC', 'Latitude degrees', 'Longitude degrees', 'Last update [-]']]
df = df.T.reset_index(level=1, drop=True).T.dropna(axis=0)[:20]

In [25]:
df[['date', 'time']] = df['Date & Time UTC'].apply(lambda x: pd.Series(str(x).split(' '))).drop([2, 3], axis=1)
df['latitude'] = df.iloc[:, 1].str.cat(df.iloc[:, 2], sep=' ')
df['longitude'] = df.iloc[:, 3].str.cat(df.iloc[:, 4], sep=' ')
df.rename(columns={'Last update [-]': 'region name'}, inplace=True)
df = df[['date', 'time', 'latitude', 'longitude', 'region name']]
df['time'] = df['time'].apply(lambda x: re.match(r'\d{2}:\d{2}:\d{2}\.\d', x).group(0))

df

Unnamed: 0,date,time,latitude,longitude,region name
0,2019-12-19,01:30:25.0,1.06 S,121.41 E,"SULAWESI, INDONESIA"
1,2019-12-19,01:27:32.3,41.62 N,19.45 E,ADRIATIC SEA
2,2019-12-19,01:06:43.7,17.91 N,66.58 W,PUERTO RICO REGION
3,2019-12-19,00:39:46.7,41.56 N,19.63 E,ALBANIA
4,2019-12-19,00:33:04.9,1.59 S,67.78 E,CARLSBERG RIDGE
5,2019-12-19,00:23:30.0,36.30 N,140.70 E,"NEAR EAST COAST OF HONSHU, JAPAN"
9,2019-12-18,23:40:08.0,22.01 S,68.70 W,"ANTOFAGASTA, CHILE"
10,2019-12-18,23:37:37.0,30.96 N,50.17 E,SOUTHERN IRAN
11,2019-12-18,23:05:04.0,20.86 S,69.19 W,"TARAPACA, CHILE"
12,2019-12-18,22:31:34.0,21.83 S,68.59 W,"ANTOFAGASTA, CHILE"


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [26]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'
user = 'realDonaldTrump'
response = requests.get(url + user)

response

<Response [200]>

In [27]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [28]:
trump_numbers = soup.find_all('span', {'class': 'ProfileNav-value'})
n_tweets = trump_numbers[0].text.replace(',', '.')[:4] + 'K'

n_tweets

'47.3K'

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [29]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [30]:
# your code here
n_followers = trump_numbers[2].text.replace(',', '.')[:4] + 'M'

n_followers

'67.7M'

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [31]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'
response = requests.get(url)

response

<Response [200]>

In [32]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [33]:
div_tags = soup.find_all('div', {'dir': 'ltr'})
langs = [dt.strong.text for dt in div_tags]

n_articles = [','.join(dt.small.bdi.text.split('\xa0')) for dt in div_tags]
lang_n_article_pairs = list(zip(langs, n_articles))

lang_n_article_pairs

[('English', '5,982,000+'),
 ('日本語', '1,181,000+'),
 ('Español', '1,564,000+'),
 ('Deutsch', '2,375,000+'),
 ('Русский', '1,584,000+'),
 ('Français', '2,164,000+'),
 ('Italiano', '1,572,000+'),
 ('中文', '1,086,000+'),
 ('Português', '1,017,000+'),
 ('Polski', '1,373,000+')]

#### A list with the different kind of datasets available in data.gov.uk.

In [34]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk'
response = requests.get(url)

response

<Response [200]>

In [35]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [36]:
fields_dict = {}
div_tags1 = soup.find_all('div', {'class': 'column-one-third'})
for div in div_tags1:
    li_tags = div.find_all('li')
    for li in li_tags:
        fields_dict[li.h2.text] = {}
        response = requests.get(url + li.h2.a.get('href'))
        soup = BeautifulSoup(response.content)
        div_tags2 = soup.find_all('div', {'class': 'dgu-results__result'})
        fields_dict[li.h2.text]['dataset'] = []
        fields_dict[li.h2.text]['published_by'] = []
        fields_dict[li.h2.text]['link'] = []
        for div in div_tags2:
            fields_dict[li.h2.text]['dataset'].append(div.h2.a.text)
            fields_dict[li.h2.text]['published_by'].append(div.dl.dd.text.strip('\n'))
            fields_dict[li.h2.text]['link'].append(url + div.h2.a.get('href'))

In [37]:
fields_df = pd.concat([pd.DataFrame(fields_dict[f]) for f in fields_dict], axis=1, keys=fields_dict.keys())

fields_df

Unnamed: 0_level_0,Business and economy,Business and economy,Business and economy,Crime and justice,Crime and justice,Crime and justice,Defence,Defence,Defence,Education,...,Mapping,Society,Society,Society,Towns and cities,Towns and cities,Towns and cities,Transport,Transport,Transport
Unnamed: 0_level_1,dataset,published_by,link,dataset,published_by,link,dataset,published_by,link,dataset,...,link,dataset,published_by,link,dataset,published_by,link,dataset,published_by,link
0,Non consolidated performance pay Charity Commi...,Charity Commission for England and Wales,https://data.gov.uk/dataset/46438cde-fcc0-4671...,Crown Prosecution Service hate crime and crime...,Crown Prosecution Service,https://data.gov.uk/dataset/41071520-9d6c-42de...,Chemical Weapons Convention Inspectors Contact...,Not released,https://data.gov.uk/dataset/f8289e21-4bef-4cdd...,Apprentice Progress,...,https://data.gov.uk/dataset/9ca46149-0988-48f2...,Crown Prosecution Service Conditional Cautioni...,Crown Prosecution Service,https://data.gov.uk/dataset/93ff053a-fd24-4e6c...,Annual Planned Activity on Delivery Feed Law C...,Food Standards Agency,https://data.gov.uk/dataset/d710a81c-6836-4dbc...,Freight moved,Office of Rail and Road,https://data.gov.uk/dataset/d35706a4-8841-477e...
1,Tobacco Duties Statistical Bulletin,Her Majesty's Revenue and Customs,https://data.gov.uk/dataset/c20be4da-a47f-4de1...,Crown Prosecution Service case outcomes by pri...,Crown Prosecution Service,https://data.gov.uk/dataset/c0cb19af-f893-45b9...,Chemical Weapons Convention Inspection reports,Not released,https://data.gov.uk/dataset/0cf707a7-022d-4a4b...,Training Needs Analysis,...,https://data.gov.uk/dataset/d577a2ca-83e8-4077...,Revenue-based Taxes and Benefits: Enterprise i...,Her Majesty's Revenue and Customs,https://data.gov.uk/dataset/acb0f200-32f6-47a0...,Primary Authority Inspection Plans,Food Standards Agency,https://data.gov.uk/dataset/f65083d6-9445-465c...,Passenger rail service complaints,Office of Rail and Road,https://data.gov.uk/dataset/8838af1f-53cb-4a05...
2,UK Overseas Trade Statistics,Her Majesty's Revenue and Customs,https://data.gov.uk/dataset/09656884-4c9b-4463...,"Conditional Cautioning Data, Quarter 1 2013-2014",Crown Prosecution Service,https://data.gov.uk/dataset/c74eaf46-26a8-45c5...,Senior Staff hospitality received April 2010 t...,Ministry of Defence,https://data.gov.uk/dataset/00d77e50-104f-477a...,Lessons learned,...,https://data.gov.uk/dataset/93280408-662c-45db...,National Insurance Contributions (NICs),Not released,https://data.gov.uk/dataset/c2fcbbdd-ff76-4286...,Electronic File Series 2004-2013,Food Standards Agency,https://data.gov.uk/dataset/00b0d9b1-4697-466e...,Government support,Not released,https://data.gov.uk/dataset/1e443830-b233-4c2a...
3,Organisational Unit Areas,Not released,https://data.gov.uk/dataset/cae06487-018a-4daf...,Crown Prosecution Service Conditional Cautioni...,Crown Prosecution Service,https://data.gov.uk/dataset/52f8f3a2-734a-4973...,MOD: senior officials' domestic and internatio...,Ministry of Defence,https://data.gov.uk/dataset/19ce52c0-8d45-4f3e...,Student Loan Repayments,...,https://data.gov.uk/dataset/6e23d706-cff0-4288...,Debt data,Not released,https://data.gov.uk/dataset/7153aa91-23a4-41a3...,Scheduled Dairy Visits,Food Standards Agency,https://data.gov.uk/dataset/03452914-7ba7-460d...,Network Rail monitor key statistics,Office of Rail and Road,https://data.gov.uk/dataset/f901e369-eea1-445b...
4,UK Property Transaction Statistics,Her Majesty's Revenue and Customs,https://data.gov.uk/dataset/f4dee494-f9b5-41d8...,Crown Prosecution Service case outcomes by pri...,Crown Prosecution Service,https://data.gov.uk/dataset/5b249cb4-adf7-4890...,2009 Army Cadet Force Survey- Cadets Responses,Not released,https://data.gov.uk/dataset/0cec3148-190f-4a64...,Higher Qualifications,...,https://data.gov.uk/dataset/70f5dbbe-e8ed-43dd...,Revenue-based Taxes and Benefits: Employee Sha...,Her Majesty's Revenue and Customs,https://data.gov.uk/dataset/69e97232-a7da-48ea...,Food Standards Agency File Plan,Food Standards Agency,https://data.gov.uk/dataset/a4538b56-5f25-4d6d...,Rail fares,Office of Rail and Road,https://data.gov.uk/dataset/9caca2dd-7baf-429c...
5,Cash collected from compliance,Not released,https://data.gov.uk/dataset/faa3a033-6d4a-49b2...,CPS Conditional Cautioning Data Quarter 4 2010...,Crown Prosecution Service,https://data.gov.uk/dataset/82ccf3c4-beb9-44d7...,Cemetery Database Post 1900,Not released,https://data.gov.uk/dataset/eef0e112-408c-403d...,Summer GCSEs and IGCSEs entries for England,...,https://data.gov.uk/dataset/6b09a797-2bbe-432d...,Personal Tax Model,Not released,https://data.gov.uk/dataset/17d711a3-c495-4546...,Management Development Programme,Not released,https://data.gov.uk/dataset/9284d7ef-56d7-4028...,Signals passed at danger (SPADs),Office of Rail and Road,https://data.gov.uk/dataset/0c35ce31-9ad3-4990...
6,Enterprise Data Models,Not released,https://data.gov.uk/dataset/df11f2da-4631-4234...,Conditional Cautions - Outcome of CPS decision...,Crown Prosecution Service,https://data.gov.uk/dataset/3517bbeb-5d49-407a...,Statistical Series 5,Ministry of Defence,https://data.gov.uk/dataset/ef713193-05eb-4d41...,Initial teacher education (ITE) inspections an...,...,https://data.gov.uk/dataset/66faae6f-6e57-466f...,Workforce Planning,Not released,https://data.gov.uk/dataset/1a46f8e4-1cdb-4d34...,Measurement Workbooks,Not released,https://data.gov.uk/dataset/0b2e6f0a-04ed-4e42...,Quarterly statistical summary,Office of Rail and Road,https://data.gov.uk/dataset/13f66d3a-1858-45cc...
7,Betting and Gaming Factsheet,Her Majesty's Revenue and Customs,https://data.gov.uk/dataset/40ae5222-cabf-4e74...,Crown Prosecution Service case outcomes by pri...,Crown Prosecution Service,https://data.gov.uk/dataset/3a6dd5b3-eb86-47aa...,Chickerell firing programme,Ministry of Defence,https://data.gov.uk/dataset/0a701048-7e92-47cc...,Early years inspections and outcomes,...,https://data.gov.uk/dataset/29659d12-45ab-4b6c...,Cross savings & pensions longitudinal data,Not released,https://data.gov.uk/dataset/245aebdb-e6a1-4f9b...,Section 17 Third Party Information (TPI) Mart,Not released,https://data.gov.uk/dataset/a604fd57-9f48-47d1...,ITS Directive Road Safety Information Data - L...,Transport NI,https://data.gov.uk/dataset/75842be9-20af-43f2...
8,Large Business Service Hit Rate,Not released,https://data.gov.uk/dataset/856004d0-1abd-43a3...,"Conditional Cautioning Data, Quarter 3 2013-2014",Crown Prosecution Service,https://data.gov.uk/dataset/8dcec934-7c2a-4eea...,Monthly Iraq and Afghanistan UK Patient Treatm...,Ministry of Defence,https://data.gov.uk/dataset/8a951230-b2ed-447e...,Civil Service People Survey 2010 – results for...,...,https://data.gov.uk/dataset/aad190dc-5bcf-4a3f...,Personal Details,Not released,https://data.gov.uk/dataset/12fd971d-f24c-4425...,Paper File Series,Food Standards Agency,https://data.gov.uk/dataset/9274cd16-e4b1-44ac...,Freight lifted,Office of Rail and Road,https://data.gov.uk/dataset/ea3f0961-7322-4c49...
9,Businesses Admin Burden,Not released,https://data.gov.uk/dataset/11ce59e0-dd0e-4cfc...,CPS Key Measures,Crown Prosecution Service,https://data.gov.uk/dataset/695b75d3-3b01-4c0a...,Dartmoor firing programme,Ministry of Defence,https://data.gov.uk/dataset/24a16af1-660d-47e1...,Childcare providers and inspections as at 31 M...,...,https://data.gov.uk/dataset/b5b42aa8-3d80-4881...,Destination Of Assets On Death (DAD),Not released,https://data.gov.uk/dataset/8b83d4d3-3921-4ad8...,Ceaser Reports,Not released,https://data.gov.uk/dataset/fa20eba3-a3a8-4b57...,Freight market indicators,Office of Rail and Road,https://data.gov.uk/dataset/9e5041a7-de13-42eb...


#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [38]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
lst = pd.read_html(url)
df = lst[0]

In [39]:
df.sort_values(by='Speakers(millions)', ascending=False).iloc[:10]

Unnamed: 0,Rank,Language,Speakers(millions),% of the World population (March 2019)[8],Language familyBranch
0,1,Mandarin Chinese,918.0,11.922,Sino-TibetanSinitic
1,2,Spanish,480.0,5.994,Indo-EuropeanRomance
2,3,English,379.0,4.922,Indo-EuropeanGermanic
3,4,Hindi (Sanskritised Hindustani)[9],341.0,4.429,Indo-EuropeanIndo-Aryan
4,5,Bengali,228.0,2.961,Indo-EuropeanIndo-Aryan
5,6,Portuguese,221.0,2.87,Indo-EuropeanRomance
6,7,Russian,154.0,2.0,Indo-EuropeanBalto-Slavic
7,8,Japanese,128.0,1.662,JaponicJapanese
8,9,Western Punjabi[10],92.7,1.204,Indo-EuropeanIndo-Aryan
9,10,Marathi,83.1,1.079,Indo-EuropeanIndo-Aryan


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [40]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'
user = 'realDonaldTrump'
response = requests.get(url + user)

response

<Response [200]>

In [41]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [42]:
div_tags = soup.find_all('div', {'class': 'js-tweet-text-container'})
tweets = [dt.p.text for dt in div_tags]

print(*tweets, sep='\n\n')

THANK YOU! #KAGhttps://twitter.com/mattfinnfnc/status/1207359122372476928 …

Thank you Michigan, I am on my way. See everybody soon! #KAGpic.twitter.com/GP9SbH67CN

SUCH ATROCIOUS LIES BY THE RADICAL LEFT, DO NOTHING DEMOCRATS. THIS IS AN ASSAULT ON AMERICA, AND AN ASSAULT ON THE REPUBLICAN PARTY!!!!

....won’t convict and remove the President - Then the House should not be Impeaching the President in the first place. If this is the new standard, every President from here on out is impeachable.”  Andy McCarthy @FoxNews  So well stated. Thank you!

In the end here, nothing happened. We don’t approach anything like the egregious conduct that should be necessary before a President should be removed from office. I believe that a President can’t be removed from office if there is no reasonable possibility that the Senate..

“The evidence has to be overwhelming, and it is not. It’s not even close.” Ken Starr, Former Independent Counsel

https://www.foxnews.com/opinion/gregg-jarrett-ig-horowi

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [43]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'
response = requests.get(url)

response

<Response [200]>

In [44]:
soup = BeautifulSoup(response.content)

# soup

In [45]:
# your code here
title_tags = soup.find_all('td', {'class': 'titleColumn'})
rating_tags = soup.find_all('td', {'class': 'ratingColumn imdbRating'})

titles = [tt.a.text for tt in title_tags]
release_year = [tt.span.text.lstrip('(').rstrip(')') for tt in title_tags]
ratings = [rt.text.strip('\n') for rt in rating_tags]
directors = [tt.a.get('title').split('(dir.)')[0].rstrip() for tt in title_tags]
actors = [tt.a.get('title').split('(dir.)')[1].lstrip(', ') for tt in title_tags]

imdb_dict = {'title': titles, 'release_year': release_year, 'imdb_rating': ratings, 'director': directors, 'actors': actors}
imdb_df = pd.DataFrame(imdb_dict)

In [46]:
imdb_df

Unnamed: 0,title,release_year,imdb_rating,director,actors
0,Um Sonho de Liberdade,1994,9.2,Frank Darabont,"Tim Robbins, Morgan Freeman"
1,O Poderoso Chefão,1972,9.1,Francis Ford Coppola,"Marlon Brando, Al Pacino"
2,O Poderoso Chefão II,1974,9.0,Francis Ford Coppola,"Al Pacino, Robert De Niro"
3,Batman: O Cavaleiro das Trevas,2008,9.0,Christopher Nolan,"Christian Bale, Heath Ledger"
4,12 Homens e uma Sentença,1957,8.9,Sidney Lumet,"Henry Fonda, Lee J. Cobb"
...,...,...,...,...,...
245,Aladdin,1992,8.0,Ron Clements,"Scott Weinger, Robin Williams"
246,Munna Bhai M.B.B.S.,2003,8.0,Rajkumar Hirani,"Sanjay Dutt, Arshad Warsi"
247,A Batalha de Argel,1966,8.0,Gillo Pontecorvo,"Brahim Hadjadj, Jean Martin"
248,Conflitos Internos,2002,8.0,Andrew Lau,"Andy Lau, Tony Chiu-Wai Leung"


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [47]:
#This is the url you will scrape in this exercise
url = 'https://www.imdb.com/chart/top'

In [48]:
# your code here
url_2 = 'https://www.imdb.com'

rand_movies = random.sample(range(250), 10)

href_tags = [tt.a.get('href') for tt in title_tags]
summaries = []
for ht in pd.Series(href_tags).loc[rand_movies]:
    response = requests.get(url_2 + ht)
    soup = BeautifulSoup(response.content)
    summary = soup.find('div', {'class': 'summary_text'}).text.strip('\n ')
    summaries.append(summary)
    
summaries = pd.Series(summaries, index=rand_movies, name='summary')

In [49]:
pd.concat([imdb_df.loc[rand_movies, ['title', 'release_year']], summaries], axis=1)

Unnamed: 0,title,release_year,summary
77,O Barco: Inferno no Mar,1981,The claustrophobic world of a WWII German U-bo...
180,Era uma Vez em Tóquio,1953,An old couple visit their children and grandch...
208,Platoon,1986,A young soldier in Vietnam faces a moral crisi...
59,Django Livre,2012,"With the help of a German bounty hunter, a fre..."
93,A Caça,2012,"A teacher lives a lonely life, all the while s..."
132,Fugindo do Inferno,1963,Allied prisoners of war plan for several hundr...
98,O Garoto,1921,"The Tramp cares for an abandoned child, but ev..."
112,Cafarnaum,2018,While serving a five-year sentence for a viole...
3,Batman: O Cavaleiro das Trevas,2008,When the menace known as the Joker wreaks havo...
229,Fúria Sanguinária,1949,A psychopathic criminal with a mother complex ...


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [50]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'
response = requests.get(url)

out = response.json()

Enter the city: chicago


In [51]:
# your code here
print(out)
temp = out['main']['temp']
wind_spd = out['wind']['speed']
descript = out['weather'][0]['description']
weather = out['weather'][0]['main']

print()
print('WEATHER REPORT:')
print(f'Temperature: {temp}')
print(f'Wind Speed: {wind_spd}')
print(f'Description: {descript}')
print(f'Weather: {weather}')

{'coord': {'lon': -87.62, 'lat': 41.88}, 'weather': [{'id': 803, 'main': 'Clouds', 'description': 'broken clouds', 'icon': '04n'}], 'base': 'stations', 'main': {'temp': -8.67, 'feels_like': -13.08, 'temp_min': -10, 'temp_max': -7, 'pressure': 1029, 'humidity': 61}, 'visibility': 16093, 'wind': {'speed': 1.5, 'deg': 210}, 'clouds': {'all': 75}, 'dt': 1576720561, 'sys': {'type': 1, 'id': 4505, 'country': 'US', 'sunrise': 1576674783, 'sunset': 1576707662}, 'timezone': -21600, 'id': 4887398, 'name': 'Chicago', 'cod': 200}

WEATHER REPORT:
Temperature: -8.67
Wind Speed: 1.5
Description: broken clouds
Weather: Clouds


#### Find the book name, price and stock availability as a pandas dataframe.

In [52]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'
response = requests.get(url)

response

<Response [200]>

In [53]:
# your code here
soup = BeautifulSoup(response.content)

# soup

In [54]:
hrefs = [at.get('href') for at in soup.find('ul', {'class': None}).find_all('a')]
cat_links = [url + href for href in hrefs]

books = {}
books['name'] = []
books['price'] = []
books['availability'] = []
for ct in cat_links:
    response = requests.get(ct)
    soup = BeautifulSoup(response.content)
    books['name'] += [at.text for at in soup.find_all('a') if at.has_attr('title')]
    books['price'] += [pt.text for pt in soup.find_all('p', {'class': 'price_color'})]
    books['availability'] += [pt.text.strip('\n ') for pt in soup.find_all('p', {'class': 'instock availability'})]
    page = 1
    while any([at.text.strip('\n ')  == 'next' for at in soup.find_all('a') if at.has_attr('href')]):
        page += 1
        url = ct.replace('index.html', f'page-{page}.html')
        response = requests.get(url)
        soup = BeautifulSoup(response.content)
        books['name'] += [at.text for at in soup.find_all('a') if at.has_attr('title')]
        books['price'] += [pt.text for pt in soup.find_all('p', {'class': 'price_color'})]
        books['availability'] += [pt.text.strip('\n ') for pt in soup.find_all('p', {'class': 'instock availability'})]

In [55]:
books_df = pd.DataFrame(books)

books_df

Unnamed: 0,name,price,availability
0,It's Only the Himalayas,£45.17,In stock
1,Full Moon over Noah’s ...,£49.43,In stock
2,See America: A Celebration ...,£48.87,In stock
3,Vagabonding: An Uncommon Guide ...,£36.94,In stock
4,Under the Tuscan Sun,£37.33,In stock
...,...,...,...
995,Why the Right Went ...,£52.65,In stock
996,Equal Is Unfair: America's ...,£56.86,In stock
997,Amid the Chaos,£36.58,In stock
998,Dark Notes,£19.19,In stock
