# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

### Make sure you have all libraries installed before start the lab!

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [123]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import re
# import scrapy

# requests:
"Requests is one of the most downloaded Python packages of all time, pulling in over 400,000 downloads each day. Join the party!"

[Source](https://2.python-requests.org/en/master/)

##### internal jokes:
"Requests is the only Non-GMO HTTP library for Python, safe for human consumption." #3556
[Source](https://github.com/kennethreitz/requests/issues/3556)

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [128]:
# starting by calling requests.get over the 'url'
get_html = requests.get(url)

In [130]:
# exploring the request.get() methods
get_html.status_code

200

In [131]:
get_html.encoding

'ISO-8859-1'

In [134]:
get_html.headers['content-type']

'text/html'

In [135]:
# calling the content method
html = get_html.content

In [136]:
soup = BeautifulSoup(
    # element
    html, 
    # parser type
    "lxml")

In [139]:
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [142]:
for i in soup.text:
    print(find_all(i['h1']))

NameError: name 'find_all' is not defined

In [7]:
tags = ['h1']

In [8]:
text = [element.text for element in soup.find_all(tags)][1:]

In [9]:
text

['Francois Zaninotto',
 '\n\n\n\n      Faker\n    \n',
 'Philip Walton',
 '\n\n\n\n      solved-by-flexbox\n    \n',
 'Michał Pierzchała',
 '\n\n\n\n      jest-preset-angular\n    \n',
 'Alan Shreve',
 '\n\n\n\n      ngrok\n    \n',
 'Gaëtan Renaudeau',
 '\n\n\n\n      gl-react-native-v2\n    \n',
 'Álvaro',
 '\n\n\n\n      fullPage.js\n    \n',
 'Jake Wharton',
 '\n\n\n\n      butterknife\n    \n',
 'Mark Baker',
 '\n\n\n\n      PHPComplex\n    \n',
 '子骅',
 '\n\n\n\n      medis\n    \n',
 'Andrew Trask',
 '\n\n\n\n      Grokking-Deep-Learning\n    \n',
 'Pascal Birchler',
 '\n\n\n\n      oEmbed-API\n    \n',
 'Erik Rasmussen',
 '\n\n\n\n      react-redux-universal-hot-example\n    \n',
 'Ryan Davis',
 '\n\n\n\n      enhanced-ruby-mode\n    \n',
 'Carl Lerche',
 '\n\n\n\n      tower-web\n    \n',
 'Vincenzo Chianese',
 '\n\n\n\n      vscode-apielements\n    \n',
 'shirou',
 '\n\n\n\n      gopsutil\n    \n',
 'Jake Archibald',
 '\n\n\n\n      svgomg\n    \n',
 'Jeff Bezanson',
 '\n\n\n\

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [12]:
# This is the url you will scrape in this exercise
url2 = 'https://github.com/trending/python?since=daily'

In [13]:
html2 = requests.get(url2).content

In [14]:
html2

b'\n\n\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-67V2J9Se2CifJlftk9/cExHGvxd7N9b9EdGnQEpszu99Ogeecilu9jIDxoCkx3zNLfB9ArraXW0J03qyVmN0Uw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-e7318add1f7e055d040edb0f75aaa0ba.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-MRlTIqIyb8caK5+o8llXVntXovciHyAM4qE3kWU2S7SIjAPDxYp4mE0jQ

In [15]:
soup2 = BeautifulSoup(html2, 'lxml')

In [16]:
soup2

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-e7318add1f7e055d040edb0f75aaa0ba.css" integrity="sha512-67V2J9Se2CifJlftk9/cExHGvxd7N9b9EdGnQEpszu99Ogeecilu9jIDxoCkx3zNLfB9ArraXW0J03qyVmN0Uw==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-294181adec18ed639e160b96b45d17ac.css" integrity="sha512-MRlTIqIyb8caK5+o8llXVntXovciHy

In [17]:
devs = [element.text for element in soup2.find_all('h1')][1:]

In [18]:
devs

['\n\n\ngto76 / python-cheatsheet\n ',
 '\n\n\n0voice / interview_internal_reference\n ',
 '\n\n\nhuggingface / pytorch-transformers\n ',
 '\n\n\nvincent-thevenin / Realistic-Neural-Talking-Head-Models\n ',
 '\n\n\nj3ssie / Osmedeus\n ',
 '\n\n\nNVIDIA / DeepLearningExamples\n ',
 '\n\n\njunyanz / pytorch-CycleGAN-and-pix2pix\n ',
 '\n\n\nmlflow / mlflow\n ',
 '\n\n\neragonruan / text-detection-ctpn\n ',
 '\n\n\nnbei / Deep-Flow-Guided-Video-Inpainting\n ',
 '\n\n\nmatterport / Mask_RCNN\n ',
 '\n\n\nyzddmr6 / webshell-venom\n ',
 '\n\n\nthreat9 / routersploit\n ',
 '\n\n\ngoogle-research / bert\n ',
 '\n\n\nuber / ludwig\n ',
 '\n\n\nsherlock-project / sherlock\n ',
 '\n\n\nCSAILVision / semantic-segmentation-pytorch\n ',
 '\n\n\nlukemelas / EfficientNet-PyTorch\n ',
 '\n\n\nosmr / imgclsmob\n ',
 '\n\n\nmsgi / nlp-journey\n ',
 '\n\n\nidealo / image-super-resolution\n ',
 '\n\n\nhuashengdun / webssh\n ',
 '\n\n\nfacebookresearch / LASER\n ',
 '\n\n\ntensorflow / models\n ',
 '\n\n\ne

In [21]:
devs = [x.replace('\n', '').replace(' ', '') for x in devs]

In [22]:
devs

['gto76/python-cheatsheet',
 '0voice/interview_internal_reference',
 'huggingface/pytorch-transformers',
 'vincent-thevenin/Realistic-Neural-Talking-Head-Models',
 'j3ssie/Osmedeus',
 'NVIDIA/DeepLearningExamples',
 'junyanz/pytorch-CycleGAN-and-pix2pix',
 'mlflow/mlflow',
 'eragonruan/text-detection-ctpn',
 'nbei/Deep-Flow-Guided-Video-Inpainting',
 'matterport/Mask_RCNN',
 'yzddmr6/webshell-venom',
 'threat9/routersploit',
 'google-research/bert',
 'uber/ludwig',
 'sherlock-project/sherlock',
 'CSAILVision/semantic-segmentation-pytorch',
 'lukemelas/EfficientNet-PyTorch',
 'osmr/imgclsmob',
 'msgi/nlp-journey',
 'idealo/image-super-resolution',
 'huashengdun/webssh',
 'facebookresearch/LASER',
 'tensorflow/models',
 'encode/django-rest-framework']

#### Display all the image links from Walt Disney wikipedia page

In [28]:
# This is the url you will scrape in this exercise
url3 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [29]:
http3 = requests.get(url3).content

In [30]:
http3

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Walt Disney - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":905475165,"wgRevisionId":905475165,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Featured articles","Use mdy dates from April 2017","Use American English from May 2016","All Wikipedia articles written in American English","Biography with signature","Articles with hCards","Articles containing German-language 

In [31]:
soup3 = BeautifulSoup(http3, 'lxml')

In [32]:
soup3

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":905475165,"wgRevisionId":905475165,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Featured articles","Use mdy dates from April 2017","Use American English from May 2016","All Wikipedia articles written in American English","Biography with signature","Articles with hCards","Articles containing German-language text","Ar

In [33]:
href_link = [link.get('href') for link in soup3.find_all('a')][1:]

In [34]:
# checking for elements that are not strings and won't be parsed by re
for x in href_link:
    if type(x)!= str:
        print('yes')

yes
yes
yes
yes
yes
yes
yes
yes
yes


In [37]:
# removing those elements 
for x in href_link:
    if type(x)!= str:
        href_link.remove(x)

In [38]:
href_link

['/wiki/Wikipedia:Featured_articles',
 '/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#p-search',
 '/wiki/The_Walt_Disney_Company',
 '/wiki/Walt_Disney_(disambiguation)',
 '/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/Chicago',
 '/wiki/Burbank,_California',
 '/wiki/The_Walt_Disney_Company',
 '/wiki/Disney_family',
 '/wiki/Academy_Awards',
 '/wiki/Golden_Globe_Award',
 '/wiki/Emmy_Award',
 '/wiki/File:Walt_Disney_1942_signature.svg',
 '/wiki/Help:IPA/English',
 '#cite_note-OD:_pronunciation-1',
 '/wiki/Modern_animation_in_the_United_States',
 '/wiki/Cartoon',
 '/wiki/Academy_Awards',
 '/wiki/Golden_Globe_Awards',
 '/wiki/Emmy_Award',
 '/wiki/National_Film_Registry',
 '/wiki/Library_of_Congress',
 '/wiki/The_Walt_Disney_Company',
 '/wiki/Roy_O._Disney',
 '/wiki/Ub_Iwerks',
 '/wiki/Mickey_Mouse',
 '/wiki/Technicolor',
 '/wiki/Feature-length',
 '/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 '/wiki/Pinocchio_(1940_film)',
 '/wiki/Fantasia_(1940_film)',
 '/wiki/Dumbo',
 '/wiki/

In [40]:
import re

In [42]:
images = []
for x in href_link:
    clean_link = re.findall(r'.*.jpg|.*.png\b', x)
    if len(clean_link)>0:
        images.append(clean_link)

In [43]:
images

[['/wiki/File:Walt_Disney_envelope_ca._1921.jpg'],
 ['/wiki/File:Walt_Disney_envelope_ca._1921.jpg'],
 ['/wiki/File:Trolley_Troubles_poster.jpg'],
 ['/wiki/File:Trolley_Troubles_poster.jpg'],
 ['/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg'],
 ['/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg'],
 ['/wiki/File:Steamboat-willie.jpg'],
 ['/wiki/File:Steamboat-willie.jpg'],
 ['/wiki/File:Walt_Disney_1935.jpg'],
 ['/wiki/File:Walt_Disney_1935.jpg'],
 ['/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg'],
 ['/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg'],
 ['/wiki/File:Disney_drawing_goofy.jpg'],
 ['/wiki/File:Disney_drawing_goofy.jpg'],
 ['/wiki/File:DisneySchiphol1951.jpg'],
 ['/wiki/File:DisneySchiphol1951.jpg'],
 ['/wiki/File:WaltDisneyplansDisneylandDec1954.jpg'],
 ['/wiki/File:WaltDisneyplansDisneylandDec1954.jpg'],
 ['/wiki/Fil

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [45]:
# This is the url you will scrape in this exercise
url4 ='https://en.wikipedia.org/wiki/Python' 

In [46]:
http4 = requests.get(url4).content

In [47]:
soup4 = BeautifulSoup(http4, 'lxml')

In [48]:
href = [link.get('href') for link in soup4.find_all('a')][1:]

In [49]:
links = []
for x in href:
    list_clean = re.findall(r'\bhttp.*', x)
    if len(list_clean) > 0:
        links.append(list_clean[0])

In [50]:
links

['https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 'https://en.wikipedia.org/w/index.php?title=Python&oldid=905477736',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://www.wikidata.org/wiki/Special:EntityPage/Q747452',
 'https://commons.wikimedia.org/wiki/Category:Python',
 'https://af.wikipedia.org/wiki/Python',
 'https://als.wikipedia.org/wiki/Python',
 'https://az.wikipedia.org/wiki/Python',
 'https://bn.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8_(%E0%A6%A6%E0%A7%8D%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%B0%E0%A7%8D%E0%A6%A5%E0%A6%A4%E0%A6%BE_%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A6%B8%E0%A6%A8)',
 'https://be.wikipedia.org/wiki/Python',
 'https://bg.wikipedia.org/wiki/%D0%9F%D0%B8%D1%82%D0%BE%D0%BD_(%D0%BF%D0%BE%D1%8F%D1%81%D0%BD%D0%B5%D0%BD%D0%B8%D0%B5)',
 'https://cs.wikipedia.org/wiki/Python_(rozcestn%C3%ADk)',
 'https://da.wik

#### Number of Titles that have changed in the United States Code since its last release point 

In [51]:
# This is the url you will scrape in this exercise
url5 = 'http://uscode.house.gov/download/download.shtml'

In [52]:
http5 = requests.get(url5).content

In [53]:
soup5 = BeautifulSoup(http5, 'lxml')

In [54]:
updatedTitles = soup5.find_all('div', {'class':'usctitlechanged'}) 

In [55]:
len(updatedTitles)

15

#### A Python list with the top ten FBI's Most Wanted names 

In [57]:
# This is the url you will scrape in this exercise
url6 = 'https://www.fbi.gov/wanted/topten'

In [58]:
http6 = requests.get(url6).content

In [59]:
soup6 = BeautifulSoup(http6, 'lxml')#your code 

In [60]:
crim = [x.text for x in soup6.find_all('h3')]

In [61]:
crim = [x.replace('\n', '').title() for x in crim]

In [62]:
crim

['Alejandro Rosales Castillo',
 'Yaser Abdel Said',
 'Jason Derek Brown',
 'Rafael Caro-Quintero',
 'Alexis Flores',
 'Eugene Palmer',
 'Santiago Villalba Mederos',
 'Robert William Fisher',
 'Bhadreshkumar Chetanbhai Patel',
 'Arnoldo Jimenez']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [63]:
# This is the url you will scrape in this exercise
url7 = 'https://www.emsc-csem.org/Earthquake/'

In [107]:
html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");
earthquakes = soup.find('tbody', {'id': 'tbody'}).find_all("tr");

nelem = 20;
latest_earthquakes = [];
    
for earthquake in earthquakes[:nelem]:
    # Date and time
    date, time = earthquake.find('td', {'class': 'tabev6'}).find('a').text.split();
    # Latitude and longitude
    lat_deg, lon_deg = earthquake.find_all('td', {'class': 'tabev1'});
    lat_dir, lon_dir, magnitude = earthquake.find_all('td', {'class': 'tabev2'});
    lat_deg = f"{lat_deg.text.strip()} {lat_dir.text.strip()}";
    lon_deg = f"{lon_deg.text.strip()} {lon_dir.text.strip()}";
    # Region
    region = earthquake.find('td', {'class': 'tb_region'}).text.strip();
    # Create list of information and append
    earthquake_summary = [date, time, lat_deg , lon_deg, region];
    latest_earthquakes.append(earthquake_summary);
    
df = pd.DataFrame(latest_earthquakes, columns=['Date', 'Time', 'Latitude', 'Longitude', 'Region'])

AttributeError: 'NoneType' object has no attribute 'find_all'

In [64]:
http7 = requests.get(url7).content

In [65]:
soup7 = BeautifulSoup(http7, 'lxml')

In [66]:
earthq = [x for x in soup7.find_all('tr', 
                                    {'class':['ligne1 normal', 'ligne2 normal']})]

In [67]:
earthq

[<tr class="ligne1 normal" id="780265" onclick="go_details(event,780265);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=780265">2019-07-18   03:26:12.4</a></b><i class="ago" id="ago0">04min ago</i></td><td class="tabev1">36.12 </td><td class="tabev2">N  </td><td class="tabev1">117.83 </td><td class="tabev2">W  </td><td class="tabev3">2</td><td class="tabev5" id="magtyp0">ML</td><td class="tabev2">2.2</td><td class="tb_region" id="reg0"> CENTRAL CALIFORNIA</td><td class="comment updatetimeno" id="upd0" style="text-align:right;">2019-07-18 03:29</td></tr>,
 <tr class="ligne2 normal" id="780264" onclick="go_details(event,780264);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=780264">2019-07-18   03:05:39.3</a></b><i class="ago" id="ago1">2

In [69]:
dateTime = []
region = []
latLong = []
longLat = []
for x in earthq:
    region.append(x.find_all('td', {'class':'tb_region'}))
    dateTime.append(x.find_all('a'))
    latLong.append(x.find_all('td', {'class':'tabev1'}))
    longLat.append(x.find_all('td', {'class':'tabev2'}))

In [77]:
date = [(re.findall('\d\d\d\d-\d\d-\d\d', 
                    str(x)))[0] for x in dateTime]

In [78]:
time = [(re.findall('\d\d:\d\d:\d\d.\d', 
                    str(x)))[0] for x in dateTime]

In [81]:
region = [(re.findall('>.*<', 
                      str(x)))[0].strip('<').strip('>').title() 
          for x in region]

IndexError: list index out of range

In [82]:
latitude = []
longitude = []
for x in latLong:
    latitude.append((re.findall('\d*[.]\d*', str(x)))[0])
    longitude.append((re.findall('\d*[.]\d*', str(x)))[1])

In [83]:
i = 0 
for x in longLat:
    latitude[i] += ' '+(re.findall('[NSEW]', str(x)))[0]
    longitude[i] += ' '+(re.findall('[NSEW]', str(x)))[1]
    i += 1

In [84]:
earthDF = pd.DataFrame({'date': date, 
                        'time': time,
                        'region': region,
                        'latitude': latitude,
                        'longitude': longitude})

In [85]:
earthDF.head(20)

Unnamed: 0,date,time,region,latitude,longitude
0,2019-07-18,03:26:12.4,Central California,36.12 N,117.83 W
1,2019-07-18,03:05:39.3,Southern California,35.63 N,117.43 W
2,2019-07-18,03:02:37.9,Southern California,35.62 N,117.46 W
3,2019-07-18,02:46:40.0,"Halmahera, Indonesia",0.43 S,127.59 E
4,2019-07-18,02:38:23.0,"Halmahera, Indonesia",0.69 S,128.08 E
5,2019-07-18,02:36:31.4,Southern California,35.63 N,117.43 W
6,2019-07-18,02:34:59.1,Southern California,35.62 N,117.43 W
7,2019-07-18,02:33:43.6,Southern California,35.75 N,117.58 W
8,2019-07-18,02:26:20.9,Central California,36.12 N,117.83 W
9,2019-07-18,02:03:19.0,"Libertador O'Higgins, Chile",34.32 S,70.55 W


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [92]:
# This is the url you will scrape in this exercise
url8 ='https://hackevents.co/hackathons'

In [None]:
#your code

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [108]:
username = input('Please, input your username: ')
html = requests.get(url + username).content;
soup = BeautifulSoup(html, "lxml");

try:
    tweet_box = soup.find('li', {'class':'ProfileNav-item ProfileNav-item--tweets is-active'});
    tweets = tweet_box.find('a').find('span', {'class':'ProfileNav-value'});
    print("{} has {} number of tweets.".format(username, tweets.get('data-count')))
except:
    print('Account name not found...')

KeyboardInterrupt: 

In [86]:
def tweetsNum(accountName, out):
    
    '''
    arg: str - twitter handle
         int - 0 to get number of tweets, 1 to get number of followers
    output: int - number of tweets or number of followers
    
    
    nums: the number of tweets is stored in the first instance of ProfileNav-value, hence c = 0
        number of followers is stored in the third instance of ProfileNav-value, hence c = 2
        turning it into a string so it can be read by re.
    tweets: using re.search, so it will return the first instance of tweet number - 
        without a dot that indicates hundreds.
        group() to use the value re.search returns
        strip() to remove the quotes 
        int() to make it integer
    '''
    if out == 0:
        c = 0
    elif out == 1:
        c = 2
    else:
        print('input 0 to get number of tweets, 1 to get number of followers')
        return
    
    url8 = 'https://twitter.com/'
    http8 = requests.get(url8+accountName)

    if http8.status_code == 200:
        soup8 = BeautifulSoup(http8.text, 'lxml')
        nums = str([el for el in soup8.find_all('span', {'class':'ProfileNav-value'})][c]) 
        tweets = int(re.search('"\d*"', nums).group().strip('""'))

    elif http8.status_code == 404:
        print('user not found')
        return

    return tweets

In [87]:
tweetsNum('neiltyson', 0)

6114

In [88]:
tweetsNum('neiltyson', 1)

13339663

In [89]:
tweetsNum('neiltyson', 2)

input 0 to get number of tweets, 1 to get number of followers


In [90]:
tweetsNum('neiltyso', 1)

user not found


#### List all language names and number of related articles in the order they appear in wikipedia.org

In [93]:
# This is the url you will scrape in this exercise
url10 = 'https://www.wikipedia.org/'

In [109]:
html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");

languages = soup.find_all('a', {'class': 'link-box'});
for language in languages:
    print(language.text.strip())

#### A list with the different kind of datasets available in data.gov.uk 

In [95]:
# This is the url you will scrape in this exercise
url11 = 'https://data.gov.uk/'

In [110]:
html = requests.get(url).content
soup = BeautifulSoup(html,"lxml")
topics = soup.findAll('h2')
for topic in topics:
    print(topic.text)

Product
Platform
Support
Company


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [97]:
# This is the url you will scrape in this exercise
url12 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [98]:
http12 = requests.get(url12).content

In [99]:
soup12 = BeautifulSoup(http12, 'lxml')

In [100]:
texts = [x.text for x in soup12.find_all('td')]

In [103]:
lang = [texts[i].strip('\n') for i in range(2,100, 9)]

In [104]:
population = [texts[i].strip('\n') for i in range(5,100, 9)]

In [105]:
languages = pd.DataFrame({'language':lang, 
                          'population': population})

In [106]:
languages.head(10)

Unnamed: 0,language,population
0,Chinese (macrolanguage),1311
1,Mandarin,918
2,Spanish,460
3,English,379
4,Hindi,341
5,Arabic (macrolanguage),319
6,Bengali,228
7,Portuguese,221
8,Russian,154
9,Japanese,128


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [111]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [113]:
username = input('Please, input your username: ')
n_tweets = int(input('Input number of tweets to scrape: '))
html = requests.get(url + username).content;
soup = BeautifulSoup(html, "lxml");

all_tweets = soup.find_all('div', {'class':'tweet'})

if all_tweets:
    for tweet in all_tweets[0:n_tweets]:
        name = tweet.find('span', {'class': 'FullNameGroup'}).find('strong')
        username = tweet.find('span', {'class': 'username'})
        time = tweet.find('small', {'class': 'time'})
        content = tweet.find('p', {'class': 'TweetTextSize TweetTextSize--normal js-tweet-text tweet-text'})
        statistics = tweet.find('div', {'class': 'ProfileTweet-actionCountList u-hiddenVisually'})
        
        print(f'\n{name.text} {username.text} {time.text.strip()}')
        print(content.text)
        print(statistics.text.strip().replace('\n', ' '))
else:
    print('Account name not found or tweet list is empty...')

Please, input your username: bolsonaro
Input number of tweets to scrape: 10

Renata Tedesque @bolsonaro 15 de jun de 2014
Retornando ao Twitter
6 respostas     34 retweets     29 curtiram

Renata Tedesque @bolsonaro 27 de set de 2009
Hoje a tarde vou assistir o jogo de São Paulo e Corinthians e, torcerei para empatarem. Sou Palmeirense...
2 respostas     34 retweets     17 curtiram

Renata Tedesque @bolsonaro 27 de set de 2009
Acordei hoje as 07:00, brinquei com Angelina (minha gatinha), em seguida pretendo jogar bola e fazer um churrasco em casa após o futebol.
1 resposta     33 retweets     13 curtiram

Renata Tedesque @bolsonaro 25 de jul de 2009
Organizei a casa e assisti lutas da UFC
0 resposta     26 retweets     10 curtiram

Renata Tedesque @bolsonaro 25 de jul de 2009
Hoje a manhã estava chuvosa e aproveitei para dormir bastante
0 resposta     70 retweets     18 curtiram


#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [115]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [116]:
html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");

movies = soup.find_all('td', {'class':'titleColumn'})
titles = [movie.find('a').text for movie in movies]
years = [movie.find('span').text[1:-1] for movie in movies]
directors = [movie.find('a').get('title').split(',')[0][:-7] for movie in movies]
actors = [' & '.join(movie.find('a').get('title').split(',')[1:]) for movie in movies]

movies_dict = {'Title': titles, 'Release': years, 'Director': directors, 'Actors': actors}

movies_df = pd.DataFrame(movies_dict)
movies_df

Unnamed: 0,Title,Release,Director,Actors
0,Um Sonho de Liberdade,1994,Frank Darabont,Tim Robbins & Morgan Freeman
1,O Poderoso Chefão,1972,Francis Ford Coppola,Marlon Brando & Al Pacino
2,O Poderoso Chefão II,1974,Francis Ford Coppola,Al Pacino & Robert De Niro
3,Batman: O Cavaleiro das Trevas,2008,Christopher Nolan,Christian Bale & Heath Ledger
4,12 Homens e uma Sentença,1957,Sidney Lumet,Henry Fonda & Lee J. Cobb
5,A Lista de Schindler,1993,Steven Spielberg,Liam Neeson & Ralph Fiennes
6,O Senhor dos Anéis: O Retorno do Rei,2003,Peter Jackson,Elijah Wood & Viggo Mortensen
7,Pulp Fiction: Tempo de Violência,1994,Quentin Tarantino,John Travolta & Uma Thurman
8,Três Homens em Conflito,1966,Sergio Leone,Clint Eastwood & Eli Wallach
9,Clube da Luta,1999,David Fincher,Brad Pitt & Edward Norton


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [117]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [118]:
from random import shuffle;

n_random = 10;

html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");
movies = soup.find_all('td', {'class':'titleColumn'})

shuffle(movies)

titles = [movie.find('a').text for movie in movies[0:n_random]]
years = [movie.find('span').text[1:-1] for movie in movies[0:n_random]]
links_to_movies = [movie.find('a').get('href') for movie in movies[0:n_random]]

summary = []
for link in links_to_movies:
    html = requests.get('https://www.imdb.com' + link).content;
    soup = BeautifulSoup(html, "lxml");
    summary.append(soup.find('div', {'class':'summary_text'}).text.strip());

movies_dict = {'Title': titles, 'Release': years, 'Summary': summary}

movies_df = pd.DataFrame(movies_dict)
movies_df

Unnamed: 0,Title,Release,Summary
0,Cidadão Kane,1941,"Following the death of publishing tycoon, Char..."
1,Barry Lyndon,1975,An Irish rogue wins the heart of a rich widow ...
2,Intriga Internacional,1959,A New York City advertising executive goes on ...
3,Se Meu Apartamento Falasse,1960,A man tries to rise in his company by letting ...
4,Amores Brutos,2000,A horrific car accident connects three stories...
5,Pulp Fiction: Tempo de Violência,1994,"The lives of two mob hitmen, a boxer, a gangst..."
6,Gandhi,1982,Gandhi's character is fully explained as a man...
7,A Outra História Americana,1998,A former neo-nazi skinhead tries to prevent hi...
8,Disque M para Matar,1954,A tennis player frames his unfaithful wife for...
9,A Bela e a Fera,1991,A prince cursed to spend his days as a hideous...


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [119]:
city = input('Enter the city: ').lower();
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'
weather_json = requests.get(url).json()

print("\n{}'s temperature: {}°C ".format(city.capitalize(), weather_json['main']['temp']))
print("Wind speed: {} m/s".format(weather_json['wind']['speed']))
print("Description: {}".format(weather_json['weather'][0]['description'].capitalize()))
print("Weather: {}".format(weather_json['weather'][0]['main'].capitalize()))


Enter the city: Belo Horizonte

Belo horizonte's temperature: 12.4°C 
Wind speed: 2.1 m/s
Description: Clear sky
Weather: Clear


#### Book name,price and stock availability as a pandas dataframe.

In [121]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [122]:
html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");
books = soup.find_all('article', {'class': 'product_pod'})

titles = [book.find('h3').text for book in books];
prices = [book.find('p', {'class': 'price_color'}).text for book in books];
stock = [book.find('p', {'class': 'instock availability'}).text.strip() for book in books]

books_dict = {'Title': titles, 'Price': prices, 'Stock': stock}

books_df = pd.DataFrame(books_dict)
books_df

Unnamed: 0,Title,Price,Stock
0,A Light in the ...,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History ...,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets ...,£33.34,In stock
7,The Coming Woman: A ...,£17.93,In stock
8,The Boys in the ...,£22.60,In stock
9,The Black Maria,£52.15,In stock
