# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](https://requests.readthedocs.io/en/master/user/quickstart/)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import regex as re

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
# your code here 
html = requests.get(url)
html

<Response [200]>

In [4]:
html = requests.get(url).content

In [5]:
soup = BeautifulSoup(html, 'html')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-eca8e21af2622cbcba2c93c67f79baed.css" integrity="sha512-7KjiGvJiLLy6LJPGf3m67ejAdgQsgDdnxZYoaI6+Agd0ZxHKTCjoKZgaf3PgUjURCcVceAwySJJJWgitRskDiA==" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-50b589bc080b2bf76c15ba545f14721f.css" integrity="sha512-ULWJvAgLK/dsFbpUXxRyH2pEmS1rgKA7HlRJmPXRLgZiGsjgw5V4oI2LLdc8wrWX9v2+WuYL4SsOfWH/GE02Tw==" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-8ab03e53e96fe1c31f30d0cf406

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [6]:
# your code here
"""
The name of the developer is actually pretty deep down in the html code.
The information about the trending developers is put in articles: <article class="Box-row d-flex">
Within these articles, there are divisions with the developers name and tag in it: <div class="col-md-6">
Within this class, the developers names is in: <h1 class="h3 lh-condensed">
also within this class, the developers 'nickname' is in: <p class="f4 text-normal mb-1">        
"""

'\nThe name of the developer is actually pretty deep down in the html code.\nThe information about the trending developers is put in articles: <article class="Box-row d-flex">\nWithin these articles, there are divisions with the developers name and tag in it: <div class="col-md-6">\nWithin this class, the developers names is in: <h1 class="h3 lh-condensed">\nalso within this class, the developers \'nickname\' is in: <p class="f4 text-normal mb-1">        \n'

In [7]:
name = soup.find_all('h1',attrs={'class': "h3 lh-condensed"})
name[0:2]

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":5457236,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="9509acee7651792875849fa87f42bb0e5630cbc8c904470600303bd952b70b19" href="/nunomaduro">
             Nuno Maduro
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":175809,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="6e3d924441c9e83c84efcf741016c495218f35195ce5b7d6ff562c206ccb6f0e" href="/mre">
             Matthias
 </a> </h1>]

In [8]:
print(type(name[0]))
print(type(name))

<class 'bs4.element.Tag'>
<class 'bs4.element.ResultSet'>


In [9]:
"""
To keep only the name, and get rid of the rest, you could delete everything that is between a < and a > (including the < >). 
You would need to do this for every element in the list name.
After that, you should probably delete white spaces before and after the name (strip).
But to be able to do that, you need a list of which each element is a string. 
"""

'\nTo keep only the name, and get rid of the rest, you could delete everything that is between a < and a > (including the < >). \nYou would need to do this for every element in the list name.\nAfter that, you should probably delete white spaces before and after the name (strip).\nBut to be able to do that, you need a list of which each element is a string. \n'

In [10]:
name = list(name)
print('name has type:', type(name))

str_name = []

for i in name:
    i = i.text
    str_name.append(i)

print(str_name) #ha, all the things between < > already disappeared. Hooray!
print(type(str_name[0]))

name has type: <class 'list'>
['\n\n            Nuno Maduro\n ', '\n\n            Matthias\n ', '\n\n            Evan Wallace\n ', '\n\n            Leo Farias\n ', '\n\n            XAMPPRocky\n ', '\n\n            Florian Roth\n ', '\n\n            Jason Quense\n ', '\n\n            Alex Potsides\n ', '\n\n            John Lindquist\n ', '\n\n            Stefan Prodan\n ', '\n\n            thomas chaton\n ', '\n\n            Nicolas P. Rougier\n ', '\n\n            Aaron Stannard\n ', '\n\n            Klaus Post\n ', '\n\n            Ben Frederickson\n ', '\n\n            Nathan Shively-Sanders\n ', '\n\n            berstend̡̲̫̹̠̖͚͓̔̄̓̐̄͛̀͘\n ', '\n\n            Robert Mosolgo\n ', '\n\n            Tristan Edwards\n ', '\n\n            Fernand Galiana\n ', '\n\n            Wojciech Maj\n ', '\n\n            Raine Revere\n ', '\n\n            Daishi Kato\n ', '\n\n            Phil Ewels\n ', '\n\n            Anton Kosyakov\n ']
<class 'str'>


In [11]:
#delete the \n and white spaces per item
clean_names = []
for i in str_name:
    clean_names.append(re.sub("\n", "", i).strip())
clean_names

['Nuno Maduro',
 'Matthias',
 'Evan Wallace',
 'Leo Farias',
 'XAMPPRocky',
 'Florian Roth',
 'Jason Quense',
 'Alex Potsides',
 'John Lindquist',
 'Stefan Prodan',
 'thomas chaton',
 'Nicolas P. Rougier',
 'Aaron Stannard',
 'Klaus Post',
 'Ben Frederickson',
 'Nathan Shively-Sanders',
 'berstend̡̲̫̹̠̖͚͓̔̄̓̐̄͛̀͘',
 'Robert Mosolgo',
 'Tristan Edwards',
 'Fernand Galiana',
 'Wojciech Maj',
 'Raine Revere',
 'Daishi Kato',
 'Phil Ewels',
 'Anton Kosyakov']

In [12]:
#missing the nicknames; will try again.
url = 'https://github.com/trending/developers'
html_2= requests.get(url)
print(html_2)

<Response [200]>


In [13]:
html_2 = requests.get(url).content
soup = BeautifulSoup(html_2, 'html')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-eca8e21af2622cbcba2c93c67f79baed.css" integrity="sha512-7KjiGvJiLLy6LJPGf3m67ejAdgQsgDdnxZYoaI6+Agd0ZxHKTCjoKZgaf3PgUjURCcVceAwySJJJWgitRskDiA==" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-50b589bc080b2bf76c15ba545f14721f.css" integrity="sha512-ULWJvAgLK/dsFbpUXxRyH2pEmS1rgKA7HlRJmPXRLgZiGsjgw5V4oI2LLdc8wrWX9v2+WuYL4SsOfWH/GE02Tw==" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-8ab03e53e96fe1c31f30d0cf406

In [14]:
name_2 = soup.find_all(('p', 'h1'), attrs=({"f4 text-normal mb-1"},{"h3 lh-condensed"}))
print(name_2[0:5])

[<h1 class="h3 lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":5457236,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="9509acee7651792875849fa87f42bb0e5630cbc8c904470600303bd952b70b19" href="/nunomaduro">
            Nuno Maduro
</a> </h1>, <p class="f4 text-normal mb-1">
<a class="Link--secondary" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":5457236,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="9509acee7651792875849fa87f42bb0e5630cbc8c904470600303bd952b70b19" href="/nunomaduro">
              nunomaduro
</a> </p>, <h1 class="h3 lh-condensed">
<a data-hyd

In [15]:
name_2 = list(name_2)
print('name_2 has type:', type(name_2))

str_name_2 = []

for i in name_2:
    i = i.text
    str_name_2.append(i)
print(str_name_2[0:13])

name_2 has type: <class 'list'>
['\n\n            Nuno Maduro\n ', '\n\n              nunomaduro\n ', '\n\n            Matthias\n ', '\n\n              mre\n ', '\n\n            Evan Wallace\n ', '\n\n              evanw\n ', '\n\n            Leo Farias\n ', '\n\n              leoafarias\n ', '\n\n            XAMPPRocky\n ', '\n\n            Florian Roth\n ', '\n\n              Neo23x0\n ', '\n\n            Jason Quense\n ', '\n\n              jquense\n ']


In [16]:
clean_names_2 = []
for i in str_name_2:
    clean_names_2.append(re.sub("\n", "", i).strip())
clean_names_2

['Nuno Maduro',
 'nunomaduro',
 'Matthias',
 'mre',
 'Evan Wallace',
 'evanw',
 'Leo Farias',
 'leoafarias',
 'XAMPPRocky',
 'Florian Roth',
 'Neo23x0',
 'Jason Quense',
 'jquense',
 'Alex Potsides',
 'achingbrain',
 'John Lindquist',
 'johnlindquist',
 'Stefan Prodan',
 'stefanprodan',
 'thomas chaton',
 'tchaton',
 'Nicolas P. Rougier',
 'rougier',
 'Aaron Stannard',
 'Aaronontheweb',
 'Klaus Post',
 'klauspost',
 'Ben Frederickson',
 'benfred',
 'Nathan Shively-Sanders',
 'sandersn',
 'berstend̡̲̫̹̠̖͚͓̔̄̓̐̄͛̀͘',
 'berstend',
 'Robert Mosolgo',
 'rmosolgo',
 'Tristan Edwards',
 't4t5',
 'Fernand Galiana',
 'derailed',
 'Wojciech Maj',
 'wojtekmaj',
 'Raine Revere',
 'raineorshine',
 'Daishi Kato',
 'dai-shi',
 'Phil Ewels',
 'ewels',
 'Anton Kosyakov',
 'akosyakov']

## Bonus Challenges

Below, you'll find a number of bonus challenges. Feel free to do as many as you like. These challenges are useful if you want to bolster your web scraping skills, but are not a requirement. 

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [17]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [18]:
# your code here


#### Display all the image links from Walt Disney wikipedia page.

In [19]:
# This is the url you will scrape in this exercise
url_Disney = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [20]:
# your code here
html_Disney = requests.get(url_Disney)
#tag and class of images: <a href="/wiki/File:Walt_Disney_1946.JPG" class="image">
print(html_Disney)

html_Disney = requests.get(url_Disney).content
html_Disney[0:80]

<Response [200]>


b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta char'

In [21]:
#make the soup
soup_Disney = BeautifulSoup(html_Disney, 'html')
print(soup_Disney.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Walt Disney - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"24272c2b-1c61-41dd-840b-d7f6e149ee7c","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":1018580368,"wgRevisionId":1018580368,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Short description is different from Wikidata","Wikipedia extended-con

In [22]:
links_Disney = []
for link in soup_Disney.find_all('a', attrs={'class': "image"}):
    links_Disney.append(link.get('href'))
links_Disney

['/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/File:Walt_Disney_1942_signature.svg',
 '/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg',
 '/wiki/File:Disneyland_Resort_logo.svg',
 '/wiki/File:Animation_disc.svg',
 '/wiki/File:P_vip.svg',
 '/wiki/File:Magic_Kingdom_castle.jpg',
 '/wiki/File:Video-x-generic.svg',
 '/wiki/File:Flag_of_Los_Angeles_County,_California.svg',
 '/wiki/File:Blank_television_set.svg',
 '/wiki/File:Flag_of_the_United_States.svg']

In [23]:
#remove the scalable items, and only keep the Jpg

pict_Disney = []

for i in links_Disney:
    if '.svg' not in i:
        pict_Disney.append(i)
pict_Disney
        


['/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg',
 '/wiki/File:Magic_Kingdom_castle.jpg']

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [24]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'

In [25]:
# your code here

#### Find the number of titles that have changed in the United States Code since its last release point.

In [26]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [27]:
# your code here

#### Find a Python list with the top ten FBI's Most Wanted names.

In [28]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [29]:
# your code here

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [30]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [31]:
# your code here
html = requests.get(url)
print(html)
html = requests.get(url).content
print(html[0:500])

<Response [200]>
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xml:lang="en" lang="en">\r\n<head><meta name="google-site-verification" content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" /><meta name="msvalidate.01" content="BCAA3C04C41AE6E6AFAF117B9469C66F" /><meta name="y_key" content="43b36314ccb77957" /><!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->\r\n<meta http-equ'


In [32]:
#make the soup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"/>
  <meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/>
  <meta content="43b36314ccb77957" name="y_key"/>
  <!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
  <meta content="en" http-equiv="Content-Language"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="all" name="robots"/>
  <meta content="earthquake,earthquakes,last earthquake,earthquake today,earthquakes today,earth quake,earth quakes,real time seismicity,seismic,seismicity,seismicity map,seismology,sismologie,EMSC,CSEM,seismicity on google earth,sumatra,tsunami,tsunamis,map,maps,richter,mercalli,moment tensors,epicenter,magnitude,seismology,foreshock,aftersho

In [33]:
table = soup.find_all('table')[3]
table

<table border="0" cellpadding="0" cellspacing="0" width="100%"><thead>
<tr>
<td colspan="13" style="background: white;height: 1px"></td>
</tr>
<tr id="haut_tableau"><th class="th2 th3" colspan="3" style="width:90px;display:table-csell;"><div onmouseout="info_b('notshow','','');" onmouseover="info_b('show','Represents the results of information provided by users (Felt earthquake, pictures, testimonies ,...)&lt;br&gt;See an intensity map for more details on the macroseismic intensity scale.','Citizen response');">Citizen<br/>Response</div><br/><table cellpadding="0" cellspacing="0" style="width:100%;margin:0;text-align:center;"><tr><td><div onclick="change_tri('im_report');" onmouseout="change_image(this,'out','im_report');" onmouseover="change_image(this,'over','im_report');"><span class="spriteorig sp_ico_list" onmouseout="info_b2('notshow','');" onmouseover="info_b2('show','Sorted by number of &lt;b&gt;Comments&lt;/b&gt;');"></span><span class="spriteorig sp_s_asc" id="im_report" styl

In [34]:
rows = table.find_all('tr')
rows[7]

<tr class="ligne1 normal" id="977481" onclick="go_details(event,977481);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=977481">2021-04-29   18:34:25.0</a></b><i class="ago" id="ago2">24min ago</i></td><td class="tabev1">30.04 </td><td class="tabev2">S  </td><td class="tabev1">72.09 </td><td class="tabev2">W  </td><td class="tabev3">35</td><td class="tabev5" id="magtyp2">ML</td><td class="tabev2">3.5</td><td class="tb_region" id="reg2"> OFFSHORE COQUIMBO, CHILE</td><td class="comment updatetimeno" id="upd2" style="text-align:right;">2021-04-29 18:50</td></tr>

In [35]:
rows[7].text.strip()

'earthquake2021-04-29\xa0\xa0\xa018:34:25.024min ago30.04\xa0S\xa0\xa072.09\xa0W\xa0\xa035ML3.5\xa0OFFSHORE COQUIMBO, CHILE2021-04-29 18:50'

In [36]:
rows[7].text.strip().split('\xa0') #is there a better 'character' to split on?!

['earthquake2021-04-29',
 '',
 '',
 '18:34:25.024min ago30.04',
 'S',
 '',
 '72.09',
 'W',
 '',
 '35ML3.5',
 'OFFSHORE COQUIMBO, CHILE2021-04-29 18:50']

In [37]:
rows = [row.text.strip().split('\xa0') for row in rows]
#print(rows)

In [38]:
data = rows[1:]
df = pd.DataFrame(data)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,CitizenResponseDate & Time UTCLatitude degrees...,,,,,,,,,,
1,,,,,,,,,,,
2,,,,,,,,,,,
3,12345678910»,,,,,,,,,,
4,3IIIearthquake2021-04-29,,,18:52:14.706min ago36.91,N,,22.15,E,,2ML3.0,SOUTHERN GREECE2021-04-29 18:55
5,earthquake2021-04-29,,,18:47:55.511min ago46.15,N,,7.49,E,,11ML0.9,SWITZERLAND2021-04-29 18:55
6,earthquake2021-04-29,,,18:34:25.024min ago30.04,S,,72.09,W,,35ML3.5,"OFFSHORE COQUIMBO, CHILE2021-04-29 18:50"
7,earthquake2021-04-29,,,18:18:05.040min ago19.25,N,,155.41,W,,32Ml2.1,"ISLAND OF HAWAII, HAWAII2021-04-29 18:23"
8,earthquake2021-04-29,,,18:15:29.043min ago2.11,N,,126.69,E,,10 M3.8,MOLUCCA SEA2021-04-29 18:25
9,earthquake2021-04-29,,,18:12:12.746min ago19.19,N,,155.47,W,,35Md2.2,"ISLAND OF HAWAII, HAWAII2021-04-29 18:15"


In [39]:
df = df.drop([0,1,2,3], axis = 0)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
4,3IIIearthquake2021-04-29,,,18:52:14.706min ago36.91,N,,22.15,E,,2ML3.0,SOUTHERN GREECE2021-04-29 18:55
5,earthquake2021-04-29,,,18:47:55.511min ago46.15,N,,7.49,E,,11ML0.9,SWITZERLAND2021-04-29 18:55
6,earthquake2021-04-29,,,18:34:25.024min ago30.04,S,,72.09,W,,35ML3.5,"OFFSHORE COQUIMBO, CHILE2021-04-29 18:50"
7,earthquake2021-04-29,,,18:18:05.040min ago19.25,N,,155.41,W,,32Ml2.1,"ISLAND OF HAWAII, HAWAII2021-04-29 18:23"
8,earthquake2021-04-29,,,18:15:29.043min ago2.11,N,,126.69,E,,10 M3.8,MOLUCCA SEA2021-04-29 18:25


In [40]:
df = df.drop([1,2,5,8], axis=1)
df.head()

Unnamed: 0,0,3,4,6,7,9,10
4,3IIIearthquake2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE2021-04-29 18:55
5,earthquake2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND2021-04-29 18:55
6,earthquake2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE2021-04-29 18:50"
7,earthquake2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII2021-04-29 18:23"
8,earthquake2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA2021-04-29 18:25


In [41]:
df.columns=["Date","time_ago_degrees", "Latitude", "degrees lon", "Longitude", "Depth_kmMag",  "Region_name_Last_update"]
df.head(10)

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region_name_Last_update
4,3IIIearthquake2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE2021-04-29 18:55
5,earthquake2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND2021-04-29 18:55
6,earthquake2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE2021-04-29 18:50"
7,earthquake2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII2021-04-29 18:23"
8,earthquake2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA2021-04-29 18:25
9,earthquake2021-04-29,18:12:12.746min ago19.19,N,155.47,W,35Md2.2,"ISLAND OF HAWAII, HAWAII2021-04-29 18:15"
10,earthquake2021-04-29,18:08:49.150min ago19.20,N,155.46,W,35Md2.0,"ISLAND OF HAWAII, HAWAII2021-04-29 18:12"
11,earthquake2021-04-29,17:55:53.91hr 03min ago35.49,N,3.61,W,14ML1.9,STRAIT OF GIBRALTAR2021-04-29 18:05
12,earthquake2021-04-29,17:47:17.11hr 11min ago38.98,N,29.84,E,7ML2.0,WESTERN TURKEY2021-04-29 18:38
13,earthquake2021-04-29,17:40:36.01hr 18min ago21.61,N,121.62,E,15 M3.7,TAIWAN REGION2021-04-29 17:55


In [42]:
#remove the word "earthquake" in column "date"
#split time and ago and degrees. (delete the ago part) 
# method could be: first remove the hour:minute:time part by saying something like "take everything up to ."
#secondly from what remains take everything before 'min ago'. 
#third step: take any numbers and dots (not whitespaces or letters).

#split column depth km mag
#method could be: split on M (and strip)
#remove the L 

#clean region name Last update (delete last update)
#method could be: split on 2021 and delete the numbers-column

In [43]:
df[['Region', 'rubbish']] = df.Region_name_Last_update.str.split('2021', expand=True)
df.head()

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region_name_Last_update,Region,rubbish
4,3IIIearthquake2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE2021-04-29 18:55,SOUTHERN GREECE,-04-29 18:55
5,earthquake2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND2021-04-29 18:55,SWITZERLAND,-04-29 18:55
6,earthquake2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE2021-04-29 18:50","OFFSHORE COQUIMBO, CHILE",-04-29 18:50
7,earthquake2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII2021-04-29 18:23","ISLAND OF HAWAII, HAWAII",-04-29 18:23
8,earthquake2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA2021-04-29 18:25,MOLUCCA SEA,-04-29 18:25


In [44]:
df = df.drop('rubbish', axis=1)
df.head()

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region_name_Last_update,Region
4,3IIIearthquake2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE2021-04-29 18:55,SOUTHERN GREECE
5,earthquake2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND2021-04-29 18:55,SWITZERLAND
6,earthquake2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE2021-04-29 18:50","OFFSHORE COQUIMBO, CHILE"
7,earthquake2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII2021-04-29 18:23","ISLAND OF HAWAII, HAWAII"
8,earthquake2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA2021-04-29 18:25,MOLUCCA SEA


In [45]:
df = df.drop('Region_name_Last_update', axis = 1)
df.head()

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region
4,3IIIearthquake2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE
5,earthquake2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND
6,earthquake2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE"
7,earthquake2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII"
8,earthquake2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA


In [46]:
df[['rubbish', 'Date']] = df.Date.str.split('earthquake', expand=True)
df.head()

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region,rubbish
4,2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE,3III
5,2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND,
6,2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE",
7,2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII",
8,2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA,


In [47]:
df = df.drop('rubbish', axis=1)

In [48]:
df.head()

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region
4,2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE
5,2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND
6,2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE"
7,2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII"
8,2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA


In [49]:
df[['time_ago', 'degrees_lat']] = df.time_ago_degrees.str.split('min ago', expand = True)
df.head()

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region,time_ago,degrees_lat
4,2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE,18:52:14.706,36.91
5,2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND,18:47:55.511,46.15
6,2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE",18:34:25.024,30.04
7,2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII",18:18:05.040,19.25
8,2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA,18:15:29.043,2.11


In [50]:
df[['time_ago', 'rubbish']] = df.time_ago.str.split('.', expand = True)
df.head()

Unnamed: 0,Date,time_ago_degrees,Latitude,degrees lon,Longitude,Depth_kmMag,Region,time_ago,degrees_lat,rubbish
4,2021-04-29,18:52:14.706min ago36.91,N,22.15,E,2ML3.0,SOUTHERN GREECE,18:52:14,36.91,706
5,2021-04-29,18:47:55.511min ago46.15,N,7.49,E,11ML0.9,SWITZERLAND,18:47:55,46.15,511
6,2021-04-29,18:34:25.024min ago30.04,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE",18:34:25,30.04,24
7,2021-04-29,18:18:05.040min ago19.25,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII",18:18:05,19.25,40
8,2021-04-29,18:15:29.043min ago2.11,N,126.69,E,10 M3.8,MOLUCCA SEA,18:15:29,2.11,43


In [51]:
df = df.drop(['rubbish', "time_ago_degrees"], axis = 1)
df.head()

Unnamed: 0,Date,Latitude,degrees lon,Longitude,Depth_kmMag,Region,time_ago,degrees_lat
4,2021-04-29,N,22.15,E,2ML3.0,SOUTHERN GREECE,18:52:14,36.91
5,2021-04-29,N,7.49,E,11ML0.9,SWITZERLAND,18:47:55,46.15
6,2021-04-29,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE",18:34:25,30.04
7,2021-04-29,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII",18:18:05,19.25
8,2021-04-29,N,126.69,E,10 M3.8,MOLUCCA SEA,18:15:29,2.11


In [52]:
df = df.drop('time_ago', axis = 1)
df.head()

Unnamed: 0,Date,Latitude,degrees lon,Longitude,Depth_kmMag,Region,degrees_lat
4,2021-04-29,N,22.15,E,2ML3.0,SOUTHERN GREECE,36.91
5,2021-04-29,N,7.49,E,11ML0.9,SWITZERLAND,46.15
6,2021-04-29,S,72.09,W,35ML3.5,"OFFSHORE COQUIMBO, CHILE",30.04
7,2021-04-29,N,155.41,W,32Ml2.1,"ISLAND OF HAWAII, HAWAII",19.25
8,2021-04-29,N,126.69,E,10 M3.8,MOLUCCA SEA,2.11


In [53]:
df['Depth_kmMag'] = df.Depth_kmMag.str.lower()
df.head()

Unnamed: 0,Date,Latitude,degrees lon,Longitude,Depth_kmMag,Region,degrees_lat
4,2021-04-29,N,22.15,E,2ml3.0,SOUTHERN GREECE,36.91
5,2021-04-29,N,7.49,E,11ml0.9,SWITZERLAND,46.15
6,2021-04-29,S,72.09,W,35ml3.5,"OFFSHORE COQUIMBO, CHILE",30.04
7,2021-04-29,N,155.41,W,32ml2.1,"ISLAND OF HAWAII, HAWAII",19.25
8,2021-04-29,N,126.69,E,10 m3.8,MOLUCCA SEA,2.11


In [54]:
df[['Depth_km','Mag']] = df.Depth_kmMag.str.split('m', expand = True)
df.head()

Unnamed: 0,Date,Latitude,degrees lon,Longitude,Depth_kmMag,Region,degrees_lat,Depth_km,Mag
4,2021-04-29,N,22.15,E,2ml3.0,SOUTHERN GREECE,36.91,2,l3.0
5,2021-04-29,N,7.49,E,11ml0.9,SWITZERLAND,46.15,11,l0.9
6,2021-04-29,S,72.09,W,35ml3.5,"OFFSHORE COQUIMBO, CHILE",30.04,35,l3.5
7,2021-04-29,N,155.41,W,32ml2.1,"ISLAND OF HAWAII, HAWAII",19.25,32,l2.1
8,2021-04-29,N,126.69,E,10 m3.8,MOLUCCA SEA,2.11,10,3.8


In [55]:
df = df[['Date', "degrees_lat", 'Latitude', 'degrees lon', 'Longitude', 'Mag', 'Depth_km', "Region"]]
df.head()

Unnamed: 0,Date,degrees_lat,Latitude,degrees lon,Longitude,Mag,Depth_km,Region
4,2021-04-29,36.91,N,22.15,E,l3.0,2,SOUTHERN GREECE
5,2021-04-29,46.15,N,7.49,E,l0.9,11,SWITZERLAND
6,2021-04-29,30.04,S,72.09,W,l3.5,35,"OFFSHORE COQUIMBO, CHILE"
7,2021-04-29,19.25,N,155.41,W,l2.1,32,"ISLAND OF HAWAII, HAWAII"
8,2021-04-29,2.11,N,126.69,E,3.8,10,MOLUCCA SEA


In [56]:
type(df["Mag"][25])

str

In [57]:
def clean_mag(mag):
    if mag is not None:
        return re.sub(("[a-z]+"), "", mag)
    else: 
        return 0

df["Mag"] = df["Mag"].apply(clean_mag)
df.head()

Unnamed: 0,Date,degrees_lat,Latitude,degrees lon,Longitude,Mag,Depth_km,Region
4,2021-04-29,36.91,N,22.15,E,3.0,2,SOUTHERN GREECE
5,2021-04-29,46.15,N,7.49,E,0.9,11,SWITZERLAND
6,2021-04-29,30.04,S,72.09,W,3.5,35,"OFFSHORE COQUIMBO, CHILE"
7,2021-04-29,19.25,N,155.41,W,2.1,32,"ISLAND OF HAWAII, HAWAII"
8,2021-04-29,2.11,N,126.69,E,3.8,10,MOLUCCA SEA


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [58]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [59]:
# your code here

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [60]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [61]:
# your code here

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [62]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [63]:
# your code here

#### A list with the different kind of datasets available in data.gov.uk.

In [64]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [65]:
# your code here

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [66]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [67]:
# your code here

#### Scrape a certain number of tweets of a given Twitter account.

In [68]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [69]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [70]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [71]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [72]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [73]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here