# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [26]:
#your code
#GET url + check status code and content
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
print(page)
print(content)

<Response [200]>
b'\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-jFUBCdWOA1Ov3xo3oFMBwsdP4Up2K1bRnP4QYI5WqvpaIYxWVek89k2M0oyTbNhYMViGtxJB3Vdwcw8ln8hGQw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-8c550109d58e0353afdf1a37a05301c2.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-d8tnxebCP3jW0DJ3E/CIxAfZO2DHR3H4J+hsV7zj/WJ6

In [73]:
#PARSE (using BeautifulSoup)
soup = BeautifulSoup(page.content, 'html.parser')

In [8]:
#Print the response text to doublecheck data
soup.text

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTrending  developers on GitHub today · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\n                Sign\xa0up\n              \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n                    Why GitHub?\n                    \n\n\n\n\nFeatures →\n\nCode review\nProject management\nIntegrations\nActions\nPackages\nSecurity\nTeam management\nHosting\nMobile\n\n\nCustomer stories →\nSecurity →\n\n\n\n\n\nTeam\n\n\nEnterprise\n\n\n\n\n                    Explore\n                    \n\n\n\n\n\nExplore GitHub →\n\nLearn & contribute\n\nTopics\nCollections\nTrending\nLearning Lab\nOpen source guides\n\nConnect with others\n\nEvents\nCommunity forum\nGitHub Education\n\n\n\n\n\nMarketplace\n\n\n\n\n                    Pricing\n                    \n\n\n\n\nPlans →\n\nCompare plans\nContact Sales\n\n\nNonprofit →\nEducation →\n\n\n\n\n\n\n\n\n\n\n\

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [57]:
#your code
#Buscamos la clase donde estarán presentes los nombres de los desarrolladores
developers_class=soup.find_all('h1', class_='h3 lh-condensed')

In [55]:
#comprobamos texto para inspeccionar 
developers_class

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":35374649,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="cea3809579f6c2ad5025c0f913ee198ef545d18252a1c8ac68767673ba121233" href="/anuraghazra">
             Anurag Hazra
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":10654537,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="e8bfa1034a3de144f3a998a3bf00c8d2149a0ee65e6b2e0725fbae5cfcd4a7c1" href="/MathewSachin">
             Mathew Sachin
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a data-hydro-click='

In [63]:
#Sacamos un nombre para entender la ruta
developers_name1=developers_class[0].find("a").get_text().replace("\n","").replace("            ","")


In [64]:
#comprobamos nombre1
developers_name1

'Anurag Hazra'

In [71]:
#Sacamos todos los nombres de los desarrolladores y los almacenamos en una lista "names"
developers_names=[]
for info in developers_class:
    name=info.find("a").get_text().replace("\n","").replace("            ","")
    developers_names.append(name)
    

In [72]:
#imprimimos todos los nombres de los desarrolladores más famosos en GitHub
developers_names

['Anurag Hazra',
 'Mathew Sachin',
 'Chocobozzz',
 'Stefano Gottardo',
 'Chris Banes',
 'Kyle Mathews',
 'Brandon Bayer',
 'Joe Block',
 'Arvid Norberg',
 'Michiel Borkent',
 'Nicolas Gallagher',
 'Clemens Wolff',
 'Felix Yan',
 'Hajime Hoshi',
 'Domenic Denicola',
 'Jacob Quinn',
 'Robert Mosolgo',
 'Sebastian Silbermann',
 'Remi Rousselet',
 'Michael Lynch',
 'Veronika Romashkina',
 'Jeremy Ashkenas',
 'Elad Ben-Israel',
 'Jun Han',
 'Klaus Sinani']

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [36]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [37]:
#your code
#GET url + check status code and content
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
print(page)
print(content)

<Response [200]>
b'\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-jFUBCdWOA1Ov3xo3oFMBwsdP4Up2K1bRnP4QYI5WqvpaIYxWVek89k2M0oyTbNhYMViGtxJB3Vdwcw8ln8hGQw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-8c550109d58e0353afdf1a37a05301c2.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-fqnZtayqgLCmcQfxXp5OH4orKvv16fP0zCU6Ns+NuAUL

In [38]:
#PARSE (using BeautifulSoup)
soup = BeautifulSoup(page.content, 'html.parser')
soup


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-8c550109d58e0353afdf1a37a05301c2.css" integrity="sha512-jFUBCdWOA1Ov3xo3oFMBwsdP4Up2K1bRnP4QYI5WqvpaIYxWVek89k2M0oyTbNhYMViGtxJB3Vdwcw8ln8hGQw==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-7ea9d9b5acaa80b0a67107f15e9e4e1f.css" integrity="sha512-fqnZtayqgLCmcQfxXp5OH4orKvv16

In [39]:
#Print the response text to doublecheck data
soup.text

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTrending Python repositories on GitHub today · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\n                Sign\xa0up\n              \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n                    Why GitHub?\n                    \n\n\n\n\nFeatures →\n\nCode review\nProject management\nIntegrations\nActions\nPackages\nSecurity\nTeam management\nHosting\nMobile\n\n\nCustomer stories →\nSecurity →\n\n\n\n\n\nTeam\n\n\nEnterprise\n\n\n\n\n                    Explore\n                    \n\n\n\n\n\nExplore GitHub →\n\nLearn & contribute\n\nTopics\nCollections\nTrending\nLearning Lab\nOpen source guides\n\nConnect with others\n\nEvents\nCommunity forum\nGitHub Education\n\n\n\n\n\nMarketplace\n\n\n\n\n                    Pricing\n                    \n\n\n\n\nPlans →\n\nCompare plans\nContact Sales\n\n\nNonprofit →\nEducation →\n\n\n\n\n\n\n\

In [40]:
#Buscamos la clase donde estarán presentes los nombres de los desarrolladores
repository_class=soup.find_all('h1', class_='h3 lh-condensed')

In [41]:
repository_class

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":136914524,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="9767833cfca9072c3c3fd81ac2f3944dca7e8b7dbb1a9e03fc9fd93d1904df36" href="/flairNLP/flair">
 <svg aria-hidden="true" class="octicon octicon-repo mr-1 text-gray" color="gray" height="16" mr="1" version="1.1" viewbox="0 0 16 16" width="16"><path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path

In [42]:
#Sacamos un nombre de repo para entender la ruta y comprobamos nombre repo
repository_name1=repository_class[0].find("a").get_text().replace("\n","").replace("        ","").replace(" ","")
repository_name1

'flairNLP/flair'

In [43]:
#Sacamos todos los nombres de los repositorios y los almacenamos en una lista "repository_names"
repository_names=[]
for info in repository_class:
    name=info.find("a").get_text().replace("\n","").replace("        ","").replace(" ","")
    repository_names.append(name)
    

In [44]:
#imprimimos lista
repository_names

['flairNLP/flair',
 'PostHog/posthog',
 'kangvcar/InfoSpider',
 'PyTorchLightning/pytorch-lightning',
 'vaexio/vaex',
 'donnemartin/system-design-primer',
 'PaddlePaddle/PaddleDetection',
 'opencv/cvat',
 'huggingface/transformers',
 'abhimishra91/insight',
 'horovod/horovod',
 'iperov/DeepFaceLab',
 'microsoft/recommenders',
 'd2l-ai/d2l-en',
 'awslabs/aws-lambda-powertools-python',
 'apache/airflow',
 'optuna/optuna',
 'scikit-learn/scikit-learn',
 'catalyst-team/catalyst',
 'Eloston/ungoogled-chromium',
 'pytorch/fairseq',
 'ckan/ckan',
 'lyhue1991/eat_tensorflow2_in_30_days',
 'pennersr/django-allauth',
 'miemie2013/Keras-YOLOv4']

#### Display all the image links from Walt Disney wikipedia page

In [249]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [250]:
#your code
#GET url + check status code and content
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
print(page)
print(content)

<Response [200]>
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Walt Disney - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"478efa2d-4ef6-4286-9ca1-d459446892b2","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":972127952,"wgRevisionId":972127952,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Wikipedia extended-confirmed-protected pages","Wikipedia indefinitely move-protected pages","Articl

In [251]:
#PARSE (using BeautifulSoup)
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"478efa2d-4ef6-4286-9ca1-d459446892b2","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":972127952,"wgRevisionId":972127952,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Wikipedia extended-confirmed-protected pages","Wikipedia indefinitely move-protected pages","Articles with short descripti

In [123]:
#Buscamos los links de imagenes de la url
img_links_class=soup.find_all(class_='image')

In [121]:
img_links_class

[<a class="image" href="/wiki/File:Walt_Disney_1946.JPG"><img alt="Walt Disney 1946.JPG" data-file-height="675" data-file-width="450" decoding="async" height="330" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/330px-Walt_Disney_1946.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/440px-Walt_Disney_1946.JPG 2x" width="220"/></a>,
 <a class="image" href="/wiki/File:Walt_Disney_1942_signature.svg"><img alt="Walt Disney 1942 signature.svg" data-file-height="218" data-file-width="585" decoding="async" height="56" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/225px-Walt_Disney_1942_signature.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/t

In [130]:
#Sacamos un link para entender la ruta y comprobamos link(href)
img_links_class[0].get("href")

'/wiki/File:Walt_Disney_1946.JPG'

In [136]:
#Sacamos todos los nombres de los repositorios y los almacenamos en una lista "repository_names"
img_links=[]
for info in img_links_class:
    link=info.get("href")
    img_links.append(link)

In [137]:
img_links

['/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/File:Walt_Disney_1942_signature.svg',
 '/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg',
 '/wiki/File:The_Walt_Disney_Company_Logo.svg',
 '/wiki/File:Animation_disc.svg',
 '/wiki/File:P_vip.svg',
 '/wiki/File:Magic_Kingdom_castle.jpg',
 '/wiki/File:Video-x-generic.svg',
 '/wiki/File:Flag_of_Los_Angeles_County,_C

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [138]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [139]:
#your code
#GET url + check status code and content
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
print(page)
print(content)

<Response [200]>
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Python - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6b140b43-59bb-4dbc-abde-d705e0a874ae","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":963092579,"wgRevisionId":963092579,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal co

In [140]:
#PARSE (using BeautifulSoup)
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6b140b43-59bb-4dbc-abde-d705e0a874ae","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":963092579,"wgRevisionId":963092579,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal common name disambiguatio

In [150]:
#Buscamos los links dentro la url para saber la ruta
links_class=soup.find_all("a")

In [149]:
soup.find_all("a")

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a href="#Snakes"><span class="tocnumber">1</span> <span class="toctext">Snakes</span></a>,
 <a href="#Ancient_Greece"><span class="tocnumber">2</span> <span class="toctext">Ancient Greece</span></a>,
 <a href="#Media_and_entertainment"><span class="tocnumber">3</span> <span class="toctext">Media and entertainment</span></a>,
 <a href="#Computing"><span class="tocnumber">4</span> <span class="toctext">Computing</span></a>,
 <a href="#Engineering"><span class="tocnumber">5</span> <span class="toctext">Engineering</span></a>,
 <a href="#Roller_coasters"><span class="tocnumber">5.1</span> <span class="toctext">Roller coasters</span></a>,
 <a h

In [153]:
#buscamos los links de todas la url
links=[]
for info in links_class:
    link=info.get("href")
    links.append(link)

In [154]:
links

[None,
 '#mw-head',
 '#searchInput',
 'https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 '#Snakes',
 '#Ancient_Greece',
 '#Media_and_entertainment',
 '#Computing',
 '#Engineering',
 '#Roller_coasters',
 '#Vehicles',
 '#Weaponry',
 '#People',
 '#Other_uses',
 '#See_also',
 '/w/index.php?title=Python&action=edit&section=1',
 '/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/w/index.php?title=Python&action=edit&section=2',
 '/wiki/Python_(mythology)',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/w/index.php?title=Python&action=edit&section=3',
 '/wiki/Python_(film)',
 '/wiki/Pythons_2',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/w/index.php?title=Python&action=edit&section=4',
 '/wiki/Python_(programming_language)',
 '/wiki/CPython',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/w/index.php?title=Python&action=edit&section=5',
 '/w/index.php?title=Python&action=edit&sec

In [156]:
#Hay un total de 159 links en la web. Algunos de los link funcionan como indice para ir a una sección diferente
#otros son link que nos redirigen a otra web
len(links)

159

#### Number of Titles that have changed in the United States Code since its last release point 

In [161]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [162]:
#your code
#GET url + check status code and content
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
print(page)
print(content)

<Response [200]>
b'<?xml version=\'1.0\' encoding=\'UTF-8\' ?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml"><head>\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n        <meta http-equiv="X-UA-Compatible" content="IE=8" />\n        <meta http-equiv="pragma" content="no-cache" /><!-- HTTP 1.0 -->\n        <meta http-equiv="cache-control" content="no-cache,must-revalidate" /><!-- HTTP 1.1 -->\n        <meta http-equiv="expires" content="0" />\n        <link rel="shortcut icon" href="/javax.faces.resource/favicon.ico.xhtml?ln=images" /><link type="text/css" rel="stylesheet" href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" /><script type="text/javascript" src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces"></script><link type="text/css" rel="stylesheet" href="/javax.faces.resource/static.css.xhtml?ln=css" /></head><body><scrip

In [164]:
#PARSE (using BeautifulSoup)
soup = BeautifulSoup(page.content, 'html.parser')
soup

<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="no-cache" http-equiv="pragma"/><!-- HTTP 1.0 -->
<meta content="no-cache,must-revalidate" http-equiv="cache-control"/><!-- HTTP 1.1 -->
<meta content="0" http-equiv="expires"/>
<link href="/javax.faces.resource/favicon.ico.xhtml?ln=images" rel="shortcut icon"/><link href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" rel="stylesheet" type="text/css"/><script src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces" type="text/javascript"></script><link href="/javax.faces.resource/static.css.xhtml?ln=css" rel="stylesheet" type="text/css"/></head><body><script src="/javax.faces.resource/browserPreferences.js.xhtml?ln=scripts" type="text/javasc

In [237]:
#Buscamos titulo en negrita dentro la url (style='padding-bottom: 10px;')
#ya que sabemos que los títulos así son los qu están cambiados después del último release

#o directamente buscando la class=usctitlechanged

items_bold=soup.find_all(class_="usctitlechanged")
items_bold



[<div class="usctitlechanged" id="us/usc/t54" style="padding-bottom: 10px;">
 
           Title 54 - National Park Service and Related Programs <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>]

In [242]:
#seleccionamos solo el titulo
title_changed=items_bold[0].get_text().strip().replace(" ٭","")

In [243]:
title_changed

'Title 54 - National Park Service and Related Programs'

In [245]:
#Respuesta: sólo 1 título ha cambiado desde el último release

#### A Python list with the top ten FBI's Most Wanted names 

In [267]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [268]:
#your code
#GET url + check status code and content
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
print(page)
print(content)

<Response [503]>


In [248]:
#PARSE (using BeautifulSoup)
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<!-- saved from url=(0023)http://kidmondo.com/404 -->
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="no" http-equiv="imagetoolbar"/>
<meta content="noindex,nofollow" name="robots"/>
<title>Security Prompt</title>
<style>body{background:#fff;margin:0;padding:20px;text-align:center;font-family:Arial,Helvetica,sans-serif;font-size:14px;color:#666}.error_page{width:600px;padding:50px;margin:auto}.error_page h1{margin:20px 0 0}.error_page p{margin:10px 0;padding:0}a{color:#9caa6d;text-decoration:none}a:hover{color:#9caa6d;text-decoration:underline}</style>
<script type="text/javascript">
  //<![CDATA[
  (function(){
    
    var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },
    b = function(b, c) {a() ? document.addEventListener("DOMContentLo

In [60]:
#Buscamos la clase donde estarán presentes los nombres de las 10 personas más buscadas
soup.find_all(class_="full-grid wanted-grid-natural infinity castle-grid-block-xs-2 castle-grid-block-sm-2castle-grid-block-md-3 castle-grid-block-lg-5 dt-grid")

[]

In [None]:
###  MISSING FBINAMES

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [113]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [114]:
#GET url + check status code and content
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
print(page)
print(content)

<Response [200]>


In [289]:
#PARSE (using BeautifulSoup)
soup = BeautifulSoup(page.content, 'html.parser')

In [116]:
#Get Main Info for date and time
earthquake_class_date_time=soup.find_all(class_="tabev6")

In [118]:
#In this path we can see date and time
earthquake_class_date_time[0].find("a").text

'2020-08-22\xa0\xa0\xa018:09:35.0'

In [119]:
#Get all dates from all earthquakes
dates=[]
for info in earthquake_class_date_time:
    date=info.find("a").text
    dates.append(re.findall(r"\d{4}-\d{2}-\d{2}",date))

In [120]:
dates

[['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22'],
 ['2020-08-22']]

In [121]:
#Get all times from all earthquakes
times=[]
for info in earthquake_class_date_time:
    time=info.find("a").text
    times.append(re.findall(r"\d{2}:\d{2}",time))

In [122]:
times

[['18:09'],
 ['17:49'],
 ['17:44'],
 ['17:43'],
 ['17:22'],
 ['17:13'],
 ['16:36'],
 ['16:12'],
 ['16:10'],
 ['15:54'],
 ['15:43'],
 ['15:41'],
 ['15:35'],
 ['15:29'],
 ['14:45'],
 ['14:44'],
 ['14:38'],
 ['14:30'],
 ['14:18'],
 ['14:18'],
 ['14:14'],
 ['14:06'],
 ['13:55'],
 ['13:51'],
 ['13:21'],
 ['13:03'],
 ['12:41'],
 ['12:40'],
 ['12:39'],
 ['12:38'],
 ['12:11'],
 ['11:27'],
 ['11:25'],
 ['11:17'],
 ['11:04'],
 ['11:04'],
 ['10:52'],
 ['10:50'],
 ['10:38'],
 ['10:35'],
 ['10:33'],
 ['10:31'],
 ['10:25'],
 ['10:22'],
 ['10:21'],
 ['10:15'],
 ['10:01'],
 ['09:38'],
 ['09:35'],
 ['09:24']]

In [288]:
#Get Main Info for latitudes and longitudes
earthquake_class_latitudes_longitudes=soup.find_all(class_="tabev1")

In [124]:
earthquake_class_latitudes_longitudes[0].get_text().strip()

'6.00'

In [125]:
#In this path we can see latitudes
latitudes_longitudes=[]
for info in earthquake_class_latitudes_longitudes:
    latitude=info.get_text().strip()
    latitudes_longitudes.append(latitude)

In [126]:
latitudes=latitudes_longitudes[::2]

In [127]:
latitudes

['6.00',
 '54.54',
 '10.59',
 '2.30',
 '58.34',
 '28.13',
 '22.59',
 '18.49',
 '33.10',
 '58.34',
 '64.97',
 '4.36',
 '9.94',
 '10.60',
 '20.19',
 '14.74',
 '30.32',
 '35.87',
 '29.03',
 '38.15',
 '38.26',
 '12.00',
 '64.96',
 '11.90',
 '18.05',
 '24.54',
 '38.17',
 '40.13',
 '44.35',
 '9.55',
 '43.01',
 '39.11',
 '19.81',
 '9.24',
 '58.35',
 '19.41',
 '42.81',
 '42.81',
 '5.24',
 '39.36',
 '20.11',
 '31.01',
 '41.51',
 '20.07',
 '7.19',
 '34.03',
 '7.05',
 '17.12',
 '15.61',
 '45.89']

In [128]:
longitudes=latitudes_longitudes[1::2]

In [129]:
longitudes

['126.60',
 '160.86',
 '85.27',
 '122.34',
 '133.52',
 '15.26',
 '68.76',
 '145.89',
 '178.83',
 '133.46',
 '149.12',
 '102.46',
 '119.16',
 '85.27',
 '69.15',
 '92.35',
 '57.50',
 '117.70',
 '98.05',
 '15.13',
 '38.74',
 '124.03',
 '149.20',
 '124.20',
 '66.83',
 '179.76',
 '117.83',
 '27.30',
 '115.16',
 '84.58',
 '18.31',
 '27.71',
 '70.35',
 '84.06',
 '133.43',
 '155.29',
 '13.40',
 '13.41',
 '95.19',
 '123.24',
 '70.22',
 '141.81',
 '19.56',
 '70.20',
 '12.88',
 '117.55',
 '126.98',
 '100.09',
 '93.24',
 '7.02']

In [130]:
earthquake_class_region=soup.find_all(class_="tb_region")
earthquake_class_region
regions=[]
for info in earthquake_class_region:
    region=info.get_text().strip()
    regions.append(region)

In [131]:
regions

['MINDANAO, PHILIPPINES',
 'ALASKA PENINSULA',
 'COSTA RICA',
 'SULAWESI, INDONESIA',
 'SOUTHEASTERN ALASKA',
 'CANARY ISLANDS, SPAIN REGION',
 'ANTOFAGASTA, CHILE',
 'PAGAN REG., N. MARIANA ISLANDS',
 'SOUTH OF KERMADEC ISLANDS',
 'SOUTHEASTERN ALASKA',
 'CENTRAL ALASKA',
 'SOUTHERN SUMATRA, INDONESIA',
 'SUMBA REGION, INDONESIA',
 'COSTA RICA',
 'TARAPACA, CHILE',
 'CHIAPAS, MEXICO',
 'EASTERN IRAN',
 'CENTRAL CALIFORNIA',
 'SOUTHERN TEXAS',
 'SICILY, ITALY',
 'EASTERN TURKEY',
 'SAMAR, PHILIPPINES',
 'CENTRAL ALASKA',
 'LEYTE, PHILIPPINES',
 'PUERTO RICO',
 'SOUTH OF FIJI ISLANDS',
 'NEVADA',
 'WESTERN TURKEY',
 'SOUTHERN IDAHO',
 'COSTA RICA',
 'BOSNIA AND HERZEGOVINA',
 'WESTERN TURKEY',
 'DOMINICAN REPUBLIC REGION',
 'COSTA RICA',
 'SOUTHEASTERN ALASKA',
 'ISLAND OF HAWAII, HAWAII',
 'CENTRAL ITALY',
 'CENTRAL ITALY',
 'NORTHERN SUMATRA, INDONESIA',
 'NORTHERN CALIFORNIA',
 'DOMINICAN REPUBLIC REGION',
 'IZU ISLANDS, JAPAN REGION',
 'ALBANIA',
 'DOMINICAN REPUBLIC REGION',
 'ASCE

In [137]:
#Creamos DataFrame
dict1={'Region':regions,'Date':dates,'Time':times,'Latitude':latitudes,'Longitude':longitudes}
df = pd.DataFrame(dict1) 

In [138]:
df

Unnamed: 0,Region,Date,Time,Latitude,Longitude
0,"MINDANAO, PHILIPPINES",[2020-08-22],[18:09],6.0,126.6
1,ALASKA PENINSULA,[2020-08-22],[17:49],54.54,160.86
2,COSTA RICA,[2020-08-22],[17:44],10.59,85.27
3,"SULAWESI, INDONESIA",[2020-08-22],[17:43],2.3,122.34
4,SOUTHEASTERN ALASKA,[2020-08-22],[17:22],58.34,133.52
5,"CANARY ISLANDS, SPAIN REGION",[2020-08-22],[17:13],28.13,15.26
6,"ANTOFAGASTA, CHILE",[2020-08-22],[16:36],22.59,68.76
7,"PAGAN REG., N. MARIANA ISLANDS",[2020-08-22],[16:12],18.49,145.89
8,SOUTH OF KERMADEC ISLANDS,[2020-08-22],[16:10],33.1,178.83
9,SOUTHEASTERN ALASKA,[2020-08-22],[15:54],58.34,133.46


#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [152]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/lavecinarubia'

In [153]:
#your code
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html dir="ltr" lang="en">
<meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" name="viewport"/>
<link href="//abs.twimg.com" rel="preconnect"/>
<link href="//api.twitter.com" rel="preconnect"/>
<link href="//pbs.twimg.com" rel="preconnect"/>
<link href="//t.co" rel="preconnect"/>
<link href="//video.twimg.com" rel="preconnect"/>
<link href="//abs.twimg.com" rel="dns-prefetch"/>
<link href="//api.twitter.com" rel="dns-prefetch"/>
<link href="//pbs.twimg.com" rel="dns-prefetch"/>
<link href="//t.co" rel="dns-prefetch"/>
<link href="//video.twimg.com" rel="dns-prefetch"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.87388485.js" nonce="MjNlMTU1NmItY2IzNy00MjYwLWEzYzMtNWI1MDc5YzM0YmY4" rel="preload"/>
<link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/vendors~main.55c72055.js" nonc

In [160]:
#viendo diferentes clases no consigo que me imprima los tweets
soup.find_all(class_="css-1dbjc4n r-1habvwh")

[]

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [155]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [157]:
#your code


#### List all language names and number of related articles in the order they appear in wikipedia.org

In [158]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [290]:
#your code
#Get info
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
soup = BeautifulSoup(page.content, 'html.parser')

In [249]:
#Find articles num
class_results=soup.find_all(class_="central-featured")[0]

In [250]:
class_bdi=class_results.find_all("bdi",dir="ltr")
class_bdi[0].text

'6\xa0137\xa0000+'

In [251]:
articles=[]
for info in class_bdi:
    article=info.text
    articles.append(article)

In [252]:
articles

['6\xa0137\xa0000+',
 '1\xa0222\xa0000+',
 '1\xa0617\xa0000+',
 '2\xa0467\xa0000+',
 '1\xa0651\xa0000+',
 '2\xa0241\xa0000+',
 '1\xa0627\xa0000+',
 '1\xa0136\xa0000+',
 '1\xa0041\xa0000+',
 '1\xa0423\xa0000+']

In [198]:
#Find languages
class_results

<div class="central-featured" data-el-section="primary links">
<!-- Rankings from http://stats.wikimedia.org/EN/Sitemap.htm -->
<!-- Article counts from http://meta.wikimedia.org/wiki/List_of_Wikipedias/Table -->
<!-- #1. en.wikipedia.org - 1 626 855 000 views/day -->
<div class="central-featured-lang lang1" dir="ltr" lang="en">
<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English — Wikipedia — The Free Encyclopedia">
<strong>English</strong>
<small><bdi dir="ltr">6 137 000+</bdi> <span>articles</span></small>
</a>
</div>
<!-- #2. ja.wikipedia.org - 262 652 000 views/day -->
<div class="central-featured-lang lang2" dir="ltr" lang="ja">
<a class="link-box" data-slogan="フリー百科事典" href="//ja.wikipedia.org/" id="js-link-box-ja" title="Nihongo — ウィキペディア — フリー百科事典">
<strong>日本語</strong>
<small><bdi dir="ltr">1 222 000+</bdi> <span>記事</span></small>
</a>
</div>
<!-- #3. es.wikipedia.org - 222 454 000 views/day -->
<div class="cen

In [None]:
soup.find_all(class_="central-featured")[0].find("strong").text

In [196]:
languages=[]
for info in class_results:
    language=info.find("strong")
    languages.append(language)

In [197]:
#no entinedo porque aquí me salen los -1, y si le pongo .text como arriba en el ejemplo individual me sale vacio
languages

[-1,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 <strong>English</strong>,
 -1,
 -1,
 -1,
 <strong>日本語</strong>,
 -1,
 -1,
 -1,
 <strong>Español</strong>,
 -1,
 -1,
 -1,
 <strong>Deutsch</strong>,
 -1,
 -1,
 -1,
 <strong>Русский</strong>,
 -1,
 -1,
 -1,
 <strong>Français</strong>,
 -1,
 -1,
 -1,
 <strong>Italiano</strong>,
 -1,
 -1,
 -1,
 <strong>中文</strong>,
 -1,
 -1,
 -1,
 <strong>Português</strong>,
 -1,
 -1,
 -1,
 <strong>Polski</strong>,
 -1]

#### A list with the different kind of datasets available in data.gov.uk 

In [258]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [287]:
#your code 
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
soup = BeautifulSoup(page.content, 'html.parser')

In [264]:
#Find daatabase avaiable
class_data=soup.find_all("a",class_="govuk-link")

In [266]:
#Find 1 database avaiable name to apply later to find all
class_data[0].text

'cookies to collect information'

In [267]:
#find all
databases=[]
for info in class_data:
    database=info.text
    databases.append(database)

In [268]:
databases

['cookies to collect information',
 'change your cookie settings',
 'feedback',
 'Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport']

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [273]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [282]:
#your code
page= requests.get(url, headers={"user-agent":"Mozilla/5.0"})
content= page.content
soup = BeautifulSoup(page.content, 'html.parser')


In [278]:
#Find 1 lengauge 
class_languages=soup.select("tbody tr td a")

In [281]:
class_languages[0].text

'Mandarin Chinese'

In [283]:
#find all languages
languages=[]
for info in class_languages:
    language=info.text
    languages.append( language)

In [286]:
languages[:10]

['Mandarin Chinese',
 'Sino-Tibetan',
 'Sinitic',
 'Spanish',
 'Indo-European',
 'Romance',
 'English',
 'Indo-European',
 'Germanic',
 'Hindi']

### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [200]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
#your code

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
#your code