# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
import re
# import scrapy
import pandas as pd


def get_BF_from_URL(url):
    rs = requests.get(url)
    data = rs.text
    return BeautifulSoup(data,"html.parser")

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
#your code
res = requests.get(url)
res = res.text
data = BeautifulSoup(res,"html.parser")
print(data.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-6b8b7859c4b8fbe3ab45f8ab0905a9f8.css" integrity="sha512-aVn2DoCuXdXX9G3sp/Luupl/Ui00/iXrUh7Ke3geLlkigQY8GHBky7kKRSuyeKxGApWDdCQUy+6gTF1ZmYHWkw==" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-b046b27487428b94fc20941868838997.css" integrity="sha512-Myp6HIV6Q

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [4]:
#your code
#We obtain names of developers
data_names = data.find_all("div", {"class": "col-md-6"})
a = [ele.find("h1",{"class" : "h3 lh-condensed"}) for ele in data_names if ele.find("h1",{"class" : "h3 lh-condensed"})!= None]
name = [ele.find("a", href=True).text for ele in a]

In [5]:
#We obtain subname of developers
data_subnames = data.find_all("p", {"class": "f4 text-normal mb-1"})
sub_name = [ele.find("a",{"class" : "link-gray"}).text for ele in data_subnames]

In [6]:
#We create list with both name
ls_deve = []
for i in range(len(name)): 
    ls_deve.append("{} ({})".format(name[i],sub_name[i]))

In [7]:
ls_deve

['Francois Zaninotto (fzaninotto)',
 'Sheng Chen (jdneo)',
 'Daniel Agar (dagar)',
 'Micah Lee (micahflee)',
 'Sean McArthur (seanmonstar)',
 'Caleb Porzio (calebporzio)',
 'Raphaël Benitte (plouc)',
 'Eliza Weisman (hawkw)',
 'James Montemagno (jamesmontemagno)',
 'Franck Nijhof (frenck)',
 'Ilya Dmitrichenko (errordeveloper)',
 'Jon Shier (jshier)',
 'Kyle Fuller (kylef)',
 'Paulus Schoutsen (balloob)',
 'Ahmet Alp Balkan (ahmetb)',
 'Alex Crichton (alexcrichton)',
 'Nikita Prokopov (tonsky)',
 '二货机器人 (zombieJ)',
 'Alex Ellis (alexellis)',
 '迷渡 (justjavac)',
 'Travis Tidwell (travist)',
 'Mitchell Hashimoto (mitchellh)',
 'Alex Gaynor (alex)',
 'Steven Loria (sloria)',
 'Sébastien Eustace (sdispater)']

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [8]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [9]:
#your code
#your code
res = requests.get(url)
res = res.text
data = BeautifulSoup(res,"html.parser")
data


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-6b8b7859c4b8fbe3ab45f8ab0905a9f8.css" integrity="sha512-aVn2DoCuXdXX9G3sp/Luupl/Ui00/iXrUh7Ke3geLlkigQY8GHBky7kKRSuyeKxGApWDdCQUy+6gTF1ZmYHWkw==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-b046b27487428b94fc20941868838997.css" integrity="sha512-Myp6HIV6QpUnPxl7XbmQTeyGZboMM

In [10]:
#your code
#We obtain names of developers
projects = data.find_all("h1", {"class": "h3 lh-condensed"})
a = [ele.find("a", href=True)["href"].strip("/").split("/") for ele in projects]
repos = ["{} ({})".format(ele[0],ele[1]) for ele in a]
repos

['testerSunshine (12306)',
 'Yorko (mlcourse.ai)',
 'iperov (DeepFaceLab)',
 'ckiplab (ckiptagger)',
 'tensorflow (neural-structured-learning)',
 'insilicomedicine (GENTRL)',
 'hudson-and-thames (mlfinlab)',
 'deepfakes (faceswap)',
 'CorentinJ (Real-Time-Voice-Cloning)',
 'deepinsight (insightface)',
 'zhaipro (easy12306)',
 'keras-team (keras)',
 'mne-tools (mne-python)',
 'pjialin (py12306)',
 'xadrianzetx (fullstack.ai)',
 'ansible (awx)',
 's3nh (pytorch-text-recognition)',
 'alexmojaki (heartrate)',
 'RaRe-Technologies (gensim)',
 'open-mmlab (mmdetection)',
 'tensorflow (models)',
 'Azure (azure-cli)',
 'nprapps (heat-income)',
 'zalandoresearch (flair)',
 'iGhibli (iOS-DeviceSupport)']

#### Display all the image links from Walt Disney wikipedia page

In [11]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [12]:
res = requests.get(url)
data = res.text
wd = BeautifulSoup(data,"html.parser")

In [13]:
images = wd.find_all("img",src = True)
[ele["src"] for ele in images]

['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [14]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'

In [15]:
#your code
rs = requests.get(url)
data = rs.text
python = BeautifulSoup(data,"html.parser")
python

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":912426772,"wgRevisionId":912426772,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short description","All article disambiguation pages","All disambiguation pages","Animal common name disambiguation pages","Disambiguation pages"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","Au

In [16]:
links = python.find_all("a", href=True)
[a.get("href") for a in links]

['#mw-head',
 '#p-search',
 'https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 '#Snakes',
 '#Ancient_Greece',
 '#Media_and_entertainment',
 '#Computing',
 '#Engineering',
 '#Roller_coasters',
 '#Vehicles',
 '#Weaponry',
 '#People',
 '#Other_uses',
 '#See_also',
 '/w/index.php?title=Python&action=edit&section=1',
 '/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/w/index.php?title=Python&action=edit&section=2',
 '/wiki/Python_(mythology)',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/w/index.php?title=Python&action=edit&section=3',
 '/wiki/Python_(film)',
 '/wiki/Pythons_2',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/w/index.php?title=Python&action=edit&section=4',
 '/wiki/Python_(programming_language)',
 '/wiki/CPython',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/w/index.php?title=Python&action=edit&section=5',
 '/w/index.php?title=Python&action=edit&section=6',
 

#### Number of Titles that have changed in the United States Code since its last release point 

In [17]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [18]:
#your code


titles = get_BF_from_URL(url)

In [19]:
a = titles.find_all("div",{"class":"usctitlechanged"})
title_name = [ele.text.strip("\n").strip() for ele in a]
[re.findall("[0-9]+",ele)[0] for ele in title_name]

['5', '11', '20', '28', '38', '42']

#### A Python list with the top ten FBI's Most Wanted names 

In [20]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'
fbi = get_BF_from_URL(url)

In [21]:
a = fbi.find_all("h3",{"class":"title"})
[ele.text.strip("\n") for ele in a]

['YASER ABDEL SAID',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'EUGENE PALMER',
 'SANTIAGO VILLALBA MEDEROS',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ARNOLDO JIMENEZ',
 'ALEJANDRO ROSALES CASTILLO']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [22]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'
quake = get_BF_from_URL(url)

In [23]:
t_quake = quake.select("tr[class^=ligne]")

a = [ele.select("[class^=tabev]") for ele in t_quake]
quake_info= []
for earth_quake in a:
    info = []
    for ele in earth_quake:
        info.append(ele.text)
    quake_info.append(info)
quake_info_2 = []
for ele in quake_info:
    aux = []
    aux.append(re.findall("\d+\-\d+\-\d+",ele[3])[0])
    aux.append(re.findall("\d+\:\d+\:\d+.\d",ele[3])[0])
    aux.append(re.findall("\d+\.\d+",ele[4])[0] + " " + re.findall("[A-Z]",ele[5])[0])
    aux.append(re.findall("\d+\.\d+",ele[6])[0] + " " + re.findall("[A-Z]",ele[7])[0])
    quake_info_2.append(aux)
    
q_names = [ele.select("[class=tb_region]")[0].text.strip("\xa0") for ele in t_quake]

for i,qua in enumerate(quake_info_2):
    qua.append(q_names[i])
    
df = pd.DataFrame(quake_info_2,columns=["Date","Time","Lat","Long","Location"])
df.head(20)

Unnamed: 0,Date,Time,Lat,Long,Location
0,2019-09-05,16:44:09.0,1.09 N,125.46 E,MOLUCCA SEA
1,2019-09-05,16:41:54.0,16.85 N,95.17 W,"OAXACA, MEXICO"
2,2019-09-05,16:34:47.0,36.18 S,73.76 W,"OFFSHORE BIO-BIO, CHILE"
3,2019-09-05,16:06:03.1,57.59 S,66.35 W,DRAKE PASSAGE
4,2019-09-05,15:55:42.5,37.67 N,20.73 E,IONIAN SEA
5,2019-09-05,15:37:28.1,44.56 S,80.67 W,"OFF COAST OF AISEN, CHILE"
6,2019-09-05,15:31:05.1,36.22 N,118.17 W,CENTRAL CALIFORNIA
7,2019-09-05,15:27:28.4,43.90 N,127.45 W,OFF COAST OF OREGON
8,2019-09-05,15:20:07.4,19.57 N,156.03 W,"HAWAII REGION, HAWAII"
9,2019-09-05,15:09:04.4,37.82 N,57.39 E,NORTHEASTERN IRAN


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [17]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'
hack = get_BF_from_URL(url)

In [20]:
name = []
for ele in hack.select("h5[class:card-title]"):
    name.append(ele.text)

dates = []
for ele in hack.select("div > p[class:card-text]"):
    print(ele)
    if re.search("\d+\/\d+\/\d+",ele.text):
        dates.append(re.findall("\d+\/\d+\/\d+",ele.text)[0])
df = pd.DataFrame(list(zip(name,dates)),columns=["Name", "Date"])
df

<p class="card-text"><i class="fas fa-calendar-alt"></i> 9/6/2019
</p>
<p class="card-text"><i class="fas fa-calendar-alt"></i> 1/31/2019
</p>
<p class="footerTitle"><img alt="Hackevents Logo" height="15" src="/img/logos/HACKEVENTS_white.svg" style="margin-top:8px;"/></p>
<p class="footerEntry">a product of Hackerbay</p>
<p class="footerEntry"><a href="/legal/imprint">Imprint</a></p>
<p class="footerEntry p9" style="padding-top:10px">© <script>document.write(new Date().getFullYear())</script> Singularity Technologies GmbH.<br/>All rights reserved.</p>
<p class="footerTitle">FOR ORGANIZERS</p>
<p class="footerEntry"><a href="/submit">Submit event</a></p>
<p class="footerTitle">FOR EU HACKERS</p>
<p class="footerEntry"><a href="/search/anything/Germany/anytime">Hackathons in Germany</a></p>
<p class="footerEntry"><a href="/search/anything/Berlin/anytime">Hackathons in Berlin</a></p>
<p class="footerEntry"><a href="/search/anything/France/anytime">Hackathons in France</a></p>
<p class="fo

Unnamed: 0,Name,Date
0,TECHFEST MUNICH,9/6/2019
1,Galileo App Competition,1/31/2019


#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [4]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url

url = "https://twitter.com/cristiano"

found_tweets = True
while(found_tweets):
    try:
        url = input("Give me a twitter account: ")
        if url=="exit":
            break
        data = get_BF_from_URL(url)
        tweets_n = data.select("span.ProfileNav-value")[0].text.strip()
        print("The account has a total of {} tweets.".format(tweets_n))
        if tweets_n != None:
            found_tweets = False
    except Exception as e:
        print("Not a valid account",Exception, e)

Give me a twitter account: "https://twitter.com/depemore
Not a valid account <class 'Exception'> No connection adapters were found for '"https://twitter.com/depemore'
Give me a twitter account: https://twitter.com/depemore
The account has a total of 14,1 mil tweets.


#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [12]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = "https://twitter.com/cristiano"

found_follow = True
while(found_follow):
    try:
        url = input("Give me a twitter account: ")
        if url=="exit":
            break
        data = get_BF_from_URL(url)
        follower_n = data.select("span.ProfileNav-value")[2].text.strip()
        print("The account has a total of {} followers.".format(follower_n))
        if follower_n != None:
            found_follow = False
    except Exception as e:
        print("Not a valid account",Exception, e)

Give me a twitter account: https://twitter.com/imu59
The account has a total of 926 followers.


In [9]:
#your code

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [46]:
# This is the url you will scrape in this exercise
url = "https://en.wikipedia.org/wiki/Main_Page"
data = get_BF_from_URL(url)

In [60]:
a = data.select("div[class] .interlanguage-link")
[e.text for e in a]

['العربية',
 'Български',
 'Bosanski',
 'Català',
 'Čeština',
 'Dansk',
 'Deutsch',
 'Eesti',
 'Ελληνικά',
 'Español',
 'Esperanto',
 'Euskara',
 'فارسی',
 'Français',
 'Galego',
 '한국어',
 'Hrvatski',
 'Bahasa Indonesia',
 'Italiano',
 'עברית',
 'ქართული',
 'Latviešu',
 'Lietuvių',
 'Magyar',
 'Bahasa Melayu',
 'Nederlands',
 '日本語',
 'Norsk',
 'Norsk nynorsk',
 'Polski',
 'Português',
 'Română',
 'Русский',
 'Simple English',
 'Slovenčina',
 'Slovenščina',
 'Српски / srpski',
 'Srpskohrvatski / српскохрватски',
 'Suomi',
 'Svenska',
 'ไทย',
 'Türkçe',
 'Українська',
 'Tiếng Việt',
 '中文']

#### A list with the different kind of datasets available in data.gov.uk 

In [61]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'
data = get_BF_from_URL(url)

In [74]:
#your code 
a = data.select("a[href^=/se]")
[ele.text for ele in a]

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport']

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [75]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
data = get_BF_from_URL(url)

In [167]:
#your code
bs = data.select("table[class^=wikitable]")[0]

"""
table= []
for e in bs.findAll("tr")[1:]:
    row = []
    for ele in e.findAll("td"):
        row.append(ele.text.strip("\n"))
    table.append(row)
df=pd.DataFrame(table,columns=["Index","Language", "Location", "Countries", "Speakers", "% world Popu", "Lang Familly"])
df.head(10)
"""

df = HTMLtable_to_df(bs)

df.head(10)
    

Unnamed: 0,Rank,Language,Primary Country,TotalCountries[a],Speakers(millions),% of the World population (March 2019)[7],Language familyBranch
0,1,Mandarin (language family)[8],China,13,918.0,11.922,Sino-TibetanSinitic
1,2,Spanish,Spain,31,460.0,5.974,Indo-EuropeanRomance
2,3,English,United Kingdom,137,379.0,4.922,Indo-EuropeanGermanic
3,4,Hindi [9],India,4,341.0,4.429,Indo-EuropeanIndo-Aryan
4,5,Bengali,Bangladesh,4,228.0,2.961,Indo-EuropeanIndo-Aryan
5,6,Portuguese,Portugal,15,221.0,2.87,Indo-EuropeanRomance
6,7,Russian,Russian Federation,19,154.0,2.0,Indo-EuropeanBalto-Slavic
7,8,Japanese,Japan,2,128.0,1.662,JaponicJapanese
8,9,Western Punjabi[10],Pakistan,2,92.7,1.204,Indo-EuropeanIndo-Aryan
9,10,Marathi,India,1,83.1,1.079,Indo-EuropeanIndo-Aryan


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [193]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/cristiano'


In [8]:
bad_input =True
while(bad_input):
    try:
        url = input("Which account you want the tweets from?:")
        if url == "exit":
            break
        data = get_BF_from_URL(url)   
        tweets = data.select("div[class^=js-tweet-text-container]")
        res =[tweet.select("p")[0].text.split("pic.twitter")[0].strip() for tweet in tweets]
        
        print("These are the last {} tweets:\n\n {}".format(len(res),"\n".join(res)))
        bad_input = False
    except:
        print("Wrong input. Again. Type exit to abort." , e)


Which account you want the tweets from?:https://twitter.com/depemore
These are the last 19 tweets:

 Lo mejor del fútbol es que siempre encuentras a alguien más idiota que túhttps://twitter.com/VictorMolina7/status/1165583247574786048 …
Vale señor ¿Va a querer bolsa o no?https://twitter.com/PeioHR/status/1161948506367692800 …
Una mirada a las rutinas diarias de grandes personajes creativos de la historia. Vía @VisualCap
Gente con perrazos de 40 Kg en pisos de 60 metros cuadrados, que sacan 2 veces de 20 minutos al día a que hagan pipi y popo, diciendo que es una barbaridad que hayan animales en los zoos, el musical.
#easyjet beats @Ryanair to have backless seats. @IATA @EASA this is flight 2021 Luton to Geneva. How can this be allowed. @GeneveAeroport @easyJet_press @easyJet
Diputados presumiendo de vino regalado.

Hostia, es que vaya panorama.https://twitter.com/MarcosdeQuinto/status/1158666632430063618 …
Los habitantes de Europa advierten que el día 1 de enero será otro año nuevo.htt

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [9]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'
data = get_BF_from_URL(url)

In [25]:
def HTMLtable_to_df(soup_table):
    table= []
    for row in soup_table.findAll("tr")[1:]:
        rows = []
        for ele in row.findAll("td"):
            rows.append(ele.text.strip("\n"))
        table.append(rows)
    headers = []
    for header in soup_table.findAll("th"):
        headers.append(header.text)
    df = pd.DataFrame(table,columns=headers)
    return df

def HTMLtable_to_df_IDMB(soup_table):
    table= []
    for row in soup_table.findAll("tr")[1:]:
        rows = []
        for ele in row.findAll("td"):
            rows.append(re.sub("\n"," ",ele.text.strip("\n")))
        table.append(rows)
    headers = []
    for header in soup_table.findAll("th"):
        headers.append(header.text)
    df = pd.DataFrame(table,columns=headers)
    df.drop(columns=["Your Rating"],inplace=True)
    return df

In [109]:
# your code
num_movies = 10
table = data.select("table[data-caller-name=chart-top250movie]")[0]
df = HTMLtable_to_df_IDMB(table)
df = df.head(num_movies)

In [113]:
basic_url = 'https://www.imdb.com'
webs = [basic_url + e["href"] for e in table.select("a[href^=/title]")[::2]]
directors = []
date = []
for web in webs[:num_movies]:
    data1 = get_BF_from_URL(web)
    directors.append(data1.select("h4[class=inline] + a")[0].text)
    data11=data1.select("#titleDetails")[0]
    date.append("".join(re.findall("\d+\s\w+\s\d+",data11.select("div:nth-of-type(4)")[0].text.strip("\n, "))))

df["Director"] = directors
df["Release Date"] = date
df

Unnamed: 0,Unnamed: 1,Rank & Title,IMDb Rating,Unnamed: 4,Director,Release Date
0,,1. Cadena perpetua (1994),9.2,,Frank Darabont,24 February 1995
1,,2. El padrino (1972),9.1,,Francis Ford Coppola,20 October 1972
2,,3. El padrino: Parte II (1974),9.0,,Francis Ford Coppola,13 October 1975
3,,4. El caballero oscuro (2008),9.0,,Christopher Nolan,13 August 2008
4,,5. 12 hombres sin piedad (1957),8.9,,Sidney Lumet,10 April 1957
5,,6. La lista de Schindler (1993),8.9,,Steven Spielberg,4 March 1994
6,,7. El señor de los anillos: El ret...,8.9,,Peter Jackson,17 December 2003
7,,8. Pulp Fiction (1994),8.9,,Quentin Tarantino,13 January 1995
8,,"9. El bueno, el feo y el malo (1966)",8.8,,Sergio Leone,
9,,10. El club de la lucha (1999),8.8,,David Fincher,5 November 1999


In [124]:
df.groupby("Director").count()["Rank & Title"].sort_values(ascending=False)

Director
Francis Ford Coppola    2
Steven Spielberg        1
Sidney Lumet            1
Sergio Leone            1
Quentin Tarantino       1
Peter Jackson           1
Frank Darabont          1
David Fincher           1
Christopher Nolan       1
Name: Rank & Title, dtype: int64

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
#your code

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
#your code