# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
web_address = 'https://github.com/trending/developers'
source = requests.get(web_address).text
soup = BeautifulSoup(source, 'lxml')

In [3]:
articles = soup.find_all('div', class_='Box')

In [4]:
print(soup.get_text())















Trending  developers on GitHub today · GitHub


























































Skip to content













                Sign up
              
















                    Why GitHub?
                    




Features →

Code review
Project management
Integrations
Actions
Packages
Security
Team management
Hosting


Customer stories →
Security →





Team


Enterprise




                    Explore
                    





Explore GitHub →

Learn & contribute

Topics
Collections
Trending
Learning Lab
Open source guides

Connect with others

Events
Community forum
GitHub Education





Marketplace




                    Pricing
                    




Plans →

Compare plans
Contact Sales


Nonprofit →
Education →






























        Search
      

        All GitHub
      
↵


      Jump to
      ↵






No suggested jump to results















        Search
      

        All GitHub
      
↵


      Jump to
   

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [5]:
for article in articles:
    users = article.find_all('div', class_="col-md-6")
    for user in users:
        try:
            name = user.find('h1', class_="h3 lh-condensed").text
            username = user.find('p', class_="f4 text-normal mb-1").text
            
            print(name.strip(), ',', username.strip())
        except:
            continue

Thibault Duplessis , ornicar
Jake Archibald , jakearchibald
Stephen Roller , stephenroller
David Rodríguez , deivid-rodriguez
Shawn Tabrizi , shawntabrizi
Steven Allen , Stebalien
Taylor Otwell , taylorotwell
Sebastián Ramírez , tiangolo
Leo Di Donato , leodido
Felix Angelov , felangel
Agniva De Sarker , agnivade
Tyler Neely , spacejam
Klaus Post , klauspost
James Agnew , jamesagnew
Dustin L. Howett , DHowett
Anmol Sethi , nhooyr
Yiyi Wang , shd101wyy
Philipp Oppermann , phil-opp
Ben Manes , ben-manes
Stefan Prodan , stefanprodan
Dries Vints , driesvints
John Keiser , jkeiser
Carlos Alexandro Becker , caarlos0
Ines Montani , ines
Brian Flad , bflad


#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [6]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [7]:
articles = soup.find_all('div', class_='Box')

In [8]:
#articles

In [9]:
repos = []
for article in articles:
    users = article.find_all('article', class_="Box-row")
    for user in users:
        try:
            repo = user.find('h1', class_="h3 lh-condensed").text
            repos.append(repo)
        except:
            continue

In [10]:
for r in repos:
    print(''.join(r.split()))


minimaxir/big-list-of-naughty-strings
rusty1s/pytorch_geometric
espnet/espnet
public-apis/public-apis
donnemartin/system-design-primer
ranjian0/building_tool
sherlock-project/sherlock
OpenMined/PySyft
open-mmlab/mmdetection
xillwillx/skiptracer
allenai/allennlp
explosion/spaCy
zylo117/Yet-Another-EfficientDet-Pytorch
renatoviolin/next_word_prediction
anandpawara/Real_Time_Image_Animation
pytorch/fairseq
lyhue1991/eat_tensorflow2_in_30_days
zhanghang1989/ResNeSt
Rapptz/discord.py
google-research/big_transfer
TachibanaYoshino/AnimeGAN
shengqiangzhang/examples-of-web-crawlers
hunglc007/tensorflow-yolov4-tflite
ianzhao05/textshot
bitcoinbook/bitcoinbook


#### Display all the image links from Walt Disney wikipedia page.

In [11]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [12]:
links = soup.find_all('img')
for link in links:
    print(link['src'])

//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_and_his_cartoon

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [13]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [14]:
links = soup.find_all('a', href=True)
for link in links:
    print(link['href'])

#mw-head
#p-search
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
#Snakes
#Ancient_Greece
#Media_and_entertainment
#Computing
#Engineering
#Roller_coasters
#Vehicles
#Weaponry
#People
#Other_uses
#See_also
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(mythology)
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_(film)
/wiki/Pythons_2
/wiki/Monty_Python
/wiki/Python_(Monty)_Pictures
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(programming_language)
/wiki/CPython
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=5
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(Efteling)
/w/index.php?title=Python&action=edi

#### Find the number of titles that have changed in the United States Code since its last release point.

In [15]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [16]:
article = soup.find_all('div', class_='usctitlechanged')

In [17]:
len(article)

2

#### Find a Python list with the top ten FBI's Most Wanted names.

In [18]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [19]:
articles = soup.find_all('li', class_='portal-type-person castle-grid-block-item')

In [20]:
for article in articles:
    print(article.find('h3', class_='title').text)


RAFAEL CARO-QUINTERO


ROBERT WILLIAM FISHER


BHADRESHKUMAR CHETANBHAI PATEL


ALEJANDRO ROSALES CASTILLO


ARNOLDO JIMENEZ


JASON DEREK BROWN


YASER ABDEL SAID


ALEXIS FLORES


EUGENE PALMER


SANTIAGO VILLALBA MEDEROS



####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [5]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

article = soup.find_all('tbody')
table_rows = article[0].find_all('tr')

In [6]:
rowlist = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    rowlist.append(row)
df = pd.DataFrame(rowlist, columns=["drop1", "drop2", "drop3", "Datetime", "Lat", "Lat Direction", "Long", " Long Direction", "Depth", "drop4", "Magnitude", "Region", "Last Comment"])

In [7]:
df

Unnamed: 0,drop1,drop2,drop3,Datetime,Lat,Lat Direction,Long,Long Direction,Depth,drop4,Magnitude,Region,Last Comment
0,2.0,,F,earthquake2020-05-27 13:34:15.015min ago,2.15,S,79.83,W,52,M,4.1,NEAR COAST OF ECUADOR,2020-05-27 13:40
1,,,,earthquake2020-05-27 13:30:07.919min ago,38.17,N,117.82,W,1,ml,2.1,NEVADA,2020-05-27 13:34
2,,,,earthquake2020-05-27 13:24:28.724min ago,19.22,N,155.4,W,31,Ml,2.1,"ISLAND OF HAWAII, HAWAII",2020-05-27 13:30
3,,,,earthquake2020-05-27 12:48:26.01hr 00min ago,19.22,N,155.41,W,32,Md,2.0,"ISLAND OF HAWAII, HAWAII",2020-05-27 12:51
4,,,,earthquake2020-05-27 12:44:50.21hr 04min ago,38.31,N,117.88,W,12,ml,2.2,NEVADA,2020-05-27 12:52
5,,,,earthquake2020-05-27 12:42:06.11hr 07min ago,38.2,N,117.91,W,8,ML,2.7,NEVADA,2020-05-27 12:49
6,,,,earthquake2020-05-27 12:34:25.71hr 14min ago,37.79,N,20.79,E,2,ML,3.0,IONIAN SEA,2020-05-27 13:09
7,,,,earthquake2020-05-27 12:32:40.61hr 16min ago,19.24,N,155.41,W,35,ML,3.8,"ISLAND OF HAWAII, HAWAII",2020-05-27 12:43
8,,,,earthquake2020-05-27 12:31:37.71hr 17min ago,19.21,N,155.4,W,33,Md,2.2,"ISLAND OF HAWAII, HAWAII",2020-05-27 12:34
9,,,,earthquake2020-05-27 12:26:46.31hr 22min ago,46.22,N,7.55,E,6,ML,1.2,SWITZERLAND,2020-05-27 12:32


In [654]:
dfdate= df['Datetime'].str.split(expand=True)
df.drop('Datetime', axis=1, inplace=True)
dfdate.drop([2,3], axis=1, inplace=True)
dfdate[1] = dfdate[1].apply(lambda x: x[:8] )
dfdate[0] = dfdate[0].apply(lambda x: x[10:])
df = pd.concat([df, dfdate], axis=1)
df.drop(['drop1', 'drop2','drop3', 'drop4'], axis=1, inplace=True)
df.columns = ['Lat','Lat Direction','Long','Long Direction', 'Depth', 'Magnitude','Region','Last Comment', 'Date','Time']
df.head()

Unnamed: 0,Lat,Lat Direction,Long,Long Direction,Depth,Magnitude,Region,Last Comment,Date,Time
0,19.24,N,155.41,W,35,3.8,"ISLAND OF HAWAII, HAWAII",2020-05-27 12:43,2020-05-27,12:32:40
1,19.21,N,155.4,W,33,2.2,"ISLAND OF HAWAII, HAWAII",2020-05-27 12:34,2020-05-27,12:31:37
2,46.22,N,7.55,E,6,1.2,SWITZERLAND,2020-05-27 12:32,2020-05-27,12:26:46
3,17.93,N,66.84,W,11,2.6,PUERTO RICO REGION,2020-05-27 12:38,2020-05-27,12:24:06
4,2.84,S,79.16,W,74,3.6,NEAR COAST OF ECUADOR,2020-05-27 12:35,2020-05-27,12:21:56


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [25]:
handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--tweets is-active'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print(tweets.get('data-count'))

except:
    print('Account name not found...')
  

Input your account name on Twitter: ferenc
2023


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [26]:
handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--followers'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print(tweets.get('data-count'))

except:
    print('Account name not found...')
  

Input your account name on Twitter: ferenc
334


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [27]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [28]:
languages = soup.find_all('div', class_="central-featured-lang")

In [29]:
for language in languages:
    print(language.strong.text, language.bdi.text)
    
# Its circular there is no order

English 6 085 000+
EspaÃ±ol 1 601 000+
æ¥æ¬èª 1 208 000+
Deutsch 2 436 000+
Ð ÑÑÑÐºÐ¸Ð¹ 1 629 000+
FranÃ§ais 2 219 000+
Italiano 1 609 000+
ä¸­æ 1 121 000+
PortuguÃªs 1 033 000+
Polski 1 411 000+


#### A list with the different kind of datasets available in data.gov.uk.

In [30]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [31]:
articles = soup.find('div', class_='grid-row dgu-topics')

In [32]:
types = articles.find_all('a', href=True)

In [33]:
types[1]

<a href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>

In [34]:
import re
for type in types:

    pattern = '>.*<'
    print(re.findall(pattern, str(type)))

['>Business and economy<']
['>Crime and justice<']
['>Defence<']
['>Education<']
['>Environment<']
['>Government<']
['>Government spending<']
['>Health<']
['>Mapping<']
['>Society<']
['>Towns and cities<']
['>Transport<']


#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [35]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [36]:
article = soup.find_all('tbody')
table_rows = article[0].find_all('tr')
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
pd.DataFrame(l, columns=['Rank', 'Language', 'Speakers', '% of world pop', 'lang_family']).head(11)

Unnamed: 0,Rank,Language,Speakers,% of world pop,lang_family
0,,,,,
1,1\n,Mandarin Chinese\n,918\n,11.922\n,Sino-TibetanSinitic\n
2,2\n,Spanish\n,480\n,5.994\n,Indo-EuropeanRomance\n
3,3\n,English\n,379\n,4.922\n,Indo-EuropeanGermanic\n
4,4\n,Hindi (Sanskritised Hindustani)[9]\n,341\n,4.429\n,Indo-EuropeanIndo-Aryan\n
5,5\n,Bengali\n,228\n,2.961\n,Indo-EuropeanIndo-Aryan\n
6,6\n,Portuguese\n,221\n,2.870\n,Indo-EuropeanRomance\n
7,7\n,Russian\n,154\n,2.000\n,Indo-EuropeanBalto-Slavic\n
8,8\n,Japanese\n,128\n,1.662\n,JaponicJapanese\n
9,9\n,Western Punjabi[10]\n,92.7\n,1.204\n,Indo-EuropeanIndo-Aryan\n


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [37]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
handle = input('Input your account name on Twitter: ')
ctr = int(input('Input number of tweets to scrape: '))
res=requests.get('https://twitter.com/'+ handle)
bs=BeautifulSoup(res.content,'lxml')
all_tweets = bs.find_all('div',{'class':'tweet'})
if all_tweets:
  for tweet in all_tweets[:ctr]:
    context = tweet.find('div',{'class':'context'}).text.replace("\n"," ").strip()
    content = tweet.find('div',{'class':'content'})
    header = content.find('div',{'class':'stream-item-header'})
    user = header.find('a',{'class':'account-group js-account-group js-action-profile js-user-profile-link js-nav'}).text.replace("\n"," ").strip()
    time = header.find('a',{'class':'tweet-timestamp js-permalink js-nav js-tooltip'}).find('span').text.replace("\n"," ").strip()
    message = content.find('div',{'class':'js-tweet-text-container'}).text.replace("\n"," ").strip()
    footer = content.find('div',{'class':'stream-item-footer'})
    stat = footer.find('div',{'class':'ProfileTweet-actionCountList u-hiddenVisually'}).text.replace("\n"," ").strip()
    if context:
      print(context)
    print(user,time)
    print(message)
    print(stat)
    print()
else:
    print("List is empty/account name not found.")
  

Input your account name on Twitter: ferenc
Input number of tweets to scrape: 50
Ferenc‏ @ferenc 26 jul. 2018
http://Booking.com  - #UX Case Study https://buff.ly/2L6jovz pic.twitter.com/Q1WsiCKzq4
0 antwoorden     0 retweets     0 vind-ik-leuks

Ferenc‏ @ferenc 26 jul. 2018
Top 10 Elon Musk Productivity Secrets for Insane Success https://buff.ly/2JTsZ3E pic.twitter.com/AVeGqtXY2A
0 antwoorden     1 retweet     0 vind-ik-leuks

Ferenc‏ @ferenc 25 jul. 2018
Simplify Life: What Can You Remove? https://buff.ly/2JED9Fd pic.twitter.com/BC62vemSA7
1 antwoord     0 retweets     0 vind-ik-leuks

Ferenc‏ @ferenc 17 jun. 2018
What It’s Really Like to Be an Entrepreneur (Are You Sure You Want to Be One?) https://buff.ly/2sRjigo pic.twitter.com/SDgtPyaLY9
0 antwoorden     0 retweets     0 vind-ik-leuks

Ferenc‏ @ferenc 13 jun. 2018
Digital manipulation: How platforms like Uber and Deliveroo exploit workers https://buff.ly/2Ml20jM pic.twitter.com/zqWy9NB60c
0 antwoorden     0 retweets     0 vind-ik-

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [488]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
article = soup.find_all('tbody', class_='lister-list')

In [489]:
trow = article[0].find_all('tr')
list_df = []
for row in trow:
    tcol = row.find_all('td')
    raw_text = [tr.text for tr in tcol]
    list_df.append(raw_text)
    
df= pd.DataFrame(list_df, columns=['Poster', 'Title', 'IMDB Rating', 'Your rating', 'Watchlist'] )
df.drop(['Poster', 'Watchlist', 'Your rating'], axis=1, inplace=True)
df['IMDB Rating'] = df['IMDB Rating'].str.strip('\n')
df = pd.concat([df.drop('Title', axis=1), df['Title'].str.split('\n', expand=True)], axis=1)
df.columns = ['IMDB Rating', 'dr1', 'Ranking', 'Title', 'Year', 'dr2']
df.drop(['dr1','dr2'], axis=1, inplace=True)
df.set_index('Ranking', inplace=True)
#df.head()

Unnamed: 0_level_0,IMDB Rating,Title,Year
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,9.2,The Shawshank Redemption,(1994)
2.0,9.1,The Godfather,(1972)
3.0,9.0,The Godfather: Part II,(1974)
4.0,9.0,The Dark Knight,(2008)
5.0,8.9,12 Angry Men,(1957)


In [490]:
movie_links = []
trows = article[0].find_all('td', class_='titleColumn')
links = []
for row in trows:
    links.append(row.a['href'])
full_links = ['https://www.imdb.com'+ link for link in links] 

In [494]:
all_stars = []
dirs = []
for l in full_links:
    source = requests.get(l).text
    soup = BeautifulSoup(source, 'lxml')
    article = soup.find_all('div', class_='plot_summary')
    summary = article[0].find_all('div', class_='credit_summary_item')
    director = ' '.join(summary[0].text.split()[1:])    
    stars = ' '.join(summary[2].text.split()[1:])
    stars = re.split('\|', stars)[0]
    all_stars.append(stars)
    dirs.append(director)
    df['Stars'] = all_stars
    df['Director'] = dirs
    df[['Star1', 'Star2', 'Star3' ]] = df['Stars'].str.split(',', expand=True)

In [512]:
df.head()

Unnamed: 0_level_0,IMDB Rating,Title,Year,Stars,Director,Star1,Star2,Star3
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1.0,9.2,The Shawshank Redemption,(1994),"Tim Robbins, Morgan Freeman, Bob Gunton",Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton
2.0,9.1,The Godfather,(1972),"Marlon Brando, Al Pacino, James Caan",Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan
3.0,9.0,The Godfather: Part II,(1974),"Al Pacino, Robert De Niro, Robert Duvall",Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall
4.0,9.0,The Dark Knight,(2008),"Christian Bale, Heath Ledger, Aaron Eckhart",Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart
5.0,8.9,12 Angry Men,(1957),"Henry Fonda, Lee J. Cobb, Martin Balsam",Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [513]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')


In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [514]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city: The Hague


In [516]:
# SITE NOT WORKING

#### Find the book name, price and stock availability as a pandas dataframe.

In [788]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
articles = soup.find_all('ol', class_='row')
books = articles[0].find_all('a')

booklist = []
for book in books:
    b = book.get('title')
    if b is not None:
        booklist.append(b)

pricelist = []
bookprices = articles[0].find_all('p', class_='price_color')
for price in bookprices:
    pricelist.append(price.text)
stockl = []
stocklist = articles[0].find_all('p', class_='instock availability')
for stock in stocklist:
    stockl.append(price.text)

In [789]:
pd.DataFrame(list(zip(booklist,pricelist,stockl)))

Unnamed: 0,0,1,2
0,A Light in the Attic,Â£51.77,Â£45.17
1,Tipping the Velvet,Â£53.74,Â£45.17
2,Soumission,Â£50.10,Â£45.17
3,Sharp Objects,Â£47.82,Â£45.17
4,Sapiens: A Brief History of Humankind,Â£54.23,Â£45.17
5,The Requiem Red,Â£22.65,Â£45.17
6,The Dirty Little Secrets of Getting Your Dream...,Â£33.34,Â£45.17
7,The Coming Woman: A Novel Based on the Life of...,Â£17.93,Â£45.17
8,The Boys in the Boat: Nine Americans and Their...,Â£22.60,Â£45.17
9,The Black Maria,Â£52.15,Â£45.17
