# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [61]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
import random
import re
# import scrapy
import json
import GetOldTweets3 as got

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [None]:
#your code

# reading the web page and print the status code
r = requests.get(url)
r.status_code

In [None]:
# print the response text
print(r.text[:500])

In [None]:
# parsing the HTML using beautifulsoup
soup = bs(r.text, 'html.parser')
# soup

In [None]:
type(soup)

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [None]:
# name = soup.find('div', class_='col-md-6').text
# name

In [None]:
#your code

# find a single name
name = soup.find('h1', class_='h3').text.replace('\n','').strip()
name

In [None]:
# find a single nickname
nickname = soup.find('p', class_='f4 text-normal mb-1').text.replace('\n','').replace(' ','')
nickname

In [None]:
# find all the names & nicknames

# for name in soup.find_all('h1', class_='h3'):
#     print(name.text.replace('\n','').strip())

# list comprehension NAMES
names = [name.text.replace('\n','').strip() for name in soup.find_all('h1', class_='h3')]
names

In [None]:
# list comprehension NICKNAMES
nicknames = [nickname.text.replace('\n','').replace(' ','') for nickname in soup.find_all('p', class_='f4 text-normal mb-1')]
nicknames

In [None]:
# final list of names 
list_names= []
for name, nickname in zip(names, nicknames):
    i = name + ' (' + nickname +')'
    # print(f'{name} ({nickname})')
    list_names.append(i)
print(list_names)

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url1 = 'https://github.com/trending/python?since=daily'

In [None]:
r = requests.get(url1)
r.status_code

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
type(soup)

In [None]:
# find a single repo
repo = soup.find('h1', class_='h3 lh-condensed').find('a')['href'].replace(r"^\W+", "")
repo

In [None]:
# list comprehension REPOS
repos = [repo.find('a')['href'] for repo in soup.find_all('h1', class_='h3 lh-condensed')]
repos

In [None]:
# final list of repos
list_repos= []
for i in repos:
    i = re.sub(r"^\W+", "", i)
    list_repos.append(i)
print(list_repos)

#### Display all the image links from Walt Disney wikipedia page

In [None]:
# This is the url you will scrape in this exercise
url2 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [None]:
#your code
r = requests.get(url2)
r.status_code    

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
# find a table image
img_table = soup.find('table', class_='infobox biography vcard').find('a')['href'].replace(r'/wiki/File:','')
img_table

In [None]:
# imm = soup.find('a', attrs={'class':'image'}).find('img')['alt']
# imm

In [None]:
im = soup.find('a', attrs={'class':'image'}).find('img')['src']
im

In [None]:
# list comprehension IMAGES
# img + src
imgs = [img.find('img')['src'].replace(r'.*',r'//upload.*?px-.*?-?_?(.*)') for img in soup.find_all('div', class_='thumbinner')]
imgs

In [None]:
enlace = '//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg'

In [None]:
# regex to eliminate the link before the image
search_links = re.compile('//upload.*?px-.*?-?_?(.*)')

In [None]:
search_links.findall(enlace)

In [None]:
# some text to review. I don't like every term is a list
for r in imgs:
    print(search_links.findall(r))   

In [None]:
# for i in imgs:
#     print(type(i))

In [None]:
# list comprehension IMAGES
# imgs = [img.find('a')['href'].replace(r'/wiki/File:','') for img in soup.find_all('div', class_='thumbinner')]
# imgs

In [None]:
# sum table image + page images
tot_imgs = 1 + len(imgs)
tot_imgs

In [None]:
print(f"In Walt Disney's wikipedia page there are {tot_imgs} images")

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [None]:
# This is the url you will scrape in this exercise
url3 ='https://en.wikipedia.org/wiki/Python' 

In [None]:
#your code
r = requests.get(url3)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
link = soup.find('div', class_='mw-parser-output').find('li').find('a')['href']
link

In [None]:
# list comprehension LINKS
# links = [l.find('a')['href'] for l in soup.find_all('div', class_='mw-parser-output')]
# links

#### Number of Titles that have changed in the United States Code since its last release point 

In [None]:
# This is the url you will scrape in this exercise
url4 = 'http://uscode.house.gov/download/download.shtml'

In [None]:
#your code
r = requests.get(url4)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
txt = soup.find('div', class_='usctitlechanged').text.replace('\n','').strip()
txt

In [None]:
# list comprehension TITLE HAVE BEEN CHANGED from the next release point
txts = [t.text.replace('\n','').strip() for t in soup.find_all('div', class_='usctitlechanged')]
txts

#### A Python list with the top ten FBI's Most Wanted names 

In [None]:
# This is the url you will scrape in this exercise
url5 = 'https://www.fbi.gov/wanted/topten'

In [None]:
#your code 
r = requests.get(url5)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
mwf = soup.find('h3', class_='title').text.replace('\n','')
mwf

In [None]:
# list comprehension FUGITIVES
mw_fugitives = [mwf.text.replace('\n','') for mwf in soup.find_all('h3', class_='title')]
mw_fugitives

In [None]:
len(mw_fugitives)

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise
url6 = 'https://www.emsc-csem.org/Earthquake/'

In [None]:
#your code
r = requests.get(url6)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
tables = soup.find_all("table")
#  tables

In [None]:
table = tables[3]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
                        for row in table.find_all("tr")]
# tab_data

In [None]:
df = pd.DataFrame(tab_data)
df.head(20)

In [None]:
# move the first row to the headers
df.columns = df.iloc[1,:]
df.drop(index=1,inplace=True)

In [None]:
# delete lines with null columns
df = df.dropna(how='any', thresh = 10,axis=0) 

In [None]:
df.reset_index(inplace=True)
del df['index']
df.head(20)

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url7 = 'https://twitter.com/NBA'

In [None]:
#your code
# r = requests.get(url7)
# r.status_code  

In [None]:
# print(r.text[:500])

In [None]:
# soup = bs(r.text, 'html.parser')
# soup

In [None]:
# install package to analyze twitter page
# pip install GetOldTweets3

In [None]:
# account = soup.find('div', class_ = 'css-1dbjc4n r-1habvwh')
# account

In [None]:
# username = 'NBA'
# count = 2000

# # Creation of query object
# tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
#                                         .setMaxTweets(count)

# # Creation of list that contains all tweets
# tweets = got.manager.TweetManager.getTweets(tweetCriteria)

# # Creating list of chosen tweet data
# user_tweets = [[tweet.date, tweet.text] for tweet in tweets]
# user_tweets

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url8 = 'https://twitter.com/'

In [None]:
#your code
# r = requests.get(url8)
# r.status_code  

In [None]:
# print(r.text[:500])

In [None]:
# soup = bs(r.text, 'html.parser')
# soup

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [None]:
# This is the url you will scrape in this exercise
url9 = 'https://www.wikipedia.org/'

In [None]:
#your code
r = requests.get(url9)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
lang = soup.find('a', class_ = 'link-box').find('strong').text
lang

In [None]:
art = soup.find('a', class_ = 'link-box').find('small').text.replace('\xa0','.')
art

In [None]:
from langdetect import detect

In [None]:
wikipedia = [(wiki.find('strong').text, wiki.find('small').text.replace('\xa0','.')) for wiki in soup.find_all('a', class_ = 'link-box')]
wikipedia

#### A list with the different kind of datasets available in data.gov.uk 

In [None]:
# This is the url you will scrape in this exercise
url10 = 'https://data.gov.uk/'

In [None]:
#your code 
r = requests.get(url10)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
data = soup.find('div', class_ = 'grid-row dgu-topics').find('a').text
data

In [None]:
dataset = [data.find('a', class_ = 'govuk-link').text for data in soup.find_all('div', class_ = 'column-one-third')]
dataset

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [None]:
# This is the url you will scrape in this exercise
url11 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
#your code
r = requests.get(url11)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
# find all the tables in the page
tables = soup.find_all("table")
# tables

In [None]:
# find the table I need to analyze
table = tables[0]
tab_data = [[cell.text.replace('\n','') for cell in row.find_all(["th","td"])]
                        for row in table.find_all("tr")]
# tab_data

In [None]:
df = pd.DataFrame(tab_data)
df.head(20)

In [None]:
df.columns = df.iloc[0,:]
df.drop(index=0,inplace=True)
df.head(10)

### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url12 = 'https://twitter.com/'

In [None]:
# your code
r = requests.get(url12)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [2]:
# This is the url you will scrape in this exercise 
url13 = 'https://www.imdb.com/chart/top'

In [3]:
# your code
r = requests.get(url13)
r.status_code  

200

In [4]:
print(r.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    
    
    

    
    
    

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">
            <style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
           


In [5]:
soup = bs(r.text, 'html.parser')
# soup

In [6]:
# find all the tables in the page
tables = soup.find_all("table")
# tables

In [77]:
# name of the movie
name = soup.find('td', class_ = 'titleColumn').find('a').text
name

'Le ali della libertà'

In [8]:
# director and stars
stars = soup.find('td', class_ = 'titleColumn').find('a')['title']
stars

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

In [11]:
# release date
date = soup.find('td', class_ = 'titleColumn').find('span', class_ = 'secondaryInfo').text.replace('(','').replace(')','')
date

'1994'

In [31]:
# find the table I need to analyze
table = tables[0]
tab_data = [[(cell.find('a').text, cell.find('a')['title'],cell.find('span', class_ = 'secondaryInfo').text.replace('(','').replace(')','')) 
              for cell in row.find_all("td", class_ = 'titleColumn')]
                        for row in table.find_all("tr")]
# tab_data

In [42]:
# verify cell type
# for cell in tab_data:
#     print(type(cell))

In [50]:
df = pd.DataFrame(tab_data)
# splitting a list in a Pandas cell into multiple columns
df = df[0].apply(pd.Series)

In [51]:
# move the first row to the headers
df.columns = df.iloc[0,:]
df.drop(index=0,inplace=True)

In [52]:
# rename columns
df.columns = ['Name', 'Director and cast', 'Date release']
df

Unnamed: 0,Name,Director and cast,Date release
1,Le ali della libertà,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",1994
2,Il padrino,"Francis Ford Coppola (dir.), Marlon Brando, Al...",1972
3,Il padrino - Parte II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",1974
4,Il cavaliere oscuro,"Christopher Nolan (dir.), Christian Bale, Heat...",2008
5,La parola ai giurati,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",1957
...,...,...,...
246,La battaglia di Algeri,"Gillo Pontecorvo (dir.), Brahim Hadjadj, Jean ...",1966
247,Il trono di sangue,"Akira Kurosawa (dir.), Toshirô Mifune, Minoru ...",1957
248,Una luce dal passato,"Ashutosh Gowariker (dir.), Shah Rukh Khan, Gay...",2004
249,Lagaan - C'era una volta in India,"Ashutosh Gowariker (dir.), Aamir Khan, Raghuvi...",2001


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url14 = 'http://www.imdb.com/chart/top'

In [None]:
#your code
r = requests.get(url14)
r.status_code  

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [53]:
# I don't find the summary of the movies!

In [72]:
# table with name and year of the top 10 random movies
table1 = tables[0]
tab_data1 = [[(cell.find('a').text, cell.find('span', class_ = 'secondaryInfo').text.replace('(','').replace(')',''))
              for cell in row.find_all("td", class_ = 'titleColumn')]
                        for row in table1.find_all("tr")]
# tab_data1

In [80]:
df = pd.DataFrame(tab_data1)
# randomly select rows from DF (10 items)
df = df.sample(10) 
df

Unnamed: 0,0
29,"(Il miglio verde, 1999)"
142,"(Il labirinto del fauno, 2006)"
156,"(V per Vendetta, 2005)"
95,"(Dangal, 2016)"
98,"(Quarto potere, 1941)"
110,"(Toy Story 3 - La grande fuga, 2010)"
26,"(Salvate il soldato Ryan, 1998)"
107,"(Taxi Driver, 1976)"
24,"(La vita è meravigliosa, 1946)"
186,"(Nel nome del padre, 1993)"


In [82]:
# splitting a list in a Pandas cell into multiple columns
df = df[0].apply(pd.Series)

In [83]:
# rename columns
df.columns = ['Name', 'Date release']
df

Unnamed: 0,Name,Date release
29,Il miglio verde,1999
142,Il labirinto del fauno,2006
156,V per Vendetta,2005
95,Dangal,2016
98,Quarto potere,1941
110,Toy Story 3 - La grande fuga,2010
26,Salvate il soldato Ryan,1998
107,Taxi Driver,1976
24,La vita è meravigliosa,1946
186,Nel nome del padre,1993


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [85]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url15 = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city: Barcelona


In [86]:
# your code
r = requests.get(url15)
r.status_code 
# I'm not able to open the webpage, it gives me 400 error

200

In [89]:
print(r.text)

{"coord":{"lon":2.16,"lat":41.39},"weather":[{"id":803,"main":"Clouds","description":"broken clouds","icon":"04d"}],"base":"stations","main":{"temp":31.49,"feels_like":36.12,"temp_min":28,"temp_max":34,"pressure":1014,"humidity":78},"visibility":10000,"wind":{"speed":4.6,"deg":190},"clouds":{"all":70},"dt":1597057845,"sys":{"type":1,"id":6398,"country":"ES","sunrise":1597035340,"sunset":1597085868},"timezone":7200,"id":3128760,"name":"Barcelona","cod":200}


In [91]:
soup = bs(r.text, 'html.parser')
type(soup)

bs4.BeautifulSoup

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url16 = 'http://books.toscrape.com/'

In [None]:
#your code
r = requests.get(url16)
r.status_code  

In [None]:
r.encoding

In [None]:
# change the encoding to eliminate the special character in price output
r.encoding = 'utf-8'

In [None]:
print(r.text[:500])

In [None]:
soup = bs(r.text, 'html.parser')
# soup

In [None]:
# find all the tables in the page
tables = soup.find_all("table")
# tables

In [None]:
# name
name = soup.find('article', class_ = 'product_pod').find('img')['alt']
name

In [None]:
# price
price = soup.find('article', class_ = 'product_pod').find('div', class_ = 'product_price').find('p', class_ = 'price_color').text
price

In [None]:
# stock availability
stock = soup.find('article', class_ = 'product_pod').find('div', class_ = 'product_price').find('p', class_ = 'instock availability').text.replace('\n','').strip()
stock

In [None]:
books = [(book.find('img')['alt'], book.find('div', class_ = 'product_price').find('p', class_ = 'price_color').text, book.find('div', class_ = 'product_price').find('p', class_ = 'instock availability').text.replace('\n','').strip()) for book in soup.find_all('article', class_ = 'product_pod')]
books

In [None]:
df = pd.DataFrame(books)
df.columns = ['Name', 'Price', 'Stock availability']
df