# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [26]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [27]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [28]:
# your code here
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
#print(soup.prettify())

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [38]:
# your code here
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

#Those are the tags I need. On h1 I have the name and the main repo and on 'p' I have the username. 
#See url = 'https://github.com/trending/developers' with the chrome dev tools and scroll over the name
tags = ['h1', 'p']
text = [element.text for element in soup.find_all(tags)][2:]
#Check the results, the two first element are not needed (headers and presentations)
#text

new = []

# #Iteration to clean strings
for item in text: 
    x = item.strip()
    new.append(x)

# #Check the result
new

# # #Function to create chunks of name+username+repo
def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i+n]

# # #I have to pass 3 to chunks to create lists of 3 items (name, user and repo)
new_text = chunks(new, 3) 
#new_text
# # #Convert results to a list
newest_text = list(new_text)
newest_text
# # #Check the results


result = []

for i in range(len(newest_text)):
        element = newest_text[i][0] + ' (' + newest_text[i][1] + ')'  
        result.append(element)

result

['Thibault Duplessis (ornicar)',
 'Jake Archibald (jakearchibald)',
 'Stephen Roller (stephenroller)',
 'David Rodríguez (deivid-rodriguez)',
 'Shawn Tabrizi (shawntabrizi)',
 'Steven Allen (Stebalien)',
 'Taylor Otwell (taylorotwell)',
 'Sebastián Ramírez (tiangolo)',
 'Leo Di Donato (leodido)',
 'Felix Angelov (felangel)',
 'Agniva De Sarker (agnivade)',
 'Tyler Neely (spacejam)',
 'Klaus Post (klauspost)',
 'James Agnew (jamesagnew)',
 'Dustin L. Howett (DHowett)',
 'Anmol Sethi (nhooyr)',
 'Yiyi Wang (shd101wyy)',
 'Philipp Oppermann (phil-opp)',
 'Ben Manes (ben-manes)',
 'Stefan Prodan (stefanprodan)',
 'Dries Vints (driesvints)',
 'John Keiser (jkeiser)',
 'Carlos Alexandro Becker (caarlos0)',
 'Ines Montani (ines)',
 'Brian Flad (bflad)']

In [39]:
table = soup.find_all('h2',{'class':'f3 text-normal'});
trending_devs = [dev.text.strip().replace(' ','').replace('\n\n', ' ') for dev in table];
trending_devs

[]

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [40]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

In [41]:
# your code here
articles = soup.find_all('article')
repo = []
articles
for a in articles:
    clean = a.text.strip().replace('\n\n','').split()
    if clean[0] != 'Popular':
        repo.append(clean[1] + clean[2] + clean[3])
print(repo)

['minimaxir/big-list-of-naughty-strings', 'rusty1s/pytorch_geometric', 'espnet/espnet', 'public-apis/public-apis', 'donnemartin/system-design-primer', 'ranjian0/building_tool', 'sherlock-project/sherlock🔎', 'OpenMined/PySyft', 'open-mmlab/mmdetection', 'xillwillx/skiptracer', 'allenai/allennlp', 'explosion/spaCy💫', 'zylo117/Yet-Another-EfficientDet-Pytorch', 'renatoviolin/next_word_prediction', 'anandpawara/Real_Time_Image_Animation', 'pytorch/fairseq', 'lyhue1991/eat_tensorflow2_in_30_days', 'zhanghang1989/ResNeSt', 'Rapptz/discord.py', 'google-research/big_transfer', 'TachibanaYoshino/AnimeGAN', 'shengqiangzhang/examples-of-web-crawlers', 'hunglc007/tensorflow-yolov4-tflite', 'ianzhao05/textshot', 'bitcoinbook/bitcoinbook']


In [42]:
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")


#On the h1 tag I find the name of the repo
tags = ['h1']
text = [element.text for element in soup.find_all(tags)][1:]
#First element does not interest us. Check the result
#text

#Cleaning the results (step 1)
new = []

for item in text: 
    x = item.strip()     
    new.append(x)

#Check the results
new

#Cleaning (step 2)
newest_list = []

for element in new:
    item = element.replace('\n', '')
    item2 = item.replace('     ', '')
    newest_list.append(item2)

newest_list

['minimaxir / big-list-of-naughty-strings',
 'rusty1s / pytorch_geometric',
 'espnet / espnet',
 'public-apis / public-apis',
 'donnemartin / system-design-primer',
 'ranjian0 / building_tool',
 'sherlock-project / sherlock',
 'OpenMined / PySyft',
 'open-mmlab / mmdetection',
 'xillwillx / skiptracer',
 'allenai / allennlp',
 'explosion / spaCy',
 'zylo117 / Yet-Another-EfficientDet-Pytorch',
 'renatoviolin / next_word_prediction',
 'anandpawara / Real_Time_Image_Animation',
 'pytorch / fairseq',
 'lyhue1991 / eat_tensorflow2_in_30_days',
 'zhanghang1989 / ResNeSt',
 'Rapptz / discord.py',
 'google-research / big_transfer',
 'TachibanaYoshino / AnimeGAN',
 'shengqiangzhang / examples-of-web-crawlers',
 'hunglc007 / tensorflow-yolov4-tflite',
 'ianzhao05 / textshot',
 'bitcoinbook / bitcoinbook']

#### Display all the image links from Walt Disney wikipedia page.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

In [None]:
# your code here
images = soup.find_all("img")
result = []
for i in images:
    result.append('https:' + i['src'])
print(result)

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [22]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python';

In [23]:
# your code here
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
table = soup.find_all('a')
for link in table:
    if 'href' in link.attrs:
        print(link['href'])

#mw-head
#p-search
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
#Snakes
#Ancient_Greece
#Media_and_entertainment
#Computing
#Engineering
#Roller_coasters
#Vehicles
#Weaponry
#People
#Other_uses
#See_also
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(mythology)
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_(film)
/wiki/Pythons_2
/wiki/Monty_Python
/wiki/Python_(Monty)_Pictures
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(programming_language)
/wiki/CPython
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=5
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(Efteling)
/w/index.php?title=Python&action=edi

#### Find the number of titles that have changed in the United States Code since its last release point.

In [None]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [None]:
# your code here
#When parsing, I see the only class different when a title has changes is usctitlechanged. Therefore I have to count those
txt = requests.get(url).text;
count = txt.count('class="usctitlechanged"');
print(f'Number of titles changed: {count}');

#### Find a Python list with the top ten FBI's Most Wanted names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

In [None]:
# your code here
names = soup.find_all("h3", attrs={"class":"title"})
result = [n.text.strip() for n in names]
print(result)

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

In [None]:
# your code here
html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");
earthquakes = soup.find('tbody', {'id': 'tbody'}).find_all("tr");

nelem = 20;
latest_earthquakes = [];
    
for earthquake in earthquakes[:nelem]:
    # Date and time
    date, time = earthquake.find('td', {'class': 'tabev6'}).find('a').text.split();
    # Latitude and longitude
    lat_deg, lon_deg = earthquake.find_all('td', {'class': 'tabev1'});
    lat_dir, lon_dir, magnitude = earthquake.find_all('td', {'class': 'tabev2'});
    lat_deg = f"{lat_deg.text.strip()} {lat_dir.text.strip()}";
    lon_deg = f"{lon_deg.text.strip()} {lon_dir.text.strip()}";
    # Region
    region = earthquake.find('td', {'class': 'tb_region'}).text.strip();
    # Create list of information and append
    earthquake_summary = [date, time, lat_deg , lon_deg, region];
    latest_earthquakes.append(earthquake_summary);
    
df = pd.DataFrame(latest_earthquakes, columns=['Date', 'Time', 'Latitude', 'Longitude', 'Region']);
df

#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [1]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [4]:
# your code here

username = input('Please, input your username: ')
html = requests.get(url + username).content
soup = BeautifulSoup(html, "lxml")

try:
    tweet_box = soup.find('li', {'class':'ProfileNav-item ProfileNav-item--tweets is-active'});
    tweets = tweet_box.find('a').find('span', {'class':'ProfileNav-value'});
    print("{} has {} number of tweets.".format(username, tweets.get('data-count')))
except:
    print('Account name not found...')

Please, input your username: ironhackams
ironhackams has 114 number of tweets.


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [5]:
# your code here

username = input('Please, input your username: ')
html = requests.get(url + username).content;
soup = BeautifulSoup(html, "lxml");

try:
    tweet_box = soup.find('li', {'class':'ProfileNav-item ProfileNav-item--followers'});
    tweets = tweet_box.find('a').find('span', {'class':'ProfileNav-value'});
    print("{} has {} followers.".format(username, tweets.get('data-count')))
except:
    print('Account name not found...')

Please, input your username: ironhackams
ironhackams has 203 followers.


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [6]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [7]:
# your code here


html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")

languages = soup.find_all('a', {'class': 'link-box'})
for language in languages:
    print(language.text.strip())

English
6 085 000+ articles
Español
1 601 000+ artículos
日本語
1 208 000+ 記事
Deutsch
2 436 000+ Artikel
Русский
1 629 000+ статей
Français
2 219 000+ articles
Italiano
1 609 000+ voci
中文
1 121 000+ 條目
Português
1 033 000+ artigos
Polski
1 411 000+ haseł


#### A list with the different kind of datasets available in data.gov.uk.

In [8]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'


In [9]:
# your code here


html = requests.get(url).content
soup = BeautifulSoup(html,"lxml")
topics = soup.findAll('h2')
for topic in topics:
    print(topic.text)

Business and economy
Crime and justice
Defence
Education
Environment
Government
Government spending
Health
Mapping
Society
Towns and cities
Transport


#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [10]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [11]:
# your code here


html = requests.get(url).content
soup = BeautifulSoup(html,"lxml")
languages = soup.find('table', {'class': 'wikitable sortable'}).find_all('a', attrs = {'title' : True});

for i in range(10):
    print(languages[i].text)

Mandarin Chinese
Sino-Tibetan
Sinitic
Spanish
Indo-European
Romance
English
Indo-European
Germanic
Hindi


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [12]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [13]:
# your code here

username = input('Please, input your username: ')
n_tweets = int(input('Input number of tweets to scrape: '))
html = requests.get(url + username).content;
soup = BeautifulSoup(html, "lxml");

all_tweets = soup.find_all('div', {'class':'tweet'})

if all_tweets:
    for tweet in all_tweets[0:n_tweets]:
        name = tweet.find('span', {'class': 'FullNameGroup'}).find('strong')
        username = tweet.find('span', {'class': 'username'})
        time = tweet.find('small', {'class': 'time'})
        content = tweet.find('p', {'class': 'TweetTextSize TweetTextSize--normal js-tweet-text tweet-text'})
        statistics = tweet.find('div', {'class': 'ProfileTweet-actionCountList u-hiddenVisually'})
        
        print(f'\n{name.text} {username.text} {time.text.strip()}')
        print(content.text)
        print(statistics.text.strip().replace('\n', ' '))
else:
    print('Account name not found or tweet list is empty...')

Please, input your username: ironhackams
Input number of tweets to scrape: 5

Ironhack Amsterdam @ironhackAMS 14 mei
Since the beginning of #covid19 , at Ironhack we have switched the format of the bootcamps to remote to keep the #ironhackers safe and sound Wondering what the remote bootcamps are all about? Here are 5 differences between #remote and #online courses
https://soo.nr/mGaj pic.twitter.com/oC33uQjN40
0 antwoorden     0 retweets     0 vind-ik-leuks

Ironhack Amsterdam @ironhackAMS 4 dec. 2019
We have amazing news for you! Over $800,000 in scholarships to attend ANY one of our 9 global campuses  Apply Today! http://x.ea.com/61724 https://twitter.com/TheSims/status/1201940090517426177 …
0 antwoorden     0 retweets     2 vind-ik-leuks

Ironhack Amsterdam @ironhackAMS 2 apr. 2019
Why I went off the beaten track and followed a web development bootcamp? - An alumni story by Matt Hamers

#webdevelopment #ironhack #Fullstack #coding #bootcamp #Alumni 

Read the article in the link be

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [14]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

In [15]:
# your code here
html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");

movies = soup.find_all('td', {'class':'titleColumn'})
titles = [movie.find('a').text for movie in movies]
years = [movie.find('span').text[1:-1] for movie in movies]
directors = [movie.find('a').get('title').split(',')[0][:-7] for movie in movies]
actors = [' & '.join(movie.find('a').get('title').split(',')[1:]) for movie in movies]

movies_dict = {'Title': titles, 'Release': years, 'Director': directors, 'Actors': actors}

movies_df = pd.DataFrame(movies_dict)
movies_df

Unnamed: 0,Title,Release,Director,Actors
0,The Shawshank Redemption,1994,Frank Darabont,Tim Robbins & Morgan Freeman
1,The Godfather,1972,Francis Ford Coppola,Marlon Brando & Al Pacino
2,The Godfather: Part II,1974,Francis Ford Coppola,Al Pacino & Robert De Niro
3,The Dark Knight,2008,Christopher Nolan,Christian Bale & Heath Ledger
4,12 Angry Men,1957,Sidney Lumet,Henry Fonda & Lee J. Cobb
...,...,...,...,...
245,Mandariinid,2013,Zaza Urushadze,Lembit Ulfsak & Elmo Nüganen
246,Aladdin,1992,Ron Clements,Scott Weinger & Robin Williams
247,Lagaan: Once Upon a Time in India,2001,Ashutosh Gowariker,Aamir Khan & Raghuvir Yadav
248,PK,2014,Rajkumar Hirani,Aamir Khan & Anushka Sharma


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [16]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'


In [17]:
# your code here


from random import shuffle;

n_random = 10;

html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");
movies = soup.find_all('td', {'class':'titleColumn'})

shuffle(movies)

titles = [movie.find('a').text for movie in movies[0:n_random]]
years = [movie.find('span').text[1:-1] for movie in movies[0:n_random]]
links_to_movies = [movie.find('a').get('href') for movie in movies[0:n_random]]

summary = []
for link in links_to_movies:
    html = requests.get('https://www.imdb.com' + link).content;
    soup = BeautifulSoup(html, "lxml");
    summary.append(soup.find('div', {'class':'summary_text'}).text.strip());

movies_dict = {'Title': titles, 'Release': years, 'Summary': summary}

movies_df = pd.DataFrame(movies_dict)
movies_df



Unnamed: 0,Title,Release,Summary
0,Portrait de la jeune fille en feu,2019,On an isolated island in Brittany at the end o...
1,Barry Lyndon,1975,An Irish rogue wins the heart of a rich widow ...
2,Butch Cassidy and the Sundance Kid,1969,"Wyoming, early 1900s. Butch Cassidy and The Su..."
3,Andhadhun,2018,A series of mysterious events change the life ...
4,Rocky,1976,A small-time boxer gets a supremely rare chanc...
5,"Lock, Stock and Two Smoking Barrels",1998,A botched card game in London triggers four fr...
6,Idi i smotri,1985,"After finding an old rifle, a young boy joins ..."
7,Dial M for Murder,1954,A tennis player tries to arrange his wife's mu...
8,Full Metal Jacket,1987,A pragmatic U.S. Marine observes the dehumaniz...
9,Faa yeung nin wa,2000,"Two neighbors, a woman and a man, form a stron..."


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [18]:
# your code here


city = input('Enter the city: ').lower();
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'
weather_json = requests.get(url).json()

print("\n{}'s temperature: {}°C ".format(city.capitalize(), weather_json['main']['temp']))
print("Wind speed: {} m/s".format(weather_json['wind']['speed']))
print("Description: {}".format(weather_json['weather'][0]['description'].capitalize()))
print("Weather: {}".format(weather_json['weather'][0]['main'].capitalize()))

Enter the city: Barcelona

Barcelona's temperature: 25.27°C 
Wind speed: 3.6 m/s
Description: Clear sky
Weather: Clear


#### Find the book name, price and stock availability as a pandas dataframe.

In [19]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [20]:
# your code here

html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");
books = soup.find_all('article', {'class': 'product_pod'})

titles = [book.find('h3').text for book in books];
prices = [book.find('p', {'class': 'price_color'}).text for book in books];
stock = [book.find('p', {'class': 'instock availability'}).text.strip() for book in books]

books_dict = {'Title': titles, 'Price': prices, 'Stock': stock}

books_df = pd.DataFrame(books_dict)
books_df

Unnamed: 0,Title,Price,Stock
0,A Light in the ...,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History ...,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets ...,£33.34,In stock
7,The Coming Woman: A ...,£17.93,In stock
8,The Boys in the ...,£22.60,In stock
9,The Black Maria,£52.15,In stock
