# Scraping Nobel Laureates Info
**Authors**: Manas Bedmutha, Kishen Gowda and N. V. Karthikeya

## Introduction

This notebook will walk us through extracting data from a website _not_ in a table format. We will be extracting the list of Nobel Laureates from the site NobelPrize.org ("https://www.nobelprize.org/nobel_prizes/lists/all/")


### Importing Libraries

We will require the following libraries for scraping through this page

1. requests: Used for basic get, post operations to the webpage. Here, to get the data from nobelprize.org
2. bs4 (BeautifulSoup): To extract the content based on html tags and their attributes
3. csv: To write the extracted data into a (comma separated value) csv file


In [3]:
#coding: utf-8
import requests
from bs4 import BeautifulSoup as bs
import csv

### Getting html content of the webpage

The request.get() method gets a response object returned by the server based on the given url. Based on the response, we extract its content using bs4 and create what is generally called a soup. The soup.prettify() method prettifies the extracted html content in the soup so that it is clearly legible.


In [4]:
r = requests.get('https://www.nobelprize.org/nobel_prizes/lists/all/')
soup = bs(r.content,'html.parser')
print(soup.prettify())

<!DOCTYPE doctype html>
<!--[if lt IE 7]><html class="lt-ie9 lt-ie8 lt-ie7" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><![endif]-->
<!--[if IE 7]><html class="lt-ie9 lt-ie8" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><![endif]-->
<!--[if IE 8]><html class="lt-ie9" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><![endif]-->
<html>
 <head>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0" name="viewport"/>
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:400,600" rel="stylesheet" type="text/css"/>
  <link href="https://fonts.googleapis.com/css?family=Libre+Baskerville:400,700,400italic" rel="stylesheet" type="text/css"/>
  <title>
   All Nobel Prizes
  </title>
  <meta content="Nobelprize.org, The Official Web Site of the Nobel Prize" name="description"/>
  <meta content="Nobel, Nobelprize, Nobelpriset, Foundation, Prize, Alfred, Museum, 

Now that we have the html content, observe that the basic data like **name of the laureate**, **year** and their **field**, which is relevant to us is in a **h6** tag. If we notice carefully, in the **a** tag inside all of these **h6** tags the year and field are present. Such small observations help a long way in fetching the data quickly.

So, after analysing this, now we can easily scrape all the data we require. 

The find_all() method for the soup object is used here to get all such occurences. We store that in a list variable l. 
We will print types of l and l[0] to check type of elements are in l. Also, to understand what it is extracting, we will print l[0].
However, all the content needs to be first extracted to get only the text inside those tags.

In [5]:
data = soup.find_all('h6')
print(type(data))
print(type(data[0]))
print(data[0])

<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<h6 style="margin: 0 0 5px 0; padding: 0;"><a href="/nobel_prizes/physics/laureates/2017/weiss-facts.html">Rainer Weiss</a>, <a href="/nobel_prizes/physics/laureates/2017/barish-facts.html">Barry C. Barish</a> <span style="font-weight: normal;">and</span> <a href="/nobel_prizes/physics/laureates/2017/thorne-facts.html">Kip S. Thorne</a></h6>


Now we will need to scrape the required data. For this first we need to extract the year and field of prize which is inside the **a** tag. We can simply do basic string operations for it. Once, this is done, we will just need to get the text part in each of these. We will use the .text function for that.

In [7]:
headings = ['Year','Field','Laureates'] #Headings
final_data=[headings] #We are defining a list to store our final data.
for i in range(len(data)):
	s = str(data[i]) #We are converting to string here as we saw that l[0] was not a string
	w1 = s.find('nobel_prizes/') #We are searching the first occurence of the given strings
	w2 = s.find('/laureates/')
	field = s[w1+13:w2] #this gives the field # Here 13 means length of 'nobel_prizes/'
	year = s[w2+11:w2+15] #this gives the year # Here 11 means length of '/laureates/'
	name = str(data[i].text) #this gives the names of the laureates
	final_data.append([year,field,name]) #adding the above info to final data list
print(final_data[:50]) #TO check if you have done everything right and got the required data

[['Year', 'Field', 'Laureates'], ['2017', 'physics', 'Rainer Weiss, Barry C. Barish and Kip S. Thorne'], ['2017', 'chemistry', 'Jacques Dubochet, Joachim Frank and Richard Henderson'], ['2017', 'medicine', 'Jeffrey C. Hall, Michael Rosbash and Michael W. Young'], ['2017', 'literature', 'Kazuo Ishiguro'], ['2017', 'peace', 'International Campaign to Abolish Nuclear Weapons (ICAN) '], ['2017', 'economic-sciences', 'Richard H. Thaler'], ['2016', 'physics', 'David J. Thouless, F. Duncan M. Haldane and J. Michael Kosterlitz'], ['2016', 'chemistry', 'Jean-Pierre Sauvage, Sir J. Fraser Stoddart and Bernard L. Feringa'], ['2016', 'medicine', 'Yoshinori Ohsumi'], ['2016', 'literature', 'Bob Dylan'], ['2016', 'peace', 'Juan Manuel Santos'], ['2016', 'economic-sciences', 'Oliver Hart and Bengt Holmström'], ['2015', 'physics', 'Takaaki Kajita and Arthur B. McDonald'], ['2015', 'chemistry', 'Tomas Lindahl, Paul Modrich and Aziz Sancar'], ['2015', 'medicine', 'William C. Campbell and Satoshi Ōmura']

We are done here with the scraping part. We got all the info we required. So, now we will write this data into a csv (comma separated value) file, so that we can store the data we extacted in a nice excel sheet format.


In [8]:
myFile = open('nobel_laureates_v1.csv', 'w',encoding='utf-8') #opening a new file or existing file (if exists by same name) in write mode)
with myFile:
    writer = csv.writer(myFile,lineterminator='\n') #Creating a writer with desired properties
    for i in final_data:
    	writer.writerow(i) #Writing row by row
myFile.close() #Closing the file 
     
print("Writing complete") 

Writing complete


## Thankyou!