---
title: "Explore Nobel Laureates data"
date: 2019-01-25T19:14:46+05:30
draft: False
author: "Nitin Patil"

---

Here we will create a SQL table from external [kaggle dataset of Nobel Laureates](https://www.kaggle.com/nobelfoundation/nobel-laureates/). It is available in csv format.

### Connect with SQL database

In [76]:
import sqlalchemy as db
import pandas as pd

In [77]:
# Create an engine to the `test` database
# The typical form of a database URL is `dialect+driver://username:password@host:port/database`
engine = db.create_engine('mysql+pymysql://root:root@localhost:3306/test')

In [78]:
# Print the table names
engine.table_names()

['student_data', 'world']

Handy function for future use.

In [None]:
def ex_df(query):
    """Execute query and return result in pandas.DataFrame"""
    data = engine.execute(query).fetchall()
    cols = [x[0] for x in engine.execute(query).cursor.description]
    return pd.DataFrame(data=data, columns=cols)

### Create table from external csv

Read csv file

In [79]:
df_nobel = pd.read_csv("./nobel_laureates/archive.csv")

In [105]:
df_nobel.head(2)

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
0,1901,Chemistry,The Nobel Prize in Chemistry 1901,"""in recognition of the extraordinary services ...",1/1,160,Individual,Jacobus Henricus van 't Hoff,1852-08-30,Rotterdam,Netherlands,Male,Berlin University,Berlin,Germany,1911-03-01,Berlin,Germany
1,1901,Literature,The Nobel Prize in Literature 1901,"""in special recognition of his poetic composit...",1/1,569,Individual,Sully Prudhomme,1839-03-16,Paris,France,Male,,,,1907-09-07,Châtenay,France


The dataset has column names with space. It would be difficult to refer such columns in sql. Hence remove those space and keep all columns in lowercase.

In [111]:
df_nobel.columns = [s.replace(' ', '_').lower() for s in df_nobel.columns.tolist()]

Create a SQL tabel with name `nobel`

In [113]:
df_nobel.to_sql('nobel', engine, index=False)

In [114]:
engine.table_names()

['nobel', 'student_data', 'world']

View top two records

In [130]:
q = """SELECT * FROM nobel
LIMIT 2"""

In [132]:
ex_df(q)

Unnamed: 0,year,category,prize,motivation,prize_share,laureate_id,laureate_type,full_name,birth_date,birth_city,birth_country,sex,organization_name,organization_city,organization_country,death_date,death_city,death_country
0,1901,Chemistry,The Nobel Prize in Chemistry 1901,"""in recognition of the extraordinary services ...",1/1,160,Individual,Jacobus Henricus van 't Hoff,1852-08-30,Rotterdam,Netherlands,Male,Berlin University,Berlin,Germany,1911-03-01,Berlin,Germany
1,1901,Literature,The Nobel Prize in Literature 1901,"""in special recognition of his poetic composit...",1/1,569,Individual,Sully Prudhomme,1839-03-16,Paris,France,Male,,,,1907-09-07,Châtenay,France


Change data type of date columns to DATE

In [158]:
q ="""ALTER TABLE nobel
MODIFY birth_date DATE
"""
engine.execute(q)

q ="""ALTER TABLE nobel
MODIFY death_date DATE
"""
engine.execute(q)

<sqlalchemy.engine.result.ResultProxy at 0x223f0571d68>

Check nobel table columns and their details

In [159]:
ex_df("DESCRIBE nobel")

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,year,year(4),YES,,,
1,category,text,YES,,,
2,prize,text,YES,,,
3,motivation,text,YES,,,
4,prize_share,text,YES,,,
5,laureate_id,bigint(20),YES,,,
6,laureate_type,text,YES,,,
7,full_name,text,YES,,,
8,birth_date,date,YES,,,
9,birth_city,text,YES,,,


View last two recors of table.

In [188]:
q ="""SELECT * FROM nobel
ORDER BY year DESC
LIMIT 2"""
ex_df(q)

Unnamed: 0,year,category,prize,motivation,prize_share,laureate_id,laureate_type,full_name,birth_date,birth_city,birth_country,sex,organization_name,organization_city,organization_country,death_date,death_city,death_country
0,2016,Chemistry,The Nobel Prize in Chemistry 2016,"""for the design and synthesis of molecular mac...",1/3,931,Individual,Jean-Pierre Sauvage,1944-10-21,Paris,France,Male,University of Strasbourg,Strasbourg,France,,,
1,2016,Chemistry,The Nobel Prize in Chemistry 2016,"""for the design and synthesis of molecular mac...",1/3,932,Individual,Sir J. Fraser Stoddart,1942-05-24,Edinburgh,United Kingdom,Male,Northwestern University,"Evanston, IL",United States of America,,,


### Who are the living nobel laureate
There are some records with early 19 with death_date as none. Seem like it is incomplete information as those records even does not have birth_date. So I will consider the laureate as living if they have birth entry but not death entry.

In [161]:
q ="""SELECT year, full_name, birth_date, 
            ROUND(DATEDIFF(NOW(), birth_date)/365.25, 0) AS Age 
        FROM nobel
      WHERE birth_date IS NOT NULL 
              AND 
            death_date IS NULL
      ORDER BY Age
"""
ex_df(q)

Unnamed: 0,year,full_name,birth_date,Age
0,2001,A. Michael Spence,1943-00-00,
1,2014,Malala Yousafzai,1997-07-12,22
2,2011,Tawakkol Karman,1979-02-07,40
3,2010,Konstantin Novoselov,1974-08-23,45
4,2011,Leymah Gbowee,1972-02-01,47
5,2011,Adam G. Riess,1969-12-16,49
6,2011,Adam G. Riess,1969-12-16,49
7,2011,Brian P. Schmidt,1967-02-24,52
8,2014,Stefan W. Hell,1962-12-23,56
9,2014,Stefan W. Hell,1962-12-23,56


### Update data correction

A. Michael Spence has wrong birth date. Lets correct it.

In [162]:
q= """UPDATE nobel
SET birth_date = '1943-11-07'
WHERE full_name = 'A. Michael Spence'
"""
engine.execute(q)

<sqlalchemy.engine.result.ResultProxy at 0x223f055d278>

In [165]:
q= """SELECT year, full_name, birth_date FROM nobel
WHERE full_name = 'A. Michael Spence'"""
ex_df(q)

Unnamed: 0,year,full_name,birth_date
0,2001,A. Michael Spence,1943-11-07


Also some of the entries are repeated as some laureates got more than one nobel prize.

In [170]:
q ="""SELECT DISTINCT year, full_name, birth_date, 
ROUND(DATEDIFF(NOW(), birth_date)/365.25, 0) AS Age 
FROM nobel
WHERE birth_date IS NOT NULL AND death_date IS NULL
ORDER BY Age DESC
"""
ex_df(q)

Unnamed: 0,year,full_name,birth_date,Age
0,1997,Paul D. Boyer,1918-07-31,101
1,1997,Jens C. Skou,1918-10-08,100
2,1981,Nicolaas Bloembergen,1920-03-11,99
3,1992,Edmond H. Fischer,1920-04-06,99
4,1972,Kenneth J. Arrow,1921-08-23,98
5,1988,Jack Steinberger,1921-05-25,98
6,1957,Chen Ning Yang,1922-09-22,97
7,1988,Leon M. Lederman,1922-07-15,97
8,1989,Hans G. Dehmelt,1922-09-09,97
9,1973,Henry A. Kissinger,1923-05-27,96


There are multiple laureates who are no more, but database is not updated. e.g. Paul D. Boyer died on 2 June 2018 [wikipedia](https://en.wikipedia.org/wiki/Paul_D._Boyer). Let's update records for top few.

In [175]:
q="""UPDATE nobel
    SET death_date = 
                    CASE full_name
                        WHEN 'Paul D. Boyer' THEN '2018-06-02'
                        WHEN 'Jens C. Skou' THEN '2018-05-28'
                        WHEN 'Nicolaas Bloembergen' THEN '2017-09-05'
                        WHEN 'Kenneth J. Arrow' THEN '2017-02-21'
                        WHEN 'Leon M. Lederman' THEN '2018-10-03'
                        ELSE death_date
                    END

WHERE full_name IN ('Paul D. Boyer', 'Jens C. Skou', 'Nicolaas Bloembergen', 'Kenneth J. Arrow', 'Leon M. Lederman')
"""
engine.execute(q);

In [177]:
q = """SELECT year, full_name, birth_date, death_date FROM nobel
WHERE full_name IN ('Paul D. Boyer', 'Jens C. Skou', 'Nicolaas Bloembergen', 'Kenneth J. Arrow', 'Leon M. Lederman')"""
ex_df(q)

Unnamed: 0,year,full_name,birth_date,death_date
0,1972,Kenneth J. Arrow,1921-08-23,2017-02-21
1,1981,Nicolaas Bloembergen,1920-03-11,2017-09-05
2,1988,Leon M. Lederman,1922-07-15,2018-10-03
3,1997,Paul D. Boyer,1918-07-31,2018-06-02
4,1997,Jens C. Skou,1918-10-08,2018-05-28


    Ideally I should write an utility to scrap wikipedia page for laureate and find if he/she is alive and update the records accrodingly.

### Indian Nobel laureates

In [181]:
q= """SELECT full_name, year, category, motivation FROM nobel
WHERE birth_country = 'India' OR organization_country = 'India' OR death_country = 'India'
"""
ex_df(q)

Unnamed: 0,full_name,year,category,motivation
0,Ronald Ross,1902,Medicine,"""for his work on malaria, by which he has show..."
1,Rabindranath Tagore,1913,Literature,"""because of his profoundly sensitive, fresh an..."
2,Sir Chandrasekhara Venkata Raman,1930,Physics,"""for his work on the scattering of light and f..."
3,Har Gobind Khorana,1968,Medicine,"""for their interpretation of the genetic code ..."
4,Mother Teresa,1979,Peace,
5,Amartya Sen,1998,Economics,"""for his contributions to welfare economics"""
6,Venkatraman Ramakrishnan,2009,Chemistry,"""for studies of the structure and function of ..."
7,Kailash Satyarthi,2014,Peace,"""for their struggle against the suppression of..."


Knights of the realm. List the winners, year and category where the winner starts with Sir.

In [None]:
q="""select full_name, year , category from nobel
where full_name LIKE ('Sir%')"""
ex_df(q)

### References
- [Database operations with python](https://nitinai.github.io/sql/sqlalchemy_pandas/)
- [kaggle dataset of Nobel Laureates](https://www.kaggle.com/nobelfoundation/nobel-laureates/)