# SELECT from Nobel

## `nobel` Nobel Laureates

We continue practicing simple SQL queries on a single table.

This tutorial is concerned with a table of Nobel prize winners:

```
nobel(yr, subject, winner)
```

Using the `SELECT` statement.

In [1]:
import os
import pandas as pd
import findspark
os.environ['SPARK_HOME'] =  '/opt/spark'
findspark.init()

from pyspark.sql import SparkSession
sc = (SparkSession.builder.appName('app03')
      .config('spark.sql.warehouse.dir', 'hdfs://quickstart.cloudera:8020/user/hive/warehouse')
      .config('hive.metastore.uris', 'thrift://quickstart.cloudera:9083')
      .enableHiveSupport().getOrCreate())

## 1. Winners from 1950

Change the query shown so that it displays Nobel prizes for 1950.

In [2]:
nobel = sc.read.table('sqlzoo.nobel')

In [3]:
nobel.filter(nobel['yr']==1950).toPandas()

Unnamed: 0,yr,subject,winner
0,1950,Chemistry,Kurt Alder
1,1950,Chemistry,Otto Diels
2,1950,Literature,Bertrand Russell
3,1950,Medicine,Philip S. Hench
4,1950,Medicine,Edward C. Kendall
5,1950,Medicine,Tadeus Reichstein
6,1950,Peace,Ralph Bunche
7,1950,Physics,Cecil Powell


## 2. 1962 Literature

Show who won the 1962 prize for Literature.

In [4]:
(nobel.filter((nobel['yr']==1962) & (nobel['subject']=='Literature'))
     .select('winner').toPandas())

Unnamed: 0,winner
0,John Steinbeck


## 3. Albert Einstein

Show the year and subject that won 'Albert Einstein' his prize.

In [5]:
nobel.filter(nobel['winner']=='Albert Einstein').select('yr', 'subject').toPandas()

Unnamed: 0,yr,subject
0,1921,Physics


## 4. Recent Peace Prizes

Give the name of the 'Peace' winners since the year 2000, including 2000.

In [6]:
nobel.filter((nobel['yr']>=2000) & (nobel['subject']=='Peace')).select('winner').toPandas()

Unnamed: 0,winner
0,Tunisian National Dialogue Quartet
1,Kailash Satyarthi
2,Malala Yousafzai
3,European Union
4,Ellen Johnson Sirleaf
5,Leymah Gbowee
6,Tawakel Karman
7,Liu Xiaobo
8,Barack Obama
9,Martti Ahtisaari


## 5. Literature in the 1980's

Show all details **(yr, subject, winner)** of the Literature prize winners for 1980 to 1989 inclusive.

In [7]:
nobel.filter((nobel['yr'].between(1980, 1989)) & 
             (nobel['subject']=='Literature')).toPandas()

Unnamed: 0,yr,subject,winner
0,1989,Literature,Camilo José Cela
1,1988,Literature,Naguib Mahfouz
2,1987,Literature,Joseph Brodsky
3,1986,Literature,Wole Soyinka
4,1985,Literature,Claude Simon
5,1984,Literature,Jaroslav Seifert
6,1983,Literature,William Golding
7,1982,Literature,Gabriel García Márquez
8,1981,Literature,Elias Canetti
9,1980,Literature,Czeslaw Milosz


## 6. Only Presidents

Show all details of the presidential winners:

- Theodore Roosevelt
- Woodrow Wilson
- Jimmy Carter
- Barack Obama

In [8]:
nobel.filter(nobel['winner'].isin(
    ['Theodore Roosevelt', 'Woodrow Wilson', 
     'Jimmy Carter', 'Barack Obama'])).toPandas()

Unnamed: 0,yr,subject,winner
0,2009,Peace,Barack Obama
1,2002,Peace,Jimmy Carter
2,1919,Peace,Woodrow Wilson
3,1906,Peace,Theodore Roosevelt


## 7. John

Show the winners with first name John

In [9]:
nobel.filter(nobel['winner'].startswith('John')).select('winner').toPandas()

Unnamed: 0,winner
0,John O'Keefe
1,John B. Gurdon
2,John C. Mather
3,John L. Hall
4,John B. Fenn
5,John E. Sulston
6,John Pople
7,John Hume
8,John E. Walker
9,John C. Harsanyi


## 8. Chemistry and Physics from different years

**Show the year, subject, and name of Physics winners for 1980 together with the Chemistry winners for 1984.**

In [10]:
(nobel.filter(((nobel['subject']=='Physics') & (nobel['yr']==1980)) |
          ((nobel['subject']=='Chemistry') & (nobel['yr']==1984)))
     .select('yr', 'subject', 'winner').toPandas())

Unnamed: 0,yr,subject,winner
0,1984,Chemistry,Bruce Merrifield
1,1980,Physics,James Cronin
2,1980,Physics,Val Fitch


## 9. Exclude Chemists and Medics

**Show the year, subject, and name of winners for 1980 excluding Chemistry and Medicine**

In [11]:
(nobel.filter((nobel['yr']==1980) & ~ nobel['subject'].isin(['Chemistry', 'Medicine']))
     .select('yr', 'subject', 'winner').toPandas())

Unnamed: 0,yr,subject,winner
0,1980,Economics,Lawrence R. Klein
1,1980,Literature,Czeslaw Milosz
2,1980,Peace,Adolfo Pérez Esquivel
3,1980,Physics,James Cronin
4,1980,Physics,Val Fitch


## 10. Early Medicine, Late Literature

Show year, subject, and name of people who won a 'Medicine' prize in an early year (before 1910, not including 1910) together with winners of a 'Literature' prize in a later year (after 2004, including 2004)

In [12]:
(nobel.filter(((nobel['yr']<1910) & (nobel['subject']=='Medicine')) |
         ((nobel['yr']>=2004) & (nobel['subject']=='Literature')))
         .select('yr', 'subject', 'winner').toPandas())

Unnamed: 0,yr,subject,winner
0,2015,Literature,Svetlana Alexievich
1,2014,Literature,Patrick Modiano
2,2013,Literature,Alice Munro
3,2012,Literature,Mo Yan
4,2011,Literature,Tomas Tranströmer
5,2010,Literature,Mario Vargas Llosa
6,2009,Literature,Herta Müller
7,2008,Literature,Jean-Marie Gustave Le Clézio
8,2007,Literature,Doris Lessing
9,2006,Literature,Orhan Pamuk


## 11. Umlaut

Find all details of the prize won by PETER GRÜNBERG

> _Non-ASCII characters_   
> The u in his name has an umlaut. You may find this link useful <https://en.wikipedia.org/wiki/%C3%9C#Keyboarding>

In [13]:
from pyspark.sql.functions import upper
nobel.filter(upper(nobel['winner'])=='PETER GRÜNBERG').toPandas()

Unnamed: 0,yr,subject,winner
0,2007,Physics,Peter Grünberg


## 12. Apostrophe

Find all details of the prize won by EUGENE O'NEILL

> _Escaping single quotes_   
> You can't put a single quote in a quote string directly. You can use two single quotes within a quoted string.

In [14]:
nobel.filter(upper(nobel['winner'])=='EUGENE O\'NEILL').toPandas()

Unnamed: 0,yr,subject,winner
0,1936,Literature,Eugene O'Neill


## 13. Knights of the realm

Knights in order

**List the winners, year and subject where the winner starts with Sir. Show the the most recent first, then by name order.**

In [15]:
from pyspark.sql.functions import desc
(nobel.filter(nobel['winner'].startswith('Sir'))
     .select('winner', 'yr', 'subject')
     .orderBy(desc('yr'), 'winner').toPandas())

Unnamed: 0,winner,yr,subject
0,Sir Martin J. Evans,2007,Medicine
1,Sir Peter Mansfield,2003,Medicine
2,Sir Paul Nurse,2001,Medicine
3,Sir Harold Kroto,1996,Chemistry
4,Sir James W. Black,1988,Medicine
5,Sir Arthur Lewis,1979,Economics
6,Sir Nevill F. Mott,1977,Physics
7,Sir Bernard Katz,1970,Medicine
8,Sir John Eccles,1963,Medicine
9,Sir Frank Macfarlane Burnet,1960,Medicine


## 14. Chemistry and Physics last

The expression **subject IN ('Chemistry','Physics')** can be used as a value - it will be 0 or 1.

**Show the 1984 winners and subject ordered by subject and winner name; but list Chemistry and Physics last.**

In [16]:
(nobel.withColumn('flg', nobel['subject'].isin(['Chemistry', 'Physics']))
      .filter(nobel['yr']==1984)
      .orderBy('flg', 'subject', 'winner')
      .select('winner', 'subject').toPandas())

Unnamed: 0,winner,subject
0,Richard Stone,Economics
1,Jaroslav Seifert,Literature
2,César Milstein,Medicine
3,Georges J.F. Köhler,Medicine
4,Niels K. Jerne,Medicine
5,Desmond Tutu,Peace
6,Bruce Merrifield,Chemistry
7,Carlo Rubbia,Physics
8,Simon van der Meer,Physics


In [17]:
sc.stop()