# A Visual History of Nobel Prize Winners



The Nobel Prize is perhaps the world's most well known scientific award. Every year it is given to scientists and scholars in chemistry, literature, physics, medicine, economics, and peace. In this project, we aim at exploring patterns and trends in over 100 years worth of Nobel Prize winners.

Specifically, we try to answer the following questions:

1. Who gets the Nobel Prize?
2. What is the gender of a typical Nobel Prize winner?
   (Males count vs. Females count)
3. Dominant country in each field?
4. USA dominance, visualized
5. Capitals dominance vs. non-capitals
6. The first woman to win the Nobel Prize
7. Organizations vs. Individuals
8. Oldest and youngest winners
9. Age differences between prize categories

The dataset we use can be found on kaggle: https://www.kaggle.com/nobelfoundation/nobel-laureates

## 1. Data Cleaning & Transformations:

We start by inspecting the data and cleaning it. We will have a look at the missing values, the noisy values and try to clean the data as much as possible. We as well will perform some transformations and pre-processing on the data so as to serve answering our target questions.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Inspecting the data
nobel_data = pd.read_csv("Data - Nobel Prize Laureates.csv")
nobel_data.sample(10)

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
399,1967,Medicine,The Nobel Prize in Physiology or Medicine 1967,"""for their discoveries concerning the primary ...",1/3,387,Individual,George Wald,1906-11-18,"New York, NY",United States of America,Male,Harvard University,"Cambridge, MA",United States of America,1997-04-12,"Cambridge, MA",United States of America
923,2013,Medicine,The Nobel Prize in Physiology or Medicine 2013,"""for their discoveries of machinery regulating...",1/3,885,Individual,Randy W. Schekman,1948-12-30,"St. Paul, MN",United States of America,Male,University of California,"Berkeley, CA",United States of America,,,
730,1999,Physics,The Nobel Prize in Physics 1999,"""for elucidating the quantum structure of elec...",1/2,159,Individual,Martinus J.G. Veltman,1931-06-27,Waalwijk,Netherlands,Male,,Bilthoven,Netherlands,,,
829,2007,Medicine,The Nobel Prize in Physiology or Medicine 2007,"""for their discoveries of principles for intro...",1/3,811,Individual,Mario R. Capecchi,1937-10-06,Verona,Italy,Male,University of Utah,"Salt Lake City, UT",United States of America,,,
654,1993,Literature,The Nobel Prize in Literature 1993,"""who in novels characterized by visionary forc...",1/1,670,Individual,Toni Morrison,1931-02-18,"Lorain, OH",United States of America,Female,,,,,,
365,1963,Peace,The Nobel Peace Prize 1963,,1/2,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
12,1902,Physics,The Nobel Prize in Physics 1902,"""in recognition of the extraordinary service t...",1/2,3,Individual,Pieter Zeeman,1865-05-25,Zonnemaire,Netherlands,Male,Amsterdam University,Amsterdam,Netherlands,1943-10-09,Amsterdam,Netherlands
920,2013,Economics,The Sveriges Riksbank Prize in Economic Scienc...,"""for their empirical analysis of asset prices""",1/3,896,Individual,Robert J. Shiller,1946-03-29,"Detroit, MI",United States of America,Male,Yale University,"New Haven, CT",United States of America,,,
348,1961,Physics,The Nobel Prize in Physics 1961,"""for his pioneering studies of electron scatte...",1/2,75,Individual,Robert Hofstadter,1915-02-05,"New York, NY",United States of America,Male,Stanford University,"Stanford, CA",United States of America,1990-11-17,"Stanford, CA",United States of America
213,1939,Physics,The Nobel Prize in Physics 1939,"""for the invention and development of the cycl...",1/1,47,Individual,Ernest Orlando Lawrence,1901-08-08,"Canton, SD",United States of America,Male,University of California,"Berkeley, CA",United States of America,1958-08-27,"Palo Alto, CA",United States of America


**Unique Values and Duplicate Records:**    
We will take a look at the number of unique values in each column and checkout the shape of the dataset. The ID column should be unique, that means that the number of rows must be the same as the number of unique values in ID column.

In [3]:
#print number of rows by number of columns
nobel_data.shape

(969, 18)

In [4]:
#print the count of unique values in each column
nobel_data.nunique()

Year                    113
Category                  6
Prize                   579
Motivation              565
Prize Share               4
Laureate ID             904
Laureate Type             2
Full Name               904
Birth Date              868
Birth City              601
Birth Country           121
Sex                       2
Organization Name       315
Organization City       186
Organization Country     29
Death Date              582
Death City              291
Death Country            50
dtype: int64

We realise there are 969 records in the data, however there are only 904 IDs. This indicates we have duplicate records that need to be eliminated. We shall use pandas' drop_duplicates method and make sure that the number of records matches the number of unique IDs then.

In [5]:
#based on the ID column, drop duplicates keeping only the first occurence
nobel_no_dup = nobel_data.drop_duplicates(subset='Laureate ID', keep='first')

#check that the number of records is now the same as the number of unique IDs
nobel_no_dup.shape

(904, 18)

**Missing Values Analysis and Imputation:**    
We will now analyze the missing values and try to derive conclusions about what are plausible values and techniques to impute with, and what values can be manually filled using the internet.

In [6]:
#Check the missing values in the dataset
# get the number of missing data points per column
missing_values = nobel_no_dup.isnull().sum()

# We want to check the percentage of the missing values.
#Therefore we get the number of missing values and number of total cells to calculate %
total_cells = np.product(nobel_no_dup.shape) 
total_missing_cells = missing_values.sum()

# percent of data that is missing
percentage_missing_values = (total_missing_cells/total_cells) * 100
percentage_missing_values = '%.2f' % percentage_missing_values
print("Missing Data Percentage: " + str(percentage_missing_values) + "%")

Missing Data Percentage: 11.54%


Having calculated the percentage of the missing data, we can't simply drop the records with missing values since the percentage exceeds the safe zone (2-3%). Therefore, further analysis of the missing values is needed. We will start by looking at the "Laureate Type" column.  


In [7]:
#check the different values in the column along with their counts
nobel_no_dup['Laureate Type'].value_counts()

Individual      877
Organization     27
Name: Laureate Type, dtype: int64

We note that there are two unique values for the "Laureate Type" column: Individual and Organization. We also note that most of the records are individuals vs, very small % which are organizations. It is particularly interesting to look at the demographics data of the organization records.

In [8]:
#get only records of type organization
org = nobel_no_dup.loc[nobel_no_dup['Laureate Type'] == "Organization"]
org

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
24,1904,Peace,The Nobel Peace Prize 1904,,1/1,467,Organization,Institut de droit international (Institute of ...,,,,,,,,,,
61,1910,Peace,The Nobel Peace Prize 1910,,1/1,477,Organization,Bureau international permanent de la Paix (Per...,,,,,,,,,,
90,1917,Peace,The Nobel Peace Prize 1917,,1/1,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
206,1938,Peace,The Nobel Peace Prize 1938,,1/1,503,Organization,Office international Nansen pour les Réfugiés ...,,,,,,,,,,
244,1947,Peace,The Nobel Peace Prize 1947,,1/2,508,Organization,Friends Service Council (The Quakers),,,,,,,,,,
245,1947,Peace,The Nobel Peace Prize 1947,,1/2,509,Organization,American Friends Service Committee (The Quakers),,,,,,,,,,
295,1954,Peace,The Nobel Peace Prize 1954,,1/1,515,Organization,Office of the United Nations High Commissioner...,,,,,,,,,,
366,1963,Peace,The Nobel Peace Prize 1963,,1/2,523,Organization,Ligue des Sociétés de la Croix-Rouge (League o...,,,,,,,,,,
383,1965,Peace,The Nobel Peace Prize 1965,,1/1,525,Organization,United Nations Children's Fund (UNICEF),,,,,,,,,,
416,1969,Peace,The Nobel Peace Prize 1969,,1/1,527,Organization,International Labour Organization (I.L.O.),,,,,,,,,,


We note that all the organization records have NaN values for the demographic columns, except for some records such as Mother Teresa who have demographic data and were mistakenly recorded as organizations. We will identify these values and correct them. And we also decide not to impute the missing demographic values for the organization records, since it makes sense that this data is Nan, and more importantly since we don't need it in answering any of our target analysis questions indicated at the top of the notebook.

In [9]:
#identify records who are labeled as organization but have a birth date which is not nan
isOrg = nobel_no_dup['Laureate Type'] == "Organization"
isBD = pd.notna(nobel_no_dup['Birth Date'])
mask = isOrg & isBD

#change the laureate type of these records to be individual
#run twice to suppress the warning
nobel_no_dup.loc[ mask, 'Laureate Type'] = "Individual"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Now we check records which are labeled organization, and expect all of them to be actual organizations.

In [10]:
#get only records of type organization
org = nobel_no_dup.loc[nobel_no_dup['Laureate Type'] == "Organization"]
org

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
24,1904,Peace,The Nobel Peace Prize 1904,,1/1,467,Organization,Institut de droit international (Institute of ...,,,,,,,,,,
61,1910,Peace,The Nobel Peace Prize 1910,,1/1,477,Organization,Bureau international permanent de la Paix (Per...,,,,,,,,,,
90,1917,Peace,The Nobel Peace Prize 1917,,1/1,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
206,1938,Peace,The Nobel Peace Prize 1938,,1/1,503,Organization,Office international Nansen pour les Réfugiés ...,,,,,,,,,,
244,1947,Peace,The Nobel Peace Prize 1947,,1/2,508,Organization,Friends Service Council (The Quakers),,,,,,,,,,
245,1947,Peace,The Nobel Peace Prize 1947,,1/2,509,Organization,American Friends Service Committee (The Quakers),,,,,,,,,,
295,1954,Peace,The Nobel Peace Prize 1954,,1/1,515,Organization,Office of the United Nations High Commissioner...,,,,,,,,,,
366,1963,Peace,The Nobel Peace Prize 1963,,1/2,523,Organization,Ligue des Sociétés de la Croix-Rouge (League o...,,,,,,,,,,
383,1965,Peace,The Nobel Peace Prize 1965,,1/1,525,Organization,United Nations Children's Fund (UNICEF),,,,,,,,,,
416,1969,Peace,The Nobel Peace Prize 1969,,1/1,527,Organization,International Labour Organization (I.L.O.),,,,,,,,,,


Also check records which are labeled as organization but have a birth date, and expect that none exist anymore.

In [11]:
nobel_no_dup.loc[isOrg & isBD]

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
452,1973,Peace,The Nobel Peace Prize 1973,,1/2,531,Individual,Le Duc Tho,1911-10-14,Nam Ha province,Vietnam,Male,,,,1990-10-13,Hanoi,Vietnam
520,1979,Peace,The Nobel Peace Prize 1979,,1/1,540,Individual,Mother Teresa,1910-08-26,Uskup (Skopje),Ottoman Empire (Republic of Macedonia),Female,,,,1997-09-05,Calcutta,India
620,1989,Peace,The Nobel Peace Prize 1989,,1/1,551,Individual,The 14th Dalai Lama (Tenzin Gyatso),1935-07-06,Taktser,Tibet (People's Republic of China),Male,,,,,,
640,1991,Peace,The Nobel Peace Prize 1991,"""for her non-violent struggle for democracy an...",1/1,553,Individual,Aung San Suu Kyi,1945-06-19,Rangoon (Yangon),Burma (Myanmar),Female,,,,,,


Now all is good concerning the organizations. Now we want to analyze the individuals and their missing values. First, we group only the individual records, then display the count of missing values in each column.

In [12]:
#get only records of type individual
ind = nobel_no_dup.loc[nobel_no_dup['Laureate Type'] == "Individual"]
ind.sample(10)

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
785,2004,Chemistry,The Nobel Prize in Chemistry 2004,"""for the discovery of ubiquitin-mediated prote...",1/3,779,Individual,Aaron Ciechanover,1947-10-01,Haifa,British Protectorate of Palestine (Israel),Male,Technion - Israel Institute of Technology,Haifa,Israel,,,
938,2014,Medicine,The Nobel Prize in Physiology or Medicine 2014,"""for their discoveries of cells that constitut...",1/4,905,Individual,Edvard I. Moser,1962-04-27,Ålesund,Norway,Male,Norwegian University of Science and Technology...,Trondheim,Norway,,,
358,1962,Physics,The Nobel Prize in Physics 1962,"""for his pioneering theories for condensed mat...",1/1,77,Individual,Lev Davidovich Landau,1908-01-22,Baku,Russian Empire (Azerbaijan),Male,Academy of Sciences,Moscow,Union of Soviet Socialist Republics,1968-04-01,Moscow,Union of Soviet Socialist Republics (Russia)
481,1976,Chemistry,The Nobel Prize in Chemistry 1976,"""for his studies on the structure of boranes i...",1/1,249,Individual,William N. Lipscomb,1919-12-09,"Cleveland, OH",United States of America,Male,Harvard University,"Cambridge, MA",United States of America,2011-04-14,"Cambridge, MA",United States of America
52,1909,Literature,The Nobel Prize in Literature 1909,"""in appreciation of the lofty idealism, vivid ...",1/1,579,Individual,Selma Ottilia Lovisa Lagerlöf,1858-11-20,Mårbacka,Sweden,Female,,,,1940-03-16,Mårbacka,Sweden
536,1981,Chemistry,The Nobel Prize in Chemistry 1981,"""for their theories, developed independently, ...",1/2,257,Individual,Kenichi Fukui,1918-10-04,Nara,Japan,Male,Kyoto University,Kyoto,Japan,1998-01-09,Kyoto,Japan
161,1931,Literature,The Nobel Prize in Literature 1931,"""The poetry of Erik Axel Karlfeldt""",1/1,604,Individual,Erik Axel Karlfeldt,1864-07-20,Karlbo,Sweden,Male,,,,1931-04-08,Stockholm,Sweden
527,1980,Chemistry,The Nobel Prize in Chemistry 1980,"""for their contributions concerning the determ...",1/4,255,Individual,Walter Gilbert,1932-03-21,"Boston, MA",United States of America,Male,"Harvard University, Biological Laboratories","Cambridge, MA",United States of America,,,
838,2008,Chemistry,The Nobel Prize in Chemistry 2008,"""for the discovery and development of the gree...",1/3,829,Individual,Osamu Shimomura,1928-08-27,Kyoto,Japan,Male,Marine Biological Laboratory (MBL),"Woods Hole, MA",United States of America,,,
534,1980,Physics,The Nobel Prize in Physics 1980,"""for the discovery of violations of fundamenta...",1/2,116,Individual,James Watson Cronin,1931-09-29,"Chicago, IL",United States of America,Male,University of Chicago,"Chicago, IL",United States of America,2016-08-25,"St. Paul, MN",United States of America


In [13]:
#get missing values count in individual records
ind_missing_values = ind.isnull().sum()
ind_missing_values

Year                      0
Category                  0
Prize                     0
Motivation               71
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                2
Birth City                2
Birth Country             0
Sex                       0
Organization Name       220
Organization City       218
Organization Country    218
Death Date              292
Death City              309
Death Country           303
dtype: int64

Now we observe very interesting patterns in our data. We note that the birth date and the birth country have only two missing values which we can identify and fill manually.  
We also note that we have large number of missing values for the organization name, organization city and organization country. This is most probably due to some individuals not belonging to any organization like Naguib Mahfouz for example. We also note that there exists large number of missing values for death date, death city and death country. This is probably due to some people being still alive.    
Since the death dates and the organizations don't matter much to our proposed question, we decide to keep them as NaNs since this is normal: if someone didn't die yet, his death date is not a missing value, the record can still be considered complete then.   
We focus however on identifying the 2 missing values for "Birth Date" and "Birth City", since they are important to our analysis.

In [14]:
#get records who are individuals and their birth city is NaN
isInd = nobel_no_dup['Laureate Type'] == "Individual"
isCityNan = pd.isna(nobel_no_dup['Birth City'])

nobel_no_dup.loc[isInd & isCityNan]

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
750,2001,Literature,The Nobel Prize in Literature 2001,"""for having united perceptive narrative and in...",1/1,747,Individual,Sir Vidiadhar Surajprasad Naipaul,1932-08-17,,Trinidad,Male,,,,,,
880,2010,Peace,The Nobel Peace Prize 2010,"""for his long and non-violent struggle for fun...",1/1,855,Individual,Liu Xiaobo,1955-12-28,,China,Male,,,,,,


Now that we identified the 2 records whose birth city is Nan, we fill them up manually. Using the internet we get that the birth city of Sir Vidiadhar Surajprasad Naipaul is Chaguanas, and the birth city of Liu Xiaobo is Changchun. We fill them out manually.

In [15]:
nobel_no_dup.loc[nobel_no_dup['Full Name'] == "Sir Vidiadhar Surajprasad Naipaul", 'Birth City'] = "Chaguanas"
nobel_no_dup.loc[nobel_no_dup['Full Name'] == "Liu Xiaobo", 'Birth City'] = "Changchun"


In [16]:
#get records who are individuals and their birth date is NaN
isInd = nobel_no_dup['Laureate Type'] == "Individual"
isBDNan = pd.isna(nobel_no_dup['Birth Date'])

nobel_no_dup.loc[isInd & isBDNan]

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
853,2009,Chemistry,The Nobel Prize in Chemistry 2009,"""for studies of the structure and function of ...",1/3,841,Individual,Venkatraman Ramakrishnan,,"Chidambaram, Tamil Nadu",India,Male,MRC Laboratory of Molecular Biology,Cambridge,United Kingdom,,,
894,2011,Physics,The Nobel Prize in Physics 2011,"""for the discovery of the accelerating expansi...",1/2,864,Individual,Saul Perlmutter,,"Champaign-Urbana, IL",United States of America,Male,Lawrence Berkeley National Laboratory,"Berkeley, CA",United States of America,,,


Also using the internet we get that the birth date of Venkatraman Ramakrishnan is 1 April 1952, and the birth date of Saul Perlmutter is 22 September 1959. We as well fill them manually. We first however inspect the type of the date column.

In [17]:
nobel_no_dup.loc[nobel_no_dup['Full Name'] == "Venkatraman Ramakrishnan", 'Birth Date'] = '1952-1-4'
nobel_no_dup.loc[nobel_no_dup['Full Name'] == "Saul Perlmutter", 'Birth Date'] = '1959-22-9'


Now we check the missing values in individuals. We expect not to have any missing values in birth date and birth city.

In [18]:
ind = nobel_no_dup.loc[nobel_no_dup['Laureate Type'] == "Individual"]
#get missing values count in individual records
ind_missing_values = ind.isnull().sum()
ind_missing_values

Year                      0
Category                  0
Prize                     0
Motivation               71
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                0
Birth City                0
Birth Country             0
Sex                       0
Organization Name       220
Organization City       218
Organization Country    218
Death Date              292
Death City              309
Death Country           303
dtype: int64

As for the motivation column, we need it to get insights about who gets the nobel prize (see our analysis questions at the top of the notebook), we might get the most repeated words for this task or something similar. For this, we don't need all the values to be there and we also find it infeasible to manually fill them or to impute them with the most frequent value (since they are non numeric), therefore we keep this column as it is and later on see how we can benefit from it. 

Now since our target questions requires some answers about the age of the individuals when they accquired the prize, we find it beneficial to add an age column using the birth date and the year of the prize. We first check the type of the date column, and try to extract the year from it.

In [19]:
#check the data type of each column
nobel_no_dup.dtypes

Year                     int64
Category                object
Prize                   object
Motivation              object
Prize Share             object
Laureate ID              int64
Laureate Type           object
Full Name               object
Birth Date              object
Birth City              object
Birth Country           object
Sex                     object
Organization Name       object
Organization City       object
Organization Country    object
Death Date              object
Death City              object
Death Country           object
dtype: object

We notice that the "Birth Date" column is of type string instead of type datetime, therefore we try to typecast it using pandas' to_datetime

In [20]:
#nobel_no_dup['Birth Date'] = pd.to_datetime(nobel_no_dup['Birth Date'])

#when running the commented line above we get this error:
#ValueError: month must be in 1..12


The above error means that the dates are having inconsistent format. Since we are only interested in the year, we perform string manipulation in order to extract the year, then parse it into an integer. We will then subtract the prize date from the birth date to accquire the age column that we need.

In [21]:
#get organization records alone
org = nobel_no_dup.loc[nobel_no_dup['Laureate Type'] == "Organization"]
#get individual records alone
ind = nobel_no_dup.loc[nobel_no_dup['Laureate Type'] == "Individual"]

#split the date using "-" and store the result in 3 columns in a new dataframe df
df = pd.DataFrame()
df[ ['1','2','3'] ] = ind['Birth Date'].str.split("-", expand=True)
#the year is stored in the first column
df['Birth Year'] = df['1']
#drop 1,2,3 and keep year only
df = df.drop(['1', '2', '3'], axis=1)

#Now we make sure that the year is assigned correctly
#we do this by trying to print all values of Birth Year whose length is not =4, this should be false
df.loc[df['Birth Year'].str.len() != 4]

Unnamed: 0,Birth Year


Now that we extracted the year correctly, we proceed to modifying the ind dataframe and the org accordingly, and concatenate them.

In [22]:
#concat the year to the ind dataframe and convert it into int
ind = pd.concat([ind, df], axis=1)
ind['Birth Year'] = ind['Birth Year'].astype(int)

#calculate the age as the difference between year of prize and birth year
ind['Age'] = ind['Year'].astype(int) - ind['Birth Year'].astype(int)
ind['Age'] = ind['Age'].astype(int)

#add a zero value for the age of organization records so as to concatenate both
org.insert(0, 'Age', 0)

#form the nobel_no_dup as the concatentation of the modified ind and the ord
nobel_no_dup = pd.concat([ind,org], axis=0, sort=False)


Now examining the "Laureate Type" column, it has only two values: Individual and Organization. Therefore we find it convenient to encode it into binary integers (0/1) in order to make it easier to handle.

In [23]:
nobel_no_dup["Laureate Type"] = nobel_no_dup["Laureate Type"].astype('category')
nobel_no_dup["Laureate Type"] = nobel_no_dup["Laureate Type"].cat.codes
nobel_no_dup

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country,Birth Year,Age
0,1901,Chemistry,The Nobel Prize in Chemistry 1901,"""in recognition of the extraordinary services ...",1/1,160,0,Jacobus Henricus van 't Hoff,1852-08-30,Rotterdam,Netherlands,Male,Berlin University,Berlin,Germany,1911-03-01,Berlin,Germany,1852.0,49
1,1901,Literature,The Nobel Prize in Literature 1901,"""in special recognition of his poetic composit...",1/1,569,0,Sully Prudhomme,1839-03-16,Paris,France,Male,,,,1907-09-07,Châtenay,France,1839.0,62
2,1901,Medicine,The Nobel Prize in Physiology or Medicine 1901,"""for his work on serum therapy, especially its...",1/1,293,0,Emil Adolf von Behring,1854-03-15,Hansdorf (Lawice),Prussia (Poland),Male,Marburg University,Marburg,Germany,1917-03-31,Marburg,Germany,1854.0,47
3,1901,Peace,The Nobel Peace Prize 1901,,1/2,462,0,Jean Henry Dunant,1828-05-08,Geneva,Switzerland,Male,,,,1910-10-30,Heiden,Switzerland,1828.0,73
4,1901,Peace,The Nobel Peace Prize 1901,,1/2,463,0,Frédéric Passy,1822-05-20,Paris,France,Male,,,,1912-06-12,Paris,France,1822.0,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
821,2006,Peace,The Nobel Peace Prize 2006,"""for their efforts to create economic and soci...",1/2,810,1,Grameen Bank,,,,,,,,,,,,0
833,2007,Peace,The Nobel Peace Prize 2007,"""for their efforts to build up and disseminate...",1/2,818,1,Intergovernmental Panel on Climate Change (IPCC),,,,,,,,,,,,0
909,2012,Peace,The Nobel Peace Prize 2012,"""for over six decades contributed to the advan...",1/1,881,1,European Union (EU),,,,,,,,,,,,0
927,2013,Peace,The Nobel Peace Prize 2013,"""for its extensive efforts to eliminate chemic...",1/1,893,1,Organisation for the Prohibition of Chemical W...,,,,,,,,,,,,0


We try to encode the "Sex" column as well, it consists of only 2 values: Male and Female. The problem is there is Nan values in this column for the organization records. Therefore we make a dictionary and map the values instead of using .cat.codes.

In [24]:
#create the dictionary
dict = {'Male' : 0, 'Female' : 1} 

#convert it into categorical
nobel_no_dup["Sex"] = nobel_no_dup["Sex"].astype('category')  

#Remap the values of the dataframe 
nobel_no_dup["Sex"] = nobel_no_dup["Sex"].map(dict)

nobel_no_dup.sample(10)

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country,Birth Year,Age
547,1982,Chemistry,The Nobel Prize in Chemistry 1982,"""for his development of crystallographic elect...",1/1,259,0,Aaron Klug,1926-08-11,Zelvas,Lithuania,0,MRC Laboratory of Molecular Biology,Cambridge,United Kingdom,,,,1926.0,56
685,1996,Chemistry,The Nobel Prize in Chemistry 1996,"""for their discovery of fullerenes""",1/3,284,0,Robert F. Curl Jr.,1933-08-23,"Alice, TX",United States of America,0,Rice University,"Houston, TX",United States of America,,,,1933.0,63
266,1950,Physics,The Nobel Prize in Physics 1950,"""for his development of the photographic metho...",1/1,55,0,Cecil Frank Powell,1903-12-05,Tonbridge,United Kingdom,0,Bristol University,Bristol,United Kingdom,1969-08-09,,Italy,1903.0,47
367,1963,Physics,The Nobel Prize in Physics 1963,"""for his contributions to the theory of the at...",1/2,78,0,Eugene Paul Wigner,1902-11-17,Budapest,Austria-Hungary (Hungary),0,Princeton University,"Princeton, NJ",United States of America,1995-01-01,"Princeton, NJ",United States of America,1902.0,61
272,1951,Physics,The Nobel Prize in Physics 1951,"""for their pioneer work on the transmutation o...",1/2,56,0,Sir John Douglas Cockcroft,1897-05-27,Todmorden,United Kingdom,0,Atomic Energy Research Establishment,"Harwell, Berkshire",United Kingdom,1967-09-18,Cambridge,United Kingdom,1897.0,54
50,1908,Physics,The Nobel Prize in Physics 1908,"""for his method of reproducing colours photogr...",1/1,12,0,Gabriel Lippmann,1845-08-16,Hollerich,Luxembourg,0,Sorbonne University,Paris,France,1921-07-13,,,1845.0,63
205,1938,Medicine,The Nobel Prize in Physiology or Medicine 1938,"""for the discovery of the role played by the s...",1/1,333,0,Corneille Jean François Heymans,1892-03-28,Ghent,Belgium,0,Ghent University,Ghent,Belgium,1968-07-18,Knokke,Belgium,1892.0,46
548,1982,Economics,The Sveriges Riksbank Prize in Economic Scienc...,"""for his seminal studies of industrial structu...",1/1,696,0,George J. Stigler,1911-01-17,"Renton, WA",United States of America,0,University of Chicago,"Chicago, IL",United States of America,1991-12-01,"Chicago, IL",United States of America,1911.0,71
132,1926,Peace,The Nobel Peace Prize 1926,,1/2,491,0,Gustav Stresemann,1878-05-10,Berlin,Germany,0,,,,1929-10-03,Berlin,Germany,1878.0,48
749,2001,Economics,The Sveriges Riksbank Prize in Economic Scienc...,"""for their analyses of markets with asymmetric...",1/3,746,0,Joseph E. Stiglitz,1943-02-09,"Gary, IN",United States of America,0,Columbia University,"New York, NY",United States of America,,,,1943.0,58


For the "Category" column, we will examine its unique values and see what we can do with them:

In [25]:
nobel_no_dup['Category'].value_counts()

Medicine      211
Physics       203
Chemistry     173
Peace         126
Literature    113
Economics      78
Name: Category, dtype: int64

We notice that we have only 6 different values. We will try to encode them in order to get rid of the strings. Since there is no notion of ordering among the values, encoding them into numeric values seems to be a bad idea. Instead, we use one hot encoding.

In [26]:
#first we label them using numerical encoding

from sklearn.preprocessing import LabelEncoder

labelEncoder = LabelEncoder()

nobel_no_dup['Category'] = labelEncoder.fit_transform(nobel_no_dup['Category'])

nobel_no_dup.sample(10)


Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country,Birth Year,Age
398,1967,3,The Nobel Prize in Physiology or Medicine 1967,"""for their discoveries concerning the primary ...",1/3,386,0,Haldan Keffer Hartline,1903-12-22,"Bloomsburg, PA",United States of America,0,Rockefeller University,"New York, NY",United States of America,1983-03-17,"Fallston, MD",United States of America,1903.0,64
94,1919,2,The Nobel Prize in Literature 1919,"""in special appreciation of his epic, <I>Olymp...",1/1,588,0,Carl Friedrich Georg Spitteler,1845-04-24,Liestal,Switzerland,0,,,,1924-12-29,Lucerne,Switzerland,1845.0,74
940,2014,4,The Nobel Peace Prize 2014,"""for their struggle against the suppression of...",1/2,914,0,Malala Yousafzai,1997-07-12,Mingora,Pakistan,1,,,,,,,1997.0,17
641,1991,5,The Nobel Prize in Physics 1991,"""for discovering that methods developed for st...",1/1,141,0,Pierre-Gilles de Gennes,1932-10-24,Paris,France,0,Collège de France,Paris,France,2007-05-18,Orsay,France,1932.0,59
367,1963,5,The Nobel Prize in Physics 1963,"""for his contributions to the theory of the at...",1/2,78,0,Eugene Paul Wigner,1902-11-17,Budapest,Austria-Hungary (Hungary),0,Princeton University,"Princeton, NJ",United States of America,1995-01-01,"Princeton, NJ",United States of America,1902.0,61
387,1966,0,The Nobel Prize in Chemistry 1966,"""for his fundamental work concerning chemical ...",1/1,232,0,Robert S. Mulliken,1896-06-07,"Newburyport, MA",United States of America,0,University of Chicago,"Chicago, IL",United States of America,1986-10-31,"Arlington, VA",United States of America,1896.0,70
621,1989,5,The Nobel Prize in Physics 1989,"""for the invention of the separated oscillator...",1/2,135,0,Norman F. Ramsey,1915-08-27,"Washington, DC",United States of America,0,Harvard University,"Cambridge, MA",United States of America,2011-11-04,"Wayland, MA",United States of America,1915.0,74
139,1927,5,The Nobel Prize in Physics 1927,"""for his discovery of the effect named after him""",1/2,33,0,Arthur Holly Compton,1892-09-10,"Wooster, OH",United States of America,0,University of Chicago,"Chicago, IL",United States of America,1962-03-15,"Berkeley, CA",United States of America,1892.0,35
576,1985,3,The Nobel Prize in Physiology or Medicine 1985,"""for their discoveries concerning the regulati...",1/2,432,0,Michael S. Brown,1941-04-13,"New York, NY",United States of America,0,University of Texas Southwestern Medical Cente...,"Dallas, TX",United States of America,,,,1941.0,44
407,1968,5,The Nobel Prize in Physics 1968,"""for his decisive contributions to elementary ...",1/1,89,0,Luis Walter Alvarez,1911-06-13,"San Francisco, CA",United States of America,0,University of California,"Berkeley, CA",United States of America,1988-09-01,"Berkeley, CA",United States of America,1911.0,57


In [27]:
#We notice that the encoding is as follows:
# 0=Chemistry, 1=Economics, 2=Literature, 3=Medicine, 4=Peace, 5=Physics

#Now we proceed to transforming the encoding into one hot

from sklearn.preprocessing import OneHotEncoder

oneHotEncoder = OneHotEncoder()
oh = oneHotEncoder.fit_transform(nobel_no_dup['Category'].values.reshape(-1,1)).toarray()

dfOneHot = pd.DataFrame(oh, columns = ["Chemistry", "Economics", "Literature", "Medicine", "Peace", "Physics"])
df = pd.get_dummies(dfOneHot)

nobel_no_dup.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)

nobel_no_dup = pd.concat([nobel_no_dup, df], axis=1)

nobel_no_dup.sample(10)


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,...,Death City,Death Country,Birth Year,Age,Chemistry,Economics,Literature,Medicine,Peace,Physics
457,1977,0,The Nobel Prize in Chemistry 1977,"""for his contributions to non-equilibrium ther...",1/1,250,0,Ilya Prigogine,1917-01-25,Moscow,...,Brussels,Belgium,1917.0,60,1.0,0.0,0.0,0.0,0.0,0.0
355,1965,5,The Nobel Prize in Physics 1965,"""for their fundamental work in quantum electro...",1/3,85,0,Julian Schwinger,1918-02-12,"New York, NY",...,"Los Angeles, CA",United States of America,1918.0,47,0.0,0.0,0.0,0.0,0.0,1.0
391,1970,3,The Nobel Prize in Physiology or Medicine 1970,"""for their discoveries concerning the humoral ...",1/3,395,0,Ulf von Euler,1905-02-07,Stockholm,...,Stockholm,Sweden,1905.0,65,0.0,0.0,0.0,1.0,0.0,0.0
367,1967,3,The Nobel Prize in Physiology or Medicine 1967,"""for their discoveries concerning the primary ...",1/3,385,0,Ragnar Granit,1900-10-30,Helsinki,...,Stockholm,Sweden,1900.0,67,0.0,0.0,0.0,1.0,0.0,0.0
677,1999,0,The Nobel Prize in Chemistry 1999,"""for his studies of the transition states of c...",1/1,292,0,Ahmed H. Zewail,1946-02-26,Damanhur,...,,,1946.0,53,1.0,0.0,0.0,0.0,0.0,0.0
482,1979,2,The Nobel Prize in Literature 1979,"""for his poetry, which, against the background...",1/1,655,0,Odysseus Elytis,1911-11-02,Iráklion,...,Athens,Greece,1911.0,68,0.0,0.0,1.0,0.0,0.0,0.0
610,1993,1,The Sveriges Riksbank Prize in Economic Scienc...,"""for having renewed research in economic histo...",1/2,710,0,Douglass C. North,1920-11-05,"Cambridge, MA",...,"Benzonia, MI",United States of America,1920.0,73,0.0,1.0,0.0,0.0,0.0,0.0
662,1997,5,The Nobel Prize in Physics 1997,"""for development of methods to cool and trap a...",1/3,152,0,Steven Chu,1948-02-28,"St. Louis, MO",...,,,1948.0,49,0.0,0.0,0.0,0.0,0.0,1.0
577,1989,3,The Nobel Prize in Physiology or Medicine 1989,"""for their discovery of the cellular origin of...",1/2,441,0,Harold E. Varmus,1939-12-18,"Oceanside, NY",...,,,1939.0,50,0.0,0.0,0.0,1.0,0.0,0.0
162,1932,5,The Nobel Prize in Physics 1932,"""for the creation of quantum mechanics, the ap...",1/1,38,0,Werner Karl Heisenberg,1901-12-05,Würzburg,...,Munich,West Germany (Germany),1901.0,31,0.0,0.0,0.0,0.0,0.0,1.0


### 2. Data Visualization:    

Now that we are done with cleaning our data we proceed to creating the visualizations and answering our target questions: