## Data analysis with Pandas
Pandas is a data analysis library for Python which allows to easily read, analyze and manipulate multiple kinds of data

Simplifying quite a lot, the core functionality of Pandas is the DataFrame, somehow a kind of equivalent of a *spreadsheet* that allows us to have tables with observations (rows) for which we have different kinds of property values or categories (columns).

Pandas includes a large set of tools for working with the information in these DataFrames, ranging from very simple tools (select column subsets, arithmetic operations, aggregates such as means or standard deviations) to more complex functions for advanced data wrangling. Here we will focus on the simple ones, with brief mentions of the more advanced features: nonetheless, in general both the documentation and the community around Pandas are excellent, and plenty of resources on how to do a specific process in a DataFrame can be found on the web.

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Data reading
The most straightforward way to read a DataFrame is through the `pd.read_csv()` function, which accepts a comma-separated values (CSV) filename. 

For well-formed CSV files, where the comma is the separator and the first line contains column headers, the function can be used directly: however, if data is not like that, different arguments can be used to accept other separators, to override header detection and so on.

In [2]:
# By default, the first row will be assigned as the headers of the columns
data = pd.read_csv("datasets/wine-quality-white-and-red.csv")
display(data)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,white,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,white,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,red,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
6493,red,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
6494,red,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
6495,red,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


### Working with DataFrames
Some key aspects:
- Columns can have different types: numeric, strings...
- Rows have unique *indices*, which can be either simple sequential numbers or any kind of **id**.
- Columns have *names*: usually strings that are headers that describe that column.
- To access either rows or columns by their index/name, we have the `.loc[row_index,column_name]` method.
- If instead we want them by position, there is `.iloc[i,j]`
    - And to get multiple positions at once, we can use also ranges `a:b`.
- If we want all rows/columns, use a semicolon in the methods -> e.g. `.loc[:,column_name]` for all rows for a given column, or `.loc[row_index,:]` for all columns for a given row.
    - Here, *row_index* or *column_name* can also be **lists** with several values to get multiple rows or columns. 


In [3]:
### Select columns by name: type, pH, alcohol and quality
sel1 = data.loc[:,["type","pH","alcohol","quality"]]
display(sel1)

Unnamed: 0,type,pH,alcohol,quality
0,white,3.00,8.8,6
1,white,3.30,9.5,6
2,white,3.26,10.1,6
3,white,3.19,9.9,6
4,white,3.19,9.9,6
...,...,...,...,...
6492,red,3.45,10.5,5
6493,red,3.52,11.2,6
6494,red,3.42,11.0,6
6495,red,3.57,10.2,5


In [4]:
### Select columns 0, 9, 10 and 11, and rows 0 to 20
sel2 = data.iloc[0:20,[0,9,10,11]]
display(sel2)

Unnamed: 0,type,pH,sulphates,alcohol
0,white,3.0,0.45,8.8
1,white,3.3,0.49,9.5
2,white,3.26,0.44,10.1
3,white,3.19,0.4,9.9
4,white,3.19,0.4,9.9
5,white,3.26,0.44,10.1
6,white,3.18,0.47,9.6
7,white,3.0,0.45,8.8
8,white,3.3,0.49,9.5
9,white,3.22,0.45,11.0


In [37]:
# For numeric columns, we can do simple and complex arithmetic
op1 = sel2.loc[:,"alcohol"] / 100
display(op1)

0     0.088
1     0.095
2     0.101
3     0.099
4     0.099
5     0.101
6     0.096
7     0.088
8     0.095
9     0.110
10    0.120
11    0.097
12    0.108
13    0.124
14    0.097
15    0.114
16    0.096
17    0.128
18    0.113
19    0.095
Name: alcohol, dtype: float64

In [38]:
op2 = 10**(-sel2.loc[:,"pH"])
display(op2)

0     0.001000
1     0.000501
2     0.000550
3     0.000646
4     0.000646
5     0.000550
6     0.000661
7     0.001000
8     0.000501
9     0.000603
10    0.001023
11    0.000724
12    0.000661
13    0.000288
14    0.001047
15    0.000562
16    0.000575
17    0.000468
18    0.000759
19    0.000603
Name: pH, dtype: float64

In [6]:
# And we can also easily assign these operated columns to the DataFrame, by naming a new column
sel2.loc[:,"proton_concentration"] = 10**(-sel2.loc[:,"pH"])
display(sel2)

Unnamed: 0,type,pH,sulphates,alcohol,proton_concentration
0,white,3.0,0.45,8.8,0.001
1,white,3.3,0.49,9.5,0.000501
2,white,3.26,0.44,10.1,0.00055
3,white,3.19,0.4,9.9,0.000646
4,white,3.19,0.4,9.9,0.000646
5,white,3.26,0.44,10.1,0.00055
6,white,3.18,0.47,9.6,0.000661
7,white,3.0,0.45,8.8,0.001
8,white,3.3,0.49,9.5,0.000501
9,white,3.22,0.45,11.0,0.000603


In [7]:
# We can access the names of the columns in a DF as a list, through the .columns attribute
print(sel2.columns)

Index(['type', 'pH', 'sulphates', 'alcohol', 'proton_concentration'], dtype='object')


### Data writing
DataFrames can be very directly stored as CSV files through the `.to_csv()` method.

In [8]:
sel2.to_csv("Selected_DF_wines.csv")

In [47]:
# We can also easily do statistical operations on the data: mean, standard deviation... of one or several columns
# reminding to remove non-numeric columns before
sel3 = sel2.loc[:,["pH","sulphates","alcohol","proton_concentration"]]
op3 = sel3.mean()
display(op3)

pH                       3.194500
sulphates                0.486000
alcohol                 10.320000
proton_concentration     0.000668
dtype: float64

In [48]:
# Or do a very quick statistical analysis with the .describe() method
op4 = sel3.describe()
display(op4)

Unnamed: 0,pH,sulphates,alcohol,proton_concentration
count,20.0,20.0,20.0,20.0
mean,3.1945,0.486,10.32,0.000668
std,0.135238,0.0787,1.147354,0.000205
min,2.98,0.36,8.8,0.000288
25%,3.135,0.44,9.575,0.00055
50%,3.205,0.48,9.9,0.000624
75%,3.26,0.53,11.075,0.000733
max,3.54,0.67,12.8,0.001047


### Cleaning data
Many times, scientific data is not completely clean, and the tables we have to work with are prone to have some kind of "blanks". Therefore, some observations (rows) will not have a value in one or multiple properties (columns). Handling missing data is a key point of the data analysis pipeline, and must be addressed early to avoid problems!

By default, missing data will be represented as **NaN** (not a number) value in the DataFrame.

In [11]:
data2 = pd.read_csv("datasets/wine-faulty.csv")
display(data2)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,,45.0,170.0,1.0010,3.00,0.45,8.8,6.0
1,white,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6.0
2,white,8.1,0.28,,,0.050,30.0,97.0,0.9951,3.26,0.44,10.1,6.0
3,white,7.2,,0.32,,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
100,white,6.9,0.24,0.33,1.7,0.035,47.0,136.0,0.9900,3.26,0.40,12.6,7.0
101,white,7.1,0.44,0.62,11.8,0.044,52.0,,0.9975,3.12,0.46,8.7,6.0
102,white,7.2,0.39,0.63,11.0,0.044,55.0,156.0,0.9974,3.09,0.44,8.7,6.0
103,white,6.8,0.25,0.31,13.3,0.050,69.0,202.0,0.9972,3.22,0.48,9.7,6.0


In [12]:
# Many built-in functions will handle NaNs automatically
op5 = data2.loc[:,"total sulfur dioxide"].mean()
display(op5)

140.91414141414143

Nevertheless, further uses of this data can still encounter issues, and we should not just rely on Python/Pandas automatically taking care of them. On the other hand, it is possible that observations lacking certain fields   might be just faulty, prompting us to directly exclude them from the dataset.

This can be easily achieved with the `.dropna()` method. If used as-is, it will just clear every row having one or more NA values.

In [49]:

print("Original",data2.shape)
data2_clean = data2.dropna()
print("Cleaned",data2_clean.shape)

# And 45 rows were dropped due to having some problematic value(s)

Original (105, 13)
Cleaned (59, 13)


In [14]:
# We can also work column-wise, in cases where it is clear that only some columns have missing info and 
# we are not interested in them. However, here ALL columns have some NA value 
data2_tooclean = data2.dropna(axis="columns")
display(data2_tooclean)

Unnamed: 0,type
0,white
1,white
2,white
3,white
4,white
...,...
100,white
101,white
102,white
103,white


Another possible way of operation, depending on the case, is not removing NA values, but instead converting them to some other value. This can be a default numeric value, such as assuming that every missing value is zero, or any other kind of number or string to flag a *problematic situation*. This is particularly useful for compatibility with other tools or programs. 

For example, if we need to pass this data to a program that marks problematic values as **-1**, or as **ERR**...

In [15]:
data2_zeroed = data2.fillna(-1)
display(data2_zeroed)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,-1.000,45.0,170.0,1.0010,3.00,0.45,8.8,6.0
1,white,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6.0
2,white,8.1,0.28,-1.00,-1.0,0.050,30.0,97.0,0.9951,3.26,0.44,10.1,6.0
3,white,7.2,-1.00,0.32,-1.0,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
100,white,6.9,0.24,0.33,1.7,0.035,47.0,136.0,0.9900,3.26,0.40,12.6,7.0
101,white,7.1,0.44,0.62,11.8,0.044,52.0,-1.0,0.9975,3.12,0.46,8.7,6.0
102,white,7.2,0.39,0.63,11.0,0.044,55.0,156.0,0.9974,3.09,0.44,8.7,6.0
103,white,6.8,0.25,0.31,13.3,0.050,69.0,202.0,0.9972,3.22,0.48,9.7,6.0


In [16]:
# It is also possible to have duplicated rows, which shall also be removed for cleanliness
data2_unique = data2.drop_duplicates()
display(data2_unique)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.270,0.36,20.7,,45.0,170.0,1.0010,3.00,0.45,8.8,6.0
1,white,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6.0
2,white,8.1,0.280,,,0.050,30.0,97.0,0.9951,3.26,0.44,10.1,6.0
3,white,7.2,,0.32,,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
4,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,white,7.1,0.260,0.29,12.4,0.044,62.0,,0.9969,3.04,0.42,9.2,
96,white,,0.340,0.66,15.9,0.046,26.0,164.0,0.9979,3.14,0.50,8.8,6.0
97,white,8.6,0.265,0.36,1.2,0.034,15.0,,0.9913,2.95,0.36,11.4,7.0
98,white,9.8,0.360,0.46,10.5,0.038,4.0,83.0,0.9956,2.89,0.30,10.1,4.0


### Data exploration
Until now, we have mostly focused on knowing how data is handled in Pandas and on it being "correct". However, the core purpose of data analysis approaches is to explore the data and use it to answer actual questions. 

Thus, the rest of the lesson will be focused on using some basic tools to explore two datasets of interest, considering their particularities and targets.
1. For wines dataset, know more about the differences between white and red wines.
2. For a dataset of Nobel awardees across history, explore a bit on their distribution.

Further insights on both datasets will be seen a bit later, in the **data visualization** lesson.

#### Grouping
For a *category* in the DataFrame, we can *aggregate* through it to get different measurements, such as the *average* across the category, the *count* of items, the total *sum*...
<img src='df_grouping.png' width=500>
<br>
Here, we consider the *color* as the categorical variable (blue or yellow-orange), and aggregate. The results on the table on the right would be the **average** of all the blue rows and all the yellow rows in the dataset.

In [50]:
#### Compare the parameters of both kinds of wine
# Simple approach: use an if-like structure to select entries for white wine and for red wine
white = data[data.loc[:,"type"] == "white"]
red = data[data.loc[:,"type"] == "red"]

# How many of each?
print(len(white),"entries for white wine")
print(len(red),"entries for red wine")

4898 entries for white wine
1599 entries for red wine


In [51]:
# Ideally, we should take categories from the data: more robust and automated
wine_types = data.loc[:,"type"].unique()
print("Wine types:",wine_types)
wine_dfs = []
for typ in wine_types:
    df = data[data.loc[:,"type"] == typ]
    wine_dfs.append(df)
    print("%d entries for %s wine" % (len(df),typ))

Wine types: ['white' 'red']
4898 entries for white wine
1599 entries for red wine


In [19]:
# Does the mean alcoholic & sugar contents vary much between both types? And the acidity?
print("White")
print("Sugar",white.loc[:,"residual sugar"].mean())
print("Alcohol",white.loc[:,"alcohol"].mean())
print("pH",white.loc[:,"pH"].mean())
print("Red")
print("Sugar",red.loc[:,"residual sugar"].mean())
print("Alcohol",red.loc[:,"alcohol"].mean())
print("pH",red.loc[:,"pH"].mean())

White
Sugar 6.391414863209474
Alcohol 10.51426704777011
pH 3.1882666394446715
Red
Sugar 2.53880550343965
Alcohol 10.422983114446529
pH 3.3111131957473416


In [None]:
# Do the same in loop with categories

This more "manual" approach has the advantage of being a bit clearer to the unexperienced user, and also more flexible: we may only show properties for white wines, excluding the red ones, or to use different functions for different cases. 

However, it is not the cleanest one, as Pandas offers many tools to inspect data: here, the `.groupby()` method will allow us to get prettier and easier results just out of the box.

In [53]:
grp = data.groupby("type")
grp[["alcohol","residual sugar","pH"]].mean()

Unnamed: 0_level_0,alcohol,residual sugar,pH
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
red,10.422983,2.538806,3.311113
white,10.514267,6.391415,3.188267


In [21]:
# We can also get a more thorough insight on a single column: for example, the alcohol content
grp[["alcohol"]].describe()

Unnamed: 0_level_0,alcohol,alcohol,alcohol,alcohol,alcohol,alcohol,alcohol,alcohol
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
red,1599.0,10.422983,1.065668,8.4,9.5,10.2,11.1,14.9
white,4898.0,10.514267,1.230621,8.0,9.5,10.4,11.4,14.2


But we will also be more restricted if, for example, we wanted to exclude a category, or compute different things for different cases. Preferring the cleaner, Pandas-like way (groupby) or the more simple approach strongly depends on the question to answer, the level of comfort with DataFrames, etc. There is not a right or wrong way to do things!

**Exercise**: instead of by type, group the wines by their quality, and explore their mean content in alcohol and citric acid, as well as their total acidity.

*Hint*. Look at the DataFrame to recall the names of the columns.

In [None]:
### Exercise

#### Multi-grouping
More intricate relationships can be determined through this kind of data manipulation. 
For example, we may be interested in the *mean alcohol content* for each *type* of wine depending on its *quality*. We then have two categories, organizing data like:
- Quality 3, white wine
- Quality 3, red wine
- Quality 4, white wine

and so on.

A reasonable way to inspect this kind of information will be to have a matrix-like table where the columns are the different quality thresholds, and the rows are the wine types. Every point in the table will be the mean alcohol content across the corresponding group.

A way to do this is to create a pivot table (`pd.pivot_table()`), where we reorganize the DataFrame to aggregate through the "buckets" determined by category pairs. Like before, we may use different aggregation functions, such as *mean*, *count*, *sum*...

<img src='df_pivot.png' width=450>


In [22]:
# Here, the "colors" will be the wine types, and the "shapes", the quality ratings, with the alcohol content 
# being the value column.

wine_pivot = pd.pivot_table(data=data,index="type",columns="quality",values="alcohol",aggfunc="mean")
display(wine_pivot)


quality,3,4,5,6,7,8,9
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
red,9.955,10.265094,9.899706,10.629519,11.465913,12.094444,
white,10.345,10.152454,9.80884,10.575372,11.367936,11.636,12.18


**Exercise**: get a similar table grouping by type and quality, but now targeting the residual sugar content as the value, and getting the count of values instead of the average.

In [None]:
### Exercise

In [23]:
### Other dataset of interest: Nobel awardees in history
nobel = pd.read_csv("datasets/nobel-complete.csv")
# Select some properties of interest to see how these awards are distributed and clean possible NA values
sel_nobel = nobel.loc[:,["awardYear","gender","birth_continent","category"]].dropna()
display(sel_nobel)

Unnamed: 0,awardYear,gender,birth_continent,category
0,2001,male,North America,Economic Sciences
1,1975,male,Europe,Physics
2,2004,male,Asia,Chemistry
3,1982,male,Europe,Chemistry
4,1979,male,Asia,Physics
...,...,...,...,...
942,2000,male,Europe,Physics
943,1980,male,Europe,Chemistry
945,1972,male,North America,Physics
946,1954,male,North America,Chemistry


While the wines dataset was more quantitative, including experimental values for chemical properties of the wines, here we have more *qualitative* information: who won the awards, where were they from, which kind of award did they win...

Possible questions to answer with this data could be:
- How many winners from each continent are there for each category?
- Has the continental distribution of winners changed through time?
- How has the gender distribution of winners changed through time?

In general, to tackle these aspects we need to group data. We may use pivot tables as before (with the count aggregation function), but in this situation the `pd.crosstab()` function is simpler to work out-of-the-box, as it just considers the "buckets" of the cross-categories and handles the counts.

*Hint*
- `pd.crosstab` is useful to directly get counts across two categories: simpler.
- `pd.pivot_table` is more oriented to more complex aggregations of a certain **value** through categories.

In [24]:
# Question 1: continental distribution by category

# Build a cross-table
table = pd.crosstab(sel_nobel.loc[:,"category"],sel_nobel.loc[:,"birth_continent"])
display(table)

# Later, we will visualize this for clearer insights

birth_continent,Africa,Asia,Europe,North America,Oceania,South America
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Chemistry,3,16,103,59,3,0
Economic Sciences,0,3,30,51,0,0
Literature,6,10,80,16,0,4
Peace,13,13,53,23,2,3
Physics,2,20,114,75,2,0
Physiology or Medicine,3,10,114,80,8,4


In [35]:
table_wp = pd.pivot_table(data=sel_nobel,index="category",columns="birth_continent",values="gender",
                       aggfunc="count",fill_value=0)
display(table_wp)

birth_continent,Africa,Asia,Europe,North America,Oceania,South America
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Chemistry,3,16,103,59,3,0
Economic Sciences,0,3,30,51,0,0
Literature,6,10,80,16,0,4
Peace,13,13,53,23,2,3
Physics,2,20,114,75,2,0
Physiology or Medicine,3,10,114,80,8,4


To know about how patterns have changed through time, we have a problem: the award year is too specific to be a category (we will have hundreds!).

A suitable solution would be to determine the award *decade* instead, which can be immediately derived from the year and gives us better categories. To do this, we could write a very simple function that given a year produces its corresponding decade, and then apply it to all years in the DataFrame, through the `.apply()` method.

In [25]:
# Question 2: continental distribution by decade
def get_decade(year):
    # Takes a given year and obtains the decade it belongs to
    return int(year/10)*10
sel_nobel.loc[:,"awardDecade"] = sel_nobel.loc[:,"awardYear"].apply(get_decade)
display(sel_nobel)

Unnamed: 0,awardYear,gender,birth_continent,category,awardDecade
0,2001,male,North America,Economic Sciences,2000
1,1975,male,Europe,Physics,1970
2,2004,male,Asia,Chemistry,2000
3,1982,male,Europe,Chemistry,1980
4,1979,male,Asia,Physics,1970
...,...,...,...,...,...
942,2000,male,Europe,Physics,2000
943,1980,male,Europe,Chemistry,1980
945,1972,male,North America,Physics,1970
946,1954,male,North America,Chemistry,1950


In [26]:
table2 = pd.crosstab(sel_nobel.loc[:,"awardDecade"],sel_nobel.loc[:,"birth_continent"])
print("Continents and decades")
display(table2)


Continents and decades


birth_continent,Africa,Asia,Europe,North America,Oceania,South America
awardDecade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1900,0,2,52,1,1,0
1910,0,1,33,3,1,0
1920,0,1,48,5,0,0
1930,0,1,39,14,0,1
1940,0,1,22,14,1,2
1950,2,3,44,22,0,0
1960,2,5,39,24,4,1
1970,2,4,60,35,1,1
1980,4,6,44,36,0,4
1990,6,6,34,52,3,0


We may not be interested in *all* Nobel categories, but only in the Chemistry ones. To do this, we can repeat the same strategy, but filtering the dataset as we did before.

In [27]:
# Question 2b: Continental trends for decades, but only in Chemistry
sel_chem = sel_nobel[sel_nobel.loc[:,"category"] == "Chemistry"]
table3 = pd.crosstab(sel_chem.loc[:,"awardDecade"],sel_chem.loc[:,"birth_continent"])
print("Continents and decades in Chemistry")
display(table3)

# Visualization later

Continents and decades in Chemistry


birth_continent,Africa,Asia,Europe,North America,Oceania
awardDecade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1900,0,0,8,0,1
1910,0,0,7,1,0
1920,0,0,10,0,0
1930,0,0,11,2,0
1940,0,0,5,4,0
1950,0,0,10,4,0
1960,1,0,10,4,0
1970,0,0,9,5,1
1980,0,3,8,10,0
1990,1,0,9,8,0


**Exercise**. Get a similar table, considering the gender breakdown instead of the geographical breakdown, for Nobel prizes in Chemistry (*question 3*)

Although we have considered qualitative categories, the dataset does also contain some numbers. For instance, we could evaluate how much money has been awarded per decade: has the spending on Nobel prizes risen a lot along the decades?

This is **quantitative**: remember to use the pivot table!

In [28]:
### How much money was awarded per decade?
sel_nobel_money = nobel.loc[:,["awardYear","category","prizeAmount"]]
sel_nobel_money.loc[:,"awardDecade"] = sel_nobel_money.loc[:,"awardYear"].apply(get_decade)

table4 = pd.pivot_table(sel_nobel_money,index="awardDecade",columns="category",values="prizeAmount",
              aggfunc=sum)
display(table4)

category,Chemistry,Economic Sciences,Literature,Peace,Physics,Physiology or Medicine
awardDecade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1900,1269867.0,,1410726.0,1980892.0,1834230.0,1548203.0
1910,1139681.0,,1246673.0,972529.0,1415378.0,844911.0
1920,1357176.0,,1301135.0,1395741.0,1545801.0,1471575.0
1930,2103051.0,,1473058.0,1485606.0,1785806.0,2289794.0
1940,1193615.0,,826876.0,934742.0,950567.0,1887285.0
1950,2597368.0,,1894194.0,1503857.0,3863473.0,3811431.0
1960,4435660.0,750000.0,3198440.0,2543440.0,4788673.0,7395867.0
1970,8826000.0,9086000.0,6476000.0,7912000.0,15247000.0,14917000.0
1980,38565000.0,17655000.0,17655000.0,18805000.0,40860000.0,38815000.0
1990,126300000.0,111400000.0,67800000.0,118200000.0,149600000.0,135000000.0


**Exercise**. The previous calculations are biased, as the table has entries for every winner, but some of the prizes are shared, and the amount of money shall be also shared. 

A fair estimation of the money that was actually awarded would require to scale the award prize in every row by the *portion* it was shared. 

Calculate this and assign it to a new column in the DataFrame, then create the pivot table.