<p style='text-align: right;'> Plotting and Programming in Python </p>

# Pandas DataFrames

## Q: Selection of Individual Values

Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:

In [3]:
import pandas as pd

df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Write an expression to find the Per Capita GDP of Serbia in 2007.

In [5]:
df.loc['Serbia', 'gdpPercap_2007']

9786.534714

## Q: Extent of Slicing

1. Do the two statements below produce the same output?
2. Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

In [None]:
print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

In [6]:
# No, the result is not the same since iloc excludes the last index of the range, while loc includes it
# i.e. [0:2[ VS ['Albania': 'Belgium']

print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

         gdpPercap_1952  gdpPercap_1957
country                                
Albania     1601.056136     1942.284244
Austria     6137.076492     8842.598030
         gdpPercap_1952  gdpPercap_1957  gdpPercap_1962
country                                                
Albania     1601.056136     1942.284244     2312.888958
Austria     6137.076492     8842.598030    10750.721110
Belgium     8343.105127     9714.960623    10991.206760


## Q: Reconstructing Data

Explain what each line in the following short program does: what is in first, second, etc.?

In [None]:
first = pd.read_csv('data/gapminder_all.csv', index_col='country')
second = first[first['continent'] == 'Americas']
third = second.drop('Puerto Rico')
fourth = third.drop('continent', axis = 1)
fourth.to_csv('result.csv')

In [22]:
import pandas as pd

In [20]:
first = pd.read_csv('data/gapminder_all.csv', index_col='country')

This line loads the dataset containing the GDP data from all countries into a dataframe called first. 
The index_col='country' parameter selects which column to use as the row labels in the dataframe.

In [21]:
print(first)

                         continent  gdpPercap_1952  gdpPercap_1957  \
country                                                              
Algeria                     Africa     2449.008185     3013.976023   
Angola                      Africa     3520.610273     3827.940465   
Benin                       Africa     1062.752200      959.601080   
Botswana                    Africa      851.241141      918.232535   
Burkina Faso                Africa      543.255241      617.183465   
Burundi                     Africa      339.296459      379.564628   
Cameroon                    Africa     1172.667655     1313.048099   
Central African Republic    Africa     1071.310713     1190.844328   
Chad                        Africa     1178.665927     1308.495577   
Comoros                     Africa     1102.990936     1211.148548   
Congo Dem. Rep.             Africa      780.542326      905.860230   
Congo Rep.                  Africa     2125.621418     2315.056572   
Cote d'Ivoire       

In [19]:
second = first[first['continent'] == 'Americas']

This line makes a selection: only those rows of first for which the ‘continent’ column matches ‘Americas’ are extracted. 

Notice how the Boolean expression inside the brackets, first['continent'] == 'Americas', is used to select only those rows where the expression is true. 

Try printing this expression! Can you print also its individual True/False elements? (hint: first assign the expression to a variable)

In [24]:
print(second)

                    continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                                         
Argentina            Americas     5911.315053     6856.856212     7133.166023   
Bolivia              Americas     2677.326347     2127.686326     2180.972546   
Brazil               Americas     2108.944355     2487.365989     3336.585802   
Canada               Americas    11367.161120    12489.950060    13462.485550   
Chile                Americas     3939.978789     4315.622723     4519.094331   
Colombia             Americas     2144.115096     2323.805581     2492.351109   
Costa Rica           Americas     2627.009471     2990.010802     3460.937025   
Cuba                 Americas     5586.538780     6092.174359     5180.755910   
Dominican Republic   Americas     1397.717137     1544.402995     1662.137359   
Ecuador              Americas     3522.110717     3780.546651     4086.114078   
El Salvador          America

In [25]:
first['continent']=='Americas'

country
Algeria                     False
Angola                      False
Benin                       False
Botswana                    False
Burkina Faso                False
Burundi                     False
Cameroon                    False
Central African Republic    False
Chad                        False
Comoros                     False
Congo Dem. Rep.             False
Congo Rep.                  False
Cote d'Ivoire               False
Djibouti                    False
Egypt                       False
Equatorial Guinea           False
Eritrea                     False
Ethiopia                    False
Gabon                       False
Gambia                      False
Ghana                       False
Guinea                      False
Guinea-Bissau               False
Kenya                       False
Lesotho                     False
Liberia                     False
Libya                       False
Madagascar                  False
Malawi                      False
Mali  

In [26]:
third = second.drop('Puerto Rico')

As the syntax suggests, this line drops the row from second where the label is ‘Puerto Rico’. The resulting dataframe third has one row less than the original dataframe second.

In [27]:
print(third)

                    continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                                         
Argentina            Americas     5911.315053     6856.856212     7133.166023   
Bolivia              Americas     2677.326347     2127.686326     2180.972546   
Brazil               Americas     2108.944355     2487.365989     3336.585802   
Canada               Americas    11367.161120    12489.950060    13462.485550   
Chile                Americas     3939.978789     4315.622723     4519.094331   
Colombia             Americas     2144.115096     2323.805581     2492.351109   
Costa Rica           Americas     2627.009471     2990.010802     3460.937025   
Cuba                 Americas     5586.538780     6092.174359     5180.755910   
Dominican Republic   Americas     1397.717137     1544.402995     1662.137359   
Ecuador              Americas     3522.110717     3780.546651     4086.114078   
El Salvador          America

In [28]:
fourth = third.drop('continent', axis = 1)

Again we apply the drop function, but in this case we are dropping not a row but a whole column. 

To accomplish this, we need to specify also the axis parameter (we want to drop the second column which has index 1).

In [29]:
print(fourth)

                     gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                               
Argentina               5911.315053     6856.856212     7133.166023   
Bolivia                 2677.326347     2127.686326     2180.972546   
Brazil                  2108.944355     2487.365989     3336.585802   
Canada                 11367.161120    12489.950060    13462.485550   
Chile                   3939.978789     4315.622723     4519.094331   
Colombia                2144.115096     2323.805581     2492.351109   
Costa Rica              2627.009471     2990.010802     3460.937025   
Cuba                    5586.538780     6092.174359     5180.755910   
Dominican Republic      1397.717137     1544.402995     1662.137359   
Ecuador                 3522.110717     3780.546651     4086.114078   
El Salvador             3048.302900     3421.523218     3776.803627   
Guatemala               2428.237769     2617.155967     2750.364446   
Haiti 

In [None]:
fourth.to_csv('result.csv')

The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the to_csv() function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.

## Q: Selecting Indices

Explain in simple terms what idxmin and idxmax do in the short program below. When would you use these methods?

In [30]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())

gdpPercap_1952    Bosnia and Herzegovina
gdpPercap_1957    Bosnia and Herzegovina
gdpPercap_1962    Bosnia and Herzegovina
gdpPercap_1967    Bosnia and Herzegovina
gdpPercap_1972    Bosnia and Herzegovina
gdpPercap_1977    Bosnia and Herzegovina
gdpPercap_1982                   Albania
gdpPercap_1987                   Albania
gdpPercap_1992                   Albania
gdpPercap_1997                   Albania
gdpPercap_2002                   Albania
gdpPercap_2007                   Albania
dtype: object
gdpPercap_1952    Switzerland
gdpPercap_1957    Switzerland
gdpPercap_1962    Switzerland
gdpPercap_1967    Switzerland
gdpPercap_1972    Switzerland
gdpPercap_1977    Switzerland
gdpPercap_1982    Switzerland
gdpPercap_1987         Norway
gdpPercap_1992         Norway
gdpPercap_1997         Norway
gdpPercap_2002         Norway
gdpPercap_2007         Norway
dtype: object


For each column in data, idxmin will return the index value corresponding to each column’s minimum; idxmax will do accordingly the same for each column’s maximum value.

You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.

## Q: Practice with Selection

Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded. Write an expression to select each of the following:

* GDP per capita for all countries in 1982.
* GDP per capita for Denmark for all years.
* GDP per capita for all countries for years after 1985.
* GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.

In [31]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

In [32]:
# GDP per capita for all countries in 1982.
data['gdpPercap_1982']

country
Albania                    3630.880722
Austria                   21597.083620
Belgium                   20979.845890
Bosnia and Herzegovina     4126.613157
Bulgaria                   8224.191647
Croatia                   13221.821840
Czech Republic            15377.228550
Denmark                   21688.040480
Finland                   18533.157610
France                    20293.897460
Germany                   22031.532740
Greece                    15268.420890
Hungary                   12545.990660
Iceland                   23269.607500
Ireland                   12618.321410
Italy                     16537.483500
Montenegro                11222.587620
Netherlands               21399.460460
Norway                    26298.635310
Poland                     8451.531004
Portugal                  11753.842910
Romania                    9605.314053
Serbia                    15181.092700
Slovak Republic           11348.545850
Slovenia                  17866.721750
Spain            

In [33]:
# GDP per capita for Denmark for all years.
data.loc['Denmark']  # or data.loc['Denmark', :]

gdpPercap_1952     9692.385245
gdpPercap_1957    11099.659350
gdpPercap_1962    13583.313510
gdpPercap_1967    15937.211230
gdpPercap_1972    18866.207210
gdpPercap_1977    20422.901500
gdpPercap_1982    21688.040480
gdpPercap_1987    25116.175810
gdpPercap_1992    26406.739850
gdpPercap_1997    29804.345670
gdpPercap_2002    32166.500060
gdpPercap_2007    35278.418740
Name: Denmark, dtype: float64

In [35]:
# GDP per capita for all countries for years after 1985.
data.loc[:,'gdpPercap_1985':]

Unnamed: 0_level_0,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Albania,3738.932735,2497.437901,3193.054604,4604.211737,5937.029526
Austria,23687.82607,27042.01868,29095.92066,32417.60769,36126.4927
Belgium,22525.56308,25575.57069,27561.19663,30485.88375,33692.60508
Bosnia and Herzegovina,4314.114757,2546.781445,4766.355904,6018.975239,7446.298803
Bulgaria,8239.854824,6302.623438,5970.38876,7696.777725,10680.79282
Croatia,13822.58394,8447.794873,9875.604515,11628.38895,14619.22272
Czech Republic,16310.4434,14297.02122,16048.51424,17596.21022,22833.30851
Denmark,25116.17581,26406.73985,29804.34567,32166.50006,35278.41874
Finland,21141.01223,20647.16499,23723.9502,28204.59057,33207.0844
France,22066.44214,24703.79615,25889.78487,28926.03234,30470.0167


Pandas is smart enough to recognize the number at the end of the column label and does not give you an error, although no column named gdpPercap_1985 actually exists. This is useful if new columns are added to the CSV file later.

In [38]:
# GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.
data['gdpPercap_2007']/data['gdpPercap_1952']

country
Albania                   3.708196
Austria                   5.886596
Belgium                   4.038377
Bosnia and Herzegovina    7.648736
Bulgaria                  4.369697
Croatia                   4.686795
Czech Republic            3.320658
Denmark                   3.639808
Finland                   5.168805
France                    4.334402
Germany                   4.503060
Greece                    7.799725
Hungary                   3.421364
Iceland                   4.978308
Ireland                   7.806873
Italy                     5.793425
Montenegro                3.495221
Netherlands               4.115376
Norway                    4.889067
Poland                    3.819475
Portugal                  6.684325
Romania                   3.437140
Serbia                    2.732555
Slovak Republic           3.680703
Slovenia                  6.113405
Spain                     7.517163
Sweden                    3.970493
Switzerland               2.545529
Turkey      

## Q: Using the dir function to see available methods

Python includes a dir function that can be used to display all of the available methods (functions) that are built into a data object. As an example, the functions available for a list data type are:

In [39]:
potatoes = ["Russet", "Norkota", "Yukon Gold", "Pontiac"]
dir(potatoes)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

The double underscore functions can be ignored for now; functions that are not surrounded by double underscores are the public interface of the list type. So, if you want to sort the list of potatoes, according to dir you should try,

In [40]:
potatoes.sort()

Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded as data. Then, use dir to find the function that prints out the median per-capita GDP across all European countries for each year that information is available.

In [43]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
dir(data)

['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',

In [47]:
print(data)

                        gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                                  
Albania                    1601.056136     1942.284244     2312.888958   
Austria                    6137.076492     8842.598030    10750.721110   
Belgium                    8343.105127     9714.960623    10991.206760   
Bosnia and Herzegovina      973.533195     1353.989176     1709.683679   
Bulgaria                   2444.286648     3008.670727     4254.337839   
Croatia                    3119.236520     4338.231617     5477.890018   
Czech Republic             6876.140250     8256.343918    10136.867130   
Denmark                    9692.385245    11099.659350    13583.313510   
Finland                    6424.519071     7545.415386     9371.842561   
France                     7029.809327     8662.834898    10560.485530   
Germany                    7144.114393    10187.826650    12902.462910   
Greece                     3530.690067

In [48]:
data.median()

gdpPercap_1952     5142.469716
gdpPercap_1957     6066.721495
gdpPercap_1962     7515.733738
gdpPercap_1967     9366.067033
gdpPercap_1972    12326.379990
gdpPercap_1977    14225.754515
gdpPercap_1982    15322.824720
gdpPercap_1987    16215.485895
gdpPercap_1992    17550.155945
gdpPercap_1997    19596.498550
gdpPercap_2002    23674.863230
gdpPercap_2007    28054.065790
dtype: float64