# Software Carpentry with Python: Part 2

## Data wrangling with the pandas library

For November 21, 2019

Data needs to be downloaded at:
https://go.gwu.edu/pythondata

https://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip


### Starting in the same spot:
Remember, we're in the python-lesson directory. 

1. Let's create a New > Folder. Click the checkbox and Rename it: data. 
2. Then click to go into it. 
2. We need to put our gapminder data here. You should have already downloaded that file as part of the set-up. If not, go to https://go.gwu.edu/gapminder and download it now. Unzip it!
3. Click Upload and upload the unzipped data file. 

We're setting things up this was so that we all have the same file structure and can follow along. Also, this is generally a good practice, to create a data folder and put your original data files in there. 

A quick aside that there are Python libraries like OS Library that can work with our directory structure, however, that is not our focus today.

Lessons used: Software Carpentry: https://swcarpentry.github.io/python-novice-gapminder/08-data-frames/index.html


## Why use Python for data analysis? 
* We can automate the process of performing data manipulations in Python. 
* It’s efficient to spend time building the code to perform these tasks because once it’s built, we can use it over and over on different datasets that use a similar format. 
* This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.

## Working With Pandas DataFrames
One of the best options for working with tabular data in Python is to use the Python library pandas. The pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

**Python doesn’t load all of the libraries available to it by default.** We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax import libraryName. 

If we want to give the library a nickname to shorten the command, we can add **as nickNameHere**. An example of importing the pandas library using the common nickname pd is below.

In [216]:
import pandas as pd

### Reading CSV Data Using Pandas
We will begin by locating and reading our data which are in CSV format. CSV stands for Comma-Separated Values and is a common way store formatted data. 

We can use Pandas’ read_csv function to pull the file directly into a DataFrame.

In [217]:
pd.read_csv('data/gapminder_all.csv')

Unnamed: 0,continent,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
0,Africa,Algeria,2449.008185,3013.976023,2550.816880,3246.991771,4182.663766,4910.416756,5745.160213,5681.358539,...,11000948.0,12760499.0,14760787.0,17152804.0,20033753.0,23254956.0,26298373.0,29072015.0,31287142,33333216
1,Africa,Angola,3520.610273,3827.940465,4269.276742,5522.776375,5473.288005,3008.647355,2756.953672,2430.208311,...,4826015.0,5247469.0,5894858.0,6162675.0,7016384.0,7874230.0,8735988.0,9875024.0,10866106,12420476
2,Africa,Benin,1062.752200,959.601080,949.499064,1035.831411,1085.796879,1029.161251,1277.897616,1225.856010,...,2151895.0,2427334.0,2761407.0,3168267.0,3641603.0,4243788.0,4981671.0,6066080.0,7026113,8078314
3,Africa,Botswana,851.241141,918.232535,983.653976,1214.709294,2263.611114,3214.857818,4551.142150,6205.883850,...,512764.0,553541.0,619351.0,781472.0,970347.0,1151184.0,1342614.0,1536536.0,1630347,1639131
4,Africa,Burkina Faso,543.255241,617.183465,722.512021,794.826560,854.735976,743.387037,807.198586,912.063142,...,4919632.0,5127935.0,5433886.0,5889574.0,6634596.0,7586551.0,8878303.0,10352843.0,12251209,14326203
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,Europe,Switzerland,14734.232750,17909.489730,20431.092700,22966.144320,27195.113040,26982.290520,28397.715120,30281.704590,...,5666000.0,6063000.0,6401400.0,6316424.0,6468126.0,6649942.0,6995447.0,7193761.0,7361757,7554661
138,Europe,Turkey,1969.100980,2218.754257,2322.869908,2826.356387,3450.696380,4269.122326,4241.356344,5089.043686,...,29788695.0,33411317.0,37492953.0,42404033.0,47328791.0,52881328.0,58179144.0,63047647.0,67308928,71158647
139,Europe,United Kingdom,9979.508487,11283.177950,12477.177070,14142.850890,15895.116410,17428.748460,18232.424520,21664.787670,...,53292000.0,54959000.0,56079000.0,56179000.0,56339704.0,56981620.0,57866349.0,58808266.0,59912431,60776238
140,Oceania,Australia,10039.595640,10949.649590,12217.226860,14526.124650,16788.629480,18334.197510,19477.009280,21888.889030,...,10794968.0,11872264.0,13177000.0,14074100.0,15184200.0,16257249.0,17481977.0,18565243.0,19546792,20434176


This output is the rows in our CSV file, now as a pandas DataFrame object. 

The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame. 

It looks like the read_csv function in Pandas read our file properly. However, we haven’t saved any data to memory so we can work with it. We need to assign the DataFrame to a variable. Remember that a variable is a name for a value, such as x, or data. We can create a new object with a variable name by assigning a value to it using =.

In [218]:
data = pd.read_csv("data/gapminder_all.csv", index_col="country")

There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.

Methods are called on a DataFrame object using the syntax df_object.method(). As an example, `data.head()` gets the first few rows in the DataFrame surveys_df using the head() method. With a method, we can supply extra information in the parens to control behaviour.

In [219]:
# nothing in parens defaults to first 5 rows
data.head()

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Algeria,Africa,2449.008185,3013.976023,2550.81688,3246.991771,4182.663766,4910.416756,5745.160213,5681.358539,5023.216647,...,11000948.0,12760499.0,14760787.0,17152804.0,20033753.0,23254956.0,26298373.0,29072015.0,31287142,33333216
Angola,Africa,3520.610273,3827.940465,4269.276742,5522.776375,5473.288005,3008.647355,2756.953672,2430.208311,2627.845685,...,4826015.0,5247469.0,5894858.0,6162675.0,7016384.0,7874230.0,8735988.0,9875024.0,10866106,12420476
Benin,Africa,1062.7522,959.60108,949.499064,1035.831411,1085.796879,1029.161251,1277.897616,1225.85601,1191.207681,...,2151895.0,2427334.0,2761407.0,3168267.0,3641603.0,4243788.0,4981671.0,6066080.0,7026113,8078314
Botswana,Africa,851.241141,918.232535,983.653976,1214.709294,2263.611114,3214.857818,4551.14215,6205.88385,7954.111645,...,512764.0,553541.0,619351.0,781472.0,970347.0,1151184.0,1342614.0,1536536.0,1630347,1639131
Burkina Faso,Africa,543.255241,617.183465,722.512021,794.82656,854.735976,743.387037,807.198586,912.063142,931.752773,...,4919632.0,5127935.0,5433886.0,5889574.0,6634596.0,7586551.0,8878303.0,10352843.0,12251209,14326203


In [220]:
type(data)

pandas.core.frame.DataFrame

In [221]:
data.dtypes

continent          object
gdpPercap_1952    float64
gdpPercap_1957    float64
gdpPercap_1962    float64
gdpPercap_1967    float64
gdpPercap_1972    float64
gdpPercap_1977    float64
gdpPercap_1982    float64
gdpPercap_1987    float64
gdpPercap_1992    float64
gdpPercap_1997    float64
gdpPercap_2002    float64
gdpPercap_2007    float64
lifeExp_1952      float64
lifeExp_1957      float64
lifeExp_1962      float64
lifeExp_1967      float64
lifeExp_1972      float64
lifeExp_1977      float64
lifeExp_1982      float64
lifeExp_1987      float64
lifeExp_1992      float64
lifeExp_1997      float64
lifeExp_2002      float64
lifeExp_2007      float64
pop_1952          float64
pop_1957          float64
pop_1962          float64
pop_1967          float64
pop_1972          float64
pop_1977          float64
pop_1982          float64
pop_1987          float64
pop_1992          float64
pop_1997          float64
pop_2002            int64
pop_2007            int64
dtype: object

The DataFrame.columns variable stores information about the dataframe’s columns. 

Note that this is data, not a method. Like math.pi. So do not use () to try to call it.

In [222]:
data.columns

Index(['continent', 'gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962',
       'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982',
       'gdpPercap_1987', 'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002',
       'gdpPercap_2007', 'lifeExp_1952', 'lifeExp_1957', 'lifeExp_1962',
       'lifeExp_1967', 'lifeExp_1972', 'lifeExp_1977', 'lifeExp_1982',
       'lifeExp_1987', 'lifeExp_1992', 'lifeExp_1997', 'lifeExp_2002',
       'lifeExp_2007', 'pop_1952', 'pop_1957', 'pop_1962', 'pop_1967',
       'pop_1972', 'pop_1977', 'pop_1982', 'pop_1987', 'pop_1992', 'pop_1997',
       'pop_2002', 'pop_2007'],
      dtype='object')

In [223]:
data.describe()

Unnamed: 0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,...,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0
mean,3725.276046,4299.408345,4725.812342,5483.653047,6770.082815,7313.166421,7518.901673,7900.920218,8158.608521,9090.175363,...,20421010.0,22658300.0,25189980.0,27676380.0,30207300.0,33038570.0,35990920.0,38839470.0,41457590.0,44021220.0
std,9321.064786,9869.662202,8667.362525,8095.315431,10614.383403,8362.48915,7733.845006,8288.281304,9031.84608,10171.493263,...,69788650.0,78375480.0,88646820.0,97481090.0,105098600.0,114756200.0,124502600.0,133417400.0,140848300.0,147621400.0
min,298.846212,335.997115,355.203227,349.0,357.0,371.0,424.0,385.0,347.0,312.188423,...,65345.0,70787.0,76595.0,86796.0,98593.0,110812.0,125911.0,145608.0,170372.0,199579.0
25%,864.752389,930.540819,1059.149171,1151.245103,1257.193853,1357.257252,1363.338985,1327.469823,1270.660958,1366.837958,...,1784362.0,2034768.0,2351192.0,2759717.0,3006286.0,3194990.0,3605992.0,3770150.0,4173506.0,4508034.0
50%,1968.528344,2173.220291,2335.439533,2678.334741,3339.129407,3798.609244,4216.228428,4280.300366,4386.085502,4781.825478,...,4686040.0,5170176.0,5877996.0,6404036.0,7007320.0,7774862.0,8688686.0,9735064.0,10372920.0,10517530.0
75%,3913.492777,4876.356362,5709.381428,7075.932943,9508.839303,11204.102423,12347.953723,11994.052795,10684.35187,12022.867188,...,10980080.0,12614580.0,14679200.0,16670230.0,18407320.0,20947540.0,22705380.0,24311370.0,26545560.0,31210040.0
max,108382.3529,113523.1329,95458.11176,80894.88326,109347.867,59265.47714,33693.17525,31540.9748,34932.91959,41283.16433,...,665770000.0,754550000.0,862030000.0,943455000.0,1000281000.0,1084035000.0,1164970000.0,1230075000.0,1280400000.0,1318683000.0


## Indexing and Slicing in Python
We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using:
* labels (column headings)
* numeric ranges
* specific x,y index locations.

### Selecting data using Labels (Column Headings)
We use square brackets [] to select a subset of a Python object. For example, we can select all data from a column named lifeExp2007 from the data DataFrame by name. There are two ways to do this:

In [30]:
# Method 1: select a 'subset' of the data using the column name
data['lifeExp_2007']

country
Algeria           72.301
Angola            42.731
Benin             56.728
Botswana          50.728
Burkina Faso      52.295
                   ...  
Switzerland       81.701
Turkey            71.777
United Kingdom    79.425
Australia         81.235
New Zealand       80.204
Name: lifeExp_2007, Length: 142, dtype: float64

In [32]:
# Method 2: use the column name as an 'attribute'; gives the same output
data.lifeExp_2007

country
Algeria           72.301
Angola            42.731
Benin             56.728
Botswana          50.728
Burkina Faso      52.295
                   ...  
Switzerland       81.701
Turkey            71.777
United Kingdom    79.425
Australia         81.235
New Zealand       80.204
Name: lifeExp_2007, Length: 142, dtype: float64

A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

What we did above by taking a column was creating a Series.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.


### Selecting using slices of rows 

Slicing using the `[]` operator selects a set of rows and/or columns from a DataFrame, not counting the labels. To slice out a set of rows, you use the following syntax: data[start:stop]. When slicing in pandas the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2 your code would look like this:

In [62]:
data[0:3]

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Algeria,Africa,2449.008185,3013.976023,2550.81688,3246.991771,4182.663766,4910.416756,5745.160213,5681.358539,5023.216647,...,11000948.0,12760499.0,14760787.0,17152804.0,20033753.0,23254956.0,26298373.0,29072015.0,31287142,33333216
Angola,Africa,3520.610273,3827.940465,4269.276742,5522.776375,5473.288005,3008.647355,2756.953672,2430.208311,2627.845685,...,4826015.0,5247469.0,5894858.0,6162675.0,7016384.0,7874230.0,8735988.0,9875024.0,10866106,12420476
Benin,Africa,1062.7522,959.60108,949.499064,1035.831411,1085.796879,1029.161251,1277.897616,1225.85601,1191.207681,...,2151895.0,2427334.0,2761407.0,3168267.0,3641603.0,4243788.0,4981671.0,6066080.0,7026113,8078314


In [64]:
# Can also leave out the 0
data[:3]

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Algeria,Africa,2449.008185,3013.976023,2550.81688,3246.991771,4182.663766,4910.416756,5745.160213,5681.358539,5023.216647,...,11000948.0,12760499.0,14760787.0,17152804.0,20033753.0,23254956.0,26298373.0,29072015.0,31287142,33333216
Angola,Africa,3520.610273,3827.940465,4269.276742,5522.776375,5473.288005,3008.647355,2756.953672,2430.208311,2627.845685,...,4826015.0,5247469.0,5894858.0,6162675.0,7016384.0,7874230.0,8735988.0,9875024.0,10866106,12420476
Benin,Africa,1062.7522,959.60108,949.499064,1035.831411,1085.796879,1029.161251,1277.897616,1225.85601,1191.207681,...,2151895.0,2427334.0,2761407.0,3168267.0,3641603.0,4243788.0,4981671.0,6066080.0,7026113,8078314


## Selecting values

To access a value at the position row i, column j [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Use DataFrame.loc[..., ...] to select values by their label.

Use `: ` on its own to mean all columns or all rows

In [68]:
data.loc["Algeria", "gdpPercap_1952"]

2449.008185

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Can specify location by numerical index analogously to 2D version of character selection in strings. (Or items in lists). 

The labels aren't included in the counting, they apply to the data.

In [48]:
data.iloc[0,0]

'Africa'

Let's do the same thing to get at the gdp in 1952 for Algeria:

In [69]:
data.iloc[0, 1]

2449.008185

**Exercise:**

Practice Series and slicing.

1) Get the lifeExp for all countries in 1992 and assign it to a variable called lifeExp1992. 

2) Get the GDP in 1962 for New Zealand using multiple methods of slicing. Remember we can use .tail() to see the end of the data.


**Answer #1**

In [52]:
lifeExp1992 = data["lifeExp_1992"]
print(lifeExp1992)

country
Algeria           67.744
Angola            40.647
Benin             53.919
Botswana          62.745
Burkina Faso      50.260
                   ...  
Switzerland       78.030
Turkey            66.146
United Kingdom    76.420
Australia         77.560
New Zealand       76.330
Name: lifeExp_1992, Length: 142, dtype: float64


**Answer #2**

In [84]:
data.loc["New Zealand", "gdpPercap_1962"]

13175.678

In [257]:
#There are 142 rows in our dataframe, however, since we count starting with zero, the last row is 141. 
data.iloc[141,3]

# answer is 13175.678000

13175.678

In [258]:
#help(pd.DataFrame.loc)

**Using labels on multiple rows and columns:**

To slice the life expectancy from 1992, 2002, 2007, for all of the countries:

Specify the rows we want and then the columns we want. 

In [209]:
data.loc[:,["lifeExp_1992","lifeExp_2002","lifeExp_2007"]]

Unnamed: 0_level_0,lifeExp_1992,lifeExp_2002,lifeExp_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Algeria,67.744,70.994,72.301
Angola,40.647,41.003,42.731
Benin,53.919,54.406,56.728
Botswana,62.745,46.634,50.728
Burkina Faso,50.260,50.650,52.295
...,...,...,...
Switzerland,78.030,80.620,81.701
Turkey,66.146,70.845,71.777
United Kingdom,76.420,78.471,79.425
Australia,77.560,80.370,81.235


In [87]:
# A list of the labels of the columns we want.

data.loc[["Benin", "Turkey", "Afghanistan"],["lifeExp_1992","lifeExp_2002","lifeExp_2007"]]

Unnamed: 0_level_0,lifeExp_1992,lifeExp_2002,lifeExp_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Benin,53.919,54.406,56.728
Turkey,66.146,70.845,71.777
Afghanistan,41.674,42.129,43.828


Creating slices lets us then use methods on those subsets. For example, we earlier grabbed just the column that had the lifeExp for 1992. We can find the max and min for that slice as follows:


In [86]:
lifeExp1992 = data["lifeExp_1992"]
print(lifeExp1992.min())
print(lifeExp1992.max())

23.599
79.36


23.599 is Rwanda
79.36 is Japan

When pandas selects a single column from a DataFrame, pandas creates a view and not a copy. A view just means that no new object has been created. No new object is created, just a new reference to the one already in existence. Since no new data has been created, the assignment will modify the original DataFrame.

### Subsetting Data Using Criteria

We can also select a subset of our data using criteria. 

In [119]:
data[data.lifeExp_2007 > 80]

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Canada,Americas,11367.16112,12489.95006,13462.48555,16076.58803,18970.57086,22090.88306,22898.79214,26626.51503,26342.88426,...,18985849.0,20819767.0,22284500.0,23796400.0,25201900.0,26549700.0,28523502.0,30305843.0,31902268,33390141
Hong Kong China,Asia,3054.421209,3629.076457,4692.648272,6197.962814,8315.928145,11186.14125,14560.53051,20038.47269,24757.60301,...,3305200.0,3722800.0,4115700.0,4583700.0,5264500.0,5584510.0,5829696.0,6495918.0,6762476,6980412
Israel,Asia,4086.522128,5385.278451,7105.630706,8393.741404,12786.93223,13306.61921,15367.0292,17122.47986,18051.52254,...,2310904.0,2693585.0,3095893.0,3495918.0,3858421.0,4203148.0,4936550.0,5531387.0,6029529,6426679
Japan,Asia,3216.956347,4317.694365,6576.649461,9847.788607,14778.78636,16610.37701,19384.10571,22375.94189,26824.89511,...,95831757.0,100825279.0,107188273.0,113872473.0,118454974.0,122091325.0,124329269.0,125956499.0,127065841,127467972
France,Europe,7029.809327,8662.834898,10560.48553,12999.91766,16107.19171,18292.63514,20293.89746,22066.44214,24703.79615,...,47124000.0,49569000.0,51732000.0,53165019.0,54433565.0,55630100.0,57374179.0,58623428.0,59925035,61083916
Iceland,Europe,7267.688428,9244.001412,10350.15906,13319.89568,15798.06362,19654.96247,23269.6075,26923.20628,25144.39201,...,182053.0,198676.0,209275.0,221823.0,233997.0,244676.0,259012.0,271192.0,288030,301931
Italy,Europe,4931.404155,6248.656232,8243.58234,10022.40131,12269.27378,14255.98475,16537.4835,19207.23482,22013.64486,...,50843200.0,52667100.0,54365564.0,56059245.0,56535636.0,56729703.0,56840847.0,57479469.0,57926999,58147733
Norway,Europe,10095.42172,11653.97304,13450.40151,16361.87647,18965.05551,23311.34939,26298.63531,31540.9748,33965.66115,...,3638919.0,3786019.0,3933004.0,4043205.0,4114787.0,4186147.0,4286357.0,4405672.0,4535591,4627926
Spain,Europe,3834.034742,4564.80241,5693.843879,7993.512294,10638.75131,13236.92117,13926.16997,15764.98313,18603.06452,...,31158061.0,32850275.0,34513161.0,36439000.0,37983310.0,38880702.0,39549438.0,39855442.0,40152517,40448191
Sweden,Europe,8527.844662,9911.878226,12329.44192,15258.29697,17832.02464,18855.72521,20667.38125,23586.92927,23880.01683,...,7561588.0,7867931.0,8122293.0,8251648.0,8325260.0,8421403.0,8718867.0,8897619.0,8954175,9031088


Quick review of conditions: `==  != > < >= <= `

**Exercise 5**

Create a subset of the data that contains rows for countries where the GDP per capita in 2007 was less than $1000

In [254]:
data[data.gdpPercap_2007 <= 1000]


Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Burundi,Africa,339.296459,379.564628,355.203227,412.977514,464.099504,556.103265,559.603231,621.818819,631.699878,...,2961915.0,3330989.0,3529983.0,3834415.0,4580410.0,5126023.0,5809236.0,6121610.0,7021078,8390505
Central African Republic,Africa,1071.310713,1190.844328,1193.068753,1136.056615,1070.013275,1109.374338,956.752991,844.87635,747.905525,...,1523478.0,1733638.0,1927260.0,2167533.0,2476971.0,2840009.0,3265124.0,3696513.0,4048013,4369038
Comoros,Africa,1102.990936,1211.148548,1406.648278,1876.029643,1937.577675,1172.603047,1267.100083,1315.980812,1246.90737,...,191689.0,217378.0,250027.0,304739.0,348643.0,395114.0,454429.0,527982.0,614382,710960
Congo Dem. Rep.,Africa,780.542326,905.86023,896.314634,861.593242,904.896069,795.757282,673.747818,672.774812,457.719181,...,17486434.0,19941073.0,23007669.0,26480870.0,30646495.0,35481645.0,41672143.0,47798986.0,55379852,64606759
Eritrea,Africa,328.940557,344.161886,380.995843,468.79497,514.324208,505.753808,524.875849,521.134133,582.85851,...,1666618.0,1820319.0,2260187.0,2512642.0,2637297.0,2915959.0,3668440.0,4058319.0,4414865,4906585
Ethiopia,Africa,362.14628,378.904163,419.456416,516.118644,566.243944,556.808383,577.860747,573.741314,421.353465,...,25145372.0,27860297.0,30770372.0,34617799.0,38111756.0,42999530.0,52088559.0,59861301.0,67946797,76511887
Gambia,Africa,485.230659,520.926711,599.650276,734.782912,756.086836,884.755251,835.809611,611.658861,665.624413,...,374020.0,439593.0,517101.0,608274.0,715523.0,848406.0,1025384.0,1235767.0,1457766,1688359
Guinea,Africa,510.196492,576.267025,686.373674,708.759541,741.666231,874.685864,857.250358,805.572472,794.348438,...,3140003.0,3451418.0,3811387.0,4227026.0,4710497.0,5650262.0,6990574.0,8048834.0,8807818,9947814
Guinea-Bissau,Africa,299.850319,431.790457,522.034373,715.58064,820.224588,764.725963,838.123967,736.415392,745.539871,...,627820.0,601287.0,625361.0,745228.0,825987.0,927524.0,1050938.0,1193708.0,1332459,1472041
Liberia,Africa,575.572996,620.96999,634.195163,713.603648,803.005454,640.322438,572.199569,506.113857,636.622919,...,1112796.0,1279406.0,1482628.0,1703617.0,1956875.0,2269414.0,1912974.0,2200725.0,2814651,3193942


### Using a mask to identify a specific condition.

A mask can be useful to locate where a particular subset of values exist or don't exist, for example, NaN or "not a number". 

Comparison or function is applied element by element. Returns a similarly-shaped dataframe of True and False.

Boolean is a Python data type, True or False. False is Python's way of saying "No."

In [124]:
x = 1
x > 5

False

There is a way with pandas' methods to check for null values, (missing data or NaN). 

In [120]:
pd.isnull(data)

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Algeria,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Angola,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Benin,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Botswana,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Burkina Faso,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Switzerland,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Turkey,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
United Kingdom,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Australia,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


This is very clean data, all data is here, none of the cells have a value of True, isnull? = False.

Let's confirm by taking a closer look, applying a filter to the data. 

using the any() method, which looks for only True values. 

In [145]:
data[pd.isnull(data).any(axis=1)]

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


## Group By: split-apply-combine
Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

For instance, let’s say we want to have a clearer view on how the European countries split according to their GDP.

We can split the countries in two groups during the years surveyed, those who presented a GDP higher than the European average and those with a lower GDP.

Remember we can use methods like .mean() on a dataframe. .mean() is calculated per column. 

In [188]:
# first create a new DataFrame that is a subset, just those countries in the continent of Europe.
europe_df = data[data.continent == "Europe"].copy()
europe_df.head()

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Albania,Europe,1601.056136,1942.284244,2312.888958,2760.196931,3313.422188,3533.00391,3630.880722,3738.932735,2497.437901,...,1728137.0,1984060.0,2263554.0,2509048.0,2780097.0,3075321.0,3326498.0,3428038.0,3508512,3600523
Austria,Europe,6137.076492,8842.59803,10750.72111,12834.6024,16661.6256,19749.4223,21597.08362,23687.82607,27042.01868,...,7129864.0,7376998.0,7544201.0,7568430.0,7574613.0,7578903.0,7914969.0,8069876.0,8148312,8199783
Belgium,Europe,8343.105127,9714.960623,10991.20676,13149.04119,16672.14356,19117.97448,20979.84589,22525.56308,25575.57069,...,9218400.0,9556500.0,9709100.0,9821800.0,9856303.0,9870200.0,10045622.0,10199787.0,10311970,10392226
Bosnia and Herzegovina,Europe,973.533195,1353.989176,1709.683679,2172.352423,2860.16975,3528.481305,4126.613157,4314.114757,2546.781445,...,3349000.0,3585000.0,3819000.0,4086000.0,4172693.0,4338977.0,4256013.0,3607000.0,4165416,4552198
Bulgaria,Europe,2444.286648,3008.670727,4254.337839,5577.0028,6597.494398,7612.240438,8224.191647,8239.854824,6302.623438,...,8012946.0,8310226.0,8576200.0,8797022.0,8892098.0,8971958.0,8658506.0,8066057.0,7661799,7322858


Looks like we still have ALL of the columns, not just GDP, so let's further slice the df. 

Going to use `.iloc` because there are a dozen columns out of the 37 we want and I don't want to list them all. 

In [256]:
# overwrite our existing dataframe, and use iloc to get all rows, just the columns with the GDP per capita variables.
europe_df = europe_df.iloc[:,1:13]

In [190]:
europe_df.head()

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Albania,1601.056136,1942.284244,2312.888958,2760.196931,3313.422188,3533.00391,3630.880722,3738.932735,2497.437901,3193.054604,4604.211737,5937.029526
Austria,6137.076492,8842.59803,10750.72111,12834.6024,16661.6256,19749.4223,21597.08362,23687.82607,27042.01868,29095.92066,32417.60769,36126.4927
Belgium,8343.105127,9714.960623,10991.20676,13149.04119,16672.14356,19117.97448,20979.84589,22525.56308,25575.57069,27561.19663,30485.88375,33692.60508
Bosnia and Herzegovina,973.533195,1353.989176,1709.683679,2172.352423,2860.16975,3528.481305,4126.613157,4314.114757,2546.781445,4766.355904,6018.975239,7446.298803
Bulgaria,2444.286648,3008.670727,4254.337839,5577.0028,6597.494398,7612.240438,8224.191647,8239.854824,6302.623438,5970.38876,7696.777725,10680.79282


We can use mean() to get the mean of each column. 

In [211]:
europe_df.mean()

gdpPercap_1952     5661.057435
gdpPercap_1957     6963.012816
gdpPercap_1962     8365.486814
gdpPercap_1967    10143.823757
gdpPercap_1972    12479.575246
gdpPercap_1977    14283.979110
gdpPercap_1982    15617.896551
gdpPercap_1987    17214.310727
gdpPercap_1992    17061.568084
gdpPercap_1997    19076.781802
gdpPercap_2002    21711.732422
gdpPercap_2007    25054.481636
wealth_score          0.154955
dtype: float64

What is the result of taking mean() on a whole DataFrame?

In [212]:
type(europe_df.mean())

pandas.core.series.Series

Now we know the average GDP for each of the years. Next is to figure out whether each country's GDP is over the mean, by creating a boolean mask like we did earlier. Remember the mask is a DataFrame

In [262]:
mask_higher = europe_df > europe_df.mean()
mask_higher.head()

Unnamed: 0_level_0,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007,wealth_score
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Albania,False,False,False,False,False,False,False,False,False,False,False
Austria,True,True,True,True,True,True,True,True,True,True,True
Belgium,True,True,True,True,True,True,True,True,True,True,True
Bosnia and Herzegovina,False,False,False,False,False,False,False,False,False,False,False
Bulgaria,False,False,False,False,False,False,False,False,False,False,False


We then estimate a wealthy score based on the historical (from 1962 to 2007) values, where we count how many times a country has participated in the groups of lower or higher GDP. So, need to count how many Trues there are in each row.

We can use the **aggregate()** method to count (or sum), and then use axis=1 because we're applying this horizontally, across the all of the columns in a row. Axis = 0 is often the default in pandas and that applies something down a column. 

In [263]:
wealth_score = mask_higher.aggregate('sum', axis=1) / len(europe_df.columns)
wealth_score.head()

country
Albania                   0.0
Austria                   1.0
Belgium                   1.0
Bosnia and Herzegovina    0.0
Bulgaria                  0.0
dtype: float64

In [264]:
type(wealth_score)

pandas.core.series.Series

We can now add this back to our dataframe since it's a pandas Series and it has the same index. (They'll be able to matched up)

In [265]:
europe_df["wealth_score"] = wealth_score
europe_df.head()

Unnamed: 0_level_0,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007,wealth_score
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Albania,2312.888958,2760.196931,3313.422188,3533.00391,3630.880722,3738.932735,2497.437901,3193.054604,4604.211737,5937.029526,0.0
Austria,10750.72111,12834.6024,16661.6256,19749.4223,21597.08362,23687.82607,27042.01868,29095.92066,32417.60769,36126.4927,1.0
Belgium,10991.20676,13149.04119,16672.14356,19117.97448,20979.84589,22525.56308,25575.57069,27561.19663,30485.88375,33692.60508,1.0
Bosnia and Herzegovina,1709.683679,2172.352423,2860.16975,3528.481305,4126.613157,4314.114757,2546.781445,4766.355904,6018.975239,7446.298803,0.0
Bulgaria,4254.337839,5577.0028,6597.494398,7612.240438,8224.191647,8239.854824,6302.623438,5970.38876,7696.777725,10680.79282,0.0


## Groupby()


We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average life expectancy in a particular year. Remember we've done this on a column using the describe() method. 

We can calculate basic statistics for all records in a single column using the syntax below:

In [266]:
data["lifeExp_2007"].describe()

count    142.000000
mean      67.007423
std       12.073021
min       39.613000
25%       57.160250
50%       71.935500
75%       76.413250
max       82.603000
Name: lifeExp_2007, dtype: float64

But if we want to summarize by one or more variables, for example continent, and then apply statistics. So we can use Pandas’ .groupby() method. 

Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.

In [267]:
grouped_data = data.copy()
grouped_data = grouped_data.groupby("continent")

In [268]:
type(grouped_data)

pandas.core.groupby.generic.DataFrameGroupBy

We can now look at descriptive statistics for each of the columns in the original DataFrame, grouped by continent.

In [269]:
grouped_data.describe()

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1952,gdpPercap_1952,gdpPercap_1952,gdpPercap_1952,gdpPercap_1952,gdpPercap_1952,gdpPercap_1952,gdpPercap_1957,gdpPercap_1957,...,pop_2002,pop_2002,pop_2007,pop_2007,pop_2007,pop_2007,pop_2007,pop_2007,pop_2007,pop_2007
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Africa,52.0,1252.572466,982.952116,298.846212,534.990554,987.025569,1454.886645,4725.295531,52.0,1385.236062,...,16973552.75,119901300.0,52.0,17875760.0,24917730.0,199579.0,2909226.5,10093310.5,19363654.5,135031200.0
Americas,25.0,4079.062552,3001.727522,1397.717137,2428.237769,3048.3029,3939.978789,13990.48208,25.0,4616.043733,...,26769436.0,287675500.0,25.0,35954850.0,68833780.0,1056608.0,5675356.0,9319622.0,28674757.0,301139900.0
Asia,33.0,5195.484004,18634.890865,331.0,749.681655,1206.947913,3035.326002,108382.3529,33.0,5787.73294,...,66907826.0,1280400000.0,33.0,115513800.0,289673400.0,708573.0,6426679.0,24821286.0,69453570.0,1318683000.0
Europe,30.0,5661.057435,3114.060493,973.533195,3241.132406,5142.469716,7236.794919,14734.23275,30.0,6963.012816,...,20833960.25,82350670.0,30.0,19536620.0,23624740.0,301931.0,4780559.5,9493598.0,20849695.25,82401000.0
Oceania,2.0,10298.08565,365.560078,10039.59564,10168.840645,10298.08565,10427.330655,10556.57566,2.0,11598.522455,...,15637103.25,19546790.0,2.0,12274970.0,11538850.0,4115771.0,8195372.25,12274973.5,16354574.75,20434180.0


We can look at a specific statistic, mean, applied across the groupby DataFrame: 

In [240]:
grouped_data.mean()

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Africa,1252.572466,1385.236062,1598.078825,2050.363801,2339.615674,2585.938508,2481.59296,2282.668991,2281.810333,2378.759555,...,5702247.0,6447875.0,7305376.0,8328097.0,9602857.0,11054500.0,12674640.0,14304480.0,16033150.0,17875760.0
Americas,4079.062552,4616.043733,4901.54187,5668.253496,6491.334139,7352.007126,7506.737088,7793.400261,8044.934406,8889.300863,...,17330810.0,19229860.0,21175370.0,23122710.0,25211640.0,27310160.0,29570960.0,31876020.0,33990910.0,35954850.0
Asia,5195.484004,5787.73294,5729.369625,5971.173374,8187.468699,7791.31402,7434.135157,7608.226508,8639.690248,9834.093295,...,51404760.0,57747360.0,65180980.0,72257990.0,79095020.0,87006690.0,94948250.0,102523800.0,109145500.0,115513800.0
Europe,5661.057435,6963.012816,8365.486814,10143.823757,12479.575246,14283.97911,15617.896551,17214.310727,17061.568084,19076.781802,...,15345170.0,16039300.0,16687840.0,17238820.0,17708900.0,18103140.0,18604760.0,18964800.0,19274130.0,19536620.0
Oceania,10298.08565,11598.522455,12696.45243,14495.02179,16417.33338,17283.957605,18554.70984,20448.04016,20894.045885,24024.17517,...,6641759.0,7300207.0,8053050.0,8619500.0,9197425.0,9787208.0,10459830.0,11120720.0,11727410.0,12274970.0


Now let's look at a specific column. 

In [241]:
grouped_data["lifeExp_2007"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,52.0,54.806038,9.630781,39.613,47.834,52.9265,59.44425,76.442
Americas,25.0,73.60812,4.440948,60.916,71.752,72.899,76.384,80.653
Asia,33.0,70.728485,7.963724,43.828,65.483,72.396,75.635,82.603
Europe,30.0,77.6486,2.979813,71.777,75.02975,78.6085,79.81225,81.757
Oceania,2.0,80.7195,0.729027,80.204,80.46175,80.7195,80.97725,81.235


### Backup content on Transforms. 

In [13]:
data2 = data.T

In [14]:
data2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
country,Afghanistan,Bahrain,Bangladesh,Cambodia,China,Hong Kong China,India,Indonesia,Iran,Iraq,...,Philippines,Saudi Arabia,Singapore,Sri Lanka,Syria,Taiwan,Thailand,Vietnam,West Bank and Gaza,Yemen Rep.
gdpPercap_1952,779.445,9867.08,684.244,368.469,400.449,3054.42,546.566,749.682,3035.33,4129.77,...,1272.88,6459.55,2315.14,1083.53,1643.49,1206.95,757.797,605.066,1515.59,781.718
gdpPercap_1957,820.853,11635.8,661.637,434.038,575.987,3629.08,590.062,858.9,3290.26,6229.33,...,1547.94,8157.59,2843.1,1072.55,2117.23,1507.86,793.577,676.285,1827.07,804.83
gdpPercap_1962,853.101,12753.3,686.342,496.914,487.674,4692.65,658.347,849.29,4187.33,8341.74,...,1649.55,11626.4,3674.74,1074.47,2193.04,1822.88,1002.2,772.049,2198.96,825.623
gdpPercap_1967,836.197,14804.7,721.186,523.432,612.706,6197.96,700.771,762.432,5906.73,8931.46,...,1814.13,16903,4977.42,1135.51,1881.92,2643.86,1295.46,637.123,2649.72,862.442
