# Data Analysis in Python - IV: Summarizing Data

## Introduction

In this lesson, we will learn how summarize the data available in a dataframe. 

Note: 
1. Use the TOC to navigate between sections.


## Summary Statistics for the DataFrame

We can use the `describe()` function for a quick summary of the data.

In [19]:
# Load the dataset into a DataFrame

import pandas as pd

povData = pd.read_csv('../scratch/PovertyData.csv', sep=',',na_values="*")
povData.head()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary


In [18]:
povData.describe()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region
count,97.0,97.0,97.0,97.0,97.0,91.0,97.0
mean,29.229897,10.836082,54.901031,61.485567,66.151134,5741.252747,3.948454
std,13.546695,4.647495,45.992584,9.61597,11.005391,8093.679853,1.740277
min,9.7,2.2,4.5,38.1,41.2,80.0,1.0
25%,14.5,7.8,13.1,55.8,57.5,475.0,3.0
50%,29.0,9.5,43.0,63.7,67.8,1690.0,4.0
75%,42.2,12.5,83.0,68.6,75.4,7325.0,6.0
max,52.2,25.0,181.6,75.9,81.8,34064.0,6.0


Why is the count of the GNI column lower?

## Summary Statistics for Subsets

You can also calculate statistics for individual columns or specific subsets of the data.

In [4]:
# min live birth rate
povData['LiveBirthRate'].min()

9.7

In [6]:
# max death rate among the first 20 countries
povData.iloc[0:20,1].max()

18.0

In [7]:
povData.iloc[0:20]['DeathRate'].max()

18.0

In [8]:
# min or birth rate and death rate
povData[['LiveBirthRate','DeathRate']].min()

LiveBirthRate    9.7
DeathRate        2.2
dtype: float64

Sometimes you may want to retrieve the index of the row with a specific value. For example, the index of the row with the lowest live birth rate.

In [9]:
# index of the row that has the min live birth rate
povData['LiveBirthRate'].idxmin()

30

The default behaviour is to compute the statistics across rows (i.e., summarize all values for a column for all rows in the subset). To perform the operation across various columns of a row, you can specify `axis=1` or `axis='columns'`.

In [11]:
# min of death rate and infant deaths for each country
povData[['DeathRate','InfantDeaths']].min(axis='columns')

0      5.7
1     11.9
2     11.3
3      7.6
4     13.4
      ... 
92    15.6
93    14.0
94    14.2
95    13.7
96    10.3
Length: 97, dtype: float64

Familiarize yourself with the available summary functions.


## Bivariate Statistics

Some summary statistics are calculated on more than one variable. These are supported with specific functions. Let us try some examples. 

In [10]:
# calculate correlation between male and female life expectancy
povData['MaleLifeExpectancy'].corr(povData['FemaleLifeExpectancy'])


0.9825578248134278

In [11]:
# calculate pairwise correlations for all numeric columns in the data
povData.corr()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region
LiveBirthRate,1.0,0.486197,0.858353,-0.866519,-0.894414,-0.629059,0.716883
DeathRate,0.486197,1.0,0.654623,-0.733467,-0.693033,-0.302754,0.339988
InfantDeaths,0.858353,0.654623,1.0,-0.936838,-0.955352,-0.601647,0.632524
MaleLifeExpectancy,-0.866519,-0.733467,-0.936838,1.0,0.982558,0.642963,-0.639382
FemaleLifeExpectancy,-0.894414,-0.693033,-0.955352,0.982558,1.0,0.65004,-0.693409
GNI,-0.629059,-0.302754,-0.601647,0.642963,0.65004,1.0,-0.283399
Region,0.716883,0.339988,0.632524,-0.639382,-0.693409,-0.283399,1.0


## Unique Values, Counts and, Membership

pandas includes functions that can provide information about counts, list unique values and test membership.

Write code below and add notes explaining what the function does.

In [12]:
# use count() with Region column
povData['Region'].count()

97

In [16]:
# use count() with GNI column
povData['GNI'].count()

91

`count()` - 

In [14]:
# use nunique() with Region column 
povData['Region'].nunique()

6

`nunique()` - 

In [15]:
# use value counts with Region column
povData['Region'].value_counts()

6    27
3    19
5    17
2    12
1    11
4    11
Name: Region, dtype: int64

`value_counts()` - 

In [17]:
# use unique() with Region column
povData['Region'].unique()

array([1, 2, 3, 5, 4, 6])

`unique()` - 

In [23]:
# use isin with Region column as shown
povData[povData['Region'].isin([1,2])]

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


`isin()` - 