# Data Analysis in Python - II: Reading and Examining Data

## Introduction

In this lesson, we will learn how to read data from a file and examine the data in a DataFrame.  

Note: 
1. Use the TOC to navigate between sections.


<font color = "red"><b>Important:</b> Please make sure that this file is saved in the root folder. If not, please close the file, upload it to the correct location, restart your server from the hub control panel, and open the file again. </font>

## The dataset

Dataset: "The Statistics of Poverty and Inequality," submitted by Mary Rouncefield, Mathematics Department, Chester College, U.K.. 
Dataset obtained from the Journal of Statistics Education (http://jse.amstat.org/publications/jse). Accessed [3/24/2020]. Used by permission of author.

For this lesson, we will work with a dataset that captures the statistics of poverty and inequality. 
The dataset provides the following information for the countries included in the sample.

- Live birth rate per 1,000 of population
- Death rate per 1,000 of population
- Infant deaths per 1,000 of population under 1 year old
- Life expectancy at birth for males
- Life expectancy at birth for females
- Gross National Product per capita in U.S. dollars 
-  Country Group
          1 = Eastern Europe
          2 = South America and Mexico
          3 = Western Europe, North America, Japan, Australia, New Zealand
          4 = Middle East
          5 = Asia
          6 = Africa
- Country

Let's check where you can find the data file and the data dictionary. 

Please check the data description for additional details about the dataset and the sources of the information. 

## Loading the data set into a DataFrame

We will learn various ways to read data into DataFrames in a subsequent lesson. For now, please modify and execute the code below to read the data into the povData DataFrame.

In [16]:
# read poverty data

import pandas as pd

povData = pd.read_csv('scratch/PovertyData.csv', sep=',',na_values="*")

**Note:** I will use pathnames assuming all lesson and homework notebooks are stored in the root folder. You should all be familiar with hierarchical file storage and relative pathnames, especially if you choose to organize your files differently than suggested.
If you need to review or learn hierarchical file storage and relative pathnames concepts, you can start with the following videos.

[Video 1](https://youtu.be/BMT3JUWmqYY)
<BR/>
[Video 2](https://youtu.be/ephId3mYu9o)

## Examining the data frame

We can check to see how many countries are included in the sample and how many attributes (columns) are in the dataset.

In [3]:
povData.shape

(97, 8)

We can retrieve the names of all the columns using the `columns` property.

In [4]:
povData.columns

Index(['LiveBirthRate', 'DeathRate', 'InfantDeaths', 'MaleLifeExpectancy',
       'FemaleLifeExpectancy', 'GNI', 'Region', 'Country'],
      dtype='object')

Notice that the result is an index object.

We can examine the data types of the values in each column using the `dtypes` property.

In [5]:
povData.dtypes

LiveBirthRate           float64
DeathRate               float64
InfantDeaths            float64
MaleLifeExpectancy      float64
FemaleLifeExpectancy    float64
GNI                     float64
Region                    int64
Country                  object
dtype: object

We can retrieve the summary of this information about the data frame using the `info()` function.

In [6]:
povData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   LiveBirthRate         97 non-null     float64
 1   DeathRate             97 non-null     float64
 2   InfantDeaths          97 non-null     float64
 3   MaleLifeExpectancy    97 non-null     float64
 4   FemaleLifeExpectancy  97 non-null     float64
 5   GNI                   91 non-null     float64
 6   Region                97 non-null     int64  
 7   Country               97 non-null     object 
dtypes: float64(6), int64(1), object(1)
memory usage: 6.2+ KB


We may want to examine the data itself but it may not make sense to print a large dataset. We can instead check the top few or the bottom few rows of the data frame using the `head()` and `tail()` functions.

In [7]:
povData.head()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary


In [8]:
# use the print function
print(povData.head())

   LiveBirthRate  DeathRate  InfantDeaths  MaleLifeExpectancy  \
0           24.7        5.7          30.8                69.6   
1           12.5       11.9          14.4                68.3   
2           13.4       11.7          11.3                71.8   
3           12.0       12.4           7.6                69.8   
4           11.6       13.4          14.8                65.4   

   FemaleLifeExpectancy     GNI  Region            Country  
0                  75.5   600.0       1            Albania  
1                  74.7  2250.0       1           Bulgaria  
2                  77.7  2980.0       1     Czechoslovakia  
3                  75.9     NaN       1  Former_E._Germany  
4                  73.8  2780.0       1            Hungary  


In [9]:
povData.head(10)


Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


In [10]:
povData.tail()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
92,52.2,15.6,103.0,49.9,52.7,220.0,6,Uganda
93,50.5,14.0,106.0,51.3,54.7,110.0,6,Tanzania
94,45.6,14.2,83.0,50.3,53.7,220.0,6,Zaire
95,51.1,13.7,80.0,50.4,52.5,420.0,6,Zambia
96,41.7,10.3,66.0,56.5,60.1,640.0,6,Zimbabwe


In [11]:
povData.tail(2)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
95,51.1,13.7,80.0,50.4,52.5,420.0,6,Zambia
96,41.7,10.3,66.0,56.5,60.1,640.0,6,Zimbabwe


What is the data type of the result of head() or tail()?

In [13]:
povData.head().shape

(5, 8)