# First look at the data

**Example used: COVID-19 data**

Our World in Data provides COVID-19 public use data at https://ourworldindata.org/covid-cases. The dataset includes total cases and deaths, tests administered, hospital beds, and demographic data such as median age, gross domestic product, and a human development index, which is a composite measure of standard of living, educational levels, and life expectancy. The dataset used in this recipe was downloaded on March 3, 2024

## Read and inspect shape and data type
### Read data


In [34]:
import pandas as pd
import numpy as np

# We first take a look at the data file, and understand that it is a CSV file with 28 KB on your disk

# read the file
covid_data = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"]) # note: Pandas can automatically infer standard date formats, such as ISO 8601 (YYYY-MM-DD), if you specify which columns contain dates.

### Inspect shape, columns

In [35]:
# Inspect the shape (numbers of rows and columns) of the dataframe.
# pd.shape returns a tuple representing the dimensionality of the DataFrame.
# it has 231 rows and 17 columns
covid_data.shape

(231, 17)

In [36]:
# Inspect the columns

print(f"This file include the following columns: {covid_data.columns}")

This file include the following columns: Index(['iso_code', 'lastdate', 'location', 'total_cases', 'total_deaths',
       'total_cases_pm', 'total_deaths_pm', 'population', 'pop_density',
       'median_age', 'gdp_per_capita', 'hosp_beds', 'vac_per_hund',
       'aged_65_older', 'life_expectancy', 'hum_dev_ind', 'region'],
      dtype='object')


### Inspect head, tail, and sample

In [37]:
# Inspect some samples of the data, using head( ), tail ( ), or sample( )
print(f"The head of the dataset:\n")
covid_data.head(n=3)

The head of the dataset:



Unnamed: 0,iso_code,lastdate,location,total_cases,total_deaths,total_cases_pm,total_deaths_pm,population,pop_density,median_age,gdp_per_capita,hosp_beds,vac_per_hund,aged_65_older,life_expectancy,hum_dev_ind,region
0,AFG,2024-02-04,Afghanistan,231539.0,7982.0,5629.611,194.073,41128772,54.422,18.6,1803.987,0.5,,2.581,64.83,0.511,South Asia
1,ALB,2024-01-28,Albania,334863.0,3605.0,117813.348,1268.331,2842318,104.871,38.0,11803.431,2.89,,13.188,78.57,0.795,Eastern Europe
2,DZA,2023-12-03,Algeria,272010.0,6881.0,6057.694,153.241,44903228,17.348,29.1,13913.839,1.9,,6.211,76.88,0.748,North Africa


In [38]:
print(f"The tail of the dataset:\n")
covid_data.tail(n=3) # You can specify how many rows you wanted to inspect

The tail of the dataset:



Unnamed: 0,iso_code,lastdate,location,total_cases,total_deaths,total_cases_pm,total_deaths_pm,population,pop_density,median_age,gdp_per_capita,hosp_beds,vac_per_hund,aged_65_older,life_expectancy,hum_dev_ind,region
228,YEM,2022-11-06,Yemen,11945.0,2159.0,354.487,64.072,33696612,53.508,20.3,1479.147,0.7,,2.922,66.12,0.47,West Asia
229,ZMB,2023-12-03,Zambia,349304.0,4069.0,17449.783,203.27,20017670,22.995,17.7,3689.251,2.0,,2.48,63.89,0.584,Southern Africa
230,ZWE,2024-01-28,Zimbabwe,266265.0,5737.0,16314.719,351.52,16320539,42.729,19.6,1899.775,1.7,,2.822,61.49,0.571,Southern Africa


In [39]:
print(f"A random sample of the dataset:\n")
covid_data.sample()

A random sample of the dataset:



Unnamed: 0,iso_code,lastdate,location,total_cases,total_deaths,total_cases_pm,total_deaths_pm,population,pop_density,median_age,gdp_per_capita,hosp_beds,vac_per_hund,aged_65_older,life_expectancy,hum_dev_ind,region
220,VIR,2023-07-30,United States Virgin Islands,25389.0,132.0,255219.695,1326.913,99479,306.48,42.2,,,,18.601,80.58,,Caribbean


In [40]:
print(f"A random sample of the datasets with 7 rows, with random seed 42")
covid_data.sample(n=3, random_state=None)

A random sample of the datasets with 7 rows, with random seed 42


Unnamed: 0,iso_code,lastdate,location,total_cases,total_deaths,total_cases_pm,total_deaths_pm,population,pop_density,median_age,gdp_per_capita,hosp_beds,vac_per_hund,aged_65_older,life_expectancy,hum_dev_ind,region
123,MDV,2023-08-06,Maldives,186694.0,316.0,356423.66,603.286,523798,1454.433,30.6,15183.616,,,4.12,78.92,0.74,South Asia
74,GAB,2023-12-10,Gabon,49051.0,307.0,20532.048,128.506,2388997,7.859,23.1,16562.413,6.3,,4.45,66.47,0.703,Central Africa
217,ARE,2023-05-28,United Arab Emirates,1067030.0,2349.0,113019.214,248.805,9441138,112.442,34.0,67293.483,1.2,,1.144,77.97,0.89,West Asia


### Inspect data info and data types


In [41]:
# info also shows missing data
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231 entries, 0 to 230
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   iso_code         231 non-null    object        
 1   lastdate         231 non-null    datetime64[ns]
 2   location         231 non-null    object        
 3   total_cases      231 non-null    float64       
 4   total_deaths     231 non-null    float64       
 5   total_cases_pm   231 non-null    float64       
 6   total_deaths_pm  231 non-null    float64       
 7   population       231 non-null    int64         
 8   pop_density      209 non-null    float64       
 9   median_age       194 non-null    float64       
 10  gdp_per_capita   191 non-null    float64       
 11  hosp_beds        170 non-null    float64       
 12  vac_per_hund     13 non-null     float64       
 13  aged_65_older    188 non-null    float64       
 14  life_expectancy  227 non-null    float64  

In [42]:
covid_data.dtypes

iso_code                   object
lastdate           datetime64[ns]
location                   object
total_cases               float64
total_deaths              float64
total_cases_pm            float64
total_deaths_pm           float64
population                  int64
pop_density               float64
median_age                float64
gdp_per_capita            float64
hosp_beds                 float64
vac_per_hund              float64
aged_65_older             float64
life_expectancy           float64
hum_dev_ind               float64
region                     object
dtype: object

### Checking missing data
use `.isna( )`

In [43]:
# Checking missing data for one column
covid_data['life_expectancy'].isna()

0      False
1      False
2      False
3      False
4      False
       ...  
226    False
227    False
228    False
229    False
230    False
Name: life_expectancy, Length: 231, dtype: bool

In [44]:
# Because bool (True and False) are essentially integers (1 and 0)
# We can add them together to check how many True values (aka missing values, or nan values) there.
print(f"The number of missing values in the columns 'life_expectancy' is: {covid_data['life_expectancy'].isna().sum()}")

The number of missing values in the columns 'life_expectancy' is: 4


In [45]:
# Checking missing data for a DataFrame
covid_data.isna()

Unnamed: 0,iso_code,lastdate,location,total_cases,total_deaths,total_cases_pm,total_deaths_pm,population,pop_density,median_age,gdp_per_capita,hosp_beds,vac_per_hund,aged_65_older,life_expectancy,hum_dev_ind,region
0,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,True,True,True,True,False,True,False
4,False,False,False,False,False,False,False,False,False,True,True,True,True,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
226,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
227,False,False,False,False,False,False,False,False,True,True,True,True,True,True,False,True,False
228,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
229,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False


In [46]:
# Compute the number of missing data in each column in a DataFrame
covid_data.isna().sum()

iso_code             0
lastdate             0
location             0
total_cases          0
total_deaths         0
total_cases_pm       0
total_deaths_pm      0
population           0
pop_density         22
median_age          37
gdp_per_capita      40
hosp_beds           61
vac_per_hund       218
aged_65_older       43
life_expectancy      4
hum_dev_ind         44
region               0
dtype: int64

## Summary statistics (numerical)

using `.describe( )` to generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, *excluding* `NaN` values.


In [None]:
covid_data.describe() # generate a descriptive statistics for numerical columns

In [None]:
covid_data.describe(percentiles=[0.05, 0.95])

In [None]:
# We could also use description on one series
covid_data['total_cases'].describe()

In [None]:
# You can also use .min(), .max(), .mean(), etc to check those summary statistics of each column
covid_data.min()

## Summary statistics (categorical)

Example used: nls97 data

The NLS of Youth was conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through to 2023. In this dataset, 89 variables on grades, employment, income, and attitudes toward government from the hundreds of data items in the survey were pulled. The NLS data can be downloaded from https://www.nlsinfo.org. You must create an investigator account to download the data, but there is no charge.

In [None]:
import pandas as pd

# read the data
nls97 = pd.read_csv("data/nls97.csv")


In [None]:
# inspect the size
nls97.shape

In [None]:
# inspect the information
nls97.info()

### without converting to Pandas `category` dtype

When dealing with categorical data type, we can convert it to `category` data type, but we don't have to. The advantage of converting to `category` data type is for memory efficiency.

If you don't convert it to `category` dtype, you can directly use `.value_counts( )` and `.describe( )` to inspect the frequency of categories and summary statistics.

In [None]:
nls97['gender'].value_counts(dropna=False) # dropna=False means count NaN as a separate category

In [None]:
# Use .describe( ) to print summary statistics of this categorical variable
nls97['gender'].describe()

In [None]:
# If you want to inspect the frequency in terms of percentage, you can pass normalize=True
nls97['gender'].value_counts(normalize=True)

### Converting to Pandas `category` dtype
- using `.astype("category")` to convert a column (Series) to a categorical dtype
- Then you can use `.value_counts()` and `.describe()` to inspect the frequency of categories and the summary statistics of this columns
- As you can see, the results are the same with if you don't convert it to `category` dtype.
- The only difference is that a `category` dtype takes less memory than an `object` (`str`) dtype in Pandas.

In [None]:
# first convert gender to a categorical dtype
# If you would like to keep the original column, you can create a new column
nls97['gender_category'] = nls97['gender'].astype("category")

In [None]:
nls97['gender_category'].value_counts()

In [None]:
nls97['gender_category'].describe()

### [Optional] Other methods I found helpful for categorical data
#### `.unique()` and `.nunique()`

In [None]:
# use .unique() to check the unique values of a categorical varable
nls97['gender'].unique()

In [None]:
nls97['gender'].nunique()

#### `pd.crosstab`
Cross-Tabulation (Compare Two Categorical Variables)

In [None]:
# If you want to inspect both the gender and marital status
pd.crosstab(nls97["gender"], nls97["maritalstatus"],
            dropna=False, normalize=False)


#### `df.groupby`
We will talk about `groupby` in more detail in our later weeks, but it can also be helpful in inspecting categorical data

In [None]:
# For example, if you want to inspect the mean wage income of male and female
nls97.groupby(by='gender')['wageincome'].mean()

#### One-hot encoding categorical data

we can also convert categorical to a collection of binary variables using a technique called "one-hot encoding"

One-hot encoding is a data pre-processing technique that converts categorical data into a binary matrix (0s and 1s), representing each category as a unique vector with a single "hot" (1) value and the rest "cold" (0). It enables machine learning algorithms to process nominal data by creating new binary columns for each distinct category.

In [None]:
nls97['is_female'].value_counts()

In [None]:
# a more generic way to do one-hot encoding is to use pd.get_dummies(). However, get_dummies() will automatically drop the original columns. for example:
nls97_copy = nls97.copy()
nls97_copy = pd.get_dummies(nls97_copy, columns=['maritalstatus'], drop_first=True, dummy_na=False, dtype=int) # always drop_first


## [Optional] Summary statistics of datetime data

Datetime data is not “categorical” or “numerical”, it’s temporal, and we inspect it differently.

Before we inspect, it is better to convert it to `pandas`'s datetime type.

In [None]:
# first check the type of the datetime columns, if it object, convert it to datetime (e.g., df["date"] = pd.to_datetime(df["date"])).
# As you can se, the dtype of "lastdate" is M8[ns], which is a data type string used in the Python NumPy and pandas libraries to represent a datetime64 object with nanosecond precision.

covid_data['lastdate'].dtype


In [None]:
covid_data['lastdate'].describe()

In [None]:
# range of time, useful to understand the duration of time
covid_data['lastdate'].max() - covid_data['lastdate'].min()

In [None]:
# Extract Temporal Components
covid_data['last_year'] = covid_data['lastdate'].dt.year
covid_data['last_month'] = covid_data['lastdate'].dt.month
covid_data['last_day'] = covid_data['lastdate'].dt.day
covid_data['last_day_of_week'] = covid_data['lastdate'].dt.dayofweek
covid_data['last_day_name'] = covid_data['lastdate'].dt.day_name()
covid_data['last_hour'] = covid_data['lastdate'].dt.hour

covid_data

In [None]:
covid_data['last_day_name'].value_counts()

In [None]:
covid_data.groupby(covid_data["lastdate"].dt.to_period("M")).size().plot(kind='bar', figsize=(8, 4))