# First look at the data

**Example used: COVID-19 data**

Our World in Data provides COVID-19 public use data at https://ourworldindata.org/covid-cases. The dataset includes total cases and deaths, tests administered, hospital beds, and demographic data such as median age, gross domestic product, and a human development index, which is a composite measure of standard of living, educational levels, and life expectancy. The dataset used in this recipe was downloaded on March 3, 2024

## Read and inspect shape and data type
### Read data


In [None]:
import pandas as pd
import numpy as np

# We first take a look at the data file, and understand that it is a CSV file with 28 KB on your disk

# read the file
covid_data = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"]) # note: Pandas can automatically infer standard date formats, such as ISO 8601 (YYYY-MM-DD), if you specify which columns contain dates.

### Inspect shape, columns

In [None]:
# Inspect the shape (numbers of rows and columns) of the dataframe.
# pd.shape returns a tuple representing the dimensionality of the DataFrame.
# it has 231 rows and 17 columns
covid_data.shape

In [None]:
# Inspect the columns

print(f"This file include the following columns: {covid_data.columns}")

### Inspect head, tail, and sample

In [None]:
# Inspect some samples of the data, using head( ), tail ( ), or sample( )
print(f"The head of the dataset:\n")
covid_data.head(n=3)

In [None]:
print(f"The tail of the dataset:\n")
covid_data.tail(n=3) # You can specify how many rows you wanted to inspect

In [None]:
print(f"A random sample of the dataset:\n")
covid_data.sample()

In [None]:
print(f"A random sample of the datasets with 7 rows, with random seed 42")
covid_data.sample(n=3, random_state=None)

### Inspect data info and data types


In [None]:
# info also shows missing data
covid_data.info()

In [None]:
covid_data.dtypes

### Checking missing data
use `.isna( )`

In [None]:
# Checking missing data for one column
covid_data['life_expectancy'].isna()

In [None]:
# Because bool (True and False) are essentially integers (1 and 0)
# We can add them together to check how many True values (aka missing values, or nan values) there.
print(f"The number of missing values in the columns 'life_expectancy' is: {covid_data['life_expectancy'].isna().sum()}")

In [None]:
# Checking missing data for a DataFrame
covid_data.isna()

In [None]:
# Compute the number of missing data in each column in a DataFrame
covid_data.isna().sum()

## Summary statistics (numerical)

using `.describe( )` to generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, *excluding* `NaN` values.


In [None]:
import pandas as pd
import numpy as np

# read the file
covid_data = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])

In [None]:
covid_data.describe() # generate a descriptive statistics for numerical columns

In [None]:
covid_data.describe(percentiles=[0.05, 0.95])

In [None]:
# We could also use description on one series
covid_data['total_cases'].describe()

In [None]:
# You can also use .min(), .max(), .mean(), etc to check those summary statistics of each column
covid_data.min()

## Summary statistics (categorical)

Example used: nls97 data

The NLS of Youth was conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through to 2023. In this dataset, 89 variables on grades, employment, income, and attitudes toward government from the hundreds of data items in the survey were pulled. The NLS data can be downloaded from https://www.nlsinfo.org. You must create an investigator account to download the data, but there is no charge.

In [None]:
import pandas as pd

# read the data
nls97 = pd.read_csv("data/nls97.csv")


In [None]:
# inspect the size
nls97.shape

In [None]:
# inspect the information
nls97.info()

### without converting to Pandas `category` dtype

When dealing with categorical data type, we can convert it to `category` data type, but we don't have to. The advantage of converting to `category` data type is for memory efficiency.

If you don't convert it to `category` dtype, you can directly use `.value_counts( )` and `.describe( )` to inspect the frequency of categories and summary statistics.

In [None]:
nls97['gender'].value_counts(dropna=False) # dropna=False means count NaN as a separate category

In [None]:
# Use .describe( ) to print summary statistics of this categorical variable
nls97['gender'].describe()

In [None]:
# If you want to inspect the frequency in terms of percentage, you can pass normalize=True
nls97['gender'].value_counts(normalize=True)

### Converting to Pandas `category` dtype
- using `.astype("category")` to convert a column (Series) to a categorical dtype
- Then you can use `.value_counts()` and `.describe()` to inspect the frequency of categories and the summary statistics of this columns
- As you can see, the results are the same with if you don't convert it to `category` dtype.
- The difference is that a `category` dtype takes less memory than an `object` (`str`) dtype in Pandas.
- `category` dtype also provides many built-in attributes and methods, such as `.cat.codes`, `cat.rename_categories`, etc.

In [None]:
# first convert gender to a categorical dtype
# If you would like to keep the original column, you can create a new column
nls97['gender_category'] = nls97['gender'].astype("category")

In [None]:
nls97['gender_category'].value_counts()

In [None]:
nls97['gender_category'].describe()

### [Optional] Other methods I found helpful for categorical data
#### `.unique()` and `.nunique()`

In [None]:
# use .unique() to check the unique values of a categorical varable
nls97['gender'].unique()

In [None]:
nls97['gender'].nunique()

#### `pd.crosstab`
Cross-Tabulation (Compare Two Categorical Variables)

In [None]:
# If you want to inspect both the gender and marital status
pd.crosstab(nls97["gender"], nls97["maritalstatus"],
            dropna=False, normalize=False)


#### `df.groupby`
We will talk about `groupby` in more detail in our later weeks, but it can also be helpful in inspecting categorical data

In [None]:
# For example, if you want to inspect the mean wage income of male and female
nls97.groupby(by='gender')['wageincome'].mean()

#### One-hot encoding categorical data

we can also convert categorical to a collection of binary variables using a technique called "one-hot encoding"

One-hot encoding is a data pre-processing technique that converts categorical data into a binary matrix (0s and 1s), representing each category as a unique vector with a single "hot" (1) value and the rest "cold" (0). It enables machine learning algorithms to process nominal data by creating new binary columns for each distinct category.

In [None]:
nls97['is_female'].value_counts()

In [None]:
# a more generic way to do one-hot encoding is to use pd.get_dummies(). However, get_dummies() will automatically drop the original columns. for example:
nls97_copy = nls97.copy()
nls97_copy = pd.get_dummies(nls97_copy, columns=['maritalstatus'], drop_first=True, dummy_na=False, dtype=int) # always drop_first


## [Optional] Summary statistics of datetime data

Datetime data is not “categorical” or “numerical”, it’s temporal, and we inspect it differently.

Before we inspect, it is better to convert it to `pandas`'s datetime type.

In [None]:
# first check the type of the datetime columns, if it object, convert it to datetime (e.g., df["date"] = pd.to_datetime(df["date"])).
# As you can se, the dtype of "lastdate" is M8[ns], which is a data type string used in the Python NumPy and pandas libraries to represent a datetime64 object with nanosecond precision.

covid_data['lastdate'].dtype


In [None]:
covid_data['lastdate'].describe()

In [None]:
# range of time, useful to understand the duration of time
covid_data['lastdate'].max() - covid_data['lastdate'].min()

In [None]:
# Extract Temporal Components
covid_data['last_year'] = covid_data['lastdate'].dt.year
covid_data['last_month'] = covid_data['lastdate'].dt.month
covid_data['last_day'] = covid_data['lastdate'].dt.day
covid_data['last_day_of_week'] = covid_data['lastdate'].dt.dayofweek
covid_data['last_day_name'] = covid_data['lastdate'].dt.day_name()
covid_data['last_hour'] = covid_data['lastdate'].dt.hour

covid_data

In [None]:
covid_data['last_day_name'].value_counts()

In [None]:
covid_data.groupby(covid_data["lastdate"].dt.to_period("M")).size().plot(kind='bar', figsize=(8, 4))