# Checking and Filling Missing Data

In [1]:
import pandas as pd
import numpy as np

exam_scores = pd.read_csv('data/exam_scores.csv')

## 1. Missing Value  缺失值
- So, defining missing data: Missing data (or missing values) is defined as the data values that are not stored in a column or row.  因此，定义缺失数据：缺失数据（或缺失值）是指未存储在列或行中的数据值。
- Pandas provides isnull(), isna() functions to detect missing values. Both of them do the same thing. Pandas 提供了 isnull()和 isna()函数来检测缺失值。这两个函数的作用相同。

In [2]:
# df.isna() or df.isnull() returns the DataFrame with Boolean values indicating whether a value is missing (True) or not (False).
# df.isna() 或 df.isnull() 会返回一个布尔值的 DataFrame，指示某个值是否缺失（True）或不缺失（False）。
exam_scores.isnull()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False
996,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False


In [3]:
# We can get column wise count of all the missing values using the aggregation function sum():
# 我们可以使用聚合函数 sum() 获取按列统计的缺失值数量：
exam_scores.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [None]:
# Pandas also provides fillna() method to fill the missing values. fillna() provides many different strategies to fill missing values.
# Pandas 还提供了 fillna()方法来填充缺失值。fillna()提供了多种不同的策略来填充缺失值。

# For example, we can fill the missing values:
# exam_scores.gender.fillna('Unknown', inplace=True)

In [None]:
# There is still a missing value in 'math score' column. Let's say we want to fill the missing value in this column with the mean of the marks scored by other people.
# 在 'math score' 列中仍然存在缺失值。假设我们想要用其他人获得的分数的平均值来填充这一列的缺失值。
# exam_scores['math score'].fillna(exam_scores['math sccore'].mean(), inplace=True)

## 2. Exercise 练习

In [4]:
titanic_data = pd.read_csv('data/titanic_data.csv')

In [6]:
# isnull() or isna()
titanic_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [7]:
# mean() of 'Age' column
age_mean_before = titanic_data['Age'].mean()
age_mean_before

np.float64(29.69911764705882)

In [10]:
# fillna() to fill missing values in 'Age' column with mean
titanic_data.fillna({'Age':age_mean_before}, inplace=True)
titanic_data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [12]:
# mean() of 'Age' column after filling missing values
age_mean_after = titanic_data['Age'].mean()
age_mean_after

np.float64(29.69911764705882)

In [20]:
titanic_data['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [21]:
# fillna() to fill missing values in 'Embarked' column with 'S'
titanic_data.fillna({'Embarked': 'S'}, inplace=True)
titanic_data['Embarked'].value_counts()

Embarked
S    646
C    168
Q     77
Name: count, dtype: int64