<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dealing-with-missing-data" data-toc-modified-id="Dealing-with-missing-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dealing with missing data</a></span><ul class="toc-item"><li><span><a href="#B.-Imputing-missing-values" data-toc-modified-id="B.-Imputing-missing-values-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>B. Imputing missing values</a></span></li><li><span><a href="#Step-1:-impute-missing-values-via-the-column-mean" data-toc-modified-id="Step-1:-impute-missing-values-via-the-column-mean-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Step 1: impute missing values via the column mean</a></span></li></ul></li></ul></div>

Title: 2022-11-30 Machine Learning 2

To:&nbsp;&nbsp;&nbsp;&nbsp; Magnimind

From: Matt Curcio, matt.curcio.ri@gmail.com

Date: 2022-10-30

Re:&nbsp;&nbsp;&nbsp; Dealing with missing data

# Dealing with missing data

**Identifying missing values in tabular data**

In [32]:
a = np.arange(5)
a + 20

array([20, 21, 22, 23, 24])

In [31]:
import pandas as pd
import numpy as np
from io import StringIO
import sys

csv_data = '''A,B,C,D
              1.0,2.0,3.0,4.0
              5.0,6.0,,8.0
              10.0,11.0,12.0,'''

# If you are using Python 2.7, you need
# to convert the string to unicode:

if (sys.version_info < (3, 0)):
    csv_data = unicode(csv_data)

**Step 1: Read the csv file as a pandas dataframe**

In [20]:
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


**Step 2: Check the number of missing values for the columns**

In [21]:
# Find number of missing data points per column
missing_values_count = df.isnull().sum()

# Missing points in columns
missing_values_count[0:5]

A    0
B    0
C    1
D    1
dtype: int64

**Step 3: Access the underlying NumPy array via the `values` attribute**

In [22]:
df.isnull()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,False,False,True,False
2,False,False,False,True


**Step 4: Remove rows from df that contain missing values**

NOTE 3: See documentation on [Pandas .dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [23]:
# Remove rows containing missing values

df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


**Step 5: Remove columns from df that contain missing values**

In [24]:
# Remove columns with at least one missing value

df_na_by_column = df.dropna(axis='columns') # 'columns' can be used instead of {0,1}
df_na_by_column

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


**Step 6: Only drop rows where columns have NaN**

In [25]:
df.dropna(how='all')  # `all` : If all values are NA, drop that row or column.

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


**Step 7: Drop rows that have less than 3 real values**

In [26]:
df.dropna(thresh=4) # Keep only the rows where: NaN < 4.

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


**Step 8: Only drop rows where NaN appear in specific columns (here: 'C')**

In [27]:
df.dropna(subset=['C']) # Define in which columns to look for missing values.

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


# B. Imputing missing values

In [28]:
# again: our original array
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

## Step 1: impute missing values via the column mean

In [29]:
import numpy as np
from sklearn.impute import SimpleImputer

# Define missing_values=np.nan (nan, '?', ...)
# Using Mean/Median to replace
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df) # also 'fit_tansform'
df_imp = imputer.transform(df)

type(df_imp) # We now need to convert from NP.array to PD.df

numpy.ndarray

In [30]:
df_imp

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])