1. Import the necessary libraries: We start by importing the necessary libraries for this exercise. Pandas is a library providing high-performance, easy-to-use data structures and data analysis tools. NumPy is a library used for working with arrays. SimpleImputer is a class from the sklearn.impute module that provides basic strategies for imputing missing values.

    ## Importing the necessary libraries
    import pandas as pd
    import numpy as np
    from sklearn.impute import SimpleImputer

2. Load the dataset: The dataset is loaded into a pandas DataFrame using the read_csv function. This function is widely used in pandas to read a comma-separated values (csv) file into DataFrame.

    ## Load the dataset
    df = pd.read_csv('pima-indians-diabetes.csv')

3. Identify missing data: We identify missing data in the DataFrame using the isnull function followed by the sum function. This gives us the number of missing entries in each column. These missing entries are represented as NaN.

    ## Identify missing data (assumes that missing data is represented as NaN)
    missing_data = df.isnull().sum()

4. Print the number of missing entries in each column

    ## Print the number of missing entries in each column
    print("Missing data: \n", missing_data)

5. Configure an instance of the SimpleImputer class: We create an instance of the SimpleImputer class. This class is a part of the sklearn.impute module and provides basic strategies for imputing missing values. We configure it to replace missing values (represented as np.nan) with the mean value of the column.

    ## Configure an instance of the SimpleImputer class
    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

6. Fit the imputer on the DataFrame: We fit the imputer on the DataFrame using the fit method. This method calculates the imputation values (in this case, the mean of each column) that will be used to replace the missing data.

    ## Fit the imputer on the DataFrame
    imputer.fit(df)

7. Apply the transform to the DataFrame: We apply the transform to the DataFrame using the transform method. This method replaces missing data with the imputation values calculated by the fit method.

    ## Apply the transform to the DataFrame
    df_imputed = imputer.transform(df)

8. Print the updated matrix of features: Finally, we print out the updated matrix of features to verify that the missing data has been successfully replaced.

    print("Updated matrix of features: \n", df_imputed)




In [1]:
# Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Load the dataset
dataset = pd.read_csv('pima-indians-diabetes.csv')

# Identify missing data (assumes that missing data is represented as NaN)
missing_data = dataset.isnull().sum()

# Print the number of missing entries in each column
print(missing_data)

# Configure an instance of the SimpleImputer class
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

# Fit the imputer on the DataFrame
imputer.fit(dataset)

# Apply the transform to the DataFrame
df_imputed = imputer.transform(dataset)

#Print your updated matrix of features
print("Updated matrix of features: \n", df_imputed)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
Updated matrix of features: 
 [[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
