# Data Cleaning

Author: Yael Gonzalez

We import the required packages.

In [17]:
import numpy as np
from given_data import year_2013, year_2014, year_2015, year_2016, year_2017, year_2018, year_2019, year_2020, year_2021, year_2022

We add each of the 10 year arrays, which represent enrollment data for 20 schools for grades 10 through 12 in a year, to a `list`. Then, we create a *NumPy* `array` using the list and we `reshape` it to a (10, 20, 3) 3D array.

In [43]:
enrollment_data = [year_2013, year_2014, year_2015, year_2016, year_2017, year_2018, year_2019, year_2020, year_2021, year_2022]
enrollment_data_array = np.array(enrollment_data).reshape((10, 20, 3))

We create a mask of the array using `isnan` and use `any` to search for *NaN* values, which returns `True`.

In [42]:
mask = np.isnan(enrollment_data_array)
np.any(mask)

True

We search for the indices where there are NaN values using the mask and `argwhere`, and we print to visualize the indices.

In [19]:
indices_of_nans = np.argwhere(mask)

print(indices_of_nans)

[[ 8 11  0]
 [ 8 11  1]
 [ 8 11  2]
 [ 9 11  0]
 [ 9 11  1]
 [ 9 11  2]]


After identifying missing values at indices corresponding to the 12th school in 2021 and 2022, we aim to replace them with meaningful data. To do so, we calculate the mean enrollment for each grade from 2013 to 2020. This involves isolating relevant records by slicing `enrollment_data_array`, followed by rounding down the mean, as enrollments cannot be decimals.

In [45]:
grade_10_data = enrollment_data_array[0:8, 11, 0]
grade_11_data = enrollment_data_array[0:8, 11, 1]
grade_12_data = enrollment_data_array[0:8, 11, 2]

grade_10_replacement = np.floor(np.mean(grade_10_data)).astype(int)
grade_11_replacement = np.floor(np.mean(grade_11_data)).astype(int)
grade_12_replacement = np.floor(np.mean(grade_12_data)).astype(int)

We see each calculated value:

In [48]:
print(f"Estimated value for grade 10 enrollments in 2021 and 2022: {grade_10_replacement}")
print(f"Estimated value for grade 11 enrollments in 2021 and 2022: {grade_11_replacement}")
print(f"Estimated value for grade 12 enrollments in 2021 and 2022: {grade_12_replacement}")

Estimated value for grade 10 enrollments in 2021 and 2022: 41
Estimated value for grade 11 enrollments in 2021 and 2022: 45
Estimated value for grade 12 enrollments in 2021 and 2022: 53


Finally, instead of modifying the original `enrollment_data_array` array we decide to create a copy named `enrollment_clean_data_array`, and we make the replacements there.

In [25]:
enrollment_clean_data_array = np.copy(enrollment_data_array)

enrollment_clean_data_array[8:10, 11, 0] = grade_10_replacement
enrollment_clean_data_array[8:10, 11, 1] = grade_11_replacement
enrollment_clean_data_array[8:10, 11, 2] = grade_12_replacement

To make sure our changes are good, we print the clean array's shape, dimension, and check if there is any NaN value.

In [39]:
print(f"Shape of clean array: {enrollment_clean_data_array.shape}")
print(f"Dimension of clean array: {enrollment_clean_data_array.ndim}")
print(f"Is any value NaN in clean array?: {np.any(np.isnan(enrollment_clean_data_array))}")

Shape of clean array: (10, 20, 3)
Dimension of clean array: 3
Is any value NaN in clean array?: False


We display clean array for further reference:

In [40]:
enrollment_clean_data_array

array([[[591., 572., 558.],
        [472., 346.,   0.],
        [ 45.,  57.,  52.],
        [160., 176., 189.],
        [426., 483., 567.],
        [620., 584., 585.],
        [658., 631., 632.],
        [289., 280., 311.],
        [496., 465., 528.],
        [523., 467., 517.],
        [487., 413., 457.],
        [ 29.,  29.,  45.],
        [399., 361., 380.],
        [210., 225., 359.],
        [657., 566., 501.],
        [163., 146., 228.],
        [587., 611., 648.],
        [514., 577., 522.],
        [435., 364., 509.],
        [504., 530., 512.]],

       [[599., 592., 598.],
        [444., 452., 341.],
        [ 40.,  49.,  55.],
        [151., 137., 173.],
        [430., 404., 572.],
        [662., 611., 602.],
        [618., 639., 605.],
        [323., 370., 395.],
        [422., 437., 524.],
        [522., 549., 529.],
        [537., 502., 416.],
        [ 36.,  44.,  44.],
        [362., 371., 354.],
        [219., 200., 222.],
        [569., 619., 562.],
        [131., 164