# 8 Data Manipulation with NumPy
- Examine how to clean and preprocess data using NumPy.
- Hoy to discover missing values (and fill them up).
- Ways to remove irrelevant data.
- sort(), shuffle(), reshape(), stack(), strip()
## 8_11 Stacking NDarrays
- stack(), vstack() - vert, hstack() - horizon, dstack() - depth
- We can just stack arrays of matching shapes to create a larger array: a "stack" (of arrays)
- The arrays MUST be the same shape in the axis that you stack: vstack ->same num of cols, hstack -> same num of rows

#### numpy.stack(arrays, axis=0, out=None, *, dtype=None, casting='same_kind')
- Join a sequence of arrays along a new axis.
- The axis parameter specifies the index of the new axis in the dimensions of the result. For example, if axis=0 it will be the first dimension and if axis=-1 it will be the last dimension.

#### numpy.vstack(tup, *, dtype=None, casting='same_kind')
- Stack arrays in sequence vertically (row wise).
- This is equivalent to concatenation along the first axis after 1-D arrays of shape (N,) have been reshaped to (1,N). Rebuilds arrays divided by vsplit.
- This function makes most sense for arrays with up to 3 dimensions. For instance, for pixel-data with a height (first axis), width (second axis), and r/g/b channels (third axis). The functions concatenate, stack and block provide more general stacking and concatenation operations.

#### numpy.hstack(tup, *, dtype=None, casting='same_kind')
- Stack arrays in sequence horizontally (column wise).
- This is equivalent to concatenation along the second axis, except for 1-D arrays where it concatenates along the first axis. Rebuilds arrays divided by hsplit.
- This function makes most sense for arrays with up to 3 dimensions. For instance, for pixel-data with a height (first axis), width (second axis), and r/g/b channels (third axis). The functions concatenate, stack and block provide more general stacking and concatenation operations.

#### numpy.dstack(tup)
- Stack arrays in sequence depth wise (along third axis).
- This is equivalent to concatenation along the third axis after 2-D arrays of shape (M,N) have been reshaped to (M,N,1) and 1-D arrays of shape (N,) have been reshaped to (1,N,1). Rebuilds arrays divided by dsplit.
- This function makes most sense for arrays with up to 3 dimensions. For instance, for pixel-data with a height (first axis), width (second axis), and r/g/b channels (third axis). The functions concatenate, stack and block provide more general stacking and concatenation operations.

In [22]:
import numpy as np
np.__version__
np.set_printoptions(suppress=True)  # To avoid scientific notation

In [23]:
# Function show_attr

def show_attr(arrnm: str) -> str:
    strout = f' {arrnm}: '

    for attr in ('shape', 'ndim', 'size', 'dtype'):     #, 'itemsize'):
            arrnm_attr = arrnm + '.' + attr
            strout += f'| {attr}: {eval(arrnm_attr)} '

    return strout

#### Two Datasets (arrays) we are going to use 
1. lend_num: Lending-Company-Numeric-Data.csv, without NANs and don´t need preprocessing
2. lend_pre: Lending-Company-Numeric-Data-NAN.csv, whith NANs that will need preprocessing to replace the missing vals by the mean val of each column. five ways i know to do this
    1. The way we learn in 8_2 Substituting Missing Values in NDarryas, which requires reading the file twice, whose steps are:
        1. np.genfromtext the .csv whit only the delimite= option (data will contain NANS)
        2. Calc np.nanmax (or nanmin) and store in orig_max
        3. Store in temp_max_plus1 = temp_max + 1
        4. Calc np.nammean(data, axis=0) for each col and store in original_means 1-D array
        5. Reread with genfromtxt the same .csv with delimiter= and fillnig_value=temp_max_plus1 
        6. Using np.where replace in each col the value of temp_max_plus1 by original_mean of this column
    2. The way we learn in 8_7 Argument Where in NumPy, don't need to read the .csv twice.
        1. np.genfromtext the .csv whit only the delimite= option (data will contain NANS)
        2. Calc np.nammean(data, axis=0) for each col and store in original_means 1-D array
        2. np.argwhere(np.isnan(data)) will give us the coordinates of NANs for each column
        3. In a for loop assing the corresponding mean to each column
    3. Same as 2. but using np.nonzero, which saves us from using the for loop and allows everything to be more direct since the output tuple of np.nonzeo allows indexing the ndarrays directly.
    4. np.where(np.isnan( ), mean, )
    5. np.where(np.nan_to_num(), mean, )
- I will use 2. np.argwhere()

> At the end of this notebook I will display these three methods under the title __'5 methods to replace missing values.'__
> FUTURE!, make de time comparative of 5 methods.

In [24]:
# 1st dataset - lend_num, original without NANs
lend_num = np.loadtxt('Lending-Company-Numeric-Data.csv', delimiter=',')
display(lend_num)
show_attr('lend_num')

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

' lend_num: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 '

In [25]:
# 2st dataset - lend_pre, original without NANs
lend_pre = np.genfromtxt('Lending-Company-Numeric-Data-NAN.csv', delimiter=';')
display(lend_pre)
show_attr('lend_pre')
print('Number of NANs:', np.isnan(lend_pre).sum())

# Process to replace NANs w/mean of each column
orig_means = np.nanmean(lend_pre, axis=0).round(2)  # Means of all columns
nan_ixs = np.argwhere(np.isnan(lend_pre))   # Indices of NANs 
for nan_ix in nan_ixs:                      # for e/NAN (couple of indices)
    # Change the NAN to the mean of its column 
    lend_pre[nan_ix[0], nan_ix[1]] = orig_means[nan_ix[1]]

# DONE, check the results
display(lend_pre)
show_attr('lend_pre')
print('Number of NANs:', np.isnan(lend_pre).sum())
means = np.mean(lend_pre, axis=0).round(2)          # Actual means
print('Original Means == Actual Means:', np.array_equal(orig_means, means))

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

Number of NANs: 260


array([[ 2000.  ,    40.  ,   365.  ,  3121.  ,  4241.  , 13621.  ],
       [ 2000.  ,    40.  ,   365.  ,  3061.  ,  4171.  , 15041.  ],
       [ 1000.  ,    40.  ,   365.  ,  2160.  ,  3280.  , 15340.  ],
       ...,
       [ 2250.25,    40.  ,   365.  ,  4201.  ,  5001.  , 16600.  ],
       [ 1000.  ,    40.  ,   365.  ,  2080.  ,  3320.  , 15600.  ],
       [ 2000.  ,    40.  ,   365.  ,  4601.  ,  4601.  , 16600.  ]])

Number of NANs: 0
Original Means == Actual Means: True


In [26]:
# Create an array with the first two cols of lend_num
stck_1 = np.stack((lend_num[:,0], lend_num[:,1]))
display(stck_1)
show_attr('stck_1')     # Two rows 1043 columns

array([[2000., 2000., 1000., ..., 2000., 1000., 2000.],
       [  40.,   40.,   40., ...,   40.,   40.,   40.]])

' stck_1: | shape: (2, 1043) | ndim: 2 | size: 2086 | dtype: float64 '

In [27]:
# Same as tansposing the first two cols of lend_num
np.transpose(lend_num[:,:2])

array([[2000., 2000., 1000., ..., 2000., 1000., 2000.],
       [  40.,   40.,   40., ...,   40.,   40.,   40.]])

In [28]:
# More than two columns - Stacking them on top of one another
display(np.stack((lend_num[:,0], lend_num[:,1], lend_num[:,2])))
np.transpose(lend_num[:,:3])

array([[2000., 2000., 1000., ..., 2000., 1000., 2000.],
       [  40.,   40.,   40., ...,   40.,   40.,   40.],
       [ 365.,  365.,  365., ...,  365.,  365.,  365.]])

array([[2000., 2000., 1000., ..., 2000., 1000., 2000.],
       [  40.,   40.,   40., ...,   40.,   40.,   40.],
       [ 365.,  365.,  365., ...,  365.,  365.,  365.]])

In [29]:
# Stack in different order and mix with others (transpose can´t)
np.stack((lend_num[:,1], lend_num[:,0], lend_num[:,-1]))

array([[   40.,    40.,    40., ...,    40.,    40.,    40.],
       [ 2000.,  2000.,  1000., ...,  2000.,  1000.,  2000.],
       [13621., 15041., 15340., ..., 16600., 15600., 16600.]])

In [30]:
# To stack the columns Side by Side, use axis=1
stck_2 = np.stack((lend_num[:,1], lend_num[:,0]), axis=1)
display(stck_2)
show_attr('stck_2')     # Two rows 1043 columns

array([[  40., 2000.],
       [  40., 2000.],
       [  40., 1000.],
       ...,
       [  40., 2000.],
       [  40., 1000.],
       [  40., 2000.]])

' stck_2: | shape: (1043, 2) | ndim: 2 | size: 2086 | dtype: float64 '

In [31]:
# np.vstack() = vertical stack, of two arrays of the same shape
# Stacks 2-D arrays (and 1-D array) vertically
# Places the first array on top of the second one, results in a longer array
stck_3 = np.vstack((lend_num, lend_pre))
display(stck_3)
show_attr('stck_3')     # 2086 (1043 x 2) rows, 6 cols

array([[ 2000.  ,    40.  ,   365.  ,  3121.  ,  4241.  , 13621.  ],
       [ 2000.  ,    40.  ,   365.  ,  3061.  ,  4171.  , 15041.  ],
       [ 1000.  ,    40.  ,   365.  ,  2160.  ,  3280.  , 15340.  ],
       ...,
       [ 2250.25,    40.  ,   365.  ,  4201.  ,  5001.  , 16600.  ],
       [ 1000.  ,    40.  ,   365.  ,  2080.  ,  3320.  , 15600.  ],
       [ 2000.  ,    40.  ,   365.  ,  4601.  ,  4601.  , 16600.  ]])

' stck_3: | shape: (2086, 6) | ndim: 2 | size: 12516 | dtype: float64 '

In [32]:
# vstacking 1-D arrays
display(lend_pre[0].ndim)
stck_3 = np.vstack((lend_num[0], lend_pre[0]))
display(stck_3)
show_attr('stck_3')     # Two rows, six columns

1

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3121.,  4241., 13621.]])

' stck_3: | shape: (2, 6) | ndim: 2 | size: 12 | dtype: float64 '

In [33]:
## Try to vstack or stack to arrays of diff shape
lend_num_smll = lend_num[::130,:]
display(lend_num_smll)
print(show_attr('lend_num_smll'))

lend_pre_smll = lend_num[:7,:]
display(lend_pre_smll)
print(show_attr('lend_pre_smll'))

stck_4 = np.vstack((lend_num, lend_num_smll, lend_pre, lend_pre_smll))
display(stck_4)
print(show_attr('stck_4'))

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 1000.,    40.,   365.,  2160.,  3260., 13740.],
       [ 2000.,    40.,   365.,  4081.,  4681., 13841.],
       [ 1000.,    40.,   365.,  1960.,  2880., 11540.],
       [ 1000.,    40.,   365.,  2200.,  4600., 15600.],
       [ 2000.,    40.,   365.,  3201.,  4321., 16600.],
       [ 2500.,    50.,   365.,  3250.,  4750., 20750.],
       [ 2000.,    50.,   365.,  3400.,  5000., 20250.],
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.]])

 lend_num_smll: | shape: (9, 6) | ndim: 2 | size: 54 | dtype: float64 


array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       [ 2000.,    40.,   365.,  3041.,  4241., 15321.],
       [ 2000.,    50.,   365.,  3470.,  4820., 13720.],
       [ 2000.,    40.,   365.,  3201.,  4141., 14141.],
       [ 2000.,    50.,   365.,  1851.,  3251., 17701.]])

 lend_pre_smll: | shape: (7, 6) | ndim: 2 | size: 42 | dtype: float64 


array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    50.,   365.,  3470.,  4820., 13720.],
       [ 2000.,    40.,   365.,  3201.,  4141., 14141.],
       [ 2000.,    50.,   365.,  1851.,  3251., 17701.]])

 stck_4: | shape: (2102, 6) | ndim: 2 | size: 12612 | dtype: float64 


In [40]:
# np.hstack() = horizontal stack, a 'wider' array
stck_5 = np.hstack((lend_num, lend_pre))
display(stck_5)
show_attr('stck_5') 

array([[ 2000.,    40.,   365., ...,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365., ...,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365., ...,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365., ...,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365., ...,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365., ...,  4601.,  4601., 16600.]])

' stck_5: | shape: (1043, 12) | ndim: 2 | size: 12516 | dtype: float64 '

In [41]:
# np.dstack() = depth stack. Stacks arrays in the third dimension.
# Returns an array of a higher dimension.
stck_6 = np.dstack((lend_num, lend_pre))
display(stck_6)
show_attr('stck_6')     # 1043 matrices of 6 rows and 2 cols each.
# 1043 -> Orig_ROWS, 6 -> Orig_COLS -> 2 -> Orig_NDIM

array([[[ 2000.  ,  2000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 3121.  ,  3121.  ],
        [ 4241.  ,  4241.  ],
        [13621.  , 13621.  ]],

       [[ 2000.  ,  2000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 3061.  ,  3061.  ],
        [ 4171.  ,  4171.  ],
        [15041.  , 15041.  ]],

       [[ 1000.  ,  1000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 2160.  ,  2160.  ],
        [ 3280.  ,  3280.  ],
        [15340.  , 15340.  ]],

       ...,

       [[ 2000.  ,  2250.25],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 4201.  ,  4201.  ],
        [ 5001.  ,  5001.  ],
        [16600.  , 16600.  ]],

       [[ 1000.  ,  1000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 2080.  ,  2080.  ],
        [ 3320.  ,  3320.  ],
        [15600.  , 15600.  ]],

       [[ 2000.  ,  2000.  ],
        [   40.  ,    40.  ],
        [  365.  

' stck_6: | shape: (1043, 6, 2) | ndim: 3 | size: 12516 | dtype: float64 '

In [42]:
# The first array - first index
np.dstack((lend_num, lend_pre))[0]
# 2 cols of identical vals. Each col contains the 1st row of either lend_num
# or len_pre = The first index represents the row

array([[ 2000.,  2000.],
       [   40.,    40.],
       [  365.,   365.],
       [ 3121.,  3121.],
       [ 4241.,  4241.],
       [13621., 13621.]])

In [43]:
# 2nd index -> two identical outputs of 2000
np.dstack((lend_num, lend_pre))[0,0]
# 2nd index refer to the columns of either input array

array([2000., 2000.])

In [44]:
# 3th index [0,:,0] row[0], all cols, depth[0]
np.dstack((lend_num, lend_pre))[0,:,0]
# Output is the 1st row of the original dataset 'len_num'
# The third index represents which array the values were pulled from

array([ 2000.,    40.,   365.,  3121.,  4241., 13621.])

In [46]:
# stack w/axis = -1 same a dstack: np.stack() always returns an output that is
# exactly 1 dim more tha its inputs ¡?. Since np.dstack() works along the "third"
# axis, the 2 functions work identically (for 1-D and 2-D arrays)
stck_7 = np.stack((lend_num, lend_pre), axis=-1)
display(stck_7)
show_attr('stck_7')

array([[[ 2000.  ,  2000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 3121.  ,  3121.  ],
        [ 4241.  ,  4241.  ],
        [13621.  , 13621.  ]],

       [[ 2000.  ,  2000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 3061.  ,  3061.  ],
        [ 4171.  ,  4171.  ],
        [15041.  , 15041.  ]],

       [[ 1000.  ,  1000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 2160.  ,  2160.  ],
        [ 3280.  ,  3280.  ],
        [15340.  , 15340.  ]],

       ...,

       [[ 2000.  ,  2250.25],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 4201.  ,  4201.  ],
        [ 5001.  ,  5001.  ],
        [16600.  , 16600.  ]],

       [[ 1000.  ,  1000.  ],
        [   40.  ,    40.  ],
        [  365.  ,   365.  ],
        [ 2080.  ,  2080.  ],
        [ 3320.  ,  3320.  ],
        [15600.  , 15600.  ]],

       [[ 2000.  ,  2000.  ],
        [   40.  ,    40.  ],
        [  365.  

' stck_7: | shape: (1043, 6, 2) | ndim: 3 | size: 12516 | dtype: float64 '

In [None]:
## FUTURE_ more on dstack wit tensor rank3 and exambpes from Manual.
# and the 5 methods comparative.

## 5 methods to replace missing values.

In [34]:
# 1. The way we learn in 8_2 Substituting Missing Values in NDarryas
### Probably the third method will be faster and simpler
np.set_printoptions(suppress=True)      # to avoid scientific notation when show

# 1st read of data (.csv) and calc the original values
lend_NAN = np.genfromtxt('Lending-Company-Numeric-Data-NAN.csv', delimiter=';')
display(lend_NAN)
print(show_attr('lend_NAN'))
print('Number of NANs:', np.isnan(lend_NAN).sum())

orig_max = np.nanmax(lend_NAN)          
orig_max_plus1 = orig_max + 1
print(orig_max, ' - ', orig_max_plus1)

cols_means = np.nanmean(lend_NAN, axis=0).round(2)
display(cols_means)
del lend_NAN                            # to free memory cause i'll use other name

# 2nd read of data and replace de NANs, at the end by the mean of each column
lend_pre1 = np.genfromtxt('Lending-Company-Numeric-Data-NAN.csv',
                         delimiter=';',
                         filling_values=orig_max_plus1)
display(lend_pre1)

for i in range(lend_pre1.shape[1]):
    lend_pre1[:,i] = np.where(lend_pre1[:,i] == orig_max_plus1,
                             cols_means[i],
                             lend_pre1[:,i])

ln = '-' * 12                           # To show the final resulting array
print(f'\n{ln} The Final Dataset (array) Preprocessed {ln}')
display(lend_pre1)
print(show_attr('lend_pre'))
print('Number of NANs:', np.isnan(lend_pre1).sum())

actual_means = np.mean(lend_pre1, axis=0).round(2)
display(cols_means, actual_means)
np.array_equal(cols_means, actual_means)
# NO NANs and the mean of each col doesn't change from original

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

 lend_NAN: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 
Number of NANs: 260
64001.0  -  64002.0


array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [64002.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])


------------ The Final Dataset (array) Preprocessed ------------


array([[ 2000.  ,    40.  ,   365.  ,  3121.  ,  4241.  , 13621.  ],
       [ 2000.  ,    40.  ,   365.  ,  3061.  ,  4171.  , 15041.  ],
       [ 1000.  ,    40.  ,   365.  ,  2160.  ,  3280.  , 15340.  ],
       ...,
       [ 2250.25,    40.  ,   365.  ,  4201.  ,  5001.  , 16600.  ],
       [ 1000.  ,    40.  ,   365.  ,  2080.  ,  3320.  , 15600.  ],
       [ 2000.  ,    40.  ,   365.  ,  4601.  ,  4601.  , 16600.  ]])

 lend_pre: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 
Number of NANs: 0


array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

True

In [35]:
# 2. The way we learn in 8_7 Argument Where in NumPy (using np.argwhere)
lend_pre2 = np.genfromtxt('Lending-Company-Numeric-Data-NAN.csv',
                          delimiter=';')
display(lend_pre2)
print(show_attr('lend_pre2'))
print('Number of NANs:', np.isnan(lend_pre2).sum())

cols_means = np.nanmean(lend_pre2, axis=0).round(2)
display(cols_means)

# Indices of NANs: 2Darray w/e/row a NAN indice (row,col)
NANs_ixs = np.argwhere(np.isnan(lend_pre2))
# display(NANs_ixs)     # must have 260 rows, two cols
print(show_attr('NANs_ixs'))

# Replace NANs in each col with the nammean of such col
for Nix in NANs_ixs:         # for e/row of NANs_ixs
    # Nix[1]: col of the NAN, same col where i have to search mean
    lend_pre2[Nix[0], Nix[1]] = cols_means[Nix[1]]

ln = '-' * 12                           # To show the final resulting array
print(f'\n{ln} The Final Dataset (array) Preprocessed {ln}')
display(lend_pre2)
print(show_attr('lend_pre'))
print('Number of NANs:', np.isnan(lend_pre2).sum())

actual_means = np.mean(lend_pre2, axis=0).round(2)
display(cols_means, actual_means)
np.array_equal(cols_means, actual_means)
# NO NANs and the mean of each col doesn't change from original


array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

 lend_pre2: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 
Number of NANs: 260


array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

 NANs_ixs: | shape: (260, 2) | ndim: 2 | size: 520 | dtype: int64 

------------ The Final Dataset (array) Preprocessed ------------


array([[ 2000.  ,    40.  ,   365.  ,  3121.  ,  4241.  , 13621.  ],
       [ 2000.  ,    40.  ,   365.  ,  3061.  ,  4171.  , 15041.  ],
       [ 1000.  ,    40.  ,   365.  ,  2160.  ,  3280.  , 15340.  ],
       ...,
       [ 2250.25,    40.  ,   365.  ,  4201.  ,  5001.  , 16600.  ],
       [ 1000.  ,    40.  ,   365.  ,  2080.  ,  3320.  , 15600.  ],
       [ 2000.  ,    40.  ,   365.  ,  4601.  ,  4601.  , 16600.  ]])

 lend_pre: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 
Number of NANs: 0


array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

True

In [36]:
# 3. tryin nopnzero... FUTURE

In [37]:
# 4. np.where(isnan)
lend_pre4 = np.genfromtxt('Lending-Company-Numeric-Data-NAN.csv',
                          delimiter=';')
display(lend_pre4)
print(show_attr('lend_pre4'))
print('Number of NANs:', np.isnan(lend_pre4).sum())

cols_means = np.nanmean(lend_pre4, axis=0).round(2)
display(cols_means)

for i in range(lend_pre4.shape[1]):
    lend_pre4[:,i] = np.where(np.isnan(lend_pre4[:,i]),
                              cols_means[i],
                              lend_pre4[:,i])


ln = '-' * 12                           # To show the final resulting array
print(f'\n{ln} The Final Dataset (array) Preprocessed {ln}')
display(lend_pre4)
print(show_attr('lend_pre'))
print('Number of NANs:', np.isnan(lend_pre4).sum())

actual_means = np.mean(lend_pre4, axis=0).round(2)
display(cols_means, actual_means)
np.array_equal(cols_means, actual_means)
# NO NANs and the mean of each col doesn't change from original

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

 lend_pre4: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 
Number of NANs: 260


array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])


------------ The Final Dataset (array) Preprocessed ------------


array([[ 2000.  ,    40.  ,   365.  ,  3121.  ,  4241.  , 13621.  ],
       [ 2000.  ,    40.  ,   365.  ,  3061.  ,  4171.  , 15041.  ],
       [ 1000.  ,    40.  ,   365.  ,  2160.  ,  3280.  , 15340.  ],
       ...,
       [ 2250.25,    40.  ,   365.  ,  4201.  ,  5001.  , 16600.  ],
       [ 1000.  ,    40.  ,   365.  ,  2080.  ,  3320.  , 15600.  ],
       [ 2000.  ,    40.  ,   365.  ,  4601.  ,  4601.  , 16600.  ]])

 lend_pre: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 
Number of NANs: 0


array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

array([ 2250.25,    46.11,   365.  ,  3895.99,  5160.75, 16571.44])

True

In [38]:
# 5. np.where - np.nan_to_num

## New from Q&A
I think comparing missing values with '==' returns a false boolean
Also, I think there is no function called 'ColumnMeans'

Alternatively, you can make either of the following

1- Pass the value you want to replace in genfromtxt through filling_value parameter, and then use where function to replace your value in specific columns of the dataset

2- Simply use 'np.nan_to_num' function
so the code should be


DataMissing2[:,0] = np.nan_to_num(DataMissing2[:,0], np.nanmean(DataMissing2, axis=0)[0])


3- Instead of using 'DataMissing2[:,0] == np.nan== np.nan', use 'isnan' as follow



DataMissing2[:,0] = np.where(np.isnan(DataMissing2[:,0]), np.nanmean(DataMissing2, axis=0)[0], DataMissing2[:,0])


In [39]:
# To see 
l_NAN = np.genfromtxt('Lending-Company-Numeric-Data-NAN.csv', delimiter=';')
# l_NAN
Ncoords = np.argwhere(np.isnan(l_NAN))
print(l_NAN[Ncoords[0,0],Ncoords[0,1] ])
print(l_NAN[Ncoords[9,0],Ncoords[9,1] ])

type(np.nan)    # <class 'float'>
print(l_NAN[Ncoords[0,0],Ncoords[0,1]] == np.nan)   # False
print(np.isnan(l_NAN[Ncoords[0,0],Ncoords[0,1]]))   # True
np.nan_to_num(l_NAN[Ncoords[0,0],Ncoords[0,1]])     # 0.0

print(np.nan_to_num(l_NAN[Ncoords[0,0],Ncoords[0,1]]) == 0)     # True
print(np.nan_to_num(l_NAN[Ncoords[0,0],Ncoords[0,1]]) == 0.0)   # True
print(np.nan_to_num(l_NAN[0,0]) == 0)   # False
np.nan_to_num(l_NAN[0,0])

nan
nan
False
True
True
True
False


2000.0