# 8 Data Manipulation with NumPy
- Examine how to clean and preprocess data using NumPy.
- Hoy to discover missing values (and fill them up).
- Ways to remove irrelevant data.
- sort(), shuffle(), reshape(), stack(), strip()
## 8_3 Reshaping Ndarrays

#### numpy.reshape(a, newshape, order='C')
- Gives a new shape to an array without changing its data.

#### Reshaping - Why is useful?
- In DS we often rely on readily available functions and methods that have specific input and output limitations.
- We can´t always plug in whatever arrays we want and hope for the best because certain conditions about shapes and sizes need to be met.
- Similarly, it isn´t always possible to store the outputs of a function as part of an existing array (or series)
- __In such cases__, reshaping the array can resolve the problem.
- Reshaping is the act of morphing the shape of an object in a certain way.
- In NumPy we'll be altering the shapes of array.
- However, there are certain restrictions to the shape we can give to an array, since we have a fixed amount of data available.

In [9]:
import numpy as np
np.__version__

'1.26.4'

In [10]:
# Function show_attr

def show_attr(arrnm: str) -> str:
    strout = f' {arrnm}: '

    for attr in ('shape', 'ndim', 'size', 'dtype'):     #, 'itemsize'):
            arrnm_attr = arrnm + '.' + attr
            strout += f'| {attr}: {eval(arrnm_attr)} '

    return strout

In [11]:
# Let's work with a dataset that contains NANs.

lend_co_data_num = np.loadtxt('Lending-Company-Numeric-Data.csv',
                              delimiter=',')

display(show_attr('lend_co_data_num'))
lend_co_data_num

' lend_co_data_num: | shape: (1043, 6) | ndim: 2 | size: 6258 | dtype: float64 '

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [12]:
# Let's reshape the dataset. Actual shape:
display(lend_co_data_num.shape)     # 1043 rows x 6 columns
# Reshape to 6 x 1043
display(reshaped_data := np.reshape(lend_co_data_num, (6,1043)))
reshaped_data.shape                 # 6 rows x 1043 columns

## But is not a transposing, instead:
# 1st row: the first 1043 values of the flattened array
# 2nd row: the next 1043 values of the flattened array
# ...
# Last row: the last 1043 values of the flattened array

(1043, 6)

array([[ 2000.,    40.,   365., ...,   365.,  1581.,  3041.],
       [12277.,  2000.,    40., ...,    50.,   365.,  5350.],
       [ 6850., 15150.,  1000., ...,  2000.,    40.,   365.],
       [ 3101.,  4351., 16600., ..., 16600.,  2000.,    40.],
       [  365.,  3441.,  4661., ...,  8450., 22250.,  2000.],
       [   40.,   365.,  3701., ...,  4601.,  4601., 16600.]])

(6, 1043)

In [13]:
# If we need convert files to col and vice versa
display(t_data := lend_co_data_num.T)
t_data.shape


array([[ 2000.,  2000.,  1000., ...,  2000.,  1000.,  2000.],
       [   40.,    40.,    40., ...,    40.,    40.,    40.],
       [  365.,   365.,   365., ...,   365.,   365.,   365.],
       [ 3121.,  3061.,  2160., ...,  4201.,  2080.,  4601.],
       [ 4241.,  4171.,  3280., ...,  5001.,  3320.,  4601.],
       [13621., 15041., 15340., ..., 16600., 15600., 16600.]])

(6, 1043)

### Valids New Shapes
- The point is that we have 1043 x 6 = 6256 elements. (lend_co_data_num.size)
- And we have to alocate all those elements
- Then the product of row and cols of the new shape must be 6258
- Ex. if I wan´t x_cols the product x_cols by y_rows must be 6258 - ALL INTEGERS 
- Ex. 2 If I want a square matrix (nums rows = nums cols) I must calulate np.sqrt(data.size) for the nums of rows ans cols
  

In [14]:
# If i try to reshape to a new shape where row x col != 6258,
# the total number of elements -> ValueError

# ->  np.reshape(lend_co_data_num, (3,500))
# ValueError: cannot reshape array of size 6258 into shape (3,500)

In [15]:
# Ex. #1 - ex num cols 42 (14, 21 or 42 es div entera!)
print(tot_elements := lend_co_data_num.size)
print(new_num_cols := 42)
print(new_num_rows := tot_elements / new_num_cols)
display(r_shaped_42_cols := np.reshape(lend_co_data_num, (149, 42)))
print(r_shaped_42_cols.shape)
print(r_shaped_42_cols.size)
show_attr('r_shaped_42_cols')

6258
42
149.0


array([[ 2000.,    40.,   365., ...,  1851.,  3251., 17701.],
       [ 2000.,    40.,   365., ...,  1680.,  1680.,  5010.],
       [ 4000.,    50.,   365., ...,  3560.,  4760., 16200.],
       ...,
       [ 1000.,    40.,   365., ...,  2520.,  3740., 15600.],
       [ 4000.,    50.,   365., ...,  3701.,  5201., 20250.],
       [ 2000.,    40.,   365., ...,  4601.,  4601., 16600.]])

(149, 42)
6258


' r_shaped_42_cols: | shape: (149, 42) | ndim: 2 | size: 6258 | dtype: float64 '

In [16]:
# Ex. #2 an square matrix (num rows = num cols)
np.sqrt(lend_co_data_num.size)  # Not posible

79.10752176626443

In [17]:
# Ex. #3 if i div by 2 num of cols must mult by 2 num rows
print('6 / 2:', 6/2, '-', '1043 * 2:', 1043 * 2)
np.reshape(lend_co_data_num, (3,2086))

6 / 2: 3.0 - 1043 * 2: 2086


array([[ 2000.,    40.,   365., ...,    50.,   365.,  5350.],
       [ 6850., 15150.,  1000., ..., 16600.,  2000.,    40.],
       [  365.,  3441.,  4661., ...,  4601.,  4601., 16600.]])

In [18]:
# Ex. #4 if y want to add another ndim having a tensor R3 with two tables
np.reshape(lend_co_data_num, (2,3,1043))

array([[[ 2000.,    40.,   365., ...,   365.,  1581.,  3041.],
        [12277.,  2000.,    40., ...,    50.,   365.,  5350.],
        [ 6850., 15150.,  1000., ...,  2000.,    40.,   365.]],

       [[ 3101.,  4351., 16600., ..., 16600.,  2000.,    40.],
        [  365.,  3441.,  4661., ...,  8450., 22250.,  2000.],
        [   40.,   365.,  3701., ...,  4601.,  4601., 16600.]]])

In [19]:
# Multidimensional arrays artificially adding 1 as new dimension
# Ex. 5 ndim array (tensor R5)
display(np.reshape(lend_co_data_num, (1,1,2,3,1043)))
np.reshape(lend_co_data_num, (1,1,2,3,1043)).shape

# ndim: examining the number of square brackets at the very top
# and bottom of the array

## This trick is useful when a method or function only takes inputs
# with a higher number of dimensions than the array we want to plug
# in 


array([[[[[ 2000.,    40.,   365., ...,   365.,  1581.,  3041.],
          [12277.,  2000.,    40., ...,    50.,   365.,  5350.],
          [ 6850., 15150.,  1000., ...,  2000.,    40.,   365.]],

         [[ 3101.,  4351., 16600., ..., 16600.,  2000.,    40.],
          [  365.,  3441.,  4661., ...,  8450., 22250.,  2000.],
          [   40.,   365.,  3701., ...,  4601.,  4601., 16600.]]]]])

(1, 1, 2, 3, 1043)

### Notes and examples from de Manual - numpy.reshape()

In [22]:
display(a := np.array([[1,2,3], [4,5,6]]))
display(np.reshape(a, 6))
np.reshape(a, 6, order='F')

array([[1, 2, 3],
       [4, 5, 6]])

array([1, 2, 3, 4, 5, 6])

array([1, 4, 2, 5, 3, 6])

In [23]:
np.reshape(a, (3,-1))       # the unspecified value is inferred to be 2

array([[1, 2],
       [3, 4],
       [5, 6]])

In [30]:
# Using -1 as num of rows in our examples
display(lend_co_data_num)
display(data_3_cols := np.reshape(lend_co_data_num, (-1, 3)))
show_attr('data_3_cols')

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

array([[ 2000.,    40.,   365.],
       [ 3121.,  4241., 13621.],
       [ 2000.,    40.,   365.],
       ...,
       [ 2080.,  3320., 15600.],
       [ 2000.,    40.,   365.],
       [ 4601.,  4601., 16600.]])

' data_3_cols: | shape: (2086, 3) | ndim: 2 | size: 6258 | dtype: float64 '

In [33]:
display(data_3_cols := np.reshape(lend_co_data_num, (-1, 14)))
show_attr('data_3_cols')

array([[ 2000.,    40.,   365., ..., 15041.,  1000.,    40.],
       [  365.,  2160.,  3280., ...,    50.,   365.,  3470.],
       [ 4820., 13720.,  2000., ...,  1851.,  3251., 17701.],
       ...,
       [ 2000.,    40.,   365., ..., 16600.,  2000.,    40.],
       [  365.,  3401.,  4601., ...,    40.,   365.,  4201.],
       [ 5001., 16600.,  1000., ...,  4601.,  4601., 16600.]])

' data_3_cols: | shape: (447, 14) | ndim: 2 | size: 6258 | dtype: float64 '