# Exercises 04 - NumPy for array manipulation

This week we will work with the `numpy` library for numerical manipulations.
The reference guide for numpy can be found here: https://numpy.org/doc/stable/reference/generated/numpy.ones.html. It is
quite huge, but the sections of main interests for us are in the "routines" (a synonym for "functions") section,
in particular:
    - Array creation routines
    - Array manipulation routines
    - Input and output
    - Statistics

Some exercises have been modified from https://www.machinelearningplus.com/python/101-numpy-exercises-python/. You can try more exercises there, if you like!

In [1]:
# Import the NumPy library, using `np` as an alias
import numpy as np

### 1. Create a 1D array of numbers from 99 to 110 

Desired output: ```#> array([99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])```

In [16]:
# Write your solution here
np.arange(99,111)

array([ 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])

### 2. Create a 3x3 numpy array of all False values 

In [19]:
# Write your solution here
np.zeros((3,3),dtype=bool)

array([[False, False, False],
       [False, False, False],
       [False, False, False]])

### 3. Replace all odd numbers in ```arr``` with -1

Sample input: ```np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])```

Desired output: ```#>  array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])```

In [33]:
# Sample input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#getting the remainder of each num when divided by 2
#arr % 2 #Output: array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

#converting the array into a boolean
#arr % 2 == 1 #Output: array([False,  True, False,  True, False,  True, False,  True, False, True])

# Write your solution here:
arr[arr % 2 == 1] = -1
arr

array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])

### 4 Stack two arrays vertically



Consider these two numpy arrays as input:

```
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)
```

```
a = [[0 1 2 3 4]
 [5 6 7 8 9]]
 
b = [[1 1 1 1 1]
 [1 1 1 1 1]]
```

Desired output:
```
c = [[0, 1, 2, 3, 4],
     [5, 6, 7, 8, 9],
     [1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1]])
```

In [41]:
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)

# Write your solution here
np.vstack((a,b))

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

## Exercises of data manipulation on the "500_Person_Gender_Height_Weight_Index" dataset.

We will use a "500_Person_Gender_Height_Weight_Index" dataset containing information about height (in cm) and weights (in kg) for 500 subjects, classified by Gender. You can have a look at the dataset here in Kaggle: https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex
You do not have to download it though. The CSV (comma separated values) file 
can be found in the "datasets" directory: 

We are not going to use the Body Mass Index (BMI) information provided in the dataset. 
We will compute the BMI later on using Numpy.

Let's import the "500_Person_Gender_Height_Weight_Index" dataset. You can import the dataset keeping the text column intact, passing `dtype`='object' as argument. In this case, though all the numeric values will be stored as bytes. We will have to set the `names` argument to `True` as the first line of the file (the "header") contains the column names. 

Read more details about the textual data importing functions of Numpy here: 
- `np.loadtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.loadtxt.html#numpy.loadtxt
- `np.genfromtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

You can see all the arguments you can supply to the functions. All the arguments that have a default value are
optional.

In [44]:
#pip install kagglehub

In [6]:
import kagglehub, os

Installing the 500 person height data using the kaggle API. [link to dataset](https://www.kaggle.com/datasets/yersever/500-person-gender-height-weight-bodymassindex)

In [3]:
# Download latest version
path = kagglehub.dataset_download("yersever/500-person-gender-height-weight-bodymassindex")



In [4]:
path

'C:\\Users\\hussainsarfraz\\.cache\\kagglehub\\datasets\\yersever\\500-person-gender-height-weight-bodymassindex\\versions\\2'

In [7]:
# Import `height-weight` keeping the text column intact.
localfilepath = os.path.join(path, "500_Person_Gender_Height_Weight_Index.csv" )
hw_dataset = np.genfromtxt(localfilepath, delimiter=',', names=True, dtype='object')
hw_dataset[:20]

array([(b'Male', b'174', b'96', b'4'), (b'Male', b'189', b'87', b'2'),
       (b'Female', b'185', b'110', b'4'),
       (b'Female', b'195', b'104', b'3'), (b'Male', b'149', b'61', b'3'),
       (b'Male', b'189', b'104', b'3'), (b'Male', b'147', b'92', b'5'),
       (b'Male', b'154', b'111', b'5'), (b'Male', b'174', b'90', b'3'),
       (b'Female', b'169', b'103', b'4'), (b'Male', b'195', b'81', b'2'),
       (b'Female', b'159', b'80', b'4'),
       (b'Female', b'192', b'101', b'3'), (b'Male', b'155', b'51', b'2'),
       (b'Male', b'191', b'79', b'2'), (b'Female', b'153', b'107', b'5'),
       (b'Female', b'157', b'110', b'5'), (b'Male', b'140', b'129', b'5'),
       (b'Male', b'144', b'145', b'5'), (b'Male', b'172', b'139', b'5')],
      dtype=[('Gender', 'O'), ('Height', 'O'), ('Weight', 'O'), ('Index', 'O')])

### 5. How to convert a 1d array of tuples to a 2d numpy array

Convert the `hw_dataset` to a numeric-only 2D array `hw_data` by omitting the "Gender" text column and the "Index" numeric field. Create a `hw_label` 1D array containing only the "Gender" text field. Keep the same indexing/order as in the original array.

In [57]:
# Write your solution here

#each item, in the dataset, is a void type, not tuple
type(hw_dataset[0]) #Output: numpy.void
type(tuple(hw_dataset[0])) #Output: tuple

#converting each item to a list & specifying datatype
hw_dataset = [list(row) for row in hw_dataset]

hw_label = np.array([row[0] for row in hw_dataset], dtype=str)
hw_data = np.array([row[1:3] for row in hw_dataset], dtype=float)

print(hw_label[0:5], '\n', hw_data[0:5])

['Male' 'Male' 'Female' 'Female' 'Male'] 
 [[174.  96.]
 [189.  87.]
 [185. 110.]
 [195. 104.]
 [149.  61.]]


### 6. Split the datasets in two groups

Split the dataset in `hw_data` according to the labels in `hw_labels`. Hint: You can create a dictionary with the two different labels ("Male" and "Female") as keys, and the two split datasets as values

In [47]:
# Write your solution here

hw_by_label = {
    label: hw_data[hw_label == label] for label in set(hw_label)
}

print(hw_by_label)

{np.str_('Female'): array([[185., 110.],
       [195., 104.],
       [169., 103.],
       [159.,  80.],
       [192., 101.],
       [153., 107.],
       [157., 110.],
       [153., 149.],
       [169.,  97.],
       [172.,  67.],
       [151.,  64.],
       [163., 159.],
       [195.,  65.],
       [160., 131.],
       [157., 153.],
       [189., 132.],
       [197., 114.],
       [171., 152.],
       [185.,  81.],
       [175., 120.],
       [149., 108.],
       [182., 126.],
       [188., 122.],
       [168., 135.],
       [176.,  54.],
       [187.,  89.],
       [164.,  70.],
       [191.,  54.],
       [141., 143.],
       [179., 158.],
       [172.,  96.],
       [168.,  59.],
       [164.,  82.],
       [194., 136.],
       [153.,  51.],
       [185., 100.],
       [197., 154.],
       [168.,  90.],
       [176., 122.],
       [166., 140.],
       [190., 105.],
       [175., 141.],
       [159., 104.],
       [185., 140.],
       [178., 154.],
       [183.,  96.],
       [194., 

### 7. Compute the monovariate statistics for each group

For each label compute the key statistics for both weight and height. 
- mean
- median
- standard deviation
- variance
- interquartile range

Read carefully the numpy documentation for the statistics routines.
Answer to the following questions:

1) Which is the gender with the highest mean value for `height`?

2) Which is the gender with the smallest median value for `weight`?

3) Which is the gender with the shortest interquartile range for `weight`?

In [45]:
# Write your solution here
summary_statistics ={}

for gender, dataset in hw_by_label.items():
    
    summary_statistics[gender] = {
        'mean': np.mean(dataset, axis=0),
        'median': np.median(dataset, axis=0),
        'quantiles': np.quantile(dataset, [0.25, 0.5, 0.75], axis=0),
        'std': np.std(dataset, axis=0),
        'var': np.var(dataset, axis=0)
   }
    
print('---Summary Statistics---','\n', summary_statistics,'\n')

# male interquartile range for weight
print('---Male Interquartile Range---','\n',summary_statistics['Male']['quantiles'],'\n','------', '\n',
      summary_statistics['Male']['quantiles'][2],'\n','---The highest value of the interquartile range---', '\n',
      summary_statistics['Male']['quantiles'][2][1],'\n')

#calculating the interquartile range
male_iqr_w = summary_statistics['Male']['quantiles'][2][1] - summary_statistics['Male']['quantiles'][0][1]
female_iqr_w = summary_statistics['Female']['quantiles'][2][1] - summary_statistics['Female']['quantiles'][0][1]
print("Male IQR for weight: {}".format(male_iqr_w))
print("Female IQR for weight: {}".format(female_iqr_w))

---Summary Statistics--- 
 {np.str_('Female'): {'mean': array([170.22745098, 105.69803922]), 'median': array([170., 106.]), 'quantiles': array([[157.,  79.],
       [170., 106.],
       [184., 135.]]), 'std': array([15.68098734, 32.89732983]), 'var': array([ 245.89336409, 1082.23430988])}, np.str_('Male'): {'mean': array([169.64897959, 106.31428571]), 'median': array([171., 105.]), 'quantiles': array([[154.,  80.],
       [171., 105.],
       [183., 137.]]), 'std': array([17.0308019 , 31.76770276]), 'var': array([ 290.04821324, 1009.18693878])}} 

---Male Interquartile Range--- 
 [[154.  80.]
 [171. 105.]
 [183. 137.]] 
 ------ 
 [183. 137.] 
 ---The highest value of the interquartile range--- 
 137.0 

Male IQR for weight: 57.0
Female IQR for weight: 56.0


### 8. Compute the covarance and correlation matrix for each group

1) Compute the `height`-`weight` covariance matrix for each gender. Are the values on the diagonal matching the values computed with the variance functions in the previous step? If not, can you understand why, and how you can obtain coherent values?

2) Compute the `height`-`weight` correlation matrix for each gender.

In [51]:
# Compute the covariance matrix here:
# Delta Degrees of Freedom (DDOF) is by default == 0 in np.var while it is == 1 for np.cov
cov_mat = {
    label: dict(cov=np.cov(dataset.T, ddof=0)) for label, dataset in hw_by_label.items()
}
print(cov_mat)

{np.str_('Female'): {'cov': array([[ 245.89336409,   14.22162245],
       [  14.22162245, 1082.23430988]])}, np.str_('Male'): {'cov': array([[ 290.04821324,  -14.13865889],
       [ -14.13865889, 1009.18693878]])}}


In [52]:
# Compute the correlation matrix here

# Compute the correlation matrix here
cor_mat = {
    label: dict(cov=np.corrcoef(dataset.T)) for label, dataset in hw_by_label.items()
}
print(cor_mat)

{np.str_('Female'): {'cov': array([[1.        , 0.02756862],
       [0.02756862, 1.        ]])}, np.str_('Male'): {'cov': array([[ 1.        , -0.02613288],
       [-0.02613288,  1.        ]])}}


### 9. How to create a new column from existing columns of a numpy array

Create a new column for "Mass-body Index" ("MBI")  in `hw_data`, where MBI is:

$$MBI =  \frac{weight [kg]}{(height [m])^2}$$



In [58]:
# Compute MBI here:

# (remeber to convert the weights in kg!!)
mbi = hw_data[:, 1] / (hw_data[:, 0]/100)**2

# Introduce new dimension to match hw_data 2-dimensions
mbi = mbi[:,np.newaxis]
hw_data = np.hstack((hw_data, mbi))
hw_data

array([[174.        ,  96.        ,  31.70828379],
       [189.        ,  87.        ,  24.35542118],
       [185.        , 110.        ,  32.14024836],
       ...,
       [141.        , 136.        ,  68.40702178],
       [150.        ,  95.        ,  42.22222222],
       [173.        , 131.        ,  43.77025627]])

### 10. Convert a quantitative variable to a categorical one

Now create an array `hw_mbi_labels` where you assign each record in `hw_data` to one of these categories: 
    - 'UNDERWEIGHT': MBI < 18.5
    - 'NORMAL': 18.5 =< MBI < 25 
    - 'OVERWEIGHT': 25 =< MBI < 30
    - 'OBESE': MBI >= 30
    
Then, count the number of occurrences of each category per gender.

In [77]:
# Label the record here
# Hint: write a function to assign labels to the record then use the numpy appropriate function 
# to apply the function to each row

def assign_label(row):
    bmi = row[-1]
    if bmi < 18.5:
        return 'UNDERWEIGHT'
    if bmi >= 18.5 and bmi < 25.0:
        return 'NORMAL'
    if bmi >= 25.0 and bmi < 30.0:
        return 'OVERWEIGHT'
    if bmi >= 30.0:
        return 'OBESE'

#creating a array of categorized weights based on bmi/mbi
hw_bmi_labels = np.array([assign_label(row) for row in hw_data])
#print(hw_bmi_labels[0:5])

#creating a dictionary which categorizes the bmi by gender
mbi_labels_by_gender = {
    gender: hw_bmi_labels[hw_label == gender] for gender in set(hw_label)
}
#print(mbi_labels_by_gender)

#creating the count dictionary that keeps count of the occurances for each category
cts = {}
for gender in set(hw_label):
    unique, counts = np.unique(mbi_labels_by_gender[gender], return_counts=True)
    print(unique, counts)
    cts[gender] = dict(zip(unique, counts))
cts

['NORMAL' 'OBESE' 'OVERWEIGHT' 'UNDERWEIGHT'] [ 38 167  37  13]
['NORMAL' 'OBESE' 'OVERWEIGHT' 'UNDERWEIGHT'] [ 28 165  31  21]


{np.str_('Female'): {np.str_('NORMAL'): np.int64(38),
  np.str_('OBESE'): np.int64(167),
  np.str_('OVERWEIGHT'): np.int64(37),
  np.str_('UNDERWEIGHT'): np.int64(13)},
 np.str_('Male'): {np.str_('NORMAL'): np.int64(28),
  np.str_('OBESE'): np.int64(165),
  np.str_('OVERWEIGHT'): np.int64(31),
  np.str_('UNDERWEIGHT'): np.int64(21)}}

### 11. Compute the frequencies of each category within genders

Compute the percentages of the four categories ('UNDERWEIGHT', 'NORMAL', 'OVERWEIGHT', 'OBESE'), and anwer these questions:

- Which gender has the highest percentage of 'OBESE' subjects
- Which gender has the highest petcentahe of 'subjects that are neither 'OVERWEIGHT' nor 'OBESE'?

In [76]:
# Compute the percentages here:

prc_cts = {}
for gender, counts in cts.items():
    total_count = sum(cts[gender].values())
    print('Total count for Gender {} is {}'.format(gender, total_count))
    prc_cts[gender] = {
        label: (count/total_count)*100 for label, count in cts[gender].items() 
    }
prc_cts

Total count for Gender Female is 255
Total count for Gender Male is 245


{np.str_('Female'): {np.str_('NORMAL'): np.float64(14.901960784313726),
  np.str_('OBESE'): np.float64(65.49019607843137),
  np.str_('OVERWEIGHT'): np.float64(14.50980392156863),
  np.str_('UNDERWEIGHT'): np.float64(5.098039215686274)},
 np.str_('Male'): {np.str_('NORMAL'): np.float64(11.428571428571429),
  np.str_('OBESE'): np.float64(67.3469387755102),
  np.str_('OVERWEIGHT'): np.float64(12.653061224489795),
  np.str_('UNDERWEIGHT'): np.float64(8.571428571428571)}}

### 12. Normalize an array so the values range exactly between 0 and 1

Normalization is an important pre-processing step before feeding a dataset to a data science (e.g a machine learning) algorithm.
Create a normalized form of `hw_data`'s "height" and "weight" whose values range exactly between 0 and 1 so that the minimum has value 0 and maximum has value 1.

In [82]:
# Write your solution here 

h_min, h_max = hw_data[:, 0].min(), hw_data[:, 0].max()
w_min, w_max = hw_data[:, 1].min(), hw_data[:, 1].max()

print('Max weight: {} - Min weight: {} - Max height: {} - Min height: {}'.format(
   w_max, w_min, h_max, h_min
))

normalized_h = (hw_data[:, 0] - h_min)/(h_max - h_min)
normalized_w = (hw_data[:, 1] - w_min)/(w_max - w_min)
print(normalized_h[0:5])

Max weight: 160.0 - Min weight: 50.0 - Max height: 199.0 - Min height: 140.0
[0.57627119 0.83050847 0.76271186 0.93220339 0.15254237]


You will be able to find more exercises on numpy here: https://www.machinelearningplus.com/python/101-numpy-exercises-python/