# ITEC301: Math in Data Preprocessing-Seatwork #1

### Notebook by: Christian Vincent D. Cabral

In [1]:
import numpy as np
import pandas as pd

# Seatwork Problem 1

A group of researchers conducted a survey on the ages of people in the Town of Naga. They did this by simply talking to passersby on Naga Plaza Rizal. From their simple survey they gathered:

**Age: [13, 15, 16, 16, 19, 20, 21, 22, 25, 25, 25, 25, 30, 33, 35,
35, 35, 35, 36, 45, 46, 52, 35, 38, 30]**

```
(a) Use the binning method to smooth the above data. Assume that the bin
size is 5
```

In [859]:
age_group = np.array([13, 15, 16, 16, 19, 20, 21, 22, 25, 25, 25, 25, 30, 33, 35,
35, 35, 35, 36, 45, 46, 52, 35, 38, 30])
age_group.sort()
# group the array into 5 (bins)
bins = np.split(age_group, len(age_group) / 5)

### Bin Frequency

In [854]:
# equal frequency binning
bins

[array([13, 15, 16, 16, 19]),
 array([20, 21, 22, 25, 25]),
 array([25, 25, 30, 30, 33]),
 array([35, 35, 35, 35, 35]),
 array([36, 38, 45, 46, 52])]

### As per the bin means, simply replace the values using the respective bin's mean.

In [856]:
# compute for the mean
def mean(array):
    sum = 0
    # summation of values
    for values in array:
        sum += values
    # divide the sum by the length to get the average/mean
    return sum / len(array)

In [857]:
# loop through the bins
for i in range(len(bins)):
    # get the mean for each bin
    result = mean(bins[i])
    # fill the bin with means
    # by replacing each value with the mean
    bins[i].fill(result)

In [858]:
bins

[array([15, 15, 15, 15, 15]),
 array([22, 22, 22, 22, 22]),
 array([28, 28, 28, 28, 28]),
 array([35, 35, 35, 35, 35]),
 array([43, 43, 43, 43, 43])]

### To get the boundary, replace each value by min/max depending on which value the current value is closest to.

In [860]:
def bin_boundary(array):
    # get minimum
    minimum_value = min(array)
    # get maximum
    maximum_value = max(array)
    
    replaced_array = []
    for i in range(len(array)):
        # compute for distance
        distance_to_min = abs(array[i] - minimum_value)
        distance_to_max = abs(array[i] - maximum_value)
        # replace value by whichever it is closest to
        replaced_array.append(minimum_value if distance_to_min <= distance_to_max else maximum_value)

    return replaced_array

In [861]:
# loop through the bins
for i in range(len(bins)):
    replaced_array = bin_boundary(bins[i])
    bins[i] = replaced_array

In [862]:
bins

[[13, 13, 13, 13, 19],
 [20, 20, 20, 25, 25],
 [25, 25, 33, 33, 33],
 [35, 35, 35, 35, 35],
 [36, 36, 52, 52, 52]]

# Seatwork Problem 2

Suppose a group of 12 sales price records has been sorted as follows:

**Sales = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]**

```
Partition them into three bins and smooth the data into the following:
    - Smoothing by Equal Frequency Bins
    - Smoothing by bin means
    - Smoothing by bin boundary
```

First, sort the records and partition into (3).

In [863]:
Sales = np.array([5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215])
Sales.sort()
bins = np.split(Sales, 3)

### Equal Frequency Smoothing

In [864]:
bins

[array([ 5, 10, 11, 13]), array([15, 35, 50, 55]), array([ 72,  92, 204, 215])]

Now, to smooth the data by bin means, I can repeat what I did for the previous problem.

In [865]:
# loop through the bins
for i in range(len(bins)):
    # get the mean for each bin
    result = mean(bins[i])
    # fill the bin with means
    # by replacing each value with the mean
    bins[i].fill(result)

### Bin Means Smoothing

In [866]:
bins

[array([9, 9, 9, 9]), array([38, 38, 38, 38]), array([145, 145, 145, 145])]

Finally, with bin boundaries I'll be calling the same function.

In [646]:
# loop through the bins
for i in range(len(bins)):
    replaced_array = bin_boundary(bins[i])
    bins[i] = replaced_array

### Bin Boundary Smoothing

In [647]:
bins

[[5, 13, 13, 13], [15, 15, 55, 55], [72, 72, 215, 215]]

# Seatwork Problem 3

Use the methods below to normalize the following group of data: **[50, 100, 150, 200, 250]**

```
(a) min-max normalization 
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of
standard deviation
```

In [735]:
data = np.array([50, 100, 150, 200, 250])

### Min-Max Normalization

For the ***min-max normalization***, the formula goes:

```
(value - min) / (max - min)
```

I implemented it with *Python* like the code below:

In [736]:
def min_max_normalization(array):
    # get minimum
    minimum_value = min(array)
    # get maximum
    maximum_value = max(array)
    # get the  (max - min)
    array_range = maximum_value - minimum_value
    
    replaced_array = []
    for i in range(len(array)):
        # (value - min)
        current = array[i] - minimum_value
        # (value - min) / (max - min)
        normalized_value = current / array_range
        replaced_array.append(normalized_value)
    
    return replaced_array

In [737]:
normalized_data = np.column_stack([np.array(data), np.array(min_max_normalization(data))])

In [738]:
pd.DataFrame(normalized_data, columns=['Data', 'Normalized Data'])

Unnamed: 0,Data,Normalized Data
0,50.0,0.0
1,100.0,0.25
2,150.0,0.5
3,200.0,0.75
4,250.0,1.0


That is the normalized data after applying ***min-max normalization***.

### Z-score Scaling

For the ***Z-score Scaling***, the formula goes:

```
(x - mean) / standard_deviation
```

To calculate this, I first need to compute for the standard_deviation of the data.

In [739]:
# get mean first
mean_value = mean(data)

In [740]:
# take in an array of values
def standard_deviation(array):
    # subtract each value by the mean 
    squared_values = (array - mean_value) ** 2
    # get the sum of the squared_values and return after dividing by the length of the array and getting the square root
    return np.sqrt(squared_values.sum() / len(array))

In [741]:
standard_deviation(data).round(2)

70.71

Now, that I've acquired the standard deviation, I can proceed with the scaling part

In [742]:
def scale_by_zscore(array):
    scaled_array = []
    std = standard_deviation(data)
    for i in range(len(array)):
        # (x - mean)
        current_value = array[i] - mean_value
        # (x - mean) / standard_deviation
        scaled_value = current_value / std
        scaled_array.append(scaled_value)

    return scaled_array

In [745]:
normalized_data = np.column_stack([np.array(data), np.array(scale_by_zscore(data)).round(2)])

In [746]:
pd.DataFrame(np.array(normalized_data).round(3), columns=['Data', 'Normalized Data']) 

Unnamed: 0,Data,Normalized Data
0,50.0,-1.41
1,100.0,-0.71
2,150.0,0.0
3,200.0,0.71
4,250.0,1.41


That is the normalized data after applying ***Z-score scaling***.

### Z-score normalization with Mean Absolute Deviation

For the ***Z-score normalization using Mean Absolute Deviation***, the formula goes:

```
(x - mean) / Mean Absolute Deviation
```

First, I'll have to compute for the mean absolute deviation

In [747]:
def mean_absolute_deviation(array):
    np_array = np.array(array)
    return (abs(np_array - mean_value)).sum() / len(array)

In [748]:
mean_absolute_deviation(data).round(2)

60.0

Now, I am able to perform the calculations for ***Z-score normalization*** with **MAD**.

In [749]:
def zscore_with_mad(array):
    scaled_array = []
    MAD = mean_absolute_deviation(data)
    for i in range(len(array)):
        # (x - mean)
        current_value = array[i] - mean_value
        # (x - mean) / MAD
        scaled_value = current_value / MAD
        scaled_array.append(scaled_value)

    return scaled_array

In [750]:
normalized_data = np.column_stack([np.array(data), np.array(zscore_with_mad(data)).round(2)])

In [751]:
pd.DataFrame(normalized_data, columns=['Data', 'Normalized Data'])

Unnamed: 0,Data,Normalized Data
0,50.0,-1.67
1,100.0,-0.83
2,150.0,0.0
3,200.0,0.83
4,250.0,1.67


That is the normalized data after applying ***Z-score normalization using mean standard deviation***.

# Seatwork Problem 4

Use the methods below to normalize the following group of data: **[500, 600, 700, 900, 1000]**

```
(a) min-max normalization 
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of
standard deviation
```

I can just use the functions I created for *Problem 3*.

In [758]:
data = np.array([500, 600, 700, 900, 1000])

In [768]:
# overwrite previous mean_value with current data
mean_value = mean(data)

### Min-Max normalization

In [769]:
normalized_data = np.column_stack([np.array(data), np.array(min_max_normalization(data))])

In [770]:
pd.DataFrame(normalized_data, columns=['Data', 'Normalized Data'])

Unnamed: 0,Data,Normalized Data
0,500.0,0.0
1,600.0,0.2
2,700.0,0.4
3,900.0,0.8
4,1000.0,1.0


### Z-score Scaling

In [771]:
normalized_data = np.column_stack([np.array(data), np.array(scale_by_zscore(data)).round(2)])

In [772]:
pd.DataFrame(normalized_data, columns=['Data', 'Normalized Data'])

Unnamed: 0,Data,Normalized Data
0,500.0,-1.29
1,600.0,-0.75
2,700.0,-0.22
3,900.0,0.86
4,1000.0,1.4


### Z-score normalization using MAD

In [774]:
normalized_data = np.column_stack([np.array(data), np.array(zscore_with_mad(data)).round(2)])

In [775]:
pd.DataFrame(normalized_data, columns=['Data', 'Normalized Data'])

Unnamed: 0,Data,Normalized Data
0,500.0,-1.43
1,600.0,-0.83
2,700.0,-0.24
3,900.0,0.95
4,1000.0,1.55


# Seatwork Problem 5

Using the data for age given in Problem 1, answer the following:

```
(a) Use min-max normalization to transform the value 35 for age onto
the range [0.0, 1.0]. 
(b) Use z-score normalization to transform the value 35 for age, where
the standard deviation of age is 12.94 years.
```


To do problem (a), simply apply the ***min_max_normalization()*** function to the *age_group* array.

In [778]:
normalized_data = np.column_stack([np.array(age_group), np.array(min_max_normalization(age_group)).round(2)])

In [779]:
df = pd.DataFrame(normalized_data, columns=['Age_Group', 'Normalized Data'])
df[df['Age_Group'] == 35].head(1) # show the value 35 transformed to the range [0.0, 1.0]

Unnamed: 0,Age_Group,Normalized Data
15,35.0,0.56


To do problem (a), get the *age_group* array's mean first to proceed with the formula for z-score normalization.

In [780]:
mean_value = mean(age_group)

In [781]:
Age = 35
std = 12.94

In [782]:
current_value = Age - mean_value
normalized_data = current_value / std

### The normalized value of 35 with a standard deviation of 12.94 is:

In [783]:
normalized_data.round(2)

0.46

# Seatwork Problem 6

Using Linear Regression by Least Square, obtain the following from the given sample data:

```
X_age = [25, 30, 40, 35, 22, 28, 45, 33, 27, 38]
Y_weight = [55, 70, 75, 80, 60, 50, 85, 68, 72, 77]
```

1. Use the method of Least Squares to find the equation for the predicting weight given the age.
2. Predict the weight of a person of age 36.

In **Linear Regressing by Least Square**, the equation looks like:

```
y = a + bx
```

To find **b**,

```
E[(x - x_mean) * (y - y_mean)] / E(x - x_mean)^2
```

In [784]:
X_age = np.array([25, 30, 40, 35, 22, 28, 45, 33, 27, 38])
Y_weight = np.array([55, 70, 75, 80, 60, 50, 85, 68, 72, 77])

In [785]:
# get mean for the 2 variables (X, y)
mean_x_age = mean(X_age)
mean_Y_weight = mean(Y_weight)

In [786]:
mean_x_age, mean_Y_weight

(32.3, 69.2)

Now, subtract each value of X and y by the mean value of X and y respectively

In [787]:
X = X_age - mean_x_age
y = Y_weight - mean_Y_weight

In [788]:
pd.DataFrame(np.column_stack([X_age, X]), columns=['X_age', 'x - x_mean'])

Unnamed: 0,X_age,x - x_mean
0,25.0,-7.3
1,30.0,-2.3
2,40.0,7.7
3,35.0,2.7
4,22.0,-10.3
5,28.0,-4.3
6,45.0,12.7
7,33.0,0.7
8,27.0,-5.3
9,38.0,5.7


In [789]:
pd.DataFrame(np.column_stack([Y_weight, y]), columns=['Y_weight', 'y - y_mean'])

Unnamed: 0,Y_weight,y - y_mean
0,55.0,-14.2
1,70.0,0.8
2,75.0,5.8
3,80.0,10.8
4,60.0,-9.2
5,50.0,-19.2
6,85.0,15.8
7,68.0,-1.2
8,72.0,2.8
9,77.0,7.8


I can now multiply *(x - x_mean)* and *(y - y_mean)*.

In [790]:
# multiply Xi and Yi
Xy = X * y
Xy

array([103.66,  -1.84,  44.66,  29.16,  94.76,  82.56, 200.66,  -0.84,
       -14.84,  44.46])

Next, get the sum of the result.

In [791]:
Xy_sum = Xy.sum()
Xy.sum()

582.4

Then, get the **E(x - x_mean)^2**

In [792]:
X_squared = np.square(X)
squared_sum = X_squared.sum()
np.square(X).sum()

472.09999999999997

Finally, I can perform the division part now and get **b**.

In [793]:
b = Xy_sum / squared_sum

In [794]:
b.round(2)

1.23

The implementation as a function would be:

In [795]:
def find_b(XX, yy):
    # get mean
    mean_xx_age = mean(XX)
    mean_yy_weight = mean(yy)

    # subtract mean
    XX_sub = XX - mean_xx_age
    yy_sub = yy - mean_yy_weight

    # multiply the result from subtraction
    XXyy = XX_sub * yy_sub

    # get sum
    XXyy_sum = XXyy.sum()

    # get bottom part of formula
    XX_squared = np.square(XX_sub)
    squared_sum = XX_squared.sum()
    
    # return b
    return XXyy_sum / squared_sum

In [796]:
find_b(X_age, Y_weight).round(2)

1.23

Since **b** is taken care of, it's time to find **a**.

The formula for a is:

```
a = y_mean - b * x_mean
```

Using the declared variables above, it would be:

In [797]:
a = mean_Y_weight - b * mean_x_age

In [798]:
a.round(2)

29.35

In [799]:
def find_a(y_mean, x_mean, b):
    return y_mean - b * x_mean

In [800]:
find_a(mean_Y_weight, mean_x_age, b).round(2)

29.35

Having both **a** and **b**, I can now perform predictions for values ***x***.

```
y = a + bx
```

### The predicted weight for a person of age 36 according to the linear regression model

In [801]:
weight_prediction = a + b * 36

In [868]:
print(weight_prediction.round(2), 'kg')

73.76 kg


# Seatwork Problem 7

Using Linear Regression by Least Square, obtain the following from the given sample data:

```
X_midterm_score = [70, 80, 85, 90, 60, 75, 85, 50]
y_final_grade = [82, 88, 92, 96, 68, 78, 90, 58]
```

1. Use the method of Least Squares to find the equation for the predicting final grade given the midterm score.
2. Predict the final grade of a student who had a midterm score of 95.

Knowing the formula

```
y = a + bx
```

I just need the ***find_b()*** and ***find_a()*** functions.

In [803]:
X_midterm_score = np.array([70, 80, 85, 90, 60, 75, 85, 50])
y_final_grade = np.array([82, 88, 92, 96, 68, 78, 90, 58])

In [804]:
b = find_b(X_midterm_score, y_final_grade)
b.round(2)

0.93

In [805]:
a = find_a(mean(y_final_grade), mean(X_midterm_score), b)
a.round(2)

12.43

### Now, I am able to predict the final grade of a student with a 95 midterm score

In [806]:
grade_prediction = a + b * (95)

In [807]:
# max(grade_prediction 100) if the limit is 100
grade_prediction.round(2)

100.65