<a href="https://colab.research.google.com/github/kaho0/Aaagh-more-math/blob/main/Copy_of_Assignment_01_Machine_Learning_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name:*Kahon Binte Zaman*   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [2]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [3]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

**Answer:**
*The data is mostly around the middle, but some really high values make it right-skewed. That’s why the mean is pulled a bit toward the higher side. Overall, the data seems slightly right-skewed since the max is farther from the mean than the minimum.*

>  
>  
>  



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [4]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Answer: Since about 52% of the samples are positive, the dataset is kinda balanced overall. The classes are pretty close in size, so there isn’t any major imbalance

>  
>  
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [5]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [6]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Answer: The model correctly predicts 28 true positives and 27 true negatives , but it also makes 21 false positives and 24 false negatives. So the accuracy becomes 0.55, Precision is slightly higher than recall, that means the model is a bit careful when predicting positives, but it still misses many actual positives and gives a lot of incorrect positives. Overall, the high FP and FN numbers indicate that the model struggles both to catch all real positives and to avoid predicting positives incorrectly.

>  
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [7]:
# Q4.1: Choose the numeric column [2 marks]
numeric_col="monthly_income"
df[numeric_col].head()

Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84


In [8]:
# Q4.2: Standardization with z-score [10 marks]

m = df["monthly_income"].mean()
s = df["monthly_income"].std()
print(round(m, 2))
print(round(s, 2))
c = df["monthly_income"] - m
z = c / s
df["income_z"] = z
df[["monthly_income","income_z"]].head()

2885.75
898.12


Unnamed: 0,monthly_income,income_z
0,3734.19,0.944685
1,2594.19,-0.324626
2,3550.47,0.740126
3,3821.18,1.041542
4,1750.84,-1.263639


In [9]:
# Q4.3: Min max scaling implementation [10 marks]

mn = df["monthly_income"].min()
mx = df["monthly_income"].max()
rg = mx - mn
ss = df["monthly_income"] - mn
mm = ss / rg
df["income_mm"] = mm
df[["monthly_income","income_mm"]].head()

Unnamed: 0,monthly_income,income_mm
0,3734.19,0.675209
1,2594.19,0.393685
2,3550.47,0.629839
3,3821.18,0.696691
4,1750.84,0.18542



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Answer:When we standardize monthly_income, the numbers are changed into z-scores, so most values are around 0, but some can be negative if they are below the average. Min-max scaling, on the other hand, squeezes all the incomes into a 0 to 1 range, so the smallest becomes 0 and the biggest becomes 1. In short, standardization keeps the data's original spread, while min-max just fits everything neatly into a fixed range.

>  
>  
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [10]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]

city_ohe = pd.get_dummies(df["city_type"], prefix="city",dtype = int)
city_ohe


Unnamed: 0,city_Rural,city_Suburban,city_Urban
0,0,1,0
1,0,0,1
2,1,0,0
3,0,1,0
4,0,1,0
...,...,...,...
95,0,0,1
96,0,0,1
97,1,0,0
98,0,0,1


In [11]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df = pd.concat([df, city_ohe], axis=1)
df


Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,income_z,income_mm,city_Rural,city_Suburban,city_Urban
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0.944685,0.675209,0,1,0
1,2,49,2594.19,194,7,0,0,Low,Urban,-0.324626,0.393685,0,0,1
2,3,19,3550.47,146,36,1,0,High,Rural,0.740126,0.629839,1,0,0
3,4,19,3821.18,287,14,1,0,High,Suburban,1.041542,0.696691,0,1,0
4,5,63,1750.84,66,46,0,0,Medium,Suburban,-1.263639,0.185420,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,20,3975.90,259,15,0,1,Medium,Urban,1.213813,0.734899,0,0,1
96,97,52,3764.77,295,16,1,0,Low,Urban,0.978734,0.682760,0,0,1
97,98,35,2336.44,108,46,0,1,High,Rural,-0.611613,0.330034,1,0,0
98,99,18,3007.53,202,42,1,1,High,Urban,0.135599,0.495760,0,0,1


In [12]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
order = {"Low": 0, "Medium": 1, "High": 2}
df['satisfaction_level'] = df['satisfaction_level'].map(order)
df['satisfaction_level']

Unnamed: 0,satisfaction_level
0,1
1,0
2,2
3,2
4,1
...,...
95,1
96,0
97,2
98,2



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Answer:For nominal features, we use one-hot encoding because there is no natural order among the categories, and this method prevents the model from assuming any numerical relationship between them. For satisfaction_level, ordinal encoding is appropriate because the categories (Low < Medium < High) have a meaningful order.

>  
>  
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [13]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [14]:
# Q6.2: Euclidean distance computation [10 marks]
eu = np.linalg.norm(v1 - v2)
print(np.round(eu, 2))

1140.74


In [15]:
# Q6.3: Manhattan distance computation [10 marks]

ma = np.linalg.norm(v1 - v2, ord=1)
print(np.round(ma, 3))

1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Answer: Euclidean distance measures the straight line distance between two points by squaring the differences of each coordinate, summing them and taking the square root, which usually gives a smaller value. In contrast, Manhattan distance sums the absolute differences of each coordinate without taking a square root, and as a result, it is typically greater than the Euclidean distance.

>  
>  
>  



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Answer: In Module 1, I learned how data can be distributed, such as bell-shaped or skewed, and how measures like mean, median, percentiles, IQR, and z-scores describe spread, which I applied in Q1.

From Module 2, I worked with sensitivity, specificity, false positives, false negatives, precision, and recall in Q2 and Q3, and I realized that earlier I only focused on accuracy, but a model’s performance depends on deeper metrics like recall, precision, PPV and NPV.

 Module 3 taught me why different scaling methods matter, such as when to use standardization versus min-max scaling, and how to properly encode nominal and ordinal features, which I applied in Q4 and Q5. I also used Module 3 concepts in Q6 to compare users using Euclidean and Manhattan distances . Now, if I get a dataset, I would first look at how the data is distributed, then decide how to apply the right encodings, how to remove extra or irrelevant data, and how to represent nominal and ordinal features.

>  
>  
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
