<a href="https://colab.research.google.com/github/ms-siam/Demo/blob/main/Assignment_01_Machine_Learning_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: Md. Mobarok Sarker Siam**   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [None]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [None]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [None]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_app_opens"

df[num_col].describe()

Unnamed: 0,daily_app_opens
count,100.0
mean,26.99
std,13.669619
min,5.0
25%,15.0
50%,27.0
75%,39.0
max,49.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

> Looking at the 'daily_app_opens' column, users open apps around 27 times a day on average.
>  The values range from as low as 5 times to as high as 49 times per day.
>  The maximum value 49 looks quite far from the mean 27, being almost double, which shows that some users are much more active than others.



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [None]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>   The proportion of positive class is 0.52, which means 52% of users are positive and 48% are not. This shows the dataset is quite balanced between the two classes with nearly equal representation.
>  
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [None]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [None]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  The model has an accuracy of 55% which means it correctly predicts around half of the cases. The precision is 0.57 which means when the model predicts a positive case, it's correct only 57% of the time. The recall or sensitivity is 0.54, showing the model catches only 54% of actual positive cases and misses nearly half of them. In my opinion, the model doesn't have high recall or high precision ability ... it just moderately gives the output, not being too good or too bad.
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [None]:
# Q4.1: Choose the numeric column [2 marks]
numeric_col = "monthly_income"
df[numeric_col]

Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84
...,...
95,3975.90
96,3764.77
97,2336.44
98,3007.53


In [None]:
# Q4.2: Standardization with z-score [10 marks]
mn = df[numeric_col].mean()
std_dv = df[numeric_col].std()

z_stdrztn = (df[numeric_col] - mn) / std_dv
z_stdrztn.round(2)


Unnamed: 0,monthly_income
0,0.94
1,-0.32
2,0.74
3,1.04
4,-1.26
...,...
95,1.21
96,0.98
97,-0.61
98,0.14


In [None]:
# Q4.3: Min max scaling implementation [10 marks]
min_val = df[numeric_col].min()
max_val = df[numeric_col].max()
rng = max_val - min_val

min_max_scaled = (df[numeric_col] - min_val) / rng
min_max_scaled.round(2)

Unnamed: 0,monthly_income
0,0.68
1,0.39
2,0.63
3,0.70
4,0.19
...,...
95,0.73
96,0.68
97,0.33
98,0.50


In [None]:
print(f"Min of z-scaled value: {z_stdrztn.min()}")
print(f"Max of z-scaled value: {z_stdrztn.max()}")
print(f"Min scaled value: {min_max_scaled.min()}")
print(f"Max scaled value: {min_max_scaled.max()}")

Min of z-scaled value: -2.0996472034707563
Max of z-scaled value: 2.4090808513481488
Min scaled value: 0.0
Max scaled value: 1.0



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  The Standardization with z-score is centered around 0 and values are positive and negative. The scores are ranging from -2.10 to 2.41 which are showing how far each value is from the mean. The min-max scaled column makes all values to range between 0 and 1 with the minimum becoming 0.0 and maximum becoming 1.0. While standardization with z-score shows how far each value is from the mean in terms of standard deviations, min-max scaling simply compresses all values into a fixed 0-1 range.
  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [None]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
d_city = pd.get_dummies(df['city_type'], prefix='city_type', dtype=int)
print(d_city)


    city_type_Rural  city_type_Suburban  city_type_Urban
0                 0                   1                0
1                 0                   0                1
2                 1                   0                0
3                 0                   1                0
4                 0                   1                0
..              ...                 ...              ...
95                0                   0                1
96                0                   0                1
97                1                   0                0
98                0                   0                1
99                0                   1                0

[100 rows x 3 columns]


In [None]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df_encoded = pd.concat([df, d_city], axis= 1)
print(df_encoded)

    user_id  age  monthly_income  daily_screen_time_min  daily_app_opens  \
0         1   43         3734.19                    109               48   
1         2   49         2594.19                    194                7   
2         3   19         3550.47                    146               36   
3         4   19         3821.18                    287               14   
4         5   63         1750.84                     66               46   
..      ...  ...             ...                    ...              ...   
95       96   20         3975.90                    259               15   
96       97   52         3764.77                    295               16   
97       98   35         2336.44                    108               46   
98       99   18         3007.53                    202               42   
99      100   57         2195.91                     75               18   

    true_label  pred_label satisfaction_level city_type  city_type_Rural  \
0          

In [None]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
orders = {'Low': 1, 'Medium': 2, 'High': 3}
df['satisfaction_level_encoded'] = df['satisfaction_level'].map(orders).astype(int)
print(df)


    user_id  age  monthly_income  daily_screen_time_min  daily_app_opens  \
0         1   43         3734.19                    109               48   
1         2   49         2594.19                    194                7   
2         3   19         3550.47                    146               36   
3         4   19         3821.18                    287               14   
4         5   63         1750.84                     66               46   
..      ...  ...             ...                    ...              ...   
95       96   20         3975.90                    259               15   
96       97   52         3764.77                    295               16   
97       98   35         2336.44                    108               46   
98       99   18         3007.53                    202               42   
99      100   57         2195.91                     75               18   

    true_label  pred_label satisfaction_level city_type  \
0            0           0  


> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

> One hot encoding is suitable for `city_type` because Urban, Suburban, and Rural don't have any order or ranking between them. They are just different categories. On the other hand, ordinal encoding is suitable for `satisfaction_level` because satisfaction levels like Low, Medium, and High have a clear order where Low < Medium < High is easily visible or understandable. So for city types we use one hot encoding and for satisfaction levels we use ordinal encoding




---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [None]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [None]:
# Q6.2: Euclidean distance computation [5 marks]
v1_v2 = np.linalg.norm(v1 - v2)
print(v1_v2)

1140.7370424422975


In [None]:
# Q6.3: Manhattan distance computation [5 marks]
v1_v2_manhattan = np.linalg.norm(v1 - v2, ord= 1)

print(v1_v2_manhattan)

1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

> Manhattan distance is larger in my result. This happens because Manhattan distance is measured by the absolute value of the differences between two values, while Euclidean distance is measured by the square root of the summation of squared differences, which makes the final number smaller. This is like walking along the city blocks (Manhattan) which makes travel more distance, whereas Euclidean is like traveling straight from one point to another through buildings. So Manhattan is usually bigger because it's measuring a longer path.
>  
>  



---
## Final Reflection [10 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  In this assignment, I used confusion matrix and basic metrics like accuracy, precision, and recall from Part A - Module 1 and 2 to understand how well the model predicts user behavior. From Module 3, I applied scaling techniques like standardization with z-score and min-max scaling to transform numeric features into comparable ranges. I also used encoding methods like one-hot encoding for city_type and ordinal encoding for satisfaction_level to convert categorical data into numbers. Together, these ideas help me to understand the dataset more deeply because the metrics tell me if predictions are good or bad, while scaling and encoding prepare the data properly for putting in a ML model. Without scaling, features with large values like monthly_income would dominate over small values like daily_app_opens, making comparisons unfair. So both modules work together, one evaluates the results and the other prepares the data correctly before using in a ML model.
>  
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Instruction video অনুযায়ী আমাদের দেয়া Colab ফাইলটি থেকে প্রথম একটি Save copy in drive করে নিবা। এরপর Google colab এর মধ্যে কোডগুলো করবে এবং সেই ফাইলটি ‘Anyone with the link’ & ‘View’ Access দিয়ে ফাইলটির Shareble Link টি সাবমিট করবে।
