
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: Md. Merajul Islam Khan**   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [2]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [3]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

> The Max value much higher than the mean, which show that some people use their screen a lot more than others
>  The STD is also large, meaning the numbers vary quite a bit
>  Overall, screen-time in this group is not very consistent



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [4]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  The dataset is almost evenly blanced between the two classes.
> This means neither class is much more common than the other
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [5]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [35]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  The model's accuracy of 0.55 shows it is only slightly better than random guessing.
>  Its precision is a bit higher than its recall that means when the model predicts a positive, it is somewhat careful but still makes many mistakes
>  The model is missing a fair number of actual positive case



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [9]:
# Q4.1: Choose the numeric column [2 marks]
df['monthly_income']


Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84
...,...
95,3975.90
96,3764.77
97,2336.44
98,3007.53


In [17]:
# Q4.2: Standardization with z-score [10 marks]
mean = df['monthly_income'].mean()
std = df['monthly_income'].std()
df['monthly_income_z'] = (df['monthly_income'] - mean) / std

df[['monthly_income','monthly_income_z']]

Unnamed: 0,monthly_income,monthly_income_z
0,3734.19,0.944685
1,2594.19,-0.324626
2,3550.47,0.740126
3,3821.18,1.041542
4,1750.84,-1.263639
...,...,...
95,3975.90,1.213813
96,3764.77,0.978734
97,2336.44,-0.611613
98,3007.53,0.135599


In [19]:
# Q4.3: Min max scaling implementation [10 marks]
mn = df['monthly_income'].min()
mx = df['monthly_income'].max()
df['monthly_income_minmax'] = (df['monthly_income'] - mn) / (mx - mn)

df[["monthly_income","monthly_income_minmax"]]


Unnamed: 0,monthly_income,monthly_income_minmax
0,3734.19,0.675209
1,2594.19,0.393685
2,3550.47,0.629839
3,3821.18,0.696691
4,1750.84,0.185420
...,...,...
95,3975.90,0.734899
96,3764.77,0.682760
97,2336.44,0.330034
98,3007.53,0.495760



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

> The z-score column creates the data around a mean of 0 with a STD of 1, so its values include both negative and positive numbers depending on how far each value is from the mean.
>  Min-max scaled column transforms all values into range between 0 and 1 keeping everything positive.
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [23]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
one_hot = pd.get_dummies(df['city_type'], prefix='city_type').astype(int)
one_hot

Unnamed: 0,city_type_Rural,city_type_Suburban,city_type_Urban
0,0,1,0
1,0,0,1
2,1,0,0
3,0,1,0
4,0,1,0
...,...,...,...
95,0,0,1
96,0,0,1
97,1,0,0
98,0,0,1


In [24]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df = pd.concat([df, one_hot], axis=1)
df

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,monthly_income_z,monthly_income_minmax,city_type_Rural,city_type_Suburban,city_type_Urban
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0.944685,0.675209,0,1,0
1,2,49,2594.19,194,7,0,0,Low,Urban,-0.324626,0.393685,0,0,1
2,3,19,3550.47,146,36,1,0,High,Rural,0.740126,0.629839,1,0,0
3,4,19,3821.18,287,14,1,0,High,Suburban,1.041542,0.696691,0,1,0
4,5,63,1750.84,66,46,0,0,Medium,Suburban,-1.263639,0.185420,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,20,3975.90,259,15,0,1,Medium,Urban,1.213813,0.734899,0,0,1
96,97,52,3764.77,295,16,1,0,Low,Urban,0.978734,0.682760,0,0,1
97,98,35,2336.44,108,46,0,1,High,Rural,-0.611613,0.330034,1,0,0
98,99,18,3007.53,202,42,1,1,High,Urban,0.135599,0.495760,0,0,1


In [27]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
ordinal = {'Low': 1, 'Medium': 2, 'High': 3}
df['satisfaction_level_encoded'] = df['satisfaction_level'].map(ordinal)
df[['satisfaction_level','satisfaction_level_encoded']]

Unnamed: 0,satisfaction_level,satisfaction_level_encoded
0,Medium,2
1,Low,1
2,High,3
3,High,3
4,Medium,2
...,...,...
95,Medium,2
96,Low,1
97,High,3
98,High,3



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  One-hot-encoding is suitable for city_type because its categories have no natural order
>  Ordinal encoding is appropiate for satisfaction_level because the levels follow meaningful progression
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [28]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [32]:
# Q6.2: Euclidean distance computation [10 marks]
dist = np.linalg.norm(v1 - v2).round(2)
print("Euclidean distance:", dist)

Euclidean distance: 1140.74


In [34]:
# Q6.3: Manhattan distance computation [10 marks]
manhattan = np.sum(np.abs(v1 - v2))
print("Manhattan distance:", manhattan)

Manhattan distance: 1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  The Manhattan distance is larger than the Euclidean distance in this case. This usally happens because Manhattan distance adds the absolute diffrences along each dimension
>  
>  



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  In this assignment, Module 1 and 2 concepts such as descriptive statistics and data preprocessing were used to understand the basic properties of the dataset, including measures like mean, max, standard deviation, and scaling methods like z-score and min–max scaling.
>  From Module 3, we applied classification evaluation ideas such as accuracy, precision, recall, and confusion-matrix calculations to assess how well a model performs on the same dataset. These modules connect because preprocessing and understanding the raw data Module 1-2 directly affect how well a model can learn and how meaningful the evaluation metrics Module 3 become.
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
