
# Week 01 Assignment  
# Data Quality, Evaluation, Scaling, and Encoding

Student name: Maria Akter Mukti

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [21]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [22]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [23]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  here, we have mean 181.89. Also min 60 and max 299.here, max is comperativly bigger then mean. There shows a huge difference. SD 68.88 which is quite larger.
over all, all data are spreaded, so there is much data variation.



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [24]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

here, positive count=52, total count=100. so, negative count=(100-52)=48.
It almost perfect balanced not exact balanced.
>


### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [25]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [26]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

Accuracy-> 0.55 means-> it predicts 55% correctly and 45% wrongly overall.

Precision-> 0.57 means-> 57% it shows correct, and 43% wrong.

Recall-> 0.5384 means-> it can catch 54% correctly and 46% wrongly.

Overall, it is not a perfect model.it predects wrong.  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [27]:
# Q4.1: Choose the numeric column [2 marks]
numeric_col = "monthly_income"
df[numeric_col]

Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84
...,...
95,3975.90
96,3764.77
97,2336.44
98,3007.53


In [28]:
# Q4.2: Standardization with z-score [10 marks]


mean_val= df[numeric_col].mean()
standard_deviation= df[numeric_col].std()

df['Z_Score_of_income'] = (df[numeric_col] - mean_val) / standard_deviation
df['Z_Score_of_income']

Unnamed: 0,Z_Score_of_income
0,0.944685
1,-0.324626
2,0.740126
3,1.041542
4,-1.263639
...,...
95,1.213813
96,0.978734
97,-0.611613
98,0.135599


In [29]:
# Q4.3: Min max scaling implementation [10 marks]

min_val= df[numeric_col].min()
max_val= df[numeric_col].max()

df['min_max_scaled_income'] =(df[numeric_col]-min_val)/(max_val-min_val)
df['min_max_scaled_income']


Unnamed: 0,min_max_scaled_income
0,0.675209
1,0.393685
2,0.629839
3,0.696691
4,0.185420
...,...
95,0.734899
96,0.682760
97,0.330034
98,0.495760



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  Z value have both positive and negative numbers, that means range is not fixed.This actually shows how far a value is from the mean or average.


>  But min max valus always in [0 to 1] and its the range. It basically compresses all values onto 0 and 1 interval nad range is fixed.



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [30]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
one_hot_encoded = pd.get_dummies(df, columns=['city_type'])
one_hot_encoded

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,Z_Score_of_income,min_max_scaled_income,city_type_Rural,city_type_Suburban,city_type_Urban
0,1,43,3734.19,109,48,0,0,Medium,0.944685,0.675209,False,True,False
1,2,49,2594.19,194,7,0,0,Low,-0.324626,0.393685,False,False,True
2,3,19,3550.47,146,36,1,0,High,0.740126,0.629839,True,False,False
3,4,19,3821.18,287,14,1,0,High,1.041542,0.696691,False,True,False
4,5,63,1750.84,66,46,0,0,Medium,-1.263639,0.185420,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,20,3975.90,259,15,0,1,Medium,1.213813,0.734899,False,False,True
96,97,52,3764.77,295,16,1,0,Low,0.978734,0.682760,False,False,True
97,98,35,2336.44,108,46,0,1,High,-0.611613,0.330034,True,False,False
98,99,18,3007.53,202,42,1,1,High,0.135599,0.495760,False,False,True


In [31]:
# Q5.2: Attach one hot encoded columns to df [5 marks]

df = pd.concat([df, one_hot_encoded], axis=1)
df.head()


Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,Z_Score_of_income,...,daily_screen_time_min.1,daily_app_opens.1,true_label.1,pred_label.1,satisfaction_level.1,Z_Score_of_income.1,min_max_scaled_income,city_type_Rural,city_type_Suburban,city_type_Urban
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0.944685,...,109,48,0,0,Medium,0.944685,0.675209,False,True,False
1,2,49,2594.19,194,7,0,0,Low,Urban,-0.324626,...,194,7,0,0,Low,-0.324626,0.393685,False,False,True
2,3,19,3550.47,146,36,1,0,High,Rural,0.740126,...,146,36,1,0,High,0.740126,0.629839,True,False,False
3,4,19,3821.18,287,14,1,0,High,Suburban,1.041542,...,287,14,1,0,High,1.041542,0.696691,False,True,False
4,5,63,1750.84,66,46,0,0,Medium,Suburban,-1.263639,...,66,46,0,0,Medium,-1.263639,0.18542,False,True,False


In [32]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]

df['satisfaction_ordinal'] = df.iloc[:, 7]

print(df['satisfaction_ordinal'])


0     Medium
1        Low
2       High
3       High
4     Medium
       ...  
95    Medium
96       Low
97      High
98      High
99    Medium
Name: satisfaction_ordinal, Length: 100, dtype: object


In [33]:
df

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,Z_Score_of_income,...,daily_app_opens.1,true_label.1,pred_label.1,satisfaction_level.1,Z_Score_of_income.1,min_max_scaled_income,city_type_Rural,city_type_Suburban,city_type_Urban,satisfaction_ordinal
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0.944685,...,48,0,0,Medium,0.944685,0.675209,False,True,False,Medium
1,2,49,2594.19,194,7,0,0,Low,Urban,-0.324626,...,7,0,0,Low,-0.324626,0.393685,False,False,True,Low
2,3,19,3550.47,146,36,1,0,High,Rural,0.740126,...,36,1,0,High,0.740126,0.629839,True,False,False,High
3,4,19,3821.18,287,14,1,0,High,Suburban,1.041542,...,14,1,0,High,1.041542,0.696691,False,True,False,High
4,5,63,1750.84,66,46,0,0,Medium,Suburban,-1.263639,...,46,0,0,Medium,-1.263639,0.185420,False,True,False,Medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,20,3975.90,259,15,0,1,Medium,Urban,1.213813,...,15,0,1,Medium,1.213813,0.734899,False,False,True,Medium
96,97,52,3764.77,295,16,1,0,Low,Urban,0.978734,...,16,1,0,Low,0.978734,0.682760,False,False,True,Low
97,98,35,2336.44,108,46,0,1,High,Rural,-0.611613,...,46,0,1,High,-0.611613,0.330034,True,False,False,High
98,99,18,3007.53,202,42,1,1,High,Urban,0.135599,...,42,1,1,High,0.135599,0.495760,False,False,True,High



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

One hot encoding is best for nominal features. There is nothing to show greater then or less then oe which one is bigger or smaller then others, no comperasion.
that is why we use one hot encoding for 'city_type'.


Ordinal encoding is best for natural order. we have to memorize the low,middle or upper value. low is worse than medium, medium is worse then high like this.
To understand the higher to lower value we use it.




---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [34]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.float64(3734.19) np.int64(48) np.int64(48)]
v2: [np.float64(2594.19) np.float64(2594.19) np.int64(7) np.int64(7)]


In [35]:
# Q6.2: Euclidean distance computation [10 marks]

euclidean_distance = np.sqrt(np.sum((v1 - v2)**2))
print(euclidean_distance)

1613.24579652327


In [36]:
# Q6.3: Manhattan distance computation [10 marks]
manhattan_distance = np.sum(np.abs(v1 - v2))

print(manhattan_distance)

2362.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  Manhattan distance = 2362 which is more larger then Euclidean distance.
because, Euclidean distance mainly measures shortest straight line between 2 points only. But Manhattan distance only do summation of adsolute difference of 2 points. So, Manhuttan distance(2362.0) > Euclidean distance(1613.24579652327).
>  



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  
>  
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
