
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name:** Himadree Chaudhury  

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [2]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("user_data.csv")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [6]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here: The column shows a good spread between the maximum and minimum values across the 100 data points. The mean is about 4 units higher than the second quartile (Q2). The standard deviation is close to the minimum value, sitting roughly 8 units above it.

>  
>  
>  



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [7]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here: The proportion of values in this column shows that it is almost balanced compared to the other category. Training a model with this column will not introduce any imbalance or biased decision.

>  
>  
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [8]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [9]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here: The accuracy (55%) shows that the model gives a wrong prediction about every two attempts. The same pattern appears in sensitivity(54%), which suggests the model is cautious when marking something as positive. However, precision is the highest among the metrics (57%), which can create a serious problem by often labeling negatives as positives. Overall, the model is somewhat careful with positive predictions, but its performance is still weak.

>  
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [13]:
# Q4.1: Choose the numeric column [2 marks]
stand_col = df['monthly_income']
stand_col.head()

Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84


In [15]:
# Q4.2: Standardization with z-score [10 marks]

mean= stand_col.mean()
std = stand_col.std()

z_scaled_df = ((stand_col - mean)/std).round(2)
z_scaled_df.head()

Unnamed: 0,monthly_income
0,0.94
1,-0.32
2,0.74
3,1.04
4,-1.26


In [16]:
# Q4.3: Min max scaling implementation [10 marks]

min_val = stand_col.min()
max_val=stand_col.max()

min_max_scaled_df= ((stand_col-min_val)/(max_val-min_val)).round(2)
min_max_scaled_df.head()

Unnamed: 0,monthly_income
0,0.68
1,0.39
2,0.63
3,0.7
4,0.19



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here: The z-scaled values range from about -2.1 to 2.4, while the min-max scaled values fall between 0 and 1. Both transformations produce a roughly Gaussian shaped distribution because the outliers don't have a strong impact on the data.

>  
>  
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [17]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]

one_hot_encoded_df=pd.get_dummies(df['city_type'],prefix="CITY", dtype=int)
one_hot_encoded_df.head()

Unnamed: 0,CITY_Rural,CITY_Suburban,CITY_Urban
0,0,1,0
1,0,0,1
2,1,0,0
3,0,1,0
4,0,1,0


In [18]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df['CITY_Urban'] = one_hot_encoded_df['CITY_Urban']
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,CITY_Urban
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0
1,2,49,2594.19,194,7,0,0,Low,Urban,1
2,3,19,3550.47,146,36,1,0,High,Rural,0
3,4,19,3821.18,287,14,1,0,High,Suburban,0
4,5,63,1750.84,66,46,0,0,Medium,Suburban,0


In [20]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]

ordinal_map={'Low':1,'Medium':2,'High':3}

oridinal_encoded_df=df['satisfaction_level'].map(ordinal_map)
oridinal_encoded_df.head()

Unnamed: 0,satisfaction_level
0,2
1,1
2,3
3,3
4,2



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here: Since the column `city_type` represents different areas, assigning hierarchical numeric values would create an order that does not actually exist, which can lead to incorrect interpretation. That is why one-hot encoding is the right choice for this column. On the other hand, `satisfaction_level` has categories with a meaningful order, so ordinal encoding fits here because it assigns a proper numerical ranking to those categories.

>  
>  
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [21]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [22]:
# Q6.2: Euclidean distance computation [10 marks]
l2_for_v1 = np.linalg.norm(v1)
l2_for_v2 = np.linalg.norm(v2)

l2_distance=l2_for_v1-l2_for_v2
print("The Eculidian distance between v1 & v2 is: ",l2_distance)

The Eculidian distance between v1 & v2 is:  1140.2990437324888


In [23]:
# Q6.3: Manhattan distance computation [10 marks]

l1_for_v1 = np.linalg.norm(v1, ord=1)
l1_for_v2 = np.linalg.norm(v2, ord=1)

l1_distance=l1_for_v1-l1_for_v2
print("The Manhattan distance between v1 & v2 is: ",l1_distance)

The Manhattan distance between v1 & v2 is:  1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:The Manhatan distance is bigger than Euclidian distance because Manhatan sums absolute values while Euclidian takes the square root of summed squares.

>  
>  
>  



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here: The idea of standard deviation came from module 1. The concept of the confusion matrix came from module 2. From module 3, I learned about encoding. Standard deviation helped me understand how the data is distributed and whether it is skewed. The confusion matrix taught me how to evaluate a model's performance. Encoding showed me how to clean and process categorical data properly.

>  
>  
>  



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
