
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name:**  Md Jahirul Islam

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [21]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [33]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [34]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  Here, I can see that the minimum is 60 and maximum is 299. And mean is close to 50th percentile. So, mean is close to central tendency. Again if we look at standard deviation we see that it is 68.89 (approx.). Which means the most of the values of the column is within roughly +68.89 with mean and -68.89 with mean or 250.78 to 113. And maximum is more than 250.78. So, they are far from each other.
>  
>  



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [35]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  As the proportion of positive class is 0.52, it is close to half. This means there is 52 percent chance to get a positive count from the "true label" column. So we can consider that the dataset is balanced.
>  
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [36]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [37]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  The precision of the model is slightly higher than the recall. It means the model is more carefull on predicting positive values than finding all positives. We can see the recall is less the model is missing some positive values. And, accuracy of 0.55 means it is not guessing randomly.
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [38]:
# Q4.1: Choose the numeric column [2 marks]
col = df['monthly_income']
col.head()

Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84


In [39]:
# Q4.2: Standardization with z-score [10 marks]
mean = col.mean()
std = col.std()
z_score = (col-mean)/std
z_score.head()

Unnamed: 0,monthly_income
0,0.944685
1,-0.324626
2,0.740126
3,1.041542
4,-1.263639


In [40]:
# Q4.3: Min max scaling implementation [10 marks]
mn = col.min()
mx = col.max()

min_max = (col-mn)/(mx-mn)
min_max.head()

Unnamed: 0,monthly_income
0,0.675209
1,0.393685
2,0.629839
3,0.696691
4,0.18542



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

> Standardization transforms the data with mean of 0 and standard deviation 1. On the other hand, Min max scaling just scales the data from 0 to 1.
> There is no specific range for Standardization. But the range of Min Max Scaling is from 0 to 1.
  
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [41]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
city_col = df['city_type']
city_encoded = pd.get_dummies(city_col, prefix="City", dtype=int)
city_encoded.head()

Unnamed: 0,City_Rural,City_Suburban,City_Urban
0,0,1,0
1,0,0,1
2,1,0,0
3,0,1,0
4,0,1,0


In [42]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df = pd.concat([df, city_encoded], axis = 1)
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,City_Rural,City_Suburban,City_Urban
0,1,43,3734.19,109,48,0,0,Medium,Suburban,0,1,0
1,2,49,2594.19,194,7,0,0,Low,Urban,0,0,1
2,3,19,3550.47,146,36,1,0,High,Rural,1,0,0
3,4,19,3821.18,287,14,1,0,High,Suburban,0,1,0
4,5,63,1750.84,66,46,0,0,Medium,Suburban,0,1,0


In [43]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
order = {'Low':1, 'Medium':2, "High":3 }

df["satisfaction_level"]=df["satisfaction_level"].map(order).astype(int)
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type,City_Rural,City_Suburban,City_Urban
0,1,43,3734.19,109,48,0,0,2,Suburban,0,1,0
1,2,49,2594.19,194,7,0,0,1,Urban,0,0,1
2,3,19,3550.47,146,36,1,0,3,Rural,1,0,0
3,4,19,3821.18,287,14,1,0,3,Suburban,0,1,0
4,5,63,1750.84,66,46,0,0,2,Suburban,0,1,0



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  We cannot rank any cities with numbers as it sounds illogical. So with one hot encoding we seperate the cities and then we can mark 1 if the city is found in that city column and rest is 0.
>  Again, we can rank levels like low, medium and high. With low we mean less than medium. With medium we mean it is less than high and more than low. High is more than low and medium. That is why we can put numbers to define these levels.
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [44]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [48]:
# Q6.2: Euclidean distance computation [10 marks]
Euc_dist_v1 = np.linalg.norm(v1)
print(Euc_dist_v1)
Euc_dist_v2 = np.linalg.norm(v2)
print(Euc_dist_v2)

3734.4984878963332
2594.1994441638444


In [50]:
# Q6.3: Manhattan distance computation [10 marks]
Manhat_dist_v1 = np.linalg.norm(v1,ord=1)
print(Manhat_dist_v1)
Manhat_dist_v2 = np.linalg.norm(v2,ord=1)
print(Manhat_dist_v2)

3782.19
2601.19



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  Manhattan distance is larger than Euclidean distance.
>  In Manhattan distance formula we take absolute values of the values in each dimension and add them. It means it finds distance between the points by moving one step at a time (Horizontally or Vertically). Again, in Euclidean distance we subtract the points indices and square root them. Basically it takes shortest distance directly through hypotenuse. As a result, Euclidean is smaller then Manhatten distance.
>  



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

> I have used knowledge of precision, recall and accuracy from Module 2.
> I have used knowledge of Min Max Scaling, one hot encoding, ordinal encoding from Module 3.
> With precision, recall and accuracy we can understand how well positive and negative cases. We can correctly find base-rate effect and decrease biasness of the model. We can also understand the correct way to understand performance for skewed dataset.
And, with Min Max scaling and one hot encoding we can convert our categorical data to numerical data as the training cannot be done with categorical features. These techniques allows us to preprocess our data before training any model.



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
