
---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [2]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [3]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here: Accoding the given output the daily average screen_time is 181.89, minimum is 60 and maximum is 299, that means there's some user spent significat amount of time also standard deviation 68.88 shows that screen time varies amonm different users.

>  
>  
>  



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [4]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here: So, Here in this dataset show's more than half (52%) shows positive count that menats this dataset is balanced.

>  
>  
>  



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [5]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [6]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here: At first I can see the model accuracy is 0.55 which means it's not doing so good and only little bit better than gussing, Precision 0.57 shows need to be careful when predection and lastly the recall 0.53 tells still it missing lots of real positive values.

>  
>  
>  



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [7]:
# Q4.1: Choose the numeric column [2 marks]
df['monthly_income']


Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84
...,...
95,3975.90
96,3764.77
97,2336.44
98,3007.53


In [8]:
# Q4.2: Standardization with z-score [10 marks]
z_score = ((df['monthly_income'] - df['monthly_income'].mean()) / df['monthly_income'].std())
z_score = z_score.round(2)
print(f"z-score: {z_score}")

z-score: 0     0.94
1    -0.32
2     0.74
3     1.04
4    -1.26
      ... 
95    1.21
96    0.98
97   -0.61
98    0.14
99   -0.77
Name: monthly_income, Length: 100, dtype: float64


In [9]:
# Q4.3: Min max scaling implementation [10 marks]
mn_value = df['monthly_income'].min()
mx_value = df['monthly_income'].max()

min_max_scaling = ((df['monthly_income'] - mn_value) / (mx_value - mn_value))
print("min_max_scaling:", min_max_scaling.round(2))

min_max_scaling: 0     0.68
1     0.39
2     0.63
3     0.70
4     0.19
      ... 
95    0.73
96    0.68
97    0.33
98    0.50
99    0.30
Name: monthly_income, Length: 100, dtype: float64



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here: So, Here min-max scaling changes all values to a range between 0 and 1. That’s why the min-max numbers look smaller and stay within 0–1, while standardized values can go below or above 0.

>  
>  
>  



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [10]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
df = pd.get_dummies(df, columns=['city_type'], prefix="city", dtype=int)

In [11]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
print(df.head())

   user_id  age  monthly_income  daily_screen_time_min  daily_app_opens  \
0        1   43         3734.19                    109               48   
1        2   49         2594.19                    194                7   
2        3   19         3550.47                    146               36   
3        4   19         3821.18                    287               14   
4        5   63         1750.84                     66               46   

   true_label  pred_label satisfaction_level  city_Rural  city_Suburban  \
0           0           0             Medium           0              1   
1           0           0                Low           0              0   
2           1           0               High           1              0   
3           1           0               High           0              1   
4           0           0             Medium           0              1   

   city_Urban  
0           0  
1           1  
2           0  
3           0  
4           0  


In [12]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
satisfaction_level = {'Low': 1, 'Medium': 2, 'High': 3}
df['satisfaction_level'] = df['satisfaction_level'].map(satisfaction_level)
print(df.head())

   user_id  age  monthly_income  daily_screen_time_min  daily_app_opens  \
0        1   43         3734.19                    109               48   
1        2   49         2594.19                    194                7   
2        3   19         3550.47                    146               36   
3        4   19         3821.18                    287               14   
4        5   63         1750.84                     66               46   

   true_label  pred_label  satisfaction_level  city_Rural  city_Suburban  \
0           0           0                   2           0              1   
1           0           0                   1           0              0   
2           1           0                   3           1              0   
3           1           0                   3           0              1   
4           0           0                   2           0              1   

   city_Urban  
0           0  
1           1  
2           0  
3           0  
4           


> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here: We use One Hot encoding for new dummy variable with 0 and 1. but Ordinal encoding for the meaningful ordere/ranking porpose. So, city_type is sutiable One hot and satisfaction_level for ordinal encoding.

>  
>  
>  



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [13]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [3734.19   48.  ]
v2: [2594.19    7.  ]


In [14]:
# Q6.2: Euclidean distance computation [5 marks]
norm_a = np.linalg.norm(v1)
norm_b = np.linalg.norm(v2)

euclidean_distance = np.linalg.norm(v1 - v2)
print("Euclidean distance:", euclidean_distance.round(2))

Euclidean distance: 1140.74


In [15]:
# Q6.3: Manhattan distance computation [5 marks]
manhattan_distance = np.sum(np.abs(v1 - v2))
print("Manhattan distance:", manhattan_distance.round(2))

Manhattan distance: 1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here: Manhattan_distance is larger. According to the formula, We use Euclidean distence for stright line between point and Manhattan distance mesures the distance by moving horizontaly/vertically

>  
>  
>  
