<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>

# Week 9| Lab: Supervised Classification

**Clemson University** </br>
**Instructor(s):** Tim Ransom </br>

------------------------------------------------------------------------
## Learning objectives

- Differentiate between supervised and unsupervised classification methods.
- Implement a logistic regression model for classification.
- Evaluate the performance of a classification model using metrics like accuracy.
- Compare the performance of different classification algorithms.
- Preprocess data for classification tasks, including normalization.

In [None]:
""" RUN THIS CELL TO GET THE RIGHT FORMATTING """
import requests
from IPython.core.display import HTML
css_file = 'https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/css/cpsc6300.css'
styles = requests.get(css_file).text
HTML(styles)

In [None]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix

import statsmodels.api as sm
from statsmodels.api import OLS

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from matplotcheck.base import PlotTester
from matplotlib.patches import PathPatch

## Part 1: The AirBnB NYC 2019 Dataset + EDA

The dataset contains information about AirBnB hosts in NYC from 2019.
There are 49k unique hosts and 16 features for each:

-   **id:** listing ID
-   **name:** name of the listing
-   **host<sub>id</sub>:** host ID
-   **host<sub>name</sub>:** name of the host
-   **neighbourhood<sub>group</sub>:** NYC borough
-   **neighbourhood:** neighborhood
-   **latitude:** latitude coordinates
-   **longitude:** longitude coordinates
-   **room<sub>type</sub>:** listing space type (e.g., private room,
    entire home)
-   **price:** price in dollars per night
-   **minimum<sub>nights</sub>:** number of min. nights required for
    booking
-   **number<sub>ofreviews</sub>:** number of reviews
-   **last<sub>review</sub>:** date of the last review
-   **reviews<sub>permonth</sub>:** number of reviews per month
-   **calculated<sub>hostlistingscount</sub>:** number of listings the
    host has
-   **availability<sub>365</sub>:** number of days the listing is
    available for booking

Our goal is to predict the price of unseen housing units as being
'affordable' or 'unaffordable', by using their features. We will assume
that this task is for a particular client who has a specific budget and
would like to simplify the problem by classifying any unit that costs \<
\\150 per night as 'affordable' and any unit that costs \\150 or great
as 'unaffordable'.

For this task, we will exercise our normal data science pipeline – from
EDA to modelling and visualization. In particular, we will show the
performance of two different classifiers:

-   `Linear Regression`
-   `Logistic Regression`

### Read-in and checking

We do the usual read-in and verification of the data:

In [None]:
df = pd.read_csv("data/nyc_airbnb.csv")
df.head()

## Building the training/ dev/ testing data

As usual, we split the data before we begin our analysis. It would be
unfair to cheat by looking at the testing data. Let's divide the data
into 60% training, 20% development (aka validation), 20% testing.
However, before we split the data, let's make the simple transformation
and converting the prices into a categories of being *affordable* or
not.

In [None]:
df['affordable'] = np.where(df['price'] < 150, 1, 0)
df.head()

**NOTE:** The `affordable` column now has a value of 1 whenever the
price is \< 150, and 0 otherwise.

Also, the feature named `neighbourhood_group` can be easily confused
with `neighbourhood`, so let's go ahead and rename it to `borough`, as
that is more distinct:

In [None]:
df.rename(columns={"neighbourhood_group": "borough"}, inplace=True)
df.head()

Without looking at the full data yet, let's just ensure our prices are
within valid ranges:

In [None]:
df['price'].describe()

Uh-oh. We see that `price` has a minimum value of \\0. I highly doubt
any unit in NYC is free. These data instances are garbage, so let's go
ahead and remove any instance that has a price of \\0.

In [None]:
print("original training size:", df.shape)
df = df.loc[df['price'] != 0]
print("new training size:", df.shape)

Now, let's split the data while ensuring that our test set has a fair
distribution of affordable units, then further split our training set so
as to create the development set:

In [None]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['affordable'])
df_train, df_dev = train_test_split(df_train, test_size=0.25, random_state=99)

# ensure our dataset splits are of the % sizes we want
total_size = len(df_train) + len(df_dev) + len(df_test)
print("train:", len(df_train), "=>", len(df_train) / total_size)
print("dev:", len(df_dev), " =>", len(df_dev) / total_size)
print("test:", len(df_test), "=>", len(df_test) / total_size)


Let's remove the target value (i.e., **affordable**) from our current
dataframes and create it as separate prediction dataframes.

In [None]:
# training
x_train = df_train.drop(['price', 'affordable'], axis=1)
y_train = pd.DataFrame(data=df_train['affordable'], columns=["affordable"])

# dev
x_dev = df_dev.drop(['price', 'affordable'], axis=1)
y_dev = pd.DataFrame(data=df_dev['affordable'], columns=["affordable"])

# test
x_test = df_test.drop(['price', 'affordable'], axis=1)
y_test = pd.DataFrame(data=df_test['affordable'], columns=["affordable"])

From now onwards, we will do EDA and cleaning based on the training set,
`x_train`.

In [None]:
for col in x_train.columns:
    print(col, ":", np.sum([x_train[col].isnull()]))

It appears ~6k of the rows have missing values concerning the reviews.
It seems impossible to impute the `last_review` feature with reasonable
values, as this is very specific to each unit. At best, we could guess
the date based on the `reviews_per_month`, but that feature is missing
for the same rows. Further, it might be difficult to replace
`reviews_per_month` with reasonable values – sure, we could fill in
values to be the median value, but that seems wrong to generalize so
heavily, especially for over 20% of our data. Consequently, let's just
ignore these two columns.

In [None]:
x_train = x_train.drop(['last_review', 'reviews_per_month'], axis=1)
x_dev = x_dev.drop(['last_review', 'reviews_per_month'], axis=1)
x_test = x_test.drop(['last_review', 'reviews_per_month'], axis=1)

Let's look at the summary statistics of the data:

In [None]:
x_train.describe()

Next, we see that the `minimum_nights` feature has a maximum value of
1,250. That's almost 3.5 years, which is probably longer than the
duration that most people rent an apartment. This seems anomalous and
wrong. Let's discard it and other units that are outrageous. Well, what
constitutes 'outrageous'? We see that the standard deviation for
`minimum_nights` is 21.24. If we assume our distribution of values are
normally distributed, then only using values that are within 2 standard
deviations of the mean would yield us with ~95% of the original data.
However, we have no reason to believe our data is actually normally
distributed, especially since our mean is 7. To have a better idea of
our actual values, let's plot it as a histogram.

In [None]:
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'], 25, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')

Yea, that instance was a strong outlier, and the host was being
ridiculously greedy. That's a clever way to get out a multi-year lease.
Notice that we are using log-scale. Clearly, a lot of our mass is from
units less than 365 days. To get a better sense of that subset, let's
re-plot only units with minumum<sub>nights</sub> \< 365 days.

In [None]:
subset = x_train['minimum_nights']<365
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'][subset], 30, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')

Ok, that doesn't look too bad, as most units require \< 30 nights. It's
surprising that some hosts list an unreasonable requirement for the
minimum number of nights. There is a risk that any host that lists such
an unreasonable value might also have other incorrect information.
Personally, I think anything beyond 30 days could be suspicious. If we
were to exclude any unit that requires more than 30 days, how many
instances would we be ignoring?

In [None]:
len(x_train.loc[x_train['minimum_nights']>30])

Alright, we'd be throwing away 436 out of our ~30k entries. That's
roughly 1.5% of our data. While we generally want to keep and use as
much data as we can, I think this is an okay amount to discard,
especially considering (1) we have a decently large amount of data
remaining, and (2) the entries beyond a 30-day-min could be unrealiable.

In [None]:
good_subset = x_train['minimum_nights'] <= 30
x_train = x_train.loc[good_subset]
y_train = y_train.loc[good_subset]

Notice that we only trimmed our training data, not our development or
testing data. I am making this choice because in real scenarios, we
would not know the nature of the testing data values. We pre-processed
our data to ignore all data that has a price of \$0, and to ignore
certain columns (even if it's in the testing set), but that was fair
because those columns proved to be obvious, bogus element of the
dataset. However, it would be unfair to inspect the values of the
training set and then to further trim the development and testing set
accordingly, conditioned on certain data values.

The remaining columns of our training data all have reasonable summary
statistics. None of the min's or max's are cause for concern, and we
have no reason to assert a certain distribution of values. Since all the
feature values are within reasonable ranges, and there are no missing
values (NaNs) remaining, we can confidently move foward. To recap, our
remaining columns are now:

In [None]:
[col for col in x_train.columns] # easier to read vertically than horizontally

We don't have a terribly large number of features. This allows us to
inspect every pairwise interaction. A scatterplot is great for this, as
it provides us with a high-level picture of how every pair of features
correlates. If any subplot of features depicts a linear relationship
(i.e., a clear, concise path with mass concentrated together), then we
can assume there exists some collinearity – that the two features
overlap in what they are capturing and that they are not independent
from each other.

In [None]:
scatter_matrix(x_train, figsize=(30,20));

------------------------------------------------------------------------

## Part 2: Predicting with Linear Regression

Now, let's actually use our features to make more informed predictions.
Since our model needs to use numeric values, not textual ones, let's use
**ONLY** the following features for our linear model:

-   `borough`, using 1-hot encodings. There are 5 distinct boroughs, so
    represent them via 4 unique columns.
-   `latitude`
-   `longitude`
-   `room_type`, using 1-hot encodings. There are 3 distinct
    room<sub>types</sub>, so represent them via 2 unique columns.
-   `minimum_nights`
-   `number_of_reviews`
-   `calculated_host_listings_count`
-   `availability_365`

<div class="exercise"><b>Exercise 1:</b> </div>  

Prepare the dataset for **classification** by selecting relevant features and applying one-hot encoding.

### **Instructions:**  
1. **Modify `x_train`, `x_dev`, and `x_test`** to contain only the following features:  
   - **Categorical Features** (to be one-hot encoded):  
     - `borough` (NYC borough)  
     - `room_type` (Type of listing: entire home, private room, shared room)  
   - **Numeric Features**:  
     - `latitude`  
     - `longitude`  
     - `minimum_nights`  
     - `number_of_reviews`  
     - `calculated_host_listings_count`  
     - `availability_365`  

2. **Apply one-hot encoding**:  
   - Convert `borough` and `room_type` into one-hot encoded variables.  
   - Drop the **first category** for each (`drop_first=True`) to avoid dummy variable traps.  

3. **Ensure final dataset structure**:  
   - Remove unnecessary columns: `id`, `name`, `host_id`, `host_name`, and `neighbourhood`.  
   - Ensure the final shape of `x_train` is **(28,894, 12)**.  

**Hint:**  
- Use `pd.get_dummies()` for **one-hot encoding**.  
- Use `.drop()` to **remove unnecessary columns**.  
- Refer to [`pandas.get_dummies()` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).  


In [None]:
"""Write your code for exercise-1 here:"""

# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 2:</b> </div>

Perform **multi-linear regression** and evaluate its performance on the **development set**.

1. **Create and train a Linear Regression model**:  
   - Instantiate a `LinearRegression` model named **`model`**.  
   - Train the model using the **training set** (`x_train`, `y_train`).  

2. **Compute the R² scores**:  
   - Store the **R² score** for the **training set** (`x_train`, `y_train`) in a variable named **`train_r2`**.  
   - Store the **R² score** for the **development set** (`x_dev`, `y_dev`) in a variable named **`dev_r2`**.  

3. **Make predictions and evaluate accuracy**:  
   - Use `.predict()` on `x_dev` and store the predictions in `y_pred_dev`.  
   - Convert predictions into binary classifications using **0.5 as the threshold**:
     - **Predictions ≥ 0.5 → Affordable (1)**
     - **Predictions < 0.5 → Unaffordable (0)**
   - Compute the **accuracy score** and store it in **`accuracy`**.  

4. **Print the results**:
   - Print **R² scores** (`train_r2` and `dev_r2`).  
   - Print the **accuracy percentage**.  

**Hint:**  
- Use `.score()` to calculate the R² value.  
- Use `accuracy_score()` from `sklearn.metrics` to compute accuracy.  
- Refer to [`LinearRegression()` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and [`accuracy_score()` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).  


In [None]:
"""Write your code for exercise-2 here:"""

# your code here
raise NotImplementedError

<div class="theme"> <b>Question 1:</b> </div>  

After evaluating the **Linear Regression model**, the **accuracy on the development set** was approximately **77%**. What does this accuracy indicate about the model's performance? (Select the most appropriate answer.)

- 1. The accuracy is **very high**, meaning the model is excellent and needs no improvement.  
- 2. The accuracy is **reasonable**, but **Linear Regression may not be ideal** for classification problems.  
- 3. The accuracy is **too low**, suggesting there is **no relationship** between features and affordability.  
- 4. The accuracy is **irrelevant**, as Linear Regression is always the best model for classification tasks.  

**Store your answer in an integer variable named `answer` in the below code cell.**

In [None]:
# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 3:</b> </div> 

Analyze and visualize the **residuals** for both the **training** and **development** sets.

1. **Compute residuals**:  
   - Residuals are calculated as the difference between actual and predicted values:  
     $$
     \text{Residual} = \text{Actual Value} - \text{Predicted Value}
     $$
   - Use the trained **Linear Regression model** (`model`) to generate predictions:  
     - Compute **training residuals** and store them in `training_residuals`.  
     - Compute **development residuals** and store them in `dev_residuals`.  

2. **Create a figure with two histograms**:  
   - Use `plt.subplots(1, 2, figsize=(15, 5))` to create a **single figure** with **two subplots**:
     - **Left plot** → Histogram of **training residuals**.  
     - **Right plot** → Histogram of **development residuals**.  

3. **Customize the plots**:  
   - Set **titles** for each subplot (`"Histogram of Training Residuals"` and `"Histogram of Development Residuals"`).  
   - Label the **x-axis** as `"Residuals"` and the **y-axis** as `"Frequency"`.  
   - Use **bins=20** for better visualization.  
   - Add a **horizontal reference line at zero** (`axhline(0)`) for comparison.  

4. **Display the figure** using `plt.show()`.  

5. **Print the minimum residual value** in the development set.  

**Hint:**  
- Use `ax.hist()` to create **histograms** in each subplot.  
- Use `ax.axhline(0, color='black', linewidth=2)` to **add a reference line at zero**.  
- Refer to [`matplotlib.pyplot.subplots()` documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html).  


In [None]:
"""Write your code for exercise-3 here:"""

# your code here
raise NotImplementedError

<div class="theme"> <b>Question 2:</b> </div> 

Based on the residual plots from **Exercise 3**, does this adhere to the assumptions of a **linear model**? (Select the most appropriate answer.)

1. **Yes, the residuals are randomly distributed, supporting the linear model assumption.**  
2. **No, the residuals show patterns, suggesting the linear model may not be the best choice.**  

**Store your answer in an integer variable named `answer` in the below code cell.**


In [None]:
# your code here
raise NotImplementedError

## **Part 3: Binary Logistic Regression**

Linear regression is a useful **baseline model**, but since our target variable is **binary** (0 or 1), we need to use **Binary Logistic Regression** instead.  

### **Why Logistic Regression?**  
- **Handles binary classification problems effectively.**  
- **Outputs probabilities** that can be converted into class labels (0 or 1).  
- **Applies the logistic (sigmoid) function** to model the relationship between input features and the probability of a class.  

### **Implementing Logistic Regression with `sklearn`**  
In this section, we will:  
1. **Import the necessary classes** from `sklearn`.  
2. **Instantiate and fit a Logistic Regression model** using our training data.  
3. **Evaluate the model** using accuracy metrics.  
4. **Analyze feature importance** using model coefficients.  

---

### **Step 1: Import Required Libraries**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Step 2: Instantiate and Fit the Model

In [None]:
# Create an instance of Logistic Regression
lr = LogisticRegression(solver="lbfgs", max_iter=1000)

In [None]:
# Train the model using the training set
lr.fit(x_train, y_train['affordable'])

<div class="exercise"><b>Exercise 4:</b> </div>  

Train a **binary logistic regression model**, make predictions, and evaluate its accuracy.
 
1. **Use the above trained `LogisticRegression` model (`lr`)** to make predictions on **x_dev**.  
   - Store the predictions in a variable named **`y_dev_pred`**.  

2. **Compute the accuracy of the model**:  
   - Use `accuracy_score()` from `sklearn.metrics` to calculate accuracy.  
   - Store the accuracy in a variable named **`lr_accuracy`**.  

3. **Print the model’s performance details**:  
   - Print the **accuracy score**.  
   - Print the **number of coefficients** in the model.  
   - Print all **coefficient values** along with their corresponding feature names.  

4. **Experiment with different model parameters**:  
   - Try different values for **C penalty** (e.g., from `0` to `100 million`).  
   - Test different **max iterations** (`5 to 5000`).  
   - Compare the effects of **L1 (Lasso) vs. L2 (Ridge) regularization**.  

5. **Analyze and interpret the results**:  
   - How does the **accuracy compare** to the linear regression model?  
   - What happens when you change **regularization strength (`C`)**?  
   - Does increasing **max iterations** always improve performance?  

**Hint:**  
- Use `.predict()` to generate predictions.  
- Use `.coef_` to examine feature importance.  
- Refer to [`LogisticRegression()` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and [`accuracy_score()` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).  


In [None]:
"""Write your code for exercise-4 here:"""

# your code here
raise NotImplementedError

The results here should show that for this dataset, logistic regression
offered effectively identical performance as linear regression. There
are two main takeaways from this:

-   logistic regression should not be viewed as being *superior* to
    linear regression; it should be viewed as a solution to a different
    type of problem – **classification** (predicting categorical
    outputs), not **regression** (predicting continuous-valued outputs).
-   In our situation, our two categories/classes (affordable or not) had
    an ordinal nature. That is, the continuum of prices directly aligned
    with the structure of our two classes. Alternatively, you could
    imagine other scenarios where our two categories are nominal and
    thus un-rankable (e.g., predicting cancer or not, or predicting
    which NYC borough an AirBnB is in based on its property features).
    
### **Key Takeaways**  

The **Logistic Regression model** achieved an accuracy of **77.39%**, using **12 features**.  

#### **Key Observations:**  
1. **Logistic Regression vs. Linear Regression**  
   - The results indicate that **Logistic Regression** performed **similarly** to **Linear Regression** in this dataset.  
   - This suggests that **linear regression** can still be an effective baseline, even for binary classification problems.  

2. **When to Use Logistic Regression?**  
   - **Logistic Regression is not necessarily superior to Linear Regression**. Instead, it is suited for **classification problems** where the output is categorical.  
   - In contrast, **Linear Regression** is best used when predicting **continuous numerical values**.  

3. **Nature of Our Classification Task**  
   - In this case, the **two classes (affordable vs. unaffordable)** were **ordinal** (ordered by price).  
   - If the classification task involved **nominal categories** (e.g., **predicting boroughs** or **disease presence**), then **Logistic Regression would be the more appropriate choice** over Linear Regression.  


## Part 4 (The Real Challenge): Multiclass Classification

Before we move on, let's consider a more common use case of logistic
regression: predicting not just a binary variable, but what level a
categorical variable will take. Instead of breaking the price variable
into two classes (affordable being true or false), we may care for more
fine-level granularity.

<div class="exercise"><b>Exercise 5:</b> </div> 

Prepare the dataset for **multi-class classification** by creating **categorical price levels**.

1. **Create copies** of the original dataset (`x_train`, `x_dev`) for multi-class classification:  
   - `x_train_multiclass`, `y_train_multiclass` (for training set)  
   - `x_dev_multiclass`, `y_dev_multiclass` (for development set)  

2. **Create a new categorical column `price_level`** based on the `price` column using the following 5 categories:
   - **0 → Budget** (`price < 80`)
   - **1 → Affordable** (`80 ≤ price < 120`)
   - **2 → Average** (`120 ≤ price < 180`)
   - **3 → Expensive** (`180 ≤ price < 240`)
   - **4 → Very Expensive** (`price ≥ 240`)

3. **Use `pd.cut()`** to bin the `price` values into these **5 categories**.

4. **Ensure that**:
   - `x_train_multiclass` and `x_dev_multiclass` contain **all original features** (excluding `price_level`).  
   - `y_train_multiclass` and `y_dev_multiclass` contain **only the `price_level` column**.

### **Why Multi-Class Classification?**  
- Instead of predicting whether a property is simply **affordable (0 or 1)**, this new approach **categorizes prices into multiple levels**.  
- This allows for **a more granular analysis** of property pricing and could improve predictive power.

**Hint:**  
- Use `pd.cut()` to **efficiently bin continuous variables** into categories.  
- Refer to [`pandas.cut()` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html).  



In [None]:
"""Write your code for exercise-5 here:"""

# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 6:</b> </div>

**Multi-Class Logistic Regression**  

In this exercise, you will perform **logistic regression** on the newly created **multi-class price categories** and evaluate the model’s performance. Unlike binary classification, this task involves multiple class labels, making it a **multi-class classification problem**.

---

### **Steps to Complete the Exercise**
#### **1. Train a Multi-Class Logistic Regression Model**
- Instantiate a **Logistic Regression** model named **`lr_8`**.
- Use **`x_train_multiclass`** (features) and **`y_train_multiclass`** (multi-class labels) to train the model.

#### **2. Make Predictions**
- Use the trained model to predict the multi-class labels for **`x_dev_multiclass`**.
- Store the predictions in **`y_dev_pred_multiclass`**.

#### **3. Compute Accuracy**
- Calculate the **accuracy score** between **predicted labels** and **true labels** (`y_dev_multiclass`).
- Store the result in **`lr_accuracy_multiclass`**.
- Print the accuracy score.

---

### **References**
**LogisticRegression() documentation** - [Scikit-learn Docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)  
**accuracy_score() documentation** - [Scikit-learn Docs](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)  

---


In [None]:
"""Write your code for exercise-6 here:"""

# your code here
raise NotImplementedError

In [None]:
print("Accuracy score: {:.3f}".format(lr_accuracy_multiclass))
print("Coefficient Count: {}".format(len(lr_8.coef_[0])))
print("Feature Count: {}".format(x_dev.shape[1]))

In [None]:
for i in range(len(x_dev.columns)):
    print("Feature: {}".format(x_dev.columns[i]))
    print("Coef: {:.5f}\n".format(lr.coef_[0][i]))

### **Conclusion and Next Steps**  

Despite having **five distinct price categories**, our **multi-class classification model** performs reasonably well! However, there are several ways we could **further improve performance**:  

1. **Cross-validation**:  
   - Instead of relying on a single train-test split, we could apply **k-fold cross-validation** to evaluate model performance more robustly.  

2. **Feature Engineering**:  
   - **One-hot encoding** for the `neighbourhood` feature could help capture **fine-level location granularity**, as neighborhoods likely correlate strongly with price.  
   - However, we should ensure we have **enough representative data** for each neighborhood before introducing it as a feature.  

3. **Feature Scaling**:  
   - **Longitude and latitude** contain valuable **spatial information**, but their range is quite small.  
   - Scaling these features **between 0 and 1** might improve the model's ability to distinguish fine-grained location-based pricing trends.  

4. **Handling Imbalanced Importance of Classes**:  
   - In **some classification problems**, **certain classes may be more critical** than others.  
   - Example: In medical diagnoses, **false negatives** (missing a disease) are more dangerous than **false positives** (false alarms).  
   - We could **adjust class weights** during training to **prioritize more important categories** and **balance misclassification penalties**.  

5. **Threshold Tuning for Decision Making**:  
   - Instead of treating a predicted class as the highest probability category, we can **analyze model performance across different thresholds**.  
   - By **plotting precision-recall curves** and adjusting the decision threshold, we could better control **false positives vs. false negatives** depending on the application.  

---

### **Final Takeaway**  
In this lab, we explored:
- **Binary classification** using **Logistic Regression**.  
- **Multi-class classification** and performance evaluation.  
- **Feature engineering, scaling, and class weight adjustments** for improvement.  
- **Real-world considerations** in classification models, including cost-sensitive learning.  

# END