

# 🏠 Linear Regression with Boston Housing Data

## Overview  
In this project, we’ll use the **Boston Housing Dataset** (a classic dataset from scikit-learn) to predict the **median price of houses** in Boston (in $1000s).  

The dataset contains features such as:  
- Crime rate in the area  
- Property tax rate  
- Average number of rooms per house  
- And other neighborhood details  

## Goal  
👉 Train a **Linear Regression model** to predict housing prices based on these features.  

## Dataset  
The **Boston Housing Dataset** is available directly in **scikit-learn**, so no need to download anything separately.  

## Tools and Libraries  
You’ll need:  
- Python  
- Libraries: `numpy`, `pandas`, `matplotlib`, `scikit-learn`  

Install them with:  
```bash
pip install numpy pandas matplotlib scikit-learn
````

## Steps to Follow

1. Import libraries and load the dataset
2. (Optional) Explore the dataset (check columns, stats, etc.)
3. Split the dataset into **training** and **testing** sets
4. Train a **Linear Regression model**
5. Evaluate the model’s performance
6. Make predictions on new data

```




In [1]:
#Import Libraries and Dataset
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [3]:
boston = "http://lib.stat.cmu.edu/datasets/boston"
boston = pd.read_csv(boston, sep="\s+", skiprows=22, header=None)
data = np.hstack([boston.values[::2, :], boston.values[1::2, :2]])
target = boston.values[1::2, 2]
#Data Exploration (Optional)
#Exploring the first few rows of the dataset:
print(boston.head()) #First 5 rows it will display

          0      1      2    3      4      5     6       7    8      9     10
0    0.00632  18.00   2.31  0.0  0.538  6.575  65.2  4.0900  1.0  296.0  15.3
1  396.90000   4.98  24.00  NaN    NaN    NaN   NaN     NaN  NaN    NaN   NaN
2    0.02731   0.00   7.07  0.0  0.469  6.421  78.9  4.9671  2.0  242.0  17.8
3  396.90000   9.14  21.60  NaN    NaN    NaN   NaN     NaN  NaN    NaN   NaN
4    0.02729   0.00   7.07  0.0  0.469  7.185  61.1  4.9671  2.0  242.0  17.8


In [4]:
#Splitting the Dataset
#Split the dataset into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

In [5]:
#4. Training a Regression Model
#Train a linear regression model:Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [6]:
#Model Evaluation: Evaluate the model using the mean squared error (MSE)
# Predictions
y_pred = model.predict(X_test)
# Calculate MSE
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 24.291119474973247


In [7]:
# New house details
new_house = np.array([[0.03, 0.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296.0, 15.3, 396.9, 4.98]])

# Predict the price
predicted_price = model.predict(new_house)
print(f"Predicted Price: ${predicted_price[0]*1000:.2f}")

Predicted Price: $29408.73


In [8]:
import pandas as pd

# Create a pandas DataFrame
boston_house_prices = pd.read_csv('boston_house_prices.csv')

# Assume 'MEDV' is the target column (adjust if necessary)
X = boston_house_prices.drop(columns=['MEDV'])  # All the Features
y = boston_house_prices['MEDV']  # Target

# Display the first few rows of the DataFrame
print(boston_house_prices.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   

        B  LSTAT  MEDV  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
4  396.90   5.33  36.2  


In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create a pandas DataFrame
boston_house_prices = pd.read_csv('boston_house_prices.csv')
# Split the dataset into features and target
X = boston_house_prices.drop(columns=['MEDV'])  # All the Features
y = boston_house_prices['MEDV']  # Target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [10]:
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

Mean Squared Error: 24.29111947497345
R^2 Score: 0.6687594935356329


# ✂️ train_test_split Function (scikit-learn)

## What it Does  
The `train_test_split` function (from **scikit-learn**) is used to **split a dataset into two parts**:  
- A **training set** (used to train the model)  
- A **testing set** (used to check how well the model works)  

---

## Arguments  

- **X** → The feature matrix (the input data).  
  - Each **row** = one sample (like an email).  
  - Each **column** = one feature (like word count, presence of "free", etc.).  

- **y** → The target labels (the output you want to predict).  
  - Example: spam or not spam.  
  - It’s usually a 1D array where each value matches the label for a sample in **X**.  

---

## Keyword Arguments  

- **test_size=0.2** → Sets how much data goes into testing.  
  - `0.2` means **20% test data** and **80% training data**.  
  - You can adjust this (e.g., `0.3` for 30% testing).  

- **random_state=42** → Keeps the split the same every time you run the code.  
  - This is useful for **reproducibility** (getting consistent results).  
  - You can use any number, `42` is just common practice.  

---

## Output (What You Get Back)  

The function returns **four arrays**:  

- **X_train** → Features for the training set (part of `X`).  
- **X_test** → Features for the testing set (the rest of `X`).  
- **y_train** → Labels for the training set (part of `y`).  
- **y_test** → Labels for the testing set (the rest of `y`).  

So basically:  
👉 `train_test_split` breaks your dataset into smaller sets you can train and test your ML model with.  

---

# 🌳 Exercise 2: Decision Tree & Random Forest

## What it’s About  
We’ll use **Decision Trees** and **Random Forests** to predict probabilities in a dataset.  

- A **Decision Tree** is a model that splits data into branches to make predictions.  
- A **Random Forest** is an **ensemble method** that combines many decision trees.  
  - This usually gives **better accuracy** and can estimate probabilities.  

## Example: Weather Prediction  
We’ll create a synthetic **weather dataset** and train a Random Forest Classifier to predict the probability of conditions like:  
- Sunny 🌞  
- Rainy 🌧️  
- Cloudy ☁️  

based on different features (e.g., temperature, humidity, wind).  


In [11]:
#Import Libraries 
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Sample Weather Dataset
data = {
'Temperature': [30, 25, 20, 15, 10, 5, 0, -5, 20, 25, 30, 35],
'Humidity': [70, 80, 85, 90, 95, 60, 55, 50, 80, 85, 90, 75],
'Windy': [0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0],
'Outlook': ['sunny', 'rainy', 'rainy', 'sunny', 'cloudy', 'cloudy', 'sunny', 'sunny',
'cloudy', 'rainy', 'sunny', 'cloudy'],
}
# Target labels (0: Sunny, 1: Rainy, 2: Cloudy)
target = [0, 1, 1, 0, 2, 2, 0, 0, 2, 1, 0, 2]

# Convert to DataFrame
df = pd.DataFrame(data)
df['Outlook'] = df['Outlook'].map({'sunny': 0, 'rainy': 1, 'cloudy': 2}) # Encoding

#categorical feature
# Features and Target
X = df[['Temperature', 'Humidity', 'Windy', 'Outlook']]
y = target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)
# Initialize and train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42) # 100 decision trees
rf_model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [12]:
# Predict probabilities for the test set
y_pred_probs = rf_model.predict_proba(X_test)
# Display the probability predictions for each test instance
for i, probs in enumerate(y_pred_probs):
    print(f"Test Sample {i + 1} - Probabilities [Sunny, Rainy, Cloudy]: {probs}")

Test Sample 1 - Probabilities [Sunny, Rainy, Cloudy]: [0.56 0.37 0.07]
Test Sample 2 - Probabilities [Sunny, Rainy, Cloudy]: [0.04 0.93 0.03]
Test Sample 3 - Probabilities [Sunny, Rainy, Cloudy]: [0.49 0.28 0.23]
Test Sample 4 - Probabilities [Sunny, Rainy, Cloudy]: [0.   0.49 0.51]


In [13]:
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred,
                                                          target_names=["Sunny", "Rainy", "Cloudy"]))


Accuracy: 1.0

Classification Report:
               precision    recall  f1-score   support

       Sunny       1.00      1.00      1.00         2
       Rainy       1.00      1.00      1.00         1
      Cloudy       1.00      1.00      1.00         1

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4



In [None]:
#  Machine Learning - Decision Tree  

## What is a Decision Tree?  
A **Decision Tree** works like a **flow chart**. It helps make decisions based on past data.  

In this example, we want to predict if a person will go to a **comedy show** or not.  

- The person has recorded data about comedy shows in the past.  
- Information includes the **comedian’s details** and whether they actually went to the show or not.  

---

## Important Step: Numerical Data  
Decision Trees in machine learning can only work with **numbers**, not text.  
That means we have to convert all non-numerical columns (like text) into numbers.  

For this dataset, we need to convert:  
- **`Nationality`** → (UK, USA, N)  
- **`Go`** → (YES or NO)  

---

## Using Pandas map()  
Pandas has a handy function called `map()` that can convert values based on a dictionary.  

Example:  

```python
# Convert 'Nationality' column
data['Nationality'] = data['Nationality'].map({'UK': 0, 'USA': 1, 'N': 2})

# Convert 'Go' column
data['Go'] = data['Go'].map({'NO': 0, 'YES': 1})

This means:

UK → 0, USA → 1, N → 2

NO → 0, YES → 1

In [14]:
#Create and display a Decision Tree:
import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

df = pandas.read_csv("data2.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)

features = ['Age', 'Experience', 'Rank', 'Nationality']
X = df[features]
y = df['Go']

dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)

tree.plot_tree(dtree, feature_names=features)

FileNotFoundError: [Errno 2] No such file or directory: 'data2.csv'