<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Using Notebooks in Google Colab

Jupyter Notebooks are interactive documents that combine live code, visualizations, and explanatory text. They are an essential tool for data scientists, machine learning engineers, and anyone working with data analysis and coding in Python or R. Google Colab provides a cloud-based environment for running Jupyter Notebooks, making them accessible and collaborative.

Getting started

1. Open a new or existing notebook in Google Colab.
2. Write your code in code cells.
3. Execute code cells by pressing **Shift+Enter** or clicking the **Play** button.
4. Add and edit text using Markdown cells.
5. Use the menu and toolbar options to navigate, format, and share your notebook.


In [None]:
# prompt: Write an example short python to run here.

print("Hello from Google Colab!")
a = 10
b = 5
print(f"The sum of {a} and {b} is {a + b}")
print(f"The difference of {a} and {b} is {a - b}")
print(f"The product of {a} and {b} is {a * b}")
print(f"The quotient of {a} and {b} is {a / b}")


Hello from Google Colab!
The sum of 10 and 5 is 15
The difference of 10 and 5 is 5
The product of 10 and 5 is 50
The quotient of 10 and 5 is 2.0


This is a text box that you can edit. To edit the text, double-click inside the box. You can use markdown format in the text boxes.


## Working with Cells in Google Colab

### Adding Cells

    To add a code cell:
        Hover your mouse between two existing cells or below the last cell.
        Click the + Code button that appears.
    To add a text cell:
        Hover your mouse between two existing cells or below the last cell.
        Click the + Text button that appears.

### Editing Cells

    To edit a code cell:
        Simply click inside the code cell and start typing.
    To edit a text cell:
        Double-click inside the text cell to enter edit mode.
        Make your changes using Markdown formatting.
        Click outside the cell or press Shift+Enter to save your changes.

### Running Cells

    To run a code cell:
        Select the cell you want to run.
        Press Shift+Enter or click the Play button (triangle icon) next to the cell.
        The output, if any, will appear below the code cell.

### Deleting Cells

    To delete a cell:
        Select the cell you want to delete.
        Click the trash can icon that appears next to the cell or right-click and select "Delete cell" from the context menu.


# Building a Simple Machine Learning Model End-to-End: Predicting House Prices

Machine learning models often seem abstract, but their practical applications can solve real-life problems. In this chapter, we will walk through building a simple machine learning model to predict house prices using the popular **Boston Housing dataset**. This dataset, available in libraries like `scikit-learn`, contains features related to house prices in Boston suburbs, making it a meaningful real-world use case.


### Import the libraries

In [2]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

### Load the California Housing dataset

The **California Housing dataset** contains information about housing blocks in California. Each data point represents a single block group, the smallest geographical unit for which the US Census publishes sample data. The dataset includes:
- **Features**:
    - `MedInc`: Median income in the block group.
    - `HouseAge`: Median age of houses in the block group.
    - `AveRooms`: Average number of rooms per household in the block group.
    - `AveOccup`: Average number of occupants per household in the block group.
    - `Latitude` and `Longitude`: Geographical coordinates of the block group.
- **Target**:
    - Median house value (`target`), measured in hundreds of thousands of dollars.


In [4]:
# Load the California Housing dataset
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

In [5]:
### Disbplay the dataset sturcture

In [6]:
print("Feature names:", housing.feature_names)
print("First few rows of the dataset:")
print(X.head())

Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
First few rows of the dataset:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  


#### Data Splitting
Using `train_test_split`, we divide the dataset into training (80%) and testing (20%) subsets. This separation ensures that we evaluate the model on unseen data.


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### Initialize and Train the Model
We initialize a `LinearRegression` model from `scikit-learn` and fit it to the training data.


In [8]:
model = LinearRegression()
model.fit(X_train, y_train)

In [9]:
y_pred = model.predict(X_test)

#### Evaluate the Model
- **Mean Squared Error (MSE)**:
    - Measures the average squared difference between actual and predicted values.
    - Lower MSE indicates better model performance.
    - Example: If `MSE = 0.5`, the average squared difference between predicted and actual house prices is 0.5 (in hundreds of thousands of dollars).
- **R-squared (R²) Score**:
    - Represents the proportion of variance in the target variable explained by the model.
    - Ranges from 0 to 1, where higher values indicate better performance.
    - Example: If `R² = 0.8`, the model explains 80% of the variance in house prices.


In [10]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")

Mean Squared Error (MSE): 0.56
R-squared (R2) Score: 0.58


#### Predict a New House PriceWe

Provide an example input for a hypothetical house, representing feature values such as median income and house age. The model predicts its price in hundreds of thousands of dollars.


In [13]:
new_house = [[8.0, 25.0, 6.0, 1000.0, 2.0, 37.0, -120.0, 35.0]]  # Example feature values
new_house_df = pd.DataFrame(new_house, columns=housing.feature_names)  # Add feature names
predicted_price = model.predict(new_house_df)
print(f"Predicted price for the new house: ${predicted_price[0] * 100000:.2f}")


Predicted price for the new house: $78427903.30
