# Introduction to Machine Learning with California Housing Dataset

This notebook provides an introductory lesson to Machine Learning (ML) using the California Housing dataset. 

We will cover fundamental concepts,
the typical ML workflow, and a practical example of building a regression model.


## What is Machine Learning?

Machine Learning is a subset of Artificial Intelligence (AI) that enables systems
to learn from data, identify patterns, and make decisions or predictions with
minimal human intervention. 

Instead of being explicitly programmed, ML models
"learn" from data.

## Types of Machine Learning

1.  **Supervised Learning:**
    * **Concept:** The model learns from labeled data, meaning the input data
        (features) is paired with the correct output (labels or targets).
    * **Goal:** To predict an output based on new, unseen input data.
    * **Examples:**
        * **Regression:** Predicting a continuous value (e.g., house prices, temperature).
        * **Classification:** Predicting a categorical label (e.g., spam/not spam, dog/cat).

2.  **Unsupervised Learning:**
    * **Concept:** The model learns from unlabeled data, finding hidden patterns
        or structures within the data without explicit guidance.
    * **Goal:** To discover underlying structures or representations in the data.
    * **Examples:**
        * **Clustering:** Grouping similar data points together (e.g., customer segmentation).
        * **Dimensionality Reduction:** Reducing the number of features while retaining
            important information.

3.  **Reinforcement Learning:**
    * **Concept:** An agent learns to make decisions by performing actions in an
        environment and receiving rewards or penalties.
    * **Goal:** To maximize cumulative reward over time.
    * **Examples:** Game playing (AlphaGo), robotics, autonomous driving.


## The Machine Learning Workflow

The typical ML workflow involves several stages:

1.  **Problem Definition:** Clearly define what you want to achieve with ML.
2.  **Data Collection:** Gather relevant data.
3.  **Data Preprocessing/Cleaning:** Handle missing values, outliers, transform data,
    feature engineering. This is often the most time-consuming step.
4.  **Feature Selection/Engineering:** Choosing the most relevant features or creating new ones.
5.  **Model Selection:** Choosing an appropriate ML algorithm based on the problem type
    (e.g., Linear Regression for continuous prediction, Logistic Regression for classification).
6.  **Training:** Feeding the preprocessed data to the chosen model so it can learn patterns.
7.  **Evaluation:** Assessing the model's performance on unseen data using appropriate metrics.
8.  **Hyperparameter Tuning:** Optimizing model parameters that are not learned from data.
9.  **Deployment:** Integrating the trained model into an application or system.
10. **Monitoring & Maintenance:** Continuously monitoring the model's performance and retraining if necessary.


## Practical Example: Predicting California Housing Prices

In this example, we will use the California Housing dataset, a classic dataset for regression tasks. Our goal is to predict the median house value for districts in California based on various features.

### 1. Import Necessary Libraries

We will import 

* `numpy` for numerical operations
* `pandas` for data manipulation
* `matplotlib.pyplot` and `seaborn` for visualization
* `sklearn`for dataset loading, model training, and evaluation.

In [1]:
## Begin Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set plot style for better aesthetics
sns.set_style("whitegrid")
## End Imports

ModuleNotFoundError: No module named 'sklearn'

### 2. Load the California Housing Dataset

The California Housing dataset is readily available in `scikit-learn`.

### 2. Load the California Housing Dataset

Many common datasets are already built into machine learning libraries, making them easy to access. The California Housing dataset is one of them, available through `scikit-learn`.

**In the cell below:**
1.  Find the function within `sklearn.datasets` that lets you `fetch` the California Housing dataset.
2.  Load the dataset. It's often helpful to load it `as_frame=True` so it acts like a pandas DataFrame.
3.  The dataset object you get back will have two main parts you care about: the `data` (your features, often called `X`) and the `target` (what you want to predict, often called `y`). Assign these to separate variables.
4.  Print the `shape` of your `X` and `y` to see how many rows and columns you have.
5.  Display the first few rows of your `X` and `y` to get a peek at the data (`.head()`).

In [None]:
## Begin Example










## End Example

### 3. Data Exploration: Get to Know Your Data!

It's crucial to understand your data before trying to build a model. Let's start by looking at some basic numbers and visualizing distributions.

**In the cell below:**
1.  Get descriptive statistics for your features (`X`).
2.  Create histograms for all the features in `X`. (Pandas DataFrames have a built-in `.hist()` method that works well with `matplotlib`). You might want to adjust `figsize` and `bins` to make them readable.
3.  Create a histogram specifically for your target variable `y` (Median House Value) to see its distribution.

**After running the code, what do you notice about the histograms?**

* Look at `MedInc` (Median Income) and `MedHouseVal` (Median House Value). Do you see a long "tail" on one side, or do they stop abruptly? (note that `MedHouseVal` was capped at $500,000 for simplicity in the original dataset, which means values above 5.0 are shown as 5.0).
* Which features seem to have a lot of data grouped together, and which are more spread out?
* Do any of them look like a "bell curve" (symmetrical in the middle)?

In [None]:
## Begin Example







## End Example

### Exploration: Geographical Distribution of House Values

Let's use a scatter plot to visualize house values on a map of California. This will help us see if there are any geographical patterns.

* The **horizontal position (x-axis)** of each dot tells us its **Longitude** (how far east or west it is).
* The **vertical position (y-axis)** of each dot tells us its **Latitude** (how far north or south it is).
* We can also use the **size** and **color** of each dot to show the **Median House Value (`MedHouseVal`)** of that neighborhood. Bigger and brighter dots will represent higher values.

**In the cell below:**
1.  Use `seaborn.scatterplot` to create a scatter plot.
2.  Set `Longitude` for the x-axis and `Latitude` for the y-axis.
3.  Use `MedHouseVal` to control both the `size` and `hue` (color) of the points.
4.  Choose a `palette` (like "viridis") and set `alpha` (transparency, e.g., 0.5) to help with visualization.
5.  Add a title to your plot and potentially a legend.

**What do you notice about the plot?**
* Are there any bright, large clusters of dots? Where are they located (e.g., along the coast, specific cities)?
* Do smaller, darker dots appear in different areas?
* What does this tell us about housing prices and location in California?
"""

### 4. Splitting the Data into Training and Testing Sets

This is a critical step in Machine Learning! You need to split your data into two parts:
* The **Training Set:** This is the data your model will "learn" from.
* The **Testing Set:** This is data the model has **never seen before**. You'll use it to evaluate how well your model truly performs and if it can make good predictions on new, real-world data.

We typically use about 70-80% of the data for training and 20-30% for testing. Using `random_state` ensures that if you run your code multiple times, you'll get the same split, which helps with reproducibility.

**In the cell below:**
1.  Find the function within `sklearn.model_selection` that helps you `split` your data.
2.  Use it to split your `X` and `y` into `X_train`, `X_test`, `y_train`, and `y_test`.
3.  Set the `test_size` to 0.2 (meaning 20% for testing).
4.  Set `random_state` to 42 (or any integer you like) for consistent results.
5.  Print the `shape` of each of your new training and testing sets to verify the split.


In [None]:
## Begin Example








## End Example

### 5. Model Selection and Training

For this regression problem (predicting a continuous number), we'll start with a simple but powerful model: **Linear Regression**. This model tries to find the best straight line (or flat surface in multiple dimensions) that fits your data.

**In the cell below:**
1.  Find `LinearRegression` within `sklearn.linear_model`.
2.  Create an "instance" of your Linear Regression model (like making a new, empty model object).
3.  "Train" your model using the `fit()` method. This method takes your training features (`X_train`) and your training target (`y_train`) as input. The model will learn the relationships from this data.


In [None]:
## Begin Example








## End Example

### 6. Making Predictions

Once your model is trained, it's ready to make guesses! You'll ask it to predict the house values for the `X_test` data (the data it has never seen before).

**In the cell below:**
1.  Use your trained model's `predict()` method.
2.  Pass in your `X_test` data to get predictions.
3.  Store these predictions in a new variable (e.g., `y_pred`).
4.  Print the first few actual values from `y_test` and compare them side-by-side with your `y_pred` to get an initial feel for the predictions.

In [None]:
## Begin Example








## End Example

### 7. Model Evaluation

How good are your model's predictions? We use **evaluation metrics** to measure this. For regression, common metrics include:

* **Mean Squared Error (MSE):** This calculates the average of the squared differences between your predicted values and the actual values. A lower MSE means your predictions are closer to the real values.
* **Root Mean Squared Error (RMSE):** This is just the square root of MSE. It's often preferred because its value is in the same units as your target variable (in this case, hundreds of thousands of dollars), making it easier to understand.
* **R-squared ($R^2$):** This is a very intuitive metric! It tells you **what percentage of the variations in the actual house values your model can explain** using the features. $R^2$ ranges from 0 to 1, where 1 means your model perfectly explains all the variation (a perfect fit!). If your $R^2$ is 0.5, it means your model explains 50% of why house prices are different across districts. The other 50% is due to factors not in your model or randomness.

**In the cell below:**
1.  Find the functions for `mean_squared_error` and `r2_score` within `sklearn.metrics`.
2.  Calculate the MSE and R-squared using `y_test` (actual values) and `y_pred` (your model's predictions).
3.  Calculate the RMSE from the MSE you found.
4.  Print out these metrics clearly, formatted nicely.

In [2]:
## Begin Example



## End Example

### 8. Visualization of Predictions vs. Actual Values

A great way to visually assess your model is to plot its predictions against the actual values. If your model were perfect, all the dots would fall exactly on a 45-degree diagonal line. The closer your dots are to that line, the better your model is!

**In the cell below:**
1.  Use `matplotlib.pyplot.scatter()` to create a scatter plot.
2.  Put your `y_test` (actual values) on the x-axis.
3.  Put your `y_pred` (predicted values) on the y-axis.
4.  Add a diagonal line (from the minimum to the maximum values of `y`) to represent the "perfect prediction" line.
5.  Add labels for your axes and a title for your plot.
6.  Consider setting `alpha` for the scatter plot to make dense areas clearer.


## Conclusion

Congratulations! You've just completed a hands-on exploration of a fundamental
Machine Learning workflow. You've learned how to:

* Understand the basics of ML types and its typical steps.
* Load and explore a real-world dataset.
* Prepare data by splitting it into training and testing sets.
* Train a simple regression model.
* Make predictions with your model.
* Evaluate your model's performance using common metrics.
* Visualize your results to gain insights.

This is truly just the beginning of your journey into Machine Learning.

### How Can We Improve This Model?

Now that you've seen a basic model in action, the next exciting step is to think about **how we could make it even better**. Consider these questions and discuss them with your classmates:

1.  **Could we get more data?** Often, having more diverse and relevant data can help a model learn better patterns and make more accurate predictions, especially if your current dataset is small or doesn't cover all scenarios.
2.  **Could we use a different type of model?** Linear Regression assumes a straight-line relationship. What if the relationship between features and house prices isn't a straight line? (Hint: In `sklearn`, explore `ensemble` models like Decision Trees or Random Forests, or even `neural_network` models).
3.  **Are all the features equally important?** Should we try to use all the features, or could some be combined or transformed? (Hint: Sometimes creating new features from existing ones – this is called "feature engineering" – or selecting only the most impactful ones can improve results).
4.  **Does the "scale" of our data matter?** Some features like 'population' are very large numbers, while others like 'average rooms' are small. Could this difference in scale affect how the model learns? (Hint: Look into 'feature scaling' or 'normalization' techniques in `sklearn.preprocessing`).
5.  **How could we be more sure our model isn't just "memorizing" the training data?** Our model works okay on the test set, but what if it just got "lucky" on this particular split? How can we make our evaluation more robust? (Hint: Research 'cross-validation' techniques).

These are just a few ideas to get you thinking about the exciting next steps in improving machine learning models.