# Prediction Models

In machine learning, a prediction model, also known as a predictive model or a regression model, is a mathematical representation or algorithm that is trained on data to make predictions or estimates about unknown or future outcomes. It is a fundamental component of supervised learning, where the model learns patterns and relationships from labeled training data and then applies that knowledge to make predictions on new, unseen data.

The prediction model takes input features or variables and uses them to generate output predictions. The specific type of prediction model used depends on the problem at hand. For example, linear regression models are commonly used for predicting continuous numerical values, while classification models such as logistic regression, decision trees, or support vector machines are used for predicting discrete categorical values or class labels.




# Part 1 - Regression

## 1.1 Linear Regression

Linear regression is a basic supervised learning algorithm that is widely used for making predictions. It is often taught in introductory statistics courses and is considered a fundamental technique in data analysis. Although it is straightforward and relatively simple compared to other machine learning algorithms, **linear regression remains valuable for predicting quantitative values such as home prices or ages**. Despite its simplicity, linear regression and its variations remain relevant and effective in practical applications.


##Problem
##1.1 Fitting a line
Suppose you want to train a model that represents a linear relationship between the feature and target vector. You can use a linear regression (in scikit-learn, `LinearRegression`):

In [None]:
# Load libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score


#load california housing dataset from scikit
from sklearn.datasets import fetch_california_housing


In [None]:
#load dataset
#TODO

### First, visualizing the data that we have:
Let's better visualize these features by plotting their histograms.

In [None]:
#let's take a look at the data description
#TODO

In [None]:
#show info
#TODO

We can see that:

the dataset contains 20,640 samples and 8 features

all features are numerical features encoded as floating number

there is no missing values.

In [None]:
import matplotlib.pyplot as plt

california_housing.frame.hist(figsize=(10, 10), bins=30, edgecolor="blue")
plt.subplots_adjust(hspace=0.7, wspace=0.4)

As you see in the output of the data info, we are provided with the longitude and latitude that carry geographical information. Let's take a look at this information.

In [None]:
import seaborn as sns

sns.scatterplot(
    data=california_housing.frame,
    x="Longitude",
    y="Latitude",
    size="MedHouseVal",
    hue="MedHouseVal",
    palette="viridis",
    alpha=0.5,
)
plt.legend(title="MedHouseVal", bbox_to_anchor=(1.05, 0.95), loc="upper left")
_ = plt.title("Median house value depending of\n their spatial location")

Please note that California's big cities: San Diego, Los Angeles, San Jose, or San Francisco, are located in the east coast!

We can perform random subsampling to reduce the number of data points for plotting, while still capturing the relevant characteristics.

In [None]:
import numpy as np

rng = np.random.RandomState(0)
indices = rng.choice(
    np.arange(california_housing.frame.shape[0]), size=100, replace=False
)

In [None]:
sns.scatterplot(
    data=california_housing.frame.iloc[indices],
    x="Longitude",
    y="Latitude",
    size="MedHouseVal",
    hue="MedHouseVal",
    palette="viridis",
    alpha=0.5,
)
plt.legend(title="MedHouseVal", bbox_to_anchor=(1.05, 1), loc="upper left")
_ = plt.title("Median house value depending of\n their spatial location")

### Create input feature set for fitting the regression:

In [None]:
# Create features
f#TODO

In [None]:
#the target contains the median of the house value for each district
#TODO

In [None]:
# Create linear regression
#TODO

# Fit the linear regression
#TODO

In [None]:
# Evaluate model:
# Cross-validate the linear regression using R-squared
# The closer to 1.0, the better the model.

#TODO

In [None]:
# Test the first value in the target vector multiplied by 1000
target[0]*1000

Using the `predict` method, we can predict a value for that house:

In [None]:
# Predict the target value of the first observation, multiplied by 1000
model.predict(features)[0]*1000

Not bad! Our model was only off by $394.35!