# Linear Regression Analysis on California Housing Dataset

This notebook performs a linear regression analysis to predict housing prices in California based on various features of the housing data. It demonstrates data preprocessing, model training using both ordinary least squares and stochastic gradient descent, and evaluation using common regression metrics.

## Contents

1. **Data Loading and Exploration**
   - Load the California housing dataset from a public URL.
   - Display the first few rows of the dataset to understand its structure.

2. **Data Cleaning**
   - Remove any rows with missing values to ensure data quality.

3. **Feature and Target Separation**
   - Define the features (`x`) and the target variable (`y`) for the regression model.
   - The target variable is `median_house_value`, and all other columns are used as features.

4. **Data Splitting**
   - Split the dataset into training and testing sets using an 80-20 split to evaluate model performance.

5. **Model Training with Linear Regression**
   - Create and fit a linear regression model to the training data.
   - Make predictions on the test set.

6. **Model Evaluation**
   - Calculate and display the Mean Squared Error (MSE) to assess prediction accuracy.
   - Calculate and display the R-squared score to evaluate the model's goodness of fit.

7. **Model Training with Stochastic Gradient Descent**
   - Create and fit a Stochastic Gradient Descent (SGD) regression model to the training data.
   - Make predictions on the test set and evaluate the model using MSE.

## Dependencies

- `pandas`
- `numpy`
- `scikit-learn`

## Usage

- Run each section sequentially to perform linear regression analysis on the California housing dataset.
- Modify the input data as needed or experiment with different features or regression models.

## Notes

- Ensure that you have internet access to download the dataset from the specified URL.
- The notebook includes metrics to assess model performance, aiding in understanding the effectiveness of both linear regression methods used.





In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/a-nagar/datasets/main/california_housing.csv")

In [None]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,342200


In [None]:
df.dropna(inplace=True)

In [None]:
x = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)


In [None]:
#lets use calculus
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)

In [None]:
y_pred_calculus = model.predict(x_test)

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred_calculus)

4792579709.946032

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred_calculus)

0.6262817408549621

In [None]:
from sklearn.linear_model import SGDRegressor
model = SGDRegressor()
model.fit(x_train, y_train)

In [None]:
y_pred_sgd = model.predict(x_test)

In [None]:
mean_squared_error(y_test, y_pred_sgd)

3.0562617615816703e+31