# üìì House Prices Regression Demo using Kaggle Dataset


## Step 1: Setup Kaggle API and Download Dataset

We'll use the Kaggle API to download the House Prices dataset. To do this, you need to upload your `kaggle.json` file, which contains your API credentials.

- Go to [https://www.kaggle.com/account](https://www.kaggle.com/account)
- Scroll down to the "API" section
- Click ‚ÄúCreate New API Token‚Äù
- Save the downloaded `kaggle.json` file
- Upload it when prompted below

In [None]:
# Skip uploading kaggle.json. Assume it already exists at ~/.kaggle/kaggle.json
!pip install -q kaggle


# Upload the kaggle.json file (from your local computer)
from google.colab import files
files.upload()  # Choose kaggle.json when prompted

# Move the file to the right location
!mkdir -p /root/.kaggle
!cp kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

# Download the House Prices dataset from Kaggle
!kaggle competitions download -c house-prices-advanced-regression-techniques
!unzip -q house-prices-advanced-regression-techniques.zip -d house_prices



## Step 2: Load and Inspect the Data

Now that we have the dataset, let's load it into a pandas DataFrame and take a quick look at the structure.

We'll use the `train.csv` file, which includes both the input features and the target variable (`SalePrice`).

In [None]:
import pandas as pd

df = pd.read_csv("house_prices/train.csv")
print("Shape of dataset:", df.shape)
df.head()

## Step 3: Preprocess the Data

To keep this demo simple, we'll do the following:

1. Keep only numeric features (to avoid complex encoding for now).
2. Drop columns with missing values.
3. Separate our input features (`X`) and the target variable (`y`).

In [None]:
# Keep only numeric columns
df_numeric = df.select_dtypes(include=["number"])

# Drop columns with missing values
df_clean = df_numeric.dropna(axis=1)

# Separate features (X) and target (y)
X = df_clean.drop("SalePrice", axis=1)
y = df_clean["SalePrice"]

## Step 4: Train-Test Split

To evaluate our model fairly, we'll split the data into training and testing sets.  
This means the model will learn from one part and be tested on another, unseen part.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Step 5: Train a Linear Regression Model

We'll use **Linear Regression**, one of the simplest and most interpretable machine learning models for regression tasks.

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

## Step 6: Evaluate the Model

After training the model, we want to check how well it's performing.

We'll use:
- **Root Mean Squared Error (RMSE)**: how far predictions are from actual prices
- **R¬≤ Score**: how much of the variance in house prices is explained by our features

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R¬≤ Score: {r2:.2f}")

## Step 7: Visualize Predictions

A scatter plot of predicted prices vs. actual prices helps us visually assess model performance.  
If predictions are perfect, points will lie along the diagonal.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Actual Sale Price")
plt.ylabel("Predicted Sale Price")
plt.title("Predicted vs. Actual House Prices")
plt.grid(True)
plt.show()

## ‚úÖ Summary

In this notebook, we:
- Downloaded real-world housing data from Kaggle
- Cleaned and prepared the data
- Trained a basic regression model using Linear Regression
- Evaluated and visualized the results

This is just a starting point. You can improve the model by:
- Handling categorical variables (e.g., one-hot encoding)
- Filling in missing values instead of dropping them
- Trying other models like Decision Trees or XGBoost
- Performing feature selection and engineering