# Project Title: House Price Prediction 🏡  
## 1. Introduction  
This project predicts house prices using a **Linear Regression model** based on various features like square footage, number of garages, and overall quality.

## 2. Load Dataset  
First, we import necessary libraries and load the dataset.


In [10]:
import pandas as pd

# Load the dataset (update the filename if needed)
file_path = "/content/test/train.csv"
df = pd.read_csv(file_path)

# Display first few rows
df.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## 3. Data Preprocessing 🛠️
We check for missing values and fill them accordingly.


In [11]:
# Check for missing values
df.isnull().sum()


Unnamed: 0,0
Id,0
MSSubClass,0
MSZoning,0
LotFrontage,259
LotArea,0
...,...
MoSold,0
YrSold,0
SaleType,0
SaleCondition,0


In [12]:
# Check data types
df.dtypes


Unnamed: 0,0
Id,int64
MSSubClass,int64
MSZoning,object
LotFrontage,float64
LotArea,int64
...,...
MoSold,int64
YrSold,int64
SaleType,object
SaleCondition,object


In [13]:
# Fill numeric columns with mean
df.fillna(df.select_dtypes(include=['number']).mean(), inplace=True)

# Fill categorical columns with mode (most frequent value)
for col in df.select_dtypes(include=['object']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


In [14]:
print(df.columns)


Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [16]:
# Define features (modify based on dataset)
features = ["OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF"]  # Adjust as needed

# Define target variable (house price)
target = "SalePrice"

# Split into input (X) and output (y)
X = df[features]
y = df[target]


## 4.We split the dataset and train a Linear Regression model.

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [18]:
from sklearn.linear_model import LinearRegression

# Initialize model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)


In [19]:
# Predict house prices
y_pred = model.predict(X_test)

# Compare actual vs. predicted prices
pd.DataFrame({"Actual Price": y_test.values, "Predicted Price": y_pred[:len(y_test)]})


Unnamed: 0,Actual Price,Predicted Price
0,154500,143522.478557
1,325000,288596.453556
2,115000,136156.554463
3,159000,187027.687043
4,315500,293501.054731
...,...,...
287,89471,113929.499753
288,260000,219999.640812
289,189000,194052.924181
290,108000,106271.116816


## 5.We evaluate the model using Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² Score.

In [20]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate error metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print results
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")


Mean Absolute Error (MAE): 25446.0547392125
Mean Squared Error (MSE): 1602914819.443908
R² Score: 0.7910239048318479


## 📌 6. Conclusion

✅ We successfully built a Linear Regression model to predict house prices.

✅ The model was evaluated using standard metrics like MAE, MSE, and R² Score.

✅ Future improvements could include using advanced models like Random Forest or XGBoost for better accuracy.

## 🔗 7. References

Kaggle Dataset: Iowa House Prices

Scikit-learn Documentation: Linear Regression