In [4]:
import pandas as pd

df = pd.read_csv("C:/Users/DELL/OneDrive/Desktop/Academics/3rd Year/5th Sem/Academic Related Docs/Junaid's Intership Tasks/Task 04/DataBase- for Task04/house_prices.csv")

df.head()


Unnamed: 0,Size,Location,Number of Rooms,Price
0,1500,Urban,3,250000
1,1600,Suburban,4,270000
2,1700,Urban,3,290000
3,1200,Rural,2,200000
4,1800,Suburban,4,300000


In [2]:
print("Shape:", df.shape)

print("\nMissing values:\n", df.isnull().sum())

print("\nData types:\n", df.dtypes)

print("\nDuplicate rows:", df.duplicated().sum())

print("\nSummary Statistics:\n", df.describe())


Shape: (20, 4)

Missing values:
 Size               0
Location           0
Number of Rooms    0
Price              0
dtype: int64

Data types:
 Size                int64
Location           object
Number of Rooms     int64
Price               int64
dtype: object

Duplicate rows: 0

Summary Statistics:
               Size  Number of Rooms          Price
count    20.000000        20.000000      20.000000
mean   1745.000000         3.750000  291750.000000
std     365.952255         1.118034   62244.741904
min    1100.000000         2.000000  190000.000000
25%    1487.500000         3.000000  246250.000000
50%    1725.000000         4.000000  292500.000000
75%    1962.500000         4.250000  332500.000000
max    2500.000000         6.000000  400000.000000


### Step 01: Load and Explore Dataset

I loaded the `house_prices.csv` dataset and explored its structure.  
I checked for missing values, duplicates, data types, and summary statistics.

The dataset looks clean and contains key features for predicting house prices like Size, Location, and Number of Rooms.


In [9]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X = df.drop('Price', axis=1)
y = df['Price']

numeric_features = ['Size', 'Number of Rooms']
categorical_features = ['Location']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)


### Step 02: Preprocess the Data

In this step, I prepared the dataset for modeling:
- Scaled numeric features like `Size` and `Number of Rooms`
- Applied one-hot encoding to the `Location` column (categorical)


In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### Step 03: Train-Test Split

I split the dataset into:
- 80% for training
- 20% for testing

This helps evaluate the model on unseen data.


In [11]:
from sklearn.linear_model import LinearRegression

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

model.fit(X_train, y_train)


### Step 04: Train the Model

I built a pipeline with:
- Data preprocessing (scaling + encoding)
- Linear regression model

This pipeline makes the process simple and clean.


In [14]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared Score (R²):", r2)

comparison = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print("\nActual vs Predicted:")
print(comparison)


Mean Squared Error (MSE): 84997417.40189402
R-squared Score (R²): 0.9302585293112664

Actual vs Predicted:
    Actual      Predicted
0   250000  263239.227045
17  280000  284889.592096
15  190000  178139.433378
1   270000  270362.471232


### Step 05: Predict and Evaluate

I used the trained model to predict house prices on test data.

Metrics:
- Mean Squared Error (MSE): shows average prediction error
- R-squared (R²): shows how well the model explains price variation

I also compared actual vs predicted values to check the model's accuracy.


### Step 06: Final Summary

In this task, I worked on predicting house prices using linear regression.

Here’s what I did:
1. Loaded and explored a housing dataset
2. Preprocessed the data using standard scaling and one-hot encoding
3. Split the data into training and test sets
4. Built a pipeline combining preprocessing and regression
5. Evaluated the model using Mean Squared Error and R² score

This task gave me hands-on experience with preprocessing real-world data and using machine learning models to make predictions.
