In [None]:
Step 1: Understand the Data
Look at your dataset and identify columns.
Identify target variable (what you want to predict) → price.
Identify features (variables that may influence price) → e.g., bedrooms, bathrooms, sqft_living, floors, yr_built, etc.

Step 2: Clean the Data
Check for missing values or errors and decide how to handle them (remove rows, fill with average, etc.).
Ensure all selected features are in numerical format.
Convert dates or categorical variables to numbers if needed.

Step 3: Select Features
Choose features that are likely to impact the house price.
Exclude irrelevant information like street, statezip, or country (unless you plan to encode them).

Step 4: Split Data
Divide the dataset into training data (to train the model) and testing data (to evaluate performance).
Usually, 70–80% for training and 20–30% for testing.

Step 5: Train Linear Regression Model
Fit a linear regression model using the training data.
The model finds the relationship between features and the price.

Step 6: Make Predictions
Use the trained model to predict prices on the testing data.
This helps you see how well the model generalizes to new data.

Step 7: Evaluate the Model
Check how accurate the predictions are:
Mean Squared Error (MSE) → average error size
R² score → how much variance in price is explained by your model
Higher R² and lower MSE indicate a better model.

Step 8: Interpret Results
Look at the coefficients for each feature.
Positive → increases price
Negative → decreases price
This tells you which features matter most in determining house price.

In [20]:
import kagglehub
import os 
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("shree1992/housedata")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/shree1992/housedata?dataset_version_number=2...


100%|████████████████████████████████████████████████████████████████████████████████| 432k/432k [00:02<00:00, 212kB/s]

Extracting files...
Path to dataset files: C:\Users\USER PC\.cache\kagglehub\datasets\shree1992\housedata\versions\2





In [24]:
file_path = os.path.join(path, 'data.csv')
df = pd.read_csv(file_path)
df

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,3.130000e+05,3.0,1.50,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2.384000e+06,5.0,2.50,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,3.420000e+05,3.0,2.00,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,4.200000e+05,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,5.500000e+05,4.0,2.50,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,2014-07-09 00:00:00,3.081667e+05,3.0,1.75,1510,6360,1.0,0,0,4,1510,0,1954,1979,501 N 143rd St,Seattle,WA 98133,USA
4596,2014-07-09 00:00:00,5.343333e+05,3.0,2.50,1460,7573,2.0,0,0,3,1460,0,1983,2009,14855 SE 10th Pl,Bellevue,WA 98007,USA
4597,2014-07-09 00:00:00,4.169042e+05,3.0,2.50,3010,7014,2.0,0,0,3,3010,0,2009,0,759 Ilwaco Pl NE,Renton,WA 98059,USA
4598,2014-07-10 00:00:00,2.034000e+05,4.0,2.00,2090,6630,1.0,0,0,3,1070,1020,1974,0,5148 S Creston St,Seattle,WA 98178,USA


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

In [26]:
# Show all columns
print(df.columns)

# Check for duplicates
print("Duplicate rows:", df.duplicated().sum())

Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'street', 'city',
       'statezip', 'country'],
      dtype='object')
Duplicate rows: 0


In [27]:
columns_to_drop = ["date","street","Country","city"]  # adjust based on your dataset
df = df.drop(columns=columns_to_drop, errors='ignore')  # ignore if column not present

# Check resu
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,statezip,country
0,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,WA 98133,USA
1,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,WA 98119,USA
2,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,WA 98042,USA
3,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,WA 98008,USA
4,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,WA 98052,USA


In [28]:
columns_to_drop = ["statezip","country"]  # adjust based on your dataset
df = df.drop(columns=columns_to_drop, errors='ignore')  # ignore if column not present

# Check resu
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
0,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005
1,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0
2,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0
3,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0
4,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992
