In [None]:
Step 1: Understand the Data
Look at your dataset and identify columns.
Identify target variable (what you want to predict) → price.
Identify features (variables that may influence price) → e.g., bedrooms, bathrooms, sqft_living, floors, yr_built, etc.

Step 2: Clean the Data
Check for missing values or errors and decide how to handle them (remove rows, fill with average, etc.).
Ensure all selected features are in numerical format.
Convert dates or categorical variables to numbers if needed.

Step 3: Select Features
Choose features that are likely to impact the house price.
Exclude irrelevant information like street, statezip, or country (unless you plan to encode them).

Step 4: Split Data
Divide the dataset into training data (to train the model) and testing data (to evaluate performance).
Usually, 70–80% for training and 20–30% for testing.

Step 5: Train Linear Regression Model
Fit a linear regression model using the training data.
The model finds the relationship between features and the price.

Step 6: Make Predictions
Use the trained model to predict prices on the testing data.
This helps you see how well the model generalizes to new data.

Step 7: Evaluate the Model
Check how accurate the predictions are:
Mean Squared Error (MSE) → average error size
R² score → how much variance in price is explained by your model
Higher R² and lower MSE indicate a better model.

Step 8: Interpret Results
Look at the coefficients for each feature.
Positive → increases price
Negative → decreases price
This tells you which features matter most in determining house price.

Path to dataset files: C:\Users\USER PC\.cache\kagglehub\datasets\shree1992\housedata\versions\2


In [67]:
file_path = os.path.join(path, 'data.csv')
df = pd.read_csv(file_path)
df

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,3.130000e+05,3.0,1.50,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2.384000e+06,5.0,2.50,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,3.420000e+05,3.0,2.00,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,4.200000e+05,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,5.500000e+05,4.0,2.50,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,2014-07-09 00:00:00,3.081667e+05,3.0,1.75,1510,6360,1.0,0,0,4,1510,0,1954,1979,501 N 143rd St,Seattle,WA 98133,USA
4596,2014-07-09 00:00:00,5.343333e+05,3.0,2.50,1460,7573,2.0,0,0,3,1460,0,1983,2009,14855 SE 10th Pl,Bellevue,WA 98007,USA
4597,2014-07-09 00:00:00,4.169042e+05,3.0,2.50,3010,7014,2.0,0,0,3,3010,0,2009,0,759 Ilwaco Pl NE,Renton,WA 98059,USA
4598,2014-07-10 00:00:00,2.034000e+05,4.0,2.00,2090,6630,1.0,0,0,3,1070,1020,1974,0,5148 S Creston St,Seattle,WA 98178,USA


In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

In [69]:


# ==========================================================
# 3. REMOVE UNNECESSARY COLUMNS
# ==========================================================
columns_to_drop = ["id", "date", "street", "city", "statezip", "country"]

df = df.drop(columns=columns_to_drop, errors="ignore")

print("After dropping unnecessary columns:", df.shape)

# ==========================================================
# 4. CLEAN MISSING VALUES
# ==========================================================
df = df.dropna()   # Drop missing rows
print("After dropping missing values:", df.shape)

# ==========================================================
# 5. SELECT FEATURES + TARGET
# ==========================================================
target = "price"
features = df.columns.drop(target)

X = df[features]
y = df[target]

print("Features used for training:", list(features))

# ==========================================================
# 6. TRAIN/TEST SPLIT
# ==========================================================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

# ==========================================================
# 7. TRAIN LINEAR REGRESSION MODEL
# ==========================================================
model = LinearRegression()
model.fit(X_train, y_train)

# ==========================================================
# 8. MAKE PREDICTIONS
# ==========================================================
y_pred = model.predict(X_test)

# ==========================================================
# 9. EVALUATE MODEL
# ==========================================================
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nModel Evaluation:")
print("Mean Squared Error (MSE):", mse)
print("R² Score:", r2)

# ==========================================================
# 10. INTERPRET COEFFICIENTS
# ==========================================================
coef_df = pd.DataFrame({
    "Feature": features,
    "Coefficient": model.coef_
}).sort_values(by="Coefficient", ascending=False)

print("\nFeature Importance (Coefficients):")
print(coef_df)



After dropping unnecessary columns: (4600, 13)
After dropping missing values: (4600, 13)
Features used for training: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated']
Training data shape: (3680, 12)
Testing data shape: (920, 12)

Model Evaluation:
Mean Squared Error (MSE): 986921767056.1176
R² Score: 0.032283856632784325

Feature Importance (Coefficients):
          Feature    Coefficient
5      waterfront  382459.666353
4          floors   69824.740108
6            view   44755.841775
1       bathrooms   36520.440676
7       condition   29335.539392
2     sqft_living     186.049845
8      sqft_above      96.860817
9   sqft_basement      89.189028
11   yr_renovated       8.259917
3        sqft_lot      -0.514414
10       yr_built   -2569.163533
0        bedrooms  -64497.461587


In [70]:
df = df[df['price'] < df['price'].quantile(0.99)]
df

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
0,313000.000000,3.0,1.50,1340,7912,1.5,0,0,3,1340,0,1955,2005
2,342000.000000,3.0,2.00,1930,11947,1.0,0,0,4,1930,0,1966,0
3,420000.000000,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0
4,550000.000000,4.0,2.50,1940,10500,1.0,0,0,4,1140,800,1976,1992
5,490000.000000,2.0,1.00,880,6380,1.0,0,0,3,880,0,1938,1994
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,308166.666667,3.0,1.75,1510,6360,1.0,0,0,4,1510,0,1954,1979
4596,534333.333333,3.0,2.50,1460,7573,2.0,0,0,3,1460,0,1983,2009
4597,416904.166667,3.0,2.50,3010,7014,2.0,0,0,3,3010,0,2009,0
4598,203400.000000,4.0,2.00,2090,6630,1.0,0,0,3,1070,1020,1974,0


In [71]:
# ==========================================================
# 3. REMOVE UNNECESSARY COLUMNS
# ==========================================================
columns_to_drop = ["id", "date", "street", "city", "statezip", "country"]

df = df.drop(columns=columns_to_drop, errors="ignore")

print("After dropping unnecessary columns:", df.shape)

# ==========================================================
# 4. CLEAN MISSING VALUES
# ==========================================================
df = df.dropna()   # Drop missing rows
print("After dropping missing values:", df.shape)

# ==========================================================
# 5. SELECT FEATURES + TARGET
# ==========================================================
target = "price"
features = df.columns.drop(target)

X = df[features]
y = df[target]

print("Features used for training:", list(features))

# ==========================================================
# 6. TRAIN/TEST SPLIT
# ==========================================================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

# ==========================================================
# 7. TRAIN LINEAR REGRESSION MODEL
# ==========================================================
model = LinearRegression()
model.fit(X_train, y_train)

# ==========================================================
# 8. MAKE PREDICTIONS
# ==========================================================
y_pred = model.predict(X_test)

# ==========================================================
# 9. EVALUATE MODEL
# ==========================================================
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nModel Evaluation:")
print("Mean Squared Error (MSE):", mse)
print("R² Score:", r2)

# ==========================================================
# 10. INTERPRET COEFFICIENTS
# ==========================================================
coef_df = pd.DataFrame({
    "Feature": features,
    "Coefficient": model.coef_
}).sort_values(by="Coefficient", ascending=False)

print("\nFeature Importance (Coefficients):")
print(coef_df)



After dropping unnecessary columns: (4554, 13)
After dropping missing values: (4554, 13)
Features used for training: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated']
Training data shape: (3643, 12)
Testing data shape: (911, 12)

Model Evaluation:
Mean Squared Error (MSE): 42394379353.571365
R² Score: 0.49831986668016137

Feature Importance (Coefficients):
          Feature   Coefficient
4          floors  74922.578918
6            view  53224.159821
1       bathrooms  42989.384844
7       condition  29866.279193
2     sqft_living    132.369576
8      sqft_above     76.434093
9   sqft_basement     55.935484
11   yr_renovated      4.273932
3        sqft_lot     -0.366040
10       yr_built  -2256.175344
0        bedrooms -34195.611399
5      waterfront -70936.619255


In [72]:
from scipy import stats
import numpy as np

z_scores = np.abs(stats.zscore(df['price']))
outliers = df[z_scores > 3]
print("عدد القيم الشاذة:", outliers.shape[0])


عدد القيم الشاذة: 82


In [73]:
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df['price'] < Q1 - 1.5*IQR) | (df['price'] > Q3 + 1.5*IQR)]
print("عدد القيم الشاذة في السعر:", outliers.shape[0])


عدد القيم الشاذة في السعر: 204
