<a href="https://colab.research.google.com/github/norakhaled112005-ctrl/A-Small-Project-of-Machine-Learning-Model/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a machine learning model to predict house prices, including data loading, preprocessing, model training, evaluation, and a summary of its performance with potential improvements.

## Load Data

### Subtask:
Load the dataset for house price prediction into a pandas DataFrame. We'll need to identify the features (independent variables) and the target variable (house prices).


**Reasoning**:
To load the dataset, first, I need to import the pandas library, then read the 'house_prices.csv' file into a DataFrame, and finally display its first few rows and information to understand its structure.



In [1]:
import pandas as pd

df = pd.read_csv('house_prices.csv')

print('First 5 rows of the DataFrame:')
print(df.head())

print('\nDataFrame Info:')
df.info()

FileNotFoundError: [Errno 2] No such file or directory: 'house_prices.csv'

**Reasoning**:
The previous attempt to load 'house_prices.csv' failed because the file was not found. I will correct the file path to use 'california_housing_train.csv' from the provided sample data directory, which is a suitable dataset for house price prediction.



In [2]:
import pandas as pd

df = pd.read_csv('/content/sample_data/california_housing_train.csv')

print('First 5 rows of the DataFrame:')
print(df.head())

print('\nDataFrame Info:')
df.info()

First 5 rows of the DataFrame:
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0   
1    -114.47     34.40                19.0       7650.0          1901.0   
2    -114.56     33.69                17.0        720.0           174.0   
3    -114.57     33.64                14.0       1501.0           337.0   
4    -114.57     33.57                20.0       1454.0           326.0   

   population  households  median_income  median_house_value  
0      1015.0       472.0         1.4936             66900.0  
1      1129.0       463.0         1.8200             80100.0  
2       333.0       117.0         1.6509             85700.0  
3       515.0       226.0         3.1917             73400.0  
4       624.0       262.0         1.9250             65500.0  

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column    

**Reasoning**:
The dataset has been successfully loaded and its structure inspected. Now, I need to explicitly identify the target variable ('median_house_value') and the features (all other columns) as required by the subtask, and then conclude the subtask.



In [3]:
target_variable = 'median_house_value'
features = [col for col in df.columns if col != target_variable]

print(f"Target Variable: {target_variable}")
print(f"Features: {features}")

Target Variable: median_house_value
Features: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']


## Preprocess Data

### Subtask:
Clean and preprocess the loaded data. This may include handling missing values, encoding categorical features, and scaling numerical features to prepare it for model training.


**Reasoning**:
The first instruction is to check for missing values in the DataFrame `df`. I will use `df.isnull().sum()` to achieve this.



In [4]:
print('Missing values in each column:')
print(df.isnull().sum())

Missing values in each column:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64


**Reasoning**:
The previous step showed no missing values. Now, I need to create the feature and target DataFrames, import and instantiate `StandardScaler`, apply it to the features, and then display the scaled features to verify the transformation, as per the instructions.



In [5]:
X = df[features]
y = df[target_variable]

print("Created X (features) and y (target variable) DataFrames.")

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print('\nFirst 5 rows of scaled features (X_scaled):')
print(pd.DataFrame(X_scaled, columns=features).head())

Created X (features) and y (target variable) DataFrames.

First 5 rows of scaled features (X_scaled):
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0   2.619365 -0.671520           -1.079671     1.361695        1.764204   
1   2.539569 -0.573264           -0.761872     2.296608        3.230441   
2   2.494683 -0.905463           -0.920772    -0.882462       -0.866956   
3   2.489696 -0.928857           -1.159121    -0.524186       -0.480230   
4   2.489696 -0.961609           -0.682422    -0.545747       -0.506328   

   population  households  median_income  
0   -0.361184   -0.075998      -1.252543  
1   -0.261865   -0.099404      -1.081483  
2   -0.955354   -0.999252      -1.170105  
3   -0.796793   -0.715774      -0.362600  
4   -0.701830   -0.622148      -1.026454  


## Train Model

### Subtask:
Select a suitable machine learning model (e.g., Linear Regression, Random Forest Regressor) and train it using the preprocessed data. We will split the data into training and testing sets.


**Reasoning**:
To prepare for model training, I will first split the preprocessed data into training and testing sets using `train_test_split`. Then, I will import and instantiate a `LinearRegression` model, and finally train it on the training data, as per the instructions.



In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets successfully.")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# Instantiate and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

print("\nLinear Regression model instantiated and trained successfully.")

Data split into training and testing sets successfully.
X_train shape: (13600, 8)
X_test shape: (3400, 8)
y_train shape: (13600,)
y_test shape: (3400,)

Linear Regression model instantiated and trained successfully.


## Evaluate Model

### Subtask:
Evaluate the performance of the trained model using appropriate metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to assess its prediction accuracy.


**Reasoning**:
To evaluate the trained model, I will first make predictions on the test set, then import the required evaluation metrics from `sklearn.metrics`, calculate MAE, MSE, and R-squared, and finally print these metrics.



In [7]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model Evaluation Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")

Model Evaluation Metrics:
Mean Absolute Error (MAE): 49983.47
Mean Squared Error (MSE): 4634658406.22
R-squared (R2) Score: 0.66


## Final Task

### Subtask:
Provide a summary of the model's performance and discuss potential improvements or next steps for further development.


## Summary:

### Q&A

**1. What is the model's performance?**
The Linear Regression model achieved a Mean Absolute Error (MAE) of 49983.47, a Mean Squared Error (MSE) of 4634658406.22, and an R-squared (R2) score of 0.66. This indicates that the model's predictions are, on average, off by approximately \$49,983.47 from the actual median house values, and it explains 66% of the variance in house prices.

**2. What are potential improvements or next steps for further development?**
*   **Feature Engineering:** Explore creating new features from existing ones (e.g., bedrooms per room, population per household) to capture more complex relationships.
*   **Model Selection & Hyperparameter Tuning:** Experiment with more advanced regression models (e.g., RandomForestRegressor, GradientBoostingRegressor, XGBoost) and tune their hyperparameters to potentially improve prediction accuracy.

### Data Analysis Key Findings

*   The dataset `california_housing_train.csv` containing 17,000 entries and 9 columns was successfully loaded.
*   The target variable was identified as `median_house_value`, and the remaining 8 columns were used as features.
*   No missing values were found in any of the columns, simplifying the preprocessing step.
*   Features were successfully scaled using `StandardScaler` to ensure all features contribute equally to the model.
*   The data was split into training and testing sets with an 80/20 ratio, resulting in 13,600 samples for training and 3,400 samples for testing.
*   A Linear Regression model was chosen and successfully trained on the scaled training data.
*   The model achieved a Mean Absolute Error (MAE) of \$49,983.47, a Mean Squared Error (MSE) of \$4,634,658,406.22, and an R-squared (R2) score of 0.66 on the test set.

### Insights or Next Steps

*   The current Linear Regression model provides a reasonable baseline for predicting house prices, explaining 66% of the variance.
*   To improve model performance, consider exploring non-linear models or ensemble methods, as they might capture more complex relationships within the data that linear regression cannot.
