![Prodigy_ML/image.png](image.png)

---
# House Price Prediction using Linear Regression

This notebook implements a linear regression model to predict house prices based on their square footage, number of bedrooms, and number of bathrooms.

---

---
# Data Fields

Here's a brief version of what you'll find in the data description file.

- **SalePrice**: The property's sale price in dollars. This is the target variable that you're trying to predict.
- **MSSubClass**: The building class.
- **MSZoning**: The general zoning classification.
- **LotFrontage**: Linear feet of street connected to property.
- **LotArea**: Lot size in square feet.
- **Street**: Type of road access.
- **Alley**: Type of alley access.
- **LotShape**: General shape of property.
- **LandContour**: Flatness of the property.
- **Utilities**: Type of utilities available.
- **LotConfig**: Lot configuration.
- **LandSlope**: Slope of property.
- **Neighborhood**: Physical locations within Ames city limits.
- **Condition1**: Proximity to main road or railroad.
- **Condition2**: Proximity to main road or railroad (if a second is present).
- **BldgType**: Type of dwelling.
- **HouseStyle**: Style of dwelling.
- **OverallQual**: Overall material and finish quality.
- **OverallCond**: Overall condition rating.
- **YearBuilt**: Original construction date.
- **YearRemodAdd**: Remodel date.
- **RoofStyle**: Type of roof.
- **RoofMatl**: Roof material.
- **Exterior1st**: Exterior covering on house.
- **Exterior2nd**: Exterior covering on house (if more than one material).
- **MasVnrType**: Masonry veneer type.
- **MasVnrArea**: Masonry veneer area in square feet.
- **ExterQual**: Exterior material quality.
- **ExterCond**: Present condition of the material on the exterior.
- **Foundation**: Type of foundation.
- **BsmtQual**: Height of the basement.
- **BsmtCond**: General condition of the basement.
- **BsmtExposure**: Walkout or garden level basement walls.
- **BsmtFinType1**: Quality of basement finished area.
- **BsmtFinSF1**: Type 1 finished square feet.
- **BsmtFinType2**: Quality of second finished area (if present).
- **BsmtFinSF2**: Type 2 finished square feet.
- **BsmtUnfSF**: Unfinished square feet of basement area.
- **TotalBsmtSF**: Total square feet of basement area.
- **Heating**: Type of heating.
- **HeatingQC**: Heating quality and condition.
- **CentralAir**: Central air conditioning.
- **Electrical**: Electrical system.
- **1stFlrSF**: First Floor square feet.
- **2ndFlrSF**: Second floor square feet.
- **LowQualFinSF**: Low quality finished square feet (all floors).
- **GrLivArea**: Above grade (ground) living area square feet.
- **BsmtFullBath**: Basement full bathrooms.
- **BsmtHalfBath**: Basement half bathrooms.
- **FullBath**: Full bathrooms above grade.
- **HalfBath**: Half baths above grade.
- **Bedroom**: Number of bedrooms above basement level.
- **Kitchen**: Number of kitchens.
- **KitchenQual**: Kitchen quality.
- **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms).
- **Functional**: Home functionality rating.
- **Fireplaces**: Number of fireplaces.
- **FireplaceQu**: Fireplace quality.
- **GarageType**: Garage location.
- **GarageYrBlt**: Year garage was built.
- **GarageFinish**: Interior finish of the garage.
- **GarageCars**: Size of garage in car capacity.
- **GarageArea**: Size of garage in square feet.
- **GarageQual**: Garage quality.
- **GarageCond**: Garage condition.
- **PavedDrive**: Paved driveway.
- **WoodDeckSF**: Wood deck area in square feet.
- **OpenPorchSF**: Open porch area in square feet.
- **EnclosedPorch**: Enclosed porch area in square feet.
- **3SsnPorch**: Three season porch area in square feet.
- **ScreenPorch**: Screen porch area in square feet.
- **PoolArea**: Pool area in square feet.
- **PoolQC**: Pool quality.
- **Fence**: Fence quality.
- **MiscFeature**: Miscellaneous feature not covered in other categories.
- **MiscVal**: $Value of miscellaneous feature.
- **MoSold**: Month Sold.
- **YrSold**: Year Sold.
- **SaleType**: Type of sale.
- **SaleCondition**: Condition of sale.

---

---
# Data Processing and Model Training
---

---

## Overview

In the `data_description.txt` file, the dataset was not properly formatted; it contained non-data lines and was not in a tabular form. This README outlines the steps taken to filter and process the data and describes how we used the processed data to train and test a model.

## Data Filtering and Processing

To filter out the non-data lines and process the data, we followed these steps:

1. **Initialize Lists to Hold Data**: We started by creating lists to store the extracted data.
2. **Process the Lines to Extract Data**: We read through each line of the `data_description.txt` file to identify and extract the relevant data.
3. **Skip the Header Line for `MSSubClass`**: We skipped the header line to focus on the actual data.
4. **Split the Line by Tab and Store the Data**: Each line was split by tabs, and the resulting data was stored in the initialized lists.
5. **Convert `MSSubClass` Data to DataFrame**: Finally, the extracted data for `MSSubClass` was converted into a DataFrame for further processing.

By performing these steps, we filtered and processed the data, resulting in a cleaned dataset.

## Additional Data Requirements

Despite the filtering and processing, the dataset still did not include specific data needed for our analysis, such as square footage and the number of bedrooms and bathrooms. To address this, we used the `random.seed` function to process 1,000 data samples.

## Model Training and Testing

We trained and tested the model using the processed data. If more data is needed, we can easily adjust the number of samples by changing the `n_samples` variable.

## Usage

To use the processed data for training and testing the model:

1. Load the `ms_subclass` dataset.
2. Ensure the dataset includes the necessary features such as square footage and the number of bedrooms and bathrooms.
3. Use the `random.seed` function to process the desired number of data samples.
4. Train and test the model with the processed data.

If you need more data, simply change the value of the `n_samples` variable to the desired number of samples.

---

---
# Task-01 
## Implement a linear regression model to predict the prices of houses based on their square footage and the number of bedrooms and bathrooms.
---

## Import necessary Libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

## Read the file and filter out non-data lines

In [6]:
# Read the file and filter out non-data lines
with open('data_description.txt', 'r') as file:
    lines = file.readlines()

# Initialize lists to hold the data
ms_subclass_data = []

# Process the lines to extract data
for line in lines:
    if 'MSSubClass' in line:
        # Skip the header line for MSSubClass
        continue
    if '\t' in line:
        # Split the line by tab and store the data
        fields = line.strip().split('\t')
        if len(fields) == 2:  # Ensure correct format
            ms_subclass_data.append(fields)

# Convert MSSubClass data to DataFrame
ms_subclass_df = pd.DataFrame(ms_subclass_data, columns=['Code', 'Description'])

print("MSSubClass Data:")
print(ms_subclass_df)


MSSubClass Data:
        Code                                        Description
0         20                    1-STORY 1946 & NEWER ALL STYLES
1         30                               1-STORY 1945 & OLDER
2         40                  1-STORY W/FINISHED ATTIC ALL AGES
3         45                  1-1/2 STORY - UNFINISHED ALL AGES
4         50                      1-1/2 STORY FINISHED ALL AGES
..       ...                                                ...
315  Abnorml    Abnormal Sale -  trade, foreclosure, short sale
316  AdjLand                            Adjoining Land Purchase
317   Alloca  Allocation - two linked properties with separa...
318   Family                        Sale between family members
319  Partial  Home was not completed when last assessed (ass...

[320 rows x 2 columns]


##  Save the DataFrame to a File
##### To CSV:

In [9]:
ms_subclass_df.to_csv('ms_subclass_data.csv', index=False)

## Load the Dataset

In [12]:
ms_subclass_df = pd.read_csv('ms_subclass_data.csv')

## Perform Data Analysis

### Basic Information:

In [16]:
print(ms_subclass_df.info())
print(ms_subclass_df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320 entries, 0 to 319
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Code         306 non-null    object
 1   Description  318 non-null    object
dtypes: object(2)
memory usage: 5.1+ KB
None
       Code Description
count   306         318
unique  206         223
top      Gd   Excellent
freq     11          10


### Check for Missing Values:

In [19]:
print(ms_subclass_df.isnull().sum())

Code           14
Description     2
dtype: int64


### View Unique Values in a Column:

In [22]:
print(ms_subclass_df['Code'].unique())
print(ms_subclass_df['Description'].unique())

['20' '30' '40' '45' '50' '60' '70' '75' '80' '85' '90' '120' '150' '160'
 '180' '190' 'A' 'C' 'FV' 'I' 'RH' 'RL' 'RP' 'RM' 'Grvl' 'Pave' 'NA '
 'Reg' 'IR1' 'IR2' 'IR3' 'Lvl' 'Bnk' 'HLS' 'Low' 'AllPub' 'NoSewr'
 'NoSeWa' 'ELO' 'Inside' 'Corner' 'CulDSac' 'FR2' 'FR3' 'Gtl' 'Mod' 'Sev'
 'Blmngtn' 'Blueste' 'BrDale' 'BrkSide' 'ClearCr' 'CollgCr' 'Crawfor'
 'Edwards' 'Gilbert' 'IDOTRR' 'MeadowV' 'Mitchel' 'Names' 'NoRidge'
 'NPkVill' 'NridgHt' 'NWAmes' 'OldTown' 'SWISU' 'Sawyer' 'SawyerW'
 'Somerst' 'StoneBr' 'Timber' 'Veenker' 'Artery' 'Feedr' 'Norm' 'RRNn'
 'RRAn' 'PosN' 'PosA' 'RRNe' 'RRAe' '1Fam' '2FmCon' 'Duplx' 'TwnhsE'
 'TwnhsI' '1Story' '1.5Fin' '1.5Unf' '2Story' '2.5Fin' '2.5Unf' 'SFoyer'
 'SLvl' '10' '9' '8' '7' '6' '5' '4' '3' '2' '1' 'Flat' 'Gable' 'Gambrel'
 'Hip' 'Mansard' 'Shed' 'ClyTile' 'CompShg' 'Membran' 'Metal' 'Roll'
 'Tar&Grv' 'WdShake' 'WdShngl' 'AsbShng' 'AsphShn' 'BrkComm' 'BrkFace'
 'CBlock' 'CemntBd' 'HdBoard' 'ImStucc' 'MetalSd' 'Other' 'Plywood'
 'PreCast' 'Sto

## Create a sample dataset

In [25]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_samples = 10000

# Randomly generate features
square_footage = np.random.randint(1000, 3500, size=n_samples)
bedrooms = np.random.randint(1, 6, size=n_samples)
bathrooms = np.random.randint(1, 4, size=n_samples)

# Generate target variable (Price) with some added noise
price = (square_footage * 200) + (bedrooms * 15000) + (bathrooms * 10000) + np.random.normal(0, 50000, size=n_samples)

# Create DataFrame
df = pd.DataFrame({
    'SquareFootage': square_footage,
    'Bedrooms': bedrooms,
    'Bathrooms': bathrooms,
    'Price': price
})

# Display the first few rows
print(df.head())

   SquareFootage  Bedrooms  Bathrooms          Price
0           1860         5          3  514566.832646
1           2294         3          1  556496.939908
2           2130         1          2  448934.590278
3           2095         2          1  432408.207895
4           2638         3          3  541907.790291


## Preprocess the Data

In [28]:
# Check for missing values
print(df.isnull().sum())

# In this synthetic dataset, we shouldn't have missing values, but it's good practice to check
# If needed, handle missing values
# df = df.dropna()  # or df.fillna(method='ffill')

# Display data types
print(df.dtypes)

SquareFootage    0
Bedrooms         0
Bathrooms        0
Price            0
dtype: int64
SquareFootage      int32
Bedrooms           int32
Bathrooms          int32
Price            float64
dtype: object


## Split the Data

In [31]:
# Define features and target
X = df[['SquareFootage', 'Bedrooms', 'Bathrooms']]
y = df['Price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Training set size: {X_train.shape[0]} rows')
print(f'Testing set size: {X_test.shape[0]} rows')

Training set size: 8000 rows
Testing set size: 2000 rows


## Train the Model

In [34]:
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

## Evaluate the Model

In [37]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Mean Squared Error: 2391436586.4481106
R^2 Score: 0.898803770698618


## Make Predictions

In [40]:
# Example new data for prediction
new_data = pd.DataFrame({
    'SquareFootage': [2000, 1500],
    'Bedrooms': [3, 2],
    'Bathrooms': [2, 1]
})


#### Predict prices for all houses in the dataset

In [43]:
# Predict prices for all houses in the dataset
df['PredictedPrice'] = model.predict(X)

#### Display the results

In [46]:
# Display the results
print("\nPredicted Prices for All Houses:")
print(df[['SquareFootage', 'Bedrooms', 'Bathrooms', 'PredictedPrice']])


Predicted Prices for All Houses:
      SquareFootage  Bedrooms  Bathrooms  PredictedPrice
0              1860         5          3   477467.478379
1              2294         3          1   515222.035205
2              2130         1          2   460986.447853
3              2095         2          1   460241.655142
4              2638         3          3   601951.963712
...             ...       ...        ...             ...
9995           3345         3          1   724747.505422
9996           2340         4          1   539700.610373
9997           1569         4          2   395070.790217
9998           2973         2          2   644353.509951
9999           1545         1          1   335286.546206

[10000 rows x 4 columns]


## Generate New Dataset 

In [49]:
df.to_csv('New_Dataset.csv')

In [51]:
df.sample(5)

Unnamed: 0,SquareFootage,Bedrooms,Bathrooms,Price,PredictedPrice
9983,3313,2,3,810170.856703,721210.652128
4235,1905,2,3,487732.973411,440514.303825
2680,2298,5,1,615182.899167,546635.663773
8837,1460,3,1,367901.25678,348957.294804
6637,1441,1,1,316616.134179,314553.293207


![Prodigy_ML/khush.png](khush.png)