In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [4]:
# Load dataset
df = pd.read_csv("/content/Housing.csv")

# Display first few rows
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7229300521,20141013T000000,231300.0,2,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [10]:
# Step 3: Check missing values
df.isnull().sum()


Unnamed: 0,0
id,0
date,0
price,0
bedrooms,0
bathrooms,0
sqft_living,0
sqft_lot,0
floors,0
waterfront,0
view,0


In [12]:
# Step 4: Draw missing values bar chart
missing = df.isnull().sum()
missing = missing[missing > 0]

if not missing.empty:
    missing.plot(kind="bar")
    plt.title("Missing Values Before Cleaning")
    plt.show()
else:
    print("No missing values found ")


No missing values found 


In [13]:
# Step 5: Separate column types
num_cols = df.select_dtypes(include=["int64", "float64"]).columns
cat_cols = df.select_dtypes(include=["object"]).columns


In [14]:
# Step 6: Fill missing numbers using median
for col in num_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)


In [15]:
# Step 7: Fill missing words using mode
for col in cat_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)


In [16]:
# Step 8: Remove columns with too many missing values
threshold = 0.4 * len(df)
df = df.dropna(axis=1, thresh=len(df) - threshold)


In [17]:
# Step 9: Final check
df.isnull().sum()


Unnamed: 0,0
id,0
date,0
price,0
bedrooms,0
bathrooms,0
sqft_living,0
sqft_lot,0
floors,0
waterfront,0
view,0


In [18]:
# Step 10: Compare size
print("Rows & Columns:", df.shape)


Rows & Columns: (21613, 21)


DATASET ANALYSIS REPORT
HOUSING DATASET

1. INTRODUCTION
The objective of this analysis is to examine the Housing dataset,
identify missing values, apply appropriate data cleaning techniques,
and validate the dataset after preprocessing.

2. DATASET OVERVIEW
Dataset Name: Housing.csv
Total Rows: 21,613
Total Columns: 21
The dataset contains both numerical and categorical features related
to house prices and property characteristics.

3. MISSING VALUE ANALYSIS
Missing values were identified using the isnull().sum() method.
After analysis, it was observed that there are no missing values
present in any column of the dataset.

4. MISSING VALUE VISUALIZATION
A bar chart was implemented using matplotlib to visualize missing values.
Since the dataset does not contain missing values, no bars appeared
in the chart. The visualization code was included for demonstration
purposes and future use on datasets with missing values.

5. DATA CLEANING METHODS
Although no missing values were found, standard data cleaning
techniques were implemented:
- Numerical columns were prepared for median imputation.
- Categorical columns were prepared for mode imputation.
- Columns with more than 40% missing values were set to be removed,
  but no columns met this condition.

6. DATA VALIDATION
After preprocessing, the dataset was rechecked for missing values
and validated to ensure data consistency and completeness.

7. BEFORE AND AFTER COMPARISON
Rows before cleaning: 21,613
Rows after cleaning: 21,613
Columns before cleaning: 21
Columns after cleaning: 21
Missing values before cleaning: 0
Missing values after cleaning: 0

8. CONCLUSION
The Housing dataset was found to be clean and well-structured.
The preprocessing workflow confirms that the dataset is ready
for further analysis and machine learning tasks.
