<h1 id="Challenge:-Advanced-Data-Cleaning-with-the-Housing-Prices-Dataset"><strong>Mock Test 2: Advanced Data Cleaning with the Housing Prices Dataset</strong></h1>
<h2 id="Introduction"><strong>Introduction</strong></h2>
<p>To further solidify your understanding and enhance your data cleaning skills, we present you with a hands-on challenge. In this challenge, you will apply the techniques you've learned to a new dataset, encountering different scenarios and complexities.</p>
<h2 id="Objective"><strong>Objective</strong></h2>
<p>Apply your data cleaning and formatting skills to the <strong>House Prices</strong> dataset. This exercise will help you gain experience in handling real-world data imperfections and prepare you for more advanced data analysis tasks.</p>

---

<h2><strong>Part 1: Importing and Inspecting the Dataset</strong></h2>
<h3><strong>Step 1: Import Required Libraries</strong></h3>
<p>Begin by importing the necessary Python libraries for data manipulation and visualization.</p>

In [49]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer, KNNImputer


<h3><strong>Step 2: Load the Dataset</strong></h3>
<p>Load the House Prices dataset into a pandas DataFrame.</p>

In [50]:
# Load the House Prices dataset from a URL
url = "https://raw.githubusercontent.com/khiew-tzong-yong/hf3de_m7/refs/heads/main/Dataset_with_Missing_Data_and_Duplicates.csv"
df = pd.read_csv(url)

# Display the first few rows of the dataset
df.head(10)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


<h3><strong>Step 3: Feature Selection</strong></h3>
<p>Select the features from the dataset except longitude and latutude. </p>

In [52]:
# Define a variable on selected features
selected_features = df[['housing_median_age', 'total_rooms', 'total_bedrooms',
                        'population', 'households', 'median_income',
                        'median_house_value', 'ocean_proximity']]
selected_features.head(10)

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,52.0,1274.0,,558.0,219.0,5.6431,341300.0,NEAR BAY
4,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


#### Step 3.2: Create the data frame by extracting the relevant features (including the target variable)

In [53]:
# Write your code here
# Create a new DataFrame with selected features
relevant_features = [
    'housing_median_age', 'total_rooms', 'total_bedrooms',
    'population', 'households', 'median_income',
    'median_house_value',  
    'ocean_proximity'      
]

df_selected = df[relevant_features]
df_selected.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,52.0,1274.0,,558.0,219.0,5.6431,341300.0,NEAR BAY
4,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


<h3><strong>Step 4: Inspect the Dataset</strong></h3>
<p>Examine the dataset to understand its structure and identify potential issues.</p>

In [54]:
# Get basic information about the dataset
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21672 entries, 0 to 21671
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   housing_median_age  21672 non-null  float64
 1   total_rooms         21672 non-null  float64
 2   total_bedrooms      18576 non-null  float64
 3   population          21672 non-null  float64
 4   households          21672 non-null  float64
 5   median_income       21672 non-null  float64
 6   median_house_value  21672 non-null  float64
 7   ocean_proximity     21672 non-null  object 
dtypes: float64(7), object(1)
memory usage: 1.3+ MB


In [55]:
# Display summary statistics
print("Numerical Columns Summary:")
print(df_selected.describe())
print("\nCategorical Columns Summary:")
print(df_selected['ocean_proximity'].value_counts())

Numerical Columns Summary:
       housing_median_age   total_rooms  total_bedrooms    population  \
count        21672.000000  21672.000000    18576.000000  21672.000000   
mean            28.655362   2635.010797      537.863211   1424.642580   
std             12.570122   2177.198036      419.157368   1127.615132   
min              1.000000      2.000000        1.000000      3.000000   
25%             18.000000   1448.000000      296.000000    788.000000   
50%             29.000000   2127.000000      436.000000   1166.000000   
75%             37.000000   3149.000000      647.000000   1724.000000   
max             52.000000  39320.000000     6445.000000  35682.000000   

         households  median_income  median_house_value  
count  21672.000000   21672.000000        21672.000000  
mean     499.294943       3.874639       207025.157392  
std      381.265414       1.906372       115451.740239  
min        1.000000       0.499900        14999.000000  
25%      280.000000       2.56

In [56]:
# Check for missing values
missing_values = df_selected.isnull().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 housing_median_age       0
total_rooms              0
total_bedrooms        3096
population               0
households               0
median_income            0
median_house_value       0
ocean_proximity          0
dtype: int64


In [57]:
# Check for duplicated records
duplicates = df_selected.duplicated()
df_duplicates = duplicates.sum()
print(f"Number of duplicated rows: {df_duplicates}")

Number of duplicated rows: 1032


---

<h2><strong>Part 2: Handling Missing Data</strong></h2>
<h3><strong>Step 1: Identify Missing Values</strong></h3>
<p>Determine the percentage of missing values in each column to decide on appropriate handling methods.</p>

In [58]:
# Calculate the percentage of missing values per column
missing_percent = (df_selected.isnull().sum() / len(df_selected)) * 100
print("Percentage of missing values per column:\n")
print(missing_percent)

Percentage of missing values per column:

housing_median_age     0.000000
total_rooms            0.000000
total_bedrooms        14.285714
population             0.000000
households             0.000000
median_income          0.000000
median_house_value     0.000000
ocean_proximity        0.000000
dtype: float64


<h3><strong>Step 2: Filling Missing Values</strong></h3>
<p>Decide on strategies to handle missing data based on the nature of each column.</p>
<ul>
<li><strong>Numerical Columns:</strong> You might fill missing values with the median or mean.</li>
</ul>

In [59]:
# Fill missing values in with the median or mean
imputer_mean = SimpleImputer(strategy='mean')
df['total_bedrooms'] = imputer_mean.fit_transform(df[['total_bedrooms']])

<h3><strong>Step 4: Verify Missing Values Have Been Handled</strong></h3>
<p>Check to confirm that missing values have been appropriately addressed.</p>

In [60]:
# Recheck missing values
print("Missing values in each column:")
print(df.isnull().sum())

Missing values in each column:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64


---

<h2><strong>Part 3: Removing Duplicate Records</strong></h2>
<p>Duplicate entries can skew your analysis. Ensure the dataset is free from duplicates.</p>

In [61]:
# Check for duplicate rows
duplicate_rows=df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

Number of duplicate rows: 1032


In [62]:
# Remove duplicate rows if any
df_cleaned = df.drop_duplicates()

In [63]:
# Verify that duplicate record are removed
df_cleaned.shape

(20640, 10)

---

<h2><strong>Part 4: Basic Data Formatting</strong></h2>

<h3><strong>Step 1: Converting Data Types</strong></h3>
<p>Ensure that each column has the appropriate data type for analysis.</p>

In [64]:
# Convert 'median_income' to float if not already
df['median_income'] = df['median_income'].astype(float)

# Convert categorical columns (ocean_proximity) to 'category' data type
df['ocean_proximity'] = df['ocean_proximity'].astype('category')

# Verify all column data types
print(df.dtypes)

longitude              float64
latitude               float64
housing_median_age     float64
total_rooms            float64
total_bedrooms         float64
population             float64
households             float64
median_income          float64
median_house_value     float64
ocean_proximity       category
dtype: object


<h3><strong>Step 2: Renaming Columns</strong></h3>
<p>Rename columns to improve readability and consistency. All columns name must be rename in camel case convention.</p>

In [65]:
# Display current column name
print(df.columns)

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')


In [66]:
# Rename column name in camel case convention
df.rename(columns={
    'longitude': 'longitude',
    'latitude': 'latitude',
    'housing_median_age': 'housingMedianAge',
    'total_rooms': 'totalRooms',
    'total_bedrooms': 'totalBedrooms',
    'population': 'population',
    'households': 'households',
    'median_income': 'medianIncome',
    'median_house_value': 'medianHouseValue',
    'ocean_proximity': 'oceanProximity'
}, inplace=True)

# Check updated column names
print(df.columns)

Index(['longitude', 'latitude', 'housingMedianAge', 'totalRooms',
       'totalBedrooms', 'population', 'households', 'medianIncome',
       'medianHouseValue', 'oceanProximity'],
      dtype='object')


In [67]:
# Display new column names
print(df.columns)


Index(['longitude', 'latitude', 'housingMedianAge', 'totalRooms',
       'totalBedrooms', 'population', 'households', 'medianIncome',
       'medianHouseValue', 'oceanProximity'],
      dtype='object')


---

<h2 id="Part-5:-Removing-Unnecessary-Columns"><strong>Part 5: Removing Unnecessary Columns</strong></h2>
<p>Drop the '<span style="font-family: monospace; font-size: 14px; white-space-collapse: preserve;">ocean_proximity' column.<br /></span></p>

In [68]:
# Drop the columns
updated_df = df.drop(columns=['oceanProximity'])
# Verify that columns are removed
updated_df.head(10)

Unnamed: 0,longitude,latitude,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,537.863211,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0


---

<h2 id="Part-7:-Saving-the-Cleaned-Dataset"><strong>Part 7: Saving the Cleaned Dataset</strong></h2>
<p>Save your cleaned and formatted dataset as 'housing_cleaned.csv'</p>

In [69]:
# Save the cleaned dataset to a new CSV file
updated_df.to_csv("housepricing.csv", index=False)
print("Cleaned dataset saved as 'housepricing.csv'")

Cleaned dataset saved as 'housepricing.csv'


---

# Submission:
Submit all files to myConnexion. Rename the file as IndexNo_Name_MockTest.