# Project One: House Prices

Oliver James
DS 5033 Data Mining & Machine Learning

During the course of this project, you will learn to load, process, visualize and predict with housing data.
We will work through a series of steps:
1. Understand the Overarching Task
2. Acquire and Load the Data
3. Prepare the Data (for ML)
4. Selecting and Training an ML Model
5. Presenting an Analysis on the Solution

We will consolidate this notebook into three following steps:

* Broadly speaking, tasks 1-3 fall under Exploratory Data Analysis (EDA).
* Then, task four encompulates several steps for price price prediction (the modeling side of things).
* Finally, we have our analysis in task five.


Due to some difficulties in acquiring a dataset on Texas and San Antonio housing, we will be relying on an
older dataset. For the purposes of learning, the data is still relevant.

For this project, the data is on Californian houses (when they were still affordable).  
The csv is also attached to the assignment.

* The dataset is available from: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
* You can also get the csv from: https://github.com/ageron/data/blob/main/housing/housing.csv
* Read more here: https://www.kaggle.com/datasets/camnugent/california-housing-prices/data

In [None]:
#install any libraries you may need
import pandas as pd # for dataframe creation
from pandas.plotting import scatter_matrix #to create scatter matrix for initial EDA
import matplotlib.pyplot as plt #for plotting data
import seaborn as sns
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler,MinMaxScaler #This will be used to scale data t
from sklearn.compose import ColumnTransformer #To allow the application of different transformations (e.g., scaling, encoding) to different columns of your dataset in one unified step.
from sklearn.pipeline import Pipeline #To combine preprocessing and modeling steps into one object, making your code cleaner and reducing the chances of data leakage by ensuring preprocessing is applied consistently.


### Exploratory Data Analysis (40%)

The task here is to construct a big picture view of the data set.

1. Download and load the dataset using pandas
2. Print statistics - use the head, info, describe functions
3. Visualize the data using the scatter_matrix function 
4. Visualize the data according to geospatial data (HINT: plot using coordinate data)
5. Improve your geospatial plot: change parameters alpha, s, c, cmap. (your end result should have a colorbar with price and datapoints showing housing density with the hues indicating the price)

In [None]:
#EDA
# Step 1: Load the dataset
df = pd.read_csv('housing.csv')  # read CSV file

# Step 2: Print statistics
print("First 5 rows of the data:")
print(df.head())  # get initial look at data

#Please see next code block for step 3, it is a rather large graph, so I've given it it's own code block


In [None]:
#EDA Continued
#Step 3
# Only include relevant numeric columns to avoid clutter
scatter_matrix(df[['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 
                   'households', 'median_income', 'median_house_value']], 
               figsize=(20, 10), alpha=0.3, diagonal='kde')
plt.suptitle("Scatter Matrix of Housing Data", size=16)
plt.show()


 

In [None]:
#EDA Continued
# Steps 4 and 5 are here for EDA
# Select only the numeric columns from your DataFrame
numeric_df = df.select_dtypes(include=['float64', 'int64'])
plt.figure(figsize=(10, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Matrix")
plt.text(0, -0.5, "For above Matrix: Correlation values range from -1 (negative correlation) to +1 (positive correlation).\nValues close to 0 indicate little or no linear relationship.",
         ha='center', va='top', fontsize=12, transform=plt.gca().transAxes)
plt.show()

# Step 4 & 5 combined: Improved Geospatial Plot with adjustments to alpha, s, c, and cmap
plt.figure(figsize=(12, 10))
plt.scatter(df['longitude'], df['latitude'], 
            c=df['median_house_value'],  # Color represents median house value
            cmap='coolwarm',             # Color map for better contrast
            s=df['population'] / 50,     # Adjust size to represent population
            alpha=0.4)                   # Transparency for overlapping points
plt.colorbar(label='Median House Value')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Improved Geospatial Plot of Housing Prices and Density')
plt.show()

### Data Prep and Cleaning (20%)

Here, you will work to further investigate your features.

1. Check your features, are they in the correct units (median_income)
2. Find and fix missing values
3. Handle categorical data (HINT: OneHotEncoder)
4. Check the scaling for data. Does it need adjustment? (HINT: StandardScaler vs MinMaxScaler)
5. EXTRA CREDIT: how are you fixing (2)? Use SimpleImputer

In [None]:
#Data Prep and Cleaning

#use multiple code blocks to organize your code
#Step 1, fix median income to be in the 10ks... So multiply everything by 10,000

df_1 = pd.read_csv('housing.csv')
df_1['median_income'] = df_1['median_income'] *10000
df_1.drop_duplicates()
df_1.dropna()
df_1.to_csv('housing_updated1.csv', index=False)


#step 2, find and fix missing values

# Check for missing values in each column
missing_values = df_1.isnull().sum()
print("Missing values in each column:")
print(missing_values)
print()


#we see that  there 207 missing values in total bedrooms column
print('207 missing values for total bedrooms field, to deal with this issue I will fill it with the median total bedrooms\n')
df_1['total_bedrooms'] = df_1['total_bedrooms'].fillna(df['total_bedrooms'].median())

# Check if the missing values are filled
missing_values = df_1.isnull().sum()
print("Missing values after filling:")
print(missing_values) #solved the problem

In [None]:
#Data Prep and cleaning
#Step 3 handle categorical data, OneHotEncoder 

print('Now I am going to take a look at the field Ocean_Promixity, since that is the only field that is categorical\n')

# Get unique values
unique_values = df_1['ocean_proximity'].unique()

# Print the unique values
print(f"Unique values in the column: \n")
print(unique_values)

#now I need to encode these values respecively, and update my CSV file
# Perform one-hot encoding
one_hot_encoded = pd.get_dummies(df_1['ocean_proximity'], prefix='ocean_proximity')

# Convert encoded columns to integers to ensure they appear as 1 and 0
one_hot_encoded = one_hot_encoded.astype(int)

# Concatenate the original dataframe with the encoded columns
df_updated = pd.concat([df_1, one_hot_encoded], axis=1)

# Drop the original 'ocean_proximity' column
df_updated.drop(columns=['ocean_proximity'], inplace=True)

# Save the updated dataframe back to a CSV file
df_updated.to_csv('housing_updated1.csv', index=False)

print("\nThe one-hot encoding is complete, and the updated CSV file has been saved as 'housing_updated1.csv'.")

In [None]:
#Data Prep and Cleaning continued
#Step 4, check scaling of data. 
#I used ChatGPT to look at my updated excel file and look at which fields need adjusting, see below recommendations
#It created scatter plots, and I looked for normal, or near normal distribution for standard and skewed for minMaxScaler
"""""
Key Considerations for Scaling:
StandardScaler: Works well for normally distributed data, scales to zero mean and unit variance. Suitable for variables like Median Income and Median House Value.
MinMaxScaler: Rescales data to a fixed range, typically [0, 1]. Better for preserving data structure in features with skewed distributions like Total Rooms and Population.
Recommendation:
Use StandardScaler for features like Median Income and Median House Value.
Use MinMaxScaler for skewed data like Total Rooms, Total Bedrooms, Population, and Households.
"""

# Define transformations
preprocessing = ColumnTransformer(
    transformers=[
        ('std', StandardScaler(), ['housing_median_age', 'median_income']),
        ('minmax', MinMaxScaler(), ['total_rooms', 'total_bedrooms', 'population', 'households'])
    ]
)

# Fit and transform the dataset
scaled_data = preprocessing.fit_transform(df_updated)

# Convert to DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=['housing_median_age', 'median_income', 'total_rooms', 'total_bedrooms', 'population', 'households'])
print(scaled_df.head())

### Price Prediction (20%)

Now you will create and test many ML models. The idea is to play with hyperparameters and model types.
You may find that some of your data prep needs further tweaking.

Now that you understand the big picture and have an intuition on how homes may be priced, you can move on to creating a price model.
Generally, you will want to follow the order of steps below (with 1&2 already completed):

1. Create a copy of your data.
2. On the copy, perform the data prep transformations.
3. Create a train & test split. Generally, 80-20 splits work pretty well.
4. Use a model you understand well. We recently learned how LinearRegression. How well does this model work?
5. In our introductory class, we also learned about the DecisionTreeRegressor. How does this perform?
6. EXTRA CREDIT: Model prices with the RandomForestRegressor model

The main idea here is to get used to programming and modeling on real data. Report your results and aim for
a ballpark Root Mean Squared Error less than 75K.

In [None]:
#use multiple code blocks to organize your code

### Analysis (20%):

A critical component in science is communicating your results and explaining the reseasoning behind the results.
A good presentation here should include the following:  

1. An introduction to the dataset, any things we should know (e.g. how it was collected, common errors). 
2. What did you discover in your EDA? What do you do with missing values, outliers, etc.  
3. What kind of distribution is the data? Is there a skew or high concentration of houses in a certain range?  
4. What correlations were revealed in the analysis? e.g. footage and price, do they correlate positively?  
5. Feature selection: which feature worked for price predictions and what was noise? How did you determine?

The points above are for guidance; you can choose your template and structure.  
The idea is to present a short report (no word counts) that is structured. 
structured, clear, and concise.  
You can refer back to your figures and use external links to explain your insights.

### use this markdown cell to write a report

[here]

### Submission:

You need to prepare your ipynb/jupyter notebook for grading.
The two main tasks are ensuring all your cell outputs are present and that you convert the notebook to PDF.

The instructs will vary slightly based on the platform (collab, kaggle, anaconda, etc).
Generally, inside the notebook, you will want to:
1. Restart & clear all cell outputs (optional, may detect buggy program control flow)
2. Run all (must do; I need to see your code cell outputs!)

Next, you need to download the notebook as a PDF. Unfortunately, exporting as PDF is a bit tricky.
An easy work around:
1. Download the notebook. (all platforms allow the default .ipynb export)
2. https://onlineconvertfree.com/convert-format/ipynb-to-pdf/

If you are unable to upload as a PDF, submit the .ipynb. Do NOT upload a .py file.

### Rubric:

Please see the associated percentage allocations.  
In general, ensure your code runs correctly.  Make your the PDF upload includes your code ouputs.  
You will be given significant credit for documentation and pseudo-code.

For more details, please read the rubric PDF in the assignment files.