# Regression Analysis in Real Estate
This is the part of Inronhack Mid-term Bootcamp project. 

## Objective

- The following project will idenfity the factors that influnce to the selling price of properties.
- The following project will predict the houing price using machine learning algorithms (Linear Regression, KNNNeighbour Regressor & Random Forest Regressor)


## Dataset

- The dataset consists of information on 22,000 properties. 
- The dataset consists of historic data of houses sold between May 2014 to May 2015 in King County WA.
- These are the definitions of data points provided:

| Column Name   | Description                                                                 |
| ------------- | --------------------------------------------------------------------------- |
| id            | ID of the house                                                             |
| date          | Date the house was sold                                                     |
| bedrooms      | Number of bedrooms                                                          |
| bathrooms     | Number of bathrooms                                                         |
| sqft_living   | Square footage of the home                                                   |
| sqft_lot      | Square footage of the lot                                                   |
| floors        | Total floors in the house                                                   |
| waterfront    | House which has a view to a waterfront                                      |
| view          | Has been viewed                                                             |
| condition     | How good the condition is overall                                           |
| grade         | Overall grade given to the housing unit, based on King County grading system|
| sqft_above    | Square footage of house apart from basement                                 |
| sqft_basement | Square footage of the basement                                              |
| yr_built      | Built Year                                                                  |
| yr_renovated  | Year when the house was renovated                                           |
| zipcode       | Zip code                                                                    |
| lat           | Latitude coordinate                                                         |
| long          | Longitude coordinate                                                        |
| sqft_living15 | Living room area in 2015 (implies-- some renovations)                       |
| sqft_lot15    | Lot size area in 2015 (implies-- some renovations)                          |
| price         | Price of the house                                                          |
| district      | District where the house is located                                         |



# Project Structure
1. Import libraries & Load Dataset
2. Overview of Dataset
3. Data Cleaning
4. Exploratory Data Analysis (EDA)
5. Data Modelling: Linear Regressor, KNN Regressor, Random Forest Regressor
6. Cross examination of modesl
7. Feature importance
8. Conclusion


# 1. Import libraries & Load Dataset

In [1]:
# Import necessary libraries & load dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler


In [2]:
df = pd.read_excel('/Users/dooinnkim/ironhack_da_may_2023/data_mid_bootcamp_project_regression/regression_data.xls')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/dooinnkim/ironhack_da_may_2023/data_mid_bootcamp_project_regression/regression_data.xls'

# 2. Overview of Data

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe().T

### Summary of the Dataset
- There are 21,597 entries in this dataset (i.e., houses for sale).
- The dataset includes various features such as the number of bedrooms and bathrooms, living area (in square feet), lot size (in square feet), number of floors, whether the house is a waterfront property, view quality, condition and grade of the house, above ground living area (in square feet), basement area (in square feet), the year the house was built and renovated, zip code, latitude, longitude, the living area in 2015 (in square feet), the lot size in 2015 (in square feet), and the house price.

### Observations:

- **Bedrooms**: The average house has around 3.37 bedrooms, but the number can vary greatly, with a standard deviation of about 0.93. The minimum number of bedrooms is 1, and the maximum is 33.

- **Bathrooms**: Houses have on average 2.12 bathrooms with a standard deviation of 0.77. The minimum number of bathrooms is 0.5 (likely a toilet and sink but no shower or tub), while the maximum is 8.

- **Sqft_living**: The average living space is about 2,080 square feet, with a standard deviation of 918. The smallest house has 370 square feet, while the largest has 13,540 square feet.

- **Sqft_lot**: The average lot size is 15,099 square feet (roughly a third of an acre), but the size varies greatly (standard deviation is 41,412) with the largest lot being 1,651,359 square feet (almost 38 acres).

- **Floors**: The average number of floors in a house is approximately 1.49, and houses in the dataset have between 1 and 3.5 floors.

- **Waterfront**: The mean of this feature is close to 0, suggesting that most houses are not on the waterfront.

- **Price**: House prices average around $540,296, with a standard deviation of $367,368. The cheapest house costs $78,000,  while the most expensive one is  $7,700,000.

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

Luckily, there are no missing values and duplicated rows which means that it's already cleaned dataset.

# 3. Data Cleaning

Althougg the dataset seems already cleanded, we will create additional columns: district and house_age for the following reasons:
- **District**: We assume that zipcode and lattitude and longtitude would not the recognizeable factor that people consider when buying properties. Instead, the name of district would influence on the purchase decision (which is more common convention). We add 'district' column using uszipcode python library.
- **house_age**: assuming that the age of house is one of the factros that would influence to the selling price.  

In [None]:
# using uszipcode, a python geo library
from uszipcode import SearchEngine


search = SearchEngine()


def get_state_by_zip(zipcode):
    zipcode_info = search.by_zipcode(zipcode)
    if zipcode_info:
        return zipcode_info.state


def get_district_by_zip(zipcode):
    zipcode_info = search.by_zipcode(zipcode)
    if zipcode_info:
        return zipcode_info.major_city


# Create a new 'District' column
df['district'] = df['zipcode'].apply(get_district_by_zip)


In [None]:
# Create a new 'age of house' column
df['date'] = pd.to_datetime(df['date'])

df['year'] = df['date'].dt.year

df['house_age'] = df['year'] - df['yr_built']

df.head()

# 4. EDA (Exploratory Data Analysis)

We've briefly seen the summary of the dataset in Data Overview part. This part will investigate in dept of the respective features and see if there are any useful business insights for reporting

In [None]:
# Remove unecessary columns for the analysis
df = df.drop(columns=['id', 'date', 'year'])

df.hist(bins=50, figsize=(20,15))
plt.tight_layout()
plt.show()

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Select only numeric features from the DataFrame
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Fit and transform the numeric features with StandardScaler
scaled_df = pd.DataFrame(scaler.fit_transform(numeric_df), columns=numeric_df.columns)

# Create boxplots
scaled_df.boxplot(figsize=(20, 6))
plt.title("Boxplots of Scaled Numerical Columns")
plt.ylabel("Values")
plt.xticks(rotation=45)

plt.show()

### Obsevations on Distribution

- **bedrooms**: Most houses have between 2 and 5 bedrooms, with 3 bedrooms being the most common.

- **bathrooms**: Most houses have between 1 and 3 bathrooms. Houses with 2.5 bathrooms (a common configuration with 2 full bathrooms and a half bathroom) are particularly common.

- **sqft_living**: The square footage of living spaces in the houses is right-skewed, meaning most houses have smaller living spaces, but there are a few houses with very large living spaces.

- **sqft_lot**: Similarly, the square footage of lots is also right-skewed. Most houses have smaller lots, but a few houses have very large lots.

- **floors**: Many houses have 1 or 2 floors. There are some houses with 1.5, 2.5, or 3 floors, indicating that some houses have a loft or a half floor.

- **waterfron**t: Almost all houses do not have a waterfront view, as indicated by the bar at 0.

- **view**: Most houses have not been viewed, but some houses have been viewed multiple times.

- **condition**: Most houses are in condition 3 or 4. Very few houses are in poor condition (1) or excellent condition (5).

- **grade**: The grading of the houses seems to follow a normal distribution, with most houses having a grade around 7.

- **sqft_above**: The square footage of house apart from basement is right-skewed, with most houses having smaller areas, but a few houses having very large areas.

- **sqft_basement**: Many houses do not have a basement (indicated by the bar at 0). Among the houses that do have a basement, the square footage of the basement is right-skewed.

- **yr_built**: The year the houses were built is roughly uniformly distributed across the years, with a slight increase in more recent years.

- **yr_renovated**: Most houses have not been renovated (indicated by the bar at 0). Among the houses that have been renovated, the year of renovation is right-skewed.

- **zipcode**: The zip codes appear to be uniformly distributed, indicating that the houses are spread out across the zip codes.

- **lat and long**: The latitude and longitude indicate the geographical location of the houses. There seem to be clusters of houses at certain latitudes and longitudes.

- **sqft_living15 and sqft_lot15**: The square footage of the living room area in 2015 and the lot size area in 2015 are both right-skewed, similar to sqft_living and sqft_lot.

- **price**: The price of the houses is right-skewed. Most houses are priced lower, but there are a few houses with very high prices.


### Outliers

Let's also check the outliers of each features using IQR method. Treating outliers is important for the following several reasons:

1. **Accuracy**: Outliers can significantly affect the mean and standard deviation of your data, which are used in a variety of statistical tests and machine learning models. This can lead to inaccurate results. For example, in a linear regression model, a single outlier can dramatically change the line of best fit, leading to inaccurate predictions.

2. **Model Performance**: Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data. Outliers in the data can cause problems with these algorithms, leading to poor performance.

3. **Data Quality**: Outliers can sometimes be the result of errors in data collection or entry. Identifying and addressing these outliers can help improve the quality of your data.

3. **Interpretability**: In some cases, outliers can make it more difficult to understand the underlying patterns and trends in your data. Treating these outliers can make your data analysis and visualizations more interpretable.

4. **Assumptions**: Many statistical procedures assume a normal distribution and outliers can violate this assumption, leading to incorrect results.

In [None]:
outliers_dict = {}

for col in numeric_df.columns:

    Q1 = numeric_df[col].quantile(0.25)
    Q3 = numeric_df[col].quantile(0.75)
    IQR = Q3 - Q1

    outliers = numeric_df[(numeric_df[col] < (Q1 - 1.5 * IQR)) | (numeric_df[col] > (Q3 + 1.5 * IQR))][col]
    
    outlier_percentage = round((len(outliers) / len(numeric_df[col]) * 100),2)
    
    outliers_dict[col] = {'outliers_count': len(outliers), 'outliers_percentage': outlier_percentage}
    outliers_df = pd.DataFrame.from_dict(outliers_dict, orient='index')

outliers_df

Here are possible suggestions for treating the outliers in prior to training our machine learning model.

- **'bedrooms'**: 530 outliers. If these represent homes with a very high number of bedrooms, they might be valid entries representing large houses or mansions. Investigate the actual values and consider keeping them or using winsorizing.

- **'bathrooms'**: 561 outliers. Similar to 'bedrooms', these could be valid entries representing large houses. Review the actual values and consider keeping them or using winsorizing.

- **'sqft_living'**: 571 outliers. These might represent particularly large or small houses. Consider a log transformation to handle the skewness of the data.

- **'sqft_lot'**: 2419 outliers. These might represent properties with particularly large lot sizes. A log transformation could be useful here too.

- **'waterfront'**: 163 outliers. This is likely a binary feature indicating whether the property is waterfront or not. Outliers might simply be the less common class (e.g., waterfront properties). If this is the case, no outlier handling is needed.

- **'view'**: 2122 outliers. If this feature represents the number of views a property has had, outliers could be properties that are particularly popular or unpopular. Depending on the distribution, a transformation or winsorizing might be appropriate.

- **'condition'**: 29 outliers. This is likely a categorical variable, and "outliers" are probably just less common conditions. No outlier handling is likely needed.

- **'grade'**: 1905 outliers. If this is a grading system for the quality of a house, outliers may be houses that are extremely high or low quality. These could be important to keep as they may have a significant impact on house prices.

- **'sqft_above', 'sqft_basement'**: These could be handled similarly to 'sqft_living'. Consider a log transformation.

- **'yr_renovated'**: 914 outliers. These might be houses that were recently renovated. If many houses have a value of 0 (indicating no renovation), this could lead to a skewed distribution. One approach could be to turn this into a binary 'was_renovated' feature.

- **'lat', 'long'**: 2 and 255 outliers, respectively. These are coordinates and outliers may represent properties that are geographically distant from others. It might be worth keeping these as they could represent different real estate markets.

- **'sqft_living15', 'sqft_lot15'**: Outliers in these features might be handled in the same way as 'sqft_living' and 'sqft_lot', possibly with a log transformation.

- **'price'**: 1158 outliers. These are likely high-value houses. Because the goal is likely to predict this variable, it's important to handle outliers carefully. A log transformation is a common choice for skewed target variables.

In [None]:
corr_matrix=df.corr(method='pearson')
fig, ax = plt.subplots(figsize=(20, 20))
ax = sns.heatmap(corr_matrix, annot=True)
plt.show()

### Observations on Correlation Matrix

- **sqft_living** has a high positive correlation with **price (0.701917)**, meaning that houses with more living space tend to be more expensive. This variable also has strong correlations with **bathrooms (0.755758)**, **grade (0.762779)**, and **sqft_above (0.876448)**, suggesting that larger houses tend to have more bathrooms, higher grades, and more above ground space.

- **grade** and **sqft_above** also have significant positive correlations with **price (0.667951 and 0.605368 respectively)**, suggesting that the quality of the house and the above ground space are important factors in determining the price of a house.


- **floors** and **condition** have a moderate negative correlation (-0.264075), implying that houses with more floors tend not to be in as good condition, or vice versa.

In [None]:
corr_matrix["price"].sort_values(ascending=False)

The following features reveal particularly strong correlated relationship with price. This suggests that the size and quality of a house (as measured by square footage and grade) are important factors that influence its price:

- **sqft_living**: 0.70
- **grade**: 0.67
- **sqft_above**: 0.61
- **sqft_living15**: 0.59
- **bathrooms**: 0.53



In [None]:
# Plot histogram for 'price'
plt.figure(figsize=(10, 5))
sns.histplot(df['price'], bins=30, kde=True)
plt.title('Distribution of Price')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Plot scatterplots for 'price' vs highly correlated features
highly_corr_features = ['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']
for feature in highly_corr_features:
    plt.figure(figsize=(10, 5))
    sns.scatterplot(data=df, x=feature, y='price')
    plt.title(f'Price vs {feature}')
    plt.show()

Here are some observations from the visualizations of **Price vs strongly correalted features**:

- **Price Distribution**: The price distribution is right-skewed, indicating that while most houses are priced on the lower end, there are a few houses with very high prices.

- **Price vs Square Footage of Living Area**: There is a positive correlation between the square footage of the living area (sqft_living) and the price, and the plots appears quite linear relation which makes sense as larger houses tend to be more expensive.

- **Price vs Grade**: The grade of a house also seems to be positively correlated with its price. Higher-graded houses appear to be more expensive. This again is logical as better quality houses are expected to fetch higher prices.

- **Price vs Square Footage Above Ground**: The square footage of house apart from basement (sqft_above) also shows a positive correlation with price. Larger above-ground living spaces tend to increase the price of a house.

- **Price vs Square Footage of Living Area in 2015**: The size of the living area in 2015 (sqft_living15) also seems to affect the price, although there's a bit of spread in the data. Larger living areas in 2015 tend to be associated with higher prices.

- **Price vs Number of Bathrooms**: The number of bathrooms in a house (bathrooms) is also positively correlated with its price. Houses with more bathrooms tend to be more expensive.

These visualizations and observations provide some initial insights into the factors that may affect house prices. However, these are just bivariate relationships. A multivariate analysis or a machine learning model would be needed to better understand and predict house prices.

### Analysis of Categorical Features relative to Price

We will analysse the catergorical features in relation to the price to identify what features could impact on the selling price. First, we would need to create new addiontioal columns 'was_renovated', 'has_basement' which are the binary features if the house was renovated and has basement'.

In [None]:
df['was_renovated'] = df['yr_renovated'].apply(lambda x: 0 if x == 0 else 1)
df['has_basement'] = df['sqft_basement'].apply(lambda x: 0 if x == 0 else 1)

In [None]:
df_cat = df.copy()
cols_to_convert = ['waterfront', 'condition', 'grade','was_renovated','has_basement']

for col in cols_to_convert:
    df_cat[col] = df_cat[col].astype(str)
    

In [None]:
for column in df_cat.select_dtypes(include=['object']).columns:
    plt.figure(figsize=(10, 4))
    order = df_cat[column].value_counts().index
    sns.countplot(x=column, data=df_cat, order=order)
    plt.title(f'Countplot of {column}')
    plt.xticks(rotation=45)
    plt.show()
    
    count_table = df_cat[column].value_counts()
    print(count_table)


In [None]:
for column in df_cat.select_dtypes(include=['object']).columns:
    plt.figure(figsize=(10, 4)) 
    sns.barplot(x=column, y='price', data=df_cat)
    plt.title(f'Average price per {column}')
    plt.xticks(rotation=45)
    plt.show()
    
    avg_price = pd.concat([df_cat[column], df['price']], axis=1).groupby(column).mean()
    print(avg_price.sort_values(by='price',ascending=False))

### Obsevations on Categorical features relative to Price

- **Waterfront**: Houses with a waterfront are on average significantly more expensive than those without. The average price of houses with a waterfront is about 1.66 million, while those without are about 531,762.

- **Condition**: The condition of a house also affects its average price. Houses in condition '5' (assuming '5' represents the best condition) have an average price of about 612,578, while those in condition '1' have an average price of around 341,067. The table suggests that, generally, houses in better condition fetch higher prices.

- **Grade**: The house prices also seem to increase with the grade of the house. Houses with a grade of '13' have the highest average price at approximately 3.71 million. On the other end of the spectrum, houses with a grade of '4' have an average price of around 212,002.

- **District**: The district where the house is located also significantly affects the average price. Houses in Medina have the highest average price of around 2.16 million, while those in Federal Way have the lowest average price of approximately 289,391.

- **Was_renovated**: Houses that have been renovated are generally more expensive than those that haven't. The average price of houses that were renovated is about $760,629, while those that were not renovated have an average price of around 530,560.

- **Has_basement**: Having a basement seems to add value to a house as well. Houses with a basement have an average price of approximately 622,518, while those without a basement have an average price of about 487,069.

# 5. Data Modelling

### Baseline models

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
df_model = df.copy()


In [None]:

#encoding
df_model = pd.get_dummies(df_model)


X = df_model.drop(columns=['price'])
Y = df_model['price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)



models = [LinearRegression(), KNeighborsRegressor(), RandomForestRegressor()]


metrics_df = pd.DataFrame(columns=["Model", "r2", "mse", "mae"])

for model in models:
    model.fit(X_train, Y_train)
    predictions = model.predict(X_test)
    
    mse = mean_squared_error(Y_test, predictions)
    mae = mean_absolute_error(Y_test, predictions)
    r2 = r2_score(Y_test, predictions)

    metrics_dict = {"Model": str(type(model).__name__), "r2": r2, "mse": mse, "mae": mae}


    metrics_df = metrics_df.append(metrics_dict, ignore_index=True)

    plt.figure(figsize=(10, 7))
    sns.regplot(x=Y_test, y=predictions, line_kws={"color": "red"})
    
    plt.title(f'Regression Plot: Actual vs Predicted values for {str(type(model).__name__)}')
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.show()

    
print(metrics_df)




In [None]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, Y_train)

importances = rf.feature_importances_
feature_names = X.columns


importances = 100.0 * (importances / np.sum(importances))


importances, feature_names = zip(*((importance, feature) for importance, feature in zip(importances, feature_names) if importance > 0.1))

importances = np.array(importances)
feature_names = np.array(feature_names)

indices = np.argsort(importances)[::-1]

plt.figure(figsize=(20, 13))
barplot = plt.bar(range(len(importances)), importances[indices], align="center")

plt.xticks(range(len(importances)), feature_names[indices], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance (%)")
plt.title("Feature Importance")


for rect in barplot:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width()/2.0, height, f'{height:.1f}%', ha='center', va='bottom')

plt.show()




In [None]:
# Create a DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance (%)': importances
})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance (%)', ascending=False)

# Display the DataFrame
print(importance_df)


### Standard Scale

In [None]:


df_model_std = df_model.copy()


X = df_model_std.drop(columns=['price'])
Y = df_model_std['price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) 



models = [LinearRegression(), KNeighborsRegressor(), RandomForestRegressor()]


metrics_df = pd.DataFrame(columns=["Model", "r2", "mse", "mae"])

for model in models:
    model.fit(X_train, Y_train)
    predictions = model.predict(X_test)
    
    mse = mean_squared_error(Y_test, predictions)
    mae = mean_absolute_error(Y_test, predictions)
    r2 = r2_score(Y_test, predictions)

    metrics_dict = {"Model": str(type(model).__name__), "r2": r2, "mse": mse, "mae": mae}


    metrics_df = metrics_df.append(metrics_dict, ignore_index=True)

    plt.figure(figsize=(10, 7))
    sns.regplot(x=Y_test, y=predictions, line_kws={"color": "red"})
    
    plt.title(f'Regression Plot: Actual vs Predicted values for {str(type(model).__name__)}')
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.show()

    
metrics_df





### Log Transform
We will first treat the outliers of each features according to the suggestion that we have discussed in the previous part.

We will log transform the following features : 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15', 'price'


In [None]:
df_model_log = df_model.copy()

columns_to_transform = ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15', 'price']

for col in columns_to_transform:
    df_model_log[col] = np.log1p(df_model_log[col])
    
    

In [None]:

X = df_model_log.drop(columns=['price'])
Y = df_model_log['price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)


models = [LinearRegression(), KNeighborsRegressor(), RandomForestRegressor()]


metrics_df = pd.DataFrame(columns=["Model", "r2", "mse", "mae"])

for model in models:
    model.fit(X_train, Y_train)
    predictions = model.predict(X_test)
    
    mse = mean_squared_error(Y_test, predictions)
    mae = mean_absolute_error(Y_test, predictions)
    r2 = r2_score(Y_test, predictions)

    metrics_dict = {"Model": str(type(model).__name__), "r2": r2, "mse": mse, "mae": mae}


    metrics_df = metrics_df.append(metrics_dict, ignore_index=True)

    plt.figure(figsize=(10, 7))
    sns.regplot(x=Y_test, y=predictions, line_kws={"color": "red"})
    
    plt.title(f'Regression Plot: Actual vs Predicted values for {str(type(model).__name__)}')
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.show()

metrics_df

    

There is slight improvement in accuracy by log transform the features. Let's further check if we can improve better by featrue selection to reduce multicolinearity detected by correlation matrix

### Feature Selection

In [None]:
# Feature Slelection using coef
df_model.corr()['price'].sort_values(ascending=False)

In [None]:
# Checking Multi-Colinearity between features:
correlations_matrix = df_model.corr()
correlations_matrix = correlations_matrix[((correlations_matrix > .8) | (correlations_matrix < -.8))]
correlations_matrix.fillna(0)[:20]

In [None]:
correlations_matrix = correlations_matrix.fillna(0)
mask = np.zeros_like(correlations_matrix)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(40, 30))
ax = sns.heatmap(correlations_matrix, mask=mask, annot=True)
plt.show()

In [None]:
# remove multicorrelinerity. Long and lat can replace zipcode and district.
df_model = df_model.drop(['sqft_above','zipcode','house_age','yr_renovated','sqft_basement'],axis=1)

### Retraining the model

In [None]:
df_model_log = df_model.copy()

columns_to_transform = ['sqft_living', 'sqft_lot', 'sqft_living15', 'sqft_lot15', 'price']

for col in columns_to_transform:
    df_model_log[col] = np.log1p(df_model_log[col])

    
    
# X, Y split
X = df_model_log.drop(columns=['price'])
Y = df_model_log['price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)



#Model fitting

models = [LinearRegression(), KNeighborsRegressor(), RandomForestRegressor()]

# Create a DataFrame to store the metrics
metrics_df = pd.DataFrame(columns=["Model", "r2", "mse", "mae"])

for model in models:
    model.fit(X_train, Y_train)
    predictions = model.predict(X_test)
    
    mse = mean_squared_error(Y_test, predictions)
    mae = mean_absolute_error(Y_test, predictions)
    r2 = r2_score(Y_test, predictions)

    metrics_dict = {"Model": str(type(model).__name__), "r2": r2, "mse": mse, "mae": mae}


    metrics_df = metrics_df.append(metrics_dict, ignore_index=True)

    plt.figure(figsize=(10, 7))
    sns.regplot(x=Y_test, y=predictions, line_kws={"color": "red"})
    
    plt.title(f'Regression Plot: Actual vs Predicted values for {str(type(model).__name__)}')
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.show()

    
metrics_df

In [None]:
metrics_df['mae'] = np.expm1(metrics_df['mae'])


# 7.Cross-examination of models

In this data project, the effect of varying preprocessing techniques on the performance of three machine learning models - Linear Regression, KNeighborsRegressor, and RandomForestRegressor - was examined. Here's an analysis of the results obtained from each preprocessing stage:

1. **Initial Model (Post Basic Data Cleaning)**: Upon executing the models with basic cleaned data, RandomForestRegressor emerged as the superior model according to all three metrics (r2, mse, mae). LinearRegression produced satisfactory results, while the KNeighborsRegressor underperformed, likely due to the inherent sensitivity of this algorithm to high-dimensionality and non-normalized data.


2. **Model with Standard Scaling**: The incorporation of standard scaling enhanced the performance of all models. The most notable improvement was observed in the KNeighborsRegressor model, with substantial growth in all metrics. The reason for this improvement lies in the core functionality of the KNN algorithm, which computes the distance between instances and is consequently sensitive to feature scaling. There was also a marginal improvement in RandomForestRegressor, confirming its position as the top-performing model.


3. **Model with Log Transformation**: The models showed significant improvements after implementing a log transformation on the house price and size. This transformation notably impacted the Linear Regression model, indicating the possibility of a multiplicative rather than additive relationship between price and features. Log transformation aids in mitigating the effects of outliers and allows the model to better understand exponential growth. However, we must bear in mind that the mse and mae metrics are now operating on a log scale, rendering them not directly comparable with prior values.


4. **Model with Log Transformation & Reduced Multicollinearity**: Multicollinearity reduction, which combats high correlation between predictors, generally augments model stability and interpretability. When implemented alongside log transformation, it resulted in a minor reduction in Linear Regression model performance, possibly due to loss of some information when highly correlated features were removed. However, the KNeighborsRegressor model displayed an uptick in performance, while RandomForestRegressor's performance was nearly unaltered. In spite of these modifications, RandomForestRegressor maintains its position as the best performing model.

# 8. Feature Importance

In [None]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, Y_train)

importances = rf.feature_importances_
feature_names = X.columns

# Normalize the importance values and convert to percentage
importances = 100.0 * (importances / np.sum(importances))

# Filter importances and features based on the condition of importances > 0.1%
importances, feature_names = zip(*((importance, feature) for importance, feature in zip(importances, feature_names) if importance > 0.1))

importances = np.array(importances)
feature_names = np.array(feature_names)

indices = np.argsort(importances)[::-1]

plt.figure(figsize=(20, 13))
barplot = plt.bar(range(len(importances)), importances[indices], align="center")

plt.xticks(range(len(importances)), feature_names[indices], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance (%)")
plt.title("Feature Importance")

# Adding the percentage labels on top of each bar
for rect in barplot:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width()/2.0, height, f'{height:.1f}%', ha='center', va='bottom')

plt.show()


The features grade, lat, and sqft_living have the highest importance values according to the Random Forest model, suggesting they play a significant role in predicting the house prices. These variables likely have strong relationships with the house price:

- grade (33.94%): This probably refers to the grading given based on the King County grading system. Higher grade houses are usually of higher quality and have better finishes, hence more expensive.
- lat (29.45%): This refers to the latitude coordinate of the house. Location (north or south) can have a big impact on the price of a house. Certain latitudes may correspond to more desirable locations, such as being closer to city centers or better schools, thus driving up house prices.
- sqft_living (19.20%): This is the square footage of the apartment interior living space. Larger houses typically cost more, so it's expected this feature is important.

- The rest of the features contribute less to the model, with importance below 6%, indicating that while they do have some effect on the price of the house, they are not as significant as the top three. However, it's important to remember that even features with lower importance can have significant effects in specific contexts or when interacting with other features. For example, waterfront may have a huge effect on price for houses in certain locations, even though its overall importance is low.

# Conclusion

- According to the Forest Random Regressor result, Grade, location, size are the main deciding factors for the price of property. 
- We still do not have specific info about grading system, so we were able to get specific stardard what broughts the higher price. Assuming that bigger size, better location (e.g. good education environment, commercial center, transportation etc.) would be ther external factors that are not existed in the dataset. 
- Thus, we would need more infomation like the transportation, education, low crimanal rate, something more social factors also need to be considered to properly estimate the size.
