# Assignment - 6

Write a 500-word Medium post describing how you applied supervised learning to extract insights from a dataset. Using methods from module 6, you 
1. should construct a dataset of interest, ✅
2. identify the type of supervision problem (i.e., classification or regression), ✅
3. describe (and justify) your selection of the supervised learning model, and ✅
4. explain the supervised labels/scores you will use (i.e., where they come from and what they mean). ✅

After training and applying your model to your dataset, 
1. explain how you determined whether the model was performing well or not, ✅
2. select 5 samples that your learning model got wrong, and ✅
3. discuss why you think these samples were incorrectly labeled/scored. ✅


## Data: 
[Dataset](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset?select=realtor-data.csv)

* status (Housing status - a. ready for sale or b. ready to build)
* bed (# of beds)
* bath (# of bathrooms)
* acre_lot (Property / Land size in acres)
* city (city name)
* state (state name)
* zip_code (postal code of the area)
* house_size (house area/size/living space in square feet)
* prev_sold_date (Previously sold date)
* price (Housing price, it is either the current listing price or recently sold price if the house is sold recently)

In [285]:
import pandas as pd
import numpy as np 

# load and inspect the data

In [286]:
# Load the dataset into a pandas dataframe

realtor_df = pd.read_csv("data/realtor-data.csv")
print(f"Rows x Columns: {realtor_df.shape}", '\n')
print(f"Variables: {realtor_df.columns}")

Rows x Columns: (100000, 10) 

Variables: Index(['status', 'bed', 'bath', 'acre_lot', 'city', 'state', 'zip_code',
       'house_size', 'prev_sold_date', 'price'],
      dtype='object')


In [287]:
realtor_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   status          100000 non-null  object 
 1   bed             75050 non-null   float64
 2   bath            75112 non-null   float64
 3   acre_lot        85987 non-null   float64
 4   city            99948 non-null   object 
 5   state           100000 non-null  object 
 6   zip_code        99805 non-null   float64
 7   house_size      75082 non-null   float64
 8   prev_sold_date  28745 non-null   object 
 9   price           100000 non-null  float64
dtypes: float64(6), object(4)
memory usage: 7.6+ MB


In [288]:
realtor_df.describe()

Unnamed: 0,bed,bath,acre_lot,zip_code,house_size,price
count,75050.0,75112.0,85987.0,99805.0,75082.0,100000.0
mean,3.701013,2.494595,13.613473,2132.003467,2180.082,438365.6
std,2.091372,1.573324,840.143878,2455.654774,5625.349,1015773.0
min,1.0,1.0,0.0,601.0,100.0,445.0
25%,3.0,2.0,0.19,971.0,1200.0,125000.0
50%,3.0,2.0,0.51,1225.0,1728.0,265000.0
75%,4.0,3.0,2.0,1611.0,2582.0,474900.0
max,86.0,56.0,100000.0,99999.0,1450112.0,60000000.0


In [289]:
print(f"Number of states: {len(realtor_df['state'].unique())}")
print(f"Number of cities: {len(realtor_df['city'].unique())}")

Number of states: 12
Number of cities: 526


In [290]:
# drop the 'prev_sold_date' column
realtor_df.drop(columns=["prev_sold_date"], axis=0, inplace=True)
realtor_df.head(3)

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0


In [291]:
# convert the 'status' column to dummy variables
realtor_df = pd.get_dummies(realtor_df, columns=["status"])
realtor_df.head(3)

Unnamed: 0,bed,bath,acre_lot,city,state,zip_code,house_size,price,status_for_sale,status_ready_to_build
0,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0,1,0
1,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0,1,0
2,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0,1,0


In [292]:
realtor_df.dropna(inplace=True)
realtor_df.shape

(60670, 10)

# Training the model

### Cluster the data using the `state` column 

In [293]:
realtor_df.state.value_counts()

Massachusetts     31355
Puerto Rico       15390
Connecticut        8736
Rhode Island       2018
New Hampshire      1214
New York            923
Vermont             690
Virgin Islands      342
New Jersey            2
Name: state, dtype: int64

In [294]:
realtor_by_state = realtor_df.groupby(by="state")
realtor_by_state.get_group('Connecticut').head(3)

Unnamed: 0,bed,bath,acre_lot,city,state,zip_code,house_size,price,status_for_sale,status_ready_to_build
27820,3.0,1.0,3.93,Willington,Connecticut,6279.0,1572.0,225000.0,1,0
27821,4.0,3.0,2.34,Coventry,Connecticut,6238.0,3320.0,579900.0,1,0
27826,2.0,1.0,0.91,East Windsor,Connecticut,6016.0,960.0,215000.0,1,0


### Split each cluster into a training and test set, and train the Linear Regression model for each cluster

In [295]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split as tts 

In [661]:
states = realtor_by_state.groups.keys() # get the states in the dataset
cluster_model_data = dict() # create a dictionary to store the cluster data
for state in states: 
    this_state = cluster_model_data[state] = dict()
    X = realtor_by_state.get_group(state).drop(columns=["price", "state", "city"])
    y = realtor_by_state.get_group(state)["price"]
    X_train, X_test, y_train, y_test = tts(X, y, test_size=.2,)

    # Store the training data
    train = this_state["train"] = dict()
    train["data"] = X_train
    train["target"] = y_train

    # Store the test data
    test = this_state["test"] = dict()
    test["data"] = X_test
    test["target"] = y_test

    # Train this cluster's model
    estimator = this_state["estimator"] = LinearRegression(positive=True)
    estimator.fit(train["data"], train["target"])

    # Evaluate the model's performance
    this_state["y_pred"] = pd.Series(estimator.predict(test["data"]), index=test["target"].index)
    this_state["score"] = estimator.score(test["data"], test["target"])
    



### Visualize the performance of cluster classifiers in tabular format

In [662]:
# store and print the the performance results for each clusters regressor
results_df = pd.DataFrame(
    columns=[
        "State", 
        "Training Data", 
        "Testing Data", 
        "Model Performance (R^2)", 
        "y_true", 
        "y_pred"
    ]
)
for i, state in enumerate(states): 
    c = cluster_model_data[state]
    data = [
        state, 
        c["train"]["data"].shape, 
        c["test"]["data"].shape, 
        round(c["score"], 2), 
        c["test"]["target"], 
        c["y_pred"]
    ]
    results_df.loc[i] = data
results_df

Unnamed: 0,State,Training Data,Testing Data,Model Performance (R^2),y_true,y_pred
0,Connecticut,"(6988, 7)","(1748, 7)",-0.91,50530 899000.0 68180 219900.0 29069 2...,50530 1.276544e+06 68180 2.833432e+04 29...
1,Massachusetts,"(25084, 7)","(6271, 7)",0.28,29061 314900.0 31326 219900.0 90679 ...,29061 6.423592e+05 31326 1.725054e+05 90...
2,New Hampshire,"(971, 7)","(243, 7)",0.49,80284 425000.0 67154 259000.0 81304 4...,80284 488421.280503 67154 430093.707168 ...
3,New Jersey,"(1, 7)","(1, 7)",,"30126 333490.0 Name: price, dtype: float64",30126 333490.0 dtype: float64
4,New York,"(738, 7)","(185, 7)",-0.02,64678 450000.0 54556 299900.0 66075 ...,64678 1.964774e+06 54556 3.461160e+05 66...
5,Puerto Rico,"(12312, 7)","(3078, 7)",0.25,3683 74000.0 5236 84900.0 26809 1...,3683 236501.158881 5236 238335.158509 ...
6,Rhode Island,"(1614, 7)","(404, 7)",0.3,96503 395000.0 94936 560000.0 94537 7...,96503 189422.119662 94936 576175.906700 ...
7,Vermont,"(552, 7)","(138, 7)",0.75,70839 299000.0 71527 735000.0 74182 3...,70839 4.074395e+05 71527 4.931712e+05 74...
8,Virgin Islands,"(273, 7)","(69, 7)",0.39,14038 320000.0 11340 2000000.0 17698 ...,14038 2.604795e+04 11340 2.197976e+06 17...


### Evaluating the model performance for wrong labels

In [669]:
# Evaluate the performance for 5 wrongly labeled data points
conneticut = pd.DataFrame(
    [
        results_df.loc[0, ["y_true"]].to_numpy()[0], 
        results_df.loc[0, ["y_pred"]].to_numpy()[0]
    ], 
).transpose()
conneticut.columns=['y_true', 'y_pred']

Unnamed: 0,y_true,y_pred
50530,899000.0,1.276544e+06
68180,219900.0,2.833432e+04
29069,280000.0,2.630619e+05
47242,239900.0,1.151828e+05
57456,329900.0,4.474563e+05
...,...,...
32963,234000.0,1.412619e+05
58739,425000.0,3.657001e+05
40454,210000.0,4.998066e+04
59728,459000.0,7.255513e+05


In [671]:
conneticut.loc[abs((conneticut.y_true - conneticut.y_pred) / conneticut.y_true) > .1].head(5)

Unnamed: 0,y_true,y_pred
50530,899000.0,1276544.0
68180,219900.0,28334.32
47242,239900.0,115182.8
57456,329900.0,447456.3
84304,549900.0,606093.2
