# Assignment - 6

Write a 500-word Medium post describing how you applied supervised learning to extract insights from a dataset. Using methods from module 6, you 
1. should construct a dataset of interest,
2. identify the type of supervision problem (i.e., classification or regression), 
3. describe (and justify) your selection of the supervised learning model, and
4. explain the supervised labels/scores you will use (i.e., where they come from and what they mean). 

After training and applying your model to your dataset, 
1. explain how you determined whether the model was performing well or not, 
2. select 5 samples that your learning model got wrong, and 
3. discuss why you think these samples were incorrectly labeled/scored. 


## Data: 
[Dataset](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset?select=realtor-data.csv)

* status (Housing status - a. ready for sale or b. ready to build)
* bed (# of beds)
* bath (# of bathrooms)
* acre_lot (Property / Land size in acres)
* city (city name)
* state (state name)
* zip_code (postal code of the area)
* house_size (house area/size/living space in square feet)
* prev_sold_date (Previously sold date)
* price (Housing price, it is either the current listing price or recently sold price if the house is sold recently)

In [45]:
import pandas as pd
import numpy as np 

# load and inspect the data

In [65]:
# Load the dataset into a pandas dataframe

realtor_df = pd.read_csv("data/realtor-data.csv")
print(f"Rows x Columns: {realtor_df.shape}", '\n')
print(f"Variables: {realtor_df.columns}")

Rows x Columns: (100000, 10) 

Variables: Index(['status', 'bed', 'bath', 'acre_lot', 'city', 'state', 'zip_code',
       'house_size', 'prev_sold_date', 'price'],
      dtype='object')


In [47]:
realtor_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   status          100000 non-null  object 
 1   bed             75050 non-null   float64
 2   bath            75112 non-null   float64
 3   acre_lot        85987 non-null   float64
 4   city            99948 non-null   object 
 5   state           100000 non-null  object 
 6   zip_code        99805 non-null   float64
 7   house_size      75082 non-null   float64
 8   prev_sold_date  28745 non-null   object 
 9   price           100000 non-null  float64
dtypes: float64(6), object(4)
memory usage: 7.6+ MB


In [48]:
realtor_df.describe()

Unnamed: 0,bed,bath,acre_lot,zip_code,house_size,price
count,75050.0,75112.0,85987.0,99805.0,75082.0,100000.0
mean,3.701013,2.494595,13.613473,2132.003467,2180.082,438365.6
std,2.091372,1.573324,840.143878,2455.654774,5625.349,1015773.0
min,1.0,1.0,0.0,601.0,100.0,445.0
25%,3.0,2.0,0.19,971.0,1200.0,125000.0
50%,3.0,2.0,0.51,1225.0,1728.0,265000.0
75%,4.0,3.0,2.0,1611.0,2582.0,474900.0
max,86.0,56.0,100000.0,99999.0,1450112.0,60000000.0


In [62]:
print(f"Number of states: {len(realtor_df['state'].unique())}")
print(f"Number of cities: {len(realtor_df['city'].unique())}")

Number of states: 12
Number of cities: 526


In [50]:
# drop the 'prev_sold_date' column
realtor_df.drop(columns=["prev_sold_date"], axis=0, inplace=True)
realtor_df.head(3)

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0


In [51]:
# convert the 'status' column to dummy variables
realtor_df = pd.get_dummies(realtor_df, columns=["status"])
realtor_df.head(3)

Unnamed: 0,bed,bath,acre_lot,city,state,zip_code,house_size,price,status_for_sale,status_ready_to_build
0,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0,1,0
1,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0,1,0
2,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0,1,0


# Training the model

In [52]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split as tts

In [57]:
X = realtor_df.dropna().drop(columns=["price"])
y = realtor_df.dropna()["price"]    
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=42)

In [60]:
model = LinearRegression().fit(X_train, y_train)
model.get_params()

ValueError: could not convert string to float: 'Barrington'