<a href="https://colab.research.google.com/github/kreatorkat2004/kreatorkat2004/blob/main/comp341_hw4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## COMP 341: Practical Machine Learning
## Homework Assignment 4: Predicting List Prices for Houston Homes
### Due: Thursday, October 23 at 11:59pm on Gradescope

Using the regression-based methods discussed in class, we will predict the list price of homes in the Houston area given attributes in existing listings on [Redfin](https://www.redfin.com) (an alternative to Zillow). Successful machine learning algorithms here can be used to assist real estate agents and prospective sellers in pricing their properties.



As always, fill in missing code following `# TODO:` comments or `####### YOUR CODE HERE ########` blocks and be sure to answer the short answer questions marked with `[WRITE YOUR ANSWER HERE]` in the text.

All code in this notebook will be run sequentially so make sure things work in order! Be sure to also use good coding practices (e.g., logical variable names, comments as needed, etc), and make plots that are clear and legible.

For this assignment, there will be **15 points** allocated for general coding and formatting points:
* **5 points** for coding style
* **5 points** for code flow (accurate results when everything is run sequentially)
* **5 points** for additional style guidelines listed below

Additional style guidelines:
* Make sure to rename your .ipynb file to include your netid in the file name: `netid-hw4.ipynb`
* For any TODO cell, make sure to include that cell's output in the .ipynb file that you submit. Many text editors have an option to clear cell outputs which is useful for getting a blank slate and running everything beginning-to-end, but always be sure to run the notebook before submitting and ensure that every cell has an output.
* When displaying DataFrames, please do not include `.head()` or `.tail()` calls unless asked to. Just removing these calls will work as well, and will allow us to see both the beginning and end of your DataFrames, which help us ensure data is processed properly. Notebooks will by default show only the beginning and end, so you don't have to worry about long outputs here.
* If column names are specified in the question, please use the specified name, and please avoid any sorting not specified in the instructions.
* For plots, please ensure you have included axis labels, legends, and titles.
* To format your short answer responses nicely, we recommend either **bolding** or *italicizing* your answer, or formatting it ```as a code block```.
* Generally, please keep your notebook cells to one solution per cell, and preserve the order of the questions asked.
* Finally, this can be harder to check/control and depends on which plotting libraries you prefer, but it would be helpful to limit the size/resolution of plot images in the notebook. Our grading platform has an upper limit on submission sizes it can display, and high-res plots are the usual culprit when submissions are hidden or truncated.

### Part 0: Setup
First, we need to import some libraries that are necessary to complete the assignment.

In [None]:
import pandas as pd
import numpy as np

Add additional modules/libraries to import here (rather than wherever you first use them below):

In [None]:
# additional modules/libraries to import


We provide some code to get the data files for this assignment into your workspace below. You only need to do the following steps of placing the homework files in your Google Drive once:
1. Go to 'My Drive' in your own Google Drive
2. Make a new folder named `comp341`
3. For the [training data](https://drive.google.com/file/d/19dNolYwr6fQ_okUwWNZNCoWYkUNkCL5m/view?usp=share_link) as well as the [test data](https://drive.google.com/file/d/1dTqCj7B6t2HcwXLVW21BCzX4SE2CzpJ-/view?usp=share_link), you can expand the menu options and select `Organize -> Add shortcut`, then find and select your `comp341` folder. This is a convenient alternative to having to download and re-upload the files to your own drive. You should now have shortcuts to the `houston_homes.csv` as well as a `houston_homes_test.csv` file in your folder.

If you run into trouble with accessing the files from the shortcut, then:

4. Download the [training data](https://drive.google.com/file/d/19dNolYwr6fQ_okUwWNZNCoWYkUNkCL5m/view?usp=share_link) as well as the [test data](https://drive.google.com/file/d/1dTqCj7B6t2HcwXLVW21BCzX4SE2CzpJ-/view?usp=share_link). You should now have a file entitled `houston_homes.csv` as well as a `houston_homes_test.csv` on your computer.
5. In the `comp341` folder you created in step 2, click `New -> File Upload` and select the two csv files from your computer.

Now, we will mount your local Google Drive in colab so that you can read the file in (you will need to do this each time your runtime restarts).

In [None]:
# note that this command will trigger a request from google to allow colab
# to access your files: you will need to accept the terms in order to access
# the files this way
from google.colab import drive
drive.mount('/content/drive')

# if you followed the instructions above exactly, CVA.csv should be
# in comp341/; if your files are in a different directory
# on your Google Drive, you will need to change the path below accordingly
DATADIR = '/content/drive/My Drive/comp341/'

Now that your Google Drive is mounted, you can read in the data in `houston_homes.csv` into a pandas DataFrame:

In [None]:
df = pd.read_csv(DATADIR + "houston_homes.csv")

We have already held out a portion of the full dataset to be your test set (`houston_homes_test.csv`). In fact, the `list_price` in this test set is hidden so you will not know the true list prices of these homes. You will not need to use this data until Part 4 of the homework assignment, when you use your favorite models to make predictions for this set of homes and submit them to [our Kaggle competition](https://www.kaggle.com/t/8ba216798d9f42c9bb4aa270dab061fd).

**Important note:** while the performance of your predictions will not be graded in this assignment, a portion of your grade will be based off of whether you submitted predictions that can be evaluated by Kaggle (i.e., in the right format, passes basic checks)! As such, make sure to set up an account and try submitting predictions earlier rather than later.

For Parts 1-3, you will only use the data in `houston_homes.csv`. In Parts 1 and 2, you will partition this data into a training and validation set so that we can evaluate how well our models perform for model selection. In Part 3, you will use cross-validation to dive deeper for one of the models.

In [None]:
# TODO: divide your data into a training and validation set using an 80/20 split [2 pts]


### Part 1: Data exploration (29 pts)
As always, it is important to dive into the data to see what is going on. Let's start with the typical check for missing values.

In [None]:
# TODO: check for missing values in the training data, make a table of NaN counts per feature
# include only features with non-zero missing value counts in your table [2 pts]


**Short Answer Question:** Do you think that the missing values are going to be problematic for predicting list price? Why or why not? [2 pts]

`[WRITE YOUR ANSWER HERE]`

We know that housing markets can be vastly different by region. This data is isolated to the Houston housing market, but even within Houston, there can be fluctuation between different neighborhoods. Let's take a deeper look at the region diversity in the data.

In [None]:
# TODO: calculate how many houses are on the market per zipcode in your training data [1 pt]


In [None]:
# TODO: for the 5 zipcodes with the most houses and
# the 5 zipcodes with the least houses for sale (10 zipcodes total)
# plot the number of houses per zipcode in the training data
# sort the plot by the total number of houses per zipcode so that it is
# easy to see the zipcodes with the most/least houses for sale [3 pts]


In [None]:
# TODO: using violin plots, plot the distribution of list prices
# for the same 10 zipcodes you plotted earlier, in the same order
# as your previous plot sorted by the total number of houses for sale [3 pts]


**Short Answer Question:** Based on the plots you generated above, do you think differences in zipcode distribution will affect models that predict list prices? Explain. [2 pts]

`[WRITE YOUR ANSWER HERE]`

**Short Answer Question:** Where are the most expensive homes located? [1 pt]

`[WRITE YOUR ANSWER HERE]`

Latitude and longitude can give an even more detailed view of what places are for sale across Houston. Let's map out where these properties are for sale and their list price.

In [None]:
# TODO: Plot each property in the training data by the latitude and longitude,
# coloring points by their list price (you may have to change the color scale to see differences) [2 pts]


**Short Answer Question:** Is latitude and longitude more / less / similarly informative than zipcode for determining list price? Explain. [2 pts]

`[WRITE YOUR ANSWER HERE]`

Before we make our models, let's check to see if our features are related to each other and how closely they relate to the variable that we want to predict, `list_price`. One quick way to check for these relationships is to use correlation.

In [None]:
# TODO: calculate the correlation between numeric features in the training data
# and display the results [2 pts]


**Short Answer Question:** Based on the correlations, are there features you might consider removing before linear regression? Any that you think might be helpful? Explain. [2 pts]

`[WRITE YOUR ANSWER HERE]`

Now that we have done some exploration of our data (we can always do more - feel free to do more if you wish!), we should think about how we will define "success" for this problem. One helpful way is to come up with a baseline heuristic that we should exceed with regression models.

**Short Answer Question:** If our goal is to reduce the error in our price predictions, design a baseline heuristic for this problem. Explain your rationale. [3 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: make "predictions" on your validation set using your baseline heuristic [2 pts]


In [None]:
# TODO: what is the RMSE and RMSLE of your baseline heuristic model? [2 pts]


### Part 2: Regression Models + Evaluation [27 pts]
Now that we have done some of the initial data exploration and chosen at least one baseline heuristic, we are ready to build some different regression models to predict list price. In this section, we will try three different models.

This time, we will not dictate which particular features you should use for downstream analysis. Looking at the data and the features you calculated the correlations for, you should decide which features might be useful for your model, realizing that you might not want to include all of them. You may also need to transform / preprocess some of them before using in the various models. Also be sure whatever you do to your training data, you also do the same to the validation set (while minimizing potential data leakage!).

In [None]:
# TODO: use explanatory features of your choice to build a linear regression model on the training data
# evaluate your model on the validation set by calculating RMSE
# (hint: think carefully about your choice of features, if you need to scale, impute, etc) [7 pts]


**Short Answer Question:** Which columns did you choose to omit from your feature set? Why did you exclude these columns? [3 pts]

`[WRITE YOUR ANSWER HERE]`

**Short Answer Question:** Does the linear regression model outperform the baseline heuristic you chose earlier? What does this comparison tell you? [2 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: use explanatory features of your choice to build a lasso regression model on the training data
# this time, evaluate your model on the validation set data by calculating RMSLE [2 pts]


Recall that Lasso regression has a hyperparameter alpha. Tuning alpha might increase the performance of our model versus using the default values. Below we examine the effect of this hyperparameter on the training and validation error.

In [None]:
# TODO: we provide an initial set of alphas to explore (with variable increments from 0-20,000)
# plot both the training RMSLE and the validation RMSLE for a lasso regression model as alpha changes
# note that depending on what this initial set of alphas show you, you may want to focus on a smaller range
# or expand to look at even larger alphas [5 pts]
alphas = np.concatenate([np.arange(0,100,5), np.arange(100,1000,100), np.arange(1000,20000,1000)])


**Short Answer Question:** Using the plot above as a guide, determine if there is an optimal alpha (or small range of alphas) for this task. Explain. [2 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: using SVR, explore the RMSLE of these 3 kernels: 'linear', 'poly', and 'rbf' with C=100
# output a table with the train and validation RMSLE for these 3 kernels
# note: though there are only 3 variants, you should still avoid repeating code
# (i.e., use a for loop or other iterable) [4 pts]


**Short Answer Question:** Which SVR kernel has the best performance? Give some intuition as to why you think it outperformed the other kernel choices. [2 pts]

`[WRITE YOUR ANSWER HERE]`

### Part 3: Cross-validation [22 pts]
So far, we have been running all of our analysis using a train-validation split of 80/20 and choosing our parameters based on the performance across this train-validation split, but this might not be as generalizable and robust. Instead of tuning our hyperparameters on this specific partition, let's try using cross-validation.

The next few questions will use decision tree regressor models and explore the effect of max_depth on validation error.


In [None]:
# TODO: Calculate the mean RMSLE using 5-fold cross-validation (CV) for a decision
# tree regressor model with max_depth from 1-50 (in increments of 1).
# Save these values for plotting later. [8 pts]


In [None]:
# TODO: Build a decision tree regessor model with max_depth from 1-50 (in increments of 1)
# on the training data without using CV.
# Calculate the RMSLE for the training set and the validation set that you have been using
# in the earlier parts of the assignment. Save these values for plotting later. [5 pts]


In [None]:
# TODO: In the same figure, plot the mean RMSLE from CV, as well as training RMSLE and validation RMSLE from above
# as max tree depth varies (1 to 50). [5 pts]


**Short Answer Question:** Based on these RMSLEs, which max_depth parameter is optimal? Explain. [2 pts]

`[WRITE YOUR ANSWER HERE]`

**Short Answer Question:** Was cross-validation helpful in choosing the optimal max depth parameter? Why or why not? [2 pts]

`[WRITE YOUR ANSWER HERE]`

### Part 4: Compete! (5 pts, with up to 10 extra credit pts)
Now, we can put our favorite models (or try additional variations!) to the test using the data that we haven't looked at in any part of the assignment (`houston_homes_test.csv`).

Here, we will be using [Kaggle](https://www.kaggle.com/t/8ba216798d9f42c9bb4aa270dab061fd). If you do not already have a Kaggle account, you will need to make one.

There are additional details on the Kaggle site, but some particularly important notes:
* You are free to choose any team name (the name that will show up on the Kaggle leaderboard) as long as it is not inappropriate or offensive; however, in order to receive credit, you **must** specify your `team name` in your notebook here. If you do not, there is no way for us to assign you credit!
* Kaggle lists the close date as several days after the homework's due date. This is because Kaggle does not support late submissions. The homework and your submission on Kaggle are due by the due date listed here, but you may use late days and turn it in late (i.e., if you submit Kaggle predictions after the due date, it will automatically count towards your late days even if you have turned in your notebook already).
* This portion of the assignment **must** be completed independently. You cannot share prediction code or predictions with each other. In fact, you must put the exact code you use for your final predictions below. Violations will result in point deductions.
* Related, you cannot modify your prediction files manually. Violations will result in point deductions.
* You can only use regression models that we have discussed in class (though you can feel free to preprocess your data / tune any of the parameters in the models however you like)!

For this homework, you will simply be graded for completion of a successful submission to Kaggle, and not on performance! But for fun, we have included several benchmarks on the Kaggle leaderboard, based on either simple heuristics or models.

You can receive extra credit points for exceeding these benchmarks on the private leaderboard:
* 1 pt for passing the `base-benchmark`
* 1 pt for passing the `easy-benchmark`
* 1 pt for passing the `medium-benchmark`
* 1 pt for passing the `hard-benchmark`

And additional points for doing well on the private leaderboard that will be revealed after the late due date (if there are ties, everyone tied will receive the same number of points):
* 6 pts for 1st place
* 4 pts for 2nd place
* 2 pts for 3rd place


**Kaggle team name:** `[fill in here]`

Now, we will finally read in the test dataset.

In [None]:
df_test = pd.read_csv(DATADIR + "houston_homes_test.csv")

In [None]:
# TODO: put all code needed (including preprocessing steps) to make your
# final kaggle submission; note that this code must match the predictions
# that you provide on kaggle


You can see details about the file format for submission on kaggle (`sample_submission.csv`, essentially a 2 column file with `houseid`, the unique identifier in your test set, and `list_price`, your predictions). To make things easier, we provide here some sample code that you can modify to make your own submission file if your predictions were in a variable called `y_pred_kagg`.

In [None]:
results = pd.Series(y_pred_kagg.flatten(), name="list_price")
results = pd.concat([df_test['houseid'], results], axis=1)
results.to_csv('my_submission.csv', index=False)

Once you output your csv file, you need to download the file from colab to your local computer (you can click the file folder icon on the left panel to see the files in your workspace) and upload that file to the Kaggle site as your submission. Note that you can submit multiple times (up to 5 times a day)!

## To Submit
Download the notebook from Colab as a `.ipynb` notebook (`File > Download > Download .ipynb`) and upload it to the corresponding Gradescope assignment. Your assignment should be named `netid-hw4.ipynb`.  

Also, double check that your Kaggle submission shows up on the public leaderboard.