# Exercise 4: Import data, clean data, and make predictions using Python/SciKit-Learn in a Jupyter notebook

In this exercise, you will import data from the listings.csv file, clean the data, and then build a model to predict the price of a rental property. You should perform this entire exercise in the Jupyter notebook you opened at the end of Exercise 3. Remember that to run the code in a code cell, you must select the cell, and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.
![Run cell screenshot](iPY_run_annotated.png)

## Step 1: Read data into your notebook
Your first step in this lab will be to import your raw data. However, in order to do that, you will need to set up the appropriate tools.

### Import libraries
First, you will import the various Python libraries that you will need to complete this lab. Foremost among these is pandas, which will provide the **dataframe** structure that you will use to import and manipulate the listings data for analysis. Other important libraries include numpy for scientific computation and scikit-learn (sklearn), which provides the actual ML tools you will use.

**Note:** To run this code snippet, click on the notebook cell holding the code and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter. <i>This step might take several seconds to run.</i> **You can safely disregard any deprecation warnings in this lab.**

In [None]:
import pandas as pd
import numpy as np
import math
from sklearn import ensemble
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsRegressor
import sklearn.metrics as metrics
from sklearn.metrics import mean_squared_error
from scipy.spatial import distance

### Import data from CSV
The listings you will analyze in this lab are stored in a CSV file, listings.csv. This file should be available in the local /BnB directory. If it is not available, complete Exercise 3 before proceeding.

Once the file listings.csv is in the appropriate working directory (/BnB), you will create a pandas dataframe to hold the data and specify the columns of data you want to import.

#### <font color=blue>Code Exercise 1.1</font>
Follow the instruction in the code comments below to create a dataframe to analyze the listing data and import the correct data into it.

In [None]:
# Change the elements in the cols list from 'col1',...'col6' to 
# 'price', 'accommodates', 'bedrooms', 'bathrooms', 'beds','number_of_reviews'

cols = ['col1',
        'col2',
        'col3',
        'col4',
        'col5',
        'col6'
        ]

# Change the name of the file to import from 'filename.csv' to 'listings.csv'

sea_listings = pd.read_csv('filename.csv', usecols=cols)

**Note:** To run this code snippet after you have made the changes, ensure that you have selected the notebook cell and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In order to see if you successfully imported the data, go ahead and examine the data at a high level. The <code>pandas.DataFrame.head()</code> method enables you to look at just the first five rows in your dataframe:

In [None]:
sea_listings.head()

If you see the first 5 rows of the dataframe populated with data for the <code>price, accommodates, bedrooms, bathrooms, beds,</code> and <code>number_of_reviews</code> columns, you have successfully imported your data.

If your dataframe is empty or has the incorrect columns, run the code snippet below:
#### <font color=green>Code Exercise 1.1 Answers</font>

In [None]:
cols = ['price',
        'accommodates',
        'bedrooms',
        'bathrooms',
        'beds',
        'number_of_reviews'
        ]

sea_listings = pd.read_csv('listings.csv', usecols=cols)

### View dataframe shape

Another method data scientists use to understand large-scale data is to view the shape of the dataframe to see how many rows and columns it has:

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
print(sea_listings.shape)

Remember to run the code snippet using either a menu command or the keyboard shortcut. Python should return back a shape of <code>(3818, 6)</code> for the dataframe: 3818 rows and 6 columns.

### Prepare your data

A final, important preparatory step is to prepare your data. Clean out **Not a Number** or **NaN** values that will break your code later on. To do this, run the <code>pandas.DataFrame.dropna()</code> method on the **sea_listings** dataframe.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
sea_listings = sea_listings.dropna(axis=0, how='any')

The Axis paramater tells dropna to eliminate rows with NaN values in them; the How parameter tells it to eliminate rows with one or more NaN values.

Before moving on, quickly re-check the shape of the dataframe to see how many rows were dropped.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
print(sea_listings.shape)

We had 3818 rows, but after removing those with Not a Number (NaN) values, we now have only 3796.

## Step 2: Calculate how well listings will meet your needs
For purposes of this lab, assume that you are looking for a place in Seattle that can accommodate three people. For the sake of analysis, you will calculate the "distance" between your needs (accommodating three people) and how many people each listing can actually accommodate. To see what a simple example of this calculation looks like, run this code snippet to measure the 'distance' between the number of people we need to accommodate (3) and the number of people the first listing the dataframe can accommodate:

#### <font color=blue>Code Exercise 2.1</font>
Follow the instruction in the code comments below to calculate the difference between how many people the the first listing in the dataframe can accommodate and how many people you need a listing to accommodate.

In [None]:
# Enter the number of people who will be staying at the Bed and Breakfast as 
# a value for the variable our_acc_value. You can try any number from 1 and 4
# but, to avoid issues later in the lab, set it to 3 before continuing.

our_acc_value = 

# We want to fetch the accommodates value from the first row of the array.
# Set rowindex to 0 to request the first row and 'columnname' to 'accommodates'.

first_living_space_value = sea_listings.loc[rowindex,'columnname']

# The code now checks the difference between the number of people we want to 
# accommodate and the number of people the first listing can accommodate.

first_distance = np.abs(first_living_space_value - our_acc_value)
print(first_distance)

**Note:** To run this code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In this example in the cell above, the first listing has a distance of 1 from our desired accommodation level of 3. If the code snippet returned a value other than 1, run the code snippet below.

#### <font color=green>Code Exercise 2.1 Answer</font>

In [None]:
our_acc_value = 3

first_living_space_value = sea_listings.loc[0,'accommodates']
first_distance = np.abs(first_living_space_value - our_acc_value)
print(first_distance)

Note that you calculated the absolute difference between 3 (the number of people you need to accommodate) and the accommodation of a listing. This is because, for the purposes of this lab, we treat a listing that accommodates 4 as equally far from perfectly meeting our needs as one that accommodates 2.

### Measuring accommodation value "distance" from 3 for all listings 

Now, you will perform this calculation for every listing in the dataframe. You will also create a new column in your dataframe to store this data ("distance") and then print out how many listings are each integer "distance" from your accommodation number of 3.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
sea_listings['distance'] = np.abs(sea_listings.accommodates - our_acc_value)
sea_listings.distance.value_counts().sort_index()
sea_listings = sea_listings.sort_values('distance')
sea_listings.distance.head()

Note that the distance in the first few listings in this new dataframe column are 0. This is because you sorted the dataframe on the values in that column ascending from lowest to highest values.

### Preparing price data for analysis

Ultimately, however, we want to use features of listings in order to predict their prices. This means that we will need to work with listings' prices in our test data.

In order to analyze prices, remove dollar signs and commas, and then change the data type of the price column from string to float. You will also calcluate the mean price for the first five listings.

(Doing this for the first values is important because this sorting and analysis will form the basis of your first predictive algorithm, later in the lab.)

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
sea_listings['price'] = sea_listings.price.str.replace("\$|,",'').astype(float)
mean_price = sea_listings.price.iloc[:5].mean()
mean_price

The price column is now populated with floats and have mean value of $80.40.

## Step 3: Create your training and test data
Whenever you attempt to use known data to make predictions about new data, it is essential to know how accurate your predictions are. A standard way to determine the accuracy of your predictions is to train models against a first set of data and then perform your testing against the second set. It is important to never test against your training data. So, you will split the listing dataset into a training dataset (the first 2863 listings) and a test dataset here (the remaining listings).

Using the first 2863 listings as our training set means that we're saving the last 25% of our data as our test set. Splitting your data into 75% training and 25% test is common.

#### <font color=blue>Code Exercise 3.1</font>
Follow the instruction in the code comments below to create your training and test dataframes.

In [None]:
# We want to copy the first 2863 listings to a training dataset, so specify a range of 
# 0 through 2863 for iloc. (Hint: putting 0:2863 or :2863 in the brackets will do this.)

train_df = sea_listings.copy().iloc[]
print(train_df.shape)

# We want to copy all listings from 2863 onward to a test dataset, so specify a range of 
# 2863 and above for iloc. (Hint: putting 2863: in the brackets will do this.)

test_df = sea_listings.copy().iloc[]
print(test_df.shape)

**Note:** To run this previous code snippet once you have made the proper changes, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

At this point you should have created a training dataframe with 2863 rows and a test dataframe with 933 rows. If the shape of the training and test dataframes are not <code>(2863, 7)</code> and <code>(933, 7)</code>, run the code below.

#### <font color=green>Code Exercise 3.1 Answer</font>

In [None]:
train_df = sea_listings.copy().iloc[:2863]
print(train_df.shape)
test_df = sea_listings.copy().iloc[2863:]
print(test_df.shape)

Your training data set now consists of the first 2,863 listings from the original data set; your test data is everything else.

## Step 4: Run your predictions
Now that you have training and test datasets, you are ready to run a simple model predicting the price of a listing. You will make this prediction based on the mean price of the first five training listings with the same level of accommodation (that is, the number of people accommodated by the listing): the five "nearest neighbors" of a given listing, based on the number of people the listing can accommodate.

#### <font color=blue>Code Exercise 4.1</font>
Follow the instruction in the code comments below to train your predictive model for listing prices.

In [None]:
# Train your dataset
# You will need to set the data frame temp_df to the name of the training data frame 
# you just created.

def predict_price(new_listing_value,feature_column):
    temp_df =   # Supply the name of the training dataframe you just created.
    temp_df['distance'] = np.abs(sea_listings[feature_column] - new_listing_value)
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

**Note:** To run this code snippet once you have made the proper changes, ensure that you have selected the notebook cell and then select either **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

Check the code you completed in the cell above against the one below to ensure you defined the function correctly. If you did not, run the code in the cell below before continuing.

#### <font color=green>Code Exercise 4.1 Answer</font>

In [None]:
def predict_price(new_listing_value,feature_column):
    temp_df = train_df
    temp_df['distance'] = np.abs(sea_listings[feature_column] - new_listing_value)
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

#### <font color=blue>Code Exercise 4.2</font>
Now we need to test our model. Let’s use the predict_price model we just trained to predict the prices for the listings we stored in the test_df dataframe.

Add a new column **<code>predicted_price</code>** to the test_df data frame and populate it with the value returned by our the predict_price model we just trained.

In [None]:
# Change the newcolumnname placeholder to the name of the new column to create in the dataframe (predicted_price)

test_df['newcolumnname'] = test_df.accommodates.apply(predict_price,feature_column='accommodates')
test_df.head()

**Note:** To run this code snippet once you have made the proper changes, ensure that you have selected the notebook cell and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

You should see the first five rows of the dataframe displayed with the new column <code>predicted_price</code> added. The value in the  predict_price column is the predicted price based on our trained model.

Check the code you completed in the cell above against the one below to ensure you defined the new column for the test dataframe correctly. If you did not, run the code in the cell below before continuing.

#### <font color=green>Code Exercise 4.2 Answer</font>

In [None]:
test_df['predicted_price'] = test_df.accommodates.apply(predict_price,feature_column='accommodates')
test_df.head()

### Assess predictive accuracy
Now that we have predicted prices, let’s see how well they compare to the actual prices. This will give us an idea of how accurately our model is making predictions. We can compare the predicted price stored in <code>predicted_price</code> column and the actual price stored in the <code>price</code> column and calculate the [root-mean-square error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) (RMSE). RMSE is a standard calculation to compare the error between predicted and actual values.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
rmse = math.sqrt( mse )
print('Root-mean-square error =',rmse)

An RMSE of 113.75 means that the predictions of listing price produced by the model based off of a listing's accommodation level was wrong by an average of $113.75. That’s not great. Let’s see if we can improve it by using different columns (what data scientists call “features”) to train our model and get better results.

The code below will try training the model using four different columns: <code>accomodates</code>, <code>bedrooms</code>, <code>bathrooms</code>, and <code>number_of_reviews</code>. Then, for each, the model calculates the RMSE between the actual and predicted prices and prints the results. In this way we can find out which feature (column) gives the most accurate predictions.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter. <i>This snippet could take several seconds to run.</i>

In [None]:
for feature in ['accommodates','bedrooms','bathrooms','number_of_reviews']:
    test_df['predicted_price'] = test_df.accommodates.apply(predict_price,feature_column=feature)
    test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
    mse = test_df['squared_error'].mean()
    rmse = math.sqrt( mse )
    print("RMSE for the {} column: {}".format(feature,rmse))

We can see that the lowest RMSE (113.75) is returned by the model that was trained using the <code>accommodates</code> column. Changing to another column is not going to get us better accuracy. Let’s try another approach to improve our results.

## Step 5: Take two, this time with normalized data
Is this section, you will revisit the results you got from the first time through—but this time with normalized data. The data in the <code>listings.csv</code> file comes in different units (for example, number of rooms versus dollars) and at different scales (two bathrooms versus 200 reviews). In order to account for these differences, it is a best practice to normalize the data: subtract the mean of a column from every entry in it and divide the difference by the standard deviation of the column. This leaves us with unitless, apples-to-apples numbers to use in our ML algorithms.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
cols = ['accommodates',
        'bedrooms',
        'bathrooms',
        'beds',
        'price',
        'number_of_reviews'
        ]

sea_normalized = pd.DataFrame(columns=cols)

for col in cols:
    x = sea_listings[[col]].values.astype(float)
    scaler = preprocessing.StandardScaler()
    x_scaled = scaler.fit_transform(x)
    sea_normalized[[col]] = pd.DataFrame(x_scaled)

normalized_listings = sea_normalized.sample(frac=1,random_state=0)

# Split the data into training and test data sets
norm_train_df = sea_normalized.copy().iloc[0:2863]
norm_test_df = sea_normalized.copy().iloc[2863:]

sea_normalized.head()

The <code>sea_normalized</code> dataframe is now normalized and populated with values between -1 and 1.

You will now take your predictive model up a level in terms of sophistication. The model will now be based on the distance between listings based on multiple factors. So, rather than just looking at the absolute difference between the number of people different listings can accommodate, your model will now look at the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between listings based on number of people accommodated and the number of bathrooms, and it will use the mean price of the five nearest neighbors of a listing to try and predict its price. Using Euclidean distance is useful because it lets us use multiple data features to get more accurate predictions.

Before running this algorithm against the entire data set, run it on two entries to see how measuring Euclidean distance in two dimensions looks with entries from our dataframe. This is also a step toward two features in our predictive model.

**Note:** To run the following code snippet, ensure that you have selected it and then select **Cell > Run Cells** from the menu at the top of the notebook, click the **run cell** button at the top of the notebook, or press Ctrl+Enter

In [None]:
first_listing = sea_normalized.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = sea_normalized.iloc[4][['accommodates', 'bathrooms']]
first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
first_fifth_distance

Rather than a whole number, such as 1, like we had before, we get a decimal. In this example, we ran this subset of our broader algorithm against only two features to make mental visualization easier. Imagine a plane with one axis marked "Accommodates" and the other marked "Bathrooms." The two listings in the code snippet above would mark two points on that plane, and the output of the code would deliver the distance between those two points.

Now, try this algorithm against the entire test dataset to see if looking at both the number of people accommodated and the number of bathrooms provides more accurate predictions than are generated by looking at each number alone. You will do this using the [*k*-nearest neighbors (*k*-NN) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm), a method that in this case averages the prices of the 5 training listings nearest a test listing in order to predict its price.

#### <font color=blue>Code Exercise 5.1</font>
Follow the instruction in the code comments below to run your *k*-NN model on two data features. You can then compare the predicted values to actual values in your dataset using the RMSE to see if it gets better results.

In [None]:
# See comments below for instructions for this code snippet

def predict_price_multivariate(new_listing_value,feature_columns):
    temp_df = norm_train_df
    temp_df['distance'] = distance.cdist(temp_df[feature_columns],[new_listing_value[feature_columns]])
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

# Replace col1 and col2 placeholders with the names of the two columns we are using to train our 
# model (accommodates and bathrooms) to find out if the number of people a listing can accommodate 
# and the number of bathrooms is a better predictor of price.

cols = ['col1', 'col2']
norm_test_df['predicted_price'] = norm_test_df[cols].apply(predict_price_multivariate,feature_columns=cols,axis=1)    
norm_test_df['squared_error'] = (norm_test_df['predicted_price'] - norm_test_df['price'])**(2)
mse = norm_test_df['squared_error'].mean()
rmse = math.sqrt( mse )
print(rmse)

**Note:** To run this code snippet once you have made the proper changes, ensure that you have selected the notebook cell and then eihter select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

Check the code you completed in the cell above against the one below to ensure you defined the function correctly. If you did not, run the code in the cell below before continuing.

#### <font color=green>Code Exercise 5.1 Answer</font>

In [None]:
def predict_price_multivariate(new_listing_value,feature_columns):
    temp_df = norm_train_df
    temp_df['distance'] = distance.cdist(temp_df[feature_columns],[new_listing_value[feature_columns]])
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

cols = ['accommodates', 'bathrooms']
norm_test_df['predicted_price'] = norm_test_df[cols].apply(predict_price_multivariate,feature_columns=cols,axis=1)    
norm_test_df['squared_error'] = (norm_test_df['predicted_price'] - norm_test_df['price'])**(2)
mse = norm_test_df['squared_error'].mean()
rmse = math.sqrt( mse )
print(rmse)

There are two key differences between the answer you got here and answer you got in the previous code snippet. First, rather than finding the distance between just two points, the code snippet you just ran examines the mean distance between all of the points. Second, rather than looking at the distance between points on the Accommodates-Bathrooms plane, this portion of the code shows how far apart, on average, the price of each listing is from our prediction for each listing: the average error of the predictive algorithm.

One challenge with this result lies in how to interpret it. The output is still normalized: your RMSE isn't $1.24, it's 1.24 standard deviations of all the listing prices. In order to translate it back, you need to multiply by the price's standard deviation.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
print(rmse * sea_listings.price.std())

112.54 versus 113.75: a small improvement.

So far in this lab we have written our own algorithms from scratch. That can be a great way to learn how and why these algorithms work, but the pre-packaged algorithms that come in software libraries can provide more sophistication and better accuracy. Let's run the *k*-nearest neighbors algorithm against the same two features (<code>accommodates</code> and <code>bathrooms</code>) again but this time using the *k*-nearest neighbors regression algorithm in the scikit-learn Python library.

**Note:** To run the following code snippet, ensure that you have selected it and then either select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

In [None]:
knn = KNeighborsRegressor(algorithm='brute')
knn.fit(norm_train_df[cols], norm_train_df['price'])
two_features_predictions = knn.predict(norm_test_df[cols])
two_features_mse = mean_squared_error(norm_test_df['price'], two_features_predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_rmse * sea_listings.price.std())

Note that we included a step to multiply the resultant RMSE by the price standard deviation directly in this code snippet. Running the pre-packaged *k*-NN regression algorithm against <code>accommodates</code> and <code>bathrooms</code> did produce slightly more accuracy than our previous, home-grown algorithm.

But are we using the right features for this analysis? Would more feature provide a better prediction? Let's run the pre-packaged *k*-NN regression algorithm again, but this time against four features: <code>accommodates</code>, <code>bathrooms</code>, <code>beds</code>, and <code>bedrooms</code>.

#### <font color=blue>Code Exercise 5.2</font>
Follow the instruction in the code comments below to run your k-nearest neighbors model on four data features.

In [None]:
# Replace the col1...col4 placeholders below to analyze the listings against
# accommodates, bathrooms, beds, and bedrooms

knn = KNeighborsRegressor(algorithm='brute')

cols = ['col1','col2','col3','col4']

knn.fit(norm_train_df[cols], norm_train_df['price'])
four_features_predictions = knn.predict(norm_test_df[cols])
four_features_mse = mean_squared_error(norm_test_df['price'], four_features_predictions)
four_features_rmse = four_features_mse ** (1/2)
four_features_rmse * sea_listings.price.std()

**Note:** To run this code snippet once you have made the proper changes, ensure that you have selected the notebook cell and then eihter select **Cell > Run Cells** from the menu at the top of the notebook, or click the **run cell** button at the top of the notebook, or press Ctrl+Enter.

Check the code you completed in the cell above against the one below to ensure you defined the function correctly. If you did not, run the code in the cell below before continuing.

#### <font color=green>Code Exercise 5.2 Answer</font>

In [None]:
knn = KNeighborsRegressor(algorithm='brute')

cols = ['accommodates','bedrooms','bathrooms','beds']

knn.fit(norm_train_df[cols], norm_train_df['price'])
four_features_predictions = knn.predict(norm_test_df[cols])
four_features_mse = mean_squared_error(norm_test_df['price'], four_features_predictions)
four_features_rmse = four_features_mse ** (1/2)
four_features_rmse * sea_listings.price.std()

So four features rather than two wasn't an improvement; our accuracy actually went down slightly. This example highlights two points for ML. First, predictive accuracy can come down to finding the right features to include in your analysis. Second, more features do not necessarily generate superior accuracy; accuracy can actually go down by including extraneous features. For this reason, feature selection plays a large role in good ML.

If you have time, feel free to continue to play around with the features used to see if you can get a better accuracy score!

### Step 6: Stopping the DSVM
Before completing the lab, make sure you shut down the virtual machine you created in Microsoft Azure.

1. Return to the Azure Web Portal (http://portal.azure.com). Locate and open the settings for the new DSVM that you created as part of this HOL.
2. In the controls, click **Stop** to stop the DSVM.

**Important: Remember to shut down the virtual machine in the Azure portal after you have completed this HOL.**

You have now completed Exercise 4 and the Machine Learning HOL.