We explored how to use a simple k-nearest neighbors machine learning model that used just one feature, or attribute, of the listing to predict the rent price. We first relied on the accommodates column, which describes the number of people a living space can comfortably accommodate. Then, we switched to the bathrooms column and observed an improvement in accuracy. While these were good features to become familiar with the basics of machine learning, it's clear that using just a single feature to compare listings doesn't reflect the reality of the market. An apartment that can accommodate 4 guests in a popular part of Washington D.C. will rent for much higher than one that can accommodate 4 guests in a crime ridden area.

There are 2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation):
1. increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
2. increase k, the number of nearby neighbors the model uses when computing the prediction

In this mission, we'll focus on increasing the number of attributes the model uses. When selecting more attributes to use in the model, we need to watch out for columns that don't work well with the distance equation. This includes columns containing:
1. non-numerical values (e.g. city or state)
 Euclidean distance equation expects numerical values
2. missing values
 distance equation expects a value for each observation and attribute
3. non-ordinal values (e.g. latitude or longitude)
ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal

### Instructions

Use the DataFrame.info() method to return the number of non-null values in each column.

In [2]:
import pandas as pd
import numpy as np
np.random.seed(1)
dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [3]:
print(dc_listings.info()) 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_response_rate    3289 non-null   object 
 1   host_acceptance_rate  3109 non-null   object 
 2   host_listings_count   3723 non-null   int64  
 3   accommodates          3723 non-null   int64  
 4   room_type             3723 non-null   object 
 5   bedrooms              3702 non-null   float64
 6   bathrooms             3696 non-null   float64
 7   beds                  3712 non-null   float64
 8   price                 3723 non-null   float64
 9   cleaning_fee          2335 non-null   object 
 10  security_deposit      1426 non-null   object 
 11  minimum_nights        3723 non-null   int64  
 12  maximum_nights        3723 non-null   int64  
 13  number_of_reviews     3723 non-null   int64  
 14  latitude              3723 non-null   float64
 15  longitude          

### Removing features

The following columns contain non-numerical values:

1. room_type: e.g. Private room
2. city: e.g. Washington
3. state: e.g. DC
    
while these columns contain numerical but non-ordinal values:

1. latitude: e.g. 38.913458
2. longitude: e.g. -77.031
3. zipcode: e.g. 20009

Geographic values like these aren't ordinal, because a smaller numerical value doesn't directly correspond to a smaller value in a meaningful way. For example, the zip code 20009 isn't smaller or larger than the zip code 75023 and instead both are unique, identifier values. Latitude and longitude value pairs describe a point on a geographic coordinate system and different equations are used in those cases (e.g. haversine).

While we could convert the host_response_rate and host_acceptance_rate columns to be numerical (right now they're object data types and contain the % sign), these columns describe the host and not the living space itself. Since a host could have many living spaces and we don't have enough information to uniquely group living spaces to the hosts themselves, let's avoid using any columns that don't directly describe the living space or the listing itself:

1. host_response_rate
2. host_acceptance_rate
3. host_listings_count

Let's remove these 9 columns from the Dataframe.

In [4]:
dc_listings.columns

Index(['host_response_rate', 'host_acceptance_rate', 'host_listings_count',
       'accommodates', 'room_type', 'bedrooms', 'bathrooms', 'beds', 'price',
       'cleaning_fee', 'security_deposit', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'latitude', 'longitude', 'city', 'zipcode',
       'state'],
      dtype='object')

In [5]:
dc_listings = dc_listings.drop(columns = ['host_response_rate', 'host_acceptance_rate', 'host_listings_count','latitude', 'longitude', 'city', 'zipcode','state', 'room_type']) 

In [6]:
dc_listings.head() 

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,,$300.00,1,4,149
1593,2,1.0,1.5,1.0,85.0,$15.00,,1,30,49
3091,1,1.0,0.5,1.0,50.0,,,1,1125,1
420,2,1.0,1.0,1.0,209.0,$150.00,,4,730,2
808,12,5.0,2.0,5.0,215.0,$135.00,$100.00,2,1825,34


### Handling missing values

Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):

1. bedrooms
2. bathrooms
3. beds

Since the number of rows containing missing values for one of these 3 columns is low, we can select and remove those rows without losing much information. There are also 2 columns that have a large number of missing values:

1. cleaning_fee - 37.3% of the rows
2. security_deposit - 61.7% of the rows

and we can't handle these easily. We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, let's remove these 2 columns entirely from consideration.

### Instructions
1. Drop the cleaning_fee and security_deposit columns from dc_listings.
2. Then, remove all rows that contain a missing value for the bedrooms, bathrooms, or beds column from dc_listings.

You can accomplish this by using the Dataframe method dropna() and setting the axis parameter to 0.
Since only the bedrooms, bathrooms, and beds columns contain any missing values, rows containing missing values in these columns will be removed.

3. Display the null value counts for the updated dc_listings Dataframe to confirm that there are no missing values left.

In [7]:
dc_listings = dc_listings.drop(columns = ['cleaning_fee', 'security_deposit']) 

In [8]:
dc_listings.head() 

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,1,4,149
1593,2,1.0,1.5,1.0,85.0,1,30,49
3091,1,1.0,0.5,1.0,50.0,1,1125,1
420,2,1.0,1.0,1.0,209.0,4,730,2
808,12,5.0,2.0,5.0,215.0,2,1825,34


In [9]:
dc_listings = dc_listings.dropna(subset = ['bedrooms', 'bathrooms', 'beds'], axis = 0)

In [10]:
dc_listings.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3671 entries, 574 to 1061
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   accommodates       3671 non-null   int64  
 1   bedrooms           3671 non-null   float64
 2   bathrooms          3671 non-null   float64
 3   beds               3671 non-null   float64
 4   price              3671 non-null   float64
 5   minimum_nights     3671 non-null   int64  
 6   maximum_nights     3671 non-null   int64  
 7   number_of_reviews  3671 non-null   int64  
dtypes: float64(4), int64(4)
memory usage: 258.1 KB


In [11]:
dc_listings.head() 

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,1,4,149
1593,2,1.0,1.5,1.0,85.0,1,30,49
3091,1,1.0,0.5,1.0,50.0,1,1125,1
420,2,1.0,1.0,1.0,209.0,4,730,2
808,12,5.0,2.0,5.0,215.0,2,1825,34


### Normalize columns

You may have noticed that while the `accommodates`, `bedrooms`, `bathrooms`, `beds`, and `minimum_nights` columns hover between 0 and 12 (at least in the first few rows), the values in the `maximum_nights` and `number_of_reviews` columns span much larger ranges. For example, the maximum_nights column has values as low as 4 and as high as 1825, in the first few rows itself. If we use these 2 columns as part of a k-nearest neighbors model, these attributes could end up having an outsized effect on the distance calculations, because of the largeness of the values.

In [12]:
print(dc_listings['accommodates'].max())
print(dc_listings['accommodates'].min())
print(dc_listings['bedrooms'].max())
print(dc_listings['bedrooms'].min())
print(dc_listings['bathrooms'].max())
print(dc_listings['bathrooms'].min())
print(dc_listings['beds'].max())
print(dc_listings['beds'].min())
print(dc_listings['maximum_nights'].max())
print(dc_listings['maximum_nights'].min())
print(dc_listings['number_of_reviews'].max())
print(dc_listings['number_of_reviews'].min())

16
1
10.0
0.0
8.0
0.0
16.0
1.0
2147483647
1
362
0


For example, 2 living spaces could be identical across every attribute but be vastly different just on the maximum_nights column. If one listing had a maximum_nights value of 1825 and the other a maximum_nights value of 4, because of the way Euclidean distance is calculated, these listings would be considered very far apart because of the outsized effect the largeness of the values had on the overall Euclidean distance. To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.

Normalizing the values in each column to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:
from each value, subtract the mean of the column
divide each value by the standard deviation of the column

It should be noted that you can also do the following:

and get the same answer as above.

This is because first_transform is merely shifting the mean of the distribution and has no effect on the shape or scaling of the distribution. In other words, the variance of dc_listings is the same as the variance of first_transform.
To apply this transformation across all of the columns in a Dataframe, you can use the corresponding Dataframe methods mean() and std():

### Instructions 

1. Normalize all of the feature columns in dc_listings and assign the new Dataframe containing just the normalized feature columns to normalized_listings.
2. Add the price column from dc_listings to normalized_listings.
3. Display the first 3 rows in normalized_listings.

In [13]:
normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())

In [14]:
normalized_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,-0.173345,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,-0.464148,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,-0.718601,-0.341375,-0.016573,-0.482505
420,-0.596544,-0.249467,-0.439151,-0.546858,0.437342,0.487635,-0.016584,-0.448301
808,4.393004,4.507903,1.264998,2.829956,0.480962,-0.065038,-0.016553,0.646219


In [15]:
normalized_listings['price'] = dc_listings['price'] 

In [16]:
normalized_listings.head() 

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505
420,-0.596544,-0.249467,-0.439151,-0.546858,209.0,0.487635,-0.016584,-0.448301
808,4.393004,4.507903,1.264998,2.829956,215.0,-0.065038,-0.016553,0.646219


### Euclidean distance for multivariate case

In the last mission, we trained 2 univariate k-nearest neighbors models. The first one used the accommodates attribute while the second one used the bathrooms attribute. Let's now train a model that uses both attributes when determining how similar 2 living spaces are. Let's refer to the Euclidean distance equation again to see what the distance calculation using 2 attributes would look like:

Since we're using 2 attributes, the distance calculation would look like:

To find the distance between 2 living spaces, we need to calculate the squared difference between both accommodates values, the squared difference between both bathrooms values, add them together, and then take the square root of the resulting sum. Here's what the Euclidean distance between the first 2 rows in normalized_listings looks like:


So far, we've been calculating Euclidean distance ourselves by writing the logic for the equation ourselves. We can instead use the distance.euclidean() function from scipy.spatial, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them. The euclidean() function expects:
both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series)
both of the vectors must be 1-dimensional and have the same number of elements
Here's a simple example:

In [17]:
from scipy.spatial import distance
first_listing = [-0.596544, -0.439151]
second_listing = [-0.596544, 0.412923]
dist = distance.euclidean(first_listing, second_listing)

In [18]:
dist

0.852074

### Instructions
1. Calculate the Euclidean distance using only the accommodates and bathrooms features between the first row and fifth row in normalized_listings using the distance.euclidean() function.

2. Assign the distance value to first_fifth_distance and display using the print function.

In [19]:
first = normalized_listings.iloc[0][['accommodates', 'bathrooms']] 
Fifth = normalized_listings.iloc[4][['accommodates', 'bathrooms']] 

dist = distance.euclidean(first, Fifth) 

In [20]:
dist

5.272543124668403

### Introduction to scikit-learn

So far, we've been writing functions from scratch to train the k-nearest neighbor models. While this is helpful deliberate practice to understand how the mechanics work, you can be more productive and iterate quicker by using a library that handles most of the implementation. In this screen, we'll learn about the scikit-learn library, which is the most popular machine learning library in Python. Scikit-learn contains functions for all of the major machine learning algorithms and a simple, unified workflow. Both of these properties allow data scientists to be incredibly productive when training and testing different models on a new dataset.

The scikit-learn workflow consists of 4 main steps:

1. instantiate the specific machine learning model you want to use
2. fit the model to the training data
3. use the model to make predictions
4. evaluate the accuracy of the predictions

We'll focus on the first 3 steps in this screen and the next screen.

Each model in scikit-learn is implemented as a separate class and the first step is to identify the class we want to create an instance of. In our case, we want to use the KNeighborsRegressor class.

Any model that helps us predict numerical values, like listing price in our case, is known as a regression model. 

The other main class of machine learning models is called classification, where we're trying to predict a label from a fixed set of labels (e.g. blood type or gender). 

The word regressor from the class name KNeighborsRegressor refers to the regression model class that we just discussed.

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

If you refer to the documentation, you'll notice that by default:

1. n_neighbors: the number of neighbors, is set to 5
2. algorithm: for computing nearest neighbors, is set to auto
3. p: set to 2, corresponding to Euclidean distance

Let's set the algorithm parameter to brute and leave the n_neighbors value as 5, which matches the implementation we wrote in the last mission.

If we leave the algorithm parameter set to the default value of auto, scikit-learn will try to use tree-based optimizations to improve performance (which are outside of the scope of this mission):

### Fitting a model and making predictions

Now, we can fit the model to the data using the fit method. For all models, the fit method takes in 2 required parameters:

1. matrix-like object, containing the feature columns we want to use from the training set.

2. list-like object, containing correct target values.

Matrix-like object means that the method is flexible in the input and either a Dataframe or a NumPy 2D array of values is accepted.

This means you can select the columns you want to use from the Dataframe and use that as the first parameter to the fit method.
If you recall from earlier in the mission, all of the following are acceptable list-like objects:

1. NumPy array
2. Python list
3. pandas Series object (e.g. when selecting a column)

You can select the target column from the Dataframe and use that as the second parameter to the fit method:

When the fit() method is called, scikit-learn stores the training data we specified within the KNearestNeighbors instance (knn). If you try passing in data containing missing values or non-numerical values into the fit method, scikit-learn will return an error. Scikit-learn contains many such features that help prevent us from making common mistakes.

Now that we specified the training data we want used to make predictions, we can use the predict method to make predictions on the test set.

The predict method has only one required parameter:
matrix-like object, containing the feature columns from the dataset we want to make predictions on

The number of feature columns you use during both training and testing need to match or scikit-learn will return an error:


knn = KNeighborsRegressor(algorithm='brute')
predictions = knn.predict(test_df[['accommodates', 'bathrooms']])

### Instructions
1. Create an instance of the KNeighborsRegressor class with the following parameters:

##### n_neighbors: 5
##### algorithm: brute

2. Use the fit method to specify the data we want the k-nearest neighbor model to use. Use the following parameters:

##### training data, feature columns: just the accommodates and bathrooms columns, in that order, from train_df.
##### training data, target column: the price column from train_df.

3. Call the predict method to make predictions on:
the accommodates and bathrooms columns from test_df
assign the resulting NumPy array of predicted price values to predictions.

In [21]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]

In [22]:
knn = KNeighborsRegressor(algorithm='brute', n_neighbors = 5)

In [23]:
# Matrix-like object, containing just the 2 columns of interest from training set.
train_features = train_df[['accommodates', 'bathrooms']]
# List-like object, containing just the target column, `price`.
train_target = train_df['price']
# Pass everything into the fit method.
knn.fit(train_features, train_target)

KNeighborsRegressor(algorithm='brute')

In [24]:
predictions = knn.predict(test_df[['accommodates', 'bathrooms']])

In [25]:
predictions 

array([  80.8,  251.2,   89.4,   80.8,   80.8,   80.8,  189.8,  167.8,
        167.8,  199. ,  251.2,  166.6,   81. ,  276.8,   80.8,   80.8,
         80.8,  166.6,   76.2,  982.2,   80.8,  245.8,  167.8,  216.2,
         80.8,   80.8,  167.8,  189.8,  225.8,   81. ,   81. ,   80.8,
         80.8,   80.8,   80.8,   80.8,   80.8,  166.6,  225.8,  245.4,
        225.8,   80.8,   81. ,  167.8,  135.2,  167.8,  167.8,   80.8,
         80.8,   80.8,   81. ,   80.8,   80.8,   80.8,  188. ,  135.2,
         92.4,  145.8,   80.8,  251.2,   80.8,  135.2,  167.8,   90.4,
         80.8,  135.2,   80.8,   80.8,   80.8,  135.2,  166.6,  223.6,
         80.8,  135.2,   80.8,  135.2,   80.8,  106.8,   80.8,   80.8,
         80.8,  135.2,  251.2,  189.8,   80.8,   80.8,   80.8,  135.2,
         89.4,  276.8,  199. ,   81. ,   81. ,   80.8,   80.8,  304.6,
        135.2,  135.2,  135.2,  167.8,   80.8,  135.2,   80.8,  216.2,
        167.8,   80.8,   81. ,   80.8,   80.8,   89.4,  225.8,   80.8,
      

### Calculating MSE using Scikit-Learn

In [26]:
from sklearn.metrics import mean_squared_error

train_columns = ['accommodates', 'bathrooms']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', metric='euclidean')
knn.fit(train_df[train_columns], train_df['price'])
predictions = knn.predict(test_df[train_columns])

In [27]:
mse = mean_squared_error(test_df['price'] , predictions) 

In [28]:
rmse = mean_squared_error(test_df['price'], predictions, squared = False) 

In [29]:
rmse

124.90201702396679

### Using more features

As you can tell, the model we trained using both features ended up performing better (lower error score) than either of the univariate models from the last mission. Let's now train a model using the following 4 features:

1. accommodates
2. bedrooms
3. bathrooms
4. number_of_reviews

Scikit-learn makes it incredibly easy to swap the columns used during training and testing. We're going to leave this for you as a challenge to train and test a k-nearest neighbors model using these columns instead. 

### Instructions
Create a new instance of the KNeighborsRegressor class with the following parameters:

1. n_neighbors: 5
2. algorithm: brute

Fit a model that uses the following columns from our training set (train_df):
1. accommodates
2. bedrooms
3. bathrooms
4. number_of_reviews

Use the model to make predictions on the test set (test_df) using the same columns. Assign the NumPy array of predictions to four_predictions.

Use the mean_squared_error() function to calculate the MSE value for these predictions by comparing four_predictions with the price column from test_df.

Assign the computed MSE value to four_mse.

Calculate the RMSE value and assign to four_rmse.

Display four_mse and four_rmse using the print function.

In [30]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

In [31]:
#Split full dataset into train and test sets.
train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
#Matrix-like object, containing just the 2 columns of interest from training set.
train_features = train_df[features] 
#List-like object, containing just the target column, `price`.
train_target = train_df['price']
# Pass everything into the fit method.
knn.fit(train_features, train_target)

KNeighborsRegressor(algorithm='brute')

In [32]:
predictions = knn.predict(test_df[features])

In [33]:
predictions 

array([ 102. ,  308. ,   82.2,   78. ,   78. ,   78. ,  109.2,  106.4,
        149.8,  128. ,  429. ,   85.6,  123. ,  161.8,   89.6,  104. ,
        140.8,   95.6,   86.4, 1002.2,  114.6,  193. ,  147.6,  111.4,
         78. ,   86.8,  141.6,  196.8,  290.6,  123. ,   91.6,   99.6,
        129.6,   99.6,  107.6,   93. ,  115.4,   82. ,  290.6,  171.6,
        232.8,  111.4,   91.6,  127.2,  146. ,  149.8,   90.8,   99.6,
         78. ,  111.4,   63. ,  126.2,  115.4,  138.8,  465. ,  146. ,
        112.6,  145.4,   99.6,  389.6,   93.4,  111. ,  185.8,  113. ,
         78. ,  116.8,   99.6,   91.2,  129.6,  113.6,   70.8,  137.2,
        110.2,  131. ,   99.6,   97.8,  104. ,  109.4,  107.6,   95.4,
        121. ,   97.8,  277.2,  135.2,  113.8,   99.6,  104. ,  129.6,
         84.6,  282.8,  185.4,  139.8,   69.8,   99.6,   72.6,  393.2,
         90.2,  114.4,  107.2,  124.4,  102. ,  146. ,  104. ,  219.4,
        131.6,   96.6,   76.6,  104. ,   99.6,  108. ,  290.6,   91.2,
      

In [34]:
from sklearn.metrics import mean_squared_error
train_columns = features 
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', metric='euclidean')
knn.fit(train_df[train_columns], train_df['price'])
predictions = knn.predict(test_df[train_columns])

In [35]:
mse = mean_squared_error(test_df['price'] , predictions) 

In [36]:
mse

13322.432400455064

In [37]:
rmse = mean_squared_error(test_df['price'], predictions, squared = False) 

In [38]:
rmse

115.42284176217056

In [40]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'beds',
       'minimum_nights', 'maximum_nights', 'number_of_reviews']

In [41]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')


In [42]:
#Split full dataset into train and test sets.
train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
#Matrix-like object, containing just the 2 columns of interest from training set.
train_features = train_df[features] 
#List-like object, containing just the target column, `price`.
train_target = train_df['price']
# Pass everything into the fit method.
knn.fit(train_features, train_target)

KNeighborsRegressor(algorithm='brute')

In [43]:
predictions = knn.predict(test_df[features])

In [44]:
from sklearn.metrics import mean_squared_error
train_columns = features 
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', metric='euclidean')
knn.fit(train_df[train_columns], train_df['price'])
predictions = knn.predict(test_df[train_columns])

In [46]:
mse = mean_squared_error(test_df['price'] , predictions) 

In [47]:
mse

15455.275631399316

In [45]:
rmse = mean_squared_error(test_df['price'], predictions, squared = False) 

In [48]:
rmse

124.31924883701363

Interestingly enough, the RMSE value actually increased to 125.1 when we used all of the features available to us. This means that selecting the right features is important and that using more features doesn't automatically improve prediction accuracy. We should re-phrase the lever we mentioned earlier from:

increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
to:

select the relevant attributes the model uses to calculate similarity when ranking the closest neighbors
The process of selecting features to use in a model is known as feature selection.

In this mission, we prepared the data to be able to use more features, trained a few models using multiple features, and evaluated the different performance tradeoffs. We explored how using more features doesn't always improve the accuracy of a k-nearest neighbors model. In the next mission, we'll explore another knob for tuning k-nearest neighbor models - the k value

## Syntax

#### Displaying the number of non-null values in the columns of a DataFrame:

dc_listings.info()

#### Removing rows from a DataFrame that contain a missing value:

dc_listings.dropna(axis=0, inplace=True)

#### Normalizing a column using pandas:

first_transform = dc_listings['maximum_nights'] - dc_listings['maximum_nights'].mean()
normalized_col = first_transform / first_transform.std()

#### Normalizing a DataFrame using pandas:

normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())

#### Calculating Euclidean distance using SciPy:

from scipy.spatial import distance
first_listing = [-0.596544, -0.439151]
second_listing = [-0.596544, 0.412923]
dist = distance.euclidean(first_listing, second_listing)

#### Using the KNeighborsRegressor to instantiate an empty model for K-Nearest Neighbors:

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

#### Using the fit method to fit the K-Nearest Neighbors model to the data:

train_df = normalized_listings.iloc[0:2792]

test_df = normalized_listings.iloc[2792:]

train_features = train_df[['accommodates', 'bathrooms']]

train_target = train_df['price']
knn.fit(train_features, train_target)

#### Using the predict method to make predictions on the test set:

predictions = knn.predict(test_df[['accommodates', 'bathrooms']])

#### Calculating MSE using scikit-learn:

from sklearn.metrics import mean_squared_error
two_features_mse = mean_squared_error(test_df['price'], predictions)

## Concepts

To reduce the RMSE value during validation and improve accuracy, you can:

Select the relevant attributes a model uses. When selecting attributes, you want to make sure you're not working with a column that doesn't have continuous values. The process of selecting features to use in a model is known as feature `selection.`

Increase the value of k in our algorithm.

We can normalize the columns to prevent any single value having too much of an impact on distance. 

Normalizing the values to a standard normal distribution preserves the distribution while aligning the scales. 

Let x be a value in a specific column,  be the mean of all values within a single column, and  be the standard deviation of the values within a single column, then the mathematical formula to normalize the values is as follows:

\begin{equation}
x = \frac{x - \mu}{\sigma}
\end{equation}

The distance.euclidean() function from scipy.spatial expects:

1. Both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series).

2. Both of the vectors must be 1-dimensional and have the same number of elements.



The scikit-learn library is the most popular machine learning library in Python. Scikit-learn contains functions for all of the major machine learning algorithms implemented as a separate class. The workflow consists of four main steps:

1. Instantiate the specific machine learning model you want to use.
2. Fit the model to the training data.
3. Use the model to make predictions.
4. Evaluate the accuracy of the predictions.

One main class of machine learning models is known as a `regression model`, which predicts numerical value.

The other main class of machine learning models is called `classification`, which is used when we're trying to predict a label from a fixed set of labels.

The fit method accepts list-like objects while the predict method accepts matrix like objects.

The mean_squared_error() function takes in two inputs:

1. A list-like object representing the actual values.
2. A list like object representing the predicted values using the model.