# Feature Engineering and k-Nearest Neighbors with the California Housing Prices Data Set
* [Overview](#overview)   
* [Using seaborn](#using-seaborn)
* [Reviewing the Data Set](#reviewing-the-data-set)
* [Examining the Categorical Data](#examining-the-categorical-data)
* [One-Hot Encoding](#one-hot-encoding)
* [k Nearest Neighbors](#k-nearest-neighbors)
* [Re-scaling the Data](#rescaling-tje-data)
* [Putting it Together](#putting-it-together)

## Overview

## Using Seaborn 

*seaborn* is a Python library that extends matplotlib. It can be used to make plots that give information. You should be able to install seaborn using whatever method you've used for other packages (conda or pip). We can then import it. 

In [1]:
import seaborn as sns #import the seaborn library

Seaborn has a bunch of nice plotting features. One thing that I like is the ability to create scatterplots with color-coding due to a certain variable using the [sns.scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) command. 

## Reviewing the Data Set 

We will be working with the California Housing Prices Data Set from two weeks ago. 

In [2]:
import pandas as pd

housing_df = pd.read_csv("california-housing.csv")

We can use seaborn's scatterplot command to visualize how location affects price in this data set.

## Examining the Categorical Data

A quick review: here is what the data columns look like. 

In [3]:
housing_df.head()

Unnamed: 0.1,Unnamed: 0,latitude,housing_median_age,total_rooms,population,households,median_income,median_house_value,ocean_proximity
0,0,37.88,41.0,880.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,1,37.86,21.0,7099.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,2,37.85,52.0,1467.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,3,37.85,52.0,1274.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,4,37.85,52.0,1627.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
housing_df.dtypes #access the data types of the columns. 

Unnamed: 0              int64
latitude              float64
housing_median_age    float64
total_rooms           float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

There is one column that is not numeric. (We could automate the check for categorical variables by using the [select_dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) command.)

In the past, we dealt with this column by dropping it. Now, we want to see if it actually makes a difference to our data set. We'll analyze this both quantitatively using pandas and visually in seaborn. 

Let's see how many categories there are in ocean_proximity, as well as how many unique categories. 

Let's create a latitude-longitude scatterplot that shows what these categorical features represent.

Let's also use seaborn to create some boxplots to visualize how the housing prices vary with the different categories, using the [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) command. 

## One Hot Encoding

Our machine learning algorithms are mathematical processes based on numbers. To use these categorical variables, one approach is to use

## k Nearest Neighbors

We plan to use the *k-nearest-neighbors* approach to regression. sklearn implements this with [kNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) class. 

The function has many options. Some of them are: 

1. *n_neighbors* tells how many neighbors to use in the prediction. Default is 5.
2. *weights* tells how to weight the responses from the neighbors (uniform, or scaled by distance).

We will discuss how some of these options work, as well as some of the other options in tomorrow's videos. 

![Example:nn-regression-data-set](nn-regression-data-set.png)

It's important to remember here that distance doesn't just mean physical distance. When we use this with the housing data set, all of the variables will be used in calculating distance. A better term might be "similarity".

## Rescaling the Data

When we look at distance, it's important that features be on the same scale. For instance, is a housing district which is 10000 dollars away "closer" than one that is 2 degrees of longitude away? 

To address this issue, we use 

## Putting It Together