# Introduction

Welcome to **M148- Data Science Fundamentals!** This course is designed to equip you with the tools and experiences necessary to start you off on a life-long exploration of datascience. We do not assume a prerequisite knowledge or experience in order to take the course. 

For this first project we will introduce you to the end-to-end process of doing a datascience project. Our goals for this project are to:

1. Familiarize you with the development environment for doing datascience
2. Get you comfortable with the python coding required to do datascience
3. Provide you with an sample end-to-end project to help you visualize the steps needed to complete a project on your own
4. Ask you to recreate a similar project on a separate dataset

In this project you will work through an example project end to end. Many of the concepts you will encounter will be unclear to you. That is OK! The course is designed to teach you these concepts in further detail. For now our focus is simply on having you replicate the code successfully and seeing a project through from start to finish. 

Here are the main steps:

1. Get the data
2. Visualize the data for insights
3. Preprocess the data for your machine learning algorithm
4. Select a model and train
5. Does it meet the requirements? Fine tune the model

![steps](images/MLProcess.jpg)




## Working with Real Data

It is best to experiment with real-data as opposed to aritifical datasets. 

There are many different open datasets depending on the type of problems you might be interested in!

Here are a few data repositories you could check out:
- [UCI Datasets](https://archive.ics.uci.edu/ml/)
- [Kaggle Datasets](https://www.kaggle.com/)
- [AWS Datasets](https://registry.opendata.aws)


## Submission Instructions
**Project is due April 26th at 12:00 pm noon. To submit the project, please save the notebook as a pdf file and submit the assignment via Gradescope. In addition,  Make sure that all figures are legible and sufficiently large.**


# Example Datascience Exercise
Below we will run through an California Housing example collected from the 1990's.

## Setup

Before getting started, it is always good to check the versions of important packages. Knowing the version number makes it easier to lookup correct documenation. 

To run this project, you will need the following packages installed with at least the minimial version number provided:
- Python Version >= 3.9
- Scitkit-learn >= 1.0.2
- Numpy >= 1.18.5
- Scipy >= 1.1.0
- Pandas >= 1.4.0
- Matplotlib >= 3.3.2

The following code imports these packages and checks their version number. If any assertion error occurs, you may not have the correct version installed.

**Important: If installed using a package manager like Anaconda or pip, these dependencies should be resolved. Please follow the python setup guide provided during discussion of week 1. **

In [None]:
#Import and Version Test
#Python version test
import sys
assert sys.version_info >= (3, 9) # python>=3.9

#Machine learning library
import sklearn
assert sklearn.__version__ >= "1.0.2" # sklearn >= 1.0.2

#numerical packages in python
import numpy as np 
assert np.__version__ >= "1.18.5" # numpy >= 1.18.5

#Another numerical package, unused directly but is implicitly used in sklearn
#Check the version just in case
import scipy as scp
assert scp.__version__ >= "1.1.0" # scipy >= 1.1.0

#Package for data manipulation and analysis
import pandas as pd
assert pd.__version__ >= "1.4.0" # pandas >= 1.4.0

#matplotlib magic for inline figures
import matplotlib # plotting library
assert matplotlib.__version__ >= "3.3.2" # matplotlib >= 3.3.2
%matplotlib inline 


In [None]:
import os
import tarfile
import urllib
DATASET_PATH = os.path.join("datasets", "housing")

In [None]:
#Other setup with necessary plotting

#Instead of using matplotlib direclty, we will use their nice pyplot interface defined as plt
import matplotlib.pyplot as plt

# Set random seed to make this notebook's output identical at every run
np.random.seed(42)

# Plotting Utilities

# Where to save the figures
ROOT_DIR = "."
IMAGES_PATH = os.path.join(ROOT_DIR, "images")
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_name, tight_layout=True, fig_extension="png", resolution=300):
    '''
        plt.savefig wrapper. refer to 
        https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html
    '''
    path = os.path.join(IMAGES_PATH, fig_name + "." + fig_extension)
    print("Saving figure", fig_name)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)


## Step 1. Getting the data

### Intro to Data Exploration Using Pandas

In this section we will load the dataset, do some cleaning, and visualize different
features using different types of plots.

Packages we will use:
- **[Pandas](https://pandas.pydata.org):** is a fast, flexibile and expressive data structure widely used for tabular and multidimensional datasets.
- **[Matplotlib](https://matplotlib.org)**: is a 2d python plotting library which you can use to create quality figures (you can plot almost anything if you're willing to code it out!)
    - other plotting libraries: [seaborn](https://seaborn.pydata.org), [ggplot2](https://ggplot2.tidyverse.org)

In [None]:
import pandas as pd

def load_housing_data(housing_path):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

First, we load the dataset into pandas Dataframe which you can think about as an array/table. The Dataframe has a lot of useful functionality which we will use throughout the class. 

In [None]:
housing = load_housing_data(DATASET_PATH) # we load the pandas dataframe
housing.head() # show the first few elements of the dataframe
               # typically this is the first thing you do
               # to see how the dataframe looks like

A dataset may have different types of features
- real valued
- Discrete (integers)
- categorical (strings)

The two categorical features are essentialy the same as you can always map a categorical string/character to an
integer. 

In the dataset example, all our features are real valued floats, except ocean proximity which is categorical.

In [None]:
# to see a concise summary of data types, null values, and counts
# use the info() method on the dataframe
housing.info()

In [None]:
# you can access individual columns similarly
# to accessing elements in a python dict
print(housing["ocean_proximity"].head()) # added head() to avoid printing many columns.

#Additionally, columns can be accessed as attirbutes of the dataframe object
#This method is convenient to access data but should be used with care since you can't overwrite
#built in functions like housing.min()
print(housing.ocean_proximity.head())


In [None]:
# to access a particular row we can use iloc
housing.iloc[1] 

In [None]:
# one other function that might be useful is
# value_counts(), which counts the number of occurences
# for categorical features
housing["ocean_proximity"].value_counts()

In [None]:
# The describe function compiles your typical statistics for each non-categorical column
housing.describe()

We can also perform groupings based on categorical values and analyze each group.

In [None]:
housing_group = housing.groupby('ocean_proximity')
#Has the mean for every column grouped by ocean proximity
housing_mean = housing_group.mean()
housing_mean

In [None]:
#We can also get the subset of data associated with that group

housing_inland = housing_group.get_group("INLAND")
housing_inland

In [None]:
#We can thus performs operations on each group separately
housing_inland.describe()

**Grouping is a powerful technique within pandas and a recommend reading the user guide to understand it better [here](https://pandas.pydata.org/docs/user_guide/groupby.html)**

In addition to grouping, we can also filter out the data based on our desired criteria.

In [None]:
housing_expensive= housing[(housing["median_house_value"] > 50000)]
housing_expensive.head()

In [None]:
#We can combine multiple criteria 
housing_expensive_small= housing[(housing["median_house_value"] > 50000)& (housing["population"] < 1000)]
housing_expensive_small.head()

**If you want to learn about different ways of accessing elements or other functions it's useful to check out the getting started section of pandas [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) and for a full look at all the functionaltiy that pandas offers you can check out the user guide of pandas [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)**

## Step 2. Visualizing the data 

### Let's start visualizing the dataset

In [None]:
# We can draw a histogram for each of the dataframes features
# using the built-in hist function of Dataframe
housing.hist(bins=50, figsize=(20,15))
#save_fig("attribute_histogram_plots")
plt.show() # pandas internally uses matplotlib, and to display all the figures
           # the show() function must be called

In [None]:
# if you want to have a histogram on an individual feature:
housing["median_income"].hist()
plt.show()

In [None]:
#You can even plot histograms by specifying the groupings using by 
housing["median_income"].hist(by= housing["ocean_proximity"],figsize=(20,15))
plt.show()

In [None]:
#We can also plot statistics of each groupings
housing_group_mean = housing.groupby("ocean_proximity").mean()

housing_group_mean.plot.bar(y ="median_income")

We can convert a floating point feature to a categorical feature
by binning or by defining a set of intervals. 

For example, to bin the households based on median_income we can use the pd.cut function. Note that we use np.inf to represent infinity which is internally handeled. Thus, the last bin is $(6,\infty)$.

In [None]:
# assign each bin a categorical value [1, 2, 3, 4, 5] in this case.
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

housing["income_cat"].value_counts()

In [None]:
housing["income_cat"].hist()

**Next let's visualize the household incomes based on latitude & longitude coordinates**

In [None]:
## here's a not so interesting way of plotting it
housing.plot(kind="scatter", x="longitude", y="latitude")
#save_fig("bad_visualization_plot")

In [None]:
# we can make it look a bit nicer by using the alpha parameter, 
# it simply plots less dense areas lighter.
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
#save_fig("better_visualization_plot")

In [None]:
# A more interesting plot is to color code (heatmap) the dots
# based on income. The code below achieves this

# load an image of california
images_path = os.path.join('./', "images")
os.makedirs(images_path, exist_ok=True)
filename = "california.png"

import matplotlib.image as mpimg
california_img=mpimg.imread(os.path.join(images_path, filename))
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
# overlay the califronia map on the plotted scatter plot
# note: plt.imshow still refers to the most recent figure
# that hasn't been plotted yet.
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

# setting up heatmap colors based on median_house_value feature
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cb = plt.colorbar()
cb.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cb.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
#save_fig("california_housing_prices_plot")
plt.show()

Not suprisingly, we can see that the most expensive houses are concentrated around the San Francisco/Los Angeles areas.

Up until now we have only visualized feature histograms and basic statistics. 

When developing machine learning models the predictiveness of a feature for a particular target of interest is what's important.

It may be that only a few features are useful for the target at hand, or features may need to be augmented by applying certain transformations. 

Nonetheless we can explore this using correlation matrices. Each row and column of the correlation matrix represents a non-categorical feature in our dataset and each element specifies the correlation between the row and column features. [Correlation](https://en.wikipedia.org/wiki/Correlation) is a measure of how the change in one feature affects the other feature. For example, a positive correlation means that as one feature gets larger, then the other feature will also generally get larger. Note that a feature is always fully correlated to itself which is why the diagonal of the correlation matrix is just all 1s.

In [None]:
corr_matrix = housing.corr()
corr_matrix

In [None]:
# for example if the target is "median_house_value", most correlated features can be sorted
# which happens to be "median_income". This also intuitively makes sense.
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
# We can plot a scatter matrix for different attributes/features 
# to see how some features may show a positive correlation/negative correlation or
# it may turn out to be completely random!
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
#save_fig("scatter_matrix_plot")

In [None]:
# median income vs median house value plot 2 in the first row of top figure
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])
#save_fig("income_vs_house_value_scatterplot")

### Augmenting Features: Simple Example
New features can be created by combining different columns from our data set.

- rooms_per_household = total_rooms / households
- bedrooms_per_room = total_bedrooms / total_rooms
- etc.

In [None]:
#A new column in the dataframe can be made the same away you add a new element to a dict
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
# obtain new correlations
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()

In [None]:
housing.describe()

### Augmenting Features: Advanced Example
In addition to augmenting the data using these simple operations, we can also do some advanced augmentation by bringing information from another dataset. 

In this case, we are going to find the distance between the houses and the 10 biggest cities in California during 1990. Intuitively, the location of major cities can strongly impact the value of a home. Thus, our new feature will be the distance of the home to the closest big city among the 10 biggest cities.

To perform this feature extraction, we will use the provided dataset "city_data.csv". We will also employ some helper functions and use the pd.apply function to do the augmentation.

In [None]:
#Loads the city data
def load_city_data(housing_path):
    csv_path = os.path.join(housing_path, "city_data.csv")
    return pd.read_csv(csv_path)

city_data = load_city_data(DATASET_PATH)
city_data

In [None]:
#For ease of use, we will convert city_data into a python dict 
#where the key is the city name and the value is the coordinates
city_dict = {}
for dat in city_data.iterrows(): #iterates through the rows of the dataframe
    row = dat[1]    
    city_dict[row["City"]] = (row["Latitude"],row["Longitude"])
    
print(city_dict)

In [None]:
#Helper functions

#This function is used to calculate the distance between two points on a latitude and longitude grid.
#You don't need to understand the math but know that it takes into account the curverature of the earth
#to make an accurate distance measurement. 
#While we could have used the geopy package to do this for us, this way we don't have to install it.
def distance_func(loc_a,loc_b):
    """
    Calculates the haversine distance between coordinates
    on the latitude and longitude grid. 
    Distance is in km.
    """
    lat1,lon1 = loc_a
    lat2,lon2 = loc_b
    r = 6371
    phi1 = np.radians(lat1)
    phi2 = np.radians(lat2)
    delta_phi = np.radians(lat2 - lat1)
    delta_lambda = np.radians(lon2 - lon1)
    a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) *   np.sin(delta_lambda / 2)**2
    res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a)))
    return np.round(res, 2)


#Calculates closest point to the location given in kilometers
def closest_point(location, location_dict):
    """ take a tuple of latitude and longitude and 
        compare to a dictionary of locations where
        key = location name and value = (lat, long)
        returns tuple of (closest_location , distance) 
        distance is in kilometers"""
    closest_location = None
    for city in location_dict.keys():
        distance = distance_func(location, location_dict[city])
        if closest_location is None:
            closest_location = (city, distance)
        elif distance < closest_location[1]:
            closest_location = (city, distance)
    return closest_location

#Example
closest_point((37.774931,-120.419417), city_dict)

In [None]:
#Now we apply the closest_point function to every data point in housing
#Axis = 1 specifies that apply will send each row one by one into the designated function
#We use the lambda function to catch the row and then disperse its arguments into closest_point
housing['close_city'] = housing.apply(lambda x: closest_point((x['latitude'],x['longitude']),city_dict), axis = 1)

#Since closest point outputed a tuple of names and distance, we have to split it up. 
housing['close_city_name'] = [x[0] for x in housing['close_city'].values]
housing['close_city_dist'] = [x[1] for x in housing['close_city'].values]

#Drop the redundant column
housing = housing.drop('close_city', axis=1)

In [None]:
#Now, let us look at our new features
housing.head()

In [None]:
#We can also look at the new statistics
housing.describe()

Now, let us see if the new feature provides some information about housing prices by looking at the correlation.

In [None]:
# obtain new correlations
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
housing.plot(kind="scatter", x="close_city_dist", y="median_house_value",
             alpha=0.1)
plt.axis([0, 450, 0, 520000])
plt.show()

**Observation**: From the correlation, we can see a negative correlation implying that the farther a house is from a big city, the less it costs. From the plot, we can confirm the negative correlation. We can also note that most houses are within 250 km of the big cities which can indicate that everything past 250 is an outlier or should be treated differently like farm land. 

## Step 3. Preprocess the data for your machine learning algorithm

Once we've visualized the data, and have a certain understanding of how the data looks like. It's time to clean!

Most of your time will be spent on this step, although the datasets used in this project are relatively nice and clean... in the real world it could get real dirty.

After having cleaned your dataset you're aiming for:
- train set
- test set

In some cases you might also have a validation set as well for tuning hyperparameters (don't worry if you're not familiar with this term yet..)

In supervised learning setting your train set and test set should contain (**feature**, **target**) tuples. 
 - **feature**: is the input to your model
 - **target**: is the ground truth label
     - when target is categorical the task is a classification task
     - when target is floating point the task is a regression task
     
We will make use of **[scikit-learn](https://scikit-learn.org/stable/)** python package for preprocessing. 

Scikit learn is pretty well documented and if you get confused at any point simply look up the function/object [here](https://scikit-learn.org/stable/user_guide.html)!

### Dealing With Incomplete Data

In [None]:
# have you noticed when looking at the dataframe summary certain rows 
# contained null values? we can't just leave them as nulls and expect our
# model to handle them for us so we'll have to devise a method for dealing with them...
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

In [None]:
sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1: simply drop rows that have null values

In [None]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2: drop the complete feature

In [None]:
median = housing["total_bedrooms"].median() 
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3: replace na values with median values
sample_incomplete_rows

The option where we replace the null values with a new number is known as [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics)).

Could you think of another plausible imputation for this dataset instead of using the median? (Not graded)

### Using Scikit-learn transformers to preprocess data


We have shown some operations that we want to perform on the dataset. While it is possible to manually perform it all yourselves, it is much easier to offload some of the work to the many fantastic machine learning packages. One such example is scikit-learn where we will demonstrate the use of a transformer to handle some of the work.

Consider a situation where we want to normalize the data for each feature. This involves calculating the mean $\mu$ and standard deviation $\sigma$ for that feature and applying $\frac{z-\mu}{\sigma}$ where $z$ is the feature value. We will show how to perform this using StandardScalar.

In [None]:
from sklearn.preprocessing import StandardScaler

#Extract two real valued columns
housing_sub = housing[["housing_median_age","total_rooms"]]

scaler = StandardScaler() #initiate class
#Calling .fit lets scaler calculate the mean and standard deviation, i.e. trains the standardizer
scaler.fit(housing_sub)
print("Mean: ",scaler.mean_)
print("Std: ",scaler.scale_)

#To perform the standardization, use the .transform function
housing_std= scaler.transform(housing_sub)
print("Transfrom output")
print(housing_std)

#As a shorthand, the function .fit_transform performs both operations
housing_std_2= scaler.fit_transform(housing_sub)
print("Fit Transfrom output")
print(housing_std_2)

### Prepare Data using a pipeline

Now, we will show how we can use scikit learn to create a pipeline that performs all the data preparation in one clean function call. For simplicity, we will not perform the closest city feature extraction in this pipeline. 

It is very useful to combine several steps into one to make the process much simpler to understand and easy to alter.

In [None]:
housing = load_housing_data(DATASET_PATH) # Load the dataset

housing_features = housing.drop("median_house_value", axis=1) # drop labels for training set features
                                                       # the input to the model should not contain the true label
housing_target = housing["median_house_value"].copy()

In [None]:
housing_features.head()

In [None]:
# This cell implements the complete pipeline for preparing the data
# using sklearns TransformerMixins
# Earlier we mentioned different types of features: categorical, and floats.
# In the case of floats we might want to convert them to categories.
# On the other hand categories in which are not already represented as integers must be mapped to integers before
# feeding to the model.

# Additionally, categorical values could either be represented as one-hot vectors or simple as normalized/unnormalized integers.
# Here we encode them using one hot vectors.

# DO NOT WORRY IF YOU DO NOT UNDERSTAND ALL THE STEPS OF THIS PIPELINE. CONCEPTS LIKE NORMALIZATION, 
# ONE-HOT ENCODING ETC. WILL ALL BE COVERED IN DISCUSSION

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.base import BaseEstimator, TransformerMixin
, →

######Processing Real Valued Features
# column indices
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class AugmentFeatures(BaseEstimator, TransformerMixin):
    '''
    implements the previous features we had defined
    housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
    housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
    housing["population_per_household"]=housing["population"]/housing["households"]
    '''
    def __init__(self, add_bedrooms_per_room = True): 
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        #Note that we do not use the pandas indexing anymore
        #This is due to sklearn transforming the dataframe into a numpy array during the processing
        #Thus, depending on where AugmentFeatures is in the pipeline, a different input type can be expected
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

#Example of using AugmentFeatures
housing_features_num = housing_features.drop("ocean_proximity", axis=1) # remove the categorical features
attr_adder = AugmentFeatures(add_bedrooms_per_room=False) #Create transformer object
housing_extra_attribs = attr_adder.transform(housing_features_num.values) #housing_num.values extracts the numpy array of the datafram

print("Example of Augment Features Transformer")
print(housing_extra_attribs[0])


#Pipiline for real valued features
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), #Imputes using median
        ('attribs_adder', AugmentFeatures(add_bedrooms_per_room=True)), #
        ('std_scaler', StandardScaler()),
    ])



#Example
#Output is a numpy array
housing_features_num_tr = num_pipeline.fit_transform(housing_features_num)
print("Example Output of Pipeline for numerical output")
print(housing_features_num_tr[0])

In [None]:
#Full Pipeline

#Splits names into numerical and categorical features
numerical_features = list(housing_features_num)
categorical_features = ["ocean_proximity"]

#Applies different transformations on numerical columns vs categorial columns
full_pipeline = ColumnTransformer([
        ("num", num_pipeline, numerical_features),
        ("cat", OneHotEncoder(), categorical_features),
    ])


#Example of full pipeline
#Output is a numpy array
housing_prepared = full_pipeline.fit_transform(housing_features)
print("Example Output of full Pipeline")
print(housing_prepared[0])



Now, we have a pipeline that easily processes the input data into our desired form. 

### Splitting our dataset

First we need to carve out our dataset into a training and testing cohort. To do this we'll use train_test_split, a very elementary tool that arbitrarily splits the data into training and testing cohorts.

Note that we first perform the train test split on the data before it was processed in the pipeline and then separatelyprocess the train and test data. This is done to avoid injecting information into the test data from the train data such filling in missing values in the test data with knowledge of the train data. 

In [None]:
from sklearn.model_selection import train_test_split
data_target = housing['median_house_value']
train, test, target, target_test = train_test_split(housing_features, data_target, test_size=0.3, random_state=0)

train = full_pipeline.fit_transform(train)
test = full_pipeline.fit_transform(test)

### Select a model and train

Once we have prepared the dataset it's time to choose a model.

As our task is to predict the median_house_value (a floating value), regression is well suited for this.

In [None]:
from sklearn.linear_model import LinearRegression

#Instantiate a linear regresion class
lin_reg = LinearRegression()
#Train the class using the .fit function
lin_reg.fit(train, target)

# let's try the full preprocessing pipeline on a few training instances
data = test
labels = target_test

#Uses predict to get the predicted target values
print("Predictions:", lin_reg.predict(data)[:5])
print("Actual labels:", list(labels)[:5])

In [None]:
from sklearn.metrics import mean_squared_error

preds = lin_reg.predict(test)
mse = mean_squared_error(target_test, preds)
rmse = np.sqrt(mse)
rmse

# TODO: Applying the end-end ML steps to a different dataset.

We will apply what we've learnt to another dataset ([NYC airbnb dataset from 2019](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)). We will predict airbnb price based on other features. 

Note: You do not have to use only one cell when programming your code and can do it over multiple cells.

## [50 pts] Visualizing Data 


### [10 pts] Load the data + statistics


#### - Load the dataset: airbnb/AB_NYC_2019.csv and display the first 5 few rows of the data

In [None]:
#Your code

#### - Pull up info on the data type for each of the data fields. Will any of these be problematic feeding into your model (you may need to do a little research on this)? Discuss:

In [None]:
#Your code

[Response here]

#### - Drop the following columns: name, id, host_id, host_name, last_review, and reviews_per_month and display first 5 rows

In [None]:
#Your code

#### - Display a summary of the statistics of the loaded data using .describe

In [None]:
#Your code

### [10 pts] Plot [boxplots](https://en.wikipedia.org/wiki/Box_plot) for the following 3 features: availability_365, number_of_reviews, price

You may use either pandas or matplotlib to plot the boxplot

In [None]:
#Your code

#### - What do you observe from the boxplot about the features? Anything suprising?

[Response here]

### [10 pts] Plot median price of a listing per neighbourhood_group using a bar plot

In [None]:
#Your code

#### - Describe what you expected to see with these features and what you actually observed 

[Response here]

#### - So we can see different neighborhoods have dramatically different pricepoints, but how does the price breakdown by range. To see let's do a histogram of price by neighborhood to get a better sense of the distribution. 

To prevent outliers from affecting the histogram, use the input *range = [0,300]* in the histogram function which will upperbound the max price to 300 and ignore the outliers.

In [None]:
#Your code

### [5 pts] Plot a map of airbnbs throughout New York. You do not need to overlay a map.

In [None]:
#Your code

### [10 pts] Plot median price of room types who have availability greater than 180 days and neighbourhood_group is Manhattan 

In [None]:
#Your code

### [5 pts] Find features that correlate with price
Using the correlation matrix:
- which features have positive correlation with the price?
- which features have negative correlation with the price?


In [None]:
#Your code

[Response here]

#### - Plot the full Scatter Matrix to see the correlation between prices and the other features

In [None]:
#Your code

## [30 pts] Prepare the Data

### [5 pts] Partition the data into the features and the target data. The target data is price. Then partition the feature data into categorical and numerical features.

In [None]:
#Your code

### [10 pts] Create a scikit learn Transformer that augments the numerical data with the following two features 

- Max_yearly_bookings = availability_365 / minimum_nights

- Distance from airbnb to the NYC JFK Airport 
    - Latitude: 40.641766 , Longitude: -73.780968

Make sure to append these new features in this order.

You may use the previously defined distance_func for the distance calculation.

Note that this Transformer will be applied after imputation so we do not have to worry about Nulls in the data.

In [None]:
#Your code

#### -Test your new agumentation class by applying it to the numerical data you created. Print out the first 3 rows of the resultant data.

Do not worry about missing data since none of the features we used involved nulls.

In [None]:
#Your code

### [10 pts] Create a sklearn pipeline that performs the following operations of the feature data

Now, we will create a full pipeline that processes the data before creating the model.

For the numerical data, perfrom the following operations in order:
- Use a SimpleImputer that imputes using the median value
- Use the custom feature augmentation made in the previous part
- Use StandardScaler to standardize the mean and standard deviation

For categorical features, perform the following:
- Perform one hot encoding on all the remaining categorical features: {neighbourhood_group, room_type} 

**After making the pipeline, perform the transform operation on the feature data and print out the first 3 rows.**

In [None]:
#Your code

### [5 pts] Set aside 20% of the data as test test (80% train, 20% test). Apply previously created pipeline to the train and test data separately as shown in the introduction example. 

In [None]:
#Your code

## [20 pts] Fit a Linear Regression Model

The task is to predict the price, you could refer to the housing example on how to train and evaluate your model using the mean squared error (MSE).
Provide both test and train set MSE values.

In [None]:
#Your codes