**Handling Missing Data. [15 points]**

***Background***

Suppose you are given a dataset and you want to train a machine learning model to make useful predictions. However, you notice that there are missing values in the data. You know that -- (1) many machine learning algorithms fail to work with missing values, (2) learning might end up with a biased model if missing values are not handled properly, and (3) missing data may lead to a machine learning model that lacks precision in its prediction values. Hence, you decide to handle missing values before proceeding to train a machine-learning model.

You explore the following methods to handle missing values in your dataset: ***(1) Deleting the data examples with missing values*** -- i.e. you delete all the rows where you see one or more missing values. This method is suitable when you have a few rows with many missing values. ***(2) Deleting Features with Missing Values*** - i.e. you delete all the columns where you see one or more missing values. This method is suitable when you have a few columns (i.e., features) with many missing values. ***(3) Replacing Missing Values with Estimated Values*** - this method is a good choice in many scenarios and you fill the missing value with the estimated values. For example, you replace the missing value in a column with the mean value/median value/mode value/previous value/next value of the column. You generally replace with an estimate column-wise (i.e. feature-wise) as a column generally contains values for the same property or feature of the data.

***Problem Description***

You are provided with the ***HousingPrices*** dataset. You want to predict the sale price of your house and decide to use the ***HousingPrices*** dataset to train a machine learning prediction model. In this exercise, you will prepare your dataset for machine learning. Do the following on the ***HousingPrices*** dataset provided. A description of each column is given in the ***dataDescription.txt*** file. Each row contains information about a house in your neighborhood, while different properties of the corresponding house are organized in columns.

1. Write the code to read the ***HousingPrices*** dataset from the csv file provided.

In [2]:
#Write the code to read the dataset here.

from google.colab import files
uploaded = files.upload()

import pandas as pd
data = pd.read_csv("HousingPrices.csv")
data




Saving HousingPrices.csv to HousingPrices.csv


Unnamed: 0,Id,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,BldgType,...,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,YrSold,SalePrice
0,1,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,...,1,3,1,Gd,8,Typ,0,,2008,208500
1,2,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,1Fam,...,0,3,1,TA,6,Typ,1,TA,2007,181500
2,3,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,1Fam,...,1,3,1,Gd,6,Typ,1,TA,2008,223500
3,4,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,1Fam,...,0,3,1,Gd,7,Typ,1,Gd,2006,140000
4,5,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,1Fam,...,1,4,1,Gd,9,Typ,1,TA,2008,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,62.0,7917,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,...,1,3,1,TA,7,Typ,1,TA,2007,175000
1456,1457,85.0,13175,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,...,0,3,1,TA,7,Min1,2,TA,2010,210000
1457,1458,66.0,9042,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,...,0,4,1,Gd,9,Typ,2,Gd,2010,266500
1458,1459,68.0,9717,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,...,0,2,1,Gd,5,Typ,0,,2010,142125


2. Perform initial analysis and find out if you would go ahead with supervised learning or unsupervised learning. What did you decide? Provide reasons for your decision.

The data is best fit for supervised learning. This is based on one key obersvation that I feel sums everything up, that being that the input and output data is labeled meaning it can repeatedly make predictions to find the correct answer. 

3. Perform initial analysis and find out if you will train a classification model or a regression model. What did you decide? Provide reasons for your decision.

Since values like price are continuous values and the dataset is intended to predict housing prices, a regression model is what will be trained in this homework. 

4. Analyze if you have missing values in your dataset. Did you find missing values?

After looking at the dataset, I can confirm there are missing values present. For starters, it seems that the two most common occurances of values not being present are in LotFrontage and FireplaceQu. There are also values missing in other columns but they aren't as frequent as the colums that were previously mentioned. 

5. How will you handle missing values in this dataset? Provide reasons for your decision?

I tested with a couple of things and found out some key observations. First, my goal was to remove all rows where NA was present. After doing this, I realized about 50% of the data was missing and it didn't seem like the most ideal way to do it considering a large bulk of the data is missing. The optimal way would've been to remove rows where the majority of the data isn't present and then remove columns where the majority of the data isn't present. I thought about those or even replacing the missing data with averages or doing a regression analysis to determine it. I found that since the amount of data being used is rather small, filling it in using those methods might not give us a reliable measure or prediction of housing prices. Ultimately I decided to just remove the columns containing NA. The reason I did this was because only 9 out of the 52 columns get removed. Additionally, these columns that are still present give us the data that would be quite important in determining housing prices meaning that it was the most optimal way to handle the missing data. 


6. Write the code to handle missing values and get the dataset ready for machine learning.

In [3]:
# write your code here to handle missing values. The function should take the 
# input as given dataset and return a new dataset without any missing values.

newdata = data.dropna(axis = 1, how = 'any')

newdata



Unnamed: 0,Id,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,BldgType,HouseStyle,...,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,YrSold,SalePrice
0,1,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,2Story,...,2,1,3,1,Gd,8,Typ,0,2008,208500
1,2,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,1Fam,1Story,...,2,0,3,1,TA,6,Typ,1,2007,181500
2,3,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,1Fam,2Story,...,2,1,3,1,Gd,6,Typ,1,2008,223500
3,4,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,1Fam,2Story,...,1,0,3,1,Gd,7,Typ,1,2006,140000
4,5,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,1Fam,2Story,...,2,1,4,1,Gd,9,Typ,1,2008,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,7917,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,2Story,...,2,1,3,1,TA,7,Typ,1,2007,175000
1456,1457,13175,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,1Story,...,2,0,3,1,TA,7,Min1,2,2010,210000
1457,1458,9042,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,2Story,...,2,0,4,1,Gd,9,Typ,2,2010,266500
1458,1459,9717,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,1Story,...,1,0,2,1,Gd,5,Typ,0,2010,142125


7. Write your prepared dataset in a new csv file named 'Prepared-HousingPrices.csv'.

In [8]:
# write your code here to write the prepared dataset in 

newdata.to_csv('Prepared-HousingPrices.csv', index=False,) 


