# First Data Science Project
## Melbourne Housing Prices Prediction
Here, we will go through a data challenge using data predicting housing prices in Melbourne, Australia. 

The data is from Kaggle and can be found [here](https://www.kaggle.com/anthonypino/melbourne-housing-market)

In [22]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

## 1. Collection

In [23]:
full_data = pd.read_csv("./data/Melbourne_housing_FULL.csv")
full_data

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.80140,144.99580,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.79960,144.99840,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.80790,144.99340,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.81140,145.01160,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.80930,144.99440,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34852,Yarraville,13 Burns St,4,h,1480000.0,PI,Jas,24/02/2018,6.3,3013.0,...,1.0,3.0,593.0,,,Maribyrnong City Council,-37.81053,144.88467,Western Metropolitan,6543.0
34853,Yarraville,29A Murray St,2,h,888000.0,SP,Sweeney,24/02/2018,6.3,3013.0,...,2.0,1.0,98.0,104.0,2018.0,Maribyrnong City Council,-37.81551,144.88826,Western Metropolitan,6543.0
34854,Yarraville,147A Severn St,2,t,705000.0,S,Jas,24/02/2018,6.3,3013.0,...,1.0,2.0,220.0,120.0,2000.0,Maribyrnong City Council,-37.82286,144.87856,Western Metropolitan,6543.0
34855,Yarraville,12/37 Stephen St,3,h,1140000.0,SP,hockingstuart,24/02/2018,6.3,3013.0,...,,,,,,Maribyrnong City Council,,,Western Metropolitan,6543.0


In [24]:
# import sys
# !{sys.executable} -m pip install -U pandas-profiling[notebook]
# !jupyter nbextension enable --py widgetsnbextension

In [25]:
# from pandas_profiling import ProfileReport
# profile = ProfileReport(full_data)
# profile.to_file(output_file="your_report.html")

In [26]:
X = full_data.copy()
X = X.drop(columns=['Price'])

In [27]:
y = full_data.loc[:, "Price"]

In [28]:
train_size = 0.8
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=train_size, random_state=42)

### Summary of Assessment:
- drop column: address, postcode, bedrooms or rooms, longtitude, langtitude
- turn into smaller categories and then into dummies: 
type, method, sellerG
car: no parking, 1-2 parking spaces, 3-4 parking spaces, 5 and more
rooms: studio, 1 bedroom, 2 bedroom, 3-4 bedroom, 5 and more
bathroom: no bathroom, 1 bathroom, 2 bathroom, 3-4 bathroom, 5 and more
car: no parking, 1-2 parking spaces, 3-4 parking spaces, 5 and more


- check and deal with outliers and missing values: rooms, bedroom, bathroom, car, yearbuilt, landsize, buildingarea, council area, 
- turn others into other category: regionname, suburd, method, sellerG, council area

- date: Get day, time, month separately

- distance: log transform

- price: missing values to drop

### General Cleaning

In [None]:
# find the duplicated rows
full_data.loc[full_data.duplicated(), :]
# drop duplicates, keep the first
full_data.drop_duplicates(keep='first')
full_data

In [None]:
# drop columns I don't need
columns = ['Address', 'Postcode', 'Bedroom2', 'Longtitude', 'Lattitude', 'Date', 'Suburb', 'Type', 'Method', 'SellerG', 'CouncilArea', 'Regionname']
full_data = full_data.drop(columns = columns)
full_data.columns

# Check and deal with outliers and missing values
 - rooms, bathroom, car, yearbuilt, landsize, buildingarea,


### Rooms

In [None]:
import seaborn as sns
sns.countplot(x='Rooms', data=full_data);

In [None]:
full_data['Rooms'].value_counts()

No outlier which I can prove to be a typo and no missing data anymore, seems normal.

### Bathroom

In [None]:
import seaborn as sns
sns.countplot(x='Bathroom', data=full_data);

In [None]:
full_data['Bathroom'].value_counts()

In [None]:
full_data.loc[full_data['Bathroom'] == 9]

No outlier which I can prove to be a typo and no missing data anymore, seems normal.

### Car

In [None]:
import seaborn as sns
sns.countplot(x='Car', data=full_data);

In [None]:
full_data['Car'].value_counts()

In [None]:
full_data.loc[full_data['Car'] == 26]

In [None]:
mean = full_data['Car'].mean()
full_data['Car'].fillna(median, inplace=True)

In [None]:
full_data

No outlier which I can prove to be a typo and no missing data anymore, seems normal.

### Yearbuilt

In [None]:
import seaborn as sns
sns.boxplot(x=full_data['YearBuilt'])

In [None]:
full_data.loc[(full_data['YearBuilt'] < 1800)]

In [None]:
# exclude that value from the column
full_data = full_data[full_data.YearBuilt != 1196.0]
full_data

No outlier which I can prove to be a typo and no missing data anymore, seems normal.

### Landsize

In [None]:
import seaborn as sns
sns.boxplot(x=full_data['Landsize'])

In [None]:
full_data.loc[(full_data['Landsize'] > 150000)]

In [None]:
# exclude that value from the column
full_data = full_data[full_data.Landsize != 433014.0]

In [None]:
full_data

No outlier which I can prove to be a typo and no missing data anymore, seems normal.

### Building Area

In [None]:
import seaborn as sns
sns.boxplot(x=full_data['BuildingArea'])

In [None]:
full_data.loc[:, 'BuildingArea'].value_counts().sort_values(ascending=False)

In [None]:
mean =full_data.loc[:, 'BuildingArea'].mean()
full_data.loc[:, 'BuildingArea'].fillna(mean, inplace=True)

In [None]:
full_data  = full_data[full_data.loc[:, 'BuildingArea']!=0]

In [None]:
full_data

### Type

In [None]:
# labels = ['house', 'townhouse', 'unit']
# g = sns.countplot(data=X_train, x='Type')
# g.set_xticklabels(labels)
# g;

### SellerG 

In [None]:
# X_train['SellerG'].value_counts().head(10)


### Regionname

In [None]:
# sns.countplot(data=X_train, y='Regionname');

### Datetime into columns

In [None]:
# X_train.loc[:,'Date'] = pd.to_datetime(X_train.loc[:,'Date'])
# X_train.loc[:,'Year'] = X_train.loc[:,'Date'].apply(lambda x: x.year)
# X_train.loc[:,'Month'] = X_train.loc[:,'Date'].apply(lambda x: x.month_name())
# X_train.loc[:,'Day'] = X_train.loc[:,'Date'].apply(lambda x: x.day)
# X_train = X_train.drop('Date', axis=1)

In [None]:
# X_train

In [None]:
### Correlations
corr_matrix = full_data.corr()
corr_matrix

In [None]:
### Drop missing values of Price
price_nonmissing = y.dropna()
len(price_nonmissing)

### Turning cat into dummies

In [None]:
full_data = pd.get_dummies(full_data)

In [None]:
# columns_dummies = ['Address', 'Postcode', 'Bedroom2', 'Longtitude', 'Lattitude']
# X_train = X_train.fillna(columns = columns_dummies)
# X_train.columns

In [None]:
full_data['Car']=full_data['Car'].fillna(full_data['Car'].mode()[0])
full_data['Bathroom']=full_data['Bathroom'].fillna(full_data['Bathroom'].mode()[0])
full_data['Rooms']=full_data['Rooms'].fillna(full_data['Rooms'].mode()[0])
full_data['Landsize']=full_data['Landsize'].fillna(full_data['Landsize'].mode()[0])
full_data['BuildingArea']=full_data['BuildingArea'].fillna(full_data['BuildingArea'].mode()[0])
full_data['YearBuilt']=full_data['YearBuilt'].fillna(full_data['YearBuilt'].mode()[0])
full_data

In [None]:
# # change date to datetime
# full_data['Date'] = pd.to_datetime(full_data['Date'])
# full_data.dtypes

In [None]:
# dummies_full_data = dummies_full_data.replace([np.inf, -np.inf], np.nan)

In [None]:
# np.all(np.isfinite(dummies_full_data))

In [None]:
# y

In [None]:
# full_data = full_data.dropna(subset=['Distance', 'Propertycount'])
# full_data.isna().sum()

In [None]:
# y = full_data.loc[:, "Price"]

In [None]:
# y = y.dropna(axis=0, how='any')

In [None]:
# y.isna().sum()

In [None]:
# X

In [None]:
# X = full_data.copy()
# X = X.dropna(subset=['Price'])

In [None]:
# X.shape

In [None]:
# X = X.drop(columns=['Price'])

In [None]:
# X.shape

In [None]:
# y.shape

In [None]:
# X

In [None]:
# train_size = 0.8
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, train_size=train_size, random_state=42)

In [None]:
# Import model
from sklearn.linear_model import LinearRegression

# Create linear regression object
regressor = LinearRegression()

# Fit model to training data
regressor.fit(X,y)

In [None]:
# Predict
# Predicting test set results
y_pred = regressor.predict(X_test)

In [None]:
from sklearn import metrics
print('MAE:',metrics.mean_absolute_error(y_test,y_pred))
print('MSE:',metrics.mean_squared_error(y_test,y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
print('R^2 =',metrics.explained_variance_score(y_test,y_pred))

In [None]:
# Actual v predictions scatter
plt.scatter(y_test, y_pred)

In [None]:
cdf = pd.DataFrame(data = regressor.coef_, index = X.columns, columns = ['Coefficients'])
cdf

In [None]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

## 4. Model Building

## 5. Iterating