## Workflow of ML Project

In this practical class, we simulate the core steps in the workflow of a machine learning project, including problem framing, real data collection, exploratory analysis, data pre-processing, machine learning model training and tuning.

### Problem framing

Property buying and selling play a very important role in investment for many of us. The property market prices  are hard to predict due to the variable nature of aucktion and many other factors. In this project, we attempt to make use of recent historical transaction information to build a predictive model based on some features of property such as area, number of bedrooms, etc.

### Data Collection

In this project, we will collect the historical property sold prices from a website called Domain: https://www.domain.com.au/?mode=sold.

Task 1: You are required to manually collect the price information for at least 10 houses for subsequent data analysis and predictive model training/testing. Specific instruction:
* Clicking the above Domain URL. In the search box, write down the suburb you are living now. To simplify the task, we assume that you are just interested in property prices of the subburb you are living.
* To set the Filters, you just need to choose 'House' for 'Property Types'. This is to limit our discussion on the house prices only, further simplifying the problem.
* Leave other Filters options 'Any', including # bedrooms, # bathrooms, and # parking.
* After clicking 'Search', you will get a list of sold property items.
* You need to manually record the following information
#### Date
#### Address
#### number of bedrooms
#### number of bathrooms
#### number of car parking
#### Land area
#### Sold price


* Collect such information for at least 10 listed properties. For example, you can collect the information for the top-10 listed sold properties. If the informaiton of a feature is missing, you could just simply use '?' a a placeholder
* The data are stored in a CSV (Comma-Separated Value) file.

* (Optional for interested students) You can actually collect much more data automatically by calling the APIs provided by Domain: https://developer.domain.com.au/docs/latest/introduction. A video showing how to achieve this: https://www.youtube.com/watch?v=_OJBOy00IJ0. 

In [1]:
# Display your collected data using Pandas.



### Exploratory Analysis and Visualization

Task 2: Show the statistical summary for each numerical attributes, including COUNT, MEAN, STD, MIN, MAX, MEDIAN.

In [None]:
# Use the describe() method of DataFrame in Pandas.


Task 3 (optional): Plot the histogram for each numercial attribute

In [None]:
# Use the hist() method of DataFrame in Pandas.


Task 4: Visulize the relationship between each feature and the property sold price, to study the correlation between the features and the property price.

In [None]:
# Scatter plot with a feature as the X axis and the property price as the Y axis.
# Matplotlib will be involved.


### Data Pre-processing

Task 5 (optional): Missing value imputation. If your collected dataset has missing values, you need to handle the missing values by imputation. Refer to https://scikit-learn.org/stable/modules/impute.html#impute for details about how to acheive this in Scikit-learn.

In [None]:
# Missing value handling


Task 6: Feature scaling. It can be seen that the property features have different scales. To cater for more machine learning models, we need to scale the features to make each feature can be treated equally. Refer to https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling  for more details about how to achieve feature scaling.

In [None]:
# Use the Min-Max feature scaling to transform the original data


### Model Training
After pre-processing the data, we can choose and build some machine learning models from the data. In our project, we will make use of the linear regression models to help predict a property price.

Task 7 (checkpoint): To begin with, we just use a single feature (a.k.a., predictor or independent variable) and the property price together to build a very simple linear regression model. Specifially, we use the Land Area feature for this purpose. Then, the linear regress model is $y(w_0, w_1, x)=w_0+w_1x$. We need to learn $w_0$ and $w_1$ from the data. Refer to https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py for an example.

You need to have the following steps:
#### Select the feature and the target attribute to create the dataset
#### Split the data into a training dataset and a testing dataset. Given the small size of your collected dataset, you can just have one data instance in the testing dataset.
#### Building the linear regression model on the training dataset
#### Testing the model (you need to choose a performance measure, e.g., RSME)
#### Visulize the learned model (essentially a line) with the data instances.


In [None]:
# Build, test, and visualize a simple linear regression model here


Task 8: Build the linear regression model with multiple features. The single feature above might just offer limited information for the prediction. To improve the predication performance, we can include more features/predictors. So, in this task, you are required to inlcude other features such as the number of bedrooms, the number of bathrooms, and the number of car parking. Follow the same procedure described above except the visualization step. 

In [None]:
# Build and test the linear regression model with multiple features


### Model Fine-tuning

Task 9: Use the 5-fold cross validation to train the model and report the averaged testing error. 

In [None]:
# Apply 5-fold cross validation to report a more robudt testing error


Task 10 (Optional): Build a ridge regression model from the data. Refer to https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification for more details about the ridge regression model. Tune the complexity parameter to figure out optimal value via gridsearch with cross validation (refer to https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search). 

In [None]:
# Build a Ridge regression model from the data and tuning the parameter with Grid Search


### Model Deployment and Maintenance
You can integrate the trained more with a web service. You can also update the model by retraining the model with new data.