# Installing libraries

In [None]:
%pip install pandas

In [None]:
%pip install -U scikit-learn

# Importing data

In [None]:
import pandas as pd

data = pd.read_csv("melb_data.csv")

# Cleaning data

Looking inside the dataset you can see there are some missing values in the dataset. The easiest way ( although not alway the best) is to drop the missing values.

In [None]:
data.head()

In [None]:
data = data.dropna(axis=0)

In [None]:
data.head()

# Choosing prediction target and features

To choose a prediction target and features we will use for our prediction model first we need to see the avaible data. To do that in a more readable way we can provide list of column names.

In [None]:
data.columns

Then using a dot notation we will select the column that will represent our prediction target - **y**. 

In [None]:
y = data.Price

Using brackets we will select multiple columns as our features - **x**.

In [None]:
features = ['Rooms', 'Landsize', 'Bathroom', 'Lattitude', 'Longtitude']
x = data[features]

For quick data review we can use *describe()* method - it generates descriptive statistics.

In [None]:
x.describe()

# Creating ML Model

One of the most basic ML Model is Decision Tree - it has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
As seen above the data that we use both as a prediction target as well as features consists of contines numerical values. Because of that we will use *DecisonTreeRegressor* form scikit-learn library.

In [None]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state = 0)

Before we train our model it is a good time to divide our data into two subsets - one for training and one for testing. Failing to do so will lead to overfitting. One of the easiest method for random data spliting is *train_test_split* helper function.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

Now, we can train our model using training subset and predict and output for our testing data.

In [None]:
model.fit(x_train, y_train)
prediction = model.predict(x_test)

Finally, we can use our *y_test* values and prediction output for *x_test* to score our model using *mean_absolute_error* and regression score.

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

mae = mean_absolute_error(y_test, prediction)
r2score = r2_score(y_test, prediction)
print(f'{mae=} {r2score=}')

In [None]:
from sklearn.ensemble import RandomForestRegressor

model2 = RandomForestRegressor(n_estimators = 200)

model2.fit(x_train, y_train)
prediction2 = model2.predict(x_test)
mae2 = mean_absolute_error(y_test, prediction2)
r2score2 = r2_score(y_test, prediction2)
print(f'{mae=} {r2score=}')