# Basics of Machine Learning

### Basic Data Exploration

In [1]:
import pandas as pd

In [5]:
df_path = 'https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv'
df = pd.read_csv(df_path)
df.head()

Unnamed: 0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers,Brick,Neighborhood
0,1,114300,1790,2,2,2,No,East
1,2,114200,2030,4,2,3,No,East
2,3,114800,1740,3,2,1,No,East
3,4,94700,1980,3,2,3,No,East
4,5,119800,2130,3,3,3,No,East


In [6]:
df.shape

(128, 8)

df['Neighborhood'] is not numeric, so i need tho know what values have

In [8]:
df.groupby('Neighborhood').count()

Unnamed: 0_level_0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers,Brick
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
East,45,45,45,45,45,45,45
North,44,44,44,44,44,44,44
West,39,39,39,39,39,39,39


Count the null data

In [12]:
#Dont have null
df.isnull().count()

Home            128
Price           128
SqFt            128
Bedrooms        128
Bathrooms       128
Offers          128
Brick           128
Neighborhood    128
dtype: int64

In [13]:
df.describe()

Unnamed: 0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers
count,128.0,128.0,128.0,128.0,128.0,128.0
mean,64.5,130427.34375,2000.9375,3.023438,2.445312,2.578125
std,37.094474,26868.770371,211.572431,0.725951,0.514492,1.069324
min,1.0,69100.0,1450.0,2.0,2.0,1.0
25%,32.75,111325.0,1880.0,3.0,2.0,2.0
50%,64.5,125950.0,2000.0,3.0,2.0,3.0
75%,96.25,148250.0,2140.0,3.0,3.0,3.0
max,128.0,211200.0,2590.0,5.0,4.0,6.0


### First Machine Learning Model

1. Mising values

In [14]:
#In case of missing values
df = df.dropna(axis=0)

2. Selecting the prediction target

In [15]:
#In this case i want to predict the price
y = df.Price

3. Choosing "features"

In [16]:
#In this case im going to work with the relevant numeric features of the dataframe
df.columns

Index(['Home', 'Price', 'SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick',
       'Neighborhood'],
      dtype='object')

In [17]:
df_features = ['SqFt', 'Bedrooms', 'Bathrooms', 'Offers']

In [19]:
X = df[df_features]
X.describe()

Unnamed: 0,SqFt,Bedrooms,Bathrooms,Offers
count,128.0,128.0,128.0,128.0
mean,2000.9375,3.023438,2.445312,2.578125
std,211.572431,0.725951,0.514492,1.069324
min,1450.0,2.0,2.0,1.0
25%,1880.0,3.0,2.0,2.0
50%,2000.0,3.0,2.0,3.0
75%,2140.0,3.0,3.0,3.0
max,2590.0,5.0,4.0,6.0


In [20]:
X.head()

Unnamed: 0,SqFt,Bedrooms,Bathrooms,Offers
0,1790,2,2,2
1,2030,4,2,3
2,1740,3,2,1
3,1980,3,2,3
4,2130,3,3,3


<b>Building your model</b>
1. Define the type of model --> decision tree
2. Fit --> capture patterns from provided data
3. Predict --> just predict 
4. Evaluate, determine hoy accurate the model´s predictions are --> R^2 score or MAE(Mean Absolute Error)

In [22]:
#1. Define the type of model
#https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
from sklearn.tree import DecisionTreeRegressor

df_model = DecisionTreeRegressor(random_state=1)

In [23]:
#2. Fit
df_model.fit(X, y)

In [25]:
#3. Predict
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(df_model.predict(X.head()))

Making predictions for the following 5 houses:
   SqFt  Bedrooms  Bathrooms  Offers
0  1790         2          2       2
1  2030         4          2       3
2  1740         3          2       1
3  1980         3          2       3
4  2130         3          3       3
The predictions are
[114300. 114200. 114800.  94700. 119800.]


In [28]:
#4. Evaluate R^2
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
#R^2: Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 
from sklearn.metrics import r2_score

#r2_score(true_values, predict_values)
r2_score(y, df_model.predict(X))

0.9890507972556026

Score of 98%, the model have a good prediction

In [29]:
#4. Evaluate MAE
#So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html
from sklearn.metrics import mean_absolute_error

#mean_absolute_error(true_values, predict_values)
mean_absolute_error(y, df_model.predict(X))

809.375

MAE was about $810 dollars

### Train test split

The Problem with "In-Sample" Scores
The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.


Explanation from: https://www.kaggle.com/code/dansbecker/model-validation

In [31]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

In [33]:
# Define model
df_model = DecisionTreeRegressor()
#Fit model
df_model.fit(train_X, train_y)
#Get predictions prices on validation data
val_predictions = df_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))


20118.75


Your mean absolute error for the in-sample data was about \$810 dollars.  Out-of-sample it is more than \$20.118 dollars.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=bba3d9c9-74f1-46a6-8e2c-2a3d10c7143f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>