# Projekt - podstawy Machine Learning

Nadszedł czas na podsumownaie pracy w tym module.

Do rozwiązania mamy problem regresyjny. Mamy zaprognozować kwotę sprzedaży mieszkania. Taki model pozwoli na lepsze oszacowanie potencjalnej kwoty sprzedaży, szybszą finalizację transakcji i może być wsparciem dla agentów nieruchomości oraz dla sprzedawców i nabywców.

Link do zbioru danych: https://www.kaggle.com/datasets/mohammedaltet/egypt-houses-price

Kolumny w zbiorze:
- Type: the type of property
- Price: the price of property
- Bedrooms: number of bedrooms
- Bathrooms: number of bathrooms
- Area: the Area of the property by m^2
- Furnished: is the property Furnished or not
- Level: In what floor the property is ?
- Compound: ** In what Compound the property is ?**
- Payment_Option
- Delivery_Date
- City

Zadanie polega na eksploracji danych
oraz napisaniu funkcji, która wytrenuje i porówna przynajmniej dwa różne algorytmy oraz wybierze najlepsze rozwiązanie. 

In [1]:
import pandas as pd
import os
import numpy as np

In [2]:
# Wczytanie danych
os.chdir('../')
df = pd.read_csv("data/Egypt_Houses_Price.csv")


In [3]:
df.head()

Unnamed: 0,Type,Price,Bedrooms,Bathrooms,Area,Furnished,Level,Compound,Payment_Option,Delivery_Date,Delivery_Term,City
0,Duplex,4000000,3.0,3.0,400.0,No,7,Unknown,Cash,Ready to move,Finished,Nasr City
1,Apartment,4000000,3.0,3.0,160.0,No,10+,Unknown,Cash,Ready to move,Finished,Camp Caesar
2,Apartment,2250000,3.0,2.0,165.0,No,1,Unknown,Cash,Ready to move,Finished,Smoha
3,Apartment,1900000,3.0,2.0,230.0,No,10,Unknown,Cash,Ready to move,Finished,Nasr City
4,Apartment,5800000,2.0,3.0,160.0,No,Ground,Eastown,Cash,Ready to move,Semi Finished,New Cairo - El Tagamoa


In [4]:
df.describe()

Unnamed: 0,Type,Price,Bedrooms,Bathrooms,Area,Furnished,Level,Compound,Payment_Option,Delivery_Date,Delivery_Term,City
count,27361,27359,27158,27190,26890.0,27361,27361,27361,27361,27361,27361,27361
unique,11,4182,22,22,1073.0,3,14,560,4,10,5,183
top,Apartment,3000000,3,2,120.0,No,Unknown,Unknown,Cash or Installment,Ready to move,Finished,New Cairo - El Tagamoa
freq,8506,311,9784,7753,663.0,16500,10439,11068,10842,12142,14375,6789


In [5]:
df['Level'].value_counts()

Level
Unknown    10439
Ground      4821
2           3727
1           3592
3           2097
4            898
5            577
10+          257
6            223
7            216
Highest      178
8            129
10           104
9            103
Name: count, dtype: int64

In [5]:
cols_to_numeric = ['Price','Bedrooms','Bathrooms','Area','Level']

In [6]:
# zmiany wartosci unknown na puste
df[cols_to_numeric] = df[cols_to_numeric].replace('Unknown',np.nan)

In [7]:
# Posprzątanie zmiennej level
df.loc[df['Level']=='Ground','Level'] = 0

In [9]:
df.loc[df['Level'].isin(['10+','Highest']),'Level'] = 11

In [10]:
# wartości zmiennej level
df['Level'].value_counts()

Level
0     4821
2     3727
1     3592
3     2097
4      898
5      577
11     435
6      223
7      216
8      129
10     104
9      103
Name: count, dtype: int64

In [12]:
# wartości zmiennej bedrooms
df['Bedrooms'].value_counts()

Bedrooms
3       9784
2       4763
4       4219
3.0     2019
5       1883
4.0      978
1        889
2.0      777
6        630
5.0      457
7        193
1.0      163
6.0      137
8         71
7.0       47
10        45
9         38
8.0       10
9.0       10
10.0       8
10+        1
Name: count, dtype: int64

In [13]:
# wartości zmiennej bathrooms
df['Bathrooms'].value_counts()

Bathrooms
2       7753
3       6119
4       3219
1       3153
5       1444
2.0     1433
3.0     1383
4.0      706
1.0      533
6        485
5.0      365
7        222
6.0      114
8         78
7.0       46
10        38
9         33
8.0       16
10.0       7
9.0        6
10+        1
Name: count, dtype: int64

In [14]:
# usunięcie błędnych danych
df = df[df['Bathrooms']!='10+'].reset_index(drop=True)

In [15]:
# konwersja na kolumny
df[cols_to_numeric] = df[cols_to_numeric].astype('float')

In [16]:
# sprawdzenie braków danych
df.isna().sum()

Type                  0
Price                39
Bedrooms            239
Bathrooms           207
Area                507
Furnished             0
Level             10439
Compound              0
Payment_Option        0
Delivery_Date         0
Delivery_Term         0
City                  0
dtype: int64

In [17]:
# uzupelnienie braków danych 
df['Bedrooms'] = df['Bedrooms'].fillna(df['Bedrooms'].median())
df['Bathrooms'] = df['Bathrooms'].fillna(df['Bathrooms'].median())
df['Area'] = df['Area'].fillna(df['Area'].median())

In [39]:
#usuniecie rekordów bez ceny mieszkania
df = df[~(df['Price'].isna())].reset_index(drop=True)

In [19]:
# znalezienie nazw kolumn kategorycznych
cat_features = list(df.select_dtypes(include= 'object').columns)

In [20]:
cat_features

['Type',
 'Furnished',
 'Compound',
 'Payment_Option',
 'Delivery_Date',
 'Delivery_Term',
 'City']

In [21]:
del cat_features[cat_features.index('Compound')]
del cat_features[cat_features.index('City')]

In [22]:
# One-hot encoding
df = pd.get_dummies(data=df,columns=cat_features, drop_first=True)

In [23]:
df.head()

Unnamed: 0,Price,Bedrooms,Bathrooms,Area,Level,Compound,City,Type_Chalet,Type_Duplex,Type_Penthouse,...,Delivery_Date_2026,Delivery_Date_2027,Delivery_Date_Ready to move,Delivery_Date_Unknown,Delivery_Date_soon,Delivery_Date_within 6 months,Delivery_Term_Finished,Delivery_Term_Not Finished,Delivery_Term_Semi Finished,Delivery_Term_Unknown
0,4000000.0,3.0,3.0,400.0,7.0,Unknown,Nasr City,False,True,False,...,False,False,True,False,False,False,True,False,False,False
1,4000000.0,3.0,3.0,160.0,11.0,Unknown,Camp Caesar,False,False,False,...,False,False,True,False,False,False,True,False,False,False
2,2250000.0,3.0,2.0,165.0,1.0,Unknown,Smoha,False,False,False,...,False,False,True,False,False,False,True,False,False,False
3,1900000.0,3.0,2.0,230.0,10.0,Unknown,Nasr City,False,False,False,...,False,False,True,False,False,False,True,False,False,False
4,5800000.0,2.0,3.0,160.0,0.0,Eastown,New Cairo - El Tagamoa,False,False,False,...,False,False,True,False,False,False,False,False,True,False


In [24]:
from sklearn.model_selection import train_test_split

In [27]:
# podzielenie zbioru na train/test
train_x, test_x, train_y, test_y = train_test_split(df.drop(['Price', 'Compound','City'], axis=1), df['Price'], test_size = 0.2, random_state=123)

In [28]:
#wydzielenie zbioru walidacyjnego
train_x, valid_x, train_y, valid_y = train_test_split(train_x, train_y, test_size = 0.2, random_state=123)

In [29]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [31]:
from sklearn.metrics import mean_absolute_error

In [30]:
# lista modeli
models = [DecisionTreeRegressor(min_weight_fraction_leaf=0.002,max_depth=20),
          RandomForestRegressor(max_depth=10,min_weight_fraction_leaf=0.002,oob_score=True, max_leaf_nodes=20)]

In [32]:
# funkcja do wyboru modeli
def opt_fun(train_x: pd.DataFrame,
            test_x: pd.DataFrame, 
            train_y: pd.DataFrame| pd.Series,
            test_y: pd.DataFrame| pd.Series, 
            models: list) -> tuple:
    """
    This function estimates models predefined in models list and chooses the best one
    train_x: pd.DataFrame - Data frame containing train X
    test_x: pd.DataFrame  - Data frame containing test X, 
    train_y: pd.DataFrame| pd.Series  - Data frame or Series containing train y,
    test_y: pd.DataFrame| pd.Series - Data frame or Series containing test y, 
    models: list - list of models It should defined model object compatible with scikit-learn
    """
    best_model = None
    best_metric = None 

    for m in models:
        model = m.fit(train_x,train_y)
        test_pred = model.predict(test_x)
        metric = mean_absolute_error(test_y,test_pred)
        if best_metric is None or best_metric<metric:
            best_model = model 
            best_metric = metric
    return best_model, best_metric

In [33]:
model , score = opt_fun(train_x, test_x, train_y, test_y, models)

In [34]:
score

np.float64(2410721.4444414335)

In [35]:
pred_valid = model.predict(valid_x)
mean_absolute_error(valid_y, pred_valid)

np.float64(2560673.02611078)

In [36]:
df['Price'].describe()

count    2.732100e+04
mean     4.761923e+06
std      6.766756e+06
min      3.000000e+04
25%      1.150000e+06
50%      2.731000e+06
75%      5.990000e+06
max      2.400000e+08
Name: Price, dtype: float64