# Task info
1. **Dataset preprocessing.** You should analyse all columns in dataset. Explain the reasoning why you modify/drop/skip any column.
2. **Regression.** Train sklearn linear regression, try to find the best hyper parameters and get the best model. The target column is "price". 
Describe the model accuracy, performance, score, etc.
3. **Classification.** Train sklearn logistic regression, try to find the best hyper parameters and will provide you the best model. The target column is "premium". 

In [0]:
from os import cpu_count

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import (
    StratifiedKFold, cross_val_score, GridSearchCV, train_test_split, KFold, cross_val_predict
)

# ML models
from sklearn.linear_model import LogisticRegression, Ridge
import pickle
from sklearn.base import ClassifierMixin, BaseEstimator, RegressorMixin
from sklearn.preprocessing import StandardScaler

# metrics
from sklearn.metrics import classification_report, f1_score

# vizualizing
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
import matplotlib.style as style

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

random_state = 42
n_jobs = max(cpu_count()-1, 1)

 **Drop because of having no connection with target:**


1.   *id* - does not have reasonable impact on model training result 
2.   *date* - does not have impact on target columns(price and premium)
3. *lat and long* - because to make these params truly impact on target columns I should create a program that will define a district, for example, or param like "close to metro". But it's not a goal for this lab


In [0]:
house = pd.read_csv("house_prices_data.csv")
house = house.drop(["id", "date", "lat", "long"], axis=1)

In [3]:
house.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,premium,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,0.075695,1788.390691,291.509045,1971.005136,84.402258,98077.939805,1986.552492,12768.455652
std,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,0.264516,828.090978,442.575043,29.373411,401.67924,53.505026,685.391304,27304.179631
min,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,0.0,290.0,0.0,1900.0,0.0,98001.0,399.0,651.0
25%,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,0.0,1190.0,0.0,1951.0,0.0,98033.0,1490.0,5100.0
50%,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,0.0,1560.0,0.0,1975.0,0.0,98065.0,1840.0,7620.0
75%,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,0.0,2210.0,560.0,1997.0,0.0,98118.0,2360.0,10083.0
max,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,1.0,9410.0,4820.0,2015.0,2015.0,98199.0,6210.0,871200.0


In [0]:
import pandas_profiling

def get_profiling(df, output_name):
    profile = pandas_profiling.ProfileReport(df=df)
    profile.to_file(output_name)

In [0]:
get_profiling(house, "house_prices_statistics.html")

In **statistics html file** we could see that *view, sqft_basement* and *yr_renovated* are in warnings section because they have big percetage of zeros. But *sqft_basement* has only 60% of zeros, that is not too many. So I think we should not drop this one. Also param *waterfront* has 99.2% frequency of value 0, so it's not reasonable to use it.

> **Drop because of zeros:**



4.   *view*
5.  *yr_renovated*
6. *waterfront*






And *zipcode* because of small correlation with target features.

In [6]:
house = house.drop(["view", "yr_renovated", "waterfront", "zipcode"], axis=1)
house.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,condition,premium,sqft_above,sqft_basement,yr_built,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,3.40943,0.075695,1788.390691,291.509045,1971.005136,1986.552492,12768.455652
std,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.650743,0.264516,828.090978,442.575043,29.373411,685.391304,27304.179631
min,75000.0,0.0,0.0,290.0,520.0,1.0,1.0,0.0,290.0,0.0,1900.0,399.0,651.0
25%,321950.0,3.0,1.75,1427.0,5040.0,1.0,3.0,0.0,1190.0,0.0,1951.0,1490.0,5100.0
50%,450000.0,3.0,2.25,1910.0,7618.0,1.5,3.0,0.0,1560.0,0.0,1975.0,1840.0,7620.0
75%,645000.0,4.0,2.5,2550.0,10688.0,2.0,4.0,0.0,2210.0,560.0,1997.0,2360.0,10083.0
max,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,5.0,1.0,9410.0,4820.0,2015.0,6210.0,871200.0


**Sklearn log regression:**

In [0]:
def train_log_regression(x, y,
                         random_state,
                         metric='accuracy',
                         n_splits=5, 
                         ):

    # provides train/test indices to split data in train/test sets.
    kf = StratifiedKFold(n_splits=n_splits, random_state=random_state, shuffle=True)

    # create Logistic Regression with default params and fix random_state
    lr = LogisticRegression(random_state=random_state)

    # estimate its accuracy with cross-validation
    scores_lr = cross_val_score(
                                estimator=lr,
                                X=x, 
                                y=y, 
                                scoring=metric,
                                cv=kf,
                                n_jobs=n_jobs
                            ).mean()
    lr.fit(x,y)
    return lr, scores_lr



**To drop target: (get x y)**

In [0]:
def get_xy(df, target="target"):
    return df.drop(target, axis=1), df[[target]].values

In [0]:
X, y = get_xy(house, target="premium")

In [0]:
X_train, X_test, y_train, y_test = train_test_split( X, y ,
                                                    test_size=0.2, 
                                                    random_state=random_state, 
                                                    stratify = y)

**Results of training sklearn model:**

In [12]:
base_lr, scores_baseline = train_log_regression(X_train, y_train, random_state=random_state)
print(f"Base logistic regression score on train: {scores_baseline}")
print(f"Base logistic regression score on test: {base_lr.score(X_test, y_test)}")

Base logistic regression score on train: 0.9542510121457489
Base logistic regression score on test: 0.9516539440203562


In [0]:
# from sklearn.model_selection import cross_val_score
# from sklearn.metrics import recall_score
# from sklearn.tree import DecisionTreeClassifier
# clf = DecisionTreeClassifier(random_state=0)
# cross_val_score(clf, X_train, y_train, cv=10)
# # clf.predict(X_train)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_train)
# recall_score(y_test, y_train, average="binary")

**Normalize:**

In [16]:
X,y= get_xy(house, target="premium")
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split( X, y ,
                                                    test_size=0.2, 
                                                    random_state=random_state, 
                                                    stratify = y)
log_reg, score = train_log_regression(X_train, y_train, random_state)

print(f"Train logistic regression score with standartization: {score}")
print(f"Test logistic regression score with standartization: {log_reg.score(X_test, y_test)}")

Train logistic regression score with standartization: 0.9556969346443032
Test logistic regression score with standartization: 0.9530418690724034


And we can see accuracy increase.

**Grid search: (what params are the best?)**

In [22]:
%%time

params = {'C': [0.1, 0.001, 0.0001, 0.5, 0.9],
          'penalty': ['none', 'l2'],
          'class_weight': ['balanced', None]
    
}

estimator = LogisticRegression(random_state=random_state)
kf = StratifiedKFold(n_splits=5, random_state=random_state, shuffle=True)
    
gs = GridSearchCV(
    estimator=estimator,  
    param_grid=params,  
    cv=kf,  
    error_score=1,  # warnings only
    scoring='accuracy',  
    n_jobs=n_jobs,
    verbose=1,  
)

gs.fit(
    X=X_train,
    y=y_train
)

best_params = gs.best_params_
best_score = gs.best_score_
best_lr = gs.best_estimator_

print('accuracy best: {:.4f}, +{:.4f} better than baseline'.format(
    best_score, (best_score - scores_baseline))
)
print(f'best params: {best_params}')

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


accuracy best: 0.9558, +0.0015 better than baseline
best params: {'C': 0.1, 'class_weight': None, 'penalty': 'l2'}
CPU times: user 6.76 s, sys: 4.25 s, total: 11 s
Wall time: 5.58 s


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    5.5s finished


Reload dataset:

In [0]:
house = pd.read_csv("house_prices_data.csv")

house = house.drop(["id", "date", "lat", "long", "yr_renovated", "waterfront", "view", "sqft_above", "sqft_basement", "zipcode"], axis=1)

**Sklearn linear model:**

In [0]:
def train_lin_regression(x, y,
                         random_state,
                         metric='neg_mean_squared_error',
                         n_splits=5, 
                         ):

    # provides train/test indices to split data in train/test sets
    kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    # linear least squares with l2 regularization
    lr = Ridge(random_state=random_state)

    # estimate mse-score
    scores_lr = cross_val_score(
                                estimator=lr,
                                X=x, 
                                y=y, 
                                scoring=metric,
                                cv=kf,
                                n_jobs=n_jobs
                            ).mean()
    lr.fit(x,y)
    predictions = cross_val_predict(lr, x, y, cv=kf)
    plt.scatter(y, predictions)
    return lr, scores_lr

In [25]:
X, y = get_xy(house, target="price")
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2, 
                                                    random_state=random_state, 
                                                    )
lr, score = train_lin_regression(X_train, y_train, random_state=random_state, metric='neg_mean_squared_error')
print(f"Linear regression MSE score on train: {score}")
print(f"Linear regression R2 score on test: {lr.score(X_test, y_test)}")

Linear regression MSE score on train: -52609749545.47893
Linear regression R2 score on test: 0.5951105932627119


Trying to improve my model:

**1. One hot encoding:**
Param *condition* is categorical (has only 5 values)
**2. Normalize**

In [0]:
categ_col = ["condition"]
binary_col = ["premium"]
target_col = ["price"]
# to make 5 conditions:
house = pd.get_dummies(house, columns=categ_col)
# columns to normalize:
numerical_cols = list(set(house.columns) - set(categ_col) - set(binary_col) - set(target_col))

standard_scaler = StandardScaler()
house[numerical_cols] = standard_scaler.fit_transform(house[numerical_cols])

In [27]:
house.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,premium,yr_built,sqft_living15,sqft_lot15,condition_1,condition_2,condition_3,condition_4,condition_5
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,540088.1,-1.254167e-15,-1.051951e-15,3.174253e-16,3.281921e-17,-1.753125e-14,0.075695,3.592925e-15,-1.506632e-16,1.235382e-16,-1.104958e-16,1.723629e-15,1.124148e-14,5.170733e-15,-3.325794e-15
std,367127.2,1.000023,1.000023,1.000023,1.000023,1.000023,0.264516,1.000023,1.000023,1.000023,1.000023,1.000023,1.000023,1.000023,1.000023
min,75000.0,-3.624404,-2.74592,-1.948891,-0.3521759,-0.915427,0.0,-2.417383,-2.316325,-0.4438052,-0.03728247,-0.0895657,-1.360356,-0.5969989,-0.292277
25%,321950.0,-0.3987371,-0.4736214,-0.7108948,-0.2430487,-0.915427,0.0,-0.6810785,-0.7244971,-0.2808593,-0.03728247,-0.0895657,-1.360356,-0.5969989,-0.292277
50%,450000.0,-0.3987371,0.1756067,-0.1849914,-0.1808075,0.01053939,0.0,0.1360059,-0.213828,-0.1885636,-0.03728247,-0.0895657,0.735102,-0.5969989,-0.292277
75%,645000.0,0.6764851,0.5002207,0.5118578,-0.106688,0.9365058,0.0,0.8849999,0.5448802,-0.09835556,-0.03728247,-0.0895657,0.735102,1.675045,-0.292277
max,7700000.0,31.85793,7.64173,12.47807,39.50434,3.714405,1.0,1.497813,6.162239,31.44029,26.82225,11.16499,0.735102,1.675045,3.421411


In [28]:
X, y = get_xy(house, target="price")

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2, 
                                                    random_state=random_state, 
                                                    )
base_lr, scores_baseline = train_lin_regression(X_train, y_train, random_state=random_state, metric='neg_mean_squared_error')
print(f"Standartized and OHE linear regression MSE score on train: {scores_baseline}")
print(f"Standartized and OHE linear regression R2 score on test: {base_lr.score(X_test, y_test)}")

Standartized and OHE linear regression MSE score on train: -52620743032.23311
Standartized and OHE linear regression R2 score on test: 0.595305063188655


No changes.

In statistics html file we can see that *zipcode* is not correlate with target - price, but maybe after OHE we will see the progress. Because *zipcode* is categorical.

In [0]:
house = pd.read_csv("house_prices_data.csv")

house = house.drop(["id", "date", "lat", "long", "yr_renovated", "waterfront", "view", "sqft_above", "sqft_basement"], axis=1)

We should cast zipcode to an object to see true zipcodes: as we can see in statictics file.

In [30]:
house["zipcode"] = house["zipcode"].apply(lambda x: str(int(x))[:5]).astype('object')
house["zipcode"].describe()

count     21613
unique       70
top       98103
freq        602
Name: zipcode, dtype: object

In [0]:
categ_col = ["condition","zipcode"]
binary_col = ["premium"]
target_col = ["price"]
# to make categories:
house = pd.get_dummies(house, columns=categ_col)
# columns to normalize:
numerical_cols = list(set(house.columns) - set(categ_col) - set(binary_col) - set(target_col))

standard_scaler = StandardScaler()
house[numerical_cols] = standard_scaler.fit_transform(house[numerical_cols])

In [32]:
X, y = get_xy(house, target="price")

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2, 
                                                    random_state=random_state, 
                                                    )
base_lr, scores_baseline = train_lin_regression(X_train, y_train, random_state=random_state, metric='neg_mean_squared_error')
print(f"Standartized and OHE linear regression MSE score on train: {scores_baseline}")
print(f"Standartized and OHE linear regression R2 score on test: {base_lr.score(X_test, y_test)}")

Standartized and OHE linear regression MSE score on train: -31128095412.885353
Standartized and OHE linear regression R2 score on test: 0.7591921361119804


We want to minimize the squared error. With *zipcode* it becomes smaller.