# HW1 - sklearn ml - development

# Data prep for housing problem

In order to make this dataset usable for both regression problems and classification problems, we need to:

* construct a binary target variable
* do some column dropping and reordering
* write out the new file to a csv file

In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

In [67]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier

In [68]:
%matplotlib inline

Read in the original dataset

reading wrong file v - needs to be classification

In [71]:
housing_df = pd.read_csv("./data/kc_house_data_classification.csv")

In [72]:
# housing_df = pd.read_csv("./data/kc_house_data_original.csv")

In [73]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   bedrooms       21613 non-null  int64  
 1   bathrooms      21613 non-null  float64
 2   sqft_living    21613 non-null  int64  
 3   sqft_lot       21613 non-null  int64  
 4   floors         21613 non-null  float64
 5   waterfront     21613 non-null  int64  
 6   view           21613 non-null  int64  
 7   condition      21613 non-null  int64  
 8   grade          21613 non-null  int64  
 9   sqft_above     21613 non-null  int64  
 10  sqft_basement  21613 non-null  int64  
 11  yr_built       21613 non-null  int64  
 12  yr_renovated   21613 non-null  int64  
 13  zipcode        21613 non-null  int64  
 14  lat            21613 non-null  float64
 15  long           21613 non-null  float64
 16  sqft_living15  21613 non-null  int64  
 17  sqft_lot15     21613 non-null  int64  
 18  price_

In [74]:
housing_df.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price_gt_1M
0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,0
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639,0
2,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062,0
3,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,0
4,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,0


## Data prep steps

* create a `price_gt_1M` column for classification
* drop the `id` column
* reorder the cols so that `price` and `price_gt_1M` are at the end


Let's just double check that True=1 and False=0.

In [77]:
print(int(True), int(False))

1 0


Create new `price_gt_1M` field based on whether or not `price` is greater than or equal to $1M.

In [79]:
# housing_df['price_gt_1M'] = housing_df['price'].map(lambda x: int(x >= 1000000)) 

Drop `id` and `date` columns since we won't use it for this problem.

In [81]:
#housing_df = housing_df.iloc[:, 2:]
#housing_df.info()

Now let's reorder the columns so that `price` is moved to the end. The basic strategy is to create a list of the column numbers in the order we want them. The following takes column 1 and moves it to the end by creating the following vector that we can then use with the `.iloc` selector.

In [83]:
#newcols_class = [_ for _ in range(1, 20)]
#newcols_class

In [84]:
#newcols_regression = [_ for _ in range(1, 18)]
#newcols_regression.extend([0])
#newcols_regression

In [85]:
#housing_class_df = housing_df.iloc[:, newcols_class]
#housing_class_df.info()

In [86]:
#housing_regression_df = housing_df.iloc[:, newcols_regression]
#housing_regression_df.info()

Finally, write out the new dataframe to a new csv file.

In [88]:
#housing_class_df.to_csv("./data/kc_house_data_classification.csv", index=False)

In [89]:
#housing_df.head(10)

## Task 3 - EDA

### Using Sweetviz

In [92]:
import sweetviz

In [93]:
report = sweetviz.analyze(housing_df)

                                             |                       | [  0%]   00:00 -> (? left)

In [94]:
# report.show_html("output/sweetviz_hw1report.html")

### Using Panda Profiling

In [96]:
profile = ProfileReport(housing_df, title="Pandas Profiling Report")

In [97]:
profile.to_file("output/pandas_profiling_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [98]:
# profile

It seems that there is a small difference in the outputs that were generated. The Sweetviz report suggested that the variable `floors` was a categorical variable but the Pandas Profiling one dictated it as a numeric variable. For the purpose of this hw I will not dedicate `floors` as a categorical variable in my models.

## Task 4 - Categorical vs. Numeric

Here are the steps I took in order to do some data preprocessing continue the development of the models later in the hw file.

First lets see what the datatypes for are variables are.

In [103]:
housing_df.dtypes

bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
price_gt_1M        int64
dtype: object

In [121]:
housing_class_df.dtypes

NameError: name 'housing_class_df' is not defined

Since all of the variables will output as numeric, we will have to make sure what variables are numeric/categorical in out lists. 

Next, I will convert the following variables `view`, `waterfront`, `condition`, and `price_gt_1M` into categorical data using the following code.

In [140]:
housing_df["view"] = housing_df["view"].astype("category")
housing_df["waterfront"] = housing_df["waterfront"].astype("category")
housing_df["condition"] = housing_df["condition"].astype("category")
# housing_df["price_gt_1M"] = housing_df["price_gt_1M"].astype("category")

Here is the resulting output for the numeric and categorical variables:

In [143]:
categorical_cols = housing_df.select_dtypes(include=['category']).columns.tolist()
numeric_cols = housing_df.select_dtypes(include=['number']).columns.tolist()

In [145]:
categorical_cols

['waterfront', 'view', 'condition']

In [147]:
numeric_cols

['bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'zipcode',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15',
 'price_gt_1M']

In [None]:
numeric_cols = numeric_cols[:-1]
numeric_cols

In [None]:
X = housing_df.iloc[:, 0:18]
y = housing_df.iloc[:, 18]

In [None]:
housing_df.info()

In [None]:
X.info()

In [None]:
y.info()

In [None]:
# Encode for string labels
#label_encoder = LabelEncoder().fit(y)
#y = label_encoder.transform(y)

## Task 4 - Logistic Regression models

### Pipeline for preprocessing

In [None]:
# Create a StandardScalar object to use on our numeric variables
numeric_transformer = StandardScaler()

In [None]:
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)])

In [None]:
# Classifier model
clf_model = LogisticRegression(penalty='l2', C=1, solver='saga', max_iter=500)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model)])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=73)

# Fit model on new training data - notice that clf is actually the Pipeline
clf.fit(X_train, y_train)

print(f"Training score: {clf.score(X_train, y_train):.3f}")
print(f"Test score: {clf.score(X_test, y_test):.3f}")

In [None]:
y_train.info()

#### Model 0

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")

In [None]:
dummy_clf.fit(X, y)

In [None]:
dummy_clf.predict(X)

In [None]:
dummy_clf.score(X, y)

#### Model 1

In [None]:
# Ridge Regression
clf_model_ridge = LogisticRegression(penalty='l2', C=1.0, solver='saga', max_iter=2000)

In [None]:
# Appending classifier to preprocessing pipeline.
clf_model1 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model_ridge)])

In [None]:
# Fitting model on training data 
clf_model1.fit(X_train, y_train)

# Output statement
print(f"Training score: {clf_model1.score(X_train, y_train):.3f}")
print(f"Test score: {clf_model1.score(X_test, y_test):.3f}")

#### Model 2

In [None]:
# Lasso Regression, C = 1.0
clf_model2_lasso = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=2000)

# Appending classifier to preprocessing pipeline.
clf_model2 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model2_lasso)])

# Fitting model on training data 
clf_model2.fit(X_train, y_train)

# Output statement
print(f"Training score: {clf_model2.score(X_train, y_train):.3f}")
print(f"Test score: {clf_model2.score(X_test, y_test):.3f}")

Explaination and comparison of Models 1 to 2: 

#### Model 3

In [None]:
# Lasso Regression, C = 0.01
clf_model3_lasso = LogisticRegression(penalty='l1', C=0.01, solver='saga', max_iter=2000)

# Appending classifier to preprocessing pipeline.
clf_model3 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model3_lasso)])

# Fitting model on training data 
clf_model3.fit(X_train, y_train)

# Output statement
print(f"Training score: {clf_model3.score(X_train, y_train):.3f}")
print(f"Test score: {clf_model3.score(X_test, y_test):.3f}")

ANSWER AND EXPLAINATION: Does this enforce more or less regularization? Create the same outputs and compare the performance to the first two models. Discuss why the plot looks so different than the previous plots.

#### Model 4

In [None]:
# LR, Optimal C value
clf_model4_cv = LogisticRegressionCV(penalty='l1', solver='saga', max_iter=2000)

# Appending classifier to preprocessing pipeline.
clf_model4 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_model4_cv)])

# Fitting model on training data 
clf_model4.fit(X_train, y_train)

# Output statement
print(f"Training score: {clf_model4.score(X_train, y_train):.3f}")
print(f"Test score: {clf_model4.score(X_test, y_test):.3f}")

EXPLAINATION AND ANSWER: Does regularization help for this problem? Need Confusion Matrixs for problem

### Task 5 - Decision Tree

In [None]:
# Create a DecisionTreeClassifier model. 
tree_task5 = DecisionTreeClassifier(min_samples_split=20)

# Fit the model using our features and target variables
tree_task5.fit(X_train, y_train)

# Get % accuracy on the training data
tree_task5.score(X_train, y_train)

In [None]:
# Making prediction
tree_testclasses = tree_task5.predict(X_test)
print(tree_testclasses[:10])

In [None]:
# Fitting Decision Tree
clf_RF_model_final = RandomForestClassifier(oob_score=True, random_state=0)

# Append classifier to preprocessing pipeline.
clf_RF_final = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_RF_model_final)])

# Fit model on training data 
clf_RF_final.fit(X_train, y_train)
print("Training score: %.3f" % clf_RF_final.score(X_train, y_train))

# Make predictions on the test data
clf_RF_final_predictions = clf_RF_final.predict(X_test)
print(clf_RF_final_predictions[:10])  # Print out a few predictions just to see what they look like


ANSWER AND EXPLAINATION: Discuss the performance relative to your logistic regression models.

In [None]:
#if 'price_gt_1M' in y:
  # print("Column 'Name' is present in the DataFrame")
#else:
 #  print("Column 'Name' is not present in the DataFrame") 

### Task 6 - Error Exploration

In [None]:
# Classifier model
clf_task6_lasso = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=2000)

# Append classifier to preprocessing pipeline.
clf_task6 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_task6_lasso)])

# Fit model on training data 
clf_task6.fit(X_train, y_train)

print(f"Training score: {clf_task6.score(X_train, y_train):.3f}")
print(f"Test score: {clf_task6.score(X_test, y_test):.3f}")


# Make predictions on the test data
clf_task6_predictions = clf_task6.predict(X_test)
print(clf_task6_predictions[:10])  # Print out a few predictions just to see what they look like