In [8]:
!where python

/Users/petr/.pyenv/shims/python
/Users/petr/.pyenv/shims/python
/Users/petr/.pyenv/shims/python
/Users/petr/.pyenv/shims/python
/Users/petr/.pyenv/shims/python
/opt/anaconda3/envs/unit_2/bin/python
/usr/bin/python


Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [6]:
# Imports

import pandas as pd
import pandas_profiling
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

In [7]:
# I chose this House Prices dataset so I can submit my entry in this Kaggle competition:
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques/

data_path = Path('../data/project')
df_train = pd.read_csv(data_path/'train.csv', index_col = 'Id')
df_test = pd.read_csv(data_path/'test.csv', index_col = 'Id')

In [3]:
df_train.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [26]:
# Target variable: Saleprice
# Problem type: regression
# Exploring the target:

#sns.distplot(df_train['SalePrice']);
df_train['SalePrice'].max()

# The distribution is unimodal and positive skewed, with a range of $34,900 to $755,000
# There are no outliers 
# To normalize the distribution of the target I will apply a log transform of it
#df_train['log_saleprice'] = df_train['SalePrice'].apply(lambda x: np.log(x))

755000

In [27]:
# For the evaluation metric I will use root mean squared error as that is how the Kaggle competition will be scoring my model
# The competition already provides a training and validation split (random not ordinal)

In [28]:
# Based on the pandas profiling report a lot of the features have a significant amount of NaNs. From the data_description
# .txt file it is clear that that indicates the feature is not present at the property. I'll be sure not to impute these,
# and instead fill them with zeros most likely

df_train.profile_report()

Summarize dataset:   0%|          | 0/94 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [29]:
# Splitting the data
from sklearn.model_selection import train_test_split

# Creating my X and y
df_y = df_train[target]
df_X = df_train.drop(columns=[target])

# Getting the names of the target and different kind of features
target = 'SalePrice'

numeric_features = df_X.select_dtypes(include='number').columns
cat_features = df_X.select_dtypes(exclude='number').columns

low_cat = df_X[cat_features].nunique()[df_X[cat_features].nunique() < 15].index.tolist()
high_cat = df_X[cat_features].nunique()[df_X[cat_features].nunique() > 15].index.tolist()

# Splitting my data into the training and validation sets
X_train, X_val, y_train, y_val = train_test_split(df_X, df_y, test_size = 0.2)

X_train.shape, y_train.shape, X_val.shape, y_val.shape

((1168, 79), (1168,), (292, 79), (292,))

In [30]:
# Calculating my baseline using a basic Linear Regression with ordinal encoding:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer

lr_pipeline = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'constant', fill_value = 'NA')),
                                ('ordinal', ce.OrdinalEncoder()),
                                ('linearregression', LinearRegression(n_jobs=-1))])
lr_pipeline.fit(X_train, y_train);

In [31]:
# Getting baseline RMSE

from sklearn.metrics import mean_squared_error as mse

y_pred = lr_pipeline.predict(X_val)
rmse_baseline = mse(y_val, y_pred, squared=False)

print(f'The RMSE of my baseline is: {rmse_baseline}')

The RMSE of my baseline is: 53798.6144454535
