<a href="https://colab.research.google.com/github/jaidatta71/Chatbot/blob/main/Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This assignment focuses on solving a specific regression problem using basic cross-validation with a train/test/validation split. In addition to using the methods explored, this assignment also aims to familiarize you with further utilities for data transformation including, the OneHotEncoder and OrdinalEncoder along with their use in a make_column_transformer.

The operations of encoding categorical features will be introduced using sklearn. This will allow you to streamline your model-building pipelines. Depending on whether a string type feature is ordinal or categorical we want to encode differently. The OrdinalEncoder will be used to encode features that do not need to be binarized due to an underlying order, and OneHotEncoder for categorical features (as a similar approach to that of the .get_dummies() method in pandas). By the end of the assignment, you will see how to chain multiple feature encoding methods together, including the earlier PolynomialFeatures for numeric features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector

from sklearn import set_config

set_config(display="diagram") #setting this will display your pipelines as seen above

# **The Data:** Ames Housing
This dataset is a popular beginning dataset used in teaching regression. The task is to use specific features of houses to predict the price of the house. In addition to this, as discussed in video 8.10 -- this dataset is available for use in an ongoing competition where you can use the test.csv to submit your model's predictions. Accordingly, the two data files are identical with the exception of the test.csv file not containing the target feature.

The data contains 81 columns of different information on the individual houses and their sale price. A full description of the data is attached here. In this assignment, you will use a small subset of the features to begin modeling with that includes ordinal, categorical, and numeric features. As an optional exercise, you are encouraged to continue engineering additional features and attempt to improve the performance of your model including submitting the predictions on Kaggle.

In [None]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
train.info()

In [None]:
#note the difference in one column from train to test
[i for i in train.columns if i not in test.columns]

In [None]:
X = train.drop('SalePrice', axis = 1)
y = train['SalePrice']
print(type(X))
print(type(y))

# Problem 1
# Train/Test split

Despite having a test dataset, you want to create a holdout set to assess your model's performance. To do so, use sklearn's train_test_split to split X and y with arguments:

*   test_size = 0.3
*   random_state = 22

Assign your results to X_train, X_test, y_train, y_test.

In [None]:
### GRADED

X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.3, random_state=22)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(X_train.shape)
print(X_test.shape)
print(type(X_train), type(y_train))#should be DataFrame and Series

# Problem 2
# Baseline Predictions


Before building a regression model, you should set a baseline to compare your later models to. One way to do this is to guess the mean of the SalePrice column. For the variables baseline_train and baseline_test, create arrays of same shape as y_train and y_test respectively. The variable baseline_train should contain y_train.mean(). The variable baseline_test should contain y_test.mean().

Use the mean_squared_error function to calculate the error between baseline_train and y_train, Assign the result to mse_baseline_train.

Use the mean_squared_error function to calculate the error between baseline_test and y_test, Assign the result to mse_baseline_test.

In [None]:
### GRADED

baseline_train = np.full_like(y_train, y_train.mean())
baseline_test =  np.full_like(y_test, y_test.mean())

mse_baseline_train = mean_squared_error(y_train, baseline_train)
mse_baseline_test  = mean_squared_error(y_test, baseline_test)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(baseline_train.shape, baseline_test.shape)
print(f'Baseline for training data: {mse_baseline_train}')
print(f'Baseline for testing data: {mse_baseline_test}')

# Examining the Correlations

What feature has the highest positive correlation with SalePrice? Assign your answer as a string matching the column name exactly to highest_corr below.

In [None]:
### GRADED

#highest_corr =

def find_highest_correlated_feature(df, target_column):

    # Calculate the correlation matrix
    correlation_matrix = df.corr()

    # Get correlations with the target column
    correlations = correlation_matrix[target_column].drop(target_column)

    # Find the feature with the highest positive correlation
    positive_correlations = correlations[correlations > 0]
    if not positive_correlations.empty:
        highest_correlated_feature = positive_correlations.idxmax()
        return highest_correlated_feature
    else:
        return None


# Example Usage:
# Assuming you have a DataFrame called 'data' and 'SalePrice' is the target column
# Replace 'your_data.csv' with the actual path to your data file

# Call the function
highest_correlated = find_highest_correlated_feature(train, "SalePrice")

# Print the result
if highest_correlated:
    print(f"The feature with the highest positive correlation to SalePrice is: {highest_correlated}")
else:
    print("No positive correlations found with SalePrice.")

highest_corr = highest_correlated

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(highest_corr)

# Simple Model

Complete the code below according to the instructions below:

1. Define a variable X1 and assign to it the values in the column OverallQual.
2. Instantiate a LinearRegression model and use the fit function to train it using X1 and y_train. Assing your result to lr.
3. Use the mean_squared_error function to calculate the error between y_train and lr.predict(X1). Assign the result to model_1_train_mse.
4. Use the mean_squared_error function to calculate the error between y_test and lr.predict(X_test[['OverallQual']]. Assign the result to model_1_test_mse.

In [None]:
### GRADED

X1 = X_train[['OverallQual']]

lr = LinearRegression().fit(X1, y_train)

model_1_train_mse = mean_squared_error(y_train, lr.predict(X1))
model_1_test_mse =  mean_squared_error(y_test, lr.predict(X_test[['OverallQual']]))

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(f'Train MSE: {model_1_train_mse: .2f}')
print(f'Test MSE: {model_1_test_mse: .2f}')

Problem 5
# Using OneHotEncoder

Similar to the pd.get_dummies() method earlier encountered, scikit-learn has a utility for encoding categorical features in the same way. Below, the OneHotEncoder is demonstrated in the CentralAir column. You are to use these results to build a model where the only feature is the CentralAir column. Note the two arguments are used in the OneHotEncoder:

1. sparse = False: returns an array that we can investigate vs with sparse = True you are returned a sparse matrix -- a memory saving representation
2. drop = if_binary: returns a single column for any binary categories. This avoids redundant features in our regression model.
In the code cell below, instantiate a LinearRegression model and use the fit function to train it using model_2_train and y_train. Assing your result to model_2.

In [None]:
#extract the features
central_air_train = X_train[['CentralAir']]
central_air_test  = X_test[['CentralAir']]
#a categorical feature
central_air_train.head()

In [None]:
#Instantiate a OHE object
#sparse = False returns an array so we can view
ohe = OneHotEncoder(sparse = False, drop='if_binary')
print(ohe.fit_transform(central_air_train)[:7])
#ohe

In [None]:
model_2_train = ohe.fit_transform(central_air_train)
model_2_test = ohe.transform(central_air_test)

In [None]:
### GRADED

#In the code cell below, instantiate a LinearRegression model
#and use the fit function to train it using model_2_train and y_train. Assing your result to model_2.
model_2 = LinearRegression().fit(model_2_train, y_train )

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(model_2.coef_)

To build a model using both the 'OverallQual' column and the 'CentralAir' column, you could use the OneHotEncoder to transform CentralAir, and then concatenate the results back into a DataFrame or numpy array. To streamline this process, the make_column_transformer can be used to seperate specific columns for certain transformations. Below, a make_column_transformer has been created for you to do just this.

The arguments are tuples of the form (transformer, columns) that specify a transformation to perform on the given column. Further, the remainder = passthrough argument says to just pass the other columns through. You are returned a numpy array with the CentralAir column binarized and concatenated to the OverallQual feature.

For an example using the make_column_transformer see here.

In [None]:
col_transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['CentralAir']),
                                          remainder='passthrough')
#ohe = OneHotEncoder(sparse = False, drop='if_binary')
col_transformer

In [None]:
col_transformer.fit_transform(X_train[['OverallQual', 'CentralAir']][:5])

Problem 6
# Using make_column_transformer
10 Points

Complete the code below according to the instructions below:

1. Use Pipeline to create a pipeline object. Inside the pipeline object, define a tuple where the first element is a string identifier col_transformer and the second element is an instance of col_transformer. Inside the pipeline define another tuple where the first element is a string identifier linreg, and the second element is an instance of LinearRegression. Assign the pipeline object to the variable pipe_1.

2. Use the fit function on pipe_1 to train your model on X_train[['OverallQual','CentralAir']] and y_train.

In [None]:
### GRADED

pipe_1 = Pipeline([
    ('col_transformer',col_transformer),                         #PolynomialFeatures(degree=2))
    ('linreg',LinearRegression())
])

pipe_1.fit(X_train[['OverallQual','CentralAir']], y_train)
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(pipe_1.named_steps)#col_transformer and linreg should be keys
pipe_1

Not all columns warrant binarization as done on the CentralAir column. For example, consider the HeatingQC feature -- representing the quality of the heating in the house. From the data description, the unique values are described as:

HeatingQC: Heating quality and condition

       Ex    Excellent
       Gd    Good
       TA    Average/Typical
       Fa    Fair
       Po    Poor
These are ordered values, and rather than binarizing them a numeric value representing the scale can be used. For example, using a scale of 0 - 4 you may associate the categories with an order in a list from least to greatest as:

['Po',          'Fa',        'TA',         'Gd',       'Ex']

Creating an OrdinalEncoder with these categories will transform the HeatingQC feature mapping each category as


*   Po:    0
*   Fa:    1
*   TA:    2
*   Gd:    3
*   Ex:    4

This is demonstrated below, and in a similar manner, the use of the make_column_transformer is shown using the three columns ['OverallQual', 'CentralAir', 'HeatingQC'], applying the appropriate transformations to each column and passing the remaining numeric feature through.

In [None]:
oe = OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']])
oe.fit_transform(X_train[['HeatingQC']])
X_train['HeatingQC'].head()

In [None]:
ordinal_ohe_transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['CentralAir']),
                                          (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                          remainder='passthrough')

In [None]:
ordinal_ohe_transformer.fit_transform(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])[:5]

In [None]:
X_train[['OverallQual', 'CentralAir', 'HeatingQC']].head()

# Using OrdinalEncoder

10 Points

Complete the code below according to the instructions below:

1. Use Pipeline to create a pipeline object. Inside the pipeline object define a tuple where the first element is a string identifier transformer and the second element is an instance of ordinal_ohe_transformer. Inside the pipeline define another tuple where the first element is a string identifier linreg, and the second element is an instance of LinearRegression. Assign the pipeline object to the variable pipe_2.
2. Use the fit function on pipe_2 to train your model on X_train[['OverallQual', 'CentralAir', 'HeatingQC']] and y_train.
3. Use the predict function on pipe_2 to make your predictions of X_train[['OverallQual', 'CentralAir', 'HeatingQC']]. Assign the result to pred_train.
4. Use the predict function on pipe_2 to make your predictions of X_test[['OverallQual', 'CentralAir', 'HeatingQC']]. Assign the result to pred_test.
5. Use the mean_squared_error function to calculate the MSE between y_train and pred_train. Assign the result to pipe_2_train_mse.
6. Use the mean_squared_error function to calculate the MSE between y_test and pred_test. Assign the result to pipe_2_test_mse.

In [None]:
### GRADED

pipe_2 = Pipeline([
            ('transformer',ordinal_ohe_transformer),
            ('linreg',LinearRegression())
])
pipe_2.fit(X_train[['OverallQual', 'CentralAir', 'HeatingQC']], y_train)

pred_train = pipe_2.predict(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])
pred_test  = pipe_2.predict(X_test[['OverallQual', 'CentralAir', 'HeatingQC']])

pipe_2_train_mse = mean_squared_error(y_train, pred_train)
pipe_2_test_mse  =  mean_squared_error(y_test, pred_test)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(pipe_2.named_steps)
print(f'Train MSE: {pipe_2_train_mse: .2f}')
print(f'Test MSE: {pipe_2_test_mse: .2f}')
pipe_2

# Including PolynomialFeatures

Finally, the earlier transformation of continuous columns using the PolynomialFeatures with degree = 2 can be implemented alongside the OneHotEncoder and OrdinalEncoder.

The make_column_transformer is again used, and you are to create a Pipeline with steps transformer and linreg.

The Pipeline is fit on the training data using features ['OverallQual', 'CentralAir', 'HeatingQC'].

1. Use the predict function on pipe_3 to predict the values of X_train[['OverallQual', 'CentralAir', 'HeatingQC']]. Assign your result to quad_train_preds.
2. Use the predict function on pipe_3 to predict the values of X_test[['OverallQual', 'CentralAir', 'HeatingQC']]. Assign your result to quad_test_preds.
3. Use the mean_squared_error function to calculate the MSE between y_train and quad_train_preds. Assign the result to quad_train_mse.
4. Use the mean_squared_error function to calculate the MSE between y_test and quad_test_preds. Assign the result to quad_test_mse.

In [None]:
poly_ordinal_ohe = make_column_transformer((OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                           (OneHotEncoder(drop = 'if_binary'), ['CentralAir']),
                                           (PolynomialFeatures(include_bias = False, degree = 2), ['OverallQual']))
poly_ordinal_ohe.fit_transform(X_train[['OverallQual','CentralAir', 'HeatingQC']])[:5]
pipe_3 = Pipeline([('transformer', poly_ordinal_ohe),
                  ('linreg', LinearRegression())])

In [None]:
pipe_3.fit(X_train[['OverallQual', 'CentralAir', 'HeatingQC']], y_train)

In [None]:
### GRADED

quad_train_preds = pipe_3.predict(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])
quad_test_preds  = pipe_3.predict(X_test[['OverallQual', 'CentralAir', 'HeatingQC']])

quad_train_mse = mean_squared_error(y_train, quad_train_preds)
quad_test_mse =  mean_squared_error(y_test, quad_test_preds)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(f'Train MSE: {quad_train_mse: .2f}')
print(f'Test MSE: {quad_test_mse: .2f}')

# Including More Features

Use the following features to build a new make_column_transformer and fit 5 different models of degree 1 - 5 using the degree argument in your PolynomialFeatures transformer. Keep track of the subsequent train mean squared error and test set mean squared error with the lists train_mses and test_mses respectively.

The poly_ordinal_ohe object contains the different transformers needed. Note that rather than passing a list of columns to the PolynomialFeatures transformer, the make_column_selector function is used to select any numeric feature. For more information on the make_column_selector see here.

In [None]:
features = ['CentralAir', 'HeatingQC', 'OverallQual', 'GrLivArea', 'KitchenQual', 'FullBath']
X_train[features].head()

In [None]:
poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC', 'KitchenQual']),
                                               (OneHotEncoder(drop = 'if_binary', sparse = False), ['CentralAir']))

In [None]:
### GRADED

train_mses = []
test_mses = []
#for degree in 1 - 5
for i in range(1, 6):
    #create pipeline with PolynomialFeatures degree i

    #ADD APPROPRIATE ARGUMENTS IN POLYNOMIALFEATURES
    poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(degree=i), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                               (OneHotEncoder(drop = 'if_binary'), ['CentralAir']))

    pipe_3 = Pipeline([('transformer', poly_ordinal_ohe),
                  ('linreg', LinearRegression())
                      ])

    #fit on train
    pipe_3.fit(X_train[features], y_train)

    #predict on train and test
    pipe_3.predict(X_train[features])
    pipe_3.predict(X_test[features])
    #compute mean squared errors
   # train_mses  = mean_squared_error(y_train, quad_train_preds)
   # test_mses   = mean_squared_error(y_train, quad_train_preds)
    #append to train_mses and test_mses respectively
    train_mses.append(mean_squared_error(y_train, pipe_3.predict(X_train[features])))
    test_mses.append(mean_squared_error(y_test, pipe_3.predict(X_test[features])))

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(train_mses)
print(test_mses)
#pipe

# Optimal Model Complexity

Based on your model's mean squared error on the testing data in Problem 9 above, what was the optimal complexity? Assign your answer as an integer to best_complexity below. Compute the MEAN SQUARED ERROR of this model and assign it to best_mse as a float.

In [None]:
### GRADED

best_complexity = 2

poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(degree=2), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                               (OneHotEncoder(drop = 'if_binary'), ['CentralAir']))

pipe_3 = Pipeline([('transformer', poly_ordinal_ohe),
                  ('linreg', LinearRegression())
                      ])

    #fit on train
pipe_3.fit(X_train[features], y_train)

    #predict on train and test
pipe_3.predict(X_train[features])
pipe_3.predict(X_test[features])
    #compute mean squared errors
   # train_mses  = mean_squared_error(y_train, quad_train_preds)
best_mse   = mean_squared_error(y_test, pipe_3.predict(X_test[features]))

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(f'The best degree polynomial model is:  {best_complexity}')
print(f'The smallest mean squared error on the test data is : {best_mse: .2f}')

# Further Exploration
This activity was meant to introduce you to a more streamlined modeling process using the sklearn library. While your models should be performing better than the baseline, it is likely that with a bit more feature engineering and cross-validation you would be able to further improve the performance. You are encouraged to explore further feature engineering and encoding, particularly with handling missing values.

Additionally, other transformations on the data may be appropriate. For example, if you look at the distribution of errors in your model, you will note that they are slightly skewed. An assumption of a Linear Regression model is that these should be roughly normally distributed. By building a model on the logarithm of the target column and evaluating the model on the logarithm of the testing data, you will improve towards this assumption. Note that the actual Kaggle exercise is judged on the ROOT MEAN SQUARED ERROR of the logarithm of the target feature.

If interested, scikitlearn also provides a function TransformedTargetRegressor that will accomplish this transformation and can easily be added to a pipeline. See here for more information on this transformer.