In [8]:
#Imports
# Standard Imports
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline
# Plot Settings
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
plt.rcParams['agg.path.chunksize'] = 10000

In [10]:
#Data I will use
#Some random data from a GitHub profile
file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
df = pd.read_csv(file)
X = df['Por'].values.reshape(-1,1)
y = df['Prod'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Create a SVM With Various HyperRegressors

**1. Try a Support Vector Machine Regressor `sklearn.svm.SVR` with various hyper-parameters**
- `kernel="linear"` with various values for the `C` hyperparameter
- `kernel="rbf"` with various values for the `C` & `gamma` hyperparameters
- How does the best SVR predictor perform?

For vector machines, one should remember that the ```C``` parameter defines the regularization.  
- Smaller ```C``` = misclassifications okay, we care more about general trend.  Choose one like 0.001. 
- Bigger ```C``` = missclassifications BAD. Think of predicting tumors.  We can't have false negatives! 

For more information on vector machines, here are some wonderful links.
- [StatQuest Video](https://www.youtube.com/watch?v=efR1C6CvhmE&t=40s)
- [SVM Article](https://queirozf.com/entries/choosing-c-hyperparameter-for-svm-classifiers-examples-with-scikit-learn#the-c-parameter)

In [5]:
#Imports
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV

To test multiple parameters, use a ```param_grid```

```param_grid```'s are a list of dictionaries, with: 
* Key is a string representing the parameter keyword
* Value is a list of the parameters you want to test
* Multiple dictionaries if certain parameters are only used in conjunction with other parameters
    * Confusing?  Well, take for example the ```gamma``` parameter of the ```SVR```.  It's only used if the ```rbf``` kernel is used.  Therefore, we'd need a separate dictionary because it won't work with the ```linear``` kernel.

The parameters for the ```SVR``` can be found in the Sklearn docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)

In [6]:
#Create the param_grid
param_grid = [
    {
        'kernel': ['linear'],
        'C': [0.001, 1, 100, 1000]
    },
    {
        'kernel': ['rbf'],
        'gamma': ['scale', 'auto'],
        'C': [0.001, 1, 100, 1000]
    }
]

Now, let's create our SVR, create the randomized search, fit the randomized search, and estimate!

In [13]:
#Create/Construct SVR
#We don't pass any parameters - we're gonna use a RandomizedSearchCV to find them for us.
svm_regression = SVR()

#Create the randomized search
randomized_search = RandomizedSearchCV(estimator = svm_regression, param_distributions = param_grid, scoring = "r2")

#Fit the randomized search
randomized_search.fit(X = X_train, y = y_train)

#Find the best estimator
randomized_search.best_estimator_

SVR(C=100, kernel='linear')

### Custom Transformers

Steps to Custom Transformer Greatness: 

1. Inherit from ```TransformerMixin``` and ```BaseEstimator```
2. Write your ```fit``` and ```transform``` methods

This is an example I liked from [this](https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65) article I found on Medium, which is great for learning about custom transformers.  Simply pass features you want to remain in your dataset, and the rest will be filtered out.

In [15]:
#First, let's import some data from a datset on Kaggle
data = pd.read_csv("../data/kc_house_data.csv")
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Here, we create a ```FeatureSelector``` class, which simply takes an array of Strings representing the features we want in the constructor and returns the ```DataFrame``` with just those features when it's ```transform``` method is called.  This is useful for **separating categorical and numerical** features, and running them down different pipelines.

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion, Pipeline 

#Custom Transformer that extracts columns passed as argument to its constructor 
class FeatureSelector( BaseEstimator, TransformerMixin ):
    #Class Constructor 
    def __init__( self, feature_names ):
        self._feature_names = feature_names 
    
    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self 
    
    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        return X[ self._feature_names ] 

Now, we need to create unique transformers for both the categorical and numerical features.

One important thing I'd like to note here.  Categorical features are **often turned into binary**.  This makes it much easier to recognize trends. 

In [21]:
# Custom transformer that breaks dates column into year, month and day into separate columns and
# converts certain features to binary
class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes in a list of values as its argument
    def __init__(self, use_dates=['year', 'month', 'day']):
        self._use_dates = use_dates

    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self

    # Helper function to extract year from column 'dates'
    def get_year(self, obj):
        return str(obj)[:4]

    # Helper function to extract month from column 'dates'
    def get_month(self, obj):
        return str(obj)[4:6]

    # Helper function to extract day from column 'dates'
    def get_day(self, obj):
        return str(obj)[6:8]

    # Helper function that converts values to Binary depending on input
    def create_binary(self, obj):
        if obj == 0:
            return 'No'
        else:
            return 'Yes'

    # Transformer method we wrote for this transformer
    def transform(self, X, y=None):
        # Depending on constructor argument break dates column into specified units
        # using the helper functions written above
        for spec in self._use_dates:

            exec("X.loc[:,'{}'] = X['date'].apply(self.get_{})".format(
                spec, spec))
        # Drop unusable column
        X = X.drop('date', axis=1)

        # Convert these columns to binary for one-hot-encoding later
        X.loc[:, 'waterfront'] = X['waterfront'].apply(self.create_binary)

        X.loc[:, 'view'] = X['view'].apply(self.create_binary)

        X.loc[:, 'yr_renovated'] = X['yr_renovated'].apply(self.create_binary)
        # returns numpy array
        return X.values
#Custom transformer we wrote to engineer features ( bathrooms per bedroom and/or how old the house is in 2019  ) 
#passed as boolen arguements to its constructor
class NumericalTransformer(BaseEstimator, TransformerMixin):
    #Class Constructor
    def __init__( self, bath_per_bed = True, years_old = True ):
        self._bath_per_bed = bath_per_bed
        self._years_old = years_old
        
    #Return self, nothing else to do here
    def fit( self, X, y = None ):
        return self 
    
    #Custom transform method we wrote that creates aformentioned features and drops redundant ones 
    def transform(self, X, y = None):
        #Check if needed 
        if self._bath_per_bed:
            #create new column
            X.loc[:,'bath_per_bed'] = X['bathrooms'] / X['bedrooms']
            #drop redundant column
            X.drop('bathrooms', axis = 1 )
        #Check if needed     
        if self._years_old:
            #create new column
            X.loc[:,'years_old'] =  2019 - X['yr_built']
            #drop redundant column 
            X.drop('yr_built', axis = 1)
            
        #Converting any infinity values in the dataset to Nan
        X = X.replace( [ np.inf, -np.inf ], np.nan )
        #returns a numpy array
        return X.values

The next step is using a ```FeatureUnion``` to push our categorical and numerical features down the desired pipelines.  Remember it like this:  You have a drain.  You pour lava and water down the same drain at different times.  You need the pipe to separate these properly.  So, you need some sort of union which allows only the higher density fluid to go down.  This sort of joint in the pipe is your ```FeatureUnion```

The best part about the ```FeatureUnion``` is that in the end, everything comes back together in a nice, prettied ```DataFrame```

In [23]:
#Categrical features to pass down the categorical pipeline 
categorical_features = ['date', 'waterfront', 'view', 'yr_renovated']

#Numerical features to pass down the numerical pipeline 
numerical_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'condition', 'grade', 'sqft_basement', 'yr_built']

#Defining the steps in the categorical pipeline 
categorical_pipeline = Pipeline( steps = [ ( 'cat_selector', FeatureSelector(categorical_features) ),
                                  
                                  ( 'cat_transformer', CategoricalTransformer() ), 
                                  
                                  ( 'one_hot_encoder', OneHotEncoder( sparse = False ) ) ] )
    
#Defining the steps in the numerical pipeline     
numerical_pipeline = Pipeline( steps = [ ( 'num_selector', FeatureSelector(numerical_features) ),
                                  
                                  ( 'num_transformer', NumericalTransformer() ),
                                  
                                  ('imputer', SimpleImputer(strategy = 'median') ),
                                  
                                  ( 'std_scaler', StandardScaler() ) ] )

#Combining numerical and categorical piepline into one full big pipeline horizontally 
#using FeatureUnion
full_pipeline = FeatureUnion( transformer_list = 
                             [ 
( 'categorical_pipeline', categorical_pipeline ), 
( 'numerical_pipeline', numerical_pipeline ) 

                             ] )

One thing to keep in mind is that a `FeatureUnion` can **only take in transformers.** That's why no estimators were included in the above `Pipeline`.  To remedy this issue, create **another pipeline** with the estimator(s) as the final step, and the `FeatureUnion` part of the situation as the first step.

In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#Leave it as a dataframe becuase our pipeline is called on a 
#pandas dataframe to extract the appropriate columns, remember?
X = data.drop('price', axis = 1)
#You can covert the target variable to numpy 
y = data['price'].values 

X_train, X_test, y_train, y_test = train_test_split( X, y , test_size = 0.2 , random_state = 42 )

#The full pipeline as a step in another pipeline with an estimator as the final step
full_pipeline_m = Pipeline( steps = [ ( 'full_pipeline', full_pipeline),
                                  
                                  ( 'model', LinearRegression() ) ] )

#Can call fit on it just like any other pipeline
full_pipeline_m.fit( X_train, y_train )

#Can predict with it like any other pipeline
y_pred = full_pipeline_m.predict( X_test )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_

![](https://miro.medium.com/max/496/1*b0rUb-3fH6bpvwVpHrcFUQ.png)