## Challenge

In this challenge, we will work with the same [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) as during previous weeks, **Prediction of Sales**. The main goal is to create **pipeline** that covers all data preprocessing and modeling steps.


**TASK 1**: Build Pipeline which will end with regression model to predict `Item_Outlet_Sales` from the dataset. The pipeline should have following steps:

- split features to numerical and categorical (text)
- null value replacement
    - mean for numerical variables
    - the most frequent value for categorical
- creating dummy variables from categorical features
- Use PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after OneHotEncoder that outputs data in the SparseMatrix so we need to use **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
- select 3 best candidates from original numeric features using KBest
- Fit Ridge regression (default alpha is fine for now)

**TASK 2**: Tune parameters of models as well as preprocessing steps and find the best solution
- Try models: Random Forest, Gradient Boosting Regressor or Ridge Regression. We need to use the approach from the [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section **PIPELINE TUNING (ADVANCED VERSION)**, when we tried different scalers.

In [39]:
import pandas as pd
df = pd.read_csv("regression_exercise.csv")
df

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
...,...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,,Tier 2,Supermarket Type1,549.2850
8520,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976


In [2]:
# creating target variable
y = df["Item_Outlet_Sales"]
df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

Splitting to train and test set in the begining. We should always do this before Pipeline

In [3]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [4]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]

# Task I

### Split Features to numerical and categorical

In [5]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [6]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [7]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

### null value replacement

In [28]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# For numerical variables
impute_step_num = SimpleImputer(strategy='mean')

In [26]:
# For categorical variables
imputer_step_cat = SimpleImputer(strategy='most_frequent')

### Creating dummy variables

In [22]:
# use OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

categorical_preprocessing = Pipeline([('ohe', OneHotEncoder())])

In [24]:
ohe = OneHotEncoder()

In [23]:
preprocess_cat = ColumnTransformer([
    ('categorical_preprocessing', categorical_preprocessing, [transformer_step_cat])])

### PCA to reduce number of dummy variables to 3 principal components

In [12]:
# don't forget ToDenseTransformer after one hot encoder
#Create toDenseTransformer class

class ToDenseTransformer():

    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self

In [27]:
from sklearn.decomposition import PCA

pipeline_cat = Pipeline([
    ('categorical_features', keep_cat),
    ('impute', imputer_step_cat),
    ('ohe', ohe),
    ('to_dense',ToDenseTransformer()),
    ('pca',PCA(n_components=3))])

### Select 3 best numeric features

In [14]:
# use SelectKBest
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler

In [29]:
pipeline_num = Pipeline([
    ('numerical_features', keep_num),
    ('impute', impute_step_num),
    ('scaling', StandardScaler()),
    ('select_best', SelectKBest(k=3))])

### Fitting models

In [16]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()

### Building Pipeline

In [17]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [30]:
feature_union = FeatureUnion([('num_variables', pipeline_num), 
                              ('cat_variables', pipeline_cat)])

pipeline = Pipeline(steps=[('features', feature_union),
                           ('classifier', base_model)])

In [31]:
pipeline.fit(df_train, y_train)

y_pred = pipeline.predict(df_test)

In [33]:
pipeline.score(df_test,y_test)

0.37168499975594993

In [None]:
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reduce_dim', PCA()),
        ('regressor', Ridge())
        ])

pipe = pipe.fit(X_train, y_train)

params = [
        {'scaler': scalers_to_test,
         'reduce_dim': [PCA()],
         'reduce_dim__n_components': n_features_to_test,\
         'regressor__alpha': alpha_to_test},

        {'scaler': scalers_to_test,
         'reduce_dim': [SelectKBest(f_regression)],
         'reduce_dim__k': n_features_to_test,\
         'regressor__alpha': alpha_to_test}
        ]


gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)
print('Final score is: ', gridsearch.score(X_test, y_test))

# Task II

In [34]:
from sklearn.model_selection import GridSearchCV

In [38]:
models_to_test = [RandomForestRegressor(), GradientBoostingRegressor()]

param_grid = {'features__cat_variables__pca__n_components': [3, 5],
              'features__num_variables__select_best__k': [1, 3, 6],
             'classifier': models_to_test}

gridsearch = GridSearchCV(pipeline, param_grid, verbose=1).fit(df_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Traceback (most recent call last):
  File "/home/hafsa/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/hafsa/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 330, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/hafsa/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 292, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/hafsa/anaconda3/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/home/hafsa/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 740, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/hafsa/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py", line 953, in fit_transform
 

In [41]:
print('Final score is: ', gridsearch.score(df_test, y_test))

Final score is:  0.5805949914240425
