## Challenge

In this challenge, we will work with the same [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) as during previous weeks, **Prediction of Sales**. The main goal is to create **pipeline** that covers all data preprocessing and modeling steps.


**TASK 1**: Build Pipeline which will end with regression model to predict `Item_Outlet_Sales` from the dataset. The pipeline should have following steps:

- split features to numerical and categorical (text)
- null value replacement
    - mean for numerical variables
    - the most frequent value for categorical
- creating dummy variables from categorical features
- Use PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after OneHotEncoder that outputs data in the SparseMatrix so we need to use **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
- select 3 best candidates from original numeric features using KBest
- Fit Ridge regression (default alpha is fine for now)

**TASK 2**: Tune parameters of models as well as preprocessing steps and find the best solution
- Try models: Random Forest, Gradient Boosting Regressor or Ridge Regression. We need to use the approach from the [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section **PIPELINE TUNING (ADVANCED VERSION)**, when we tried different scalers.

In [1]:
import pandas as pd
filepath = 'C:/Users/Tim/Desktop/lighthouse/w7/d1/'
df = pd.read_csv(filepath+"regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [2]:
# # creating target variable
# y = df["Item_Outlet_Sales"]
# df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

Splitting to train and test set in the begining. We should always do this before Pipeline

In [3]:
# df_train = df.sample(frac=0.8).sort_index()
# y_train = y[y.index.isin(df_train.index.tolist())]

In [4]:
# df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
# y_test = y[y.index.isin(df_test.index.tolist())]

In [5]:
def info(x):
    n_missing = x.isnull().sum().sort_values(ascending=False)
    p_missing = (x.isnull().sum()/x.isnull().count()).sort_values(ascending=False)
    dtype = x.dtypes
    count = x.count()
    missing_ = pd.concat([n_missing, p_missing, dtype, count],axis=1, keys = [
        'number_missing',
        'percent_missing',
        'type',
        'count'
    ])
    return missing_
info(df)

Unnamed: 0,number_missing,percent_missing,type,count
Outlet_Size,2410,0.282764,object,6113
Item_Weight,1463,0.171653,float64,7060
Item_Outlet_Sales,0,0.0,float64,8523
Outlet_Type,0,0.0,object,8523
Outlet_Location_Type,0,0.0,object,8523
Outlet_Establishment_Year,0,0.0,int64,8523
Outlet_Identifier,0,0.0,object,8523
Item_MRP,0,0.0,float64,8523
Item_Type,0,0.0,object,8523
Item_Visibility,0,0.0,float64,8523


In [66]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.base import TransformerMixin
from scipy.sparse import csr_matrix
from sklearn.pipeline import FeatureUnion
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

In [12]:
X = df.drop(columns=['Item_Outlet_Sales','Item_Identifier'])
y = df['Item_Outlet_Sales']

In [13]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,train_size=0.8)

In [18]:
numerical = list(X.dtypes[X.dtypes != 'object'].index)
# num_train = x_train[numerical]
# num_test = x_test[numerical]

categorical = list(X.dtypes[X.dtypes == 'object'].index)
# cat_train = x_train[categorical]
# cat_test = x_test[categorical]

In [74]:
numerical

['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']

In [78]:
categorical

['Item_Fat_Content',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Size',
 'Outlet_Location_Type',
 'Outlet_Type']

In [21]:
num_pipe = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),
    SelectKBest(k=3)
)
# x_train[numerical] = num_pipe.fit_transform(x_train[numerical])
# x_test[numerical] = num_pipe.fit_transform(x_test[numerical])

In [72]:
num_pipe.steps

[('simpleimputer', SimpleImputer(strategy='median')),
 ('standardscaler', StandardScaler()),
 ('selectkbest', SelectKBest(k=3))]

In [45]:
cat_pipe = make_pipeline(
    SimpleImputer(strategy = 'constant', fill_value = 'unknown'),
    OneHotEncoder(),
    ToDenseTransformer(),
    PCA(n_components=3)
)
# x_train[categorical] = cat_pipe.fit_transform(cat_train,columns=categorical)
# x_test[categorical] = cat_pipe.fit_transform(cat_test,columns=categorical)

In [71]:
cat_pipe.steps

[('simpleimputer', SimpleImputer(fill_value='unknown', strategy='constant')),
 ('onehotencoder', OneHotEncoder()),
 ('todensetransformer', <__main__.ToDenseTransformer at 0x1a1b004ef10>),
 ('pca', PCA(n_components=3))]

In [85]:
preprocessor = ColumnTransformer(
    transformers = [
        ('continuous', num_pipe, numerical),
        ('categorical', cat_pipe, categorical),
    ]
)

In [86]:
model_pipe = Pipeline(steps = [
    ('preprocess', preprocessor),
    ('clf', Ridge())
])

In [87]:
model_pipe.fit(x_train, y_train)

y_pred = model_pipe.predict(x_test)
acc = metrics.r2_score(y_test, y_pred)
print(f'Test set accuracy: {acc}')

Test set accuracy: 0.3406801610232326


In [81]:
# pipeline = Pipeline(steps=[('scaling', StandardScaler()),
#                            ('features', feature_union),
#                            ('classifier', RidgeClassifier())])

# Find the best hyperparameters using GridSearchCV on the train set
param_grid = {'clf__alpha': [0.001, 0.01, 0.1, 1], 
              'preprocess__categorical__pca__n_components': [1, 3, 5, 7, 9, 11, 13, 15, 17, 20],
              'preprocess__continuous__selectkbest__k': [1, 2, 3, 4]}
grid = GridSearchCV(model_pipe, param_grid=param_grid, cv=5,n_jobs=-1,verbose=1)
grid.fit(x_train, y_train)

best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_acc = grid.score(x_test, y_test)
print(f'Best test set accuracy: {best_acc}\nAchieved with hyperparameters: {best_hyperparams}')

Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   17.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   52.7s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 800 out of 800 | elapsed:  1.6min finished


Best test set accuracy: 0.5418883744291152
Achieved with hyperparameters: {'clf__alpha': 1, 'preprocess__categorical__pca__n_components': 13, 'preprocess__continuous__selectkbest__k': 1}


# Task I

### Split Features to numerical and categorical

In [37]:
# cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
# num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [9]:
# numerical = list(df.dtypes[df.dtypes != 'object'].index)
# num_dat = df[numerical]

# categorical = list(df.dtypes[df.dtypes == 'object'].index)
# cat_dat = df[categorical]

In [38]:
# from sklearn.preprocessing import FunctionTransformer

# # Using own function in Pipeline
# def numFeat(data):
#     return data[num_feats]

# def catFeat(data):
#     return data[cat_feats]

In [39]:
# # we will start two separate pipelines for each type of features
# keep_num = FunctionTransformer(numFeat)
# keep_cat = FunctionTransformer(catFeat)

In [77]:
# num_dat

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
0,9.300,0.016047,249.8092,1999,3735.1380
1,5.920,0.019278,48.2692,2009,443.4228
2,17.500,0.016760,141.6180,1999,2097.2700
3,19.200,0.000000,182.0950,1998,732.3800
4,8.930,0.000000,53.8614,1987,994.7052
...,...,...,...,...,...
8518,6.865,0.056783,214.5218,1987,2778.3834
8519,8.380,0.046982,108.1570,2002,549.2850
8520,10.600,0.035186,85.1224,2004,1193.1136
8521,7.210,0.145221,103.1332,2009,1845.5976


In [58]:
# import copy

In [59]:
# data = copy.deepcopy(df)

### null value replacement

### Creating dummy variables

In [196]:
# use OneHotEncoder

In [10]:
# # Use SimpleImputer
# from sklearn.compose import ColumnTransformer
# from sklearn.impute import SimpleImputer
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import OneHotEncoder

In [64]:
# null_values = ColumnTransformer([
#         ('impute_mean', SimpleImputer(strategy='mean'), ['Item_Weight'])
#     ], remainder='passthrough')

# pipe = Pipeline([
#     ('null', null_values)
# ])

# pipe.fit_transform(X)

array([[9.3, 'Low Fat', 0.016047301, ..., 'Medium', 'Tier 1',
        'Supermarket Type1'],
       [5.92, 'Regular', 0.019278216, ..., 'Medium', 'Tier 3',
        'Supermarket Type2'],
       [17.5, 'Low Fat', 0.016760075, ..., 'Medium', 'Tier 1',
        'Supermarket Type1'],
       ...,
       [10.6, 'Low Fat', 0.035186271, ..., 'Small', 'Tier 2',
        'Supermarket Type1'],
       [7.21, 'Regular', 0.145220646, ..., 'Medium', 'Tier 3',
        'Supermarket Type2'],
       [14.8, 'Low Fat', 0.04487828, ..., 'Small', 'Tier 1',
        'Supermarket Type1']], dtype=object)

In [65]:
# categorical_preprocessing = Pipeline([('ohe', OneHotEncoder())])
# numerical_preprocessing = Pipeline([('imputation', SimpleImputer())])

# #define which transformer applies to which columns
# preprocess = ColumnTransformer([
#     ('categorical_preprocessing', categorical_preprocessing, keep_cat),
#     ('simpleimput2', numerical_preprocessing, ['Item_Weight']),
# ])

In [None]:
# for i in cat_dat:
#     categorical_preprocessing = Pipeline([('ohe', OneHotEncoder())])
    

### PCA to reduce number of dummy variables to 3 principal components

In [66]:
# # don't forget ToDenseTransformer after one hot encoder
# from sklearn.decomposition import PCA
# from sklearn.base import TransformerMixin
# from scipy.sparse import csr_matrix

In [23]:
class ToDenseTransformer():

    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self

# need to make matrices dense because PCA does not work with sparse vectors.
dense = Pipeline([
    ('to_dense',ToDenseTransformer()),
])

In [68]:
# dense_transform = ColumnTransformer([
#     ('dense', dense, cat_dat)
# ])

### Select 3 best numeric features

In [69]:
# use SelectKBest

In [70]:
# from sklearn.pipeline import FeatureUnion
# from sklearn.feature_selection import SelectKBest

# feature_union = FeatureUnion([('pca', PCA(n_components=3)), 
#                               ('select_best', SelectKBest(k=6))])

# # kbest_transform = ColumnTransformer([
# #     ('kbest_transform',SelectKBest(k=3),num_dat)
# # ])

### Fitting models

In [71]:
# from sklearn.linear_model import Ridge
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.ensemble import GradientBoostingRegressor

# # Use base_model in Task I
# base_model = Ridge()

### Building Pipeline

In [72]:
# from sklearn.pipeline import Pipeline, FeatureUnion
# from sklearn.preprocessing import StandardScaler

In [75]:
# pipeline = Pipeline(steps=[#('preprocess', preprocess),
#                            #('dense_transform', dense_transform),
# #                           ('pca_transform', pca_transform),
# #                           ('kbest_transform',kbest_transform),
#                            ('scaling', StandardScaler()),
#                            ('features', feature_union),
#                            ('classifier', Ridge())])


In [None]:
# pipeline = Pipeline(steps = [
    
# ])

In [4]:
# model.score(df_test,y_test)

# Task II

In [208]:
# from sklearn.model_selection import GridSearchCV

In [216]:
# params = [
# # 
# ]

In [219]:
# print('Final score is: ', tuned_model.score(df_test, y_test))

Final score is:  0.6241741712069144
