## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [143]:
import pandas as pd
df = pd.read_csv("/Users/patrickokwir/Desktop/Lighthouse-data-notes/Unit_8/Day_1/pipelines_and_persistence_exercise-master/regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [144]:
# creating target variable
y = df["Item_Outlet_Sales"]
X = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [145]:
# import train_test_split
from sklearn.model_selection import train_test_split

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

---------------------
## Task I

### Split Features into numerical and categorical

In [146]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [147]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [148]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

### replacing null values

In [149]:
# Use SimpleImputer to fill in missing values
from sklearn.impute import SimpleImputer
impute_num = SimpleImputer(strategy = "mean")
impute_cat = SimpleImputer(strategy = "most_frequent")

### Creating dummy variables

In [150]:
# use OneHotEncoder to create dummy variables
from sklearn.preprocessing import OneHotEncoder
encode_cat = OneHotEncoder(handle_unknown = "ignore")

In [151]:
## Use todense transformer to convert sparse matrix to dense matrix
from sklearn.preprocessing import FunctionTransformer
todense = FunctionTransformer(lambda x: x.todense(), accept_sparse=True)

### Use PCA to reduce the number of dummy variables to 3 principal components.

In [2]:
# don't forget ToDenseTransformer after one hot encoder

In [152]:
# use PCA to reduce number of dummy variables to 3 principal components
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)

### Select the 3 best numeric features

In [153]:
# use SelectKBest to select best 3 numerical features
from sklearn.feature_selection import SelectKBest, f_regression
select_best = SelectKBest(f_regression, k = 3)

### Fitting models

In [154]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Use base_model in Task I
base_model = Ridge()

### Building a Pipeline

In [93]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [156]:
model = Pipeline([
    ("features", FeatureUnion([
        ("numeric_features", Pipeline([
            ("keep_num", keep_num),
            ("impute_num", impute_num),
            ("select_best", select_best)
        ])),
        ("categorical_features", Pipeline([
            ("keep_cat", keep_cat),
            ("impute_cat", impute_cat),
            ("encode_cat", encode_cat),
            ("todense", todense),
            ("pca", pca)
        ]))
    ])),
    ("base_model", base_model)
])




model.fit(X_train, y_train)
model.predict(X_test, y_test)

KeyError: "['Item_Outlet_Sales'] not in index"

----------------------------
## Task II

In [208]:
from sklearn.model_selection import GridSearchCV

In [216]:
params = [
# 
]

In [219]:
# print('Final score is: ', tuned_model.score(df_test, y_test))

Final score is:  0.6241741712069144
