# **Pipelines Activity**

_John Andrew Dixon_

---

**Setup**


In [38]:
# Import necessary modules

# For working with the data
import pandas as pd

# For performing a TTS
from sklearn.model_selection import train_test_split

# For scaling numerical features and encoding nominal features
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# For creating column selectors and column transformers
from sklearn.compose import make_column_selector, make_column_transformer

# For simple imputation on missing data
from sklearn.impute import SimpleImputer

# For creation of preprocessing pipelines
from sklearn.pipeline import make_pipeline

# For showing the output of sklearns display in a better way
from sklearn import set_config
set_config(display='diagram')


In [39]:
# Remote URL that has the data
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSdnb9XcAnl91bdZYxoJQgIapMW6SLkfr3DYGwnpBOIw-rkw-5j_3b0JLx01282OBAKVUCUJnq8OAUR/pub?output=xlsx"

# Read in the data and verify
df = pd.read_excel(url)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             77 non-null     object 
 1   Manufacturer                                     77 non-null     object 
 2   type                                             68 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 77 non-null     int64  
 5   grams of fat                                     69 non-null     float64
 6   milligrams of sodium                             76 non-null     float64
 7   grams of dietary fiber                           77 non-null     float64
 8   grams of complex carbohydrates                   77 non-null     float64
 9   grams of sugars                   

---

## **Tasks**

> **Question**: _How well can the calories be predicted based on the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving of the cereal?_

### **Define features (X) and target (y).**
- X should only include the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces columns.
- y should be calories per serving

In [40]:
# Create the feature matrix (X)
columns = ["Manufacturer", "type", "grams of fat", "grams of sugars", "Weight in ounces per one serving"]
X = df.loc[:, columns]
X

Unnamed: 0,Manufacturer,type,grams of fat,grams of sugars,Weight in ounces per one serving
0,General Mills,Cold,2.0,10.0,1.00
1,General Mills,Cold,2.0,,1.33
2,General Mills,Cold,2.0,1.0,1.00
3,General Mills,Cold,3.0,9.0,1.00
4,General Mills,Cold,2.0,7.0,1.00
...,...,...,...,...,...
72,Ralston Purina,Cold,,2.0,1.00
73,Ralston Purina,Cold,1.0,3.0,1.00
74,American Home Food Products,Hot,1.0,,1.00
75,Nabisco,Hot,0.0,0.0,1.00


In [41]:
# Create the target vector (y)
y = df["calories per serving"]
y

0     110.0
1     130.0
2       NaN
3     120.0
4     110.0
      ...  
72    110.0
73    100.0
74    100.0
75    100.0
76    100.0
Name: calories per serving, Length: 77, dtype: float64

### **Identify each feature as numerical, ordinal, or nominal.**
**Numerical**: `grams of fat`, `grams of sugars`, `Weight in ounces per one serving`

**Ordinal**: `None`

**Nominal**: `Manufacturer`, `type`

### **Train test split the data to prepare for machine learning.**

In [42]:
# Perform the Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### **Use pipelines and column transformers to complete the following tasks:**
- Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.
- One-hot encode the nominal features.
    - Be sure to include the arguments: sparse=False AND handle_unknown='ignore' when creating your OneHotEncoder.
- Scale the numeric columns.

In [43]:
# Instantiate all transformers

# Instantiate the Simple imputers for both types of columns
# Most Fequent for Nominal columns and Mean for numeric columns
most_frequent_imputer = SimpleImputer(strategy="most_frequent")
mean_imputer = SimpleImputer(strategy="mean")

# Instantiate the Scaler for scaling numerical features
scaler = StandardScaler()

# Instantiate the One-Hot Encoder
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

In [44]:
# Instantiate the numeric pipeline
numeric_pipeline = make_pipeline(mean_imputer, scaler)
numeric_pipeline

In [45]:
# Instantiate the nomical/categorical pipeline
nominal_pipeline = make_pipeline(most_frequent_imputer, one_hot_encoder)
nominal_pipeline


### **All preprocessing steps should be contained within a single preprocessing object.**
- Include the arguments: remainder='drop' OR remainder='passthrough' when creating your ColumnTransformer

In [46]:
# Instantiate columns selectors
numeric_selector = make_column_selector(dtype_include="number")
nominal_selector = make_column_selector(dtype_include="object")

In [47]:
# Instantiate tuples for ColumnTransformer
numeric_tuple = (numeric_pipeline, numeric_selector)
nominal_tuple = (nominal_pipeline, nominal_selector)

In [48]:
# Instantiate a ColumnTransformer as a single preprocessing object for
# all column transformations
preprocessor = make_column_transformer(numeric_tuple, nominal_tuple)
preprocessor

### **Use your preprocessing object to transform your data appropriately, avoiding data leakage, to make it ready for modeling. Show the resulting NumPy array output.**
- The .fit() and .transform() methods should ONLY be used with the resulting preprocessing object, NOT with any individual transformer.