##Pipelines Activity (Core)
- Paula Pipkin
- 7/20



In [34]:
# imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.compose import make_column_transformer, make_column_selector

from sklearn.pipeline import make_pipeline

from sklearn import set_config
set_config(display='diagram')

# The machine learning question is: 

- How well can the calories be predicted based on the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving of the cereal?  

In [35]:
filename = '/content/drive/MyDrive/BootCamp/05 - 08 - MACHINE LEARNING/01 Week 05 - Intro to Machine Learning/Cereal with missing values.xlsx - Sheet 1 - cereal.csv'

In [36]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,name,Manufacturer,type,calories per serving,grams of protein,grams of fat,milligrams of sodium,grams of dietary fiber,grams of complex carbohydrates,grams of sugars,milligrams of potassium,vitamins and minerals (% of FDA recommendation),Display shelf,Weight in ounces per one serving,Number of cups in one serving,Rating of cereal
0,Apple Cinnamon Cheerios,General Mills,Cold,110.0,2,2.0,180.0,1.5,10.5,10.0,70,25.0,1,1.0,0.75,29.509541
1,Basic 4,General Mills,Cold,130.0,3,2.0,,2.0,18.0,,100,25.0,3,1.33,0.75,37.038562
2,Cheerios,General Mills,Cold,,6,2.0,290.0,2.0,17.0,1.0,105,25.0,1,1.0,1.25,50.764999
3,Cinnamon Toast Crunch,General Mills,Cold,120.0,1,3.0,210.0,0.0,13.0,9.0,45,25.0,2,1.0,0.75,19.823573
4,Clusters,General Mills,Cold,110.0,3,2.0,140.0,2.0,13.0,7.0,105,25.0,3,1.0,0.5,40.400208


In [37]:
df.duplicated().sum()

0

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             77 non-null     object 
 1   Manufacturer                                     77 non-null     object 
 2   type                                             68 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 77 non-null     int64  
 5   grams of fat                                     69 non-null     float64
 6   milligrams of sodium                             76 non-null     float64
 7   grams of dietary fiber                           77 non-null     float64
 8   grams of complex carbohydrates                   77 non-null     float64
 9   grams of sugars                   

Define features (X) and target (y).
- X should only include the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces columns.
- Train test split the data to prepare for machine learning.

In [39]:
df.columns

Index(['name', 'Manufacturer', 'type', 'calories per serving',
       'grams of protein', 'grams of fat', 'milligrams of sodium',
       'grams of dietary fiber', 'grams of complex carbohydrates',
       'grams of sugars', 'milligrams of potassium',
       'vitamins and minerals (% of FDA recommendation)', 'Display shelf',
       'Weight in ounces per one serving', 'Number of cups in one serving',
       'Rating of cereal'],
      dtype='object')

In [40]:
X = df.filter([ 'Manufacturer', 'type','grams of fat','grams of sugars','Weight in ounces per one serving'], axis=1)
y = df['calories per serving']
X.head()

Unnamed: 0,Manufacturer,type,grams of fat,grams of sugars,Weight in ounces per one serving
0,General Mills,Cold,2.0,10.0,1.0
1,General Mills,Cold,2.0,,1.33
2,General Mills,Cold,2.0,1.0,1.0
3,General Mills,Cold,3.0,9.0,1.0
4,General Mills,Cold,2.0,7.0,1.0


In [41]:
X_train, X_test, y_test, y_test = train_test_split(X,y,random_state=42)


Identify each feature as numerical, ordinal, or nominal
- 'Manufacturer' : Nominal
- 'type': Nominal
- 'grams of fat': Numerical
- 'grams of sugars': Numerical
- 'Weight in ounces per one serving': Numerical

Use pipelines and column transformers to complete the following tasks:
- Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.
- One-hot encode the nominal features.
- Scale the numeric columns.


All preprocessing steps should be contained within a single preprocessing object.

In [42]:
# Instantiate column selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

In [43]:
#Instantiate imputers
freq_imputer = SimpleImputer(strategy='most_frequent') #for nominals
mean_imputer = SimpleImputer(strategy='mean') # for numerical , I only have floats missing

In [44]:
#Instantiate transformers
ohe_encoder = OneHotEncoder(handle_unknown='ignore')
scaler = StandardScaler()

In [45]:
#pipelines
# I'll separate nominal and numerical, will imput missing values + scale/ohe in 2 different pipelines, than get the output
num_pipe = make_pipeline(mean_imputer,scaler)
cat_pipe = make_pipeline(freq_imputer, ohe_encoder)

In [46]:
# creating the tuples to group each pipe with their data
num_group = (num_pipe, num_selector)
cat_group = (cat_pipe, cat_selector)

In [47]:
#create the make_column_transformer Processor
processor1 = make_column_transformer(num_group, cat_group)
#check
processor1

In [48]:
#fit on train
processor1.fit(X_train)

In [50]:
#Transform
X_train_processed = processor1.transform(X_train)
X_test_processed = processor1.transform(X_test)

In [51]:
np.isnan(X_train_processed).sum().sum()

0

In [52]:
#Show the resulting Numpy array
X_train_processed

array([[-9.74679434e-01,  9.94481647e-01, -1.32764897e-01,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  1.22191915e+00,  2.03880702e+00,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00],
       [-9.74679434e-01, -8.25018407e-01, -1.32764897e-01,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  1.67679417e+00,  3.15749558e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00, -1.42705887e-01, -1.32764897e-01,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
  

In [53]:
df_X_train_processed = pd.DataFrame(X_train_processed)
df_X_train_processed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-0.974679,0.994482,-0.132765,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.221919,2.038807,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,-0.974679,-0.825018,-0.132765,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,1.676794,3.157496,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,-0.142706,-0.132765,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
