<a href="https://colab.research.google.com/github/kellianneyang/bootcamp-assignments/blob/main/Pipelines_Activity_(Core)_Kellianne_Yang.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pipelines Activity (Core)**

**Name:** Kellianne Yang

**The machine learning question for this assignment is:**

*How well can the calories be predicted based on the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving of the cereal?*


# Preliminary steps

In [None]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

In [None]:
# load data
path = '/content/drive/MyDrive/Coding Dojo/05 Week 5: Machine Learning/Cereal with missing values.xlsx'
df = pd.read_excel(path)

In [None]:
# inspect data
print(df.info(), df.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             77 non-null     object 
 1   Manufacturer                                     77 non-null     object 
 2   type                                             68 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 77 non-null     int64  
 5   grams of fat                                     69 non-null     float64
 6   milligrams of sodium                             76 non-null     float64
 7   grams of dietary fiber                           77 non-null     float64
 8   grams of complex carbohydrates                   77 non-null     float64
 9   grams of sugars                   

Drop any rows that have 'calories per serving' as a missing value. Because this will be our target variable, we cannot ask our machine learning model to work with any observation that is missing this value.

In [None]:
df.dropna(subset = ['calories per serving'], inplace = True)

# Define features (X) and target (y).

## X should only include the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces columns.


## y should be calories per serving

In [None]:
df.columns

Index(['name', 'Manufacturer', 'type', 'calories per serving',
       'grams of protein', 'grams of fat', 'milligrams of sodium',
       'grams of dietary fiber', 'grams of complex carbohydrates',
       'grams of sugars', 'milligrams of potassium',
       'vitamins and minerals (% of FDA recommendation)', 'Display shelf',
       'Weight in ounces per one serving', 'Number of cups in one serving',
       'Rating of cereal'],
      dtype='object')

In [None]:
# define target (y) as calories per serving
target = 'calories per serving'
y = df[target]

In [None]:
# define features (X) as manufacturer, cereal type, grams of fat, grams of sugar, and weight in ounces
X = df[['Manufacturer', 'type', 'grams of fat', 'grams of sugars', 'Weight in ounces per one serving']]

In [None]:
# inspect y and X
print(y, X)

0     110.0
1     130.0
3     120.0
4     110.0
5     110.0
      ...  
72    110.0
73    100.0
74    100.0
75    100.0
76    100.0
Name: calories per serving, Length: 70, dtype: float64                    Manufacturer  type  grams of fat  grams of sugars  \
0                 General Mills  Cold           2.0             10.0   
1                 General Mills  Cold           2.0              NaN   
3                 General Mills  Cold           3.0              9.0   
4                 General Mills  Cold           2.0              7.0   
5                 General Mills  Cold           1.0             13.0   
..                          ...   ...           ...              ...   
72               Ralston Purina  Cold           NaN              2.0   
73               Ralston Purina  Cold           1.0              3.0   
74  American Home Food Products   Hot           1.0              NaN   
75                      Nabisco   Hot           0.0              0.0   
76                  Q

# Train test split the data to prepare for machine learning.


In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# Identify each feature as numerical, ordinal, or nominal. (Please provide this answer in a text cell in your Colab notebook).


In [None]:
# inspect y
y.dtype

dtype('float64')

In [None]:
y.value_counts(dropna = False)

110.0    27
100.0    16
120.0     9
90.0      5
140.0     3
50.0      3
70.0      2
150.0     2
130.0     1
160.0     1
80.0      1
Name: calories per serving, dtype: int64

Target variable y is numerical.

In [None]:
# inspect value counts for each column in X_train
for column in X_train:
  print(f"{column}: {df[column].dtype}\n{X_train[column].value_counts(dropna = False)} \n\n")

Manufacturer: object
Kelloggs          16
General Mills     13
Quaker Oats        6
Post               6
Ralston Purina     6
Nabisco            5
Name: Manufacturer, dtype: int64 


type: object
Cold    45
NaN      5
Hot      2
Name: type, dtype: int64 


grams of fat: float64
1.0    18
0.0    17
2.0     7
NaN     5
3.0     4
5.0     1
Name: grams of fat, dtype: int64 


grams of sugars: float64
 3.0     8
 0.0     6
 NaN     6
 11.0    5
 12.0    4
 6.0     4
 7.0     3
 8.0     3
 9.0     2
 15.0    2
 10.0    2
 13.0    2
 2.0     2
-1.0     1
 14.0    1
 4.0     1
Name: grams of sugars, dtype: int64 


Weight in ounces per one serving: float64
1.00    43
1.33     4
1.50     2
0.50     1
0.83     1
1.25     1
Name: Weight in ounces per one serving, dtype: int64 




'Manufacturer' and 'type' features are nominal, and 'Weight in ounces per one serving', 'grams of sugars', and 'grams of fat' are numerical.

# Use pipelines and column transformers to complete the following tasks:


## Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.


In [None]:
# make column selectors
cat_selector = make_column_selector(dtype_include = 'object')
num_selector = make_column_selector(dtype_include = 'number')

In [None]:
# make mean imputer for numeric columns
mean_imputer = SimpleImputer(strategy = 'mean')

In [None]:
# make most_frequent imputer for categorical columns
freq_imputer = SimpleImputer(strategy = 'most_frequent')

## One-hot encode the nominal features.


### Be sure to include the arguments: sparse=False AND handle_unknown='ignore' when creating your OneHotEncoder.


In [None]:
# make one hot encoder for categorical columns
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False)

## Scale the numeric columns.


In [None]:
# make scaler for numeric columns
scaler = StandardScaler()

# All preprocessing steps should be contained within a single preprocessing object.


In [None]:
# make numeric columns pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)

In [None]:
# make categorical columns pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)

## Include the arguments: remainder='drop' OR remainder='passthrough' when creating your ColumnTransformer


In [None]:
# make numeric and categorical tuples for the column transformer
numeric_tuple = (numeric_pipe, num_selector)
categorical_tuple = (categorical_pipe, cat_selector)

In [None]:
# make column transformer with numeric and categorical pipelines
preprocessor = make_column_transformer(numeric_tuple,
                                       categorical_tuple)

In [None]:
# inspect column transformer
preprocessor

# Use your preprocessing object to transform your data appropriately, avoiding data leakage, to make it ready for modeling. Show the resulting NumPy array output.


## The .fit() and .transform() methods should ONLY be used with the resulting preprocessing object, NOT with any individual transformer.


In [None]:
# fit the column transformer on the TRAINING SET ONLY
preprocessor.fit(X_train)

In [None]:
# transform training AND testing sets with column transformer
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [None]:
# check for missing values
print(np.isnan(X_train_processed).sum().sum(), 'missing claues in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')

0 missing claues in training data
0 missing values in testing data


In [None]:
# check that all data is numeric
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)

All data in X_train_processed are float64
All data in X_test_processed are float64


In [None]:
# check shape of data to infer that categorical columns were one hot encoded
# should have 11 columns (3 numeric, 'Manufacturer' has 6 unique values, 'type' has 2 unique values)
print('Shape of X_train_processed data is', X_train_processed.shape)
print('Shape of X_test_processed data is', X_test_processed.shape)

Shape of X_train_processed data is (52, 11)
Shape of X_test_processed data is (18, 11)


In [None]:
# check arrays to see that numeric data was scaled
print(X_train_processed)
print(X_test_processed)

[[-1.00539366e+00 -1.53835815e+00 -3.52083059e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00]
 [-4.10364759e-02  1.19210391e+00 -2.39769825e-01  0.00000000e+00
   0.00000000e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00]
 [-4.10364759e-02 -8.55742637e-01 -2.39769825e-01  1.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00]
 [ 9.23320709e-01  5.44113816e-02  1.92573028e+00  0.00000000e+00
   1.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00]
 [-1.00539366e+00 -1.53835815e+00 -1.35533049e+00  0.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00]
 [-1.00539366e+00  5.44113816e-02 -2.39769825e-01  0.00000000e+00
   1.00000000e+00  