<a href="https://colab.research.google.com/github/juDEcorous/Machine-Learning-Fundamentals/blob/main/3_Pipelines_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pipelines Activity (Core)
Name: Jude Maico Jr.

For this task, you will use the cereals dataset. This dataset shows popular cereals by brand and manufacturer along with nutrition facts. The machine learning question is: 

How well can the calories be predicted based on the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving of the cereal?

At this point, you are just completing the pre-processing steps for this assignment.

You will need to:

Define features (X) and target (y).</br>
- X should only include the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces columns.
- y should be calories per serving

Train test split the data to prepare for machine learning.

Identify each feature as numerical, ordinal, or nominal. (Please provide this answer in a text cell in your Colab notebook).

Use pipelines and column transformers to complete the following tasks:
- Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.
- One-hot encode the nominal features.
- Be sure to include the arguments: sparse=False AND handle_unknown='ignore' when creating your OneHotEncoder.
- Scale the numeric columns.

All preprocessing steps should be contained within a single preprocessing object.
- Include the arguments: remainder='drop' OR remainder='passthrough' when creating your ColumnTransformer

Use your preprocessing object to transform your data appropriately, avoiding data leakage, to make it ready for modeling. Show the resulting NumPy array output.
- The .fit() and .transform() methods should ONLY be used with the resulting preprocessing object, NOT with any individual transformer.

#Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline

#Loading Data

In [None]:
path = '/content/drive/MyDrive/Datas/Copy of Cereal with missing values.xlsx'
df = pd.read_excel(path)

In [None]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             77 non-null     object 
 1   Manufacturer                                     77 non-null     object 
 2   type                                             68 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 77 non-null     int64  
 5   grams of fat                                     69 non-null     float64
 6   milligrams of sodium                             76 non-null     float64
 7   grams of dietary fiber                           77 non-null     float64
 8   grams of complex carbohydrates                   77 non-null     float64
 9   grams of sugars                   

Unnamed: 0,name,Manufacturer,type,calories per serving,grams of protein,grams of fat,milligrams of sodium,grams of dietary fiber,grams of complex carbohydrates,grams of sugars,milligrams of potassium,vitamins and minerals (% of FDA recommendation),Display shelf,Weight in ounces per one serving,Number of cups in one serving,Rating of cereal
0,Apple Cinnamon Cheerios,General Mills,Cold,110.0,2,2.0,180.0,1.5,10.5,10.0,70,25.0,1,1.0,0.75,29.509541
1,Basic 4,General Mills,Cold,130.0,3,2.0,,2.0,18.0,,100,25.0,3,1.33,0.75,37.038562
2,Cheerios,General Mills,Cold,,6,2.0,290.0,2.0,17.0,1.0,105,25.0,1,1.0,1.25,50.764999
3,Cinnamon Toast Crunch,General Mills,Cold,120.0,1,3.0,210.0,0.0,13.0,9.0,45,25.0,2,1.0,0.75,19.823573
4,Clusters,General Mills,Cold,110.0,3,2.0,140.0,2.0,13.0,7.0,105,25.0,3,1.0,0.5,40.400208


In [None]:
print(f'We have a total of {df.isna().sum().sum()} missing values.')

We have a total of 35 missing values.


#Assignment:

##Checking data

In [None]:
#checking null values on our target
cal_colum_null = df['calories per serving'].isna().sum()
print(f'we have a total number of {cal_colum_null} null values in our target that needed to be dropped.')

we have a total number of 7 null values in our target that needed to be dropped.


In [None]:
df.dropna(subset=['calories per serving'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             70 non-null     object 
 1   Manufacturer                                     70 non-null     object 
 2   type                                             62 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 70 non-null     int64  
 5   grams of fat                                     62 non-null     float64
 6   milligrams of sodium                             69 non-null     float64
 7   grams of dietary fiber                           70 non-null     float64
 8   grams of complex carbohydrates                   70 non-null     float64
 9   grams of sugars                   

In [None]:
cal_column_null_new = df['calories per serving'].isna().sum()
print(f'After dropping the null values in our target, we now have a total number of {cal_column_null_new} in our target,')
print(f'And a total number of {df.isna().sum().sum()} missing values in our dataset.')

After dropping the null values in our target, we now have a total number of 0 in our target,
And a total number of 26 missing values in our dataset.


In [None]:
#checking for duplicates
duplicates = df.duplicated().sum()
print(f'We have a total number of {duplicates} duplicates in our data.')

We have a total number of 0 duplicates in our data.


In [None]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             70 non-null     object 
 1   Manufacturer                                     70 non-null     object 
 2   type                                             62 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 70 non-null     int64  
 5   grams of fat                                     62 non-null     float64
 6   milligrams of sodium                             69 non-null     float64
 7   grams of dietary fiber                           70 non-null     float64
 8   grams of complex carbohydrates                   70 non-null     float64
 9   grams of sugars                   

Unnamed: 0,calories per serving,grams of protein,grams of fat,milligrams of sodium,grams of dietary fiber,grams of complex carbohydrates,grams of sugars,milligrams of potassium,vitamins and minerals (% of FDA recommendation),Display shelf,Weight in ounces per one serving,Number of cups in one serving,Rating of cereal
count,70.0,70.0,62.0,69.0,70.0,70.0,62.0,70.0,69.0,70.0,70.0,70.0,70.0
mean,106.857143,2.5,1.0,154.637681,2.102857,14.45,6.951613,94.614286,27.536232,2.2,1.024286,0.816143,42.418849
std,19.966846,1.059806,1.024295,82.177708,2.436139,4.339864,4.635669,71.29207,21.922712,0.827078,0.151117,0.231111,14.453681
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,125.0,0.0,12.0,3.0,40.0,25.0,1.25,1.0,0.67,32.44921
50%,110.0,2.5,1.0,170.0,1.75,14.0,7.0,90.0,25.0,2.0,1.0,0.75,39.904682
75%,110.0,3.0,1.0,210.0,3.0,17.0,11.0,118.75,25.0,3.0,1.0,1.0,51.401243
max,160.0,6.0,5.0,290.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


In [None]:
df.describe(include='object')

Unnamed: 0,name,Manufacturer,type
count,70,70,62
unique,70,7,2
top,Apple Cinnamon Cheerios,Kelloggs,Cold
freq,1,21,59


## Define features (X) and target (y)

In [None]:
y = df['calories per serving']
X = df[['Manufacturer', 'type', 'grams of sugars', 'Weight in ounces per one serving']]

##Train test split the data to prepare for machine learning.

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

##Identify each feature as numerical, ordinal, or nominal. (Please provide this answer in a text cell in your Colab notebook).

Numerical:
  - grams of sugar
  - Weight in onces per one serving

Nominal:
  - Manufacturer
  - type

Ordinal: 
  - I would consider 'type' in ordinal but I cannot place Cold and Hot in a scale on this particular dataset. (Hot is not greater than Cold, and vice versa.)

In [None]:
X_train.nunique()

Manufacturer                         6
type                                 2
grams of sugars                     15
Weight in ounces per one serving     6
dtype: int64

In [None]:
X_train.type.value_counts()

Cold    45
Hot      2
Name: type, dtype: int64

In [None]:
X_train.info()
X_train.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52 entries, 63 to 56
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Manufacturer                      52 non-null     object 
 1   type                              47 non-null     object 
 2   grams of sugars                   46 non-null     float64
 3   Weight in ounces per one serving  52 non-null     float64
dtypes: float64(2), object(2)
memory usage: 2.0+ KB


Unnamed: 0,Manufacturer,type,grams of sugars,Weight in ounces per one serving
63,Quaker Oats,Cold,0.0,0.5
52,Post,,12.0,1.0
18,General Mills,Cold,3.0,1.0
37,Kelloggs,Cold,7.0,1.33
46,Nabisco,,0.0,0.83


## Use pipelines and column transformers to complete the following tasks:

- Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.
- One-hot encode the nominal features.
- Be sure to include the arguments: sparse=False AND handle_unknown='ignore' when creating your OneHotEncoder.
- Scale the numeric columns.

In [None]:
#column selector
cat_selector = make_column_selector(dtype_include = 'object')
num_selector = make_column_selector(dtype_include = 'number')

In [None]:
#imputers
mean_imputer = SimpleImputer(strategy = 'mean')
freq_imputer = SimpleImputer(strategy = 'most_frequent')

In [None]:
#OneHotEncoder
ohe = OneHotEncoder(sparse = False, handle_unknown = 'ignore')

In [None]:
#Scaler
scaler = StandardScaler()

## All preprocessing steps should be contained within a single preprocessing object.
- Include the arguments: remainder='drop' OR remainder='passthrough' when creating your ColumnTransformer

In [None]:
#num pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)

In [None]:
#cat pipeline (making sure they're not together)
categorical_pipe = make_pipeline(freq_imputer, ohe)

In [None]:
#tuples
num_tuple = (numeric_pipe, num_selector)
cat_tuple = (categorical_pipe, cat_selector)

In [None]:
#columntrasformmer
preprocessor = make_column_transformer(cat_tuple, num_tuple, 
                                       remainder = 'passthrough', 
                                       verbose_feature_names_out = False)

##Use your preprocessing object to transform your data appropriately, avoiding data leakage, to make it ready for modeling. Show the resulting NumPy array output.
- The .fit() and .transform() methods should ONLY be used with the resulting preprocessing object, NOT with any individual transformer.

In [None]:
#fit
preprocessor.fit(X_train)



In [None]:
#transfor
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [None]:
#process to df
X_train_df = pd.DataFrame(X_test_processed, 
                          columns = preprocessor.get_feature_names_out())

In [None]:
#display
display(X_train_df.head())
print(f'\nShape of the data is: {X_train_df.shape}')
print(f'\nThere are {X_train_df.isna().sum().sum()} missing values')
print(f'\nThe data types are \n{X_train_df.dtypes}')

Unnamed: 0,Manufacturer_General Mills,Manufacturer_Kelloggs,Manufacturer_Nabisco,Manufacturer_Post,Manufacturer_Quaker Oats,Manufacturer_Ralston Purina,type_Cold,type_Hot,grams of sugars,Weight in ounces per one serving
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.647181,-0.23977
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.737027,-0.23977
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.400666,-0.23977
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.419642,-0.23977
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.192104,-0.23977



Shape of the data is: (18, 10)

There are 0 missing values

The data types are 
Manufacturer_General Mills          float64
Manufacturer_Kelloggs               float64
Manufacturer_Nabisco                float64
Manufacturer_Post                   float64
Manufacturer_Quaker Oats            float64
Manufacturer_Ralston Purina         float64
type_Cold                           float64
type_Hot                            float64
grams of sugars                     float64
Weight in ounces per one serving    float64
dtype: object
