<a href="https://colab.research.google.com/github/marce3-2140/Machine-Learning-Preprocessing/blob/main/PredecirCaloriasCereales_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Conjunto de Datos: Cereales populares por marca y fabricante con informacion nutricional**

**Objetivo:**

1. Que tan bien se puede predecir las calorias basandose en el fabricante, tipo de cereal, gramos de grasa, gramos de azucares y el peso en onzas por una porcion del cereal.
2. Completar los pasos de preprocesamiento.


**1. Importar las Bibliotecas**

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

**2. Cargar los Datos**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
path = '/content/drive/MyDrive/CodingDojo/Cargas/Cereal with missing values.xlsx'
df = pd.read_excel(path)
df.head()

Unnamed: 0,name,Manufacturer,type,calories per serving,grams of protein,grams of fat,milligrams of sodium,grams of dietary fiber,grams of complex carbohydrates,grams of sugars,milligrams of potassium,vitamins and minerals (% of FDA recommendation),Display shelf,Weight in ounces per one serving,Number of cups in one serving,Rating of cereal
0,Apple Cinnamon Cheerios,General Mills,Cold,110.0,2,2.0,180.0,1.5,10.5,10.0,70,25.0,1,1.0,0.75,29.509541
1,Basic 4,General Mills,Cold,130.0,3,2.0,,2.0,18.0,,100,25.0,3,1.33,0.75,37.038562
2,Cheerios,General Mills,Cold,,6,2.0,290.0,2.0,17.0,1.0,105,25.0,1,1.0,1.25,50.764999
3,Cinnamon Toast Crunch,General Mills,Cold,120.0,1,3.0,210.0,0.0,13.0,9.0,45,25.0,2,1.0,0.75,19.823573
4,Clusters,General Mills,Cold,110.0,3,2.0,140.0,2.0,13.0,7.0,105,25.0,3,1.0,0.5,40.400208


**3. Explorar los Datos**

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   name                                             77 non-null     object 
 1   Manufacturer                                     77 non-null     object 
 2   type                                             68 non-null     object 
 3   calories per serving                             70 non-null     float64
 4   grams of protein                                 77 non-null     int64  
 5   grams of fat                                     69 non-null     float64
 6   milligrams of sodium                             76 non-null     float64
 7   grams of dietary fiber                           77 non-null     float64
 8   grams of complex carbohydrates                   77 non-null     float64
 9   grams of sugars                   

**3a. Identifiquen cada caracteristica como numerica, ordinal o nominal.**

In [20]:
df_numerical = df.select_dtypes(include=['int64', 'float64'])
column_names = df_numerical.columns.tolist()
# Imprimir los nombres de las columnas en orden descendente
for column in (column_names):
    print(column)

calories per serving
grams of protein
grams of fat
milligrams of sodium
grams of dietary fiber
grams of complex carbohydrates
grams of sugars
milligrams of potassium
vitamins and minerals (% of FDA recommendation)
Display shelf
Weight in ounces per one serving
Number of cups in one serving
Rating of cereal


In [23]:
df_categorical = df.select_dtypes(include='object')
df_categorical.columns

Index(['name', 'Manufacturer', 'type'], dtype='object')

In [34]:
#categoria nominal
df['name'].value_counts()

Apple Cinnamon Cheerios      1
Strawberry Fruit Wheats      1
Great Grains Pecan           1
Grape-Nuts                   1
Grape Nuts Flakes            1
                            ..
Corn Flakes                  1
Apple Jacks                  1
All-Bran with Extra Fiber    1
All-Bran                     1
Quaker Oatmeal               1
Name: name, Length: 77, dtype: int64

In [33]:
#Categoria nominal
df['Manufacturer'].value_counts()

Kelloggs                       23
General Mills                  22
Post                            9
Quaker Oats                     8
Ralston Purina                  8
Nabisco                         6
American Home Food Products     1
Name: Manufacturer, dtype: int64

In [32]:
#Ordinal
df['type'].value_counts()

Cold    65
Hot      3
Name: type, dtype: int64

> **categorias Numericas**
calories per serving
grams of protein
grams of fat
milligrams of sodium
grams of dietary fiber
grams of complex carbohydrates
grams of sugars
milligrams of potassium
vitamins and minerals (% of FDA recommendation)
Display shelf
Weight in ounces per one serving
Number of cups in one serving
Rating of cereal
**Categorias Nominales**
name, Manufacturer
**categorias ordinales**
type

**4. Division de la Validacion**

In [35]:
y = df['calories per serving']
X = df.drop('calories per serving', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

**5. Utilizar pipelines y transformadores de columnas para completar lo siguiente**

* Imputar algun valor faltante.
* Utilizar la estrategia "mean" para las columnas numericas y la estrategia "most_frequent" para las columnas categoricas.
* Realizar una codificacion one-hot a las caracteristicas nominales.
* Escalar las columnas numericas

**5a. Realizar una codificacion one-hot a las caracteristicas nominales**

In [36]:
#reemplazamos los valores inconsistentes
replacement_dictionary = {'Cold': 0, 'Hot':1}
df['type'].replace(replacement_dictionary, inplace=True)
df['type'].value_counts()

0.0    65
1.0     3
Name: type, dtype: int64

**5b. Instanciar Selectores de Columnas**

In [37]:
#Selectores
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

**5c. Instanciar Transformadores**

> **Nota:** Usaremos tres diferentes transformadores:


        1. SimpleImputer
        2. StandardScaler
        3. OneHotEncoder
        * Y dos diferentes estrategias de imputacion para datos faltantes: "most_frequent" y "mean"

In [38]:
#Imputers
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')
#Scaler
scaler = StandardScaler()
#One-Hot-Encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)


**5d. Instanciar Pipelines**

> **Nota:** Usaremos dos diferentes Pipelines uno para datos numericos y otro para categoricos.

In [39]:
#Numeric Pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe


In [40]:
#Categorical pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

**5e. Instanciar ColumnTransformer**

In [41]:
#Crearemos Tuplas para Column Transformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)
#Crearemos el ColumnTransformer
preprocessor = make_column_transformer(number_tuple, category_tuple)
preprocessor

**5f. Transformador de datos**

> **Nota:** Ajustaremos el transformador ("preprocessor") en los datos de entrenamiento.

In [42]:
#fit on train
preprocessor.fit(X_train)



**5g. Utilizar los datos de preprocesamiento para transformar los datos**

In [43]:
#Aplicamos el ajustador en los datos de entrenamiento a X_train y X_test
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

**6. Inspeccionar el Resultado**

> **Nota:** Algunas veces podemos transformar facilmente nuestros datos devuelta a un DataFrame de pandas, pero no siempre es facil obtener la columna de nombres de vuelta. El OneHotEncoder creo columnas extras y es complicado recuperar los nombres decolumna correctos para todas las columnas.

* Nos aseguramos de que se hayan sustituido los datos faltantes, que los datos categoricos realicen una codificacion one-hot y que los datos numericos se escalen.

In [44]:
print(np.isnan(X_train_processed).sum().sum(), 'Missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'Missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of X_train_processed data  is', X_train_processed.shape)
print('shape of original data is', df.shape)
print('\n')
X_train_processed

0 Missing values in training data
0 Missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of X_train_processed data  is (57, 77)
shape of original data is (77, 16)




array([[-1.30301442, -0.97467943,  0.56162348, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.        ,  0.68120871, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378, -0.97467943,  1.99664622, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 1.25808288,  1.94935887, -0.03630266, ...,  1.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.97467943, -0.15588789, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.        ,  0.08328257, ...,  0.        ,
         1.        ,  0.        ]])