# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

> Let's quickly check out the data and plan how we will preprocess each column for our model to better understand.

In [1]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [2]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [3]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Feature Engineering

### Let's do create some Pipelines for our Column Transformer for our numerical and categorical data columns

In [4]:
X_train.select_dtypes('number').columns

Index(['Clothing ID', 'Age', 'Positive Feedback Count'], dtype='object')

In [5]:
# I will create an index for numeric features for my preprocessing pipeline
num_columns = X_train.select_dtypes('number').drop(['Age'], axis=1).columns
print('Numerical features:', num_columns)

# I will create an index for categorical features for my preprocessing pipeline
cat_columns = X_train.select_dtypes('object').drop(['Review Text'], axis=1).columns
print('Categorical Features:', cat_columns)

# I will create an index for the review text for my nlp preprocessing pipeline
review_column = X_train[['Review Text']].columns
print('Review Feature:', review_column)

# I will make a seperate pipeline for age b/c it is an ordinal category where order matters
age_column = X_train[['Age']].columns
print('Age Feature:', age_column)

Numerical features: Index(['Clothing ID', 'Positive Feedback Count'], dtype='object')
Categorical Features: Index(['Title', 'Division Name', 'Department Name', 'Class Name'], dtype='object')
Review Feature: Index(['Review Text'], dtype='object')
Age Feature: Index(['Age'], dtype='object')


In [6]:
# Let's create our numerical preprocessing pipeline

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('minmax', MinMaxScaler())
])

num_pipeline

In [7]:
# Let's create our categorical preprocessing pipeline

from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehotencoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

cat_pipeline

In [8]:
# let's create a pipeline just for Age b/c we are not using onthotencoder with Age

from sklearn.preprocessing import OrdinalEncoder

age_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinalencoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])

age_pipeline

### Let's tackle our review data and transform it into numerical form for our machine learning model

In [9]:
from sklearn.base import TransformerMixin, BaseEstimator

# Creating a custom transformer to count specified characters
class CountCharacter(BaseEstimator, TransformerMixin):
    def __init__(self, character:str):
        self.character = character
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        counts = []
        for text in X:
            count = text.count(self.character)
            counts.append([count])
        return counts

In [10]:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer
import numpy as np

initial_text_preprocess = Pipeline([
    ('dimension_reshaper', FunctionTransformer(np.reshape, kw_args={'newshape':-1}))
])
count_char = FeatureUnion([
    ('count?', CountCharacter('?')),
    ('count!', CountCharacter('!'))
])

count_char

In [11]:
text_pipeline = Pipeline([
    ('reshaper', initial_text_preprocess),
    ('count', count_char)
])

text_pipeline

In [12]:
# Custom transformer using nlp lemmatizer
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    
    def __init__(self,nlp):
        self.nlp = nlp
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        lemmatized = []
        for doc in self.nlp.pipe(X):
            lemmas = []
            for token in doc:
                if not token.is_stop:
                    lemmas.append(token.lemma_)
            lemmatized.append(' '.join(lemmas))  
        return lemmatized  

In [13]:
# Let's download spacy before we continue
! python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load('en_core_web_sm')

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_pipeline = Pipeline([
    ('reshaper', initial_text_preprocess),
    ('lemmatizer', SpacyLemmatizer(nlp)),
    ('tfidf', TfidfVectorizer(stop_words = 'english'))
])

tfidf_pipeline

## Building Pipeline

> Now that we have all of our preprocessing pipelines let's put them all together with Column Transformers

In [15]:
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
    ('num', num_pipeline, num_columns),
    ('cat', cat_pipeline, cat_columns),
    ('age', age_pipeline, age_column),
    ('char_counts', text_pipeline, review_column),
    ('tfidf', tfidf_pipeline, review_column)
])

feature_engineering

In [16]:
from sklearn.ensemble import RandomForestClassifier

model_pipeline= Pipeline([
    ('feature_engineering', feature_engineering),
    ('model', RandomForestClassifier(random_state = 7))
])

model_pipeline

## Training Pipeline

In [17]:
model_pipeline.fit(X_train, y_train)

In [18]:
from sklearn.metrics import accuracy_score

y_pred = model_pipeline.predict(X_test)

In [19]:
accuracy = accuracy_score(y_test, y_pred)
print('The accuracy score of our model without fine tuning is:', accuracy)

The accuracy score of our model without fine tuning is: 0.8439024390243902


## Fine-Tuning Pipeline

> We will be using RandomizedSearchCV just to save some time, but in another setting we might want to be using GridSearchCV

In [24]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'model__n_estimators': [75, 100, 125],  # Number of trees in the forest
    'model__max_depth': [None, 4],  # Maximum depth of the tree
    'model__min_samples_split': [2, 5],  # Minimum number of samples required to split an internal node
    'model__min_samples_leaf': [1, 2],  # Minimum number of samples required to be at a leaf node
}

random_search = RandomizedSearchCV(estimator=model_pipeline, param_distributions=param_grid,
                                   cv=3, scoring='accuracy', verbose=3, random_state=7)

random_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END model__max_depth=None, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100;, score=0.844 total time= 2.2min
[CV 2/3] END model__max_depth=None, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100;, score=0.843 total time= 2.1min
[CV 3/3] END model__max_depth=None, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100;, score=0.840 total time= 2.1min
[CV 1/3] END model__max_depth=None, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=125;, score=0.843 total time= 2.3min
[CV 2/3] END model__max_depth=None, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=125;, score=0.840 total time= 2.2min
[CV 3/3] END model__max_depth=None, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=125;, score=0.837 total time= 2.2min
[CV 1/3] END model__max_depth=None, model__min_samp

In [25]:
print('this is our best parameters according to our RandomSearchCV:', random_search.best_params_)

this is our best parameters according to our RandomSearchCV: {'model__n_estimators': 100, 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_depth': None}


In [26]:
best_model = random_search.best_estimator_

In [29]:
best_pred = best_model.predict(X_test)
best_model_accuracy = accuracy_score(y_test, best_pred)
print('This is our accuracy based on our best model:', best_model_accuracy)

This is our accuracy based on our best model: 0.8439024390243902
