# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Dataset
In this section, we load the dataset and inspect its structure to understand the features and target variable.

In [1]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [2]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


## Data Splitting
We split the dataset into training and testing sets to evaluate the model's performance on unseen data.

In [3]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

## Data Exploration
In this section, we analyze the dataset to understand its structure, distributions, and relationships between features and the target variable.

In [4]:
y.value_counts()

Recommended IND
1    15053
0     3389
Name: count, dtype: int64

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
dtypes: int64(3), object(5)
memory usage: 1.1+ MB


In [6]:
numerical_features = X.select_dtypes(exclude=['object']).columns
print("Numerical features:", numerical_features)

categorical_features = (X.select_dtypes(exclude=['int64']).columns.drop(['Title', 'Review Text']))
print("Categorical features:", categorical_features)

text_features = (X[['Title', 'Review Text']]).columns
print("Text features:", text_features)

Numerical features: Index(['Clothing ID', 'Age', 'Positive Feedback Count'], dtype='object')
Categorical features: Index(['Division Name', 'Department Name', 'Class Name'], dtype='object')
Text features: Index(['Title', 'Review Text'], dtype='object')


## Building Pipeline

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
import spacy

In [8]:
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ('scaler', MinMaxScaler())
])

numeric_pipeline

In [9]:
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ('cat_enocder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

categorical_pipeline

In [10]:
# Define a custom transformer class for lemmatization using spaCy
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        """
        Initialize the SpacyLemmatizer with a spaCy language model.
        
        Parameters:
        nlp (spacy.lang): The spaCy language model to use for lemmatization.
        """
        self.nlp = nlp

    def fit(self, X, y=None):
        """
        Fit method (no fitting required for this transformer).
        
        Parameters:
        X (iterable): Input data.
        y (iterable, optional): Target data (not used).
        
        Returns:
        self: Returns the instance itself.
        """
        return self

    def transform(self, X):
        """
        Transform the input data by lemmatizing the text.
        
        Parameters:
        X (iterable): Input data containing text to be lemmatized.
        
        Returns:
        list: A list of lemmatized text strings.
        """
        lemmatized = [
            ' '.join(token.lemma_ for token in doc)  # Join lemmatized tokens into a single string
            for doc in self.nlp.pipe(X)  # Process text using spaCy's pipeline
        ]
        return lemmatized

In [11]:
nlp = spacy.load('en_core_web_sm')

text_pipeline = Pipeline([
    ('selector', FunctionTransformer(lambda X: X.iloc[:, 0], validate=False)),
    ('lemmatizer', SpacyLemmatizer(nlp=nlp)),
    ('tfidf', TfidfVectorizer(stop_words="english")),
])

text_pipeline

## Training Pipeline

In [12]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features),
        ('text', text_pipeline, text_features),
    ],
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

pipeline

In [13]:
pipeline.fit(X_train, y_train)

In [14]:
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.8661


## Fine-Tuning Pipeline

In [15]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'transform_input', 'verbose', 'preprocessor', 'classifier', 'preprocessor__force_int_remainder_cols', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__verbose_feature_names_out', 'preprocessor__num', 'preprocessor__cat', 'preprocessor__text', 'preprocessor__num__memory', 'preprocessor__num__steps', 'preprocessor__num__transform_input', 'preprocessor__num__verbose', 'preprocessor__num__imputer', 'preprocessor__num__scaler', 'preprocessor__num__imputer__add_indicator', 'preprocessor__num__imputer__copy', 'preprocessor__num__imputer__fill_value', 'preprocessor__num__imputer__keep_empty_features', 'preprocessor__num__imputer__missing_values', 'preprocessor__num__imputer__strategy', 'preprocessor__num__scaler__clip', 'preprocessor__num__scaler__copy', 'preprocessor__num__scaler__feature_range', 'preprocessor__cat__memory

In [16]:
param_grid = {
    "preprocessor__text__tfidf__stop_words": [None, "english"],
    "classifier__n_estimators": [100, 200, 300, 500],
}

param_search = RandomizedSearchCV(
    estimator=pipeline, 
    param_distributions=param_grid, 
    n_iter=5,     # Try 5 different combinations of parameters
    cv=5,         # Use 5-fold cross-validation
    n_jobs=-1,    # Use all available processors (for multiprocessing)
    refit=True,   # Refit the model using the best parameters found
    verbose=3,    # Output of parameters, score, time
    random_state=42,
)

param_search.fit(X_train, y_train)

print("\nBest Hyperparameters:")
print(param_search.best_params_)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5] END classifier__n_estimators=100, preprocessor__text__tfidf__stop_words=english;, score=0.864 total time=  14.3s
[CV 2/5] END classifier__n_estimators=100, preprocessor__text__tfidf__stop_words=english;, score=0.873 total time=  14.7s
[CV 3/5] END classifier__n_estimators=100, preprocessor__text__tfidf__stop_words=english;, score=0.861 total time=  15.6s
[CV 4/5] END classifier__n_estimators=100, preprocessor__text__tfidf__stop_words=english;, score=0.858 total time=  15.7s
[CV 5/5] END classifier__n_estimators=100, preprocessor__text__tfidf__stop_words=english;, score=0.865 total time=  15.8s
[CV 1/5] END classifier__n_estimators=100, preprocessor__text__tfidf__stop_words=None;, score=0.886 total time=  15.6s
[CV 2/5] END classifier__n_estimators=100, preprocessor__text__tfidf__stop_words=None;, score=0.882 total time=  15.2s
[CV 1/5] END classifier__n_estimators=300, preprocessor__text__tfidf__stop_words=english;, s

In [17]:
best_model = param_search.best_estimator_

y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)

print(f"Accuracy: {accuracy_best:.4f}")

Accuracy: 0.8770
