# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.


## Loads & Imports

### Loads

In [1]:
# ! python -m spacy download en_core_web_sm
# ! python -m textblob.download_corpora

### Imports

In [2]:
# general
import pandas as pd
import numpy as np

# spaCy
import spacy
from spacy.tokens import Doc
nlp = spacy.load('en_core_web_sm')
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe('spacytextblob')

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [3]:
# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [4]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [5]:
# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Your Work

## Data Exploration

### Explore Unique Values of Categorical Features

In [6]:
print(f'Division Names: {X['Division Name'].unique()}')
print(f'Department Names: {X['Department Name'].unique()}')
print(f'Class Names: {X['Class Name'].unique()}')

Division Names: ['General' 'General Petite']
Department Names: ['Dresses' 'Bottoms' 'Tops' 'Jackets' 'Trend' 'Intimate']
Class Names: ['Dresses' 'Pants' 'Blouses' 'Knits' 'Outerwear' 'Sweaters' 'Skirts'
 'Fine gauge' 'Jackets' 'Trend' 'Lounge' 'Jeans' 'Shorts' 'Casual bottoms']


### Explored Combined Unique Values

In [7]:
print(f'Combinations:\n{X[['Class Name', 'Department Name', 'Division Name']].drop_duplicates().to_string(index=False)}')

count_combined_unique = len(X[['Class Name', 'Department Name', 'Division Name']].drop_duplicates())

print(f'\nCombined unique values: {count_combined_unique}')

Combinations:
    Class Name Department Name  Division Name
       Dresses         Dresses        General
         Pants         Bottoms General Petite
       Blouses            Tops        General
         Knits            Tops General Petite
       Dresses         Dresses General Petite
         Pants         Bottoms        General
     Outerwear         Jackets        General
      Sweaters            Tops        General
        Skirts         Bottoms        General
    Fine gauge            Tops        General
         Knits            Tops        General
       Blouses            Tops General Petite
       Jackets         Jackets        General
        Skirts         Bottoms General Petite
         Trend           Trend        General
    Fine gauge            Tops General Petite
        Lounge        Intimate General Petite
         Jeans         Bottoms        General
       Jackets         Jackets General Petite
      Sweaters            Tops General Petite
         Jeans      

### Explore Clothing ID

In [8]:
print(f'Unique Clothing IDs: {len(X['Clothing ID'].unique())}')
print(f'Total Number of Clothing IDs: {len(X['Clothing ID'])}')

Unique Clothing IDs: 531
Total Number of Clothing IDs: 18442


--> Clothing ID also characterizes certain items.

Sorting by Clothing ID and looking at head and tail.

In [9]:
X.sort_values('Clothing ID').head(20)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
670,2,28,"Gorgeous top, straps way too long",I just adore this top! it is so comfy and styl...,0,General,Tops,Knits
23,4,28,Great layering piece,This sweater is so comfy and classic - it bala...,0,General,Tops,Sweaters
4833,5,39,Oldie but goodie,I'm currently on the prowl for the other color...,0,General,Tops,Sweaters
17913,7,39,Four winters in... a winner!,"I love this coat. i bought it in 2012, and it ...",0,General,Jackets,Outerwear
15259,9,34,Okay leggings,The velvet isn't as soft or plush as i thought...,0,General,Bottoms,Jeans
12145,11,46,Wanted to love it,The color says red but it's more like a rust. ...,0,General,Bottoms,Jeans
8702,12,28,Beautiful staple item,Love this striped top! it's perfect to throw o...,0,General,Tops,Knits
5502,13,39,Stylish but strange,"I love the color, the fabric and the style (es...",0,General,Dresses,Dresses
15827,16,39,Gorgeous,This gorgeous dress really does bloom before y...,0,General,Dresses,Dresses
6482,17,31,Love this dress!,I absolutely love this dress! i got it right a...,0,General,Dresses,Dresses


In [10]:
X.sort_values('Clothing ID').tail(20)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
2958,1195,45,Not great,"The fabric is very thin. it isn't see through,...",0,General,Dresses,Dresses
1516,1197,60,Most comfortable fabric i've ever worn,I originally bought this dress in another colo...,0,General,Dresses,Dresses
1508,1197,68,Perfect every day neutral,So glad i ordered this dress. it fits true to ...,1,General,Dresses,Dresses
1520,1197,55,Flattering and versatile,I ordered this winth some trepidation as the d...,0,General,Dresses,Dresses
15385,1198,25,Comfy and casual,I wanted to love this dress so much! but unfor...,0,General,Dresses,Dresses
15369,1198,68,Cute and comfy,This is a great every day dress. the buttons i...,0,General,Dresses,Dresses
5510,1199,31,Flowy and light,This dress in orange is lovely. it's very ligh...,0,General Petite,Dresses,Dresses
11825,1200,24,Its worth the sale price if you know your size,Bought this dress without trying it on but lis...,0,General,Dresses,Dresses
4928,1202,45,Armholes are huge,"The material of the dress is gorgeous, but the...",1,General Petite,Dresses,Dresses
4924,1202,31,Bright and happy,The colors are amazing. so vibrant and plentif...,0,General Petite,Dresses,Dresses


Clothing ID doesn't seem to be a charaterizing feature, but maybe the model finds a relationship.

### Looking at Text Features

In [11]:
X['Review Text'][50]

"I have a short torso and this works well for me. 34c, bought the 0. there's not much stretch to the fabric so it is fitted to my chest, but not in an uncomfortable way. definitely doesn't hang and have extra fabric like on the model. \r\n\r\nzipper goes almost all the way down to the bottom so it's easier to get on and off which makes up for the lack of stretch n the fabric.\r\n\r\nunlike another reviewer, i found it went really well with navy pants and i wore it to a business meeting under a blazer. wi"

In [12]:
X['Review Text'][1001]

"I fell in love with this dress when i saw it in the catalog and ordered it immediately. i was a bit disappointed that the dress is a little lighter than the pink showed in the catalog but it didn't deter my liking it because it's still a lovely shade of pink. the fabric takes a bit getting used to though - i thought it would be silkier but it's really a thicker fabric. i ordered a size small and i usually wear size 4. the dress drowned me. so i returned it for an xs, which fit me beautifully. so"

In [13]:
X['Review Text'][1250]

"Such a beautiful print. i sized up to a 14 because it looked like it has a high waist and it does. the waist comes almost to my bra line. unfortunately, there's a huge amount of fabric and it swallowed me. not slimming at all. i felt like i was wearing a curtain.  i loved the slits in the photo but due to the amount of fabric they weren't visible. i'm short and this literally pooled on the ground a few inches. if you're tall and thin, this is a beauty."

In [14]:
X['Review Text'][910]

'I am 5\'-7" and 135 lbs, i bought a medium petite as i wanted the dress to hit at my knees, instead of midi. this dress is easily 2 sizes bigger than expected. the pattern was not flattering on my although i\'m sure it would be for others. i was happy with the length...'

In [15]:
X['Review Text'][130]

'I like this sweater so much i just bought it in a second color! the pleats make the sweater conform to my shape just enough to be flattering. i wore it over three different dresses this week that might have felt too bare for work or cooler weather. i live in a hot climate so this is the right weight for our cooler months. the metallic threads give it a little bit of flair and the grey color goes with everything. i\'m 5\'7" size 10-12 and the large fit just right.'

## Building Pipeline

### Split Features in Numerical, Categorical and Text Features

In [16]:
num_features = (
    X[[
        'Positive Feedback Count',
        'Age',
    ]].columns
)
print('Numerical features:', num_features)

cat_features = (
    X[[
        'Division Name',
        'Department Name',
        'Class Name',
        'Clothing ID', # more a categorical feature than a numerical feature
    ]].columns
)
print('Categorical features:', cat_features)


text_feature_reviewtext = (
    X[[
        'Review Text',
    ]].columns
)
print ('Review Text feature:', text_feature_reviewtext)

text_feature_title = (
    X[[
        'Title',
    ]].columns
)
print ('Title text feature:', text_feature_title)


Numerical features: Index(['Positive Feedback Count', 'Age'], dtype='object')
Categorical features: Index(['Division Name', 'Department Name', 'Class Name', 'Clothing ID'], dtype='object')
Review Text feature: Index(['Review Text'], dtype='object')
Title text feature: Index(['Title'], dtype='object')


### Numerical Feature Pipeline

In [17]:
num_pipeline = Pipeline([
    (
        'scaler',
        MinMaxScaler(),
    ),
])

num_pipeline

### Categorical Feature Pipeline

In [18]:


cat_pipeline = Pipeline([
    (
        'cat_encoder',
        OneHotEncoder(
            sparse_output=False,
            handle_unknown='ignore',
        )
    ),
])

cat_pipeline

### Text Feature Pipeline

#### Tfidf Pipeline

In [19]:


# Spacy Lemmatizer
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        return None

    def fit(self, X, y=None):
        return self

    def transform(self, X):

        if not Doc.has_extension("blob"):
            Doc.set_extension("blob", default=None)

        # Load and initialize within the method
        nlp = spacy.load("en_core_web_sm")
        nlp.add_pipe("spacytextblob")


        lemmatized = [
            ' '.join(
                token.lemma_ for token in doc
                if not token.is_stop
            )
            for doc in nlp.pipe(X)
        ]
        return lemmatized   

In [20]:

tfidf_pipeline = Pipeline([
    (
        'dimension_reshaper',
        FunctionTransformer(
            np.reshape,
            kw_args={'newshape':-1},
        ),
    ),
    (
        'lemmatizer',
        SpacyLemmatizer(),
    ),
    (
        'tfidf_vectorizer',
        TfidfVectorizer(
            stop_words='english',
        ),
    ),
])
tfidf_pipeline

#### Sentiment Pipeline

In [21]:
class SpacySentimenter(BaseEstimator, TransformerMixin):
    def __init__(self):
        return None

    def fit(self, X, y=None):
        return self

    def transform(self, X):

        if not Doc.has_extension("blob"):
            Doc.set_extension("blob", default=None)

        # Load and initialize within the method
        nlp = spacy.load("en_core_web_sm")
        nlp.add_pipe("spacytextblob")

        # if X is a dataframe with exactly one column
        if isinstance(X, pd.DataFrame):
            if X.shape[1] != 1:
                raise ValueError("Expected DataFrame with a single column")
            X = X.iloc[:, 0]

        # if X a 2d-array
        elif isinstance(X, np.ndarray):
            if X.ndim == 2 and X.shape[1] == 1:
                X = X[:, 0]
            elif X.ndim > 1:
                raise ValueError("Expected 1D array or 2D array with one column")

        sentiment_scores = X.apply(lambda text: nlp(text)._.blob.polarity)
        return np.array(sentiment_scores).reshape(-1, 1)

In [22]:
sentiment_pipeline = Pipeline([
    (
        'sentimenter',
        SpacySentimenter(),
    ),
])
sentiment_pipeline

#### Text Feature Union

In [23]:
feature_engineering = FeatureUnion([
    ('tfidf', tfidf_pipeline),
    ('sentiment', sentiment_pipeline),
])

text_pipeline = Pipeline([
    (
        'feature_engineering',
        feature_engineering,
    ),
])
text_pipeline

### Build Whole Pipeline

In [24]:
from sklearn.compose import ColumnTransformer

whole_pipeline = ColumnTransformer([
        ('num', num_pipeline, num_features),
        ('cat', cat_pipeline, cat_features),
        ('tfidf_text_review', text_pipeline, text_feature_reviewtext),
        ('tfidf_text_title', text_pipeline, text_feature_title),
])

whole_pipeline

## Training Pipeline

In [25]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import make_pipeline

# import sklearn
# sklearn.set_config(enable_metadata_routing=True)

model_pipeline = make_pipeline(
    whole_pipeline,
    RandomForestClassifier(random_state=27),
)

model_pipeline

In [26]:
model_pipeline.fit(X_train, y_train)

In [27]:
model_pipeline.feature_names_in_

array(['Clothing ID', 'Age', 'Title', 'Review Text',
       'Positive Feedback Count', 'Division Name', 'Department Name',
       'Class Name'], dtype=object)

### Evaluate Model

In [28]:
y_pred_pipeline = model_pipeline.predict(X_test)
accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)
precision_pipeline = precision_score(y_test, y_pred_pipeline)
recall_pipeline = recall_score(y_test, y_pred_pipeline)
f1_pipeline = f1_score(y_test, y_pred_pipeline)

print('Accuracy:', accuracy_pipeline)
print('Precision:', precision_pipeline)
print('Recall:', recall_pipeline)
print('F1:', f1_pipeline)



Accuracy: 0.8688346883468835
Precision: 0.8717948717948718
Recall: 0.9855072463768116
F1: 0.9251700680272109


## Fine-Tuning Pipeline

### Get Tunable Features

In [29]:
list(model_pipeline.get_params().keys())

['memory',
 'steps',
 'transform_input',
 'verbose',
 'columntransformer',
 'randomforestclassifier',
 'columntransformer__force_int_remainder_cols',
 'columntransformer__n_jobs',
 'columntransformer__remainder',
 'columntransformer__sparse_threshold',
 'columntransformer__transformer_weights',
 'columntransformer__transformers',
 'columntransformer__verbose',
 'columntransformer__verbose_feature_names_out',
 'columntransformer__num',
 'columntransformer__cat',
 'columntransformer__tfidf_text_review',
 'columntransformer__tfidf_text_title',
 'columntransformer__num__memory',
 'columntransformer__num__steps',
 'columntransformer__num__transform_input',
 'columntransformer__num__verbose',
 'columntransformer__num__scaler',
 'columntransformer__num__scaler__clip',
 'columntransformer__num__scaler__copy',
 'columntransformer__num__scaler__feature_range',
 'columntransformer__cat__memory',
 'columntransformer__cat__steps',
 'columntransformer__cat__transform_input',
 'columntransformer__cat

In [None]:
from sklearn.model_selection import RandomizedSearchCV



my_distributions = dict(
    randomforestclassifier__max_features=[
        100,
        150,
        200,
    ],
    randomforestclassifier__n_estimators=[
        10,
        20,
    ],
    randomforestclassifier__max_depth=[
        5,
        10,
        20,
        # 40,
        # None,
    ],
)

param_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=my_distributions,
    n_iter=4,
    cv=4,
    n_jobs=1,
    refit=True,
    verbose=3,
    random_state=27,
    # error_score='raise',
)

param_search.fit(X_train, y_train)

# Retrieve the best parameters
param_search.best_params_

Fitting 4 folds for each of 4 candidates, totalling 16 fits




[CV 1/4] END randomforestclassifier__max_depth=10, randomforestclassifier__max_features=100, randomforestclassifier__n_estimators=20;, score=0.815 total time=12.2min


KeyboardInterrupt: 

In [None]:
model_best = param_search.best_estimator_
model_best

In [None]:
y_pred_model_best_pipeline = model_best.predict(X_test)
accuracy_best_model_pipeline = accuracy_score(y_test, y_pred_model_best_pipeline)
precision_best_model_pipeline = precision_score(y_test, y_pred_model_best_pipeline)
recall_best_model_pipeline = recall_score(y_test, y_pred_model_best_pipeline)
f1_best_model_pipeline = f1_score(y_test, y_pred_model_best_pipeline)

print('Accuracy:', accuracy_best_model_pipeline)
print('Precision:', precision_best_model_pipeline)
print('Recall:', recall_best_model_pipeline)
print('F1:', f1_best_model_pipeline)