# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [168]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [169]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [170]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Your Work

In [171]:
X.columns

Index(['Clothing ID', 'Age', 'Title', 'Review Text', 'Positive Feedback Count',
       'Division Name', 'Department Name', 'Class Name'],
      dtype='object')

In [172]:
print(len(X['Clothing ID'].unique()))

531


In [173]:
#Droping Clothing ID, as IDs are usually not a numerical feature
num_features = X.select_dtypes(exclude='object').columns.drop(['Clothing ID'])
cat_features = X[['Clothing ID', 'Division Name', 'Department Name', 'Class Name']].columns
text_features = X[['Title', 'Review Text']].columns

## Data Exploration

In [174]:
X[['Title', 'Review Text']].head()

Unnamed: 0,Title,Review Text
0,Some major design flaws,I had such high hopes for this dress and reall...
1,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl..."
2,Flattering shirt,This shirt is very flattering to all due to th...
3,Not for the very petite,"I love tracy reese dresses, but this one is no..."
4,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...


In [175]:
print(X['Review Text'][3])


I love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually wear a 0p in this brand. this dress was very pretty out of the package but its a lot of dress. the skirt is long and very full so it overwhelmed my small frame. not a stranger to alterations, shortening and narrowing the skirt would take away from the embellishment of the garment. i love the color and the idea of the style but it just did not work on me. i returned this dress.


## Building Pipeline

In [176]:
from sklearn.pipeline import Pipeline

In [177]:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline(
    [
        ('imputer', SimpleImputer(strategy='mean')), 
        ('scaler', MinMaxScaler())
    ] 
)

num_pipeline

0,1,2
,steps,"[('imputer', ...), ('scaler', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False


In [178]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

cat_pipeline = Pipeline(
    [
        ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)), 
        ('most_freq_imputer', SimpleImputer(strategy='most_frequent')),
        ('cat_endcoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
    ] 
)
cat_pipeline

0,1,2
,steps,"[('ordinal_encoder', ...), ('most_freq_imputer', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,categories,'auto'
,dtype,<class 'numpy.float64'>
,handle_unknown,'use_encoded_value'
,unknown_value,-1
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [179]:
from sklearn.base import BaseEstimator, TransformerMixin

class CountCharacter(BaseEstimator, TransformerMixin):
    def __init__(self, character: str):
        self.character = character

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [ [text.count(self.character)] for text in X ]
    
class CountLength(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [ [text.count('')] for text in X ]
    

In [180]:
from sklearn.pipeline import FeatureUnion, make_pipeline
from sklearn.preprocessing import FunctionTransformer
import numpy as np 

initial_text_processing = Pipeline([
    ('dimension_reshaper',
     FunctionTransformer(np.reshape, kw_args={'newshape':-1})
     )
])

character_count_feature_engineering = FeatureUnion([
    ('count_spaces', CountCharacter(' ')),
    ('count_questions_marks', CountCharacter('?')),
    ('count_exklamation', CountCharacter('!')),
    ('count_lenght', CountLength()),
])

character_count_pipeline = make_pipeline(initial_text_processing, character_count_feature_engineering)

In [181]:
! python -m spacy download en_core_web_sm

113883.83s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [182]:
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [
            ' '.join(token.lemma_ for token in doc if not token.is_stop) for doc in self.nlp.pipe(X)
        ]
  

In [183]:
from sklearn.feature_extraction.text import TfidfVectorizer 
import spacy

nlp = spacy.load('en_core_web_sm')

tfidf_pipeline = Pipeline([
    ('dimension_reshaper', FunctionTransformer(np.reshape, kw_args={'newshape':-1})),
    ('lemmatizer', SpacyLemmatizer(nlp=nlp)),
    ('tfidf_vectorizer', TfidfVectorizer(stop_words='english'))
])
tfidf_pipeline

0,1,2
,steps,"[('dimension_reshaper', ...), ('lemmatizer', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x10edfc570>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,nlp,<spacy.lang.e...t 0x14baf1250>

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'


In [None]:
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features),
    ('character_count_0', character_count_pipeline, text_features[0]),
    ('character_count_1', character_count_pipeline, text_features[1]),
    ('tfidf_text_0', tfidf_pipeline, text_features[0]),
    ('tfidf_text_1', tfidf_pipeline, text_features[1]),
])

feature_engineering

0,1,2
,transformers,"[('num', ...), ('cat', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,categories,'auto'
,dtype,<class 'numpy.float64'>
,handle_unknown,'use_encoded_value'
,unknown_value,-1
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,steps,"[('dimension_reshaper', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x10edfc570>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,transformer_list,"[('count_spaces', ...), ('count_questions_marks', ...), ...]"
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True

0,1,2
,character,' '

0,1,2
,character,'?'

0,1,2
,character,'!'

0,1,2
,steps,"[('dimension_reshaper', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function res...t 0x10edfc570>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,transformer_list,"[('count_spaces', ...), ('count_questions_marks', ...), ...]"
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True

0,1,2
,character,' '

0,1,2
,character,'?'

0,1,2
,character,'!'


In [185]:
X_train[text_features]

Unnamed: 0,Title,Review Text
893,Super cute. pockets would be nice,Easy and fun jumper. runs slightly large. i or...
1767,Great for all seasons,The dress looks great both in winter and summe...
4491,Just ok,I wanted to love this dress as it seemed perfe...
17626,Cute but...,I loved this shirt when i purchased it but it ...
11184,Grandmas draperies dress,I had to review this because i purchased befor...
...,...,...
15897,This is a wow!!!,Went to the local store today for something el...
4848,Great overall top!,I bought this top in a small and it was true t...
14879,Runs very small,Purchased this dress to wear to new orleans in...
3912,Beautiful but fuzzy,This is a lovely cardigan--especially over dre...


In [186]:
train_set = X_train[cat_features]
cat_pipeline.fit(train_set)
tmp = cat_pipeline.transform(train_set)
tmp


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.]], shape=(16597, 542))

In [None]:
train_set =  X_train[num_features]
num_pipeline.fit(train_set)
tmp = num_pipeline.transform(train_set)
tmp

array([[0.2345679 , 0.01639344],
       [0.0617284 , 0.        ],
       [0.28395062, 0.08196721],
       ...,
       [0.39506173, 0.01639344],
       [0.20987654, 0.10655738],
       [0.39506173, 0.        ]], shape=(16597, 2))

In [194]:
train_set =  X_train
feature_engineering.fit(train_set)
tmp = feature_engineering.transform(train_set)
tmp



array([[2.34567901e-01, 1.63934426e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.77000000e+02],
       [6.17283951e-02, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 3.33000000e+02],
       [2.83950617e-01, 8.19672131e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 4.70000000e+02],
       ...,
       [3.95061728e-01, 1.63934426e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 4.87000000e+02],
       [2.09876543e-01, 1.06557377e-01, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 5.01000000e+02],
       [3.95061728e-01, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.28000000e+02]],
      shape=(16597, 552))

In [188]:
character_count_pipeline.fit(X_train[text_features])
tmp = character_count_pipeline.fit_transform(X_train[text_features])
tmp

array([[  5,   0,   0,  34],
       [ 33,   0,   0, 177],
       [  3,   0,   0,  22],
       ...,
       [ 88,   0,   0, 501],
       [  6,   0,   0,  39],
       [ 23,   0,   0, 128]], shape=(33194, 4))

In [195]:
tfidf_pipeline.fit(X_train[text_features])
tmp = tfidf_pipeline.fit_transform(X_train[text_features])
tmp

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 405091 stored elements and shape (33194, 10411)>

## Training Pipeline

<H> Chose model</H>

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline


learning_pipeline = make_pipeline(feature_engineering, RandomForestClassifier(random_state=28))
learning_pipeline

0,1,2
,steps,"[('columntransformer', ...), ('randomforestclassifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,categories,'auto'
,dtype,<class 'numpy.float64'>
,handle_unknown,'use_encoded_value'
,unknown_value,-1
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,transformer_list,"['count_spaces', CountCharacter(character=' '), ...]"
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True

0,1,2
,func,<function res...t 0x10edfc570>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,{'newshape': -1}
,inv_kw_args,

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [None]:
tmp = 

In [None]:
learning_pipeline.fit(X_train, y_train)

ValueError: too many values to unpack (expected 2)

## Fine-Tuning Pipeline