# "Fit", "Dry Clean Only","Category"

This notebook contains the process to select best model for "Fit","Dry Clean Only" and "Category". There are 4 steps involved in the model building process which is EDA(exploratory data analysis), Text preprocessing, Model building, and 10-Fold cross validation. In the Final section, we deploy the best model in each of the 3 categories on 1000 row of the full_data to see the sample output.

### Import package

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from nltk import punkt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score,recall_score,f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

### Define Funciton

In [2]:
lem = WordNetLemmatizer()

def lem_sentences(sentence):
    tokens = nltk.word_tokenize(sentence)
    lemmed_tokens = [lem.lemmatize(token) for token in tokens]
    return ' '.join(lemmed_tokens)

#I copy this function from this stackoverflow website
# https://stackoverflow.com/questions/43795310/apply-porters-stemmer-to-a-pandas-column-for-each-word

def cleanHtml(sentence):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext

def keepAlpha(sentence):
    alpha_sent = ""
    for word in word_tokenize(sentence):
        alpha_word = re.sub('[^\w]+', '', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
    return alpha_sent

def cleanPunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    
    return cleaned
#The above 4 define function are based on this github page
#https://github.com/nkartik94/Multi-Label-Text-Classification/blob/master/Mark_6.ipynb
           

def cleanComma(sentence): #function to clean the word of any punctuation or special characters
    cleaned = sentence.strip(',')
    cleaned = re.sub(r',{2,}',r',',cleaned)
    return cleaned

cloth_list=['top','bottom','one piece','sweater','blazers_coats_jackets','sweatshirt_hoodie']
non_cloth_list=['shoe','accessory']

def is_clothing(x):
    #create target value to determine if item is clothing
    if x in cloth_list:
        return 1
    elif x in non_cloth_list:
        return 0

top_list=['top','sweater','blazers_coats_jackets','sweatshirt_hoodie']

def is_top(x):
    #group the top clothing into the same tag
    #create target value for "category" tag
    if x in top_list:
        return 'top'
    else:
        return x    

### Load Data

In [3]:
tag_data = pd.read_csv('tagged_product_attributes.csv')
full_data = pd.read_csv('full_data.csv')

In [4]:
#Align the attribute_name and #attribute_value
tag_data['attribute_value']=tag_data['attribute_value'].str.lower()
tag_data['attribute_name']=tag_data['attribute_name'].str.lower()

tag_data.loc[tag_data['attribute_name']=='drycleanonly','attribute_name']='dry_clean_only'

In [5]:
#the additional 3 category our group chose
fit_tag = tag_data[tag_data['attribute_name']=='fit']
dryclean_tag = tag_data[tag_data['attribute_name']=='dry_clean_only']

#this dataframe would also be used to build model to determine if the item is clothing
category_tag=tag_data[tag_data['attribute_name']=='category']

# 1. Build Model to determine if the item is clothing

The category "Fit" and "Dry Clean Only" is for clothing item only, so we should build a model to determine whether an item is clothing or not.

I used the data with "attribute_name" of "category" to generate input for the model. If the item has "shoe" or "accessory" value, it belong to the "non-clothing" category

### 1.1 EDA

In [6]:
#aligm attribute value
#this step help us to drop duplicate value later
category_tag.loc[category_tag['attribute_value']=='blazers, coats & jackets','attribute_value']='blazers_coats_jackets'
category_tag.loc[category_tag['attribute_value']=='blazerscoatsjackets','attribute_value']='blazers_coats_jackets'
category_tag.loc[category_tag['attribute_value']=='onepiece','attribute_value']='one piece'
category_tag.loc[category_tag['attribute_value']=='sweatshirthoodie','attribute_value']='sweatshirt_hoodie'
category_tag.loc[category_tag['attribute_value']=='sweatshirt & hoodie','attribute_value']='sweatshirt_hoodie'

In [7]:
#inspect the distribution of different category
category_tag['attribute_value'].value_counts()

top                      1524
shoe                     1460
bottom                   1365
accessory                 640
one piece                 605
sweater                   545
blazers_coats_jackets     433
sweatshirt_hoodie         208
Name: attribute_value, dtype: int64

In [8]:
#apply funcation to get the target the model
category_tag['clothing']=category_tag['attribute_value'].apply(is_clothing)

In [9]:
#inspect the distribution of target
category_tag['clothing'].value_counts()

1    4680
0    2100
Name: clothing, dtype: int64

In [10]:
#There are 7 products which has multiple tag under 'category'
#remove those record
tdf=category_tag.drop_duplicates(subset=['product_id','attribute_value'])
tdf2=category_tag.drop_duplicates(subset='product_id')

#list of product that have multiple category tag
tlist=tdf[~(tdf.index.isin(tdf2.index))]['product_id'].values
tdf[tdf['product_id'].isin(tlist)].sort_values('product_id')

Unnamed: 0,product_id,product_color_id,attribute_name,attribute_value,file,clothing
77339,01DPKMXBJEMQWQWAZHGJK1RS8Z,01DPKMXBJJGRSZAPCX97AXGSN8,category,top,additional,1
113233,01DPKMXBJEMQWQWAZHGJK1RS8Z,01DPKMXBJJGRSZAPCX97AXGSN8,category,bottom,additional,1
3873,01DSP2PPDS647HVVN1GAA1F68Q,01DSP2PPE1RBXCBN4D0HFJCH49,category,top,initial_tags,1
82153,01DSP2PPDS647HVVN1GAA1F68Q,01DSP2PPE1RBXCBN4D0HFJCH49,category,accessory,additional,0
71321,01DT0DJH0G3NX1VGEQW3GFH4CG,01DT0DJH28GVQFR1AQAKK7GFKB,category,top,additional,1
82293,01DT0DJH0G3NX1VGEQW3GFH4CG,01DT0DJH28GVQFR1AQAKK7GFKB,category,blazers_coats_jackets,additional,1
19497,01DT0DJW9KH3TSSDJ4V5HB23KM,01DT0DJW9VP9W3W4W8N4TNP9H8,category,blazers_coats_jackets,initial_tags,1
82227,01DT0DJW9KH3TSSDJ4V5HB23KM,01DT0DJW9VP9W3W4W8N4TNP9H8,category,sweater,additional,1
8601,01DT514ZTH8FXVKG9J2GXYT8A9,01DT514ZTQZMW9H7AMH8ATAZM8,category,top,initial_tags,1
75365,01DT514ZTH8FXVKG9J2GXYT8A9,01DT514ZTQZMW9H7AMH8ATAZM8,category,sweatshirt_hoodie,additional,1


In [11]:
#Exclude the record
category_tag2=category_tag[~(category_tag['product_id'].isin(tlist))]

In [12]:
#Keep relevant column and remove duplicate record based on "product_id" and "attribute_value"
category_tagm=category_tag2[['product_id','attribute_name','attribute_value','clothing']].\
drop_duplicates(subset=['product_id','attribute_value'])

#remove duplicate record based on product id
full_datam=full_data[['product_id','brand','brand_category','product_full_name','description','details']]\
.drop_duplicates(subset=['product_id'])
category_all=category_tagm.merge(full_datam,how='left',on='product_id')
#Fill Null value
category_all=category_all.fillna('')

In [13]:
#Concatnate string as input
category_all['input'] =category_all[['brand','brand_category','product_full_name','description','details']]\
.agg(' '.join, axis=1).str.lower()

In [14]:
#Keep the relevant column and inspect the distribution of target value
category_df=category_all[['product_id','clothing','input']]
category_df['clothing'].value_counts()

1    2980
0     982
Name: clothing, dtype: int64

### 1.2 Text Preprocessing

In [15]:
category_df['input']= category_df['input'].apply(lem_sentences)
category_df['input'] = category_df['input'].apply(cleanHtml)
category_df['input'] = category_df['input'].apply(cleanPunc)
category_df['input'] = category_df['input'].apply(keepAlpha)

In [16]:
#save the dataframe for furture model training
category_df.to_csv('is_clothing_or_not.csv')

### 1.3 Model Building

we want to predict if a item is clothing or not, 1 represent it is an clothing item and 0 represent not, so we build several binary classification model to determine which is the best

In [17]:
#spliting data for training and testing
X=category_df['input']
y=category_df['clothing']
X_train, X_test,y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.30, shuffle=True,stratify=y)

In [18]:
#import stopword
stop_list=stopwords.words('english')

#tf-idf vectorize
#this vecotirzer is used in every model in this notebook
vectorizer = TfidfVectorizer(ngram_range=(1,3),
                             token_pattern=r'\b[a-zA-Z0-9]{3,}\b',
                             max_df=0.5,
                             min_df=10, stop_words=stop_list)

In [19]:
# Using pipeline for applying logistic regression, svc,randon forest classifier, kNN, and gradient boosted
#Using both accuracy and f1 score to select model
LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LogisticRegression(solver='sag')),
            ])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LinearSVC()),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', RandomForestClassifier()),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', KNeighborsClassifier()),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', GradientBoostingClassifier()),
            ])


LogReg_pipeline.fit(X_train, y_train)
SVC_pipeline.fit(X_train, y_train)
RandomFC_pipeline.fit(X_train, y_train)
kNN_pipeline.fit(X_train, y_train)
GradientFC_pipeline.fit(X_train, y_train)
    
    # calculating test accuracy
prediction1 = LogReg_pipeline.predict(X_test)
prediction2 = SVC_pipeline.predict(X_test)
prediction3 = RandomFC_pipeline.predict(X_test)
prediction4 = kNN_pipeline.predict(X_test)
prediction5 = GradientFC_pipeline.predict(X_test)
    
accuracy1=accuracy_score(y_test, prediction1)
accuracy2=accuracy_score(y_test, prediction2)
accuracy3=accuracy_score(y_test, prediction3)
accuracy4=accuracy_score(y_test, prediction4)
accuracy5=accuracy_score(y_test, prediction5)
    
f1_score1=f1_score(y_test, prediction1)
f1_score2=f1_score(y_test, prediction2)
f1_score3=f1_score(y_test, prediction3)
f1_score4=f1_score(y_test, prediction4)
f1_score5=f1_score(y_test, prediction5)


print('Test accuracy for Logistic Regression is      {}'.format(accuracy1))
print('Test accuracy for Support Vector Classifier is {}'.format(accuracy2))
print('Test accuracy for Random Forest Classifier is {}'.format(accuracy3))
print('Test accuracy for kNearstNeibor is            {}'.format(accuracy4))
print('Test accuracy for Gradient Boosted is         {}'.format(accuracy5))
print("\n")
print('Test f1_score for Logistic Regression is      {}'.format(f1_score1))
print('Test f1_score for Support Vector Classifier is {}'.format(f1_score2))
print('Test f1_score for Random Forest Classifier is {}'.format(f1_score3))
print('Test f1_score for kNearstNeibor is            {}'.format(f1_score4))
print('Test f1_score for Gradient Boosted is         {}'.format(f1_score5))
print("\n")
print('Test mix score for Logistic Regression is      {}'.format((accuracy1+f1_score1)/2))
print('Test mix score for Support Vector Classifier is {}'.format((accuracy2+f1_score2)/2))
print('Test mix score for Random Forest Classifier is {}'.format((accuracy3+f1_score3)/2))
print('Test mix score for kNearstNeibor is            {}'.format((accuracy4+f1_score4)/2))
print('Test mix score for Gradient Boosted is         {}'.format((accuracy5+f1_score5)/2))
print("\n")

Test accuracy for Logistic Regression is      0.9957947855340622
Test accuracy for Support Vector Classifier is 0.9991589571068125
Test accuracy for Random Forest Classifier is 0.9932716568544996
Test accuracy for kNearstNeibor is            0.9983179142136249
Test accuracy for Gradient Boosted is         0.9831791421362489


Test f1_score for Logistic Regression is      0.9972113775794758
Test f1_score for Support Vector Classifier is 0.9994404029099049
Test f1_score for Random Forest Classifier is 0.9955307262569834
Test f1_score for kNearstNeibor is            0.9988814317673378
Test f1_score for Gradient Boosted is         0.9889012208657049


Test mix score for Logistic Regression is      0.996503081556769
Test mix score for Support Vector Classifier is 0.9992996800083587
Test mix score for Random Forest Classifier is 0.9944011915557415
Test mix score for kNearstNeibor is            0.9985996729904814
Test mix score for Gradient Boosted is         0.9860401815009769




The **Support Vector Classifier** model seems to perform the best, but we should cross validate to see if the model performance is consistent

### 1.4 10-Fold Cross Validation

In [20]:
LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LogisticRegression(solver='sag')),])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LinearSVC()),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', RandomForestClassifier()),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', KNeighborsClassifier()),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', GradientBoostingClassifier()),
            ])

X=category_df['input']
y=category_df['clothing']

cv_score1 =cross_val_score(LogReg_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score2 =cross_val_score(SVC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score3 =cross_val_score(RandomFC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score4 =cross_val_score(kNN_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score5 =cross_val_score(GradientFC_pipeline, X, y, scoring = 'accuracy',cv = 10)

cv_score6 =cross_val_score(LogReg_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score7 =cross_val_score(SVC_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score8 =cross_val_score(RandomFC_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score9 =cross_val_score(kNN_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score10 =cross_val_score(GradientFC_pipeline, X, y, scoring = 'f1',cv = 10)

cv_score11 =(cv_score1+cv_score6)/2
cv_score12 =(cv_score2+cv_score7)/2
cv_score13 =(cv_score3+cv_score8)/2
cv_score14 =(cv_score4+cv_score9)/2
cv_score15 =(cv_score5+cv_score10)/2

print('Test mix score based on 10-fold CV for Logistic Regression is      {}'.format(cv_score11.mean()))
print('Test mix score based on 10-fold CV for Support Vector Classifier is {}'.format(cv_score12.mean()))
print('Test mix score based on 10-fold CV for Random Forest Classifier is {}'.format(cv_score13.mean()))
print('Test mix score based on 10-fold CV for kNearstNeibor is            {}'.format(cv_score14.mean()))
print('Test mix score based on 10-fold CV for Gradient Boosted is         {}'.format(cv_score15.mean()))

Test mix score based on 10-fold CV for Logistic Regression is      0.9962275438945213
Test mix score based on 10-fold CV for Support Vector Classifier is 0.9989508805253144
Test mix score based on 10-fold CV for Random Forest Classifier is 0.995133880947075
Test mix score based on 10-fold CV for kNearstNeibor is            0.9955809644473869
Test mix score based on 10-fold CV for Gradient Boosted is         0.9869010309289223


Based on the 10-Fold accuracy and f1 score, the best model to determine whether or not the item is clothing is **Suport Vector Classifier**

# 2. Build Model for "Fit" category

### 2.1 EDA

In [21]:
#Inspect the tag name and distribution
fit_tag['attribute_value'].value_counts()

relaxed               1794
semifitted             925
straightregular        690
fittedtailored         568
semi-fitted            280
fitted / tailored      194
oversized              181
straight / regular      96
Name: attribute_value, dtype: int64

In [22]:
# align the tag name
fit_tag.loc[fit_tag['attribute_value']=='semifitted','attribute_value']='semi-fitted'
fit_tag.loc[fit_tag['attribute_value']=='straightregular','attribute_value']='straight/regular'
fit_tag.loc[fit_tag['attribute_value']=='straight / regular','attribute_value']='straight/regular'
fit_tag.loc[fit_tag['attribute_value']=='fittedtailored','attribute_value']='fitted/tailored'
fit_tag.loc[fit_tag['attribute_value']=='fitted / tailored','attribute_value']='fitted/tailored'

In [23]:
#Inspect the distribution of target
fit_tag['attribute_value'].value_counts()

relaxed             1794
semi-fitted         1205
straight/regular     786
fitted/tailored      762
oversized            181
Name: attribute_value, dtype: int64

In [24]:
fit_tag.drop_duplicates(subset='product_id').shape

(2949, 5)

In [25]:
fit_tag.drop_duplicates(subset=['product_id','attribute_value']).shape
#some product have multiple tag

(3040, 5)

In [26]:
fit_nodup=fit_tag.drop_duplicates(subset=['product_id','attribute_value'])
fit_nodup2=fit_tag.drop_duplicates(subset='product_id')

#list of product that have multiple "fit" tags
flist=fit_nodup[~(fit_nodup.index.isin(fit_nodup2.index))]['product_id'].values

#Inspect the first 10 row (5 product) that have multi tags
fit_nodup[fit_nodup['product_id'].isin(flist)].sort_values('product_id').head(10)

Unnamed: 0,product_id,product_color_id,attribute_name,attribute_value,file
113116,01DPCHNQM0PA0SXZZZX85PF2ZJ,01DPCHNQM7FDGGJ2CKEKQWRKGS,fit,relaxed,additional
82160,01DPCHNQM0PA0SXZZZX85PF2ZJ,01DPCHNQM7FDGGJ2CKEKQWRKGS,fit,straight/regular,additional
168,01DPGSTG4M1RXB26QMMN0MPPB8,01DPGSTMYC20XH5ZYK7S4XQEH5,fit,semi-fitted,initial_tags
110482,01DPGSTG4M1RXB26QMMN0MPPB8,01DPGSV457VMYHWPYHW5V79JV9,fit,relaxed,additional
104080,01DPGXGYVJ5DYV5G9CTAPGPFES,01DPGXHGB36NGP1E4PYFS7AYE2,fit,relaxed,additional
7087,01DPGXGYVJ5DYV5G9CTAPGPFES,01DPGXHB9ZFSN94Z8PD5HKAW80,fit,semi-fitted,initial_tags
77335,01DPKMXBJEMQWQWAZHGJK1RS8Z,01DPKMXBJJGRSZAPCX97AXGSN8,fit,fitted/tailored,additional
113235,01DPKMXBJEMQWQWAZHGJK1RS8Z,01DPKMXBJJGRSZAPCX97AXGSN8,fit,semi-fitted,additional
104022,01DS44MT6XDH3DDGJ80TQ6AJ2G,01DS44MT74BQ8X3QQ047Z8Q5HR,fit,oversized,additional
3715,01DS44MT6XDH3DDGJ80TQ6AJ2G,01DS44MT74BQ8X3QQ047Z8Q5HR,fit,semi-fitted,initial_tags


In [27]:
#drop the product that have multi tag
fit_tag2=fit_tag[~(fit_tag['product_id'].isin(flist))]

In [28]:
#Keep relevant column and remove duplicate record based on "product_id" and "attribute_value"
fit_tagm=fit_tag2[['product_id','attribute_name','attribute_value']].\
drop_duplicates(subset=['product_id','attribute_value'])

#remove duplicate record based on product id
full_datam=full_data[['product_id','brand','brand_category','product_full_name','description','details']]\
.drop_duplicates(subset=['product_id'])
fit_all=fit_tagm.merge(full_datam,how='left',on='product_id')
#Fill Null value
fit_all=fit_all.fillna('')

#Concatnate string as input
fit_all['input'] =fit_all[['brand','brand_category','product_full_name','description','details']]\
.agg(' '.join, axis=1).str.lower()

fit_df=fit_all[['product_id','attribute_value','input']]

#Inspect the distribution of target
fit_df['attribute_value'].value_counts()

relaxed             1093
semi-fitted          654
straight/regular     553
fitted/tailored      455
oversized            105
Name: attribute_value, dtype: int64

### 2.2 Text Preprocessing

In [29]:
fit_df['input']= fit_df['input'].apply(lem_sentences)
fit_df['input'] = fit_df['input'].apply(cleanHtml)
fit_df['input'] = fit_df['input'].apply(cleanPunc)
fit_df['input'] = fit_df['input'].apply(keepAlpha)

In [30]:
#save the dataframe for future usage
fit_df.to_csv('fit_tags.csv')

### 2.3 Model Building

We used the OneVsRestClassifier() to build the multi-class classification model. The model would choose 1 of the 5 tags in "Fit" category as the predicted value

In [31]:
X=fit_df['input']
y=fit_df['attribute_value']
X_train, X_test,y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.30, shuffle=True,stratify=y)

In [32]:
# Using pipeline for applying logistic regression, svc,randon forest classifier, kNN, and gradient boosted
#Using both accuracy select model
LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LinearSVC())),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(RandomForestClassifier())),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(KNeighborsClassifier())),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(GradientBoostingClassifier())),
            ])


LogReg_pipeline.fit(X_train, y_train)
SVC_pipeline.fit(X_train, y_train)
RandomFC_pipeline.fit(X_train, y_train)
kNN_pipeline.fit(X_train, y_train)
GradientFC_pipeline.fit(X_train, y_train)
    
    # calculating test accuracy
prediction1 = LogReg_pipeline.predict(X_test)
prediction2 = SVC_pipeline.predict(X_test)
prediction3 = RandomFC_pipeline.predict(X_test)
prediction4 = kNN_pipeline.predict(X_test)
prediction5 = GradientFC_pipeline.predict(X_test)
    
accuracy1=accuracy_score(y_test, prediction1)
accuracy2=accuracy_score(y_test, prediction2)
accuracy3=accuracy_score(y_test, prediction3)
accuracy4=accuracy_score(y_test, prediction4)
accuracy5=accuracy_score(y_test, prediction5)
    
print('Test accuracy for Logistic Regression is      {}'.format(accuracy1))
print('Test accuracy for Support Vector Classifier is {}'.format(accuracy2))
print('Test accuracy for Random Forest Classifier is {}'.format(accuracy3))
print('Test accuracy for kNearstNeibor is            {}'.format(accuracy4))
print('Test accuracy for Gradient Boosted is         {}'.format(accuracy5))
print("\n")

Test accuracy for Logistic Regression is      0.6002331002331003
Test accuracy for Support Vector Classifier is 0.6177156177156177
Test accuracy for Random Forest Classifier is 0.6048951048951049
Test accuracy for kNearstNeibor is            0.5512820512820513
Test accuracy for Gradient Boosted is         0.6118881118881119




The **Support Vector Classifer** model seems to perform the best, but we should cross validate to see if the model performance is consistent

### 2.4 10-Fold Cross Validation

In [33]:
X=fit_df['input']
y=fit_df['attribute_value']

LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LinearSVC())),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(RandomForestClassifier())),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(KNeighborsClassifier())),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(GradientBoostingClassifier())),
            ])

cv_score1 =cross_val_score(LogReg_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score2 =cross_val_score(SVC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score3 =cross_val_score(RandomFC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score4 =cross_val_score(kNN_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score5 =cross_val_score(GradientFC_pipeline, X, y, scoring = 'accuracy',cv = 10)

print('Test accuracy based on 10-fold CV for Logistic Regression is      {}'.format(cv_score1.mean()))
print('Test accuracy based on 10-fold CV for Support Vector Classifier is {}'.format(cv_score2.mean()))
print('Test accuracy based on 10-fold CV for Random Forest Classifier is {}'.format(cv_score3.mean()))
print('Test accuracy based on 10-fold CV for kNearstNeibor is            {}'.format(cv_score4.mean()))
print('Test accuracy based on 10-fold CV for Gradient Boosted is         {}'.format(cv_score5.mean()))

Test accuracy based on 10-fold CV for Logistic Regression is      0.5674825174825175
Test accuracy based on 10-fold CV for Suport Vector Classifier is 0.5811188811188812
Test accuracy based on 10-fold CV for Random Forest Classifier is 0.565034965034965
Test accuracy based on 10-fold CV for kNearstNeibor is            0.5083916083916085
Test accuracy based on 10-fold CV for Gradient Boosted is         0.5818181818181818


Based on the 10-Fold accuracy, the best modedl to determine which "Fit" tag an clothing item should have is **Gradient Boosted Classifier**

# 3. Build Model for "Dry Clean Only" tag

### 3.1 EDA

In [34]:
#Inpect the tag name and ditribution
dryclean_tag['attribute_value'].value_counts()

no     2182
yes    2158
Name: attribute_value, dtype: int64

In [35]:
dryclean_tag.drop_duplicates(subset='product_id').shape

(2776, 5)

In [36]:
dryclean_tag.drop_duplicates(subset=['product_id','attribute_value']).shape
#there are product with multitags

(2779, 5)

In [37]:
dryclean_nodup=dryclean_tag.drop_duplicates(subset=['product_id','attribute_value'])
dryclean_nodup2=dryclean_tag.drop_duplicates(subset='product_id')

#list of product that have multi tag
dclist=dryclean_nodup[~(dryclean_nodup.index.isin(dryclean_nodup2.index))]['product_id'].values

dryclean_nodup[dryclean_nodup['product_id'].isin(dclist)].sort_values('product_id')

Unnamed: 0,product_id,product_color_id,attribute_name,attribute_value,file
16870,01DT50VPJ23TNSDY43EQ54CGSK,01DT50VPJ8TJNDBGQPGDRZZJN5,dry_clean_only,yes,initial_tags
89393,01DT50VPJ23TNSDY43EQ54CGSK,01DT50VPJ8TJNDBGQPGDRZZJN5,dry_clean_only,no,additional
111900,01E1JKV4WQYPMJYVXYNN4NVYK8,01E1JKV4X22KTME0JSZMC7G30G,dry_clean_only,yes,additional
112230,01E1JKV4WQYPMJYVXYNN4NVYK8,01E1JKV4X22KTME0JSZMC7G30G,dry_clean_only,no,additional
55344,01E4EAG0KNT88YFSPB3FGSXAH8,01E4EAG0M0PKPE4TDCAPFK4C2R,dry_clean_only,no,additional
55706,01E4EAG0KNT88YFSPB3FGSXAH8,01E4EAG0M0PKPE4TDCAPFK4C2R,dry_clean_only,yes,additional


In [38]:
#remove the record that have contradicting target value
dryclean_tag2=dryclean_tag[~(dryclean_tag['product_id'].isin(dclist))]

In [39]:
#Keep relevant column and remove duplicate record based on "product_id" and "attribute_value"
dryclean_tagm=dryclean_tag2[['product_id','attribute_name','attribute_value']].\
drop_duplicates(subset=['product_id','attribute_value'])

#remove duplicate record based on product id
full_datam=full_data[['product_id','brand','brand_category','product_full_name','description','details']]\
.drop_duplicates(subset=['product_id'])
dryclean_all=dryclean_tagm.merge(full_datam,how='left',on='product_id')
#Fill Null value
dryclean_all=dryclean_all.fillna('')

#Concatnate string as input
dryclean_all['input'] =dryclean_all[['brand','brand_category','product_full_name','description','details']]\
.agg(' '.join, axis=1).str.lower()

dryclean_df=dryclean_all[['product_id','attribute_value','input']]
dryclean_df['attribute_value'].value_counts()

yes    1421
no     1352
Name: attribute_value, dtype: int64

In [40]:
#convert the target value to 1 and 0
# this step help us compute f1 score latter
dryclean_df['dry_clean']=np.where(dryclean_df['attribute_value']=='yes',1,0)

### 3.2 Text Preprocessing

In [41]:
dryclean_df['input']= dryclean_df['input'].apply(lem_sentences)
dryclean_df['input'] = dryclean_df['input'].apply(cleanHtml)
dryclean_df['input'] = dryclean_df['input'].apply(cleanPunc)
dryclean_df['input'] = dryclean_df['input'].apply(keepAlpha)

In [42]:
#save the dataframe for future use
dryclean_df.to_csv('dryclean_tags.csv')

### 3.3 Model Building

This is also a binary classification model where 1 represents the item is dry-clean-only

In [43]:
X=dryclean_df['input']
y=dryclean_df['dry_clean']
X_train, X_test,y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.30, shuffle=True,stratify=y)

In [44]:
# Using pipeline for applying logistic regression, svc,randon forest classifier, kNN, and gradient boosted
#Using both accuracy and f1 score to select model
LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LogisticRegression(solver='sag')),
            ])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LinearSVC()),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', RandomForestClassifier()),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', KNeighborsClassifier()),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', GradientBoostingClassifier()),
            ])


LogReg_pipeline.fit(X_train, y_train)
SVC_pipeline.fit(X_train, y_train)
RandomFC_pipeline.fit(X_train, y_train)
kNN_pipeline.fit(X_train, y_train)
GradientFC_pipeline.fit(X_train, y_train)
    
    # calculating test accuracy
prediction1 = LogReg_pipeline.predict(X_test)
prediction2 = SVC_pipeline.predict(X_test)
prediction3 = RandomFC_pipeline.predict(X_test)
prediction4 = kNN_pipeline.predict(X_test)
prediction5 = GradientFC_pipeline.predict(X_test)
    
accuracy1=accuracy_score(y_test, prediction1)
accuracy2=accuracy_score(y_test, prediction2)
accuracy3=accuracy_score(y_test, prediction3)
accuracy4=accuracy_score(y_test, prediction4)
accuracy5=accuracy_score(y_test, prediction5)
    
f1_score1=f1_score(y_test, prediction1)
f1_score2=f1_score(y_test, prediction2)
f1_score3=f1_score(y_test, prediction3)
f1_score4=f1_score(y_test, prediction4)
f1_score5=f1_score(y_test, prediction5)

print('Test accuracy for Logistic Regression is      {}'.format(accuracy1))
print('Test accuracy for Support Vector Classifier is {}'.format(accuracy2))
print('Test accuracy for Random Forest Classifier is {}'.format(accuracy3))
print('Test accuracy for kNearstNeibor is            {}'.format(accuracy4))
print('Test accuracy for Gradient Boosted is         {}'.format(accuracy5))
print("\n")
print('Test f1_score for Logistic Regression is      {}'.format(f1_score1))
print('Test f1_score for Support Vector Classifier is {}'.format(f1_score2))
print('Test f1_score for Random Forest Classifier is {}'.format(f1_score3))
print('Test f1_score for kNearstNeibor is            {}'.format(f1_score4))
print('Test f1_score for Gradient Boosted is         {}'.format(f1_score5))
print("\n")
print('Test mix score for Logistic Regression is      {}'.format((accuracy1+f1_score1)/2))
print('Test mix score for Support Vector Classifier is {}'.format((accuracy2+f1_score2)/2))
print('Test mix score for Random Forest Classifier is {}'.format((accuracy3+f1_score3)/2))
print('Test mix score for kNearstNeibor is            {}'.format((accuracy4+f1_score4)/2))
print('Test mix score for Gradient Boosted is         {}'.format((accuracy5+f1_score5)/2))
print("\n")

Test accuracy for Logistic Regression is      0.8966346153846154
Test accuracy for Support Vector Classifier is 0.8786057692307693
Test accuracy for Random Forest Classifier is 0.8978365384615384
Test accuracy for kNearstNeibor is            0.859375
Test accuracy for Gradient Boosted is         0.8966346153846154


Test f1_score for Logistic Regression is      0.9024943310657597
Test f1_score for Support Vector Classifier is 0.8845714285714286
Test f1_score for Random Forest Classifier is 0.9017341040462429
Test f1_score for kNearstNeibor is            0.8674971687429219
Test f1_score for Gradient Boosted is         0.8990610328638498


Test mix score for Logistic Regression is      0.8995644732251875
Test mix score for Support Vector Classifier is 0.881588598901099
Test mix score for Random Forest Classifier is 0.8997853212538907
Test mix score for kNearstNeibor is            0.863436084371461
Test mix score for Gradient Boosted is         0.8978478241242326




The **Random Forest Classifier** model seems to perform the best, but we should cross validate to see if the model performance is consistent

### 3.4 10-Fold Cross Validation

In [45]:
LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LogisticRegression(solver='sag')),])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LinearSVC()),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', RandomForestClassifier()),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', KNeighborsClassifier()),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', GradientBoostingClassifier()),
            ])

X=dryclean_df['input']
y=dryclean_df['dry_clean']

cv_score1 =cross_val_score(LogReg_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score2 =cross_val_score(SVC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score3 =cross_val_score(RandomFC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score4 =cross_val_score(kNN_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score5 =cross_val_score(GradientFC_pipeline, X, y, scoring = 'accuracy',cv = 10)

cv_score6 =cross_val_score(LogReg_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score7 =cross_val_score(SVC_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score8 =cross_val_score(RandomFC_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score9 =cross_val_score(kNN_pipeline, X, y, scoring = 'f1',cv = 10)
cv_score10 =cross_val_score(GradientFC_pipeline, X, y, scoring = 'f1',cv = 10)

cv_score11 =(cv_score1+cv_score6)/2
cv_score12 =(cv_score2+cv_score7)/2
cv_score13 =(cv_score3+cv_score8)/2
cv_score14 =(cv_score4+cv_score9)/2
cv_score15 =(cv_score5+cv_score10)/2

print('Test mix score based on 10-fold CV for Logistic Regression is      {}'.format(cv_score11.mean()))
print('Test mix score based on 10-fold CV for Support Vector Classifier is {}'.format(cv_score12.mean()))
print('Test mix score based on 10-fold CV for Random Forest Classifier is {}'.format(cv_score13.mean()))
print('Test mix score based on 10-fold CV for kNearstNeibor is            {}'.format(cv_score14.mean()))
print('Test mix score based on 10-fold CV for Gradient Boosted is         {}'.format(cv_score15.mean()))

Test mix score based on 10-fold CV for Logistic Regression is      0.8662928038760057
Test mix score based on 10-fold CV for Suport Vector Classifier is 0.8576460160455277
Test mix score based on 10-fold CV for Random Forest Classifier is 0.8557361085966676
Test mix score based on 10-fold CV for kNearstNeibor is            0.809174943570073
Test mix score based on 10-fold CV for Gradient Boosted is         0.8674092764188458


Based on the 10-Fold accuracy and f1 score, the best model to determine whether or not an clothing item is dry-clean only is **Gradient Boosted Classifer**

# 4. Build Model for "Category" tag

### 4.1 EDA

In [46]:
category_tag['attribute_value'].value_counts()

top                      1524
shoe                     1460
bottom                   1365
accessory                 640
one piece                 605
sweater                   545
blazers_coats_jackets     433
sweatshirt_hoodie         208
Name: attribute_value, dtype: int64

In [47]:
#group top clothing into the same tag
top_list=['top','sweater','blazers_coats_jackets','sweatshirt_hoodie']
category_tag['attribute_category']=category_tag['attribute_value'].apply(is_top)

In [48]:
category_tag['attribute_category'].value_counts()

top          2710
shoe         1460
bottom       1365
accessory     640
one piece     605
Name: attribute_category, dtype: int64

In [49]:
#Inspect data to find item with mupltiple tags
tdf3=category_tag.drop_duplicates(subset=['product_id','attribute_category'])
tdf4=category_tag.drop_duplicates(subset='product_id')

tlist2=tdf3[~(tdf3.index.isin(tdf4.index))]['product_id'].values
tdf3[tdf3['product_id'].isin(tlist2)].sort_values('product_id')

Unnamed: 0,product_id,product_color_id,attribute_name,attribute_value,file,clothing,attribute_category
77339,01DPKMXBJEMQWQWAZHGJK1RS8Z,01DPKMXBJJGRSZAPCX97AXGSN8,category,top,additional,1,top
113233,01DPKMXBJEMQWQWAZHGJK1RS8Z,01DPKMXBJJGRSZAPCX97AXGSN8,category,bottom,additional,1,bottom
3873,01DSP2PPDS647HVVN1GAA1F68Q,01DSP2PPE1RBXCBN4D0HFJCH49,category,top,initial_tags,1,top
82153,01DSP2PPDS647HVVN1GAA1F68Q,01DSP2PPE1RBXCBN4D0HFJCH49,category,accessory,additional,0,accessory
8323,01DT515KEMTBG2ZTNRHYQ29PCT,01DT515KET8PG0ATT955Q3N0VX,category,accessory,initial_tags,0,accessory
75400,01DT515KEMTBG2ZTNRHYQ29PCT,01DT515KET8PG0ATT955Q3N0VX,category,top,additional,1,top
1683,01DTJAZ7F9DRFPNF8W1H7SFT20,01DTJAZ7FJVFXV2QWZVSZ4RNMY,category,top,initial_tags,1,top
98437,01DTJAZ7F9DRFPNF8W1H7SFT20,01DTJAZ7FJVFXV2QWZVSZ4RNMY,category,accessory,additional,0,accessory


In [50]:
#Exclude these record
category_tag3=category_tag[~(category_tag['product_id'].isin(tlist2))]

In [51]:
#Keep relevant column and remove duplicate record based on "product_id" and "attribute_value"
category_tagm2=category_tag3[['product_id','attribute_name','attribute_category']].\
drop_duplicates(subset=['product_id','attribute_category'])

#remove duplicate record based on product id
full_datam=full_data[['product_id','brand','brand_category','product_full_name','description','details']]\
.drop_duplicates(subset=['product_id'])
category_all2=category_tagm2.merge(full_datam,how='left',on='product_id')
#Fill Null value
category_all2=category_all2.fillna('')

In [52]:
#Inpect distribution
category_all2['attribute_category'].value_counts()

top          1657
bottom        899
shoe          672
one piece     427
accessory     310
Name: attribute_category, dtype: int64

In [53]:
#Exclude the "accessory" type since we would grab some data from the full_data
#to generate record for "handbag", "scarf", and "other(including sunglasses, belt, case etc)"
category_part=category_all2[category_all2['attribute_category']!='accessory']

In [54]:
#generate data for the target value "handbag" from full_data
#using rule-based method on "product_full_name" to find target
handbag_tag=full_data[full_data['product_full_name'].fillna('').str.lower().str.contains(r'bag|clutch|purse')]\
.drop_duplicates(subset='product_id')

handbag_tag['attribute_name']='category'
handbag_tag['attribute_category']='handbag'
handbag_tag=handbag_tag[['product_id','attribute_name','attribute_category',
                         'brand','brand_category','product_full_name','description','details']]

In [55]:
#generate data for the target value "scarf" from full_data
scarf_tag=full_data[full_data['product_full_name'].fillna('').str.lower().str.contains(r'scarf|scarves')]\
.drop_duplicates(subset='product_id')

scarf_tag['attribute_name']='category'
scarf_tag['attribute_category']='scarf'
scarf_tag=scarf_tag[['product_id','attribute_name','attribute_category',
                         'brand','brand_category','product_full_name','description','details']]

In [56]:
#generate data for the target value "other" from full_data
#sampling 1000 out of over 3000 record to make the distribution of target value more balance
other_tag=full_data[full_data['product_full_name'].fillna('').str.lower().str.contains(r'case|accessory|belt|sunglasses')]\
.drop_duplicates(subset='product_id').sample(1000,random_state=42)
other_tag['attribute_name']='category'
other_tag['attribute_category']='other'
other_tag=other_tag[['product_id','attribute_name','attribute_category',
                         'brand','brand_category','product_full_name','description','details']]

In [57]:
#concat the above 4 dataframe
category_all2=pd.concat([category_part,handbag_tag,scarf_tag,other_tag])

In [58]:
#Investigate if there are multiple tag under the same product
tlist3=category_all2[category_all2.duplicated(subset='product_id')]['product_id'].values

#the rule_based method may not be very solid
#if an item have multiple tag because of the data extracted from full_data
#consider the data in tagged_data the true value
add_list=['other','scarf','handbag']
true_category=category_all2[(~(category_all2['attribute_category'].isin(add_list)))&(category_all2['product_id'].isin(tlist3))]

#exclude the product with multiple tag
category_all2=category_all2[~(category_all2['product_id'].isin(tlist3))]
#add back the product with true value
category_all2=pd.concat([category_all2,true_category])

In [59]:
#Inspect the dataframe
category_all2['attribute_category'].value_counts()

top          1657
other         960
bottom        899
scarf         738
handbag       710
shoe          672
one piece     427
Name: attribute_category, dtype: int64

In [60]:
#Concatnate string as input
category_all2=category_all2.fillna('')
category_all2['input'] =category_all2[['brand','brand_category','product_full_name','description','details']].astype('str')\
.agg(' '.join, axis=1).str.lower()

### 4.2 Text Preprocessing

In [61]:
category_all2['input']= category_all2['input'].apply(lem_sentences)
category_all2['input'] = category_all2['input'].apply(cleanHtml)
category_all2['input'] = category_all2['input'].apply(cleanPunc)
category_all2['input'] = category_all2['input'].apply(keepAlpha)

In [62]:
category_all2.shape

(6063, 9)

In [63]:
#save the dataframe for future usage
category_all2.to_csv('category_tags.csv')

### 4.3 Model Building

`Same as in "fit" tag, we used the OneVsRestClassifier() to build the multi-class classification model to choose 1 tag out 7 as the predicted value

In [64]:
X=category_all2['input']
y=category_all2['attribute_category']
X_train, X_test,y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.30, shuffle=True,stratify=y)

In [65]:
# Using pipeline for applying logistic regression, svc,randon forest classifier, kNN, and gradient boosted
#Using accuracy to select model
LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LinearSVC())),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(RandomForestClassifier())),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(KNeighborsClassifier())),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(GradientBoostingClassifier())),
            ])


LogReg_pipeline.fit(X_train, y_train)
SVC_pipeline.fit(X_train, y_train)
RandomFC_pipeline.fit(X_train, y_train)
kNN_pipeline.fit(X_train, y_train)
GradientFC_pipeline.fit(X_train, y_train)
    
    # calculating test accuracy
prediction1 = LogReg_pipeline.predict(X_test)
prediction2 = SVC_pipeline.predict(X_test)
prediction3 = RandomFC_pipeline.predict(X_test)
prediction4 = kNN_pipeline.predict(X_test)
prediction5 = GradientFC_pipeline.predict(X_test)
    
accuracy1=accuracy_score(y_test, prediction1)
accuracy2=accuracy_score(y_test, prediction2)
accuracy3=accuracy_score(y_test, prediction3)
accuracy4=accuracy_score(y_test, prediction4)
accuracy5=accuracy_score(y_test, prediction5)
    
print('Test accuracy for Logistic Regression is      {}'.format(accuracy1))
print('Test accuracy for Support Vector Classifier is {}'.format(accuracy2))
print('Test accuracy for Random Forest Classifier is {}'.format(accuracy3))
print('Test accuracy for kNearstNeibor is            {}'.format(accuracy4))
print('Test accuracy for Gradient Boosted is         {}'.format(accuracy5))
print("\n")

Test accuracy for Logistic Regression is      0.9681143485431556
Test accuracy for Support Vector Classifier is 0.9879054425508521
Test accuracy for Random Forest Classifier is 0.960417811984607
Test accuracy for kNearstNeibor is            0.9241341396371633
Test accuracy for Gradient Boosted is         0.9758108851017042




The **Support Vector Classifier** model seems to perform the best, but we should cross validate to see if the model performance is consistent

### 4.4 10-Fold Cross Validation

In [66]:
X=category_all2['input']
y=category_all2['attribute_category']

LogReg_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LinearSVC())),
            ])

RandomFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(RandomForestClassifier())),
            ])

kNN_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(KNeighborsClassifier())),
            ])

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(GradientBoostingClassifier())),
            ])

cv_score1 =cross_val_score(LogReg_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score2 =cross_val_score(SVC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score3 =cross_val_score(RandomFC_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score4 =cross_val_score(kNN_pipeline, X, y, scoring = 'accuracy',cv = 10)
cv_score5 =cross_val_score(GradientFC_pipeline, X, y, scoring = 'accuracy',cv = 10)

print('Test accuracy based on 10-fold CV for Logistic Regression is      {}'.format(cv_score1.mean()))
print('Test accuracy based on 10-fold CV for Support Vector Classifier is {}'.format(cv_score2.mean()))
print('Test accuracy based on 10-fold CV for Random Forest Classifier is {}'.format(cv_score3.mean()))
print('Test accuracy based on 10-fold CV for kNearstNeibor is            {}'.format(cv_score4.mean()))
print('Test accuracy based on 10-fold CV for Gradient Boosted is         {}'.format(cv_score5.mean()))

Test accuracy based on 10-fold CV for Logistic Regression is      0.9661873848010831
Test accuracy based on 10-fold CV for Suport Vector Classifier is 0.981363193980024
Test accuracy based on 10-fold CV for Random Forest Classifier is 0.9572794841263368
Test accuracy based on 10-fold CV for kNearstNeibor is            0.9017001864931139
Test accuracy based on 10-fold CV for Gradient Boosted is         0.9731169904469853


Based on the 10-Fold accuracy, the best model to determine which category an item belongs to is **Suport Vector Classifier**

# 5. Apply the models above on 1000 rows of Full Data

### 5.1 Data Preprocessing

In [67]:
test_df=full_data[~(full_data['product_id'].isin(category_df['product_id'].values))].head(1000)

In [68]:
test_df['input']=test_df[['brand','brand_category','product_full_name','description','details']].fillna('')\
.agg(' '.join, axis=1).str.lower()

In [69]:
test_df['input']= test_df['input'].apply(lem_sentences)
test_df['input'] = test_df['input'].apply(cleanHtml)
test_df['input']= test_df['input'].apply(cleanPunc)
test_df['input'] = test_df['input'].apply(keepAlpha)

In [70]:
test_text=test_df['input']

### 5.2 Model Deployment

In [71]:
X=category_df['input']
y=category_df['clothing']

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', LinearSVC()),
            ])

SVC_pipeline.fit(X, y)
prediction = SVC_pipeline.predict(test_text)
test_df['clothing']=prediction

In [72]:
X=fit_df['input']
y=fit_df['attribute_value']

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(GradientBoostingClassifier())),
            ])

GradientFC_pipeline.fit(X, y)
prediction = GradientFC_pipeline.predict(test_text)
test_df['fit']=prediction

In [73]:
X=dryclean_df['input']
y=dryclean_df['dry_clean']

GradientFC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', GradientBoostingClassifier()),
            ])

GradientFC_pipeline.fit(X, y)
prediction = GradientFC_pipeline.predict(test_text)
test_df['dry_clean_only']=prediction
test_df['dry_clean_only']=np.where(test_df['dry_clean_only']==1,'yes',"no")

In [74]:
X=category_all2['input']
y=category_all2['attribute_category']

SVC_pipeline = Pipeline([('tfidf', vectorizer),
                ('clf', OneVsRestClassifier(LinearSVC())),
            ])

SVC_pipeline.fit(X, y)
prediction = SVC_pipeline.predict(test_text)
test_df['category']=prediction

In [75]:
test_df['fit']=np.where(test_df['clothing']==1,test_df['fit'],'')
test_df['dry_clean_only']=np.where(test_df['clothing']==1,test_df['dry_clean_only'],'')     
test_df['category']=np.where(test_df['category']!='other',test_df['category'],'') 

### 5.3 Sample Result

In [76]:
test_df['fit'].value_counts()

relaxed             564
                    171
semi-fitted         149
fitted/tailored      72
oversized            25
straight/regular     19
Name: fit, dtype: int64

In [77]:
test_df['dry_clean_only'].value_counts()

no     537
yes    292
       171
Name: dry_clean_only, dtype: int64

In [78]:
test_df['category'].value_counts()

top          354
handbag      174
bottom       160
             111
one piece     79
shoe          75
scarf         47
Name: category, dtype: int64

In [79]:
#the overall methods are largely based on this work
#https://github.com/nkartik94/Multi-Label-Text-Classification/blob/master/Mark_6.ipynb
#https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff