The aim of this project to make a classification of dresses recommendation bases on several attributes.
The dataset is downloaded from UCI from the following link:
https://archive.ics.uci.edu/ml/machine-learning-databases/00289/

In [1]:
import numpy as np
import pandas as pd
df = pd.read_excel("Attribute DataSet.xlsx")

In [2]:
#Display first rows of the dataset
df.head()

Unnamed: 0,Dress_ID,Style,Price,Rating,Size,Season,NeckLine,SleeveLength,waiseline,Material,FabricType,Decoration,Pattern Type,Recommendation
0,1006032852,Sexy,Low,4.6,M,Summer,o-neck,sleevless,empire,,chiffon,ruffles,animal,1
1,1212192089,Casual,Low,0.0,L,Summer,o-neck,Petal,natural,microfiber,,ruffles,animal,0
2,1190380701,vintage,High,0.0,L,Automn,o-neck,full,natural,polyster,,,print,0
3,966005983,Brief,Average,4.6,L,Spring,o-neck,full,natural,silk,chiffon,embroidary,print,1
4,876339541,cute,Low,4.5,M,Summer,o-neck,butterfly,natural,chiffonfabric,chiffon,bow,dot,0


In [3]:
#Display dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Dress_ID        500 non-null    int64  
 1   Style           500 non-null    object 
 2   Price           498 non-null    object 
 3   Rating          500 non-null    float64
 4   Size            500 non-null    object 
 5   Season          498 non-null    object 
 6   NeckLine        497 non-null    object 
 7   SleeveLength    498 non-null    object 
 8   waiseline       413 non-null    object 
 9   Material        372 non-null    object 
 10  FabricType      234 non-null    object 
 11  Decoration      264 non-null    object 
 12  Pattern Type    391 non-null    object 
 13  Recommendation  500 non-null    int64  
dtypes: float64(1), int64(2), object(11)
memory usage: 54.8+ KB


As we can see from the data information the data set has a null values form approximately half of the attributes. Let's discover the data in more details, first let's drop Dress_ID attribute as it's not affecting our classification problem.


In [4]:
df = df.drop("Dress_ID", axis=1)

In [5]:
df["Recommendation"].value_counts()

0    290
1    210
Name: Recommendation, dtype: int64

Seems the Recommendation is in a good balance so there is not need to do any more modifcation to our target.
Let's explore more the attributes

In [6]:
df["Style"].value_counts()

Casual      232
Sexy         69
party        51
cute         45
vintage      25
bohemian     24
Brief        18
work         17
Novelty       8
sexy          7
Flare         2
OL            1
fashion       1
Name: Style, dtype: int64

The data above have will require some cleaning as the Sexy and sexy are teh same, also there are some values occur rarley we can see after how we will deal them.

In [7]:
#Select the object data and make all value in lower case
from sklearn.compose import make_column_selector as selector
categorical_selector = selector(dtype_include=object)
categorial_columns = categorical_selector(df)
df[categorial_columns] = df[categorial_columns].apply(lambda x: x.str.lower(), axis=0)
df["Season"].value_counts()

summer    160
winter    145
spring    124
automn     61
autumn      8
Name: Season, dtype: int64

In [8]:
#let's clean the Season attribute
df["Season"] = df["Season"].str.replace("automn", "autumn")
df.head()

Unnamed: 0,Style,Price,Rating,Size,Season,NeckLine,SleeveLength,waiseline,Material,FabricType,Decoration,Pattern Type,Recommendation
0,sexy,low,4.6,m,summer,o-neck,sleevless,empire,,chiffon,ruffles,animal,1
1,casual,low,0.0,l,summer,o-neck,petal,natural,microfiber,,ruffles,animal,0
2,vintage,high,0.0,l,autumn,o-neck,full,natural,polyster,,,print,0
3,brief,average,4.6,l,spring,o-neck,full,natural,silk,chiffon,embroidary,print,1
4,cute,low,4.5,m,summer,o-neck,butterfly,natural,chiffonfabric,chiffon,bow,dot,0


In [9]:
#Let's clean the Size by replacing the small with s
df["Size"] = df["Size"].str.replace("small", "s")
df["FabricType"].value_counts()
#Replace shiffon with chiffon
df["FabricType"] = df["FabricType"].str.replace("shiffon", "chiffon")
#Cleacm Sleevlength attribute
df["SleeveLength"] = df["SleeveLength"].str.replace("^(sl).+", "sleeveless", regex=True)
df["SleeveLength"] = df["SleeveLength"].str.replace("^(thr).+", "threequarter", regex=True)

After doing cleaning for the data we can notice that, the following attributes/features have a lot of null values:
Material, FabricType, Decoration and Pattern Type so let's drop them before applying any ML algorithm

In [10]:
#df = df.drop(["Material", "FabricType", "Decoration", "Pattern Type"], axis=1)


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Style           500 non-null    object 
 1   Price           498 non-null    object 
 2   Rating          500 non-null    float64
 3   Size            500 non-null    object 
 4   Season          498 non-null    object 
 5   NeckLine        497 non-null    object 
 6   SleeveLength    498 non-null    object 
 7   waiseline       413 non-null    object 
 8   Material        372 non-null    object 
 9   FabricType      234 non-null    object 
 10  Decoration      264 non-null    object 
 11  Pattern Type    391 non-null    object 
 12  Recommendation  500 non-null    int64  
dtypes: float64(1), int64(1), object(11)
memory usage: 50.9+ KB


In [12]:
df["Style"].value_counts()

casual      232
sexy         76
party        51
cute         45
vintage      25
bohemian     24
brief        18
work         17
novelty       8
flare         2
ol            1
fashion       1
Name: Style, dtype: int64

In [13]:
#Let's group all the Styles under 17 into new group "other"
df["Style"] = df["Style"].replace(["novelty","flare", "ol", "fashion"], "other")
df["Style"].value_counts()

casual      232
sexy         76
party        51
cute         45
vintage      25
bohemian     24
brief        18
work         17
other        12
Name: Style, dtype: int64

In [14]:
df["NeckLine"].value_counts()

o-neck             271
v-neck             124
slash-neck          25
boat-neck           19
sweetheart          15
turndowncollor      13
bowneck             10
peterpan-collor      6
sqare-collor         5
open                 3
scoop                2
ruffled              1
mandarin-collor      1
halter               1
backless             1
Name: NeckLine, dtype: int64

In [15]:
#let's do the same for Neckline by grouping all values under 10 into new group "other"
df["NeckLine"] = df["NeckLine"].replace(["peterpan-collor", "spare-collor", 
                                         "open", "scoop", "ruffled", "mandarin-collor", 
                                         "halter", "backless"], "other")

In [16]:
df["SleeveLength"].value_counts()

sleeveless        232
full               97
short              96
halfsleeve         35
threequarter       28
capsleeves          3
cap-sleeves         2
petal               1
butterfly           1
turndowncollor      1
half                1
urndowncollor       1
Name: SleeveLength, dtype: int64

In [17]:
#let's do the same for Sleevelength attribute by grouping all values under 28 into new group "other"
df["SleeveLength"] = df["SleeveLength"].replace(["capsleeves", "cap-sleeves", "petal", 
                                                 "butterfly", "turndowncollor",
                                                "half", "urndowncollor"], "other")
df["SleeveLength"].value_counts()
df["waiseline"].value_counts()

natural     304
empire      104
dropped       4
princess      1
Name: waiseline, dtype: int64

Finally before applying any ML algorithms there is one attribute which is waistline contians 4 values with a majority for 2 categories so let's drop the the minority values to see the final shape of our dataset

In [18]:
df = df[(df["waiseline"] != 'dropped') & (df["waiseline"] != 'princess')]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 495 entries, 0 to 499
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Style           495 non-null    object 
 1   Price           493 non-null    object 
 2   Rating          495 non-null    float64
 3   Size            495 non-null    object 
 4   Season          493 non-null    object 
 5   NeckLine        492 non-null    object 
 6   SleeveLength    494 non-null    object 
 7   waiseline       408 non-null    object 
 8   Material        369 non-null    object 
 9   FabricType      233 non-null    object 
 10  Decoration      262 non-null    object 
 11  Pattern Type    389 non-null    object 
 12  Recommendation  495 non-null    int64  
dtypes: float64(1), int64(1), object(11)
memory usage: 54.1+ KB


In [19]:
#drop all null values
df = df.dropna()
#df = df.drop("waiseline", axis=1)
#df = df.drop("SleeveLength", axis=1)
data, target = df.drop("Recommendation", axis=1), df["Recommendation"]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 3 to 499
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Style         99 non-null     object 
 1   Price         99 non-null     object 
 2   Rating        99 non-null     float64
 3   Size          99 non-null     object 
 4   Season        99 non-null     object 
 5   NeckLine      99 non-null     object 
 6   SleeveLength  99 non-null     object 
 7   waiseline     99 non-null     object 
 8   Material      99 non-null     object 
 9   FabricType    99 non-null     object 
 10  Decoration    99 non-null     object 
 11  Pattern Type  99 non-null     object 
dtypes: float64(1), object(11)
memory usage: 10.1+ KB


Now we have final dataset ready, just we need to one hot encoding for all categorical data and standardise the rating attribute

In [20]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

numerical_cols = selector(dtype_exclude=object)
categorical_cols = selector(dtype_include=object)

numerical_data = numerical_cols(data)
categorical_data = categorical_cols(data)
numerical_preprocessor = StandardScaler()
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer([('standard_scaler', numerical_preprocessor, numerical_data),
                                 ('one_hot_encoder', categorical_preprocessor, categorical_data)])

model = make_pipeline(preprocessor, LogisticRegression())
cv_results = cross_validate(model, data, target)
cv_results

{'fit_time': array([0.01996541, 0.02002215, 0.01907873, 0.02081585, 0.02012086]),
 'score_time': array([0., 0., 0., 0., 0.]),
 'test_score': array([0.75      , 0.6       , 0.65      , 0.6       , 0.63157895])}