# 04-1 Product Descriptions

In this notebook, we prepare the data to be used for a text-based feature extraction process. Since the data set consists of 40,000+ distinct products each with their own unique textual description, we partition the products by their ```index_name``` category. This allows us to increase the query speed by only checking a specified subset of the data rather than all 40,000+ item descriptions.

In [3]:
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

In [4]:
articles = pd.read_csv('../data/articles.csv')

In [5]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [13]:
features = ['index_name', 'detail_desc']

In [34]:
articles_desc = articles[['product_code']+features]

In [35]:
articles_desc.value_counts(subset='index_name')

index_name
Ladieswear                        26001
Divided                           15149
Menswear                          12553
Children Sizes 92-140             12007
Children Sizes 134-170             9214
Baby Sizes 50-98                   8875
Ladies Accessories                 6961
Lingeries/Tights                   6775
Children Accessories, Swimwear     4615
Sport                              3392
dtype: int64

In [36]:
div_names = articles_desc.value_counts(subset='index_name').index

In [37]:
articles_desc.isnull().sum()

product_code      0
index_name        0
detail_desc     416
dtype: int64

In [38]:
articles_desc.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  articles_desc.dropna(inplace=True)


In [39]:
articles_desc.isnull().sum()

product_code    0
index_name      0
detail_desc     0
dtype: int64

In [41]:
articles_desc

Unnamed: 0,product_code,index_name,detail_desc
0,108775,Ladieswear,Jersey top with narrow shoulder straps.
1,108775,Ladieswear,Jersey top with narrow shoulder straps.
2,108775,Ladieswear,Jersey top with narrow shoulder straps.
3,110065,Lingeries/Tights,"Microfibre T-shirt bra with underwired, moulde..."
4,110065,Lingeries/Tights,"Microfibre T-shirt bra with underwired, moulde..."
...,...,...,...
105537,953450,Menswear,Socks in a fine-knit cotton blend with a small...
105538,953763,Ladieswear,Loose-fitting sports vest top in ribbed fast-d...
105539,956217,Ladieswear,"Short, A-line dress in jersey with a round nec..."
105540,957375,Divided,Large plastic hair claw.


In [42]:
articles_desc.drop_duplicates(subset='product_code', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  articles_desc.drop_duplicates(subset='product_code', inplace=True)


In [43]:
articles_desc

Unnamed: 0,product_code,index_name,detail_desc
0,108775,Ladieswear,Jersey top with narrow shoulder straps.
3,110065,Lingeries/Tights,"Microfibre T-shirt bra with underwired, moulde..."
6,111565,Lingeries/Tights,"Semi shiny nylon stockings with a wide, reinfo..."
8,111586,Lingeries/Tights,Tights with built-in support to lift the botto...
9,111593,Lingeries/Tights,"Semi shiny tights that shape the tummy, thighs..."
...,...,...,...
105537,953450,Menswear,Socks in a fine-knit cotton blend with a small...
105538,953763,Ladieswear,Loose-fitting sports vest top in ribbed fast-d...
105539,956217,Ladieswear,"Short, A-line dress in jersey with a round nec..."
105540,957375,Divided,Large plastic hair claw.


In [44]:
articles_desc.value_counts(subset='index_name')

index_name
Ladieswear                        13394
Divided                            6731
Menswear                           4422
Children Sizes 92-140              4216
Baby Sizes 50-98                   3987
Ladies Accessories                 3959
Children Sizes 134-170             3223
Lingeries/Tights                   2896
Children Accessories, Swimwear     2767
Sport                              1451
dtype: int64

Finally, we save the data to 10 distinct csv files grouped by each product's ```index_name``` category.

For example, the dataset for the ```Ladieswear``` category would look like so:

In [46]:
articles_desc[ articles_desc['index_name'] == 'Ladieswear' ][['product_code', 'detail_desc']]

Unnamed: 0,product_code,detail_desc
0,108775,Jersey top with narrow shoulder straps.
15,116379,Fitted top in soft stretch jersey with a wide ...
23,120129,Leggings in soft jersey with a wide panel at t...
33,129085,3/4-length leggings in stretch jersey with an ...
115,179123,Jersey leggings with an elasticated waist.
...,...,...
105535,952937,"Fitted, calf-length dress in viscose jersey wi..."
105536,952938,Fitted top in jersey with a round neckline and...
105538,953763,Loose-fitting sports vest top in ribbed fast-d...
105539,956217,"Short, A-line dress in jersey with a round nec..."


In [48]:
articles_desc[ articles_desc['index_name'] == 'Ladieswear' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_ladies.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Divided' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_divided.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Menswear' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_mens.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Children Sizes 92-140' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_child1.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Children Sizes 134-170' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_child2.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Baby Sizes 50-98' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_babies.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Ladies Accessories' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_ladies_acc.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Children Accessories, Swimwear' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_child_acc.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Lingeries/Tights' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_lingerie_tights.csv', index=False)
articles_desc[ articles_desc['index_name'] == 'Sport' ][['product_code', 'detail_desc']].to_csv('../data/product_desc_sport.csv', index=False)