For this week's challenge, let’s do a bit of data preparation using shoe inventory data!

The following dataset contains a list of shoes with an ID, colors, and sizes available.


For this challenge, you want to clean the inventory data and reorganize the list to have:
- A new ID number
- The size
- A comment stating how many colors are available, and
- Three columns for color

Keep in mind that the names of the colors are separated by the sign /.

In [2]:
import pandas as pd

In [18]:
df = pd.read_csv('data/shoes.csv')

In [19]:
df.head()

Unnamed: 0,id,colors,sizes
0,AWpyySsJAGTnQPR7wNt4,Black,8
1,AWpyyyb3AGTnQPR7wN-u,Taupe,6 M US
2,AWpyzlajAGTnQPR7wOX8,Black,5
3,AWpyxomE0U_gzG0hkA1q,Black/Multi,9.5 BM US
4,AWpyxChWJbEilcB6RhWx,White,11


In [20]:
df.describe()

Unnamed: 0,id,colors,sizes
count,2000,2000,2000
unique,1995,630,210
top,AWpd384iM263mwCq9Vgv,Black,8
freq,5,319,156


In [21]:
# Exploring to see whether the data is already one row per shoe. Seems like it's already *shoe level data*.
df.drop_duplicates

<bound method DataFrame.drop_duplicates of                         id                colors          sizes
0     AWpyySsJAGTnQPR7wNt4                 Black              8
1     AWpyyyb3AGTnQPR7wN-u                 Taupe         6 M US
2     AWpyzlajAGTnQPR7wOX8                 Black              5
3     AWpyxomE0U_gzG0hkA1q           Black/Multi      9.5 BM US
4     AWpyxChWJbEilcB6RhWx                 White             11
...                    ...                   ...            ...
1995  AWpSxBRHM263mwCq8eLh  Black Synthetic/Gore           10 N
1996  AWpSvfvEM263mwCq8eF7                 White         US 8.5
1997  AWpS01sFJbEilcB6O9yN                 Taupe  7.5 Medium BM
1998  AWpSyw4E0U_gzG0hhdDo                 Black             10
1999  AWpSvGJ80U_gzG0hhc1C                   Tan   10 Medium BM

[2000 rows x 3 columns]>

In [22]:
df['colors'].unique()

array(['Black', 'Taupe', 'Black/Multi', 'White',
       'Metallic Multi Soft Leather', 'Denim Blue', 'Black / Silver',
       'Black and White Combo', 'Navy/Charcoal', 'Black Multi',
       'Light Brown', 'Natural Snk Pu', 'Black Leopard', 'Red',
       'Black Knit Flexible Technical Fabric', 'Brown', 'Black Fabric',
       'Florence Berry', 'Silver', 'Tan Brown', 'Black Patent',
       'Black/White', 'Camel', 'Dark Red',
       'White / Atomic Blue / Safety Yellow', 'Black Paris', 'Wine',
       'Charcoal', 'GOLD/METALLIC/SMOOTH', 'Luggage', 'Tan',
       'Brown Nubuck', 'By', 'Natural Beige', 'Ecru', 'Natural Leather',
       'Dark Wine Red', 'Black Suede/Crackle Lizard Print Leather',
       'Brown Printed Python',
       'Black Rubberized PU/Ballistic Nylon/Faux Fur',
       'Brown Tool Polyurethane/Gore', 'Black Lux Soft Leather',
       'Black/White Patent', 'BLACK-PINK 9013', 'Florence Navy', 'Blue',
       'Urban White', 'Black / White', 'Natural Satin',
       'Clear/Smoke Syn

In [None]:
# Splitting colors column with str.split 
df['colorset'] = df['colors'].str.split('/').apply(set)

In [26]:
df.head()

Unnamed: 0,id,colors,sizes,colorset
0,AWpyySsJAGTnQPR7wNt4,Black,8,{Black}
1,AWpyyyb3AGTnQPR7wN-u,Taupe,6 M US,{Taupe}
2,AWpyzlajAGTnQPR7wOX8,Black,5,{Black}
3,AWpyxomE0U_gzG0hkA1q,Black/Multi,9.5 BM US,"{Black, Multi}"
4,AWpyxChWJbEilcB6RhWx,White,11,{White}


In [37]:
# Turning the set into a list, because python can't really split a set up
df['colorlist'] = df['colorset'].apply(list)

In [39]:
df[['color1', 'color2', 'color3', 'color4']] = pd.DataFrame(df['colorlist'].tolist(), index=df.index)

In [40]:
df.head()

Unnamed: 0,id,colors,sizes,colorset,num_colors,colorlist,color1,color2,color3,color4
0,AWpyySsJAGTnQPR7wNt4,Black,8,{Black},1,[Black],Black,,,
1,AWpyyyb3AGTnQPR7wN-u,Taupe,6 M US,{Taupe},1,[Taupe],Taupe,,,
2,AWpyzlajAGTnQPR7wOX8,Black,5,{Black},1,[Black],Black,,,
3,AWpyxomE0U_gzG0hkA1q,Black/Multi,9.5 BM US,"{Black, Multi}",2,"[Black, Multi]",Black,Multi,,
4,AWpyxChWJbEilcB6RhWx,White,11,{White},1,[White],White,,,


In [41]:
# Okay why did I need to do four columns though?
df['color4'].unique()

array([None, ' Lite Grey '], dtype=object)

In [43]:
df['colorset']

0                       {Black}
1                       {Taupe}
2                       {Black}
3                {Black, Multi}
4                       {White}
                 ...           
1995    {Gore, Black Synthetic}
1996                    {White}
1997                    {Taupe}
1998                    {Black}
1999                      {Tan}
Name: colorset, Length: 2000, dtype: object

In [31]:
# Getting number of colors
df['num_colors'] = df['colorset'].apply(len)

In [32]:
df.head()

Unnamed: 0,id,colors,sizes,colorset,num_colors
0,AWpyySsJAGTnQPR7wNt4,Black,8,{Black},1
1,AWpyyyb3AGTnQPR7wN-u,Taupe,6 M US,{Taupe},1
2,AWpyzlajAGTnQPR7wOX8,Black,5,{Black},1
3,AWpyxomE0U_gzG0hkA1q,Black/Multi,9.5 BM US,"{Black, Multi}",2
4,AWpyxChWJbEilcB6RhWx,White,11,{White},1
