# TRANSFORM
# Cleaning raw data from Women clothes from ASOS API (2)

Data was accessed through an API on 2023-11-04 (t-shirt category) and 2023-11-09 (the rest of categories)<br/>
stored as a pickle file in `../../data/raw/ *** .pk`

The raw datasets are json files with a very complex structure <br/>

Now the data must be cleaned and transform to comply with the DDBB standards.

--------

Raw `asos_womentshirt_data` dataset has the following issues:


* 1. Extract only relevant data **(brand, description, price, colour)**
    
* 2. ASOS dataset **486 names** for colours
    * Normalize the color names using `mlg.namvector_clean`
    * Simplify colour names according to the colors used by Amazon.es
    * Use `../../data/raw/color_simplification.pkl` - A chatGPT generated dictionary to simplifify the colors returned by ASOS
    * Transform rows with multiple color values transform to "multicolor"

* 3. Clean `price` column
* 4. Normalize brand names

--------

Save cleaned data in `../../data/clean/asos_womentshirt_data.pk`

### 0. Import the modules

In [1]:
import requests
import time
import pandas as pd
import pickle
import numpy as np
import warnings
warnings.filterwarnings('ignore') # ignorar warnings
from src import dataanalysis_fun1 as mlg  #Import my module

Reload my module if neccessary

In [186]:
import importlib
from src import dataanalysis_fun1 as mlg # Import the module
#importlib.reload(mlg)  # Reload the module

# Suppress warning when reloading the module
with warnings.catch_warnings():
    warnings.simplefilter("ignore") 
    importlib.reload(mlg)  # Reload the module

## 1. Load the data using pickle

LOAD ADDITIONAL DATA FROM ASOS (DRESSES, TOPS AND SWEATSHIRTS)

In [187]:
# Now, let's load the object from the file
with open('../../data/raw/asos_womentshirt_data.pkl', 'rb') as file:
    asos_womentshirt_data = pickle.load(file)

In [188]:
# Now, let's load the object from the file
with open('../../data/raw/asos_womentop_data.pkl', 'rb') as file:
    asos_womentop_data = pickle.load(file)

In [189]:
# Now, let's load the object from the file
with open('../../data/raw/asos_womentop_data.pkl', 'rb') as file:
    asos_womentop_data = pickle.load(file)

In [190]:
# Now, let's load the object from the file
with open('../../data/raw/asos_womendress_data.pkl', 'rb') as file:
    asos_womendress_data = pickle.load(file)


In [191]:
asos_raw1=asos_womentshirt_data
asos_raw2=asos_womentop_data
asos_raw3=asos_womentop_data
asos_raw4=asos_womendress_data

## 2. Transform (clean the data)

#### Clean and extract relevant text

The ultimate goal is, for each element, to keep only the `brand, product type, colour and price €` 

## 2.1 Join the datasets and parse them into DataFrame

In [362]:
asos_raw_list=[asos_raw1,asos_raw2, asos_raw3, asos_raw4]

In [363]:
color_df_list=[]
for ASOS in asos_raw_list:
    thedf1=[[row["brandName"], row["name"],  row["price"]["current"]["text"], row["colour"]] for row in ASOS]
    color_df=pd.DataFrame(thedf1,columns=["brand", "description", "price", "colour"])
    color_df_list.append(color_df)
    
ASOS_df = pd.concat(color_df_list, ignore_index=True)

In [364]:
# delete duplicated rows 

ASOS_df1=ASOS_df[~ASOS_df["description"].duplicated()]
ASOS_df1.reset_index()

Unnamed: 0,index,brand,description,price,colour
0,0,New Look,Top negro escalonado de manga larga con estamp...,"16,99 €",Negro
1,1,Mama.licious,Top color ámbar dorado de manga larga con cuel...,"45,99 €",Dorado ámbar
2,2,Mama.licious,Top amarillo de manga larga con cuello ancho d...,"21,99 €",Dorado ámbar
3,3,Selected,Top marrón de manga larga con cuello alto de S...,"31,99 €",Tierra oscura
4,4,Mama.licious,Top multicolor a rayas con cuello alto de punt...,"40,99 €",MULTICOLOR
...,...,...,...,...,...
10555,15646,Liquorish,Vestido midi cruzado con bajo tipo pañuelo y e...,"34,95 €",Multicolor
10556,15647,& Other Stories,Minivestido negro con manga abullonada de & Ot...,"35,05 €",NEGRO
10557,15648,Selected,Vestido de cambray a rayas con detalle de punt...,"23,00 €",Blanco
10558,15649,ASOS Tall,Vestido estilo peto denim en negro desgastado ...,"18,49 €",WASHED BLACK


#### ASOS dataset has a huge number of ways to clasify the products' colours

In [365]:
len(ASOS_df1["colour"].unique())

1630

## 2.2 Normalize the color names using `mlg.namvector_clean`

In [366]:
ASOS_df1=ASOS_df1.copy()
ASOS_df1["colour"]=mlg.namvector_clean(ASOS_df1["colour"])
ASOS_df1["colour"]=[col.replace("_", " ") for col in ASOS_df1["colour"]]

### 2.3 Simplify colour names according to the color categories used by Amazon.es

In [367]:
amz_colors=['negro', 'gris', 'blanco', 'marron', 'beis', 'rojo', 'rosa',
       'naranja', 'amarillo', 'marfil', 'verde', 'turquesa', 'azul',
       'morado', 'dorado']

english_colors = ['black', 'gray', 'white', 'brown', 'beige', 'red', 'pink', 'orange', 'yellow', 'ivory', 'green', 'turquoise', 'blue', 'purple', 'gold']
amz_colors_dict = dict(zip(english_colors, amz_colors))

print(amz_colors_dict)

{'black': 'negro', 'gray': 'gris', 'white': 'blanco', 'brown': 'marron', 'beige': 'beis', 'red': 'rojo', 'pink': 'rosa', 'orange': 'naranja', 'yellow': 'amarillo', 'ivory': 'marfil', 'green': 'verde', 'turquoise': 'turquesa', 'blue': 'azul', 'purple': 'morado', 'gold': 'dorado'}


### 2.3 a - 16 SPANISH NAMES

In [None]:
ASOS_df2=ASOS_df1.copy()
ASOS_df2["colour_simp"]=""

for color in amz_colors:
    '''
    check in original color nomenclature if a simplest color is present
    some nomenclatures contain multiple words, i need doble comprehension to eval all the words by cell
    '''
    matches = [any(word in color for word in cell.split())
               for cell in ASOS_df2["colour"].values]
    
    ASOS_df2.loc[matches, "colour_simp"] = color

In [None]:
# ASOS_df2["colour_simp"].value_counts()

### 2.3 b -16 ENGLISH NAMES

In [369]:

ASOS_df2.head()

for dict_key, dict_value in amz_colors_dict.items(): 
    matches = [cell in dict_key for cell in ASOS_df2["colour"].values]
    
    '''
    check in original color nomenclature if a simplest color is present
    some nomenclatures contain multiple words, i need doble comprehension to eval all the words by cell
    '''
    matches = [any(word in dict_key for word in cell.split())
               for cell in ASOS_df2["colour"].values]
    
    ASOS_df2.loc[matches, "colour_simp"] = dict_value


In [None]:
# ASOS_df2["colour_simp"].value_counts()

### 2.3 c - `collapsed_dict2.pkl` -chatGPT generated dictionary
Load the object from the file -chatGPT generated dictionary to simplifify the colors returned by ASOS

In [371]:
with open('../../data/raw/collapsed_dict2.pkl', 'rb') as file:
    collapsed_dict2 = pickle.load(file)

In [372]:
collapsed_dict2.keys()

dict_keys(['marron', 'multicolor', 'naranja', 'blanco', 'beis', 'morado', 'gris', 'negro', 'dorado', 'verde', 'rosa', 'rojo', 'azul', 'amarillo', 'plateado'])

In [373]:
def color_dict_simplify(x):
    x = x.lower()
    theval=""
    for KKK, VVV in collapsed_dict2.items():
        for VAL in VVV:
            if VAL in x:
                theval = KKK
            
    return theval

In [384]:
ASOS_df3=ASOS_df2.copy()

ASOS_df3['colour_simp2'] = ASOS_df3['colour'].apply(color_dict_simplify)
#ASOS_df3['colour_simp2'].value_counts()

### 2.3 d- Unify colors in `colour_simp` and keep only one column

In [385]:
def update_colour_simp(row):
    if row['colour_simp'] == "":
        return row['colour_simp2']
    else:
        return row['colour_simp']

ASOS_df3['colour_simp'] = ASOS_df3.apply(update_colour_simp, axis=1)


#### There still a crazy ASOS names 

We have still 269 rows without simple nomenclature. <br/>
I could Use fuzzywuzzy to clean it but -for the moment-, I prefer just to drop them

In [386]:
crazy_ASOS=ASOS_df3[ASOS_df3["colour_simp"]==""]

print(len(crazy_ASOS))
display(crazy_ASOS.head(5))

269


Unnamed: 0,brand,description,price,colour,colour_simp,colour_simp2
3058,Dickies,Camiseta verde con estampado en la espalda San...,"26,50 €",adventurine,,
3061,Dickies,Camiseta verde oscuro con logo universitario S...,"22,00 €",adventurine,,
3062,Dickies,Camiseta azul real con estampado del yin yang ...,"26,50 €",cielo satinado,,
3244,ASOS DESIGN,Camiseta color crudo jaspeado de corte boyfrie...,"19,00 €",marga cruda,,
3300,Levi's,Camiseta azul marino con logo estampado en for...,"19,00 €",flores kinsley,,


### 2.3 e- Drop rows with crazy names and keep only simplified color

In [387]:
ASOS_df4=ASOS_df3.copy()
ASOS_df4 = ASOS_df4.loc[:, ['brand', 'description', 'price', "colour_simp"]]
ASOS_df4 = ASOS_df4.rename(columns={'colour_simp': 'colour'})

In [395]:

ASOS_df4.colour =[row.lstrip() for row in ASOS_df4.colour]
ASOS_df4=ASOS_df4[ASOS_df4["colour"]!=""]
ASOS_df4.sample(10)

#### Simplify ` turquesa` for `azul` and `marfil` for `beis`

In [399]:
ASOS_df4["colour"][ASOS_df4["colour"]=="turquesa"]="azul"
ASOS_df4["colour"][ASOS_df4["colour"]=="marfil"]="beis"

In [400]:
ASOS_df4["colour"].value_counts()

negro         2092
multicolor    1339
verde         1226
blanco         988
azul           905
beis           851
rosa           723
marron         678
gris           377
naranja        309
morado         299
rojo           283
amarillo       194
dorado          17
plateado        10
Name: colour, dtype: int64

## 2.4 Clean `price` column

* Take out € symbol
* Replace , for .
* Transform to float

In [404]:
ASOS_df4[:5]

Unnamed: 0,brand,description,price,colour
0,New Look,Top negro escalonado de manga larga con estamp...,"16,99 €",negro
1,Mama.licious,Top color ámbar dorado de manga larga con cuel...,"45,99 €",dorado
2,Mama.licious,Top amarillo de manga larga con cuello ancho d...,"21,99 €",dorado
3,Selected,Top marrón de manga larga con cuello alto de S...,"31,99 €",marron
4,Mama.licious,Top multicolor a rayas con cuello alto de punt...,"40,99 €",multicolor


In [405]:
ASOS_df4["price"]=[VAL.replace(",", ".") for VAL in ASOS_df4["price"]]
ASOS_df4["price"]=[VAL.replace("€", "") for VAL in ASOS_df4["price"]]
ASOS_df4["price"]=[float(VAL) for VAL in ASOS_df4["price"]]

In [406]:
ASOS_df4[:5]

Unnamed: 0,brand,description,price,colour
0,New Look,Top negro escalonado de manga larga con estamp...,16.99,negro
1,Mama.licious,Top color ámbar dorado de manga larga con cuel...,45.99,dorado
2,Mama.licious,Top amarillo de manga larga con cuello ancho d...,21.99,dorado
3,Selected,Top marrón de manga larga con cuello alto de S...,31.99,marron
4,Mama.licious,Top multicolor a rayas con cuello alto de punt...,40.99,multicolor


## 2.5 Normalize `brand` names

In [407]:
ASOS_df5=ASOS_df4.copy()
ASOS_df5.brand = [row.replace(".", "") for row in ASOS_df5.brand]

In [408]:
display(ASOS_df5.groupby('colour')['price'].mean())
display(ASOS_df5.groupby('brand')['price'].mean())

colour
amarillo      28.990464
azul          31.325238
beis          30.004172
blanco        27.163269
dorado        50.278824
gris          24.391061
marron        25.674277
morado        26.229231
multicolor    27.375818
naranja       25.676505
negro         29.392887
plateado      33.148000
rojo          31.225795
rosa          29.330318
verde         26.644070
Name: price, dtype: float64

brand
& Other Stories            47.643684
4th & Reckless             33.007273
4th & Reckless Plus        41.500000
4th & Reckless Tall        36.500000
AAPE BY A BATHING APE®     64.750000
                             ...    
adidas                     13.600000
adidas Originals           27.390625
adidas performance         29.666667
ellesse                    26.522727
sister jane               105.833333
Name: price, Length: 365, dtype: float64

## 3. Save cleaned data

In [411]:
#asos_womenupperclothes_clean=ASOS_df5.copy()
#asos_womenupperclothes_clean.to_csv('../../data/clean/asos_womenupperclothes_clean.csv', index=False)