# TRANSFORM
# Cleaning raw data from Women t-shirts from ASOS API

Data was accessed through an API on 2023-10-04 and <br/>
stored as a pickle file in `../../data/raw/asos_womentshirt_data.pk`

The raw `asos_womentshirt_data` dataset is a json with a very complex structure <br/>


Now the data must be cleaned and transform to comply with the DDBB standards.

--------

Raw `asos_womentshirt_data` dataset has the following issues:


* 1. Extract only relevant data **(brand, description, price, colour)**
    
* 2. ASOS dataset **486 names** for colours
    * Normalize the color names using `mlg.namvector_clean`
    * Simplify colour names according to the colors used by Amazon.es
    * Use `../../data/raw/color_simplification.pkl` - A chatGPT generated dictionary to simplifify the colors returned by ASOS
    * Transform rows with multiple color values transform to "multicolor"

* 3. Clean `price` column
* 4. Normalize brand names

--------

Save cleaned data in `../../data/clean/asos_womentshirt_data.pk`

### 0. Import the modules

In [286]:
import requests
import time
import pandas as pd
import pickle

import warnings
warnings.filterwarnings('ignore') # ignorar warnings

In [6]:
from logs.PASSES import RAPID ## my TOKEN

Import my module

In [37]:
from src import dataanalysis_fun1 as mlg 

Reload my module if neccessary

In [None]:
import importlib
from src import dataanalysis_fun1 as mlg # Import the module
#importlib.reload(mlg)  # Reload the module

# Suppress warning when reloading the module
with warnings.catch_warnings():
    warnings.simplefilter("ignore") 
    importlib.reload(mlg)  # Reload the module

## 1. Load the data using pickle

In [305]:
# Now, let's load the object from the file
with open('../../data/raw/asos_womentshirt_data.pkl', 'rb') as file:
    asos_womentshirt_data = pickle.load(file)

In [306]:
asos_raw=asos_womentshirt_data

In [307]:
len(asos_raw)

3001

## 2. Transform (clean the data)

### 2.1 Clean and extract relevant text

The ultimate goal is, for each element, to keep only the brand, product description and price € 

In [301]:
asos_raw[0]

{'id': 201533178,
 'name': 'Top negro escalonado de manga larga con estampado animal de New Look',
 'price': {'current': {'value': 16.99, 'text': '16,99 €'},
  'previous': {'value': None, 'text': ''},
  'rrp': {'value': 22.99, 'text': '22,99 €'},
  'isMarkedDown': False,
  'isOutletPrice': True,
  'currency': 'EUR'},
 'colour': 'Negro',
 'colourWayId': 201533179,
 'brandName': 'New Look',
 'hasVariantColours': False,
 'hasMultiplePrices': False,
 'groupId': None,
 'productCode': 113613824,
 'productType': 'Product',
 'url': 'new-look/top-negro-escalonado-de-manga-larga-con-estampado-animal-de-new-look/prd/201533178?clr=negro&colourWayId=201533179',
 'imageUrl': 'images.asos-media.com/products/top-negro-escalonado-de-manga-larga-con-estampado-animal-de-new-look/201533178-1-black',
 'additionalImageUrls': ['images.asos-media.com/products/top-negro-escalonado-de-manga-larga-con-estampado-animal-de-new-look/201533178-2',
  'images.asos-media.com/products/top-negro-escalonado-de-manga-larga

In [248]:
display(asos_raw[0]["brandName"], asos_raw[0]["name"],  asos_raw[0]["price"]["current"]["text"], asos_raw[0]["colour"], )

'New Look'

'Top negro escalonado de manga larga con estampado animal de New Look'

'16,99 €'

'Negro'

In [308]:
thedf=[[row["brandName"], row["name"],  row["price"]["current"]["text"], row["colour"]] for row in asos_raw]

In [309]:
color_df=pd.DataFrame(thedf,columns=["brand", "description", "price", "colour"])

In [310]:
color_df

Unnamed: 0,brand,description,price,colour
0,New Look,Top negro escalonado de manga larga con estamp...,"16,99 €",Negro
1,Mama.licious,Top color ámbar dorado de manga larga con cuel...,"45,99 €",Dorado ámbar
2,Mama.licious,Top amarillo de manga larga con cuello ancho d...,"21,99 €",Dorado ámbar
3,Selected,Top marrón de manga larga con cuello alto de S...,"31,99 €",Tierra oscura
4,Mama.licious,Top multicolor a rayas con cuello alto de punt...,"40,99 €",MULTICOLOR
...,...,...,...,...
2996,JDY,Camiseta sin mangas negra de punto de JDY,"8,00 €",Negro
2997,ASOS Maternity,Pack de 3 camisetas de manga larga con cuello ...,"38,50 €",Negro/blanco
2998,River Island,Top de escote cuadrado con manga larga en negr...,"11,25 €",Negro
2999,Vans,Camiseta blanca con logo pequeño de Vans,"21,00 €",Blanco


### 2.2  ASOS dataset 486 names for colours

In [311]:
len(color_df["colour"].unique())

486

In [312]:
color_df["colour"].unique()

array(['Negro', 'Dorado ámbar', 'Tierra oscura', 'MULTICOLOR',
       'Blanco nieve', 'Verde oscuro', 'Negro descolorido', 'Sidra',
       'Blanco desgastado', 'Verde', 'Rosa', 'Key Largo', 'Blanco',
       'Blanco óptico', 'Rojo', 'Blanco puro', 'Leopardo', 'Blanco leche',
       'Crudo', 'Estrella/corazón/símbolo de la paz', 'Marrón',
       'Marga avena', 'Dalia', 'Piedra', 'Lila', 'Gris claro',
       'Naranja quemado', 'Violeta', 'Burdeos', 'Crema', 'Naranja',
       'Verde lima', 'Marrón oscuro', 'Verde claro', 'blanco', 'Azul',
       'Blanco con diseño de Mickey', 'Gris espumoso', 'Multicolor',
       'Gris marga', 'Gris', 'piedra', 'azul marino', 'Corazón de acero',
       'Camiseta sin mangas de canalé Wilderville', 'Gris desgastado',
       'Carbón lavado', 'Blanco estampado', 'Rosa intenso',
       'Marrón Bombay', 'Negro TNF', 'Amarillo', 'Caqui',
       'Gris jaspeado y crema', 'Tostado', 'Rosa claro', 'Azul claro',
       'Pizarra', 'Marrón topo', 'Black', 'Negro/lago in

### 1. Normalize the color names using `mlg.namvector_clean`

In [313]:
color_df1=color_df.copy()
color_df1["colour"]=mlg.namvector_clean(color_df1["colour"])
color_df1["colour"]=[col.replace("_", " ") for col in color_df1["colour"]]

### 2. Simplify colour names according to the colors used by Amazon.es

In [314]:
amz_colors=['negro', 'gris', 'blanco', 'marron', 'beis', 'rojo', 'rosa',
       'naranja', 'amarillo', 'marfil', 'verde', 'turquesa', 'azul',
       'morado', 'dorado']

#### Load the object from the file -chatGPT generated dictionary to simplifify the colors returned by ASOS

In [315]:
with open('../../data/raw/color_simplification.pkl', 'rb') as file:
    color_simplification = pickle.load(file)

In [316]:
color_df2=color_df1.copy()

color_df2["colour_simp"]=""
color_df2.head()

for i in range(len(color_df2)):    
    for j in amz_colors:
        
        if (j in color_df2["colour"][i]):
            color_df2["colour_simp"][i]=color_df2["colour_simp"][i] +" " +j
            


In [317]:
color_df2

Unnamed: 0,brand,description,price,colour,colour_simp
0,New Look,Top negro escalonado de manga larga con estamp...,"16,99 €",negro,negro
1,Mama.licious,Top color ámbar dorado de manga larga con cuel...,"45,99 €",dorado ambar,dorado
2,Mama.licious,Top amarillo de manga larga con cuello ancho d...,"21,99 €",dorado ambar,dorado
3,Selected,Top marrón de manga larga con cuello alto de S...,"31,99 €",tierra oscura,
4,Mama.licious,Top multicolor a rayas con cuello alto de punt...,"40,99 €",multicolor,
...,...,...,...,...,...
2996,JDY,Camiseta sin mangas negra de punto de JDY,"8,00 €",negro,negro
2997,ASOS Maternity,Pack de 3 camisetas de manga larga con cuello ...,"38,50 €",negro blanco,negro blanco
2998,River Island,Top de escote cuadrado con manga larga en negr...,"11,25 €",negro,negro
2999,Vans,Camiseta blanca con logo pequeño de Vans,"21,00 €",blanco,blanco


In [336]:
color_df3=color_df2.copy()
color_df3["colour_simp2"]=""
color_df3.head()
for i in range(len(color_df3)):
    if color_df2.colour_simp[i] == "":
        if (color_simplification.get(color_df3.colour[i], 'Other')!="Other"):
            color_df3.colour_simp2[i]=color_simplification.get(color_df3.colour[i], 'Other')
            color_df3.colour_simp[i]=color_simplification.get(color_df3.colour[i], 'Other')
        else:
            pass
    else:
        pass

#### Unify colors and keep only one column

In [339]:
for i in range(len(color_df3)): 
    if color_df3.colour_simp[i]=="":
        color_df3.colour_simp[i] = color_df3.colour_simp2[i]
    else:
        pass
    
color_df3 = color_df3.loc[:, ['brand', 'description', 'price', "colour_simp"]]
color_df3 = color_df3.rename(columns={'colour_simp': 'colour'})

In [340]:
color_df3.colour =[row.lstrip() for row in color_df3.colour]

#### Transform rows with multiple color values transform to "multicolor"

In [341]:
for i in range(len(color_df3)):
    if len(color_df3.colour[i].split(" "))>1:
        color_df3.colour[i]="multicolor"
    

### 2.3 Clean `price` column

In [342]:
color_df3["price"]=[VAL.replace(",", ".") for VAL in color_df3["price"]]
color_df3["price"]=[VAL.replace("€", "") for VAL in color_df3["price"]]
color_df3["price"]=[float(VAL) for VAL in color_df3["price"]]

In [343]:
color_df3.head()

Unnamed: 0,brand,description,price,colour
0,New Look,Top negro escalonado de manga larga con estamp...,16.99,negro
1,Mama.licious,Top color ámbar dorado de manga larga con cuel...,45.99,dorado
2,Mama.licious,Top amarillo de manga larga con cuello ancho d...,21.99,dorado
3,Selected,Top marrón de manga larga con cuello alto de S...,31.99,marron
4,Mama.licious,Top multicolor a rayas con cuello alto de punt...,40.99,multicolor


### 2.4 Normalize brand names

In [293]:
color_df4=color_df3.copy()
color_df4.brand = [row.replace(".", "") for row in color_df4.brand]

In [345]:
display(color_df4.groupby('colour')['price'].mean())
display(color_df4.groupby('brand')['price'].mean())

colour
amarillo      23.823125
azul          25.566327
beis          20.576084
blanco        24.943798
coral         18.187500
crema         25.037500
dorado        28.580000
estampado     30.248571
floral        18.833333
gris          19.769571
lila          17.660000
marfil        23.513846
marron        19.940291
morado        22.485063
multicolor    19.713105
naranja       21.729737
negro         22.936110
neutro        22.000000
plateado      42.998000
rojo          27.608333
rosa          23.525574
turquesa      21.500000
verde         19.573492
violeta       18.518000
Name: price, dtype: float64

brand
& Other Stories        28.770270
4th & Reckless         28.166667
4th & Reckless Tall    33.000000
AFRM                   27.000000
ASOS 4505              16.495500
                         ...    
adidas                 13.600000
adidas Originals       26.265306
adidas performance     28.375000
ellesse                25.952381
sister jane            36.000000
Name: price, Length: 215, dtype: float64

## 3. Save cleaned data

In [346]:
asos_womentshirt_clean=color_df4.copy()
#asos_womentshirt_clean.to_csv('../../data/clean/asos_womentshirt_clean.csv', index=False)