# ETL Pipeline for music and fashion data

### Extract (E)
- Download Kaggle Datasets
- Store raw datasets in /datasets/raw/x_data

### Transform (T)
- Clean data, if necessary
- Standardize formats; dates, capitalization, classes
- Remove non-overlapping dates, only keep rows from both data sets where dates overlap

### Load (L)
- Store processed datasets in /datasets/process/x_data

In [26]:
import pandas as pd
from datetime import datetime

In [None]:
# Fashion data
customers_df = pd.read_csv('../datasets/raw/fashion_data/customers.csv')
discounts_df = pd.read_csv('../datasets/raw/fashion_data/discounts.csv')
employees_df = pd.read_csv('../datasets/raw/fashion_data/employees.csv')
products_df = pd.read_csv('../datasets/raw/fashion_data/products.csv')
stores_df = pd.read_csv('../datasets/raw/fashion_data/stores.csv')
transactions_df = pd.read_csv('../datasets/raw/fashion_data/transactions.csv')

# Music data
music_df = pd.read_csv('../datasets/raw/music_data/universal_top_spotify_songs.csv')

  customers_df = pd.read_csv('../datasets/raw/fashion_data/customers.csv')


## Fashion columns to keep

### Customers
- id
- city
- country
- gender
- age (use dob)
- job

In [None]:
customers_df.drop(columns=['Name', 'Email', 'Telephone'], axis=1, inplace=True)
customers_df['Age'] = pd.to_datetime('today').year - pd.to_datetime(customers_df['Date Of Birth'], errors='coerce').dt.year
customers_df.drop(columns=['Date Of Birth'], axis=1, inplace=True)
customers_df['Job Title'].fillna('Unknown', inplace=True)
customers_df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  customers_df['Job Title'].fillna('Unknown', inplace=True)


Unnamed: 0,Customer ID,City,Country,Gender,Job Title,Age
0,1,New York,United States,M,Unknown,22
1,2,New York,United States,M,Records manager,25
2,3,New York,United States,F,Unknown,22
3,4,New York,United States,M,Proofreader,29
4,5,New York,United States,F,Exercise physiologist,27


In [83]:
customers_df['Country'].value_counts()

Country
United States     354450
中国                340082
España            237575
Deutschland       205560
France            196696
United Kingdom    190574
Portugal          118369
Name: count, dtype: int64

### Discounts

- Discount
- Category (has null)
- Sub Category (has null)
- Season (use start and end)

In [47]:
season_mapping = {
    1: 'Winter',
    2: 'Winter',
    3: 'Spring',
    4: 'Spring',
    5: 'Spring',
    6: 'Summer',
    7: 'Summer',
    8: 'Summer',
    9: 'Fall',
    10: 'Fall',
    11: 'Fall',
    12: 'Winter'
}

discounts_df.fillna({'Category': 'Unknown'}, inplace=True)
discounts_df.fillna({'Sub Category': 'Unknown'}, inplace=True)
discounts_df['Season'] = discounts_df['Start'].apply(lambda x: season_mapping[pd.to_datetime(x, errors='coerce').month] if pd.notnull(x) else 'Unknown')
discounts_df.drop(columns=['Start', 'End', 'Description'], axis=1, inplace=True)

In [78]:
discounts_df.head()

Unnamed: 0,Discont,Category,Sub Category,Season
0,0.4,Feminine,Coats and Blazers,Winter
1,0.4,Feminine,Sweaters and Knitwear,Winter
2,0.4,Masculine,Coats and Blazers,Winter
3,0.4,Masculine,Sweaters and Sweatshirts,Winter
4,0.4,Children,Coats,Winter


### Products

- Id
- Category
- Sub Category
- Description EN
- Color

In [None]:
products_df.drop(columns=['Description PT', 'Description DE', 'Description FR', 'Description ES', 'Description ZH', 'Sizes', 'Production Cost'], axis=1, inplace=True)
products_df.fillna({'Color': 'Unknown'}, inplace=True)

In [77]:
products_df.head()

Unnamed: 0,Product ID,Category,Sub Category,Description EN,Color
0,1,Feminine,Coats and Blazers,Sports Velvet Sports With Buttons,Unknown
1,2,Feminine,Sweaters and Knitwear,Luxurious Pink Denim With Buttons,PINK
2,3,Feminine,Dresses and Jumpsuits,Black Tricot Printed Tricot,BLACK
3,4,Feminine,Shirts and Blouses,Basic Cotton Blouse,Unknown
4,5,Feminine,T-shirts and Tops,Basic Cotton T-Shirt,Unknown


### Transactions

- Customer ID
- Product ID
- Size
- Color
- Unit Price
- Quantity
- Date
- Discount
- Store ID
- Currency
- Payment Method
- Invoice Total

In [None]:
transactions_df.drop(columns=['Invoice ID', 'Line', 'Line Total', 'Employee ID', 'Currency Symbol', 'SKU', 'Transaction Type'], axis=1, inplace=True)

In [79]:
transactions_df.head()

Unnamed: 0,Invoice ID,Line,Customer ID,Product ID,Size,Color,Unit Price,Quantity,Date,Discount,Line Total,Store ID,Employee ID,Currency,Currency Symbol,SKU,Transaction Type,Payment Method,Invoice Total
0,INV-US-001-03558761,1,47162,485,M,,80.5,1,2023-01-01 15:42:00,0.0,80.5,1,7,USD,$,MASU485-M-,Sale,Cash,126.7
1,INV-US-001-03558761,2,47162,2779,G,,31.5,1,2023-01-01 15:42:00,0.4,18.9,1,7,USD,$,CHCO2779-G-,Sale,Cash,126.7
2,INV-US-001-03558761,3,47162,64,M,NEUTRAL,45.5,1,2023-01-01 15:42:00,0.4,27.3,1,7,USD,$,MACO64-M-NEUTRAL,Sale,Cash,126.7
3,INV-US-001-03558762,1,10142,131,M,BLUE,70.0,1,2023-01-01 20:04:00,0.4,42.0,1,6,USD,$,FECO131-M-BLUE,Sale,Cash,77.0
4,INV-US-001-03558762,2,10142,716,L,WHITE,26.0,1,2023-01-01 20:04:00,0.0,26.0,1,6,USD,$,MAT-716-L-WHITE,Sale,Cash,77.0


## Music Columns to keep

### Music

- ID
- name
- artists
- daily_rank
- daily_movement
- weekly_movement
- country
- snapshot_date
- popularity
- (all music data)

In [82]:
music_df.head()

Unnamed: 0,spotify_id,name,artists,daily_rank,daily_movement,weekly_movement,country,snapshot_date,popularity,is_explicit,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,2RkZ5LkEzeHGRsmDqKwmaJ,Ordinary,Alex Warren,1,1,0,,2025-06-11,95,False,...,2,-6.141,1,0.06,0.704,7e-06,0.055,0.391,168.115,3
1,42UBPzRMh5yyz0EDPr6fr1,Manchild,Sabrina Carpenter,2,-1,48,,2025-06-11,89,True,...,7,-5.087,1,0.0572,0.122,0.0,0.317,0.811,123.01,4
2,0FTmksd2dxiE5e3rWyJXs6,back to friends,sombr,3,0,1,,2025-06-11,98,False,...,1,-2.291,1,0.0301,9.4e-05,8.8e-05,0.0929,0.235,92.855,4
3,7so0lgd0zP2Sbgs2d7a1SZ,Die With A Smile,"Lady Gaga, Bruno Mars",4,0,-1,,2025-06-11,91,False,...,6,-7.727,0,0.0317,0.289,0.0,0.126,0.498,157.964,3
4,6dOtVTDdiauQNBQEDOtlAB,BIRDS OF A FEATHER,Billie Eilish,5,1,0,,2025-06-11,100,False,...,2,-10.171,1,0.0358,0.2,0.0608,0.117,0.438,104.978,4
