# Feature engineering

_________________________________________________________________________________

**Reference file:**
- combined_data.json
- [reviews]?

**Problem:** 
- Predicting the price or price range of the product

__________________________________________________________________________________

## 1.0 Loading file

In [64]:
#Import necessary libraries
import json 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

In [2]:
with open('../data/processed_data/combined_data.json', 'r') as file:
    data= json.load(file)
df=pd.DataFrame.from_dict(data)

In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
brand,Glow Recipe,Tatcha,goop,CLINIQUE,Tata Harper
product_name,Glow Recipe Watermelon Glow PHA +BHA Pore-Tigh...,Tatcha Pure One Step Camellia Oil Cleanser,goop GOOPGLOW Microderm Instant Glow Exfoliator,CLINIQUE Take The Day Off Makeup Remover For L...,Tata Harper Regenerating Exfoliating Cleanser
product_type,toners,face wash and cleansers,exfoliators and peels,face wash and cleansers,face wash and cleansers
num_likes,125100,107600,12900,76700,31000
rating,4.5,4.5,4.5,4.5,4.5
num_reviews,1900,1700,1200,3100,567
sensitive_type,0,1,0,0,0
combination_type,1,1,1,0,1
oily_type,1,1,1,0,0
normal_type,1,1,1,0,0


## 2.0 Feature engineering

**Encode categorical columns**

In [4]:
df= pd.get_dummies(df, columns=['formulation_type', 'richness', 'product_type', 'brand']
              #drop_first=True
              )

**Created alternative y variables**

In [5]:
#Create category based on quantiles
df['affordability']= pd.qcut(df.pricepervol, q=4, labels=['$', '$$', '$$$', '$$$$'], duplicates='raise')

In [6]:
df['affordability'].value_counts()

$$$     340
$       337
$$      334
$$$$    327
Name: affordability, dtype: int64

In [7]:
#Identify outliers
IQR = np.quantile(df.pricepervol, 0.75) - np.quantile(df.pricepervol, 0.25)
np.quantile(df.pricepervol, 0.25) -1.5*(IQR), np.quantile(df.pricepervol, 0.25) +1.5*(IQR) 

(-83.025, 103.005)

For simplicity, 0-100 will be used instead to create the bins.

In [8]:
#Create categories based on bins
df['affordability_bins']=pd.cut(df.pricepervol, bins=[0, 25, 50, 75, 100, 885], labels=['1st', '2nd', '3rd', '4th', '5th'], include_lowest=True)

## 3.0 Train-test split

In [37]:
targeted_var = 'pricepervol'
excluded_features = ['product_name', 
                    #y values 
                     'pricepervol', 'affordability', 'affordability_bins',
                     'highlighted_ingr', 'ingr_list'
                    ]

In [38]:
X= df.drop(columns=excluded_features)
y= df[targeted_var]

In [39]:
df.columns

Index(['product_name', 'num_likes', 'rating', 'num_reviews', 'sensitive_type',
       'combination_type', 'oily_type', 'normal_type', 'dry_type',
       'clean_sephora',
       ...
       'brand_belif', 'brand_fresh', 'brand_goop', 'brand_innisfree',
       'brand_lilah b.', 'brand_philosophy', 'brand_rms beauty', 'brand_tarte',
       'affordability', 'affordability_bins'],
      dtype='object', length=181)

In [40]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size= 0.3, random_state=12)

In [41]:
X_tr.shape, X_te.shape

((936, 175), (402, 175))

## 4.0 Transforming the features

In [60]:
#Create a copy of datasets
X_tr_scaled = X_tr.copy()
X_te_scaled = X_te.copy()

In [61]:
target_cols = ['num_likes', 'num_reviews']

for i in target_cols:
    # fit on training data column
    scaler = StandardScaler().fit(X_tr_scaled[[i]])
    
    # transform the training data column
    X_tr_scaled[i] = scaler.transform(X_tr_scaled[[i]])
    
    # transform the testing data column
    X_te_scaled[i] = scaler.transform(X_te_scaled[[i]])

## 5.0 Saving the dataframes

In [65]:
datapath = '../data/processed_data'
datapath_df = os.path.join(datapath, 'pre_modelling_df.json')
if not os.path.exists(datapath_df):
    df.to_json(datapath_df)