## Cleaning the Dataset for Data Visualization
### -> check this for [raw data](../data/raw/cosmetics.csv)
### -> check this for [cleaned data](../data/processed/cleaned_data.csv)

#### This notebook will clean the raw data by normalizing, ommiting repeated values, and etc...

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy as sp

In [3]:
df = pd.read_csv('../data/raw/cosmetics.csv')
df.head(5)

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1


In [4]:
df.describe()

Unnamed: 0,Price,Rank,Combination,Dry,Normal,Oily,Sensitive
count,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0
mean,55.584239,4.153261,0.65625,0.61413,0.652174,0.607337,0.513587
std,45.014429,0.633918,0.47512,0.486965,0.476442,0.488509,0.499985
min,3.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,4.0,0.0,0.0,0.0,0.0,0.0
50%,42.5,4.3,1.0,1.0,1.0,1.0,1.0
75%,68.0,4.5,1.0,1.0,1.0,1.0,1.0
max,370.0,5.0,1.0,1.0,1.0,1.0,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1472 entries, 0 to 1471
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Label        1472 non-null   object 
 1   Brand        1472 non-null   object 
 2   Name         1472 non-null   object 
 3   Price        1472 non-null   int64  
 4   Rank         1472 non-null   float64
 5   Ingredients  1472 non-null   object 
 6   Combination  1472 non-null   int64  
 7   Dry          1472 non-null   int64  
 8   Normal       1472 non-null   int64  
 9   Oily         1472 non-null   int64  
 10  Sensitive    1472 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 126.6+ KB


## Cleaning the Dataset

* Omit the skin type & label columns 
* finding any null values & omitting them, and replace with `fillna()` (guarantees 77% accuracy)
* luckily, there are no null values which means the data analysis will be more accurate.
* remove duplicate values (none)
* standardize Capitalizations ()
* change the data type object & int to float to ge the most accurate results
* remove empty strings
* remove empty lists
* remove empty dictionaries

### Other
* check how varied the data is (Z Score) & determine if the data set is a good representation 

### Ommiting the Skintype

In [6]:
df = df.loc[:, ~df.columns.isin(['Combination', 'Dry', 'Normal', 'Oily', 'Sensitive'])]
df.head(5)

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat..."
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle..."
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary..."
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P..."
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet..."


### Checking for duplicates
-> result: none

In [7]:
duplicates = df.duplicated()
print(duplicates)

0       False
1       False
2       False
3       False
4       False
        ...  
1467    False
1468    False
1469    False
1470    False
1471    False
Length: 1472, dtype: bool


### Checking for null values
-> result: none

In [8]:
null_values = df.isnull()
print(null_values)

      Label  Brand   Name  Price   Rank  Ingredients
0     False  False  False  False  False        False
1     False  False  False  False  False        False
2     False  False  False  False  False        False
3     False  False  False  False  False        False
4     False  False  False  False  False        False
...     ...    ...    ...    ...    ...          ...
1467  False  False  False  False  False        False
1468  False  False  False  False  False        False
1469  False  False  False  False  False        False
1470  False  False  False  False  False        False
1471  False  False  False  False  False        False

[1472 rows x 6 columns]


### Normalizing the Data

#### Using a label encoder to numerize the Brand and Label
#### Convert the data types to float64

In [9]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

cols_to_encode = ['Brand', 'Label']
for col in cols_to_encode:
    df[col] = le.fit_transform(df[col])


cols_to_convert = ['Price', 'Label', 'Brand']
for col in cols_to_convert:
    df[col] = df[col].astype(float)


df.head(5)

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients
0,3.0,64.0,Crème de la Mer,175.0,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat..."
1,3.0,95.0,Facial Treatment Essence,179.0,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle..."
2,3.0,29.0,Protini™ Polypeptide Cream,68.0,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary..."
3,3.0,64.0,The Moisturizing Soft Cream,175.0,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P..."
4,3.0,49.0,Your Skin But Better™ CC+™ Cream with SPF 50+,38.0,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet..."


#### Exporting the cleaned data
##### -> Will be stored in data/processed

In [10]:
df.to_csv('../data/processed/cleaned_data.csv', index=False)