# Machine Learning Fundamentals by Ninh Nguyen
Transforming Data into Features

You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from Kaggle and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis. To do this, you will have time to practice the following:

    Transforming categorical data
    Scaling your data
    Working with date-time features

Let’s get started!


## Kaggle Dataset Source

https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews?resource=download

In [121]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

Import your dataset. Save it to a variable called reviews.

Next, we want to look at the column names of our dataset along with their data types. Do the following two steps:

    Print the column names of your dataset.
    Check your features’ data types by printing .info()

In [122]:
reviews = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
reviews

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


In [123]:
reviews.columns

Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')

In [124]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


Transform the recommended feature. Start by printing the feature’s .value_counts()

In [125]:
reviews['Recommended IND'].value_counts()

1    19314
0     4172
Name: Recommended IND, dtype: int64

In [126]:
# Below codes in comments only applicable if output from previous value count is not 1 and 0.
# binary_dict = {True:1, False:0}
# binary_dict

In [127]:
# reviews['Recommended IND'] =  reviews['Recommended IND'].map(binary_dict)

In [128]:
# reviews['Recommended IND'].value_counts()

In [129]:
reviews['Rating'].value_counts()

5    13131
4     5077
3     2871
2     1565
1      842
Name: Rating, dtype: int64

In [130]:
# Below codes in comments only applicable if output from previous value count is not numerical rating (e.g., 5).
# rating_dict = {Loved it : 5, Liked it : 4, Was okay : 3, Not great : 2, Hated it : 1}
# rating_dict

In [131]:
# reviews['Rating'] =  reviews['Rating'].map(rating_dict)

In [132]:
# reviews['Rating'].value_counts()

In [133]:
reviews['Department Name'].value_counts()

Tops        10468
Dresses      6319
Bottoms      3799
Intimate     1735
Jackets      1032
Trend         119
Name: Department Name, dtype: int64

In [134]:
one_hot = pd.get_dummies(reviews['Department Name'])
one_hot

Unnamed: 0,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
0,0,0,1,0,0,0
1,0,1,0,0,0,0
2,0,1,0,0,0,0
3,1,0,0,0,0,0
4,0,0,0,0,1,0
...,...,...,...,...,...,...
23481,0,1,0,0,0,0
23482,0,0,0,0,1,0
23483,0,1,0,0,0,0
23484,0,1,0,0,0,0


In [135]:
reviews = reviews.join(one_hot)

In [136]:
reviews.columns

Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name', 'Bottoms', 'Dresses', 'Intimate',
       'Jackets', 'Tops', 'Trend'],
      dtype='object')

In [137]:
reviews = reviews[['Clothing ID', 'Age', 'Recommended IND', 'Rating', 'Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']].copy()
reviews

Unnamed: 0,Clothing ID,Age,Recommended IND,Rating,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
0,767,33,1,4,0,0,1,0,0,0
1,1080,34,1,5,0,1,0,0,0,0
2,1077,60,0,3,0,1,0,0,0,0
3,1049,50,1,5,1,0,0,0,0,0
4,847,47,1,5,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
23481,1104,34,1,5,0,1,0,0,0,0
23482,862,48,1,3,0,0,0,0,1,0
23483,1104,31,0,3,0,1,0,0,0,0
23484,1084,28,1,3,0,1,0,0,0,0


In [138]:
reviews = reviews.set_index('Clothing ID')
reviews

Unnamed: 0_level_0,Age,Recommended IND,Rating,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
Clothing ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
767,33,1,4,0,0,1,0,0,0
1080,34,1,5,0,1,0,0,0,0
1077,60,0,3,0,1,0,0,0,0
1049,50,1,5,1,0,0,0,0,0
847,47,1,5,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...
1104,34,1,5,0,1,0,0,0,0
862,48,1,3,0,0,0,0,1,0
1104,31,0,3,0,1,0,0,0,0
1084,28,1,3,0,1,0,0,0,0


In [139]:
scaler = StandardScaler()
scaler.fit_transform(reviews)

array([[-0.83054886,  0.4647678 , -0.17660399, ..., -0.21438431,
        -0.89672592, -0.07136282],
       [-0.74911087,  0.4647678 ,  0.72429116, ..., -0.21438431,
        -0.89672592, -0.07136282],
       [ 1.36827674, -2.15161203, -1.07749914, ..., -0.21438431,
        -0.89672592, -0.07136282],
       ...,
       [-0.99342483, -2.15161203, -1.07749914, ..., -0.21438431,
        -0.89672592, -0.07136282],
       [-1.23773878,  0.4647678 , -1.07749914, ..., -0.21438431,
        -0.89672592, -0.07136282],
       [ 0.71677286,  0.4647678 ,  0.72429116, ..., -0.21438431,
        -0.89672592, -0.07136282]])