### Transforming Data into Features

You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from Kaggle and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis. To do this, you will have time to practice the following:

- Transforming categorical data

- Scaling your data

- Working with date-time features

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [2]:
#import data
reviews = pd.read_csv('reviews.csv')

In [3]:
#print .info
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   clothing_id      5000 non-null   int64 
 1   age              5000 non-null   int64 
 2   review_title     4174 non-null   object
 3   review_text      4804 non-null   object
 4   recommended      5000 non-null   bool  
 5   division_name    4996 non-null   object
 6   department_name  4996 non-null   object
 7   review_date      5000 non-null   object
 8   rating           5000 non-null   object
dtypes: bool(1), int64(2), object(6)
memory usage: 317.5+ KB


In [4]:
#look at the counts of recommended
reviews['recommended'].value_counts()

True     4166
False     834
Name: recommended, dtype: int64

In [5]:
#create a binary dictionary
binary_dict = {True:1, False:0}

In [6]:
#create a new column
reviews['recommended'] = reviews['recommended'].map(binary_dict)
 
#print your transformed column
reviews['recommended'].value_counts()

1    4166
0     834
Name: recommended, dtype: int64

In [7]:
#look at the counts of rating
reviews['rating'].value_counts()

Loved it     2798
Liked it     1141
Was okay      564
Not great     304
Hated it      193
Name: rating, dtype: int64

In [8]:
#create dictionary
rating_dict = {'Loved it':5, 'Liked it':4, 'Was okay':3, 'Not great':2, 'Hated it':1}

In [9]:
#create a new column
reviews['rating'] = reviews['rating'].map(rating_dict)
 
#print your transformed column
reviews['rating'].value_counts()

5    2798
4    1141
3     564
2     304
1     193
Name: rating, dtype: int64

In [10]:
reviews['department_name'].value_counts()

Tops        2196
Dresses     1322
Bottoms      848
Intimate     378
Jackets      224
Trend         28
Name: department_name, dtype: int64

In [11]:
#perform get_dummies
one_hot = pd.get_dummies(reviews['department_name'])
one_hot

Unnamed: 0,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
0,0,1,0,0,0,0
1,0,1,0,0,0,0
2,0,0,1,0,0,0
3,0,1,0,0,0,0
4,0,1,0,0,0,0
...,...,...,...,...,...,...
4995,0,0,0,0,1,0
4996,0,0,0,0,1,0
4997,0,1,0,0,0,0
4998,1,0,0,0,0,0


In [12]:
#join the new columns back onto the original
reviews = reviews.join(one_hot)
 
#print column names
reviews.columns

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating', 'Bottoms',
       'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
      dtype='object')

Transform the review_date feature.

This feature is listed as an object type, but we want this to be transformed into a date-time feature.

Transform review_date into a date-time feature.
Print the feature type to confirm the transformation.

In [13]:
#transform review_date to date-time data
reviews['new_col'] = pd.to_datetime(reviews['review_date'])
 
reviews['new_col'].dtype

dtype('<M8[ns]')

### Scaling the Data



In [14]:
#get numerical columns
reviews = reviews[['clothing_id', 'age', 'recommended', 'rating', 'Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']].copy()

In [15]:
#reset index
reviews = reviews.set_index('clothing_id')

In [16]:
#fit transform data
scaler = StandardScaler()
scaler.fit_transform(reviews)

array([[-0.34814459,  0.44742824, -0.1896478 , ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-1.24475223,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-0.51116416,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       ...,
       [-0.59267395,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-1.24475223,  0.44742824,  0.71602461, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [ 1.68960003,  0.44742824,  0.71602461, ..., -0.21656679,
         1.12998541, -0.07504356]])