# 1.1: Pre-Processing
In this notebook, we pre-process the reviews data by reformatting columns and removing missing values.  We assume that the `reviews` dataset is saved as a CSV file in a `kedro` catalog so that we can automatically load the data as a `pandas` dataframe:

In [4]:
import pandas as pd
import numpy as np
import typing
reviews = io.load('reviews'); reviews.head()

2019-07-06 10:39:55,484 - kedro.io.data_catalog - INFO - Loading data from `reviews` (CSVLocalDataSet)...


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


First, we generate some descriptive statistics to understand the distribution of the data:

In [7]:
def _word_count(text):
    return len(text.split()) if text != np.nan else np.nan

In [8]:
def summarise(df):
    """Generate custom summary statistics for a DataFrame."""
    DIGITS = 3  # used for rounding
    dtypes = df.dtypes  # data types
    nulls = df.isnull().sum()  # number of null values
    not_nulls = df.notnull().sum()  # number of not-null values
    avg_wcs = df.apply(
        lambda col: np.mean(
            col.apply(lambda x: _word_count(x) if type(x) == str else np.nan)
        )
    )
    wcs = df.apply(
        lambda col: np.mean(
            col.apply(lambda x: _word_count(x) if type(x) == str else np.nan)
        )
    )
    uniques = df.apply(lambda col: len(col.unique()))
    summary1 = pd.DataFrame(
        {"dtype": dtypes, "n_null": nulls, "n_valid": not_nulls, "unique": uniques}
    )
    summary2 = df.describe().T.drop("count", axis=1)
    skews = pd.DataFrame({"skew": df.skew()})
    return round(pd.concat([summary1, summary2, skews], axis=1, sort=False), DIGITS)

In [9]:
# Summary statistics
summarise(reviews)

Unnamed: 0,dtype,n_null,n_valid,unique,mean,std,min,25%,50%,75%,max,skew
Unnamed: 0,int64,0,23486,23486,11742.5,6779.969,0.0,5871.25,11742.5,17613.75,23485.0,0.0
Clothing ID,int64,0,23486,1206,918.119,203.299,0.0,861.0,936.0,1078.0,1205.0,-2.088
Age,int64,0,23486,77,43.199,12.28,18.0,34.0,41.0,52.0,99.0,0.526
Title,object,3810,19676,13994,,,,,,,,
Review Text,object,845,22641,22635,,,,,,,,
Rating,int64,0,23486,5,4.196,1.11,1.0,4.0,5.0,5.0,5.0,-1.314
Recommended IND,int64,0,23486,2,0.822,0.382,0.0,1.0,1.0,1.0,1.0,-1.687
Positive Feedback Count,int64,0,23486,82,2.536,5.702,0.0,0.0,1.0,3.0,122.0,6.473
Division Name,object,14,23472,4,,,,,,,,
Department Name,object,14,23472,7,,,,,,,,


Our first inspection of the data reveals some formatting issues that we need to resolve during the initial ETL stage of our pipeline. Note that:

- `Unnamed: 0` is a unique integer identifier for the row.
- `Clothing ID` is a unique integer identifier for each product, ranging from 0 to 1205.
- `Age` is the age of each reviewer, which is integer-valued.
- `Title` and `Review Text` are free-form text entries, corresponding to the title and text of the review. As expected, titles are much shorter than reviews: the average title has around 3 words, whereas the average review has around 60 words. However, review text has a much larger standard deviation of about 28 words, with reviews ranging from a minimum of 2 words to a maximum of 115 words. This suggests that we should incorporate word counts into our feature matrix during feature engineering.
- `Rating` is integer-valued, ranging from 1 star to 5 stars.
- `Recommended IND` is binary, with only 2 unique values (0 or 1).

In [10]:
print("First column is identifier:", all(reviews['Unnamed: 0'] == range(len(reviews))))
clothing_id = pd.Series(reviews["Clothing ID"].unique()).sort_values()
print("Clothing ID ranges from 0 to 1205:", all(clothing_id == range(1206)))

First column is identifier: True
Clothing ID ranges from 0 to 1205: True


Given our observations from the raw data above, we define a new function `preprocess_reviews` that cleans the review data for feature extraction and model training. In this function, we:

- Rename the columns of the data frame to make them more Pythonic.
- Use the first column as the index of the data frame.
- Unify the case of the product category variables.
- Fix incorrect spelling in the `Division Name` varable.
- Exclude any reviews that are missing *both* a title and body text.

In [11]:
def preprocess_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Preprocess the review data.
        
        Args:
            reviews: source data.
        Returns:
            Preprocessed data.
    
    """
    
    # Rename columns of data frame
    df = df.rename(
        columns = {
            'Unnamed: 0': 'id',
            'Clothing ID': 'product_id',
            'Age': 'author_age',
            'Title': 'review_title',
            'Review Text': 'review_text',
            'Rating': 'star_rating',
            'Recommended IND': 'recommend_flag',
            'Positive Feedback Count': 'upvotes',
            'Division Name': 'product_category_division',
            'Department Name': 'product_category_department',
            'Class Name': 'product_category_class'
        }
    )
    
    # Update review index
    assert df['id'].is_unique, 'Review identifier must be unique.'
    df = df.set_index('id')
    
    # Lower case of category hierarchy
    CATEGORIES = ['product_category_division', 'product_category_department', 'product_category_class']
    df[CATEGORIES] = df[CATEGORIES].apply(lambda x: x.str.lower(), axis = 0)
    
    # Replace incorrect spelling of 'intimates'
    df['product_category_division'] = df['product_category_division'].replace('initmates','intimates')
    
    # Change category variables to category type
    df[CATEGORIES] = df[CATEGORIES].astype("category")
    
    return df

After applying the preprocessing function, the table is now ready for the next step of data cleaning:

In [12]:
reviews_preproc = preprocess_columns(reviews)
summarise(reviews_preproc)

Unnamed: 0,dtype,n_null,n_valid,unique,mean,std,min,25%,50%,75%,max,skew
product_id,int64,0,23486,1206,918.119,203.299,0.0,861.0,936.0,1078.0,1205.0,-2.088
author_age,int64,0,23486,77,43.199,12.28,18.0,34.0,41.0,52.0,99.0,0.526
review_title,object,3810,19676,13994,,,,,,,,
review_text,object,845,22641,22635,,,,,,,,
star_rating,int64,0,23486,5,4.196,1.11,1.0,4.0,5.0,5.0,5.0,-1.314
recommend_flag,int64,0,23486,2,0.822,0.382,0.0,1.0,1.0,1.0,1.0,-1.687
upvotes,int64,0,23486,82,2.536,5.702,0.0,0.0,1.0,3.0,122.0,6.473
product_category_division,category,14,23472,4,,,,,,,,
product_category_department,category,14,23472,7,,,,,,,,
product_category_class,category,14,23472,21,,,,,,,,
