In [36]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
 
# NLP
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Plotting
import plotly.express as px
import plotly.graph_objects as go
!pip install chart_studio
import chart_studio.plotly as py
from plotly.subplots import make_subplots
import cufflinks as cf
%matplotlib inline

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)


# Import dataset
df = pd.read_csv('/kaggle/input/all-products-available-on-sephora-website/sephora_website_dataset.csv')

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


Manufacturers would claim tons of product benefits, in an attempt to convince customers to buy their brand. 

This dataset made me curious how skincare products would position themselves - what are the most common product claims? And are there specific claims that effectively sway product ratings positively? 

These questions are the main issues I shall attempt to address using this Sephora dataset, and with the help of NLP techniques.

## Part 1: Popular Terms Used in Product Descriptions

The main variable to be analyzed here is called 'details' as this contains the product description - the benefit claims, ingredient callouts, clinical results, and other relevant product information. 

This variable is open text data format, therefore some data cleaning and processing would be necessary before patterns can be observed.   

In [37]:
df['details'][:2]

0    This enchanting set comes in a specially handcrafted blue box- and includes a selection of fragrances from the Blu Mediterraneo collection. A symbol of the Italian Mediterranean and the island of Capri- Arancia di Capri is sunny- relaxing- and carefree. In the air- hints of Italian citrus and the warm aroma of caramel blend together to create a pure moment of bliss- just like being on vacation.Soarkling- authentic bergamot shines at the onset of Bergamotto di Calabria. It is enhanced by the freshness of citron- red ginger- and cedarwood. At the base- an unprecedented combination of vetiver- benzoin- and musk emerges. The Amalfi Coast: it’s one of the most breathtaking places on Earth. Fico di Amalfi is a floral- woody- and citrusy fragrance that calls to mind this breathtaking stretch of Mediterranean coastline with a strong- energizing effect. Mirto di Panarea is characterized by the aromatic notes of myrtle and basil- it opens with lemon and bergamot. At the heart- a sea breeze 

In [38]:
# This analysis shall focus on facial skin care products. Relevant categories are filtered.
skin_care = ['Moisturizers', 'Face Serums', 'Face Wash & Cleansers', 'Face Masks', 'Eye Creams & Treatments', \
             'Toners', 'Face Oils', 'Face Sunscreen', 'Sheet Masks', 'Facial Peels', 'Skincare', 'Exfoliators' \
            'Face Sets', 'Anti-Aging', 'For Face']

df_skin = df[df['category'].isin(skin_care)].reset_index()

In [39]:
# Function to clean and tokenize the text data: 
def clean_and_tokenize(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    text = text.strip('\n')
    # Remove titles
    titles = '|'.join(['what it is', 'skin type', 'skincare concerns', 'formulation', 'highlighted ingredients', \
                      'ingredient callouts', 'what else you need to know', 'clinical results'])
    text = re.sub(titles, "", text)
    # Tokenize into words
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    words = [x for x in tokens if x not in stopwords.words("english")]
    # Remove n's
    words = [x for x in words if x != 'n']
    # Lemmatize, but do not lemmatize 'sls' and 'sles' terms
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) if word not in ['sls', 'sles'] else word for word in words]
    return lemmatized

In [40]:
# Create a set of clean words that will be used for analysis
words = clean_and_tokenize(''.join(str(df_skin['details'].to_list())))

In [41]:
# Dataframe for top words appearing in the descriptions
df_words = pd.Series(words).value_counts().reset_index().\
            rename(columns = {"index":"word", 0:"counts"})

# Filter it only among top 10
df_words_10 = df_words[:10]

In [42]:
# Plot the top words
fig = px.bar(x = df_words_10.word,
             y = df_words_10.counts,
             labels = {
                 'x' : 'Words',
                 'y' : 'Counts'
             },
             title = 'Top words appearing in product descriptions',
             template = 'simple_white'
)
fig.show()

Of course, the most frequent word found in the descriptions would be "skin". Aside from this, ingredient-related terms are also frequently appearing ("free" and "without" most probably pertaining to the products being free from unwanted ingredients, "ingredient" and "formulation"). "Parabens" also appeared in the top 10, a well-known nasty ingredient that skincare products swear that they do not have. 

Looking at n-grams or chains of words successively mentioned together would help give more contextual information about these terms. 

In [43]:
# Dataframe for bigrams
df_bigrams = pd.Series(nltk.ngrams(words, 2)).value_counts().reset_index().rename(columns = {"index":"bigrams", 0:"counts"})
df_bigrams['bigrams'] = df_bigrams['bigrams'].astype(str)

# Filter it only among top 10
df_bigrams_10 = df_bigrams[:10]

In [44]:
# Plot the top bigrams
fig = px.bar(x = df_bigrams_10.bigrams,
             y = df_bigrams_10.counts,
             labels = {
                 'x' : 'Bigrams',
                 'y' : 'Counts'
             },
             title = 'Top bigrams appearing in product descriptions',
             template = 'simple_white')
fig.show()

The pair of words appearing most frequently: "fine" + "line". This, along with "line" and "wrinkle", say a lot about how products boast their anti-ageing benefits.

Ingredient claims still dominate, with "formulated" + "without" in the second place, and detailed nasty additives such as sls - sles, sulfate - sls, and sles - parabens appearing in the top 10.  

We also look at trigrams to expand the context more.

In [45]:
# Dataframe for trigrams
df_trigrams = pd.Series(nltk.ngrams(words, 3)).value_counts().reset_index().rename(columns = {"index":"trigrams", 0:"counts"})
df_trigrams['trigrams'] = df_trigrams['trigrams'].astype(str)

# Filter it only among top 10
df_trigrams_10 = df_trigrams[:10]

In [46]:
# Plot the top trigrams
fig = px.bar(x = df_trigrams_10.trigrams,
             y = df_trigrams_10.counts,
             labels = {
                 'x' : 'trigrams',
                 'y' : 'Counts'
             },
             title = 'Top trigrams appearing in product descriptions',
             template = 'simple_white')
fig.show()

Again, ingredient claims, anti-ageing, and skin-type formulations appear as top trigrams.

Next step: We will look at how these claims affect ratings of the products (if ever they really do).