Zuerst importieren wir die notwendigen Bibliotheken zum Data Handling und für das Natural Language Processing.

In [1]:
# data handling
import pandas as pd
import numpy as np

# natural language processing
import nltk # https://www.nltk.org/
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import spacy # https://spacy.io/
import re # regular expressions
import string # remove puncuations

Der Datensatz enthält über 23.000 Einträge mit diverse Informationen.

In [15]:
# read data
df_reviews = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
print(len(df_reviews))
df_reviews.head()

23486


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Wir möchten ein Produkt mit möglichst vielen Reviews analysieren.

In [3]:
df_counts = pd.DataFrame(df_reviews.groupby('Clothing ID')['Review Text'].nunique())
df_counts.sort_values(by='Review Text', ascending=False)

Unnamed: 0_level_0,Review Text
Clothing ID,Unnamed: 1_level_1
1078,987
862,778
1094,735
1081,560
872,519
...,...
77,0
64,0
54,0
1164,0


Wir wählen das Produkt mit der "Clothing ID" 1078 aus. Dabei handelt es sich um ein Kleid. Im Folgenden verwenden wir ausschließlich den Review Text sowie das Rating und filtern den Datensatz entsprechend. Außerdem entfernen wir die Einträge, die keinen Review Text enthalten.

In [16]:
df_product = df_reviews[df_reviews['Clothing ID']==1078]
df_product = df_product[['Review Text','Rating']]

df_product['Review Text'].replace('', np.nan, inplace=True)
df_product.dropna(subset=['Review Text'], inplace=True)

df_product.head()

Unnamed: 0,Review Text,Rating
69,"I really wanted this to work. alas, it had a s...",3
90,"I love cute summer dresses and this one, espec...",4
117,This is the perfect summer dress. it can be dr...,5
467,"Nice fit and flare style, not clingy at all. i...",5
470,When i first opened this dress and tried it on...,3


Wir laden die Stopwords in englischer Sprache und entfernen die typischen Verneinungen "no" und "not".

In [6]:
#load english and stopwords
nlp = spacy.load('en')
nltk.download('stopwords')

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\w31bmaso\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Der ursprüngliche Text wird preprocessed. Dabei werden vor allem um die unwichtige Zeichen entfernt, sowie die Wörter lemmatisiert.

In [7]:
# function to remove non-ascii characters
def _removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

# get stop words of all languages
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}

# function to detect language based on # of stop words for particular language
def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    lang = max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0]
    if lang == 'english':
        return True
    else:
        return False

# function to clean and lemmatize comments
def clean_comments(text):
    # remove punctuations
    regex = re.compile('[' + re.escape(string.punctuation) + '\\r\\t\\n]')
    nopunct = regex.sub(" ", str(text))
    
    # use spacy to lemmatize comments
    doc = nlp(nopunct, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc]
    return lemma

def clean_and_lemmatize(reviews):

    reviews = reviews.astype('str')

    #remove non-ascii characters
    reviews = reviews.map(lambda x: _removeNonAscii(x))
    
    #filter for only english comments
    eng_reviews=reviews[reviews.apply(get_language)]
    
    #drop duplicates
    eng_reviews.drop_duplicates(inplace=True)
    
    #apply function to clean and lemmatize comments
    lemmatized = eng_reviews.map(clean_comments)
    
    #make sure to lowercase everything
    lemmatized = lemmatized.map(lambda x: [word.lower() for word in x])
    
    return lemmatized

Die Funktion clean_and_lemmatize() wird im Folgenden innerhalb der Funktion most_frequent_bi_and_tri_grams() ausgeführt. Um zu demonstrieren, wie die ursprünglichen Reviews nach der Lemmatisierung aussehen, schauen wir uns noch einen Zwischenschritt an.

In [8]:
df_product['Review Text Clean'] = clean_and_lemmatize(df_product['Review Text'])
df_product.head()

Unnamed: 0,Review Text,Rating,Review Text Clean
69,"I really wanted this to work. alas, it had a s...",3,"[-pron-, really, want, this, to, work, , alas..."
90,"I love cute summer dresses and this one, espec...",4,"[-pron-, love, cute, summer, dress, and, this,..."
117,This is the perfect summer dress. it can be dr...,5,"[this, be, the, perfect, summer, dress, , -pr..."
467,"Nice fit and flare style, not clingy at all. i...",5,"[nice, fit, and, flare, style, , not, clingy,..."
470,When i first opened this dress and tried it on...,3,"[when, i, first, open, this, dress, and, try, ..."


Im nächsten Schritt werden die Bi- und Trigrams aus den Reviews gefiltert. Die Parametrisierung, welche der Wort-Kombinationen dabei berücksichtigt werden sollen, ist entscheidend. Bei den Bigrams scheint nach einigen Versuchen die im ursprünglichen Post vorgeschlagene Kombination von Adjektiv + Nomen, sowie Nomen + Nomen die besten Resultate zu erzeugen.

In [9]:
def most_frequent_bi_and_tri_grams(reviews):
    
    lemmatized = clean_and_lemmatize(reviews)
    
    #turn all comments' tokens into one single list
    unlist_reviews = [item for items in lemmatized for item in items]
    
    bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(unlist_reviews)
    trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(unlist_reviews)
    
    #bigrams
    bigram_freq = bigramFinder.ngram_fd.items()
    bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)
    #trigrams
    trigram_freq = trigramFinder.ngram_fd.items()
    trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)
    
    #get english stopwords
    en_stopwords = set(stopwords.words('english'))
    
    #function to filter for ADJ/NN bigrams
    def rightTypes(ngram):
        if '-pron-' in ngram or 't' in ngram:
            return False
        for word in ngram:
            if word in en_stopwords or word.isspace():
                return False
        acceptable_types = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS') # ORIGINAL
        # acceptable_types = ('NN', 'NNS', 'NNP', 'NNPS')
        # acceptable_types = ('JJ', 'JJR', 'JJS')
        # acceptable_types = ('JJ', 'JJR', 'JJS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ')
        second_type = ('NN', 'NNS', 'NNP', 'NNPS') # ORIGINAL
        # second_type = ('JJ', 'JJR', 'JJS')
        # second_type = ('JJ', 'JJR', 'JJS','NN', 'NNS', 'NNP', 'NNPS', 'RB', 'RBR', 'RBS')
        # second_type = ('JJ', 'JJR', 'JJS','NN', 'NNS', 'NNP', 'NNPS')
        tags = nltk.pos_tag(ngram)
        if tags[0][1] in acceptable_types and tags[1][1] in second_type:
            return True
        else:
            return False
    
    #filter bigrams
    filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]
    
    #function to filter for trigrams
    def rightTypesTri(ngram):
        if '-pron-' in ngram or 't' in ngram:
            return False
        for word in ngram:
            if word in en_stopwords or word.isspace():
                return False
        first_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS') # ORIGINAL
        # first_type = ('JJ', 'JJR', 'JJS')
        third_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS') # ORIGINAL
        # third_type = ('JJ', 'JJR', 'JJS')
        tags = nltk.pos_tag(ngram)
        if tags[0][1] in first_type and tags[2][1] in third_type:
            return True
        else:
            return False
    
    #filter trigrams
    filtered_tri = trigramFreqTable[trigramFreqTable.trigram.map(lambda x: rightTypesTri(x))]
    
    filtered_bi = filtered_bi.reset_index()
    filtered_tri = filtered_tri.reset_index()
    
    return filtered_bi, filtered_tri

Wir schauen uns beispielhaft die häufigsten Bi- und Trigrams der positiven Bewertungen mit Rating 4 oder 5 an.

In [10]:
df_product_positive = df_product[df_product['Rating']>=4]
filtered_bi, filtered_tri = most_frequent_bi_and_tri_grams(df_product_positive['Review Text'])
filtered_bi.head(10)

Unnamed: 0,index,bigram,freq
0,445,"(dress, fit)",22
1,6507,"(beautiful, dress)",21
2,1486,"(many, compliment)",21
3,165,"(fit, perfect)",21
4,53,"(fit, well)",20
5,1271,"(great, dress)",20
6,3,"(summer, dress)",18
7,1344,"(retailer, dress)",14
8,3233,"(dress, look)",14
9,1389,"(regular, size)",12


In [11]:
filtered_tri.head(10)

Unnamed: 0,index,trigram,freq
0,4868,"(dress, look, great)",4
1,93,"(perfect, summer, dress)",4
2,24677,"(dress, fit, true)",3
3,16030,"(look, super, cute)",3
4,10705,"(local, retailer, store)",3
5,13051,"(dress, run, large)",3
6,208,"(compliment, every, time)",3
7,5994,"(small, fit, well)",2
8,6001,"(great, price, point)",2
9,11589,"(hot, summer, day)",2


Die Betrachtung der häufigsten Bi- und Trigrams zeigt, dass for allem die Passform ("fit"), aber auch das Aussehen ("look", "beautiful") und die Eignung für Sommertage ("summer") mehrfach genannt werden.