## Sentimentanalyse von [Kleidungsbewertungen](https://www.kaggle.com/timoboz/womens-ecommerce-clothing-reviews)
[Tutorial Hilfe](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)

## SETUP

In [63]:
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer, TransformerMixin, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

In [9]:
df_en = pd.read_csv('kleidung_en.csv')

In [10]:
df_en.head()

Unnamed: 0.1,Unnamed: 0,Review Text,Recommended IND
0,0,Absolutely wonderful - silky and sexy and comf...,1
1,1,Love this dress! it's sooo pretty. i happene...,1
2,2,I had such high hopes for this dress and reall...,0
3,3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,4,This shirt is very flattering to all due to th...,1


In [11]:
df_en.dtypes

Unnamed: 0          int64
Review Text        object
Recommended IND     int64
dtype: object

In [12]:
df_en['Review Text'] = df_en['Review Text'].astype(str)

Unnamed: 0.1,Unnamed: 0,Review Text,Recommended IND
0,0,Absolutely wonderful - silky and sexy and comf...,1
1,1,Love this dress! it's sooo pretty. i happene...,1
2,2,I had such high hopes for this dress and reall...,0
3,3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,4,This shirt is very flattering to all due to th...,1


In [14]:
df_en = df_en.drop(['Unnamed: 0'], axis=1)
df_en.head()

Unnamed: 0,Review Text,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


In [16]:
# check missing values
df_en.isna().sum()

Review Text        0
Recommended IND    0
dtype: int64

In [20]:
#!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 3.0 MB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047105 sha256=becbc867e6d600c3a3c9c8ffc510fad52403c20e3ab623f75c89571162895d9d
  Stored in directory: /tmp/pip-ephem-wheel-cache-3mvii00x/wheels/ee/4d/f7/563214122be1540b5f9197b52cb3ddb9c4a8070808b22d5a84
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [21]:
nlp = en_core_web_sm.load()
#nlp = spacy.load('en')

#build a list of stop words for filtering
stopwords = list(STOP_WORDS)
print(stopwords)

['thereafter', 'neither', 'what', 'put', 'while', 'herself', '’re', 'than', 'indeed', 'across', 'each', 'these', 'except', 'full', 'more', 'now', 'were', 'being', 'below', 'take', 'made', 'well', 'anyhow', 'go', 'via', 'latter', 'after', 'otherwise', 'whatever', 'in', 'nowhere', 'out', 'formerly', 'might', 'who', 'became', 'never', 'around', 'amongst', 'hence', 'give', "'re", 'none', '’d', 'therein', 'call', 'once', 'hereby', "'m", 'those', 'former', 'was', 'less', "'ll", 'get', 'on', 'anywhere', 'else', 'why', 'every', '‘m', 'its', 'him', 'to', 'during', 'enough', 'per', 'still', 'side', 'that', 'had', 'cannot', '‘ve', 'n‘t', 'eleven', 'his', 'either', 'thereupon', 'together', 'four', 'hers', 'where', 'are', 'with', 'everything', 'few', 'beside', 'show', 'one', 'about', 'really', 'does', 'due', 'nevertheless', 'against', 'both', 'for', 'becoming', 'say', 'name', 'many', 'ours', 'everyone', 'nothing', 'sometimes', 'has', 'though', 'only', 'but', 'something', 'whoever', 'yourselves', 'v

## PREPROCESSING

In [22]:
import string
punctuations = string.punctuation

#create a Spacy Parser
from spacy.lang.en import English
parser = English()

In [42]:
#create a tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]

    # Removing stop words
    mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]

    # return preprocessed list of tokens
    return mytokens

In [43]:
# Custom transformer using spaCy

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

In [44]:
#
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

#
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

## CLASSIFICATION

In [45]:
from sklearn.model_selection import train_test_split

X = df_en['Review Text'] # the features we want to analyze
y = df_en['Recommended IND'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Logistic Regression

In [49]:
# Logistic Regression Classifier

lr = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', lr)])

# model generation
pipe.fit(X_train, y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7fa7aed5a5e0>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7fa7af987af0>)),
                ('classifier', LogisticRegression())])

In [72]:
kleidung_pred_lr = pd.DataFrame(pred_lr)
df_en['pred_lr'] = kleidung_pred_lr
df_en.head()

Unnamed: 0,Review Text,Recommended IND,pred_lsvc,pred_lr
0,Absolutely wonderful - silky and sexy and comf...,1,1.0,1.0
1,Love this dress! it's sooo pretty. i happene...,1,1.0,1.0
2,I had such high hopes for this dress and reall...,0,1.0,1.0
3,"I love, love, love this jumpsuit. it's fun, fl...",1,0.0,0.0
4,This shirt is very flattering to all due to th...,1,1.0,1.0


In [68]:
pred_lr = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression train score:", pipe.score(X_train, y_train))
print("Logistic Regression test score:", pipe.score(X_test, pred_lr))

Logistic Regression train score: 0.9565085158150851
Logistic Regression test score: 1.0


## Linear SVC

In [55]:
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) 
lsvc = LinearSVC()

In [56]:
pipe_countvect = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', lsvc)])

pipe_countvect.fit(X_train,y_train)

pred_lsvc = pipe_countvect.predict(X_test)

for (sample, pred) in zip(X_test, pred_lsvc):
    print(sample,"Prediction=>", pred)

ty sizing. but i love them super cute and comfortable. i wish they had more colors available. these were a great buy! Prediction=> 1
When i tried these on in the store in my size, but thought they felt tight, but the person helping me pointed out that they hug you, but are not uncomfortable. i uncuff the bottom and wear them full-length. ended up buying one of each color. Prediction=> 1
I was so ready to love this dress but it runs extremely small - like you should maybe purchase a size 2 sizes bigger than what you normally wear. Prediction=> 1
This shirt was very cute. i ordered it as something to be able to wear to class with leggings and it is perfect. the colors are great and the length is very nice. it will be a nice piece to layer with a jacket for the the fall. Prediction=> 1
This skirt is so flattering and comfortable! Prediction=> 1
I love these pants so much and have worn them so many times in a week! they are comfy and causal, but still chic and presentable. these pants run 

In [59]:
#create a dataframe with the predictions

kleidung_pred_lsvc = pd.DataFrame(pred_lsvc)
kleidung_pred_lsvc.head()

Unnamed: 0,0
0,1
1,1
2,1
3,0
4,1


In [73]:
df_en['pred_lsvc'] = kleidung_pred_lsvc
df_en.head()

Unnamed: 0,Review Text,Recommended IND,pred_lsvc,pred_lr
0,Absolutely wonderful - silky and sexy and comf...,1,1.0,1.0
1,Love this dress! it's sooo pretty. i happene...,1,1.0,1.0
2,I had such high hopes for this dress and reall...,0,1.0,1.0
3,"I love, love, love this jumpsuit. it's fun, fl...",1,0.0,0.0
4,This shirt is very flattering to all due to th...,1,1.0,1.0


In [61]:
# Model evaluation

print("SVC train accuracy:", pipe_countvect.score(X_train, y_train))
print("SVC test accuracy:", pipe_countvect.score(X_test, y_test))
print("SVC prediction accuracy:", pipe_countvect.score(X_test, pred_lsvc))

SVC train accuracy: 0.9856447688564477
SVC test accuracy: 0.8646040306556911
SVC prediction accuracy: 1.0


## Analysis of German reviews

In [74]:
kleidung_de = pd.read_csv('kleidung_de.csv')
kleidung_de.drop('Unnamed: 0', axis=1, inplace=True)
kleidung_de.head()

Unnamed: 0,Review Text DE,Recommended IND
0,Absolut wundervoll - seidig und sexy und bequem,1
1,Liebe dieses Kleid! es ist sooo hübsch. Ich fa...,1
2,Ich hatte so große Hoffnungen auf dieses Kleid...,0
3,"Ich liebe, liebe, liebe diesen Overall. Es mac...",1
4,Dieses Shirt ist aufgrund der verstellbaren Fr...,1


In [75]:
kleidung_de['Review Text DE'] = kleidung_de['Review Text DE'].astype(str)

In [76]:
#!python -m spacy download de_core_news_sm

Collecting de_core_news_sm==2.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.3.0/de_core_news_sm-2.3.0.tar.gz (14.9 MB)
[K     |████████████████████████████████| 14.9 MB 6.3 MB/s 
Building wheels for collected packages: de-core-news-sm
  Building wheel for de-core-news-sm (setup.py) ... [?25ldone
[?25h  Created wheel for de-core-news-sm: filename=de_core_news_sm-2.3.0-py3-none-any.whl size=14907580 sha256=9485a0933dba309cc19fc22ca2a2a515dea34dc095b79ddd907031339f362720
  Stored in directory: /tmp/pip-ephem-wheel-cache-0i7wxr6k/wheels/5d/ea/e9/0d432e5114b7cba534bb0742b7de51e03db174dfb1e3dda87c
Successfully built de-core-news-sm
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-2.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')


In [77]:
import spacy
import de_core_news_sm
from  spacy.lang.de.stop_words import STOP_WORDS
nlp = de_core_news_sm.load()
#nlp = spacy.load('en')

# To build a list of stop words for filtering
stopwords_de = list(STOP_WORDS)
print(stopwords_de)

['gedurft', 'würden', 'alles', 'dürfen', 'demgegenüber', 'gekannt', 'dieselben', 'allein', 'damals', 'werden', 'dann', 'nun', 'dir', 'bei', 'darum', 'wie', 'solchen', 'bereits', 'seinem', 'eines', 'möglich', 'schon', 'uhr', 'dahin', 'trotzdem', 'neue', 'einer', 'allgemeinen', 'kleinen', 'fünfte', 'ende', 'dieser', 'davon', 'sie', 'du', 'dein', 'demgemäß', 'drittes', 'ach', 'demzufolge', 'mein', 'habt', 'anderem', 'in', 'einander', 'überhaupt', 'würde', 'dasein', 'möchte', 'tagen', 'während', 'kurz', 'sechsten', 'sehr', 'vielen', 'große', 'hin', 'welcher', 'infolgedessen', 'übrigens', 'wer', 'grosse', 'wenn', 'sondern', 'ab', 'da', 'dazwischen', 'los', 'grosses', 'ein', 'hat', 'zehnte', 'er', 'denselben', 'ohne', 'weniges', 'wirst', 'indem', 'zum', 'siebenten', 'ihr', 'was', 'dies', 'gute', 'einen', 'bin', 'ganzen', 'als', 'gegenüber', 'hatten', 'sein', 'daß', 'zehntes', 'zur', 'zu', 'schlecht', 'etwa', 'zugleich', 'ausser', 'haben', 'den', 'magst', 'konnte', 'solchem', 'jene', 'ausserd

In [78]:
import string
punctuations = string.punctuation

from spacy.lang.de import German
parser_de = German()

In [79]:
# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser_de(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]

    # Removing stop words
    mytokens = [word for word in mytokens if word not in stopwords_de and word not in punctuations]

    # return preprocessed list of tokens
    return mytokens

In [80]:

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

In [81]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [83]:
from sklearn.model_selection import train_test_split

X_de = kleidung_de['Review Text DE']
y_de = kleidung_de['Recommended IND'] 

X_train_de, X_test_de, y_train_de, y_test_de = train_test_split(X_de, y_de, test_size=0.3, random_state=42)

In [84]:
# Logistic Regression Classifier

pipe.fit(X_train_de, y_train_de)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x7fa7aed5a5e0>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x7fa7af987af0>)),
                ('classifier', LogisticRegression())])

In [85]:
pred_lr_de = pipe.predict(X_test_de)

In [88]:
print("LR train accuracy:", pipe.score(X_train_de, y_train_de))
print("LR test accuracy:", pipe.score(X_test_de, y_test_de))

LR train accuracy: 0.9263990267639902
LR test accuracy: 0.8653136531365314


## SUMMARY
We used SpaCy to preprocess text reviews in English and German, then employed two machine learning models (Logistic Regression and Support Vector Classifier) to predict the sentiment of reviews in both languages. We achieved the following accuracy scores on the test data:
- Logistic Regression: 95.65% on English, 86.53% on German
- SVC: 86.46% accuracy on English