# Introduction

The problem is a supervised learning. There are 5000 Turkish complaints belonging to 32 categories in the dataset. To build and train the model, I first tokenize the text and convert them to sequences. Each complaint is limited to 192 words. As a result, short texts less than 192 words are padded with zeros, and long ones are truncated. After processing the complaints, I trained two model in two different ways:

- Model-1: In this model, a neural network with LSTM and it classifies into 32 categories.
- Model-2: In Model-2, same model with Model-1 but it classifies according to the Pareto principle (20% of the complaints contribute 80% of the total).

In [1]:
import numpy as np
import pandas as pd

# Data Processing

In [2]:
df_input = pd.read_excel('input_data.xlsx') # The data where the text data is read.
df_output = pd.read_excel('output_data.xlsx', index_col='gozlem') # the data about which text belongs to which category is read.
df = df_input.join(df_output)
df

Unnamed: 0,Input,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,kategori
0,müşterimiz saat aradığı donama şubesi personel...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,müşterimiz kargosunun alınmadığını belirtiyor ...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,müşteri anlaşmalı müşteri adres alımlardan ürü...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,müşterimiz tazmin talebinin akıbeti ilgili bil...,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
4,adresten alım yapılmaması müşteri alım talebin...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4895,personel kargoyu dükkanın önüne bırakıp gitmiş...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4896,merhaba müşterimiz şube telefonlarına ulaşılam...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,11
4897,alıcı müşteri dükkanı kapalı telefona cevap ve...,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,8
4898,müşterimize kargo teslim edildi görünüyor faka...,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9


In [3]:
# unnecessary parts are dropped.
df.drop(columns=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32], inplace=True) 
df

Unnamed: 0,Input,kategori
0,müşterimiz saat aradığı donama şubesi personel...,1
1,müşterimiz kargosunun alınmadığını belirtiyor ...,2
2,müşteri anlaşmalı müşteri adres alımlardan ürü...,1
3,müşterimiz tazmin talebinin akıbeti ilgili bil...,3
4,adresten alım yapılmaması müşteri alım talebin...,2
...,...,...
4895,personel kargoyu dükkanın önüne bırakıp gitmiş...,1
4896,merhaba müşterimiz şube telefonlarına ulaşılam...,11
4897,alıcı müşteri dükkanı kapalı telefona cevap ve...,8
4898,müşterimize kargo teslim edildi görünüyor faka...,9


In [4]:
import nltk
from nltk.tokenize import RegexpTokenizer # Used to split sentences to make them more understandable for the machine.

!pip install TurkishStemmer # Since our data is in Turkish, TurkishStemmer is installed for cleaning (rooting words). https://github.com/otuncelli/turkish-stemmer-python
from TurkishStemmer import TurkishStemmer 
stemmer = TurkishStemmer()
tokenizer = RegexpTokenizer(r'\w+') # The variable that divides sentences into small expressions is defined.
punct_re=lambda x :" ".join(tokenizer.tokenize(x.lower())) # The variable that convert data to lower case is defined.

nltk.download('stopwords') # Download stopwords from nltk. The simpler the data, the better. Therefore, it is desirable to get rid of unnecessary words (stopwords) that do not add meaning to the sentence.

stop_word_list = nltk.corpus.stopwords.words('turkish')

def stopword_extraction(values):
    wordFilter = [word for word in values.split() if word not in stop_word_list]
    notStopword = " ".join(wordFilter)
    return notStopword

df["Input"]=df["Input"].apply(punct_re)
df["Input"]=df["Input"].apply(lambda x : " ".join([stemmer.stem(w) for w in x.split()]))
df['Input'] = df['Input'].apply(lambda x: stopword_extraction(x))
df

ERROR: Invalid requirement: '#'
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\onur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Input,kategori
0,müşter saat aradık dona şube persone taraf küs...,1
1,müşter kargo alınmadık belirtiyor konu şikayet...,2
2,müşter anlaşma müşter adres alım ürün alınmıyo...,1
3,müşter tazmi talep akıbe ilgi bilgi almak ister,3
4,adres al yapılmama müşter al talep gerçekleşme...,2
...,...,...
4895,personel kargo dükka önün bırakıp git sigar ma...,1
4896,merhap müşter şube telefon ulaşılamama dola şi...,11
4897,alıç müşter dükkan kapa telefon cevap verm kur...,8
4898,müşter kargo tesl edil görünüyor fakat müşter ...,9


In [5]:
from keras.preprocessing.text import Tokenizer # In order for the machine to understand the text, a dictionary is created with a maximum of 50000 words used in the data set with Tokenizer.

tokenizer = Tokenizer(num_words=50000)
tokenizer.fit_on_texts(df.Input)

In [6]:
# A sentence length is calculated for our text data.

num_tokens1 = [len(tokens) for tokens in df['Input']]
num_tokens1 = np.array(num_tokens1)

max_tokens1 = np.mean(num_tokens1) + 2 * np.std(num_tokens1)
max_tokens1 = int(max_tokens1)
max_tokens1

168

In [7]:
# Our neural network is asking us for inputs of equal size. In real life problems this is very difficult. That's why pad_sequences is used to make the data the same size.
from keras.preprocessing.sequence import pad_sequences

X1 = tokenizer.texts_to_sequences(df['Input']) # sentences are translated to the words in the dictionary.
X_pad1 = pad_sequences(X1, maxlen=max_tokens1) # It is resized to the same size as pad_sequences. I also put the sentence size max_tokens, which we calculated before, as the length.


_____

In [8]:
# Since our data is unstable, we are resampling our minority data with RandomOverSampler.
from imblearn.over_sampling import RandomOverSampler 

Y1 = pd.get_dummies(df['kategori']).values 

rus = RandomOverSampler(random_state = 0, sampling_strategy = "minority")
x_rus1, y_rus1 = rus.fit_resample(X_pad1, Y1)
print(x_rus1.shape, y_rus1.shape)

(5770, 168) (5770, 32)


In [9]:
from sklearn.model_selection import train_test_split
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_rus1, y_rus1, test_size = 0.20, random_state = 42)

print(x_train1.shape, y_train1.shape)
print(x_test1.shape, y_test1.shape)

(4616, 168) (4616, 32)
(1154, 168) (1154, 32)


# Model-1

In [10]:
from tensorflow import keras

model1 = keras.Sequential()
model1.add(keras.layers.Embedding(50000, 100, input_length=max_tokens1))
model1.add(keras.layers.LSTM(200, dropout=0.5, recurrent_dropout=0.5))
model1.add(keras.layers.Dense(32, activation='softmax')) # It calculates relative probabilities. It is used as activation for the last layer of the network because the result can be interpreted as a probability distribution. Softmax is used because the model is multi-class and has one correct answer.

model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # categorical_crossentropy is used because it is a multi-class classification problem.

epochs = 50 
batch_size = 64

callback = (keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001))

history = model1.fit(x_train1, y_train1, epochs=epochs, batch_size=batch_size,validation_data=[x_test1,y_test1],callbacks=[callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50


In [11]:
accr1 = model1.evaluate(x_test1, y_test1)



In [12]:
y_pred1 = model1.predict(x_test1, batch_size=64, verbose=0)

# For the classification report, our y_pred and y_test values are converted to a list of the row of the maximum value by looking at the row-based ones with argmax().
rounded_y_pred1 = np.argmax(y_pred1, axis=1)
rounded_labels1 = np.argmax(y_test1, axis=1)

In [13]:
# classification report
from sklearn import metrics
cr1 = metrics.classification_report(rounded_labels1, rounded_y_pred1, zero_division=0)
print(cr1)

              precision    recall  f1-score   support

           0       0.70      0.79      0.74       170
           1       0.92      0.89      0.91        81
           2       0.50      0.75      0.60         4
           3       0.94      0.97      0.96        34
           5       0.90      0.97      0.93       165
           6       0.89      0.87      0.88        86
           7       0.69      0.28      0.40       103
           8       0.46      0.85      0.60        73
           9       0.94      0.75      0.84        44
          10       0.98      0.97      0.97       100
          11       0.00      0.00      0.00         1
          12       0.62      0.67      0.65        15
          13       0.57      0.44      0.50         9
          14       0.33      0.50      0.40         4
          15       0.00      0.00      0.00        10
          16       0.83      0.73      0.78        26
          17       1.00      1.00      1.00       177
          18       0.31    

# Model-2

In [14]:
# In order to increase the accuracy and speed of the training, the 10 complaint category parts corresponding to 80% of the complaints are taken and the rest is discarded.

df.drop(df[(df.kategori != 1) & (df.kategori != 6) & (df.kategori != 2) & (df.kategori != 7) & 
           (df.kategori != 8) & (df.kategori != 9) & (df.kategori != 10) & (df.kategori != 11) & 
           (df.kategori != 4) & (df.kategori != 17)].index, inplace=True) # Alternative: df = df[df['Input'].map(len) > 10]

df.reset_index(inplace=True)
df.set_index('index')
df.drop(columns=['index'], inplace=True)

df

Unnamed: 0,Input,kategori
0,müşter saat aradık dona şube persone taraf küs...,1
1,müşter kargo alınmadık belirtiyor konu şikayet...,2
2,müşter anlaşma müşter adres alım ürün alınmıyo...,1
3,adres al yapılmama müşter al talep gerçekleşme...,2
4,merhap müşter kendi ait olmayan kargo sms geld...,4
...,...,...
4372,personel kargo dükka önün bırakıp git sigar ma...,1
4373,merhap müşter şube telefon ulaşılamama dola şi...,11
4374,alıç müşter dükkan kapa telefon cevap verm kur...,8
4375,müşter kargo tesl edil görünüyor fakat müşter ...,9


In [15]:
tokenizer.fit_on_texts(df.Input)

num_tokens2 = [len(tokens) for tokens in df['Input']]
num_tokens2 = np.array(num_tokens2)

max_tokens2 = np.mean(num_tokens2) + 2 * np.std(num_tokens2)
max_tokens2 = int(max_tokens2)

X2 = tokenizer.texts_to_sequences(df['Input'])
X_pad2 = pad_sequences(X2, maxlen=max_tokens2)

Y2 = pd.get_dummies(df['kategori']).values

x_rus2, y_rus2 = rus.fit_resample(X_pad2, Y2)
print(x_rus2.shape, y_rus2.shape)

from sklearn.model_selection import train_test_split
x_train2, x_test2, y_train2, y_test2 = train_test_split(x_rus2, y_rus2, test_size = 0.20, random_state = 42)

print(x_train2.shape, y_train2.shape)
print(x_test2.shape, y_test2.shape)

(5135, 165) (5135, 10)
(4108, 165) (4108, 10)
(1027, 165) (1027, 10)


In [16]:
model2 = keras.Sequential()
model2.add(keras.layers.Embedding(50000, 100, input_length=max_tokens2))
model2.add(keras.layers.LSTM(200, dropout=0.5, recurrent_dropout=0.5))
model2.add(keras.layers.Dense(10, activation='softmax')) # It calculates relative probabilities. It is used as activation for the last layer of the network because the result can be interpreted as a probability distribution. Softmax is used because the model is multi-class and has one correct answer.

model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # categorical_crossentropy is used because it is a multi-class classification problem.

epochs = 50 
batch_size = 64

callback = (keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001))

history = model2.fit(x_train2, y_train2, epochs=epochs, batch_size=batch_size,validation_data=[x_test2,y_test2],callbacks=[callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50


In [17]:
accr2 = model2.evaluate(x_test2, y_test2)



In [18]:
y_pred2 = model2.predict(x_test2, batch_size=64, verbose=0)

# For the classification report, our y_pred and y_test values are converted to a list of the row of the maximum value by looking at the row-based ones with argmax().
rounded_y_pred2 = np.argmax(y_pred2, axis=1)
rounded_labels2 = np.argmax(y_test2, axis=1)

In [19]:
# classification report
from sklearn import metrics
cr2 = metrics.classification_report(rounded_labels2, rounded_y_pred2, zero_division=0)
print(cr2)

              precision    recall  f1-score   support

           0       0.89      0.88      0.88       170
           1       0.96      0.93      0.94        80
           2       0.94      0.88      0.91        33
           3       0.89      0.95      0.92       149
           4       0.95      0.93      0.94        80
           5       0.62      0.72      0.67       103
           6       0.62      0.54      0.58        82
           7       0.92      0.85      0.88        40
           8       0.98      0.99      0.99       100
           9       0.99      0.96      0.98       190

    accuracy                           0.88      1027
   macro avg       0.88      0.86      0.87      1027
weighted avg       0.88      0.88      0.88      1027

