Opis problemu

Znajdź dowolny zbiór danych (dozwolone języki: angielski, hiszpański, polski, szwedzki) (poza IMDB oraz zbiorami wykorzystywanymi na zajęciach) do analizy sentymentu. Zbiór może mieć 2 lub 3 klasy.

Następnie:

Oczyść dane i zaprezentuj rozkład klas
Zbuduj model analizy sentymenu:
z wykorzystaniem sieci rekurencyjnej (LSTM/GRU/sieć dwukierunkowa) innej niż podstawowe RNN
z wykorzystaniem sieci CNN
z podstawiemiem pre-trained word embeddingów
z fine-tuningiem modelu języka (poza podstawowym BERTem)
Stwórz funkcję, która będzie korzystała z wytrenowanego modelu i zwracała wynik dla przekazanego pojedynczego zdania (zdań) w postaci komunikatu informującego użytkownika, czy tekst jest nacechowany negatywnie, pozytywnie (czy neutralnie w przypadku 3 klas).

Gotowe rozwiązanie zamieść na GitHubie z README. W README zawrzyj: informacje o danych - ich pochodzenie, oraz opis wybranego modelu i instrukcje korzystania z plików.

W assigmnencie w Teamsach wrzuć link do repo z rozwiązaniem. W przypadku prywatnego repo upewnij się, że będzie ono widoczne dla dwnuk@pjwstk.edu.pl.

TERMIN: jak w Teamsach

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import regex as re
import spacy
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem.regexp import RegexpStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from keras.preprocessing import text,sequence
from keras_preprocessing.sequence import pad_sequences
import keras
from keras.models import Sequential
from keras.layers import Dense,Embedding,LSTM,Dropout,SpatialDropout1D,GlobalMaxPooling1D, Dense, Conv1D, MaxPooling1D
import tensorflow as tf

In [2]:
train_df = pd.read_fwf('train.ft.txt', header = None)
test_df = pd.read_fwf('test.ft.txt', header = None)

In [3]:
train_df = train_df.sample(frac=0.01, random_state=13)
test_df = test_df.sample(frac=0.01, random_state=13)

In [4]:
train_df = train_df.drop([2], axis = 1)

In [5]:
train_df.columns = ["label", "text"]
test_df.columns = ["label", "text"]

In [6]:
train_df['label'] = train_df['label'].str.replace('__label__', '').astype(int)
test_df['label'] = test_df['label'].str.replace('__label__', '').astype(int)

In [7]:
def clean_text(text):
  text=text.str.lower()
  text=text.apply(lambda x: re.sub(r'[0-9]+','',x))
  text=text.apply(lambda x: re.sub(r'@mention',' ',x))
  text=text.apply(lambda x: re.sub(r'https?:\/\/\S+', ' ',x))
  text=text.apply(lambda x: re.sub(r"www.\[a-z]?\.?(com)+|[a-z]+\.(com)", ' ',x))
  text=text.apply(lambda x: re.sub(r"[_\,\>\(\-:\)\\\/\!\.\^\!\:\];='#]",'',x))
  return text

In [8]:
train_df['text'] = clean_text(train_df['text'])
test_df['text'] = clean_text(test_df['text'])

In [9]:
tokenizer = text.Tokenizer(num_words=10000, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(train_df['text'].values)
word_index = tokenizer.word_index

print('Found %s unique tokens.' % len(word_index))

Found 91868 unique tokens.


In [10]:
train_text = tokenizer.texts_to_sequences(train_df['text'].values)
train_text = pad_sequences(train_text, maxlen=250)

y = pd.get_dummies(train_df['label']).values

In [11]:
x_train, x_test, y_train, y_test = train_test_split(train_text,y, test_size = 0.2, random_state = 42)

In [12]:
#LSTM

model_LSTM = Sequential([
    Embedding(input_dim=10000, output_dim=128),
    LSTM(32, return_sequences=True),
    Dropout(0.2),
    LSTM(16),
    Dense(2, activation='sigmoid')
])

model_LSTM.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [13]:
history = model_LSTM.fit(x_train, y_train , validation_data=(x_test, y_test), epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [14]:
#CNN

model_CNN = Sequential([
    Embedding(10000, 100, input_length=train_text.shape[1]),
    Conv1D(filters=128, kernel_size=5, padding='same', activation='relu'),
    MaxPooling1D(pool_size=4),
    Conv1D(filters=64, kernel_size=5, padding='same', activation='relu'),
    MaxPooling1D(pool_size=4),
    GlobalMaxPooling1D(),
    Dense(16, activation='relu'),
    Dropout(0.2),
    Dense(2, activation='sigmoid')
])

model_CNN.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model_CNN.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 250, 100)          1000000   
                                                                 
 conv1d (Conv1D)             (None, 250, 128)          64128     
                                                                 
 max_pooling1d (MaxPooling1  (None, 62, 128)           0         
 D)                                                              
                                                                 
 conv1d_1 (Conv1D)           (None, 62, 64)            41024     
                                                                 
 max_pooling1d_1 (MaxPoolin  (None, 15, 64)            0         
 g1D)                                                            
                                                                 
 global_max_pooling1d (Glob  (None, 64)               

In [15]:
history = model_CNN.fit(x_train, y_train, epochs = 3, batch_size = 64, validation_data = (x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [16]:
#pre-trained word embedding

embeddings_index = {}
with open('glove.6B.100d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

        
embedding_matrix = np.zeros((len(word_index) + 1, 100))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        
        

In [17]:
model_embedding = Sequential([
    Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], input_length=train_text.shape[1], trainable=False),
    LSTM(32, return_sequences=True),
    Dropout(0.2),
    LSTM(16),
    Dense(2, activation='sigmoid')
])

model_embedding.compile(loss='binary_crossentropy', 
                        optimizer='adam', 
                        metrics=['accuracy'])

In [18]:
history = model_embedding.fit(x_train, y_train, epochs=3, batch_size=64, validation_data = (x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [21]:
def predict_sentiment(model, tokenizer, text):
    cleaned_text = clean_text(pd.Series([text]))
    text_sequence = tokenizer.texts_to_sequences(cleaned_text)
    padded_sequence = pad_sequences(text_sequence, maxlen=250)
    prediction = model.predict(padded_sequence)

    if prediction[0][0] > prediction[0][1]:
        return "Recenzja nacechowana negatywnie"
    else:
        return "Recenzja nacechowana pozytywnie"

In [22]:
result_lstm = predict_sentiment(model_LSTM, tokenizer, "terrible product")
print(result_lstm)

Recenzja nacechowana negatywnie


In [23]:
result_cnn = predict_sentiment(model_CNN, tokenizer, "great tv, everything works fine")
print(result_cnn)

Recenzja nacechowana pozytywnie


In [24]:
result_embedding = predict_sentiment(model_embedding, tokenizer, "awful service i returned the product after a week")
print(result_embedding)

Recenzja nacechowana negatywnie
