Будем анализировать отзывы отелей из train.csv и предсказывать оценку клиентов этих отелей.

In [329]:
import numpy as np
import pandas as pd

In [330]:
data = pd.read_csv('nlp/train.csv', encoding='cp1252')

In [331]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2351 entries, 0 to 2350
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            2351 non-null   int64  
 1   Hotel_name    2351 non-null   object 
 2   Review_Title  2136 non-null   object 
 3   Review_Text   2351 non-null   object 
 4   Rating        2351 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 92.0+ KB


In [332]:
data.head()

Unnamed: 0,Id,Hotel_name,Review_Title,Review_Text,Rating
0,0,Park Hyatt,Refuge in Chennai,Excellent room and exercise facility. All arou...,80.0
1,1,Hilton Chennai,Hilton Chennai,Very comfortable and felt safe. \r\nStaff were...,100.0
2,2,The Royal Regency,No worth the rating shown in websites. Pricing...,Not worth the rating shown. Service is not goo...,71.0
3,3,Rivera,Good stay,"First of all nice & courteous staff, only one ...",86.0
4,4,Park Hyatt,Needs improvement,Overall ambience of the hotel is very good. In...,86.0


In [333]:
data.isna().value_counts()

Id     Hotel_name  Review_Title  Review_Text  Rating
False  False       False         False        False     2136
                   True          False        False      215
dtype: int64

Заменим все NaN в столбце 'Review_Title' на not marked:

In [334]:
data['Review_Title'].fillna('not marked', inplace=True)

In [335]:
data.isna().value_counts()

Id     Hotel_name  Review_Title  Review_Text  Rating
False  False       False         False        False     2351
dtype: int64

Разделим данные:

In [336]:
X_data = data.drop(columns=['Rating'])
X_data.head(10)

Unnamed: 0,Id,Hotel_name,Review_Title,Review_Text
0,0,Park Hyatt,Refuge in Chennai,Excellent room and exercise facility. All arou...
1,1,Hilton Chennai,Hilton Chennai,Very comfortable and felt safe. \r\nStaff were...
2,2,The Royal Regency,No worth the rating shown in websites. Pricing...,Not worth the rating shown. Service is not goo...
3,3,Rivera,Good stay,"First of all nice & courteous staff, only one ..."
4,4,Park Hyatt,Needs improvement,Overall ambience of the hotel is very good. In...
5,5,Everest,"Good atmosphere, food and drinks not available","I reached the hotel by car, felt good for co-o..."
6,6,Metro Grand,Lovely hotel,The hotel is pretty clean with excellent beddi...
7,7,Oyo Rooms Anna Arch Arumbakkam,Not worth the money,No hot water.wifi limited to lobby. Average cl...
8,8,The Park Chennai,not marked,Hotel staff are not co-ordinating well for the...
9,9,FabHotel Priyadarshini Park Mount Road,Good hotel with poor services,Location and cleanliness is good. But as far a...


In [337]:
y_data = data['Rating'].to_frame()
y_data.head()

Unnamed: 0,Rating
0,80.0
1,100.0
2,71.0
3,86.0
4,86.0


Разделим данные на тренировочные и тестовые:

In [338]:
from sklearn.model_selection import train_test_split

In [339]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2)

In [340]:
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

Избавимся от знаков припинания:

In [344]:
import re

In [345]:
expr = '[^a-zA-Z0-9\s]'

In [347]:
X_train['Review_Text'] = X_train['Review_Text'].apply(lambda x: ' '.join([re.sub(expr, '', word.lower()) for word in x.strip().split()]))

In [348]:
X_test['Review_Text'] = X_test['Review_Text'].apply(lambda x: ' '.join([re.sub(expr, '', word.lower()) for word in x.strip().split()]))

Уменьшим возможный словарь:

Избавимся от стоп слов:

In [349]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/irina.buht12/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [350]:
from nltk.corpus import stopwords
sw_eng = set(stopwords.words('english'))
list(sw_eng)[:6]

['with', 'them', 'does', 'while', 'where', "doesn't"]

In [351]:
X_train['clear_Review_Text'] = X_train['Review_Text'].apply(lambda x: ' '.join([word for word in x.strip().split() if not word in sw_eng]))

In [352]:
X_test['clear_Review_Text'] = X_test['Review_Text'].apply(lambda x: ' '.join([word for word in x.strip().split() if not word in sw_eng]))

In [353]:
X_train['Review_Text'].apply(lambda x: len(x)).head()

327      84
430     237
1538     89
446     294
1977    100
Name: Review_Text, dtype: int64

In [354]:
X_train['clear_Review_Text'].apply(lambda x: len(x)).head()

327      69
430     145
1538     65
446     208
1977     75
Name: clear_Review_Text, dtype: int64

Видно что длины текстов заметно уменьшились.

Проведём нормализацию:
1) Стеминг

In [355]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english')

In [356]:
X_train['stemm_Review_Text'] = X_train['clear_Review_Text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [357]:
X_test['stemm_Review_Text'] = X_test['clear_Review_Text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [358]:
X_train['Review_Text'].head()

327     itc grand chola is an awesome hotel with great...
430     be careful with the taxes on your food order w...
1538    it is good to stay hotel staff is good it is i...
446     fantastic service by staff at every touch poin...
1977    geezer and ac does not work wi fi is good staf...
Name: Review_Text, dtype: object

In [359]:
X_train['clear_Review_Text'].head()

327     itc grand chola awesome hotel great facilities...
430     careful taxes food order food prices expected ...
1538    good stay hotel staff good good location good ...
446     fantastic service staff every touch point frie...
1977    geezer ac work wi fi good staff behaviour care...
Name: clear_Review_Text, dtype: object

In [360]:
X_train['stemm_Review_Text'].head()

327     itc grand chola awesom hotel great facil frien...
430     care tax food order food price expect end pay ...
1538    good stay hotel staff good good locat good sho...
446     fantast servic staff everi touch point friend ...
1977    geezer ac work wi fi good staff behaviour care...
Name: stemm_Review_Text, dtype: object

Проведём лемматизацию:

In [361]:
from nltk import wordnet, pos_tag
def get_wordnet_pos(treebank_tag):
    my_switch = {
        'J': wordnet.wordnet.ADJ,
        'V': wordnet.wordnet.VERB,
        'N': wordnet.wordnet.NOUN,
        'R': wordnet.wordnet.ADV,
    }
    for key, item in my_switch.items():
        if treebank_tag.startswith(key):
            return item
    return wordnet.wordnet.NOUN

In [362]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/irina.buht12/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [363]:
from nltk import WordNetLemmatizer
def my_lemmatizer(sent):
    lemmatizer = WordNetLemmatizer()
    tokenized_sent = sent.split()
    pos_tagged = [(word, get_wordnet_pos(tag))
                 for word, tag in pos_tag(tokenized_sent)]
    return ' '.join([lemmatizer.lemmatize(word, tag)
                    for word, tag in pos_tagged])

In [364]:
X_train['lemma_Review_Text'] = X_train['clear_Review_Text'].apply(lambda x: my_lemmatizer(x))

In [365]:
X_test['lemma_Review_Text'] = X_test['clear_Review_Text'].apply(lambda x: my_lemmatizer(x))

In [366]:
X_train['stemm_Review_Text'].head()

327     itc grand chola awesom hotel great facil frien...
430     care tax food order food price expect end pay ...
1538    good stay hotel staff good good locat good sho...
446     fantast servic staff everi touch point friend ...
1977    geezer ac work wi fi good staff behaviour care...
Name: stemm_Review_Text, dtype: object

In [367]:
X_train['lemma_Review_Text'].head()

327     itc grand chola awesome hotel great facility f...
430     careful tax food order food price expect end p...
1538    good stay hotel staff good good location good ...
446     fantastic service staff every touch point frie...
1977    geezer ac work wi fi good staff behaviour care...
Name: lemma_Review_Text, dtype: object

Видно что лемманизация более качествена, чем стемминг. Поработаем с лемманизацией.

Посчитаем количество позитивных и негативных слов:

In [368]:
f = open("nlp/negative-words.txt", "r", encoding='cp1252')
neg_words = []
while True:
    line = f.readline()
    if not line:
        break;
    neg_words.append(line.strip())  
    
f.close()

In [369]:
neg_words[0:5]

['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable']

In [370]:
f = open("nlp/positive-words.txt", "r", encoding='cp1252')
pos_words = []
while True:
    line = f.readline()
    if not line:
        break;
    pos_words.append(line.strip())  
    
f.close()

In [371]:
pos_words[0:5]

['a+', 'abound', 'abounds', 'abundance', 'abundant']

In [372]:
X_train['count_pos_words'] = X_train['lemma_Review_Text'].apply(lambda x: len([word for word in x.split() if word in pos_words]))

In [373]:
X_test['count_pos_words'] = X_test['lemma_Review_Text'].apply(lambda x: len([word for word in x.split() if word in pos_words]))

In [374]:
X_train['count_pos_words'].head()

327     5
430     0
1538    4
446     4
1977    3
Name: count_pos_words, dtype: int64

In [375]:
X_train['count_neg_words'] = X_train['lemma_Review_Text'].apply(lambda x: len([word for word in x.split() if word in neg_words]))

In [376]:
X_test['count_neg_words'] = X_test['lemma_Review_Text'].apply(lambda x: len([word for word in x.split() if word in neg_words]))

In [377]:
X_train['count_neg_words'].head()

327     0
430     0
1538    0
446     1
1977    3
Name: count_neg_words, dtype: int64

In [378]:
y_train.head()

Unnamed: 0,Rating
327,100.0
430,80.0
1538,71.0
446,100.0
1977,65.0


Представим тексты в числовом виде:

1) *CountVectorizer*

In [379]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer

CountVectorizer()

In [380]:
count_vec_lemma = vectorizer.fit_transform(X_train['lemma_Review_Text'].to_list())

In [381]:
count_vec_lemma.shape

(1880, 3667)

2) *One hot encoding* : словарь слов и тексты слишком большие поэтому надо задуматься о целесообразности его применения.

In [433]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder()
encoded = label_encoder.fit_transform(X_train['lemma_Review_Text'].tolist())

In [434]:
onehot_encoder = OneHotEncoder(sparse=False)
encoded = encoded.reshape(len(encoded), 1)
onehot_encoded_lemma = onehot_encoder.fit_transform(encoded)
print(onehot_encoded_lemma.shape)

(1880, 1521)


*TF-IDF*

In [384]:
from sklearn.feature_extraction.text import TfidfVectorizer
idf_vectorizer = TfidfVectorizer()

In [385]:
tf_idf_lemma = idf_vectorizer.fit_transform(X_train['lemma_Review_Text'].to_list())

In [386]:
tf_idf_lemma.shape

(1880, 3667)

In [387]:
tf_idf_lemma[0].todense()

matrix([[0., 0., 0., ..., 0., 0., 0.]])

Подготовка тестовых данных:

In [388]:
count_vec_lemma_test = vectorizer.transform(X_test['lemma_Review_Text'].to_list())

In [389]:
count_vec_lemma_test.shape

(471, 3667)

In [436]:
encoded_test = label_encoder.fit_transform(X_test['lemma_Review_Text'].tolist())
encoded_test = encoded_test.reshape(len(encoded_test), 1)
onehot_encoded_lemma_test = onehot_encoder.transform(encoded_test)
print(onehot_encoded_lemma_test.shape)

(471, 1521)


In [391]:
tf_idf_lemma_test = idf_vectorizer.transform(X_test['lemma_Review_Text'].to_list())
tf_idf_lemma_test.shape

(471, 3667)

Построение моделей для данных после лемманизации:

1) Для tf-idf

In [400]:
from sklearn.linear_model import LinearRegression
lg = LinearRegression()

In [401]:
lg.fit(tf_idf_lemma, y_train)

LinearRegression()

In [402]:
y_pred = lg.predict(tf_idf_lemma_test)

In [408]:
from sklearn.metrics import mean_squared_error
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

mean_squared_error:  26.493298641198137


In [420]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(tf_idf_lemma, y_train['Rating'])
y_pred = gbr.predict(tf_idf_lemma_test)
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  17.072429728793807


In [421]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(tf_idf_lemma, y_train['Rating'])
y_pred = rfr.predict(tf_idf_lemma_test)
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  16.116316136676186


2) для  onehot_encoder

In [437]:
lg.fit(onehot_encoded_lemma, y_train)

LinearRegression()

In [439]:
y_pred = lg.predict(onehot_encoded_lemma_test)

In [440]:
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  31.885427421242728


In [442]:
gbr.fit(onehot_encoded_lemma, y_train['Rating'])
y_pred = gbr.predict(onehot_encoded_lemma_test)
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  21.881541342864473


In [None]:
rfr.fit(onehot_encoded_lemma, y_train['Rating'])
y_pred = rfr.predict(onehot_encoded_lemma_test)
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

3) *Для count-vectorizer* 

In [443]:
lg.fit(count_vec_lemma, y_train['Rating'])
y_pred = lg.predict(count_vec_lemma_test)
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  72.58931951979405


In [445]:
rfr.fit(count_vec_lemma, y_train['Rating'])
y_pred = rfr.predict(count_vec_lemma_test)
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  16.173743936975214


In [444]:
gbr.fit(count_vec_lemma, y_train['Rating'])
y_pred = gbr.predict(count_vec_lemma_test)
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  17.33403999640288


**Вывод:**

Для данных после лемминизации самое маленькое отклонение RMSE получилось после преобразований CountVectorizer  и tf - idf равное 16 еденицам рейтинга. 


Посмотрим что будет если учесть количество положительных и отрицательных слов.


In [460]:
lg.fit(X_train[['count_pos_words', 'count_neg_words']], y_train['Rating'])
y_pred = lg.predict(X_test[['count_pos_words', 'count_neg_words']])
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  18.63909508050382


In [447]:
rfr.fit(X_train[['count_pos_words', 'count_neg_words']], y_train['Rating'])
y_pred = rfr.predict(X_test[['count_pos_words', 'count_neg_words']])
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  17.546953779835945


In [448]:
gbr.fit(X_train[['count_pos_words', 'count_neg_words']], y_train['Rating'])
y_pred = gbr.predict(X_test[['count_pos_words', 'count_neg_words']])
print('r_mean_squared_error: ', mean_squared_error(y_test, y_pred, squared=False))

r_mean_squared_error:  17.57948516690001


Отклонение RMSE немного больше чем при анализе текста.