# Predicting points based on description (NLP) and other features with Catboost

In [None]:
import numpy as np
import pandas as pd

import os
print(os.listdir("../input"))

First of all, we are going to load our data and clean the dataset.

In [None]:
data=pd.read_csv('../input/winemag-data-130k-v2.csv')

In [None]:
data.info()

We can see that we have a lot of null objects. Let's print some percentage.

In [None]:
total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

I'm worried the most about wines with NaN in price columns. We don't want to predict points for wines which price are undeclared. We will drop rows with NaN value in this column. Another technique is to fill that values with mean, but my approch is to deal with only price taged wines.

In [None]:
data=data.dropna(subset=['price'])

We can easily see that there are a lot of duplicates in the data, which we want to rid of.

In [None]:
print("Total number of examples: ", data.shape[0])
print("Number of examples with the same title and description: ", data[data.duplicated(['description','title'])].shape[0])

In [None]:
data=data.drop_duplicates(['description','title'])
data=data.reset_index(drop=True)

Fill all missing values with -1. 

In [None]:
data=data.fillna(-1)

# NLP
Our basic features are ready, so now we start to create features from description with using NLTK library.
NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”


In [None]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import string
from wordcloud import WordCloud, STOPWORDS
import re

from nltk.tokenize import RegexpTokenizer

We have to turn evry word into lowercase because there is no meaning diffrence between 'This' and 'this' term. We also get rid of irrelevent term.

In [None]:
data['description']= data['description'].str.lower()
data['description']= data['description'].apply(lambda elem: re.sub('[^a-zA-Z]',' ', elem))  
data['description']

We can't analyze whole sentences, we will use regex to tokenize sentences to list of words.

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
words_descriptions = data['description'].apply(tokenizer.tokenize)
words_descriptions.head()

When we split description into individual words, we have to create vocabulary and additionaly we can add new feature - description lengths.

In [None]:
all_words = [word for tokens in words_descriptions for word in tokens]
data['description_lengths']= [len(tokens) for tokens in words_descriptions]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))

Let's check what are our most common words in our dictionary.

In [None]:
from collections import Counter
count_all_words = Counter(all_words)
count_all_words.most_common(100)

We can see that there are many stop words and words which can't help us with our goal - predict points. 
Now we want to
1. Convert words with same meaning to the one word(example run, running, runned -> run). We will use PorterStemmer from NLTK library.
2. Delete all stopwords.


In [None]:
stopword_list = stopwords.words('english')
ps = PorterStemmer()
words_descriptions = words_descriptions.apply(lambda elem: [word for word in elem if not word in stopword_list])
words_descriptions = words_descriptions.apply(lambda elem: [ps.stem(word) for word in elem])
data['description_cleaned'] = words_descriptions.apply(lambda elem: ' '.join(elem))

In [None]:
all_words = [word for tokens in words_descriptions for word in tokens]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))
count_all_words = Counter(all_words)
count_all_words.most_common(100)

As we can see we deleted almost 9k words and now words from description are much more meaningful.
Now we can 3 diffrent ways to represent our description

1. **Bag of Words Counts** - embeds each sentences as a list of 0 or 1,  1 represent containing word. 
2. **TF-IDF (Term Frequency, Inverse Document Frequency)** - weighing words by how frequent they are in our dataset, discounting words that are too frequent.
3. **Word2Vec **- Capturing semantic meaning. We won't use it in this kernel.

We will check which types perform better in our case, Bag of Words Counts or TF-IDF Bag of Words.

First we will test Bag of Words Counts.

Let's define some useful function and then test our picked techniques.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from catboost import Pool, CatBoostRegressor, cv

def prepare_dataframe(vect, data, features=True):
    vectorized=vect.fit_transform(data['description_cleaned']).toarray()
    vectorized=pd.DataFrame(vectorized)
    if features == True:
        X=data.drop(columns=['points','Unnamed: 0','description','description_cleaned'])
        X=X.fillna(-1)
        print(X.columns)
        X=pd.concat([X.reset_index(drop=True),vectorized.reset_index(drop=True)],axis=1)
        categorical_features_indices =[0,1,3,4,5,6,7,8,9,10]
    else:
        X=vectorized
        categorical_features_indices =[]
    y=data['points']
    return X,y,categorical_features_indices

In [None]:
#model definintion and training.
def perform_model(X_train, y_train,X_valid, y_valid,X_test, y_test,categorical_features_indices,name):
    model = CatBoostRegressor(
        random_seed = 100,
        loss_function = 'RMSE',
        iterations=800,
    )
    
    model.fit(
        X_train, y_train,
        cat_features = categorical_features_indices,
        verbose=False,
        eval_set=(X_valid, y_valid)
    )
    
    print(name+" technique RMSE on training data: "+ model.score(X_train, y_train).astype(str))
    print(name+" technique RMSE on test data: "+ model.score(X_test, y_test).astype(str))
    

In [None]:
def prepare_variable(vect, data, features_append=True):
    X, y , categorical_features_indices = prepare_dataframe(vect, data,features_append)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, 
                                                        random_state=42)
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, 
                                                        random_state=52)
    return X_train, y_train,X_valid, y_valid,X_test, y_test, categorical_features_indices

In [None]:
vect= CountVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data)
perform_model(*training_variable, 'Bag of Words Counts')

Now we can try TF-IDF.

In [None]:
vect= TfidfVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data)
perform_model(*training_variable, 'TF-IDF')


Yeah, but beyond description we used also meaningful features, let's drop all of our features and do prediction based ONLY on descriptions. 

In [None]:
vect= CountVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data, False)
perform_model(*training_variable, 'Bag of Words Counts')

In [None]:
vect= TfidfVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data, False)
perform_model(*training_variable, 'TF-IDF')

As we can see our scores are similar, but it really outperformet technique without any NLP operations (about 2.09 test score) 
* 1. link to EDA +  Catboost without NLP : https://www.kaggle.com/mistrzuniu1/eda-catboost-feature-importance/