## JetBrains Internship

Given a dataset of the Facebook posts with their sentiments , train and evaluate a model on it. You are free to do anything with data and choose any model you want. However, you need to train the model yourself rather than using a pretrained one. Provide a link to the GitHub public repository with your code. The code can be in notebook format, however, check if the notebook is executable from top to bottom.

### Preparing data

In [1]:
!gdown 10CvDP3AFOTYmoXhWXLRDm6n_XSZV6Yev

Downloading...
From: https://drive.google.com/uc?id=10CvDP3AFOTYmoXhWXLRDm6n_XSZV6Yev
To: /content/fb_sentiment.csv
  0% 0.00/123k [00:00<?, ?B/s]100% 123k/123k [00:00<00:00, 43.6MB/s]


This is counterintuitive, but when analyzing sentiments punctuation and regular stop-words do matter, because they express sentiment of user.

Let us compare:

`My notebook is a never buy. I never loved it!`

`notebook buy loved`

So we will include punctuation to our token lists.

In [68]:
import nltk
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

nltk_lemmatizer_en = WordNetLemmatizer()
stop_words = set() # set([c for c in string.punctuation] + ['-', '...']) # + set(stopwords.words('english') +

def is_valid_word(w):
  return True
  # at_least_one_letter = False
  # for c in w:
  #   if c not in string.punctuation:
  #     at_least_one_letter = True
  #     break
  # return at_least_one_letter

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Let us convert texts to lists of lemmatized tokens. Actually, there can be another approach based on stemming, but let us try with lemmas

In [None]:
def nltk_tokenize(text, stopwords=set()):
    if type(text) == str:
      return  [ w for w in word_tokenize(text) if w not in stopwords and is_valid_word(w) ]
    else:
      return  [ nltk_tokenize(t, stopwords) for t in text ]

def nltk_lemmatize_en(tokens, stopwords=set()):
  r = list()
  for x in tokens: 
    if type(x) == list:
      r.append(nltk_lemmatize_en(x, stopwords))
    else:
      v = nltk_lemmatizer_en.lemmatize(x).lower()
      if v not in stopwords:
        r.append(v)
  return r

I was about to test spacy lemmatizer, but it didn't show significant difference to be taken into consideration

In [3]:
!pip install spacy-stanza | tail -n 1

import spacy
import stanza
import spacy_stanza

spacy_nlp_en = spacy.load('en_core_web_sm')

# only array of token-arrays is supported
def spacy_lemmatize(tokens, stopwords=set(), lemmatizer=spacy_nlp_en, include_token=False):
  div = 'br'
  text = [' '.join(text) for text in tokens]
  concat = (' ' + div + ' ').join(text)
  r = list()
  t = list()
  for m in lemmatizer(concat):
    v = m.lemma_.lower()
    if v == div:
      r.append(t)
      t = list()
      continue
    if v not in stopwords:
      if include_token:
        t.append(m)
      else:
        t.append(v)
  if len(t) != 0:
    r.append(t)
  return r

Successfully installed emoji-2.2.0 spacy-stanza-1.0.3 stanza-1.5.0


In [4]:
import numpy as np
import pandas as pd

Reading data:

In [70]:
df = pd.read_csv('fb_sentiment.csv', index_col=0)
df

Unnamed: 0,FBPost,Label
0,Drug Runners and a U.S. Senator have somethin...,O
1,"Heres a single, to add, to Kindle. Just read t...",O
2,If you tire of Non-Fiction.. Check out http://...,O
3,Ghost of Round Island is supposedly nonfiction.,O
4,Why is Barnes and Nobles version of the Kindle...,N
...,...,...
995,I liked it. Its youth oriented and I think th...,P
996,"I think the point of the commercial is that, e...",P
997,Kindle 3 is such a great product. I could not ...,P
998,develop a way to share books! that is a big d...,N


In [6]:
import re

It's time we remove links and tokenize sentences

In [71]:
df['FBPost'] = df['FBPost'].apply(lambda x: re.sub(r'http\S+', '', x) )
df['tokens'] = df['FBPost'].apply(lambda x: nltk_lemmatize_en(nltk_tokenize(x), stop_words) )

In [72]:
df

Unnamed: 0,FBPost,Label,tokens
0,Drug Runners and a U.S. Senator have somethin...,O,"[drug, runners, and, a, u.s, ., senator, have,..."
1,"Heres a single, to add, to Kindle. Just read t...",O,"[heres, a, single, ,, to, add, ,, to, kindle, ..."
2,If you tire of Non-Fiction.. Check out,O,"[if, you, tire, of, non-fiction, .., check, out]"
3,Ghost of Round Island is supposedly nonfiction.,O,"[ghost, of, round, island, is, supposedly, non..."
4,Why is Barnes and Nobles version of the Kindle...,N,"[why, is, barnes, and, nobles, version, of, th..."
...,...,...,...
995,I liked it. Its youth oriented and I think th...,P,"[i, liked, it, ., its, youth, oriented, and, i..."
996,"I think the point of the commercial is that, e...",P,"[i, think, the, point, of, the, commercial, is..."
997,Kindle 3 is such a great product. I could not ...,P,"[kindle, 3, is, such, a, great, product, ., i,..."
998,develop a way to share books! that is a big d...,N,"[develop, a, way, to, share, book, !, that, is..."


In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import preprocessing

In [74]:
le = preprocessing.LabelEncoder()
df['Label'] = le.fit_transform(df['Label'])
df

Unnamed: 0,FBPost,Label,tokens
0,Drug Runners and a U.S. Senator have somethin...,1,"[drug, runners, and, a, u.s, ., senator, have,..."
1,"Heres a single, to add, to Kindle. Just read t...",1,"[heres, a, single, ,, to, add, ,, to, kindle, ..."
2,If you tire of Non-Fiction.. Check out,1,"[if, you, tire, of, non-fiction, .., check, out]"
3,Ghost of Round Island is supposedly nonfiction.,1,"[ghost, of, round, island, is, supposedly, non..."
4,Why is Barnes and Nobles version of the Kindle...,0,"[why, is, barnes, and, nobles, version, of, th..."
...,...,...,...
995,I liked it. Its youth oriented and I think th...,2,"[i, liked, it, ., its, youth, oriented, and, i..."
996,"I think the point of the commercial is that, e...",2,"[i, think, the, point, of, the, commercial, is..."
997,Kindle 3 is such a great product. I could not ...,2,"[kindle, 3, is, such, a, great, product, ., i,..."
998,develop a way to share books! that is a big d...,0,"[develop, a, way, to, share, book, !, that, is..."


We can see some kind of controversial mapping:

In [76]:
df.loc[950]['FBPost']

'I love my Kindle, though the publishing industry seems dead set on killing it off with their idiotic pricing schemes.'

In [11]:
df[df['Label'] == 0]

Unnamed: 0,FBPost,Label,tokens,dislike_words,like_words
4,Why is Barnes and Nobles version of the Kindle...,0,"[why, is, barnes, and, nobles, version, of, th...",0.0,0.0
8,Meh. I think Singles are a bad idea. Big name ...,0,"[meh, ., i, think, singles, are, a, bad, idea,...",2.0,0.0
10,I am not sure if i just got my update but now ...,0,"[i, am, not, sure, if, i, just, got, my, updat...",1.0,0.0
14,Not a fan of Kindle Singles. They clog up the...,0,"[not, a, fan, of, kindle, singles, ., they, cl...",1.0,0.0
23,Its just too bad you arent offering these for ...,0,"[its, just, too, bad, you, arent, offering, th...",1.0,1.0
...,...,...,...,...,...
950,"I love my Kindle, though the publishing indust...",0,"[i, love, my, kindle, ,, though, the, publishi...",0.0,1.0
976,"mmm No esto no es un iPad, es un libro; No tam...",0,"[mmm, no, esto, no, e, un, ipad, ,, e, un, lib...",0.0,0.0
981,Throw it my purse and go. Have my whole librar...,0,"[throw, it, my, purse, and, go, ., have, my, w...",0.0,0.0
992,I was reading with it for around 8 hours yeste...,0,"[i, wa, reading, with, it, for, around, 8, hou...",0.0,0.0


In [12]:
df['Label'] = pd.to_numeric(df['Label'])

Let's try extracting tfidf measure features along with count measures.

My assumption is the following: we can combine two measures to assure rare positive and rare negative words are taken into consideration by model. Same time, if the word is frequent, it does not mean we should avoid it, so we will have two sets of features.

In [100]:
vec = TfidfVectorizer(lowercase=False, tokenizer=lambda x : x, ngram_range=(1, 2), max_features=10000)
vectors = vec.fit_transform(df['tokens'].tolist())

cvec = CountVectorizer(lowercase=False, tokenizer=lambda x : x, max_features=10000)
cvectors = vec.fit_transform(df['tokens'].tolist())



We need to somehow evaluate our result, so we will consider test dataset to check the quality of model

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.linear_model import SGDClassifier

X_train, X_test, y_train, y_test = train_test_split( np.concatenate((np.asarray(vectors.todense()), np.asarray(cvectors.todense()) ), axis=1) , np.array(df['Label'].to_list()), test_size=0.2, random_state=0)
y_train.shape, y_test.shape

((800,), (200,))

We can also see that data has different number of rows for each class.
We could have used upsampling, but for the sake of simplicity let us compute weights for classes

In [91]:
from sklearn.utils.class_weight import compute_class_weight

classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))
class_weights

{0: 4.301075268817204, 1: 1.2066365007541477, 2: 0.5157962604771116}

In [92]:
svc = svm.SVC(class_weight=class_weights)
svc.fit(X_train, y_train)

In [93]:
from sklearn.metrics import classification_report
print(classification_report(y_test, svc.predict(X_test), target_names=le.classes_))

              precision    recall  f1-score   support

           N       0.00      0.00      0.00        17
           O       0.62      0.76      0.69        59
           P       0.82      0.85      0.83       124

    accuracy                           0.75       200
   macro avg       0.48      0.54      0.51       200
weighted avg       0.69      0.75      0.72       200



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [89]:
sgd = SGDClassifier(class_weight='balanced')
sgd.fit(X_train, y_train)

In [90]:
from sklearn.metrics import classification_report
print(classification_report(y_test, sgd.predict(X_test), target_names=le.classes_))

              precision    recall  f1-score   support

           N       0.00      0.00      0.00        17
           O       0.59      0.75      0.66        59
           P       0.82      0.83      0.83       124

    accuracy                           0.73       200
   macro avg       0.47      0.53      0.50       200
weighted avg       0.69      0.73      0.71       200



It is quite unfortunate that two models did not capture N class :(

In [28]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [29]:
!pip install catboost
import catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [67]:
boosting_model = catboost.CatBoostClassifier(
    loss_function='MultiClassOneVsAll',
    class_weights=class_weights,
    learning_rate=0.02,
    depth=6,
    iterations=2000,
    task_type="GPU",
    devices='0:1')
      
boosting_model.fit(X_train, y_train)

0:	learn: 0.6871607	total: 48ms	remaining: 1m 35s
1:	learn: 0.6821111	total: 94.3ms	remaining: 1m 34s
2:	learn: 0.6768874	total: 138ms	remaining: 1m 31s
3:	learn: 0.6710545	total: 169ms	remaining: 1m 24s
4:	learn: 0.6657132	total: 218ms	remaining: 1m 26s
5:	learn: 0.6606006	total: 257ms	remaining: 1m 25s
6:	learn: 0.6563525	total: 291ms	remaining: 1m 22s
7:	learn: 0.6518933	total: 320ms	remaining: 1m 19s
8:	learn: 0.6473756	total: 342ms	remaining: 1m 15s
9:	learn: 0.6429842	total: 365ms	remaining: 1m 12s
10:	learn: 0.6391636	total: 394ms	remaining: 1m 11s
11:	learn: 0.6348048	total: 418ms	remaining: 1m 9s
12:	learn: 0.6306464	total: 442ms	remaining: 1m 7s
13:	learn: 0.6264587	total: 464ms	remaining: 1m 5s
14:	learn: 0.6221384	total: 486ms	remaining: 1m 4s
15:	learn: 0.6181945	total: 501ms	remaining: 1m 2s
16:	learn: 0.6148191	total: 524ms	remaining: 1m 1s
17:	learn: 0.6114140	total: 547ms	remaining: 1m
18:	learn: 0.6076336	total: 566ms	remaining: 59s
19:	learn: 0.6043362	total: 588ms	r

<catboost.core.CatBoostClassifier at 0x7f974570ab80>

We can see that catboot model result is better (i performed grid search to analyze parameters)

After more steps of learning, model becomes overfitted, so it is best result as for now

In [94]:
from sklearn.metrics import classification_report
print(classification_report(y_test, boosting_model.predict(X_test), target_names=le.classes_))

              precision    recall  f1-score   support

           N       0.55      0.35      0.43        17
           O       0.66      0.80      0.72        59
           P       0.88      0.84      0.86       124

    accuracy                           0.79       200
   macro avg       0.70      0.66      0.67       200
weighted avg       0.79      0.79      0.78       200



## One more experiment

Let us try to detect only negative comments to combine result with previous model

In [96]:
y = np.array(df['Label'].to_list())
y == 0

array([False, False, False, False,  True, False, False, False,  True,
       False,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False,  True, False, False, False,
        True, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [104]:
y = np.array(df['Label'].replace(0, 3).replace(1, 0).replace(2, 0).replace(3, 1).to_list())

X_train, X_test, y_train, y_test = train_test_split( np.concatenate((np.asarray(vectors.todense()), np.asarray(cvectors.todense())), axis=1) , y, test_size=0.2, random_state=0)
y_train.shape, y_test.shape

((800,), (200,))

In [108]:
from sklearn.utils.class_weight import compute_class_weight

classes = np.unique(y)
weights = compute_class_weight(class_weight='balanced', classes=[0, 1], y=y)
class_weights = dict(zip(classes, weights))
class_weights

{0: 0.5428881650380022, 1: 6.329113924050633}

In [109]:
boosting_model = catboost.CatBoostClassifier(
    loss_function='MultiClassOneVsAll',
    class_weights=class_weights,
    learning_rate=0.02,
    depth=6,
    iterations=2000,
    task_type="GPU",
    devices='0:1')
      
boosting_model.fit(X_train, y_train)

0:	learn: 0.6885288	total: 55ms	remaining: 1m 49s
1:	learn: 0.6836768	total: 90.8ms	remaining: 1m 30s
2:	learn: 0.6778247	total: 129ms	remaining: 1m 26s
3:	learn: 0.6718314	total: 164ms	remaining: 1m 22s
4:	learn: 0.6665363	total: 211ms	remaining: 1m 24s
5:	learn: 0.6617837	total: 235ms	remaining: 1m 18s
6:	learn: 0.6586738	total: 260ms	remaining: 1m 13s
7:	learn: 0.6541870	total: 276ms	remaining: 1m 8s
8:	learn: 0.6508951	total: 296ms	remaining: 1m 5s
9:	learn: 0.6457272	total: 314ms	remaining: 1m 2s
10:	learn: 0.6414669	total: 340ms	remaining: 1m 1s
11:	learn: 0.6363608	total: 357ms	remaining: 59.1s
12:	learn: 0.6311111	total: 378ms	remaining: 57.8s
13:	learn: 0.6262848	total: 397ms	remaining: 56.3s
14:	learn: 0.6211017	total: 418ms	remaining: 55.3s
15:	learn: 0.6175634	total: 432ms	remaining: 53.5s
16:	learn: 0.6131482	total: 450ms	remaining: 52.5s
17:	learn: 0.6086782	total: 470ms	remaining: 51.7s
18:	learn: 0.6042653	total: 486ms	remaining: 50.7s
19:	learn: 0.6006468	total: 503ms	

<catboost.core.CatBoostClassifier at 0x7f9704c62eb0>

In [110]:
from sklearn.metrics import classification_report
print(classification_report(y_test, boosting_model.predict(X_test), target_names=['OTHER', 'N']))

              precision    recall  f1-score   support

       OTHER       0.92      0.98      0.95       183
           N       0.40      0.12      0.18        17

    accuracy                           0.91       200
   macro avg       0.66      0.55      0.57       200
weighted avg       0.88      0.91      0.89       200



We can see, that negative data in this small dataset requires extraction of more features, so it seems we can not learn model to detect these. :(