# Homework 5 - TF-IDF Classifier

Ваша мета навчити класифікатор який знаходитиме "токсичні" коментарі [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

Та відповісти на ***[питання](https://forms.gle/ZTRG1o3NYGyjD9CUA)***

Дані можна скачати тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

from tqdm.notebook import tqdm


In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('./data/train_toxic.csv').fillna(' ')


Стадартними підходами до аналізу тексту є [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) і його модифікація [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Вони реалізовані в `sklearn` у вигляді [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) та [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Детальніше про них можна глянути [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [3]:
train_text = train['comment_text']


In [4]:
# Спробуйте різні Vectorizer і різні розміри n-gramm, обрізання рідкісних слів, обрізання занадто частих слів
word_vectorizer = CountVectorizer()


In [5]:
word_vectorizer.fit(train_text)
train_word_features = word_vectorizer.transform(train_text)


Для класифікації будемо використовувати логістичну регресію [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [14]:
classifier = LogisticRegression(max_iter=2000, random_state=24) # Спробуйте різні параметри, знайдіть оптимальні на крос-валідації


Тренуватимемо по одному класифікатору на кожен клас.

Щоб провалідувати якість моделі скористаємося функцією [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [15]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))

    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))


CV score for class toxic is 0.9541510934614141
CV score for class severe_toxic is 0.9447014589868006
CV score for class obscene is 0.963760295170274
CV score for class threat is 0.9521332890518199
CV score for class insult is 0.945851989914601
CV score for class identity_hate is 0.9105017099370482
Total score is 0.9451833060869929


Спробуйте підібрати найкращі параметри для `word_vectorizer` та `classifier` оптимізуючи метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


In [16]:
word_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
word_vectorizer.fit(train_text)
train_word_features = word_vectorizer.transform(train_text)

classifier=LogisticRegression(max_iter=2000, C=.01, random_state=24)
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))

    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))


CV score for class toxic is 0.8995902001669682
CV score for class severe_toxic is 0.977166522101111
CV score for class obscene is 0.9341439663219895
CV score for class threat is 0.9700430195212274
CV score for class insult is 0.9328926567240445
CV score for class identity_hate is 0.93079303242254
Total score is 0.9407715662096466


In [6]:
def train_model(word_vectorizer: CountVectorizer(), classifier: LogisticRegression()):
    print(word_vectorizer, classifier)
    word_vectorizer.fit(train_text)
    train_word_features = word_vectorizer.transform(train_text)
    scores = []

    for class_name in tqdm(class_names):
        train_target = train[class_name]

        cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))

        print(f'CV score for class {class_name} is {cv_score}')
        scores.append(cv_score)
    return(f'Total score is {np.mean(scores)}')


In [17]:
train_model(CountVectorizer(),
            LogisticRegression(C=0.1, max_iter=2000, random_state=24)
    )


CountVectorizer() LogisticRegression(C=0.1, max_iter=2000, random_state=24)


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9572243000600744
CV score for class severe_toxic is 0.9573930895305208
CV score for class obscene is 0.9691104084617749
CV score for class threat is 0.9559393027244145
CV score for class insult is 0.9541174165211588
CV score for class identity_hate is 0.9213498575486183


'Total score is 0.9525223958077603'

In [16]:
train_model(
    word_vectorizer=TfidfVectorizer(analyzer='word', ngram_range=(1, 2)),
    classifier=LogisticRegression(max_iter=2000, C=.1, random_state=24)
  )


TfidfVectorizer(ngram_range=(1, 2)) LogisticRegression(C=0.1, max_iter=2000, random_state=24)


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9338253209733729
CV score for class severe_toxic is 0.9796726349418148
CV score for class obscene is 0.9550472389243468
CV score for class threat is 0.9715994372608476
CV score for class insult is 0.9515392902046219
CV score for class identity_hate is 0.9419759098364608


'Total score is 0.9556099720235774'

In [7]:
train_model(
    word_vectorizer=CountVectorizer(token_pattern=r"\b\w[\w’]+\b",
                                    ngram_range=(1, 2),
                                    stop_words='english'
                                    ),
    classifier = LogisticRegression(max_iter=2000, C=.1, random_state=24)
)


CountVectorizer(ngram_range=(1, 2), stop_words='english',
                token_pattern='\\b\\w[\\w’]+\\b') LogisticRegression(C=0.1, max_iter=2000, random_state=24)


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9552285088975317
CV score for class severe_toxic is 0.9648107488939741
CV score for class obscene is 0.9728367793102647
CV score for class threat is 0.9680165430354826
CV score for class insult is 0.9624746177165179
CV score for class identity_hate is 0.9526807216285398


'Total score is 0.9626746532470519'

In [8]:
train_model(
    word_vectorizer=TfidfVectorizer(token_pattern=r"\b\w[\w’]+\b",
                                    ngram_range=(1, 2),
                                    stop_words='english'
                                    ),
    classifier = LogisticRegression(max_iter=2000, C=.1, random_state=24)
)


TfidfVectorizer(ngram_range=(1, 2), stop_words='english',
                token_pattern='\\b\\w[\\w’]+\\b') LogisticRegression(C=0.1, max_iter=2000, random_state=24)


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9503828033745494
CV score for class severe_toxic is 0.9835104305347008
CV score for class obscene is 0.9776572987178176
CV score for class threat is 0.975818530637375
CV score for class insult is 0.9665207417784097
CV score for class identity_hate is 0.9658771182650853


'Total score is 0.9699611538846563'

In [9]:
train_model(
    word_vectorizer=TfidfVectorizer(token_pattern=r"\b\w[\w’]+\b",
                                    ngram_range=(1, 2),
                                    stop_words='english'
                                    ),
    classifier = LogisticRegression(max_iter=3000, solver='saga',
                                    C=.01, random_state=24)
)


TfidfVectorizer(ngram_range=(1, 2), stop_words='english',
                token_pattern='\\b\\w[\\w’]+\\b') LogisticRegression(C=0.01, max_iter=3000, random_state=24, solver='saga')


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9359152529786247
CV score for class severe_toxic is 0.9826298175273183
CV score for class obscene is 0.9720917106416505
CV score for class threat is 0.9754143920168918
CV score for class insult is 0.9600475674061004
CV score for class identity_hate is 0.9641642245450562


'Total score is 0.9650438275192736'

In [10]:
train_model(
    word_vectorizer=TfidfVectorizer(analyzer='word', ngram_range=(1, 2)),
    classifier=LogisticRegression(max_iter=2000, solver='saga', C=10, random_state=24)
  )

TfidfVectorizer(ngram_range=(1, 2)) LogisticRegression(C=10, max_iter=2000, random_state=24, solver='saga')


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.973045978157655
CV score for class severe_toxic is 0.9838341365952958
CV score for class obscene is 0.9843075187068422
CV score for class threat is 0.9879034050896147
CV score for class insult is 0.9774695332465768
CV score for class identity_hate is 0.9740853474323398


'Total score is 0.9801076532047207'

In [12]:
train_model(
    word_vectorizer=TfidfVectorizer(token_pattern=r"\b\w[\w’]+\b", analyzer='word', ngram_range=(1, 2)),
    classifier=LogisticRegression(max_iter=2000, solver='saga', C=12, random_state=24)
  )


TfidfVectorizer(ngram_range=(1, 2), token_pattern='\\b\\w[\\w’]+\\b') LogisticRegression(C=12, max_iter=2000, random_state=24, solver='saga')


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9731426182394287
CV score for class severe_toxic is 0.9837337880911387
CV score for class obscene is 0.9843222968171077
CV score for class threat is 0.9880134663185691
CV score for class insult is 0.9774563903822159
CV score for class identity_hate is 0.9740938301879085


'Total score is 0.9801270650060614'

In [13]:
train_model(
    word_vectorizer=TfidfVectorizer(analyzer='word', ngram_range=(1, 2)),
    classifier=LogisticRegression(max_iter=2000, solver='sag', C=12, random_state=24)
  )


TfidfVectorizer(ngram_range=(1, 2)) LogisticRegression(C=12, max_iter=2000, random_state=24, solver='sag')


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9731624897822833
CV score for class severe_toxic is 0.9838233421207049
CV score for class obscene is 0.98435456492081
CV score for class threat is 0.9880634957054987
CV score for class insult is 0.9774814701569333
CV score for class identity_hate is 0.9742272776433645


'Total score is 0.9801854400549325'

In [14]:
train_model(
    word_vectorizer=TfidfVectorizer(analyzer='word', ngram_range=(1, 2)),
    classifier=LogisticRegression(max_iter=2000, solver='saga', C=12, random_state=24)
  )


TfidfVectorizer(ngram_range=(1, 2)) LogisticRegression(C=12, max_iter=2000, random_state=24, solver='saga')


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9731462087801404
CV score for class severe_toxic is 0.9837406936629183
CV score for class obscene is 0.9843275757961045
CV score for class threat is 0.988005636701085
CV score for class insult is 0.9774551731084264
CV score for class identity_hate is 0.9740979701412529


'Total score is 0.9801288763649879'

In [15]:
train_model(
    word_vectorizer=TfidfVectorizer(analyzer='word'),
    classifier=LogisticRegression(max_iter=2000, solver='saga', C=12, random_state=24)
  )


TfidfVectorizer() LogisticRegression(C=12, max_iter=2000, random_state=24, solver='saga')


  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9712253974919598
CV score for class severe_toxic is 0.9795915550565548
CV score for class obscene is 0.9825324089461583
CV score for class threat is 0.9857129053738305
CV score for class insult is 0.9733374873529439
CV score for class identity_hate is 0.9708916904046461


'Total score is 0.9772152407710156'

In [31]:
word_vectorizer = CountVectorizer()
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))


Number of words in vocabulary: 189775


In [34]:
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')


Unnamed: 0,0
the,496796
to,297408
of,224547
and,224092
you,218308
...,...
ｗｗｗ,1
ｳｨｷﾍﾟﾃﾞｨｱ,1
𐌰𐌹,1
𐌰𐌿,1


In [18]:
word_vectorizer = TfidfVectorizer()
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')


Number of words in vocabulary: 189775


Unnamed: 0,0
the,11436.679900
to,8163.883747
you,8016.808285
and,6127.634686
of,6065.086946
...,...
mahy,0.003014
dienew,0.002521
aidsai,0.001984
milleseconds,0.001980


In [19]:
word_vectorizer=TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')



Number of words in vocabulary: 2467140


Unnamed: 0,0
the,6162.360493
to,4336.562078
you,4200.043711
and,3280.963306
of,3279.703871
...,...
bitch fat,0.001024
milleseconds,0.001024
milleseconds wasnt,0.001024
really milleseconds,0.001024


In [20]:
word_vectorizer=TfidfVectorizer(analyzer='word', ngram_range=(1, 2), max_features=int(2e6))
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')



Number of words in vocabulary: 2000000


Unnamed: 0,0
the,6322.174035
to,4440.560500
you,4297.666879
and,3367.302844
of,3366.753964
...,...
sex fucksex,0.001134
balls ba,0.001045
bitch fat,0.001024
really milleseconds,0.001024


In [21]:
word_vectorizer=TfidfVectorizer(analyzer='word', min_df=0.005)
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')



Number of words in vocabulary: 1095


Unnamed: 0,0
the,18305.733284
to,12380.277080
you,11969.134340
of,9964.086509
and,9838.177520
...,...
tutorial,119.440601
requesting,115.755209
speedily,99.931871
meets,99.248455


In [22]:
word_vectorizer=TfidfVectorizer(analyzer='word', max_df=0.005)
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')



Number of words in vocabulary: 188680


Unnamed: 0,0
experimenting,380.747349
bitch,269.659527
barnstar,259.691668
suck,253.528164
helpme,239.974536
...,...
starteddom,0.003173
dienew,0.002521
aidsai,0.001984
milleseconds,0.001980


In [23]:
word_vectorizer=TfidfVectorizer(analyzer='word', min_df=10)
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')



Number of words in vocabulary: 22841


Unnamed: 0,0
the,12468.404471
to,8811.852603
you,8640.385307
and,6674.769063
of,6627.541639
...,...
414,0.658124
hackneyed,0.643214
book1a1contents,0.627764
secluded,0.568825


In [27]:
word_vectorizer=TfidfVectorizer(analyzer='word', max_df=0.01, min_df=5)
train_word_features = word_vectorizer.fit_transform(train_text)
print('Number of words in vocabulary:', len(word_vectorizer.vocabulary_))
word_list = word_vectorizer.get_feature_names_out()
count_list = np.asarray(train_word_features.sum(axis=0))[0]
word_frequency = dict(zip(word_list, count_list))
sorted_word_frequency = dict(sorted(word_frequency.items(), key=lambda item: item[1], reverse=True))
pd.DataFrame.from_dict(data=sorted_word_frequency, orient='index')

Number of words in vocabulary: 35649


Unnamed: 0,0
jpg,439.231044
test,401.383177
shit,398.785443
file,391.979701
category,378.525917
...,...
pledge1,0.182112
socialistmedia,0.182112
socialistwar,0.182112
swastikamedia,0.182112
