## This a simple news classifier

For the purpose of this task I decided to use CountVectorizer and Multinominal Naive Bayes.

more info at [Scikit Learn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
and [this article by Kelly Epley](https://towardsdatascience.com/naive-bayes-document-classification-in-python-e33ff50f937e)

---------------------------------------------------------------------------

### Proper imports

In [1]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score

Polish language customization feature in progress

In [2]:
#  na moment zrezygnujmy z lematyzacji
# Replace nltk stemmer with Polish stemmer version from pystempel repository (https://github.com/dzieciou/pystempel)
# from stempel import StempelStemmer

### Import data and select equal sets of different labelized news

In [3]:
df = pd.read_csv(r"train_data/corpus.csv",encoding='utf-8')
df.head()

Unnamed: 0,text,label,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ME: ekspresowa wygrana Polek z debiutantkami,__label__1,,,
1,Historyczny triumf Huberta Hurkacza w Winston-...,__label__1,,,
2,Wojciech Fibak: Życzę Hubertowi by pobił moje ...,__label__1,,,
3,Hubert Hurkacz - Benoit Paire (relacja na żywo),__label__1,,,
4,Marcin Lewandowski pobił rekord Polski na 1500...,__label__1,,,


### Split data into training sets and testing sets
below examples of training headlines and proper labels

In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(df['text'], df['label'], random_state=1)
X_train.head()

4229    Do wglądu nawet prywatna korespondencja. Johns...
159     Ott Tanak wygrywa Rajd Finlandii. Polacy nie s...
3530    John McCain – nasza część Europy zawdzięcza mu...
6104    Władze Gdańska straszą dziennikarza bo wykrył ...
4951    Zaskakujący wyrok SN. Wydawca "Wyborczej" i Cz...
Name: text, dtype: object

In [5]:
Y_train.head()

4229    __label__2
159     __label__1
3530    __label__2
6104    __label__2
4951    __label__2
Name: label, dtype: object

### Create CountVectorizer instance and transform with it our pdSeries. 


In [6]:
cv = CountVectorizer(X_train.to_list())
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

### Create Naive Bayes model instance and train it

In [7]:
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_cv, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [8]:
predictions = naive_bayes.predict(X_test_cv)

### Let's save the model

In [9]:
filename = 'finalized_model.sav'
pickle.dump(naive_bayes, open(filename, 'wb'))
filename = 'count_vectorizer.sav'
pickle.dump(cv, open(filename, 'wb'))

### Time for test it

In [10]:
my_test = ['Dwa tygodnie przerwy Piszczka',\
           'Formuła 1: GP Japonii w strugach deszczu i porywach wiatru. O ile w ogóle się odbędzie',\
           'Nowe drogi rozwiną regiony',\
           'Minister Jerzy Kwieciński z Polskim Kompasem 2019'  
          ]

my_test_series = pd.Series(my_test)
my_test_series.head()

0                        Dwa tygodnie przerwy Piszczka
1    Formuła 1: GP Japonii w strugach deszczu i por...
2                           Nowe drogi rozwiną regiony
3    Minister Jerzy Kwieciński z Polskim Kompasem 2019
dtype: object

In [11]:
my_test_series_cv = cv.transform(my_test_series)

In [12]:
naive_bayes.predict(my_test_series_cv)

array(['__label__1', '__label__1', '__label__2', '__label__1'],
      dtype='<U2284')

### Seems it work properly, we can measure accuracy of theise predictions

In [13]:
print('Accuracy score: ', accuracy_score(Y_test, predictions))

Accuracy score:  0.9505796217205613
