<a href="https://colab.research.google.com/github/lukaszplust/Projects/blob/main/Comment_classification_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
!pip install scikit-learn



In [3]:
!pip install --upgrade scikit-learn



In [4]:
text = ['Everyday I am learning ml','I like ml','Today I am going to learn ml.','love ml, love ml !']

In [5]:
print(text)

['Everyday I am learning ml', 'I like ml', 'Today I am going to learn ml.', 'love ml, love ml !']


In [6]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b') # CountVectorizer pomija pojedyncze wyrazy bez tego token_pattern, pomija tam również znaki interpunkcyjne
vectorizer.fit_transform(text) # zwraca macierz rzadką(poniewaz mamy duzo 0), zwraca  macierz o rozmiarze: (liczba zdan x liczba unikalnych slow)

<4x11 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [7]:
matrix = vectorizer.fit_transform(text).toarray()

#tutaj wiersz oznacza, które to zdanie, a 1 bądz 0 oznacza czy konkretny wyraz występuje

In [8]:
matrix

array([[1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0],
       [1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0]])

In [9]:
unique = vectorizer.get_feature_names_out()

In [10]:
unique

array(['am', 'everyday', 'going', 'i', 'learn', 'learning', 'like',
       'love', 'ml', 'to', 'today'], dtype=object)

In [11]:
df = pd.DataFrame(data = matrix, columns = unique)

In [12]:
df

Unnamed: 0,am,everyday,going,i,learn,learning,like,love,ml,to,today
0,1,1,0,1,0,1,0,0,1,0,0
1,0,0,0,1,0,0,1,0,1,0,0
2,1,0,1,1,1,0,0,0,1,1,1
3,0,0,0,0,0,0,0,2,2,0,0


In [13]:
#słownik moze przejrzyściej reprezentować, która to kolumna
vectorizer.vocabulary_

{'everyday': 1,
 'i': 3,
 'am': 0,
 'learning': 5,
 'ml': 8,
 'like': 6,
 'today': 10,
 'going': 2,
 'to': 9,
 'learn': 4,
 'love': 7}

In [14]:
unique

array(['am', 'everyday', 'going', 'i', 'learn', 'learning', 'like',
       'love', 'ml', 'to', 'today'], dtype=object)

In [15]:
vectorizer.transform(['ml nowadays']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]])

In [16]:
#bigramy -> słowa które występują po sobie
bigram = CountVectorizer(ngram_range=(1, 2), min_df=1) # ngram-range -> uwzględniane są zarówno pojedycznie słowa oraz bigramy(pary kolejnych słów), min_df =1 brane są pod uwagę tylko te bigramy, które występują conajmniej 1 raz w zbiorze
matrix_bigram = bigram.fit_transform(text).toarray()

In [17]:
unique_bigram = bigram.get_feature_names_out()

In [18]:
bigram.vocabulary_

{'everyday': 3,
 'am': 0,
 'learning': 9,
 'ml': 15,
 'everyday am': 4,
 'am learning': 2,
 'learning ml': 10,
 'like': 11,
 'like ml': 12,
 'today': 19,
 'going': 5,
 'to': 17,
 'learn': 7,
 'today am': 20,
 'am going': 1,
 'going to': 6,
 'to learn': 18,
 'learn ml': 8,
 'love': 13,
 'love ml': 14,
 'ml love': 16}

Sorted to be more clearly

In [19]:
sorted_vocabulary = sorted(bigram.vocabulary_.items(), key=lambda x: x[1])
sorted_vocabulary

[('am', 0),
 ('am going', 1),
 ('am learning', 2),
 ('everyday', 3),
 ('everyday am', 4),
 ('going', 5),
 ('going to', 6),
 ('learn', 7),
 ('learn ml', 8),
 ('learning', 9),
 ('learning ml', 10),
 ('like', 11),
 ('like ml', 12),
 ('love', 13),
 ('love ml', 14),
 ('ml', 15),
 ('ml love', 16),
 ('to', 17),
 ('to learn', 18),
 ('today', 19),
 ('today am', 20)]

In [20]:
df = pd.DataFrame(data = matrix_bigram, columns = unique_bigram)

In [21]:
df

Unnamed: 0,am,am going,am learning,everyday,everyday am,going,going to,learn,learn ml,learning,...,like,like ml,love,love ml,ml,ml love,to,to learn,today,today am
0,1,0,1,1,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,1,0,0,0,0,0
2,1,1,0,0,0,1,1,1,1,0,...,0,0,0,0,1,0,1,1,1,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,2,2,2,1,0,0,0,0


TFIDF -> obliczanie ważnośći słów w dokumencie w oparciu o częstotliwość wystąpień

unikalne słowa będą miały wyższą wartość

In [22]:
text = ['ML future', 'ML learning', 'ML love', 'ML, ML learning']
text

['ML future', 'ML learning', 'ML love', 'ML, ML learning']

In [23]:
matrix = vectorizer.fit_transform(text).toarray()

In [24]:
unique = vectorizer.get_feature_names_out().tolist()

In [25]:
df = pd.DataFrame(data = matrix, columns = unique)

In [26]:
df

Unnamed: 0,future,learning,love,ml
0,1,0,0,1
1,0,1,0,1
2,0,0,1,1
3,0,1,0,2


In [27]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
tfidf.fit_transform(matrix).toarray()

array([[0.88654763, 0.        , 0.        , 0.46263733],
       [0.        , 0.83388421, 0.        , 0.55193942],
       [0.        , 0.        , 0.88654763, 0.46263733],
       [0.        , 0.60276058, 0.        , 0.7979221 ]])

In [28]:
unique

['future', 'learning', 'love', 'ml']

Mozna odrazu wrzucić text dzięki TFIDF Vectorizer, dzięki temu nie musimy przed tym robić fit_transform na danym tekscie

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(text).toarray()

array([[0.88654763, 0.        , 0.        , 0.46263733],
       [0.        , 0.83388421, 0.        , 0.55193942],
       [0.        , 0.        , 0.88654763, 0.46263733],
       [0.        , 0.60276058, 0.        , 0.7979221 ]])

In [30]:
#szukam najważniejszego słowa (im wyższa wartość tym ważniejsze)

#future i love mają taką samą wartość, ponieważ występują najrzadziej i tylko raz, ml ma wartość 1, a learning wartość 1.51
tfidf_vectorizer.idf_

array([1.91629073, 1.51082562, 1.91629073, 1.        ])

Przykład na większych danych

In [31]:
from sklearn.datasets import fetch_20newsgroups

In [32]:
data = fetch_20newsgroups(subset='train', categories=['sci.med']) # train -> podział na zbiór treningowy, test -> podział na zbiór testowy, lub all -> podział na oba

In [33]:
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [34]:
working_data = data.copy()

In [35]:
#przykładowy mail
print(working_data['data'][0])

From: nyeda@cnsvax.uwec.edu (David Nye)
Subject: Re: Post Polio Syndrome Information Needed Please !!!
Organization: University of Wisconsin Eau Claire
Lines: 21

[reply to keith@actrix.gen.nz (Keith Stewart)]
 
>My wife has become interested through an acquaintance in Post-Polio
>Syndrome This apparently is not recognised in New Zealand and different
>symptons ( eg chest complaints) are treated separately. Does anone have
>any information on it
 
It would help if you (and anyone else asking for medical information on
some subject) could ask specific questions, as no one is likely to type
in a textbook chapter covering all aspects of the subject.  If you are
looking for a comprehensive review, ask your local hospital librarian.
Most are happy to help with a request of this sort.
 
Briefly, this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of the
weakness as they get older.  One theory is that the remaining motor
neurons

In [36]:
working_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [37]:
working_data['target_names']

['sci.med']

In [38]:
working_data['target'][:5]

array([0, 0, 0, 0, 0])

In [39]:
tfidf = TfidfVectorizer()
tfidf.fit_transform(working_data['data']).toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.07270004, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

Movie Reviews

In [40]:
!wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip

--2023-06-30 20:00:09--  https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4004848 (3.8M) [application/zip]
Saving to: ‘movie_reviews.zip.4’


2023-06-30 20:00:09 (47.3 MB/s) - ‘movie_reviews.zip.4’ saved [4004848/4004848]



In [41]:
!unzip -q movie_reviews.zip # dzieki -q komunikaty nie wyswietlaja sie na ekranie

replace movie_reviews/neg/cv000_29416.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [42]:
from sklearn.datasets import load_files

load_data = load_files('movie_reviews')

data_org = load_data.copy()

In [43]:
data_org.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [44]:
print(data_org['data'][0])

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is 

In [45]:
data_org['target'][:5]

array([0, 1, 1, 0, 1])

In [46]:
data_org['target_names']

['neg', 'pos']

In [47]:
data = data_org['data']

target = data_org['target']

In [48]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target) #random_state = 0 -> te same wyniki dla tych samych danych

In [49]:
print(f'X_train {len(X_train)}')
print(f'X_test {len(X_test)}')

X_train 1500
X_test 500


In [50]:
print(X_train[0])

b'vikings v . bears ? \nno , this isn\'t the lineup for monday night football . \nrather , these are the two opposing forces that will battle to the death in " the 13th warrior , " a film that is as dramatically flat as it is gratuitously gory . \nbased on michael crichton\'s book , eaters of the dead , this viking saga tries to evoke the mysticism of fabled norsemen and the glorious battles that they fought . \ntheir strength and honor would eventually etch their place in history among the greatest warriors that ever picked up a sword . \nluckily for the vikings , however , their warring abilities were not as clumsy as this film . \nantonio bandaras is ahmed , a travelling ambassador . \naccompanied by his friend ( omar shariff in a cameo ) , they eventually come across a small viking village . \nwe see that the vikings are an extremely proud group whose greatest strength is their fortitude . \nthey laugh heartily , revel in their arrogance , and sing songs of battles won . \nbut thei

In [51]:
print(len(X_train[0]))

3625


In [52]:
print(len(X_train[1]))

3490


In [53]:
X_test[0]

b'i must admit that i was a tad skeptical of " good will hunting " , based both on the previews and the first fifteen minutes of the film , in which the main character will hunting ( matt damon ) , an mit janitor in his early twenties , is discovered to be an einstein-level closet genius when he solves two extraordinarily difficult math problems overnight . \nthe only problem is that will is a tough street kid who\'s had his share of run-ins with the law , and before long he\'s being hauled in for assault after a parking lot fight . \nprofessor lambeau ( stellan skarsgard ) , who had brought up the math problems in his lectures , tracks him down and strikes a deal with the police : will is to be released , provided he works with lambeau on his math research regularly and attends therapy sessions . \nthis sounds like the formula for mildly charming fluff , but " good will hunting " rises above its fairly mundane premise to deliver a poignant and clever drama . \na conflict gradually eme

In [54]:
print(type(X_train))
print(type(X_test))

<class 'list'>
<class 'list'>


In [55]:
X_test = X_test.astype(str)

AttributeError: ignored

TFIDF VECTORIZER

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features = 3000)#moge ustawić max_features = 3000 ()

X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')

X_train shape: (1500, 3000)
X_test shape: (500, 3000)


Training model

In [57]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

In [58]:
classifier.score(X_test, y_test)

0.808

In [59]:
from sklearn.metrics import confusion_matrix

y_pred = classifier.predict(X_test)#dokonuje predykcji na wczesniej zdefniowanym cassifierze
cm = confusion_matrix(y_test, y_pred)

In [60]:
cm

array([[209,  55],
       [ 41, 195]])

PLOTTING SAVE IT !!!

In [61]:
import plotly.figure_factory as ff

def plot_confusion_matrix(cm):
    cm = cm[::-1] # przypisanie odwrotnej macierzy do tej samej (zamiana wierszy)
    cm = pd.DataFrame(cm, columns=['negative', 'positive'], index=['positive', 'negative'])

    fig = ff.create_annotated_heatmap(z=cm.values, x=list(cm.columns), y=list(cm.index),
                                      colorscale='ice', showscale=True, reversescale=True)
    fig.update_layout(width=400, height=400, title='Confusion Matrix', font_size=16)
    fig.show()

plot_confusion_matrix(cm)

SAVE THIS TOO !!

In [62]:
from sklearn.metrics import classification_report
#precision wzór -> TP / (TP + FP) -> Jak wiele zidentyfikowanych jako pozytywne przypadki jest rzeczywiście pozytywnych

# recall jak wiele rzeczywiscie pozytywnych przypadków zostało zweryfikowanych wzór TP / (TP + FN), gdzie TP - true positive, FN - False negative

# F1 score - srednia harmoniczna precision (precyzja) i recall (czułość). Wyższa wartość F1-score wskazuje na lepszą wydajność klasyfikatora

# support -> informuje nas o rzeczywistej liczbie przypadków, które należą do danej klasy.

print(classification_report(y_test, y_pred, target_names=['negative', 'positive']))

              precision    recall  f1-score   support

    negative       0.84      0.79      0.81       264
    positive       0.78      0.83      0.80       236

    accuracy                           0.81       500
   macro avg       0.81      0.81      0.81       500
weighted avg       0.81      0.81      0.81       500



In [103]:
text = ['It was great!.', 'What a shit. It was awful.', 'Too boring', 'Well-organized.', 'It was breathtaking and beautiful']

In [104]:
data = tfidf.transform(text)

In [105]:
data

<5x3000 sparse matrix of type '<class 'numpy.float64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [106]:
data.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [107]:
data_pred = classifier.predict(data)

In [108]:
data_pred

array([1, 0, 0, 1, 1])

In [109]:
data_probability = classifier.predict_proba(data)

In [110]:
data_prob_round = np.round(data_probability,2)

In [111]:
#po lewej prawdopodobienstwo negatywnej, po prawej prawdopodobienstwo pozytywnej
data_prob_round

array([[0.42, 0.58],
       [0.73, 0.27],
       [0.74, 0.26],
       [0.43, 0.57],
       [0.33, 0.67]])

In [113]:
data_org['target_names']

['neg', 'pos']

In [112]:
for text, target in zip(text, data_pred):
  print(f"{text} -> {target}, Class verification: {data_org['target_names'][target]}")

It was great!. -> 1, Class verification: pos
What a shit. It was awful. -> 0, Class verification: neg
Too boring -> 0, Class verification: neg
Well-organized. -> 1, Class verification: pos
It was breathtaking and beautiful -> 1, Class verification: pos
