<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/sl/28_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Preprocessing danych:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Zmiana typu danych i wstępna eksploracja](#3)
5. [LabelEncoder](#4)
6. [OneHotEncoder](#5)
7. [Pandas *get_dummies()*](#6)
8. [Standaryzacja - StandardScaler](#7)
9. [Przygotowanie danych do modelu](#8)



### <a name='0'></a> Import bibliotek

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import sklearn

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True, edgeitems=10, linewidth=1000, formatter=dict(float=lambda x: f'{x:.2f}'))
sklearn.__version__

'0.22.1'

In [2]:
from sklearn.datasets import fetch_20newsgroups

raw_data = fetch_20newsgroups(subset='train')
all_data = raw_data.copy()
all_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [3]:
all_data['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
# ['alt.atheism', 'sci.space']
# ['comp.graphics', 'talk.politics.misc']
raw_data_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'], remove = ('headers', 'footers', 'quotes'),
                                    shuffle=True, random_state=42)
all_data_train = raw_data_train.copy()

raw_data_test = fetch_20newsgroups(subset='test', categories=['alt.atheism', 'sci.space'], remove = ('headers', 'footers', 'quotes'),
                                   shuffle=True, random_state=42)
all_data_test = raw_data_test.copy()

print(f'all_data_train: {all_data_train.keys()}')
print(f'all_data_test: {all_data_test.keys()}\n')

print(f"all_data_train target names: {all_data_train['target_names']}")
print(f"all_data_test target names: {all_data_test['target_names']}")

print(f"all_data_train target: {all_data_train['target'][:10]}")
print(f"all_data_test target: {all_data_test['target'][:10]}")

all_data_train: dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
all_data_test: dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

all_data_train target names: ['alt.atheism', 'sci.space']
all_data_test target names: ['alt.atheism', 'sci.space']
all_data_train target: [0 1 1 1 0 1 1 0 0 0]
all_data_test target: [0 1 0 1 1 0 1 0 0 0]


In [5]:
target = all_data_train['target_names']

data_train = all_data_train['data']
target_train = all_data_train['target']

data_test = all_data_test['data']
target_test = all_data_test['target']

print(f'Liczba próbek treningowych: {len(data_train)}')
print(f'Liczba próbek testowych: {len(data_test)}')

Liczba próbek treningowych: 1073
Liczba próbek testowych: 713


In [6]:
data_train[:10]

[': \n: >> Please enlighten me.  How is omnipotence contradictory?\n: \n: >By definition, all that can occur in the universe is governed by the rules\n: >of nature. Thus god cannot break them. Anything that god does must be allowed\n: >in the rules somewhere. Therefore, omnipotence CANNOT exist! It contradicts\n: >the rules of nature.\n: \n: Obviously, an omnipotent god can change the rules.\n\nWhen you say, "By definition", what exactly is being defined;\ncertainly not omnipotence. You seem to be saying that the "rules of\nnature" are pre-existant somehow, that they not only define nature but\nactually cause it. If that\'s what you mean I\'d like to hear your\nfurther thoughts on the question.',
 "In <19APR199320262420@kelvin.jpl.nasa.gov> baalke@kelvin.jpl.nasa.gov \n\nSorry I think I missed a bit of info on this Transition Experiment. What is it?\n\nWill this mean a loss of data or will the Magellan transmit data later on ??\n\nBTW: When will NASA cut off the connection with Magella

In [7]:
data_test[:10]

['\n  Damn.  And I did so have my hopes up.\n\n\n/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\/\\ \n\nBob Beauchaine bobbe@vice.ICO.TEK.COM \n\nThey said that Queens could stay, they blew the Bronx away,\nand sank Manhattan out at sea.',
 "        I had to turn to one of my problem sets that I did in class for this\nlittle problem.  I don't have a calculator, but I DO have the problem set that\nwe did not too long ago, so I'll use that, and hope it's what you wanted.  \nThis is a highly simplified problem, with a very simple burst.  Bursts are\nusually more complex than this example I will use here.\n        Our burst has a peak flux of 5.43E-6 ergs cm^-2 sec^-1 and a duration\nof 8.95 seconds.  During the frst second of the burst, and the last 4 seconds,\nits flux is half of the peak flux.  It's flux is the peak flux the rest of the\ntime.  Assume that the background flux is 10E-7 erg cm^-2 sec^-1.\n        Then we had to find the 

In [8]:
print(f"Klasa: {all_data_train['target_names'][target_train[0]]}\n\nTreść maila:\n\n{data_train[0]}")

Klasa: alt.atheism

Treść maila:

: 
: >> Please enlighten me.  How is omnipotence contradictory?
: 
: >By definition, all that can occur in the universe is governed by the rules
: >of nature. Thus god cannot break them. Anything that god does must be allowed
: >in the rules somewhere. Therefore, omnipotence CANNOT exist! It contradicts
: >the rules of nature.
: 
: Obviously, an omnipotent god can change the rules.

When you say, "By definition", what exactly is being defined;
certainly not omnipotence. You seem to be saying that the "rules of
nature" are pre-existant somehow, that they not only define nature but
actually cause it. If that's what you mean I'd like to hear your
further thoughts on the question.


In [9]:
print(f"Klasa: {all_data_train['target_names'][target_train[2]]}\n\nTreść maila:\n\n{data_train[2]}")

Klasa: sci.space

Treść maila:


Henry, I made the assumption that he who gets there firstest with the mostest
wins. 

Ohhh, you want to put in FINE PRINT which says "Thou shall do wonderous R&D
rather than use off-the-shelf hardware"? Sorry, didn't see that in my copy.
Most of the Pournellesque proposals run along the lines of <some dollar
amount> reward for <some simple goal>.  

You go ahead and do your development, I'll buy off the shelf at higher cost (or
even Russian; but I also assume that there'd be some "Buy US" provos in there)
and be camped out in the Moon while you are launching and assembling little
itty-bitty payloads in LEO with your laser or gas gun.  And working out the
bugs of assembly & integration in LEO. 

Oh, hey, could I get a couple of CanadARMs tuned for the lunar environment?  I
wanna do some teleoperated prospecting while I'm up there...





In [10]:
target_train[:10]

array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])

In [11]:
target_test[:10]

array([0, 1, 0, 1, 1, 0, 1, 0, 0, 0])

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit_transform(data_train).toarray()

array([[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 

In [13]:
X_train = vectorizer.fit_transform(data_train)
X_test = vectorizer.transform(data_test)

y_train = target_train.copy()
y_test = target_test.copy()

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')

X_train shape: (1073, 17919)
X_test shape: (713, 17919)


In [14]:
X_train[0].toarray()

array([[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]])

In [15]:
print(X_train)
print(X_test)

  (0, 13224)	0.07911195468769368
  (0, 16250)	0.11999105115041317
  (0, 8068)	0.10148654709819604
  (0, 9970)	0.05393808884927991
  (0, 10516)	0.07883374159978145
  (0, 3680)	0.09369696889434878
  (0, 1702)	0.0762307320752416
  (0, 5207)	0.10310502312077618
  (0, 6602)	0.15912206033740356
  (0, 12689)	0.11662448785210391
  (0, 14367)	0.08448085064829519
  (0, 3752)	0.08755108060703673
  (0, 5208)	0.10776936017292926
  (0, 6531)	0.09112465428990733
  (0, 14365)	0.06557096266822082
  (0, 3794)	0.09017496414229824
  (0, 11664)	0.13911079115607236
  (0, 11561)	0.10675338945227657
  (0, 4610)	0.13512899190432104
  (0, 6599)	0.08971516964259744
  (0, 2007)	0.11824296387468404
  (0, 5773)	0.06248592296746404
  (0, 3293)	0.13512899190432104
  (0, 7677)	0.21723941955938655
  (0, 11181)	0.36847010687845067
  :	:
  (1072, 17110)	0.22544726874030185
  (1072, 17801)	0.4508945374806037
  (1072, 9707)	0.22544726874030185
  (1072, 12131)	0.22544726874030185
  (1072, 14690)	0.20399953616265365
  (1072,

In [16]:
len(vectorizer.get_feature_names())

17919

In [17]:
vectorizer.get_feature_names()[5000:5010]

['cylinders',
 'cynical',
 'cynics',
 'cz',
 'czcs',
 'czechoslakia',
 'czechoslavkia',
 'd0',
 'd012s658',
 'd1']

In [0]:
from sklearn.feature_selection import SelectKBest, chi2

ch2 = SelectKBest(chi2, k=50)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)

In [0]:
results = {}

In [20]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=10)
classifier.fit(X_train, y_train)
train_acc = classifier.score(X_train, y_train)
test_acc = classifier.score(X_test, y_test)
results['KNeighborsClassifier'] = [train_acc, test_acc]
print(f'Zbiór treningowy: {train_acc:.4f}\nZbiór testowy: {test_acc:.4f}')

Zbiór treningowy: 0.7987
Zbiór testowy: 0.7658


In [21]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(classifier, param_grid={'n_neighbors': np.arange(1, 25)})
grid_search.fit(X_train, y_train)
grid_search.score(X_test, y_test)

0.7573632538569425

In [22]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(random_state=42)
classifier.fit(X_train, y_train)
train_acc = classifier.score(X_train, y_train)
test_acc = classifier.score(X_test, y_test)
results['DecisionTreeClassifier'] = [train_acc, test_acc]
print(f'Zbiór treningowy: {train_acc:.4f}\nZbiór testowy: {test_acc:.4f}')

Zbiór treningowy: 0.8844
Zbiór testowy: 0.7868


In [23]:
grid_search = GridSearchCV(classifier, param_grid={'max_depth': np.arange(1, 20), 'min_samples_split': np.arange(2, 20)})
grid_search.fit(X_train, y_train)
print(grid_search.score(X_train, y_train))
grid_search.score(X_test, y_test)

0.8415657036346692


0.7629733520336606

In [24]:
grid_search.best_params_

{'max_depth': 19, 'min_samples_split': 5}

In [25]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)
train_acc = classifier.score(X_train, y_train)
test_acc = classifier.score(X_test, y_test)
results['RandomForestClassifier'] = [train_acc, test_acc]
print(f'Zbiór treningowy: {train_acc:.4f}\nZbiór testowy: {test_acc:.4f}')

Zbiór treningowy: 0.8844
Zbiór testowy: 0.8093


In [26]:
grid_search = GridSearchCV(classifier, param_grid={'max_depth': np.arange(1, 15), 'min_samples_split': np.arange(2, 10)})
grid_search.fit(X_train, y_train)
print(grid_search.score(X_train, y_train))
grid_search.score(X_test, y_test)

0.8546132339235788


0.8022440392706872

In [27]:
grid_search.best_params_

{'max_depth': 14, 'min_samples_split': 2}

In [28]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=42)
classifier.fit(X_train, y_train)
train_acc = classifier.score(X_train, y_train)
test_acc = classifier.score(X_test, y_test)
results['LogisticRegression'] = [train_acc, test_acc]
print(f'Zbiór treningowy: {train_acc:.4f}\nZbiór testowy: {test_acc:.4f}')

Zbiór treningowy: 0.8304
Zbiór testowy: 0.7966


In [29]:
from sklearn.svm import SVC

classifier = SVC(random_state=42)
classifier.fit(X_train, y_train)
train_acc = classifier.score(X_train, y_train)
test_acc = classifier.score(X_test, y_test)
results['SVC'] = [train_acc, test_acc]
print(f'Zbiór treningowy: {train_acc:.4f}\nZbiór testowy: {test_acc:.4f}')

Zbiór treningowy: 0.8546
Zbiór testowy: 0.8065


In [30]:
from sklearn.naive_bayes import MultinomialNB

classifier =  MultinomialNB()
classifier.fit(X_train, y_train)
train_acc = classifier.score(X_train, y_train)
test_acc = classifier.score(X_test, y_test)
results['MultinomialNB'] = [train_acc, test_acc]
print(f'Zbiór treningowy: {train_acc:.4f}\nZbiór testowy: {test_acc:.4f}')

Zbiór treningowy: 0.8322
Zbiór testowy: 0.8022


In [31]:
from sklearn.naive_bayes import GaussianNB

# raczej nie używa się z macierzami rzadkimi

X_train_nb = X_train.toarray()
X_test_nb = X_test.toarray()

classifier =  GaussianNB()
classifier.fit(X_train_nb, y_train)
train_acc = classifier.score(X_train_nb, y_train)
test_acc = classifier.score(X_test_nb, y_test)
results['GaussianNB'] = [train_acc, test_acc]
print(f'Zbiór treningowy: {train_acc:.4f}\nZbiór testowy: {test_acc:.4f}')

Zbiór treningowy: 0.8574
Zbiór testowy: 0.8149


In [32]:
results

{'DecisionTreeClassifier': [0.8844361602982292, 0.7868162692847125],
 'GaussianNB': [0.8574091332712023, 0.814866760168303],
 'KNeighborsClassifier': [0.798695246971109, 0.7657784011220197],
 'LogisticRegression': [0.8303821062441752, 0.7966339410939691],
 'MultinomialNB': [0.8322460391425909, 0.8022440392706872],
 'RandomForestClassifier': [0.8844361602982292, 0.8092566619915849],
 'SVC': [0.8546132339235788, 0.8064516129032258]}

In [33]:
acc = np.array(list(results.values()))
df = pd.DataFrame(data={'classifier': list(results.keys()), 'train accuracy': acc[:, 0], 'test accuracy': acc[:, 1]})
df

Unnamed: 0,classifier,train accuracy,test accuracy
0,KNeighborsClassifier,0.798695,0.765778
1,DecisionTreeClassifier,0.884436,0.786816
2,RandomForestClassifier,0.884436,0.809257
3,LogisticRegression,0.830382,0.796634
4,SVC,0.854613,0.806452
5,MultinomialNB,0.832246,0.802244
6,GaussianNB,0.857409,0.814867


In [34]:
import plotly.graph_objects as go 

fig = go.Figure(data=go.Bar(x=df['classifier'], y=df['train accuracy'], name='train', marker_color='gray'),
                layout=go.Layout(title='Dokładność na zbiorze treningowym i testowym', width=700))
fig.add_trace(go.Bar(x=df['classifier'], y=df['test accuracy'], name='test', marker_color='gold'))
fig.show()

In [35]:
df = df.sort_values(by='test accuracy', ascending=False)

fig = go.Figure(data=go.Bar(x=df['classifier'], y=df['train accuracy'], name='train', marker_color='gray'),
                layout=go.Layout(title='Dokładność na zbiorze treningowym i testowym', width=700))
fig.add_trace(go.Bar(x=df['classifier'], y=df['test accuracy'], name='test', marker_color='gold'))
fig.show()