<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/sl/28_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Preprocessing danych:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Zmiana typu danych i wstępna eksploracja](#3)
5. [LabelEncoder](#4)
6. [OneHotEncoder](#5)
7. [Pandas *get_dummies()*](#6)
8. [Standaryzacja - StandardScaler](#7)
9. [Przygotowanie danych do modelu](#8)



### <a name='0'></a> Import bibliotek

In [0]:
import numpy as np
import pandas as pd
import plotly.express as px
import sklearn

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True, edgeitems=10, linewidth=1000, formatter=dict(float=lambda x: f'{x:.2f}'))
sklearn.__version__

'0.22.1'

Ekstrakcja cech z tekstu  
unigram

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
vectorizer = CountVectorizer()
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [0]:
corpus = [
    'Today is Friday',
    'I like Friday',
    'Today I am going to learn Python.',
    'Friday, Friday!!!'
]

vectorizer.fit_transform(corpus)

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [0]:
vectorizer.fit_transform(corpus).toarray()

array([[0, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 1, 0, 1, 1, 1],
       [0, 2, 0, 0, 0, 0, 0, 0, 0]])

In [0]:
vectorizer.get_feature_names()

['am', 'friday', 'going', 'is', 'learn', 'like', 'python', 'to', 'today']

In [0]:
df = pd.DataFrame(data=vectorizer.fit_transform(corpus).toarray(), columns=vectorizer.get_feature_names())
df

Unnamed: 0,am,friday,going,is,learn,like,python,to,today
0,0,1,0,1,0,0,0,0,1
1,0,1,0,0,0,1,0,0,0
2,1,0,1,0,1,0,1,1,1
3,0,2,0,0,0,0,0,0,0


In [0]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
am,4.0,0.25,0.5,0.0,0.0,0.0,0.25,1.0
friday,4.0,1.0,0.816497,0.0,0.75,1.0,1.25,2.0
going,4.0,0.25,0.5,0.0,0.0,0.0,0.25,1.0
is,4.0,0.25,0.5,0.0,0.0,0.0,0.25,1.0
learn,4.0,0.25,0.5,0.0,0.0,0.0,0.25,1.0
like,4.0,0.25,0.5,0.0,0.0,0.0,0.25,1.0
python,4.0,0.25,0.5,0.0,0.0,0.0,0.25,1.0
to,4.0,0.25,0.5,0.0,0.0,0.0,0.25,1.0
today,4.0,0.5,0.57735,0.0,0.0,0.5,1.0,1.0


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 9 columns):
am        4 non-null int64
friday    4 non-null int64
going     4 non-null int64
is        4 non-null int64
learn     4 non-null int64
like      4 non-null int64
python    4 non-null int64
to        4 non-null int64
today     4 non-null int64
dtypes: int64(9)
memory usage: 416.0 bytes


In [0]:
vectorizer.vocabulary_

{'am': 0,
 'friday': 1,
 'going': 2,
 'is': 3,
 'learn': 4,
 'like': 5,
 'python': 6,
 'to': 7,
 'today': 8}

In [0]:
vectorizer.transform(['Friday morning']).toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 0]])

In [0]:
bigram = CountVectorizer(ngram_range=(1, 2), min_df=1)    # min_df=2
bigram.fit_transform(corpus).toarray()

array([[0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0],
       [0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [0]:
bigram.vocabulary_

{'am': 0,
 'am going': 1,
 'friday': 2,
 'friday friday': 3,
 'going': 4,
 'going to': 5,
 'is': 6,
 'is friday': 7,
 'learn': 8,
 'learn python': 9,
 'like': 10,
 'like friday': 11,
 'python': 12,
 'to': 13,
 'to learn': 14,
 'today': 15,
 'today am': 16,
 'today is': 17}

In [0]:
df = pd.DataFrame(data=bigram.fit_transform(corpus).toarray(), columns=bigram.get_feature_names())
df

Unnamed: 0,am,am going,friday,friday friday,going,going to,is,is friday,learn,learn python,like,like friday,python,to,to learn,today,today am,today is
0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,1,0,1
1,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
2,1,1,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,0
3,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


TF-IDF Transformer

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
tfidf

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [0]:
corpus = [
    'Friday morning',
    'Friday chill',
    'Friday - morning',
    'Friday, Friday morning!!!'
]

counts = vectorizer.fit_transform(corpus).toarray()
counts

array([[0, 1, 1],
       [1, 1, 0],
       [0, 1, 1],
       [0, 2, 1]])

In [0]:
df = pd.DataFrame(data=vectorizer.fit_transform(corpus).toarray(), columns=vectorizer.get_feature_names())
df

Unnamed: 0,chill,friday,morning
0,0,1,1
1,1,1,0
2,0,1,1
3,0,2,1


In [0]:
tfidf.fit_transform(counts).toarray()

array([[0.        , 0.63295194, 0.77419109],
       [0.88654763, 0.46263733, 0.        ],
       [0.        , 0.63295194, 0.77419109],
       [0.        , 0.85310692, 0.52173612]])

TF-IDF Vectorizer

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(corpus).toarray()

array([[0.        , 0.63295194, 0.77419109],
       [0.88654763, 0.46263733, 0.        ],
       [0.        , 0.63295194, 0.77419109],
       [0.        , 0.85310692, 0.52173612]])

In [0]:
tfidf_vectorizer.idf_

array([1.91629073, 1.        , 1.22314355])

In [0]:
from sklearn.datasets import fetch_20newsgroups

In [0]:
raw_data = fetch_20newsgroups(subset='train', categories=['comp.graphics'], random_state=42)
raw_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [0]:
all_data = raw_data.copy()
print(all_data['DESCR'])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [0]:
all_data['data'][:5]

["From: bbs.mirage@tsoft.net (Jerry Lee)\nSubject: Cobra 2.0 1-b-1 Video card HELP ME!!!!\nOrganization: The TSoft BBS and Public Access Unix, +1 415 969 8238\nLines: 22\n\nDoes ANYONE out there in Net-land have any information on the Cobra 2.20 \ncard?  The sticker on the end of the card reads\n        Model: Cobra 1-B-1\n        Bios:  Cobra v2.20\n\nI Havn't been able to find anything about it from anyone!  If you have \nany information on how to get a hold of the company which produces the \ncard or know where any drivers are for it, PLEASE let me know!\n\nAs far as I can tell, it's a CGA card that is taking up 2 of my 16-bit \nISA slots but when I enable the test patterns, it displays much more than \nthe usualy 4 CGA colors... At least 16 from what I can count.. Thanks!\n\n              .------------------------------------------.\n              : Internet: jele@eis.calstate.edu          :\n              :           bbs.mirage@gilligan.tsoft.net  :\n              :           bbs.

In [0]:
print(all_data['data'][0])

From: bbs.mirage@tsoft.net (Jerry Lee)
Subject: Cobra 2.0 1-b-1 Video card HELP ME!!!!
Organization: The TSoft BBS and Public Access Unix, +1 415 969 8238
Lines: 22

Does ANYONE out there in Net-land have any information on the Cobra 2.20 
card?  The sticker on the end of the card reads
        Model: Cobra 1-B-1
        Bios:  Cobra v2.20

I Havn't been able to find anything about it from anyone!  If you have 
any information on how to get a hold of the company which produces the 
card or know where any drivers are for it, PLEASE let me know!

As far as I can tell, it's a CGA card that is taking up 2 of my 16-bit 
ISA slots but when I enable the test patterns, it displays much more than 
the usualy 4 CGA colors... At least 16 from what I can count.. Thanks!

              .------------------------------------------.
              : Internet: jele@eis.calstate.edu          :
              :           bbs.mirage@gilligan.tsoft.net  :
              :           bbs.mirage@tsoft.sf-bay.org

In [0]:
all_data['target_names']

['comp.graphics']

In [0]:
all_data['target'][:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [0]:
tfidf = TfidfVectorizer()
tfidf.fit_transform(all_data['data']).toarray()

array([[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.04, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 