<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/x/01_basic/03_tfidf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Wektoryzacja tekstu](#2)
4. [Wektoryzacja tekstu - bigramy](#3)
5. [TFIDF Transformer](#4)
6. [TFIDF Vectorizer](#5)



### <a name='0'></a> Import bibliotek

In [0]:
import numpy as np
import pandas as pd
import plotly.express as px

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True, edgeitems=10, linewidth=1000, formatter=dict(float=lambda x: f'{x:.2f}'))

### <a name='1'></a> Wygenerowanie danych

In [2]:
documents = [
    'Hello world',
    'Hello',
    'I am going to say: hello.',
    'Hello, beautiful world!!!'
]

print(documents)

['Hello world', 'Hello', 'I am going to say: hello.', 'Hello, beautiful world!!!']


### <a name='2'></a> Wektoryzacja tekstu

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit_transform(documents)

<4x7 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [4]:
vectorizer.fit_transform(documents).toarray()

array([[0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0],
       [1, 0, 1, 1, 1, 1, 0],
       [0, 1, 0, 1, 0, 0, 1]])

In [5]:
vectorizer.get_feature_names()

['am', 'beautiful', 'going', 'hello', 'say', 'to', 'world']

In [6]:
df = pd.DataFrame(data=vectorizer.fit_transform(documents).toarray(), 
                  columns=vectorizer.get_feature_names())

df

Unnamed: 0,am,beautiful,going,hello,say,to,world
0,0,0,0,1,0,0,1
1,0,0,0,1,0,0,0
2,1,0,1,1,1,1,0
3,0,1,0,1,0,0,1


In [7]:
vectorizer.vocabulary_

{'am': 0,
 'beautiful': 1,
 'going': 2,
 'hello': 3,
 'say': 4,
 'to': 5,
 'world': 6}

In [8]:
vectorizer.transform(['Say: hello!']).toarray()

array([[0, 0, 0, 1, 1, 0, 0]])

### <a name='3'></a> Wektoryzacja tekstu - bigramy

In [9]:
bigram = CountVectorizer(ngram_range=(1, 2), min_df=1)    # min_df=2
bigram.fit_transform(documents).toarray()

array([[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0],
       [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]])

In [10]:
bigram.vocabulary_

{'am': 0,
 'am going': 1,
 'beautiful': 2,
 'beautiful world': 3,
 'going': 4,
 'going to': 5,
 'hello': 6,
 'hello beautiful': 7,
 'hello world': 8,
 'say': 9,
 'say hello': 10,
 'to': 11,
 'to say': 12,
 'world': 13}

In [11]:
df = pd.DataFrame(data=bigram.fit_transform(documents).toarray(), 
                  columns=bigram.get_feature_names())
df

Unnamed: 0,am,am going,beautiful,beautiful world,going,going to,hello,hello beautiful,hello world,say,say hello,to,to say,world
0,0,0,0,0,0,0,1,0,1,0,0,0,0,1
1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,1,1,0,0,1,1,1,0,0,1,1,1,1,0
3,0,0,1,1,0,0,1,1,0,0,0,0,0,1


### <a name='4'></a> TFIDF Transformer

In [12]:
documents = [
    'Hello world',
    'Hello',
    'I am going to say: hello.',
    'Hello, beautiful world!!!'
]

print(documents)

['Hello world', 'Hello', 'I am going to say: hello.', 'Hello, beautiful world!!!']


In [13]:
counts = vectorizer.fit_transform(documents).toarray()
counts

array([[0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0],
       [1, 0, 1, 1, 1, 1, 0],
       [0, 1, 0, 1, 0, 0, 1]])

In [14]:
df = pd.DataFrame(data=vectorizer.fit_transform(documents).toarray(), columns=vectorizer.get_feature_names())
df

Unnamed: 0,am,beautiful,going,hello,say,to,world
0,0,0,0,1,0,0,1
1,0,0,0,1,0,0,0
2,1,0,1,1,1,1,0
3,0,1,0,1,0,0,1


In [15]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
tfidf.fit_transform(counts).toarray()

array([[0.00, 0.00, 0.00, 0.55, 0.00, 0.00, 0.83],
       [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00],
       [0.48, 0.00, 0.48, 0.25, 0.48, 0.48, 0.00],
       [0.00, 0.73, 0.00, 0.38, 0.00, 0.00, 0.57]])

### <a name='5'></a> TFIDF Vectorizer

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(documents).toarray()

array([[0.00, 0.00, 0.00, 0.55, 0.00, 0.00, 0.83],
       [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00],
       [0.48, 0.00, 0.48, 0.25, 0.48, 0.48, 0.00],
       [0.00, 0.73, 0.00, 0.38, 0.00, 0.00, 0.57]])

In [18]:
tfidf_vectorizer.idf_

array([1.92, 1.92, 1.92, 1.00, 1.92, 1.92, 1.51])

In [19]:
tfidf_vectorizer.get_feature_names()

['am', 'beautiful', 'going', 'hello', 'say', 'to', 'world']