<a href="https://colab.research.google.com/github/kurek0010/machine-learing-bootcamp/blob/main/supervised/05_case_studies/03_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Wektoryzacja tekstu](#2)
4. [Wektoryzacja tekstu - bigramy](#3)
5. [TFIDF Transformer](#4)
6. [TFIDF Vectorizer](#5)
7. [Przygotowanie danych tekstowych - przykład](#6)



### <a name='0'></a> Import bibliotek

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import sklearn

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True, edgeitems=10, linewidth=1000, formatter=dict(float=lambda x: f'{x:.2f}'))
sklearn.__version__

### <a name='1'></a> Wygenerowanie danych

In [None]:
documents = [
    'Today is Friday',
    'I like Friday',
    'Today I am going to learn Python.',
    'Friday, Friday!!!'
]

print(documents)

### <a name='2'></a> Wektoryzacja tekstu

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit_transform(documents)

In [None]:
vectorizer.fit_transform(documents).toarray()

In [None]:
vectorizer.get_feature_names_out()

In [None]:
df = pd.DataFrame(data=vectorizer.fit_transform(documents).toarray(),
                  columns=vectorizer.get_feature_names_out())

df

In [None]:
vectorizer.vocabulary_

In [None]:
vectorizer.transform(['Friday morning']).toarray()

### <a name='3'></a> Wektoryzacja tekstu - bigramy

In [None]:
bigram = CountVectorizer(ngram_range=(1, 2), min_df=1)    # min_df=2
bigram.fit_transform(documents).toarray()

In [None]:
bigram.vocabulary_

In [None]:
df = pd.DataFrame(data=bigram.fit_transform(documents).toarray(),
                  columns=bigram.get_feature_names_out())
df

### <a name='4'></a> TFIDF Transformer

In [None]:
documents = [
    'Friday morning',
    'Friday chill',
    'Friday - morning',
    'Friday, Friday morning!!!'
]

print(documents)

In [None]:
counts = vectorizer.fit_transform(documents).toarray()
counts

In [None]:
df = pd.DataFrame(data=vectorizer.fit_transform(documents).toarray(), columns=vectorizer.get_feature_names_out())
df

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
tfidf.fit_transform(counts).toarray()

### <a name='5'></a> TFIDF Vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(documents).toarray()

In [None]:
tfidf_vectorizer.idf_

### <a name='6'></a> Przygotowanie danych tekstowych - przykład

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
raw_data = fetch_20newsgroups(subset='train', categories=['comp.graphics'], random_state=42)
raw_data.keys()

In [None]:
all_data = raw_data.copy()
all_data['data'][:2]

In [None]:
print(all_data['data'][0])

In [None]:
all_data['target_names']

In [None]:
all_data['target'][:10]

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit_transform(all_data['data']).toarray()