<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/tmp/02_sklearn/02_tfidf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* @author: krakowiakpawel9@gmail.com  
* @site: e-smartdata.org

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Załadowanie danych](#1)
3. [Wektoryzacja tekstu](#2)
4. [TF-IDF](#3)



### <a name='0'></a> Import bibliotek

In [0]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

np.set_printoptions(precision=4, edgeitems=10, linewidth=200)

### <a name='1'></a> Załadowanie danych

In [0]:
newsgroups_train = fetch_20newsgroups(subset='train', categories=None, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [0]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [0]:
print(newsgroups_train['DESCR'])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [0]:
newsgroups_train['data'][:5]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [0]:
newsgroups_train['target'][:5]

array([ 7,  4,  4,  1, 14])

In [0]:
newsgroups_train['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Załadowanie tylko dwóch klas

In [0]:
newsgroups_train = fetch_20newsgroups(subset='train', categories=['comp.graphics', 'sci.med'], shuffle=True, random_state=42)
data = newsgroups_train['data']
target = newsgroups_train['target']
target_names = newsgroups_train['target_names']

print(f'Liczba próbek: {len(data)}')

Liczba próbek: 1178


Wyświetlenie kilku dokumentów

In [0]:
for doc in zip(data[:3], target[:3]):
    print(f'target: {target_names[doc[1]]}')
    print('-' * 80)
    print(f'{doc[0]}')
    print('-' * 80)

target: comp.graphics
--------------------------------------------------------------------------------
From: zyeh@caspian.usc.edu (zhenghao yeh)
Subject: Re: Need polygon splitting algo...
Organization: University of Southern California, Los Angeles, CA
Lines: 25
Distribution: world
NNTP-Posting-Host: caspian.usc.edu
Keywords: polygons, splitting, clipping


In article <1qvq4b$r4t@wampyr.cc.uow.edu.au>, g9134255@wampyr.cc.uow.edu.au (Coronado Emmanuel Abad) writes:
|> 
|> The idea is to clip one polygon using another polygon (not
|> necessarily rectangular) as a window.  My problem then is in
|> finding out all the new vertices of the resulting "subpolygons"
|> from the first one.  Is this simply a matter of extending the
|> usual algorithm whereby each of the edges of one polygon is checked
|> against another polygon???  Is there a simpler way??
|> 
|> Comments welcome.
|> 
|> Noel.

	It depends on what kind of the polygons. 
	Convex - simple, concave - trouble, concave with loop(s)
	

### <a name='2'></a> Wektoryzacja tekstu

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(data)
X_train

<1178x24614 sparse matrix of type '<class 'numpy.int64'>'
	with 170800 stored elements in Compressed Sparse Row format>

In [0]:
X_train.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       ...,
       [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 5, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 

In [0]:
vectorizer.vocabulary_

{'from': 10161,
 'zyeh': 24608,
 'caspian': 5492,
 'usc': 23190,
 'edu': 8637,
 'zhenghao': 24571,
 'yeh': 24470,
 'subject': 21310,
 're': 18530,
 'need': 15586,
 'polygon': 17527,
 'splitting': 20877,
 'algo': 3261,
 'organization': 16414,
 'university': 23043,
 'of': 16197,
 'southern': 20726,
 'california': 5314,
 'los': 13990,
 'angeles': 3505,
 'ca': 5256,
 'lines': 13813,
 '25': 1072,
 'distribution': 8109,
 'world': 24225,
 'nntp': 15837,
 'posting': 17635,
 'host': 11533,
 'keywords': 13129,
 'polygons': 17530,
 'clipping': 6075,
 'in': 11989,
 'article': 3829,
 '1qvq4b': 833,
 'r4t': 18352,
 'wampyr': 23791,
 'cc': 5545,
 'uow': 23130,
 'au': 4005,
 'g9134255': 10258,
 'coronado': 6832,
 'emmanuel': 8824,
 'abad': 2702,
 'writes': 24275,
 'the': 22081,
 'idea': 11788,
 'is': 12576,
 'to': 22308,
 'clip': 6070,
 'one': 16284,
 'using': 23210,
 'another': 3563,
 'not': 15933,
 'necessarily': 15577,
 'rectangular': 18661,
 'as': 3847,
 'window': 24091,
 'my': 15423,
 'problem': 

### <a name='3'></a> TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_train_tfidf

<1178x24614 sparse matrix of type '<class 'numpy.float64'>'
	with 170800 stored elements in Compressed Sparse Row format>

In [0]:
X_train_tfidf.toarray()

array([[0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.0851, 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.   

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
X_train_tfidf_vec = tfidf_vec.fit_transform(data)
X_train_tfidf_vec

<1178x24614 sparse matrix of type '<class 'numpy.float64'>'
	with 170800 stored elements in Compressed Sparse Row format>

In [0]:
X_train_tfidf_vec.toarray()

array([[0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.0851, 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    , 0.    , 0.   