<a href="https://colab.research.google.com/github/kurek0010/machine-learing-bootcamp/blob/main/supervised/05_case_studies/04_movie_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Pobranie danych](#1)
3. [Eksploracja i przygotowanie danych](#2)
4. [Trenowanie modelu](#3)
5. [Ocena modelu](#4)
6. [Predykcja na podstawie modelu](#5)



### <a name='0'></a> Import bibliotek

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import sklearn

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True, edgeitems=10, linewidth=1000, formatter=dict(float=lambda x: f'{x:.2f}'))
sklearn.__version__

### <a name='1'></a> Pobranie danych

In [None]:
!wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip

In [None]:
!unzip -q movie_reviews.zip

In [None]:
!pwd
!ls

In [None]:
from sklearn.datasets import load_files

raw_movie = load_files('movie_reviews')
movie = raw_movie.copy()
movie.keys()

### <a name='2'></a> Eksploracja i przygotowanie danych

In [None]:
movie['data'][:10]

In [None]:
movie['target'][:10]

In [None]:
movie['target_names']

In [None]:
movie['filenames'][:2]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(movie['data'], movie['target'], random_state=42)

print(f'X_train: {len(X_train)}')
print(f'X_test: {len(X_test)}')

In [None]:
X_train[0]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000)
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')

### <a name='3'></a> Trenowanie modelu

In [None]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

### <a name='4'></a> Ocena modelu

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
import plotly.figure_factory as ff

def plot_confusion_matrix(cm):
    cm = cm[::-1]
    cm = pd.DataFrame(cm, columns=['negative', 'positive'], index=['positive', 'negative'])

    fig = ff.create_annotated_heatmap(z=cm.values, x=list(cm.columns), y=list(cm.index),
                                      colorscale='ice', showscale=True, reversescale=True)
    fig.update_layout(width=400, height=400, title='Confusion Matrix', font_size=16)
    fig.show()

plot_confusion_matrix(cm)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['negative', 'positive']))

### <a name='5'></a> Predykcja na podstawie modelu

In [None]:
new_reviews = ['It was awesome! Very interesting story.',
               'I cannot recommend this film. Short and awful.',
               'Very long and boring. Don\'t waste your time.',
               'Well-organized and quite interesting.']

new_reviews_tfidf = tfidf.transform(new_reviews)
new_reviews_tfidf

In [None]:
new_reviews_tfidf.toarray()

In [None]:
new_reviews_pred = classifier.predict(new_reviews_tfidf)
new_reviews_pred

In [None]:
new_reviews_prob = classifier.predict_proba(new_reviews_tfidf)
new_reviews_prob

In [None]:
np.argmax(new_reviews_prob, axis=1)

In [None]:
movie['target_names']

In [None]:
for review, target, prob in zip(new_reviews, new_reviews_pred, new_reviews_prob):
    print(f"{review} -> {movie['target_names'][target]} -> {prob[target]:.4f}")