# Lab - Klasyfikacja 3

## Zadania
1. Zbiór danych `spam.csv` zawiera przykłady wiadomości e-mail oznaczonych jako `spam` i `ham`. Zastosuj [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), aby wygenerować wektory częstości słów dla każdej wiadomości. Wytrenuj i porównaj kilka modeli klasyfikacyjnych, takich jak `MultinomialNB`, `KNeighborsClassifier`, `LogisticRegression` itp.

2. Zbiór danych `IMDB` zawiera pozytywne i negatywne recenzje filmów. Zwektoryzuj teksty recenzji, a następnie użyj metody [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html), aby porównać:
    
    - dokładność (accuracy) i f1-score,
    - czas trenowania (fit time),
    - czas testowania (score time)

    wybranych modeli klasyfikacyjnych, w tym `MultinomialNB`, `KNeighborsClassifier` oraz `LogisticRegression`.

## Zadanie 1

In [3]:
import pandas as pd

In [4]:
spam = pd.read_csv('spam.csv', encoding='latin-1')
spam.head(10)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [5]:
spam.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(spam.Message)
y = spam.Category.map({'ham': 0, 'spam': 1})

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

models = {
    'Naive Bayes': MultinomialNB(),
    'KNN': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=1000)
}

In [9]:
from sklearn.metrics import accuracy_score, f1_score

for name, model in models.items():
    model.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(X_test))
    f1 = f1_score(y_test, model.predict(X_test))
    print(f'{name} - Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}')

Naive Bayes - Accuracy: 0.9785, F1 Score: 0.9195
KNN - Accuracy: 0.9193, F1 Score: 0.5631
Logistic Regression - Accuracy: 0.9785, F1 Score: 0.9130


In [10]:
spam[["Category"]].value_counts()

Category
ham         4825
spam         747
Name: count, dtype: int64

## Zadanie 2

In [11]:
imdb = pd.read_csv('IMDB Dataset.csv')
imdb.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [12]:
imdb.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [13]:
cv = CountVectorizer()
X = cv.fit_transform(imdb.review)
y = imdb.sentiment.map({'positive': 1, 'negative': 0})

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)

In [15]:
models = {
    'Naive Bayes': MultinomialNB(),
    'KNN': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=1000)
}

In [18]:
from sklearn.model_selection import cross_validate

for name, model in models.items():
    scores = cross_validate(model, X_train, y_train, cv=5, scoring=['accuracy', 'f1'])
    print(f'{name} - Accuracy: {scores["test_accuracy"].mean():.4f}, F1 Score: {scores["test_f1"].mean():.4f},\
    Fit Time: {scores["fit_time"].mean():.4f} seconds, Score Time: {scores["score_time"].mean():.4f} seconds')

Naive Bayes - Accuracy: 0.8472, F1 Score: 0.8427,    Fit Time: 0.0227 seconds, Score Time: 0.0081 seconds
KNN - Accuracy: 0.6293, F1 Score: 0.6507,    Fit Time: 0.0210 seconds, Score Time: 17.4799 seconds
Logistic Regression - Accuracy: 0.8886, F1 Score: 0.8894,    Fit Time: 19.3545 seconds, Score Time: 0.0090 seconds
