# Lab 6
# Mariah Noelle Cornelio
# **UTA ID:** 1002053287
# ***My solutions for Exercises 6-8***
---

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets: 
    * Boston house prices 
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image: 
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I 

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers. 
    * Separate the data into train, validation, and test. 
    * Use accuracy as the metric for assessing performance. 
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1? 

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [1]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [2]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [3]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [4]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [5]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [6]:
len(np.unique(D["target"]))

23

In [7]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

# Exercise 6 - Using the "SA" dataset

This website gives more details on how to import the "SA" subset of the kddcup99 dataframe:
https://scikit-learn.org/0.19/modules/generated/sklearn.datasets.fetch_kddcup99.html

In [56]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import LeaveOneOut
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

In [28]:
X, y=fetch_kddcup99(subset="SA", percent10=True, random_state=11, return_X_y=True, as_frame=True)

# You can choose to do percent10=True which takes 10% of the SA dataset
# I ran it as percent10=False and it took a very long time (~7 minutes) -> 976,158 rows
# percent10=True -> 100,655 rows (still pretty good for training a model)

In [29]:
print(X.shape, y.shape)

(100655, 41) (100655,)


## Preprocessing X and y df's
### X

In [30]:
X.head(5)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0


In [31]:
X.dtypes # Like what we did before in the preprocessing od kddcup99, we need to turn 
# these into number values and the appropriate type

duration                       object
protocol_type                  object
service                        object
flag                           object
src_bytes                      object
dst_bytes                      object
land                           object
wrong_fragment                 object
urgent                         object
hot                            object
num_failed_logins              object
logged_in                      object
num_compromised                object
root_shell                     object
su_attempted                   object
num_root                       object
num_file_creations             object
num_shells                     object
num_access_files               object
num_outbound_cmds              object
is_host_login                  object
is_guest_login                 object
count                          object
srv_count                      object
serror_rate                    object
srv_serror_rate                object
rerror_rate 

In [32]:
continuous_columns=["duration", "src_bytes", "dst_bytes", "wrong_fragment", "urgent", "hot",
    "num_failed_logins", "num_compromised", "root_shell", "su_attempted", "num_root",
    "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds", 
    "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", 
    "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", 
    "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate", 
    "dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate", 
    "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate", 
    "dst_host_srv_rerror_rate"]

X[continuous_columns]=X[continuous_columns].apply(pd.to_numeric, errors="coerce")
object_cols=X.select_dtypes(include=["object"]).columns
print(object_cols)

# Turn these object columns into numbers

Index(['protocol_type', 'service', 'flag', 'land', 'logged_in',
       'is_host_login', 'is_guest_login'],
      dtype='object')


In [33]:
# Check what values need to be converted

print(X["land"].value_counts())
print(X["logged_in"].value_counts())
print(X["is_host_login"].value_counts())
print(X["is_guest_login"].value_counts())
print(X["protocol_type"].value_counts())
print(X["service"].value_counts())
print(X["flag"].value_counts())

0    100654
1         1
Name: land, dtype: int64
1    69980
0    30675
Name: logged_in, dtype: int64
0    100655
Name: is_host_login, dtype: int64
0    100281
1       374
Name: is_guest_login, dtype: int64
b'tcp'     77766
b'udp'     19186
b'icmp'     3703
Name: protocol_type, dtype: int64
b'http'           61916
b'smtp'            9598
b'private'         8228
b'domain_u'        5862
b'other'           5646
b'ftp_data'        3805
b'ecr_i'           2746
b'urp_i'            537
b'finger'           469
b'eco_i'            403
b'ntp_u'            380
b'ftp'              376
b'telnet'           221
b'auth'             220
b'pop_3'             79
b'time'              53
b'IRC'               42
b'urh_i'             14
b'X11'                9
b'whois'              5
b'domain'             4
b'shell'              3
b'exec'               3
b'netstat'            3
b'nntp'               2
b'netbios_dgm'        2
b'discard'            2
b'printer'            2
b'Z39_50'             2
b'remote_job'

In [34]:
X["land"]=X["land"].replace({0:0, 1:1}) # Took this from my previous code
X["logged_in"]=X["logged_in"].replace({0:0, 1:1})
X["is_host_login"]=X["is_host_login"].replace({0:0, 1:1})
X["is_guest_login"]=X["is_guest_login"].replace({0:0, 1:1})

protocol_mapping={b'tcp':0, b'udp':1, b'icmp':2} # This is updated
service_mapping={b'http': 0, b'smtp': 1, b'private': 2, b'domain_u': 3, b'other': 4, b'ftp_data': 5,
    b'ecr_i': 6, b'urp_i': 7, b'finger': 8, b'eco_i': 9, b'ntp_u': 10, b'ftp': 11, b'telnet': 12,
    b'auth': 13, b'pop_3': 14, b'time': 15, b'IRC': 16, b'urh_i': 17, b'X11': 18, b'whois': 19,
    b'domain': 20, b'shell': 21, b'exec': 22, b'netstat': 23, b'nntp': 24, b'netbios_dgm': 25,
    b'discard': 26, b'printer': 27, b'Z39_50': 28, b'remote_job': 29, b'sunrpc': 30, b'rje': 31,
    b'tim_i': 32, b'csnet_ns': 33, b'nnsp': 34, b'ctf': 35, b'courier': 36, b'iso_tsap': 37, b'red_i': 38,
    b'kshell': 39, b'pop_2': 40, b'netbios_ssn': 41, b'bgp': 42, b'hostnames': 43, b'uucp_path': 44,
    b'tftp_u': 45, b'ssh': 46, b'supdup': 47} # Also updated
flag_mapping={b'SF': 0, b'REJ': 1, b'S0': 2, b'RSTO': 3, b'S1': 4, b'RSTR': 5, b'S2': 6, b'S3': 7,
    b'OTH': 8} # Updated

X["protocol_type"].replace(protocol_mapping, inplace=True)
X["service"].replace(service_mapping, inplace=True)
X["flag"].replace(flag_mapping, inplace=True)

print(X["land"].value_counts())
print(X["logged_in"].value_counts())
print(X["is_host_login"].value_counts())
print(X["is_guest_login"].value_counts())
print(X["protocol_type"].value_counts())
print(X["service"].value_counts())
print(X["flag"].value_counts())

0    100654
1         1
Name: land, dtype: int64
1    69980
0    30675
Name: logged_in, dtype: int64
0    100655
Name: is_host_login, dtype: int64
0    100281
1       374
Name: is_guest_login, dtype: int64
0    77766
1    19186
2     3703
Name: protocol_type, dtype: int64
0     61916
1      9598
2      8228
3      5862
4      5646
5      3805
6      2746
7       537
8       469
9       403
10      380
11      376
12      221
13      220
14       79
15       53
16       42
17       14
18        9
19        5
20        4
21        3
22        3
23        3
24        2
25        2
26        2
27        2
28        2
29        2
30        2
31        2
32        2
33        1
34        1
35        1
36        1
37        1
38        1
39        1
40        1
41        1
42        1
43        1
44        1
45        1
46        1
47        1
Name: service, dtype: int64
0    94174
1     5511
2      785
3       70
4       54
5       36
6       17
7        7
8        1
Name: flag, dtype: int64

In [35]:
X.dtypes # Everything looks good

duration                         int64
protocol_type                    int64
service                          int64
flag                             int64
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate          

In [36]:
X.isnull().sum() # No null values

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_h

### y

In [37]:
y.head(5)

0    b'normal.'
1    b'normal.'
2    b'normal.'
3    b'normal.'
4    b'normal.'
Name: labels, dtype: object

In [43]:
print(y.dtypes)
print(y.value_counts())

# We need to turn these into numbers

object
b'normal.'             97278
b'smurf.'               2400
b'neptune.'              893
b'back.'                  30
b'satan.'                 14
b'ipsweep.'               11
b'warezclient.'            9
b'portsweep.'              7
b'teardrop.'               6
b'nmap.'                   3
b'pod.'                    1
b'buffer_overflow.'        1
b'multihop.'               1
b'guess_passwd.'           1
Name: labels, dtype: int64


In [46]:
target_mapping={b'normal.':0, b'smurf.':1, b'neptune.':2, b'back.':3, b'satan.':4, b'ipsweep.':5,
               b'warezclient.':6, b'portsweep.':7, b'teardrop.':8, b'nmap.':9, b'pod.':10,
               b'buffer_overflow.':11, b'multihop.':12, b'guess_passwd.':13}

y.replace(target_mapping, inplace=True)
print(y.dtypes)
print(y.value_counts())

int64
0     97278
1      2400
2       893
3        30
4        14
5        11
6         9
7         7
8         6
9         3
10        1
11        1
12        1
13        1
Name: labels, dtype: int64


## Split into training and test datasets

In [49]:
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=11)

print(X.shape, X_train.shape, X_test.shape)
print(y.shape, y_train.shape, y_test.shape)

(100655, 41) (80524, 41) (20131, 41)
(100655,) (80524,) (20131,)


## Defining my anomaly detection algorithms
For my 3 anomaly detection algorithms, I chose Isolation Forest, One-Class SVM, and Logistic Regression as a baseline. First we need to standardize our data though since we have a wide range of values.

In [51]:
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

# Defining my 3 models
iso_forest=IsolationForest(contamination=0.1, random_state=11)
oc_svm=OneClassSVM(kernel="rbf", gamma="auto", nu=0.1)
log_reg=LogisticRegression()

# Training
iso_forest.fit(X_train_scaled)
iso_pred=iso_forest.predict(X_test_scaled)
oc_svm.fit(X_train_scaled)
oc_svm_pred=oc_svm.predict(X_test_scaled)
log_reg.fit(X_train_scaled, y_train)
log_reg_pred=log_reg.predict(X_test_scaled)

# 1 = anomaly, 0 = no anomaly
iso_pred=np.where(iso_pred==-1, 1, 0)
oc_svm_pred=np.where(oc_svm_pred==-1, 1, 0)
# Note: logistic regression already converts results to 1's and 0's

# Evaluation
print("Isolation Forest Classification Report:")
print(classification_report(y_test, iso_pred))
print("One-Class SVM Classification Report:")
print(classification_report(y_test, oc_svm_pred))
print("Logistic Regression Classification Report:")
print(classification_report(y_test, log_reg_pred))

Isolation Forest Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.93      0.97     19469
           1       0.25      1.00      0.40       478
           2       0.00      0.00      0.00       170
           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00         3
           5       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         1
           8       0.00      0.00      0.00         1
           9       0.00      0.00      0.00         1

    accuracy                           0.93     20131
   macro avg       0.14      0.21      0.15     20131
weighted avg       0.97      0.93      0.94     20131

One-Class SVM Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.92      0.95     19469
           1       0.18      0.84      0.30       478
           2       0.00      0.00      0.00       170
 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Accuracy score results for anomaly detection algorithms:
- IO: 0.93
- OC SVM: 0.91
- LR: 1

Logistic regression did the best, indicating that it does well with multiple-class classifications and predictions. Isolation forest and one-class SVM would often missclassify and create false positives. 

# Exercise 7

In [55]:
# Creating the subsample of 250 
X_subsample=X.sample(n=250, random_state=11)
y_subsample=y.loc[X_subsample.index]
X_subsample_scaled=scaler.fit_transform(X_subsample)
loo=LeaveOneOut() # leave one out cross-validation evaluation algorithm

# My classifiers
models={"Isolation Forest": IsolationForest(random_state=11), "One-Class SVM": OneClassSVM(),
    "Logistic Regression": LogisticRegression(class_weight="balanced")}
results={name: [] for name in models.keys()}

for train_index, test_index in loo.split(X_subsample_scaled):
    X_train, X_test=X_subsample_scaled[train_index], X_subsample_scaled[test_index]
    y_train, y_test=y_subsample.iloc[train_index], y_subsample.iloc[test_index]

    for name, model in models.items():
        model.fit(X_train, y_train)
        prediction=model.predict(X_test)
        # in lr, binary results
        if name=="Logistic Regression":
            prediction=(prediction==1).astype(int)
        results[name].append(prediction[0])

results_df=pd.DataFrame(results, index=y_subsample.index)
for name in models.keys():
    print(f"{name} Classification Report:")
    print(classification_report(y_subsample, results_df[name]))

Isolation Forest Classification Report:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       0.0
           0       0.00      0.00      0.00     239.0
           1       0.00      0.00      0.00       5.0
           2       0.00      0.00      0.00       5.0
           3       0.00      0.00      0.00       1.0

    accuracy                           0.00     250.0
   macro avg       0.00      0.00      0.00     250.0
weighted avg       0.00      0.00      0.00     250.0

One-Class SVM Classification Report:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       0.0
           0       0.00      0.00      0.00     239.0
           1       0.00      0.00      0.00       5.0
           2       0.00      0.00      0.00       5.0
           3       0.00      0.00      0.00       1.0

    accuracy                           0.00     250.0
   macro avg       0.00      0.00      0.00     250.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

# Exercise 8
Here we will use RFE (recursive feature elimination) again to choose our top 5 features and re-evaluate our models.

In [62]:
# Redefining because there was an inconsistency with the shape of X_train_scaled and y_train
# from previous exercises
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=11)
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

feature_names=X.columns.tolist()
model=RandomForestClassifier(n_estimators=100, random_state=11)
rfe=RFE(estimator=model, n_features_to_select=5)
rfe.fit(X_train_scaled, y_train)
ranking=rfe.ranking_

selected_features=np.where(ranking==1)[0]
print("Selected Feature Names:") # Top 5 features
for index in selected_features:
    print(feature_names[index])
    print("\n")
    
# Retrain only using the top 5 features
X_train_selected=X_train_scaled[:, selected_features]
X_test_selected=X_test_scaled[:, selected_features]

for name, model in models.items(): # This is not using the subset with 250 samples
    model.fit(X_train_selected, y_train)
    y_pred_selected=model.predict(X_test_selected)
    print(f"{name} Classification Report (Selected Features):")
    print(classification_report(y_test, y_pred_selected))

Selected Feature Names:
protocol_type


count


srv_count


same_srv_rate


dst_host_srv_serror_rate


Isolation Forest Classification Report (Selected Features):
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       0.0
           0       0.00      0.00      0.00   19469.0
           1       0.00      0.00      0.00     478.0
           2       0.00      0.00      0.00     170.0
           3       0.00      0.00      0.00       7.0
           4       0.00      0.00      0.00       3.0
           5       0.00      0.00      0.00       1.0
           7       0.00      0.00      0.00       1.0
           8       0.00      0.00      0.00       1.0
           9       0.00      0.00      0.00       1.0

    accuracy                           0.00   20131.0
   macro avg       0.00      0.00      0.00   20131.0
weighted avg       0.00      0.00      0.00   20131.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


One-Class SVM Classification Report (Selected Features):
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       0.0
           0       0.00      0.00      0.00   19469.0
           1       0.00      0.00      0.00     478.0
           2       0.00      0.00      0.00     170.0
           3       0.00      0.00      0.00       7.0
           4       0.00      0.00      0.00       3.0
           5       0.00      0.00      0.00       1.0
           7       0.00      0.00      0.00       1.0
           8       0.00      0.00      0.00       1.0
           9       0.00      0.00      0.00       1.0

    accuracy                           0.00   20131.0
   macro avg       0.00      0.00      0.00   20131.0
weighted avg       0.00      0.00      0.00   20131.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Logistic Regression Classification Report (Selected Features):
              precision    recall  f1-score   support

           0       1.00      0.46      0.63     19469
           1       0.99      1.00      0.99       478
           2       0.92      0.94      0.93       170
           3       0.00      0.57      0.00         7
           4       0.00      0.33      0.01         3
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         0
           7       0.00      0.00      0.00         1
           8       0.00      1.00      0.01         1
           9       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         0
          12       0.00      0.00      0.00         0
          13       0.00      0.00      0.00         0

    accuracy                           0.48     20131
   macro avg       0.22      0.33      0.20     20131
weighted avg       1.00      0.48      0.64     20131



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


No, it did not improve. If anything it got worse. LR dropped down from 98% accuracy to 48%.

# Done!