Selecting best `k` features from bigram bytes.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from sklearn.feature_selection import SelectKBest

In [4]:
from scipy import sparse

In [5]:
from matplotlib import pyplot as plt
from matplotlib import style
style.use(style='seaborn-deep')

In [6]:
import numpy as np
import os
import pandas as pd

In [7]:
src_path = '/content/drive/MyDrive/Applied-AI/Assignment-16/data'
npz_file = os.path.join(src_path, 'all_bytes_bigram.npz')
print(npz_file)

/content/drive/MyDrive/Applied-AI/Assignment-16/data/all_bytes_bigram.npz


In [8]:
file_size = round((os.path.getsize(filename=os.path.join(src_path, npz_file)) / (1024**2)), 2)
print("File size: {} MB.".format(file_size))

File size: 1042.47 MB.


Note: When I was creating `all_bytes_bigram.npz` file, I picked files from the disk in the same order provided by `trainLabels.csv` file. Hence, we need not worry about class labels appropriately mapping to our bigram byte features.

In [9]:
X = sparse.load_npz(file=npz_file)
print(X.shape)

(10868, 66049)


In [10]:
label_df = pd.read_csv(filepath_or_buffer=os.path.join(src_path, 'trainLabels.csv'))

In [11]:
y = label_df['Class'].values
print(y.shape)

(10868,)


In [12]:
k = 2000

In [13]:
X_new = SelectKBest(k=k).fit_transform(X=X, y=y)

In [14]:
print(X_new.shape)

(10868, 2000)


In [15]:
print(type(X_new))

<class 'scipy.sparse.csr.csr_matrix'>


In [16]:
final_file_name = 'best_k_bytes_bigram.npz'
sparse.save_npz(file=os.path.join(src_path, final_file_name), matrix=X_new)

In [17]:
file_size = round((os.path.getsize(filename=os.path.join(src_path, final_file_name)) / (1024**2)), 2)
print("File size after selecting best {} features: {} MB.".format(k, file_size))

File size after selecting best 2000 features: 14.9 MB.


End of the file.