In [1]:
from sklearn.feature_extraction.text import CountVectorizer

text = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer(stop_words='english', max_features=1000)
dtm = vectorizer.fit_transform(text)

print(vectorizer.get_feature_names_out())
print(dtm.toarray())


['document' 'second']
[[1 0]
 [2 1]
 [0 0]
 [1 0]]


Question 2: Purpose of CountVectorizer in sklearn

CountVectorizer is a crucial component in natural language processing (NLP) workflows, particularly when dealing with text data. Its primary purpose is to convert raw text into numerical feature vectors that can be used as input to machine learning models. This transformation is necessary because most machine learning algorithms require numerical input to operate effectively.

Text to Numerical Representation: CountVectorizer takes a corpus of text documents (such as sentences, paragraphs, or entire documents) as input and transforms them into a numerical representation. Each document is represented as a vector, where each element of the vector corresponds to the frequency of a particular word in the document.

Document-Term Matrix (DTM): The output of CountVectorizer is often referred to as a Document-Term Matrix (DTM). This matrix has rows corresponding to documents in the corpus and columns corresponding to individual terms (words or tokens). The values in the matrix represent the frequency of each term in each document.

Feature Extraction: CountVectorizer extracts features from text data, where each feature corresponds to a unique term present in the corpus. These features can then be used as input to machine learning models for tasks such as classification, clustering, or regression.

Sparse Matrix Representation: Since text data often contains a large number of unique terms (or vocabulary), the resulting DTM can be very high-dimensional. However, most documents only contain a small subset of the entire vocabulary. CountVectorizer typically represents the DTM as a sparse matrix to efficiently store and manipulate this data, saving memory and computational resources.

Question 3: Explain what is stop_words in CountVectorizer. (10 pts)

Removal of Common Words: Stop words are common words that often appear frequently in text but typically do not carry much meaning or contribute to the overall context. Examples include articles ("the", "a", "an"), prepositions ("in", "on", "at"), and conjunctions ("and", "but", "or"). By removing these words before vectorization, we can focus on the more important words that are more indicative of the content of the text.

Improvement of Model Performance: Removing stop words can lead to more meaningful and informative features, which can improve the performance of machine learning models. This is because stop words can introduce noise into the data and may not contribute significantly to the task at hand.

Customization: CountVectorizer allows users to specify their own list of stop words or use built-in lists for common languages such as English. This flexibility enables users to tailor the vectorization process to their specific needs and domain of application.

In summary, CountVectorizer plays a crucial role in converting text data into a numerical format suitable for machine learning models. Additionally, the ability to remove stop words further enhances the quality of the resulting feature representation.

In [29]:

#Exercise 2
#import the required libraries
import pandas as pd

# Load the datasets
goodware_df = pd.read_csv('goodware_r.csv')
malware_df = pd.read_csv('malware_r.csv')

# Preview the first few rows of each dataset to understand their structure
goodware_df_head = goodware_df.head()
malware_df_head = malware_df.head()

goodware_df_head, malware_df_head


  goodware_df = pd.read_csv('goodware_r.csv')


(  BaseOfCode BaseOfData Characteristics DllCharacteristics   Entropy  \
 0       4096      40960             783              32768  7.999997   
 1       4096      28672             271              32768  7.870771   
 2       4096     131072             303                  0   7.99977   
 3       4096    1646592             259                  0  5.590701   
 4       8192     786432             258              34112  6.812076   
 
   FileAlignment FormatedTimeDateStamp  \
 0           512       2/18/2016 20:08   
 1           512       12/5/2009 20:50   
 2           512       4/18/2011 15:54   
 3          4096       3/28/2011 11:59   
 4           512       12/22/2015 9:04   
 
                                             Identify ImageBase  \
 0                                                NaN   4194304   
 1                    [['Nullsoft PiMP Stub -> SFX']]   4194304   
 2  [['Microsoft Visual C++ v6.0'], ['Microsoft Vi...   4194304   
 3  [['Microsoft Visual C++ 8'], ['VC8

In [44]:
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer



In [47]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import csv

#Define the parameter to store the path for the script to read data. And define the parameters to store the labels and text to be vectorized.
goodware_path = 'goodware_r.csv'
malware_path = 'malware_r.csv'

# Lists to store labels and text data
labels = []
text = []

# List of filenames to process
filenames = [goodware_path, malware_path]

#Read the content from each file and create labels for them
for filename in filenames:
    # Determine if the file is goodware or malware
    label = "1" if "good" in filename else "-1"
    
    # Open and process the file
    with open(filename, encoding="utf8") as f:
        content = csv.reader(f, delimiter="\t")
        next(content)  # Skip the header
        for line in content:
            # Convert line to string and clean it
            line_str = ' '.join(line).replace(',', ' ').replace('"', ' ')
            # Append the processed line and label to the lists
            text.append(line_str)
            labels.append(label)

# Initialize CountVectorizer with the specified parameters
vectorizer = CountVectorizer(stop_words='english', max_features=1000)

# Apply CountVectorizer to the text data
dtm = vectorizer.fit_transform(text)

# Convert the document-term matrix to a pandas DataFrame
df = pd.DataFrame(dtm.toarray(), index=labels, columns=vectorizer.get_feature_names_out())
df.index.name = "labels"

# Save the dataframe to a CSV file
df.to_csv('MalwareMatrix.csv')


Question 4: Randomly split data into training (70%) and test set (30%), and then apply
multinomial Naive Bayes classifier (Use functions from Scikit library [Link]). (20 pts)

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


# Load the MalwareMatrix.csv file into a DataFrame
df = pd.read_csv('MalwareMatrix.csv')

# Extract labels and features
labels = df['labels']
features = df.drop('labels', axis=1)


# Split data into training (70%) and test set (30%)
X_train, X_test, y_train, y_test = train_test_split(features, labels,test_size=0.3, random_state=42)
# Apply Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)



In [54]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Predictions on the training set
y_train_pred = nb_classifier.predict(X_train)
# Predictions on the test set
y_test_pred = nb_classifier.predict(X_test)

# Evaluate the classifier on the training set
train_accuracy = accuracy_score(y_train, y_train_pred)
train_classification_rep = classification_report(y_train, y_train_pred)
train_conf_matrix = confusion_matrix(y_train, y_train_pred)

# Evaluate the classifier on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)
test_classification_rep = classification_report(y_test, y_test_pred)
test_conf_matrix = confusion_matrix(y_test, y_test_pred)

# Display results for the training set
print("Training Set Results:")
print(f"Classification Accuracy: {train_accuracy:.2f}")
print("Training Classification Report:")
print(train_classification_rep)
print("Training Confusion Matrix:")
print(train_conf_matrix)

# Display results for the test set
print("\nTesting Set Results:")
print(f"Classification Accuracy: {test_accuracy:.2f}")
print("Testing Classification Report:")
print(test_classification_rep)
print("Testing Confusion Matrix:")
print(test_conf_matrix)


Training Set Results:
Classification Accuracy: 0.71
Training Classification Report:
              precision    recall  f1-score   support

          -1       0.73      0.75      0.74      7091
           1       0.69      0.67      0.68      5952

    accuracy                           0.71     13043
   macro avg       0.71      0.71      0.71     13043
weighted avg       0.71      0.71      0.71     13043

Training Confusion Matrix:
[[5331 1760]
 [1963 3989]]

Testing Set Results:
Classification Accuracy: 0.70
Testing Classification Report:
              precision    recall  f1-score   support

          -1       0.72      0.74      0.73      3039
           1       0.68      0.66      0.67      2552

    accuracy                           0.70      5591
   macro avg       0.70      0.70      0.70      5591
weighted avg       0.70      0.70      0.70      5591

Testing Confusion Matrix:
[[2242  797]
 [ 862 1690]]


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the MalwareMatrix.csv file into a DataFrame
df = pd.read_csv('MalwareMatrix.csv')

# Extract labels and features
labels = df['labels']
features = df.drop('labels', axis=1)

# Split data into training (70%) and test set (30%)
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)

# Apply Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Predictions on the training set
y_train_pred = nb_classifier.predict(X_train)
# Predictions on the test set
y_test_pred = nb_classifier.predict(X_test)
# Evaluate the classifier on the training set
train_accuracy = accuracy_score(y_train, y_train_pred)
train_classification_rep = classification_report(y_train, y_train_pred)
train_conf_matrix = confusion_matrix(y_train, y_train_pred)
# Evaluate the classifier on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)
test_classification_rep = classification_report(y_test, y_test_pred)
test_conf_matrix = confusion_matrix(y_test, y_test_pred)
# Display results for the training set
print("Training Set Results:")
print(f"Classification Accuracy: {train_accuracy:.2f}")
print("Training Classification Report:")
print(train_classification_rep)
print("Training Confusion Matrix:")
print(train_conf_matrix)
# Display results for the test set
print("\nTesting Set Results:")
print(f"Classification Accuracy: {test_accuracy:.2f}")
print("Testing Classification Report:")
print(test_classification_rep)
print("Testing Confusion Matrix:")
print(test_conf_matrix)
