<a href="https://colab.research.google.com/github/mecrym/Iniciacao-Cientifica/blob/main/Passo1IC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analise exploratoria do Dataset de phishing:
Abaixo estão os imports que serão usados nesse notebook, a numpy foi adicionada para verificar a existencia de valores únicos e assim facilitar dados **Categóricos** int ou float.

Os dados referentes a arquivos **Word, Excel, Pdf e HTML** são divididos em duas tabelas:
* Uma com todas as Features
* Outra que já passou pelo processo de Feature Engineering.

Já os dados referentes a arquivos **QR code** foram divididos, também, em duas tabelas, porém:
* Uma para dados benignos
* Outra para malignos.

As tabelas de QR code se destacam das demais, já que elas possuem apenas o nome referente ao QR code, a **url** utilizada para gerá-lo e o caminho (**path**) referente a ele. Os QR codes também possuem uma pasta com todos os arquivos de imagem que foram gerados por meio das *urls* das tabelas.

**Label**, que está presente em todas as tabelas, exceto as de QR code, é a nossa variável alvo, tendo apenas duas classes (benign = 0 e malicius = 1) e sendo **Categórica**.

A página Notion a seguir está sendo utilizada para controle de tarefas e anotações referentes a este trabalho: [Notion: Iniciação Científica](https://www.notion.so/Inicia-o-Cientifica-2a2579944eb68021882fc9cd25d590cd?source=copy_link)


In [None]:
import pandas as pd
import numpy as np

## Microsoft Word:


#### Todas as features dos Documentos Word:

In [None]:
path_word = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/Word_All_features.csv'
df_word_all = pd.read_csv(path_word)
df_word_all.head()

Unnamed: 0,ole_object_count,ole_object_type_count,macro_present,dde_present,vba_keywords_count,entropy,struct_ContentType,struct_PartName,file_size,struct_pos,...,path_a-accent1,path_a-sysClr,path_a-lt1,path_/a-accent4,struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}sz,path_a-solidFill,struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}themeFill,struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}csb1,struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}styleId,label
0,21,3,1,0,2,5.296863,0,0,223813,0,...,0.0,0.0,0.0,0.0,,0.0,,,,1
1,0,0,0,1,0,7.415387,14,11,40156,10,...,,,,,2200.0,,938.0,8.0,324.0,0
2,0,0,0,1,0,7.595452,15,11,258403,0,...,,,,,2283.0,,938.0,0.0,324.0,0
3,0,0,0,1,0,7.402378,14,11,39924,10,...,,,,,2200.0,,938.0,8.0,324.0,0
4,20,3,1,0,2,4.518811,0,0,199559,0,...,0.0,0.0,0.0,0.0,,0.0,,,,1


No bloco abaixo nós verificamos todos os tipos de dados através do método **info()**, os tipos verificados serão os tipos como *int, float* e *object* e não os tipos de Machine Learning como *Textual, Categórico* ou *Numérico*.

Verificando os atributos do tipo object (a biblioteca pandas salva strings como object) é possível constatar que não há tipos **Textuais** dentre as features dos arquivos Word, já que tipos **Textuais** são descritivos, nomes, url e etc.

In [None]:
df_word_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 44 columns):
 #   Column                                                                          Non-Null Count  Dtype  
---  ------                                                                          --------------  -----  
 0   ole_object_count                                                                20000 non-null  int64  
 1   ole_object_type_count                                                           20000 non-null  int64  
 2   macro_present                                                                   20000 non-null  int64  
 3   dde_present                                                                     20000 non-null  int64  
 4   vba_keywords_count                                                              20000 non-null  int64  
 5   entropy                                                                         20000 non-null  float64
 6   struct_Content

A seguir temos a contagem de váriaveis de valores únicos. Como temos muitos dados do tipo *inteiro* e *ponto flutuante*, os nossos dados categóricos provavelmente serão IDs e booleanos marcados como int64 ou float64. Portanto, das features que serão listadas abaixo, iremos filtrar as que possuem apenas 2 valores para identificar os Categóricos:

In [None]:
df_word_num = df_word_all.select_dtypes(include=np.number)

if df_word_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos_word = df_word_num.nunique()

    contagem_unicos_ordenada_word = contagem_unicos_word.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada_word)



--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
dde_present                                                                           2
macro_present                                                                         2
path_a-ln                                                                             2
path_/a-accent6                                                                       2
path_a-lt2                                                                            2
path_a-themeElements                                                                  2
path_a-accent3                                                                        2
path_a-hlink                                                                          2
struct_Extension                                                                      2
path_a-dk1                                                                            2
path_/a-accent4                                          

Após verificação dos valores únicos das colunas que tem maior chances de ser um dado categórico, junto com as suas proporções para já verificar se há desbalanceamento entre as features.

O resultado da célula a seguir evidencia que as seguintes features são **Categóricas**:
* Balanceadas (proporção de 50/50):
  * macro_present
  * dde_present
* Não balanceados(proporção de 64% para valores 0 e 36% para valores 1):
  * path_a-hlink
  * path_a-accent3
  * path_a-themeElements
  * struct_Extension
  * path_a-dk1 (Essa coluna em específico está duplicada na planilha, ela possui mesmo nome e os mesmos valores)
  * path_/a-accent6
  * path_a-lt2
  * path_a-accent4
  * path_a-accent1
  * path_a-sysClr
  * path_a-lt1
  * path_/a-accent4

As features a seguir tem apenas duas classes listadas, porém, por conta de seus valores, seriam numéricos como as demais não listadas:

* path_a-ln
* struct_Extension
* path_a-sysClr
* struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}themeFill





In [None]:
colunas_word = [
    'dde_present',
    'macro_present',
    'path_a-ln',
    'path_/a-accent6',
    'path_a-lt2',
    'path_a-themeElements',
    'path_a-accent3',
    'path_a-hlink',
    'struct_Extension',
    'path_a-dk1',
    'path_/a-accent4',
    'path_a-lt1',
    'path_a-sysClr',
    'path_a-accent1',
    'path_/a-dk1',
    'path_a-accent4',
    'struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}themeFill',
    'label'
]

for col in colunas_word:
    print(f"--- Análise da Coluna: {col} ---")

    print("Contagem:")
    class_count_word = df_word_all[col].value_counts()
    display(class_count_word)

    print("\nProporção:")
    percent_count_word = df_word_all[col].value_counts(normalize=True)
    display(percent_count_word)

    print("-" * 40 + "\n") # Adiciona um separador

--- Análise da Coluna: dde_present ---
Contagem:


Unnamed: 0_level_0,count
dde_present,Unnamed: 1_level_1
0,10006
1,9994



Proporção:


Unnamed: 0_level_0,proportion
dde_present,Unnamed: 1_level_1
0,0.5003
1,0.4997


----------------------------------------

--- Análise da Coluna: macro_present ---
Contagem:


Unnamed: 0_level_0,count
macro_present,Unnamed: 1_level_1
1,10003
0,9997



Proporção:


Unnamed: 0_level_0,proportion
macro_present,Unnamed: 1_level_1
1,0.50015
0,0.49985


----------------------------------------

--- Análise da Coluna: path_a-ln ---
Contagem:


Unnamed: 0_level_0,count
path_a-ln,Unnamed: 1_level_1
0.0,6420
3.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-ln,Unnamed: 1_level_1
0.0,0.642
3.0,0.358


----------------------------------------

--- Análise da Coluna: path_/a-accent6 ---
Contagem:


Unnamed: 0_level_0,count
path_/a-accent6,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_/a-accent6,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-lt2 ---
Contagem:


Unnamed: 0_level_0,count
path_a-lt2,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-lt2,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-themeElements ---
Contagem:


Unnamed: 0_level_0,count
path_a-themeElements,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-themeElements,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-accent3 ---
Contagem:


Unnamed: 0_level_0,count
path_a-accent3,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-accent3,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-hlink ---
Contagem:


Unnamed: 0_level_0,count
path_a-hlink,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-hlink,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: struct_Extension ---
Contagem:


Unnamed: 0_level_0,count
struct_Extension,Unnamed: 1_level_1
0.0,6419
2.0,3581



Proporção:


Unnamed: 0_level_0,proportion
struct_Extension,Unnamed: 1_level_1
0.0,0.6419
2.0,0.3581


----------------------------------------

--- Análise da Coluna: path_a-dk1 ---
Contagem:


Unnamed: 0_level_0,count
path_a-dk1,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-dk1,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_/a-accent4 ---
Contagem:


Unnamed: 0_level_0,count
path_/a-accent4,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_/a-accent4,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-lt1 ---
Contagem:


Unnamed: 0_level_0,count
path_a-lt1,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-lt1,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-sysClr ---
Contagem:


Unnamed: 0_level_0,count
path_a-sysClr,Unnamed: 1_level_1
0.0,6420
2.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-sysClr,Unnamed: 1_level_1
0.0,0.642
2.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-accent1 ---
Contagem:


Unnamed: 0_level_0,count
path_a-accent1,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-accent1,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_/a-dk1 ---
Contagem:


Unnamed: 0_level_0,count
path_/a-dk1,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_/a-dk1,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: path_a-accent4 ---
Contagem:


Unnamed: 0_level_0,count
path_a-accent4,Unnamed: 1_level_1
0.0,6420
1.0,3580



Proporção:


Unnamed: 0_level_0,proportion
path_a-accent4,Unnamed: 1_level_1
0.0,0.642
1.0,0.358


----------------------------------------

--- Análise da Coluna: struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}themeFill ---
Contagem:


Unnamed: 0_level_0,count
struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}themeFill,Unnamed: 1_level_1
938.0,9429
0.0,571



Proporção:


Unnamed: 0_level_0,proportion
struct_{http://schemas.openxmlformats.org/wordprocessingml/2006/main}themeFill,Unnamed: 1_level_1
938.0,0.9429
0.0,0.0571


----------------------------------------

--- Análise da Coluna: label ---
Contagem:


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,10000
0,10000



Proporção:


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
1,0.5
0,0.5


----------------------------------------



#### Somente as 10 features que foram filtradas através do processo de engenharia de features dos Documentos Word:

In [None]:
path_word_10 = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/Word_Top10_Features.csv'
df_word_10 = pd.read_csv(path_word_10)
df_word_10.head()

Unnamed: 0,file_name,ole_object_count,ole_object_type_count,macro_present,dde_present,vba_keywords_count,entropy,struct_ContentType,struct_PartName,file_size,struct_pos,label
0,benign_000000,0,0,0,1,0,6.826819,12,10,27585,11,0
1,benign_000001,0,0,0,1,0,7.389457,14,11,36658,10,0
2,benign_000002,0,0,0,1,0,6.850923,12,10,28887,11,0
3,benign_000003,0,0,0,1,0,7.408635,14,11,39295,10,0
4,benign_000004,0,0,0,1,0,7.388929,14,11,36659,10,0


A verificação dos tipos de dados será feita para todas as tabelas a fim de identificar quais são os dados numéricos, textuais e possíveis categóricos.

Dessa vez temos a presença de um object, que é referente ao nome do dado (*file_name*), indicando um dado **Textual**.

In [None]:
df_word_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   file_name              20000 non-null  object 
 1   ole_object_count       20000 non-null  int64  
 2   ole_object_type_count  20000 non-null  int64  
 3   macro_present          20000 non-null  int64  
 4   dde_present            20000 non-null  int64  
 5   vba_keywords_count     20000 non-null  int64  
 6   entropy                20000 non-null  float64
 7   struct_ContentType     20000 non-null  int64  
 8   struct_PartName        20000 non-null  int64  
 9   file_size              20000 non-null  int64  
 10  struct_pos             20000 non-null  int64  
 11  label                  20000 non-null  int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 1.8+ MB


Verificação de valores únicos:

In [None]:
df_word_10_num = df_word_10.select_dtypes(include=np.number)

if df_word_10_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos = df_word_10_num.nunique()

    contagem_unicos_ordenada = contagem_unicos.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada)


--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
dde_present                  2
macro_present                2
label                        2
ole_object_type_count        3
struct_pos                   4
vba_keywords_count           6
struct_ContentType          16
struct_PartName             17
ole_object_count            28
file_size                 9683
entropy                  16605
dtype: int64


Ao analisar os dados da célula abaixo, temos que somente *dde_present*, *macro_present* são **Categóricas** e ambas estão balanceadas, tendo 50% de valores 1 e 50% de valores 0.

Todas as demais são Numéricas e a **label** sempre será Categórica em todos os arquivos.

In [None]:
colunas = [
    'dde_present',
    'macro_present',
    'label'
]

for col in colunas:
    print(f"--- Análise da Coluna: {col} ---")

    print("Contagem:")

    class_count = df_word_10[col].value_counts()
    display(class_count)

    print("\nProporção:")
    percent_count = df_word_10[col].value_counts(normalize=True)
    display(percent_count)

    print("-" * 40 + "\n")

--- Análise da Coluna: dde_present ---
Contagem:


Unnamed: 0_level_0,count
dde_present,Unnamed: 1_level_1
0,10006
1,9994



Proporção:


Unnamed: 0_level_0,proportion
dde_present,Unnamed: 1_level_1
0,0.5003
1,0.4997


----------------------------------------

--- Análise da Coluna: macro_present ---
Contagem:


Unnamed: 0_level_0,count
macro_present,Unnamed: 1_level_1
1,10003
0,9997



Proporção:


Unnamed: 0_level_0,proportion
macro_present,Unnamed: 1_level_1
1,0.50015
0,0.49985


----------------------------------------

--- Análise da Coluna: label ---
Contagem:


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,10000
1,10000



Proporção:


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
0,0.5
1,0.5


----------------------------------------



## PDF Files:


#### Todas as features referentes a PDFfiles:

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/PDF_All_features.csv'
df_pdf_all = pd.read_csv(path)
df_pdf_all.head()

Unnamed: 0,file_path,file_size,title_chars,encrypted,metadata_size,page_count,valid_pdf_header,image_count,text_length,object_count,...,acroform_count,xfa_count,jbig2decode_count,colors_count,richmedia_count,trailer_count,startxref_count,has_multiple_behavioral_keywords_in_one_object,used_ocr,label
0,/content/drive/MyDrive/UNB/Mastercard project/...,20362,8,0,254,5,1,0,8018,120,...,0,0,0,0,0,0,0,0,0,0
1,/content/drive/MyDrive/UNB/Mastercard project/...,28848,35,0,242,4,1,0,11568,36,...,0,0,0,0,0,0,0,0,0,0
2,/content/drive/MyDrive/UNB/Mastercard project/...,76563,12,0,138,1,1,1,1377,21,...,0,0,0,0,0,0,0,0,1,0
3,/content/drive/MyDrive/UNB/Mastercard project/...,67982,8,0,278,4,1,0,30607,46,...,1,0,0,0,0,0,0,0,0,0
4,/content/drive/MyDrive/UNB/Mastercard project/...,3997,37,0,156,1,1,0,1,10,...,0,0,0,0,0,0,0,1,1,1


Nesta verificação de tipos já é possível identificar que temos uma Feature **Textual**, a ***file_path***, referente ao caminho onde o arquivo a ser analisado naquela linha está armazenado.

In [None]:
df_pdf_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19296 entries, 0 to 19295
Data columns (total 42 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   file_path                                       19296 non-null  object 
 1   file_size                                       19296 non-null  int64  
 2   title_chars                                     19296 non-null  int64  
 3   encrypted                                       19296 non-null  int64  
 4   metadata_size                                   19296 non-null  int64  
 5   page_count                                      19296 non-null  int64  
 6   valid_pdf_header                                19296 non-null  int64  
 7   image_count                                     19296 non-null  int64  
 8   text_length                                     19296 non-null  int64  
 9   object_count                           

Verificação de valores únicos:

In [None]:
df_pdf_all_num = df_pdf_all.select_dtypes(include=np.number)

if df_pdf_all_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos = df_pdf_all_num.nunique()

    contagem_unicos_ordenada = contagem_unicos.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada)


--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
embedded_file_count                                   1
average_embedded_file_size                            1
xref_entries                                          1
xref_count                                            1
submitform_count                                      1
startxref_count                                       1
trailer_count                                         1
jbig2decode_count                                     1
used_ocr                                              2
uses_nonstandard_port                                 2
encrypted                                             2
label                                                 2
valid_pdf_header                                      2
xfa_count                                             3
launch_count                                          4
openaction_count                                      5
richmedia_count                          

Levando em consideração os resultados a seguir, temos os seguintes dados **Categóricos**, os quais nenhum além do **label** está balanceado:
  
  * proporção: 99,56% de valores 0 e 0,44% de valores 1.
    * used_ocr
    * uses_nonstandard_port
    * encrypted, proporção
  * proporção: 61% de valores 0 e 39% de valores 1
    * valid_pdf_header

In [None]:
colunas = [
    'embedded_file_count',
    'average_embedded_file_size',
    'xref_entries',
    'xref_count',
    'submitform_count',
    'startxref_count',
    'trailer_count',
    'jbig2decode_count',
    'used_ocr',
    'uses_nonstandard_port',
    'encrypted',
    'label',
    'valid_pdf_header'
]

for col in colunas:
    print(f"--- Análise da Coluna: {col} ---")

    print("Contagem:")
    class_count = df_pdf_all[col].value_counts()
    display(class_count)

    print("\nProporção:")
    percent_count = df_pdf_all[col].value_counts(normalize=True)
    display(percent_count)

    print("-" * 40 + "\n")

--- Análise da Coluna: embedded_file_count ---
Contagem:


Unnamed: 0_level_0,count
embedded_file_count,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
embedded_file_count,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: average_embedded_file_size ---
Contagem:


Unnamed: 0_level_0,count
average_embedded_file_size,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
average_embedded_file_size,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: xref_entries ---
Contagem:


Unnamed: 0_level_0,count
xref_entries,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
xref_entries,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: xref_count ---
Contagem:


Unnamed: 0_level_0,count
xref_count,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
xref_count,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: submitform_count ---
Contagem:


Unnamed: 0_level_0,count
submitform_count,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
submitform_count,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: startxref_count ---
Contagem:


Unnamed: 0_level_0,count
startxref_count,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
startxref_count,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: trailer_count ---
Contagem:


Unnamed: 0_level_0,count
trailer_count,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
trailer_count,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: jbig2decode_count ---
Contagem:


Unnamed: 0_level_0,count
jbig2decode_count,Unnamed: 1_level_1
0,19296



Proporção:


Unnamed: 0_level_0,proportion
jbig2decode_count,Unnamed: 1_level_1
0,1.0


----------------------------------------

--- Análise da Coluna: used_ocr ---
Contagem:


Unnamed: 0_level_0,count
used_ocr,Unnamed: 1_level_1
0,15882
1,3414



Proporção:


Unnamed: 0_level_0,proportion
used_ocr,Unnamed: 1_level_1
0,0.823072
1,0.176928


----------------------------------------

--- Análise da Coluna: uses_nonstandard_port ---
Contagem:


Unnamed: 0_level_0,count
uses_nonstandard_port,Unnamed: 1_level_1
0,19211
1,85



Proporção:


Unnamed: 0_level_0,proportion
uses_nonstandard_port,Unnamed: 1_level_1
0,0.995595
1,0.004405


----------------------------------------

--- Análise da Coluna: encrypted ---
Contagem:


Unnamed: 0_level_0,count
encrypted,Unnamed: 1_level_1
0,19212
1,84



Proporção:


Unnamed: 0_level_0,proportion
encrypted,Unnamed: 1_level_1
0,0.995647
1,0.004353


----------------------------------------

--- Análise da Coluna: label ---
Contagem:


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,9999
0,9297



Proporção:


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
1,0.51819
0,0.48181


----------------------------------------

--- Análise da Coluna: valid_pdf_header ---
Contagem:


Unnamed: 0_level_0,count
valid_pdf_header,Unnamed: 1_level_1
1,11801
0,7495



Proporção:


Unnamed: 0_level_0,proportion
valid_pdf_header,Unnamed: 1_level_1
1,0.611578
0,0.388422


----------------------------------------



#### Somente as 10 features referentes a PDFfiles que já foram selecionadas:

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/PDF_Top10_features.csv'
df_pdf_10 = pd.read_csv(path)
df_pdf_10.head()

Unnamed: 0,text_length,total_filters,title_chars,file_size,object_count,stream_count,endstream_count,metadata_size,valid_pdf_header,entropy_of_streams,label
0,8018,8,8,20362,120,18,9,254,1,6.010192,0
1,11568,7,35,28848,36,16,8,242,1,6.240051,0
2,1377,2,12,76563,21,4,2,138,1,5.830408,0
3,30607,11,8,67982,46,26,13,278,1,6.317049,0
4,1,2,37,3997,10,6,3,156,1,5.442597,1


Verificação de tipos, nenhum deles é **Textual**:

In [None]:
df_pdf_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19296 entries, 0 to 19295
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   text_length         19296 non-null  int64  
 1   total_filters       19296 non-null  int64  
 2   title_chars         19296 non-null  int64  
 3   file_size           19296 non-null  int64  
 4   object_count        19296 non-null  int64  
 5   stream_count        19296 non-null  int64  
 6   endstream_count     19296 non-null  int64  
 7   metadata_size       19296 non-null  int64  
 8   valid_pdf_header    19296 non-null  int64  
 9   entropy_of_streams  19296 non-null  float64
 10  label               19296 non-null  int64  
dtypes: float64(1), int64(10)
memory usage: 1.6 MB


Verificação de tipos únicos para identificar possíveis features categóricas.

In [None]:
df_pdf_10_num = df_pdf_10.select_dtypes(include=np.number)

if df_pdf_10_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos = df_pdf_10_num.nunique()

    contagem_unicos_ordenada = contagem_unicos.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada)


--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
label                     2
valid_pdf_header          2
title_chars             135
total_filters           200
endstream_count         307
stream_count            315
metadata_size           562
object_count           1342
text_length            6610
entropy_of_streams    11387
file_size             13729
dtype: int64


Nessa verificação de valores foi possível identificar apenas uma feature categórica, a *valid_pdf_header* que não está balanceada e possui 61% de valores 1 e 39% de valores 0. Todos os outros são **Numéricos**.

In [None]:
class_count = df_pdf_10['valid_pdf_header'].value_counts()
display(class_count)

percent_count = df_pdf_10['valid_pdf_header'].value_counts(normalize=True)
display(percent_count)

Unnamed: 0_level_0,count
valid_pdf_header,Unnamed: 1_level_1
1,11801
0,7495


Unnamed: 0_level_0,proportion
valid_pdf_header,Unnamed: 1_level_1
1,0.611578
0,0.388422


## HTMLpages:




#### Todas as features referentes a HTMLpages:

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/HTML_All_Features.csv'
df_html_all = pd.read_csv(path)
df_html_all.head()

Unnamed: 0,file_size,line_count,entropy,script_entropy,tag_count,unique_tag_count,script_count,form_count,iframe_count,hidden_iframe_count,...,object_tag_count,url_digit_count,url_punct_char_count,url_avg_length,url_avg_subdomain_count,hostname_digit_ratio_avg,min_link_length,max_link_length,file_path,label
0,2014,49,5.252908,4.802471,20,12,1,1,0,0,...,0,0,8,15.75,0.0,0.0,14,18,/content/drive/MyDrive/UNB/Mastercard project/...,1
1,177022,3970,4.720787,4.634752,875,13,5,0,0,0,...,0,1139,1033,13.724706,0.061176,0.0,7,82,/content/drive/MyDrive/UNB/Mastercard project/...,0
2,191188,3088,5.149307,5.37195,1107,33,51,3,0,0,...,0,845,3064,55.930514,1.223565,0.000697,1,132,/content/drive/MyDrive/UNB/Mastercard project/...,0
3,56968,2319,4.726188,4.431225,886,22,11,1,0,0,...,0,1149,1701,26.51831,0.028169,0.0,1,68,/content/drive/MyDrive/UNB/Mastercard project/...,0
4,167,6,5.052407,0.0,4,4,0,0,0,0,...,0,0,0,0.0,0.0,0.0,0,0,/content/drive/MyDrive/UNB/Mastercard project/...,1


Nessa verificação de tipos, é possível observar que a única Feature **Textual** é a *file_path*.

In [None]:
df_html_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19997 entries, 0 to 19996
Data columns (total 42 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   file_size                 19997 non-null  int64  
 1   line_count                19997 non-null  int64  
 2   entropy                   19997 non-null  float64
 3   script_entropy            19997 non-null  float64
 4   tag_count                 19997 non-null  int64  
 5   unique_tag_count          19997 non-null  int64  
 6   script_count              19997 non-null  int64  
 7   form_count                19997 non-null  int64  
 8   iframe_count              19997 non-null  int64  
 9   hidden_iframe_count       19997 non-null  int64  
 10  external_links_count      19997 non-null  int64  
 11  mailto_link_count         19997 non-null  int64  
 12  base64_string_count       19997 non-null  int64  
 13  html_comment_count        19997 non-null  int64  
 14  max_ta

A célula a seguir faz a verificação de valores únicos para a identificação de Features **Categóricas**, porém, a única feature Categórica dessa tabela é a **label**.


In [None]:
df_html_all_num = df_html_all.select_dtypes(include=np.number)

if df_html_all_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos = df_html_all_num.nunique()

    contagem_unicos_ordenada = contagem_unicos.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada)


--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
label                           2
hidden_iframe_count             8
object_tag_count               15
iframe_count                   17
eval_in_script_blocks          25
mailto_link_count              25
form_count                     42
redirect_mechanism_count       53
max_tag_nesting_depth          60
unique_tag_count               73
noscript_count                 92
external_js_count              99
min_link_length               111
embedded_js_count             122
script_count                  158
suspicious_word_count         162
escaped_char_count            166
img_count                     253
event_attachment_count        337
html_comment_count            345
base64_string_count           451
function_count                504
external_links_count          592
external_link_count           592
internal_link_count           711
hex_encoding_rate             980
max_link_length              1194
tag_count          

A célula abaixo apenas confirma que a nossa variável alvo está balanceada.

In [None]:
class_count = df_html_all['label'].value_counts()
display(class_count)

percent_count = df_html_all['label'].value_counts(normalize=True)
display(percent_count)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,9999
0,9998


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
1,0.500025
0,0.499975


#### Somente as 13 features que foram selecionadas referentes a HTMLpages:

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/HTML_Top13_Features.csv'
df_html_13 = pd.read_csv(path)
df_html_13.head()

Unnamed: 0,file_name,url_punct_char_count,tag_count,whitespace_ratio,entropy,form_count,embedded_js_count,html_whitespace_ratio,script_entropy,min_link_length,external_link_count,total_script_characters,internal_link_count,url_digit_count,label
0,sample_09001.html,756,475,0.476933,5.072891,0,11,0.152737,4.906402,1,57,15762,39,109,0
1,sample_09002.html,287,82,0.356557,5.202024,0,3,0.179456,4.625064,21,8,2543,27,130,0
2,sample_09003.html,667,406,0.390225,5.164168,1,10,0.14981,4.220362,9,80,7163,26,333,0
3,sample_09004.html,1204,321,0.534781,5.671159,0,33,0.025707,5.626662,17,74,268250,2,1013,0
4,sample_09005.html,2606,665,0.441595,5.136624,5,8,0.13005,5.462992,8,243,7666,8,924,0


Na verificação de tipos de dados temos a descoberta de um tipo **Textual**, o *file_name*.

In [None]:
df_html_13.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19997 entries, 0 to 19996
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   file_name                19997 non-null  object 
 1   url_punct_char_count     19997 non-null  int64  
 2   tag_count                19997 non-null  int64  
 3   whitespace_ratio         19997 non-null  float64
 4   entropy                  19997 non-null  float64
 5   form_count               19997 non-null  int64  
 6   embedded_js_count        19997 non-null  int64  
 7   html_whitespace_ratio    19997 non-null  float64
 8   script_entropy           19997 non-null  float64
 9   min_link_length          19997 non-null  int64  
 10  external_link_count      19997 non-null  int64  
 11  total_script_characters  19997 non-null  int64  
 12  internal_link_count      19997 non-null  int64  
 13  url_digit_count          19997 non-null  int64  
 14  label                 

Após identificar o tipo Textual, iremos identificar os valores únicos para verificar a presença de tipos Categóricos:


In [None]:
df_html_13_num = df_html_13.select_dtypes(include=np.number)

if df_html_all_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos = df_html_13_num.nunique()

    contagem_unicos_ordenada = contagem_unicos.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada)


--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
label                          2
form_count                    42
min_link_length              111
embedded_js_count            122
external_link_count          592
internal_link_count          711
tag_count                   2386
url_digit_count             2525
url_punct_char_count        3502
total_script_characters     9182
script_entropy             12208
whitespace_ratio           12833
html_whitespace_ratio      14377
entropy                    15415
dtype: int64


Novamente, aoenas a coluna **label** é categórica, e assim como nas outras colunas label das outras tabelas, ela está balanceada.

In [None]:
class_count = df_html_13['label'].value_counts()
display(class_count)

percent_count = df_html_13['label'].value_counts(normalize=True)
display(percent_count)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,9999
0,9998


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
1,0.500025
0,0.499975


## Excel:

#### Todos os dados referentes arquivos Excel:

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/Excel_All_Features.csv'
df_excel_all = pd.read_csv(path)
df_excel_all.head()

Unnamed: 0,file_path,file_size,sheet_count,max_rows,max_cols,total_cells,non_empty_cells,numeric_cell_count,string_cell_count,formula_count,...,macro_count_parentheses,macro_count_assignments,macro_max_line_length,macro_max_string_literals,macro_max_arithmetic_ops,macro_max_concat_ops,macro_vocab_size,preview_image_width,preview_image_height,label
0,/content/drive/MyDrive/MasterCard/Nazgol /Mali...,739072,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,/content/drive/MyDrive/UNB/Mastercard project/...,58846,1,450,12,5400,4652,1579,3073,0,...,0,0,35,0,2,0,5211,0,0,0
2,/content/drive/MyDrive/UNB/Mastercard project/...,223714,2,2516,10,25164,19458,6445,13013,1,...,2,1,39,0,2,0,12384,0,0,0
3,/content/drive/MyDrive/UNB/Mastercard project/...,506774,4,2742,18,59955,46267,15349,30918,3,...,6,3,39,0,2,0,17441,0,0,0
4,/content/drive/MyDrive/MasterCard/Nazgol /Mali...,155136,1,117,12,1404,10,1,9,0,...,6,11,3080,14,2,14,78,0,0,1


Encontramos uma váriavel **Textual** na verificação de tipos abaixo, sendo ela a *file_path*.

In [None]:
df_excel_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 50 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   file_path                        20000 non-null  object 
 1   file_size                        20000 non-null  int64  
 2   sheet_count                      20000 non-null  int64  
 3   max_rows                         20000 non-null  int64  
 4   max_cols                         20000 non-null  int64  
 5   total_cells                      20000 non-null  int64  
 6   non_empty_cells                  20000 non-null  int64  
 7   numeric_cell_count               20000 non-null  int64  
 8   string_cell_count                20000 non-null  int64  
 9   formula_count                    20000 non-null  int64  
 10  hyperlink_count                  20000 non-null  int64  
 11  avg_cell_length                  20000 non-null  float64
 12  entropy_of_text   

Realizando a verificação de tipos para identificar possíveis variáveis **Categóricas**:

In [None]:
df_excel_all_num = df_excel_all.select_dtypes(include=np.number)

if df_excel_all_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos = df_excel_all_num.nunique()

    contagem_unicos_ordenada = contagem_unicos.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada)


--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
macro_callbyname_count                 1
deceptive_keywords_count_ocr           1
preview_image_text_entropy             1
ocr_extracted_text_length              1
preview_image_width                    1
preview_image_height                   1
hex_pattern_count                      2
has_macro                              2
uses_network_api                       2
uses_process_api                       2
uses_file_api                          2
macro_count                            2
label                                  2
empty_sheet_count                      9
protected_sheets_count                 9
macro_procedure_count                 17
hidden_sheets_count                   17
macro_comment_lines                   28
hyperlink_count                       31
base64_pattern_count                  32
macro_string_function_count           33
named_ranges_count                    34
sheet_count                        

Com base nos valores dos possível candidatos a serem classificados como **Categóricos**, é possivel identificar as seguintes variáveis:
* Não Balanceadas:
  * has_macro, com 98,9% de valores 0 e 1,1%
  * uses_network_api, com 99,4% de valores 0 e 0,6% de valores 1.
  * macro_count, com 99% de valores 0 e 1% de valores 1.

* Balanceadas
  * uses_process_api, com 48% para valores 0 e 52% de valores 1.
  * uses_file_api, com 46% de valores 0 e 54% de valores 1.

As demais variáveis são numéricas.

In [None]:
colunas = [
    'hex_pattern_count',
    'has_macro',
    'uses_network_api',
    'uses_process_api',
    'uses_file_api',
    'macro_count',
    'label'
]

for col in colunas:
    print(f"--- Análise da Coluna: {col} ---")

    print("Contagem:")
    class_count = df_excel_all[col].value_counts()
    display(class_count)

    print("\nProporção:")
    percent_count = df_excel_all[col].value_counts(normalize=True)
    display(percent_count)

    print("-" * 40 + "\n")

--- Análise da Coluna: hex_pattern_count ---
Contagem:


Unnamed: 0_level_0,count
hex_pattern_count,Unnamed: 1_level_1
0,19999
19,1



Proporção:


Unnamed: 0_level_0,proportion
hex_pattern_count,Unnamed: 1_level_1
0,0.99995
19,5e-05


----------------------------------------

--- Análise da Coluna: has_macro ---
Contagem:


Unnamed: 0_level_0,count
has_macro,Unnamed: 1_level_1
0,19778
1,222



Proporção:


Unnamed: 0_level_0,proportion
has_macro,Unnamed: 1_level_1
0,0.9889
1,0.0111


----------------------------------------

--- Análise da Coluna: uses_network_api ---
Contagem:


Unnamed: 0_level_0,count
uses_network_api,Unnamed: 1_level_1
0,19885
1,115



Proporção:


Unnamed: 0_level_0,proportion
uses_network_api,Unnamed: 1_level_1
0,0.99425
1,0.00575


----------------------------------------

--- Análise da Coluna: uses_process_api ---
Contagem:


Unnamed: 0_level_0,count
uses_process_api,Unnamed: 1_level_1
1,10409
0,9591



Proporção:


Unnamed: 0_level_0,proportion
uses_process_api,Unnamed: 1_level_1
1,0.52045
0,0.47955


----------------------------------------

--- Análise da Coluna: uses_file_api ---
Contagem:


Unnamed: 0_level_0,count
uses_file_api,Unnamed: 1_level_1
1,10847
0,9153



Proporção:


Unnamed: 0_level_0,proportion
uses_file_api,Unnamed: 1_level_1
1,0.54235
0,0.45765


----------------------------------------

--- Análise da Coluna: macro_count ---
Contagem:


Unnamed: 0_level_0,count
macro_count,Unnamed: 1_level_1
0,19778
1,222



Proporção:


Unnamed: 0_level_0,proportion
macro_count,Unnamed: 1_level_1
0,0.9889
1,0.0111


----------------------------------------

--- Análise da Coluna: label ---
Contagem:


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,10000
0,10000



Proporção:


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
1,0.5
0,0.5


----------------------------------------



#### Somente as 10 Features que foram selecionadas dos arquivos Excel:

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/Excel_Top10_Features.csv'
df_excel_10 = pd.read_csv(path)
df_excel_10.head()

Unnamed: 0,file_name,entropy_of_text,macro_chr_count,macro_vocab_size,macro_arithmetic_operator_count,macro_token_count,macro_max_line_length,remote_template_present,numeric_cell_count,string_cell_count,avg_cell_length,label
0,benign_sample_3253.xlsx,5.746068,39,10079,4759,10079,38,1,4336,8926,11.184588,0
1,benign_sample_5685.xlsx,5.749956,287,19789,28783,19789,42,1,27138,54355,11.167818,0
2,benign_sample_8732.xlsx,5.74547,122,16084,12754,16084,38,1,12033,23997,11.132195,0
3,benign_sample_5743.xlsx,5.743675,152,17382,16244,17382,39,1,15217,30554,11.104673,0
4,benign_sample_5522.xlsx,5.74813,106,15452,11529,15452,38,1,10837,21622,11.151976,0


Verificação de tipos identificando a variável *file_name* como uma feature **Textual**.

In [None]:
df_excel_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   file_name                        20000 non-null  object 
 1   entropy_of_text                  20000 non-null  float64
 2   macro_chr_count                  20000 non-null  int64  
 3   macro_vocab_size                 20000 non-null  int64  
 4   macro_arithmetic_operator_count  20000 non-null  int64  
 5   macro_token_count                20000 non-null  int64  
 6   macro_max_line_length            20000 non-null  int64  
 7   remote_template_present          20000 non-null  int64  
 8   numeric_cell_count               20000 non-null  int64  
 9   string_cell_count                20000 non-null  int64  
 10  avg_cell_length                  20000 non-null  float64
 11  label                            20000 non-null  int64  
dtypes: float64(2), int

Ao fazer a verificaçãod devalores únicos, é possivel identificar novamente que não há variável de tipo categórico além da label. e Todas as variáveis da tabela, com exceção da *file_name*, são **Númericas**.

In [None]:
df_excel_10_num = df_excel_10.select_dtypes(include=np.number)

if df_excel_10_num.empty:
    print("Nenhuma coluna numérica (int ou float) foi encontrada na tabela.")
else:
    contagem_unicos = df_excel_10_num.nunique()

    contagem_unicos_ordenada = contagem_unicos.sort_values()

    print("\n--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---")

    with pd.option_context('display.max_rows', 100):
        print(contagem_unicos_ordenada)


--- Contagem de Valores Únicos (Apenas Colunas Numéricas) ---
label                                  2
remote_template_present               56
macro_chr_count                      369
macro_max_line_length                417
macro_token_count                   8098
macro_vocab_size                    8098
numeric_cell_count                  8424
macro_arithmetic_operator_count     8564
string_cell_count                   9524
avg_cell_length                    11522
entropy_of_text                    12005
dtype: int64


A variável label continua balanceada para essa tabela.

In [None]:
class_count = df_excel_10['label'].value_counts()
display(class_count)

percent_count = df_excel_10['label'].value_counts(normalize=True)
display(percent_count)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,10000
1,10000


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
0,0.5
1,0.5


## QR code:

#### QR codes Benignos

In [None]:
path_qrcode_files_b = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/qrcode_csv/all_generated_urls_20251015_161937(b).csv'
df_qr_b = pd.read_csv(path_qrcode_files_b)
df_qr_b.head()

Unnamed: 0,index,url,qr_path
0,1,http://freshrpms.net/,Output\QR_All_benign\qrs\benign_000001.png
1,2,http://lists.freshrpms.net/mailman/listinfo/rp...,Output\QR_All_benign\qrs\benign_000002.png
2,3,http://www.boquist.net/stort-sup-brev,Output\QR_All_benign\qrs\benign_000003.png
3,4,http://thinkgeek.com/sf,Output\QR_All_benign\qrs\benign_000004.png
4,5,https://lists.sourceforge.net/lists/listinfo/s...,Output\QR_All_benign\qrs\benign_000005.png


A verificação de tipos de dados revela que todas as variáveis das tabelas com as informações a respeito dos QR codes, são **Textuais**, sendo elas a *url* e a *qr_path*, com exceção da index que é **Numérica** e usada apenas para enumerar os dados observados.

In [None]:
df_qr_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429976 entries, 0 to 429975
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   index    429976 non-null  int64 
 1   url      429976 non-null  object
 2   qr_path  429976 non-null  object
dtypes: int64(1), object(2)
memory usage: 9.8+ MB


#### QR codes Malignos

In [None]:
path_qrcode_files_m = '/content/drive/MyDrive/Colab Notebooks/Dataframes_passo1/qrcode_csv/all_generated_urls_20251015_184324(m).csv'
df_qr_m = pd.read_csv(path_qrcode_files_m)
df_qr_m.head()

Unnamed: 0,index,url,qr_path
0,1,https://www.amazon.com/ref=pe_175190_21431760_...,Output\QR_All_Malicious\qrs\benign_000001.png
1,2,http://g-ecx.images-amazon.com/images/G/01/x-l...,Output\QR_All_Malicious\qrs\benign_000002.png
2,3,http://irc-sspo.uz/a,Output\QR_All_Malicious\qrs\benign_000003.png
3,4,http://www.amazon.com/gp/css/returns/homepage....,Output\QR_All_Malicious\qrs\benign_000004.png
4,5,http://www.amazon.com/customer-service?ref=pe_...,Output\QR_All_Malicious\qrs\benign_000005.png


Assim como os Benignos, temos os tipos **Textuais**: *url* e o *qr_path*, e o **Numérico**: que é o index.

In [None]:
df_qr_m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 575762 entries, 0 to 575761
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   index    575762 non-null  int64 
 1   url      575762 non-null  object
 2   qr_path  575762 non-null  object
dtypes: int64(1), object(2)
memory usage: 13.2+ MB


Fontes:
* http://cicresearch.ca/IOTDataset/CIC_Trap4Phish_2025_Dataset/Dataset/
* https://www.unb.ca/cic/datasets/index.html
* https://pandas.pydata.org/docs/getting_started/index.html#getting-started
* https://www.geeksforgeeks.org/machine-learning/introduction-machine-learning/
* https://www.geeksforgeeks.org/machine-learning/supervised-machine-learning/
* https://www.geeksforgeeks.org/machine-learning/getting-started-with-classification/
* https://www.geeksforgeeks.org/machine-learning/machine-learning-with-python/
* https://www.geeksforgeeks.org/machine-learning/what-is-feature-engineering/
* https://www.datacamp.com/tutorial/categorical-data
* https://www.geeksforgeeks.org/machine-learning/ml-introduction-data-machine-learning/
* https://www.tutorialspoint.com/machine_learning/machine_learning_data_types.htm

* https://pandas-pydata-org.translate.goog/docs/user_guide/categorical.html?_x_tr_sl=en&_x_tr_tl=pt&_x_tr_hl=pt&_x_tr_pto=tc
https://www-editage-com.translate.goog/blog/normality-test-methods-of-assessing-normality/?_x_tr_sl=en&_x_tr_tl=pt&_x_tr_hl=pt&_x_tr_pto=sge#:~:text=da%20distribuição%20normal-,Métodos%20de%20Avaliação%20da%20Normalidade,%2C%20teste%20de%20Kolmogorov–Smirnov%20.
* https://www.geeksforgeeks.org/machine-learning/data-preprocessing-machine-learning-python/
* https://www.geeksforgeeks.org/machine-learning/ml-label-encoding-of-datasets-in-python/
