Обогатим датасет версиями, в которых присутствуют слова, отсутствующие в Stucco

In [1]:
import pandas as pd
import numpy as np
import psycopg2 as p2
from psycopg2 import sql
from collections import Counter
from tqdm import tqdm

pd.set_option('display.width', 20000)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 200)

Стучимся в БД

In [2]:
dbname = "vulns_scanner"
user = 'postgres'
password = 'postgres'
host = 'localhost'
port = '5432'

In [10]:
conn = p2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
cur = conn.cursor()
cur.execute('''
select cve_id, vendor, product, version, descr, initial_cpe  
from cves c inner join descriptions d on c.cve_id_pk=d.cve_id_fk
inner join cve_cpe_config ccc on c.cve_id_pk=ccc.cve_id_fk inner join cpes cp on ccc.cpe_id_fk=cp.cpe_id_pk 
where descr like '%_._%'
limit 10000000
''')
colnames = [desc[0] for desc in cur.description]
tuples = cur.fetchall()
cur.close()
# df = pd.DataFrame(tuples, columns=['cpe_id_pk', 'cpe_version', 'part', 'vendor', 
#                                     'product', 'version', 'update', 'edition', 
#                                     'sw_edition', 'target_sw', 'target_hw', 
#                                     'language', 'other', 'initial_cpe'])
df = pd.DataFrame(tuples, columns=colnames)
df.head()

Unnamed: 0,cve_id,vendor,product,version,descr,initial_cpe
0,CVE-2004-1643,progress,ws_ftp_server,5.0.2,"WS_FTP 5.0.2 allows remote authenticated users to cause a denial of service (CPU consumption) via a CD command that contains an invalid path with a ""../"" sequence.",cpe:2.3:a:progress:ws_ftp_server:5.0.2:*:*:*:*:*:*:*
1,CVE-2004-0658,linux,linux_kernel,2.5.51,Integer overflow in the hpsb_alloc_packet function (incorrectly reported as alloc_hpsb_packet) in IEEE 1394 (Firewire) driver 2.4 and 2.6 allows local users to cause a denial of service (crash) an...,cpe:2.3:o:linux:linux_kernel:2.5.51:*:*:*:*:*:*:*
2,CVE-2004-0658,linux,linux_kernel,2.5.24,Integer overflow in the hpsb_alloc_packet function (incorrectly reported as alloc_hpsb_packet) in IEEE 1394 (Firewire) driver 2.4 and 2.6 allows local users to cause a denial of service (crash) an...,cpe:2.3:o:linux:linux_kernel:2.5.24:*:*:*:*:*:*:*
3,CVE-2004-1051,todd_miller,sudo,1.6.3_p2,"sudo before 1.6.8p2 allows local users to execute arbitrary commands by using ""()"" style environment variables to create functions that have the same name as any program within the bash script tha...",cpe:2.3:a:todd_miller:sudo:1.6.3_p2:*:*:*:*:*:*:*
4,CVE-2004-1772,gnu,sharutils,4.2,Stack-based buffer overflow in shar in GNU sharutils 4.2.1 allows local users to execute arbitrary code via a long -o command line argument.,cpe:2.3:a:gnu:sharutils:4.2:*:*:*:*:*:*:*


In [11]:
df.shape

(686075, 6)

In [None]:
df['product_in_descr']

Unnamed: 0,cve_id,vendor,product,version,descr,matched_regex_before,matched_regex_after
0,CVE-2004-0013,jabber_software_foundation,jabber_server,1.4.3,"jabber 1.4.2, 1.4.2a, and possibly earlier versions, does not properly handle SSL connections, which allows remote attackers to cause a denial of service (crash).",jabber,
1,CVE-2004-0043,yahoo,messenger,5.6.0.1351,Buffer overflow in Yahoo Instant Messenger 5.6.0.1351 and earlier allows remote attackers to cause a denial of service (crash) and possibly execute arbitrary code via a long filename in the downlo...,Messenger,and
2,CVE-2004-0043,yahoo,messenger,5.6.0.1358,Buffer overflow in Yahoo Instant Messenger 5.6.0.1351 and earlier allows remote attackers to cause a denial of service (crash) and possibly execute arbitrary code via a long filename in the downlo...,Messenger,and
3,CVE-2004-0159,samhain_labs,hsftp,1.4,Format string vulnerability in hsftp 1.11 allows remote authenticated users to cause a denial of service and possibly execute arbitrary code via file names containing format string characters that...,hsftp,
4,CVE-2004-0159,samhain_labs,hsftp,1.11,Format string vulnerability in hsftp 1.11 allows remote authenticated users to cause a denial of service and possibly execute arbitrary code via file names containing format string characters that...,hsftp,
...,...,...,...,...,...,...,...
686070,CVE-2002-0600,kth,kth_kerberos,4_1.1.1,Heap overflow in the KTH Kerberos 4 FTP client 4-1.1.1 allows remote malicious servers to execute arbitrary code on the client via a long response to a passive (PASV) mode request.,,allows
686071,CVE-2002-0910,debian,netstd,3.07,"Buffer overflows in netstd 3.07-17 package allows remote DNS servers to execute arbitrary code via a long FQDN reply, as observed in the utilities (1) linux-ftpd, (2) pcnfsd, (3) tftp, (4) tracero...",netstd,
686072,CVE-2002-1964,wesmo,phpeventcalendar,1.1,Unknown vulnerability in WesMo phpEventCalendar 1.1 allows remote attackers to execute arbitrary commands via unknown attack vectors.,phpEventCalendar,
686073,CVE-2002-2110,rca,digital_cable_modem,dcm225,The RCA Digital Cable Modems DCM225 and DCM225E allow remote attackers to cause a denial of service (modem device reset) by connecting to port 80 on the 10.0.0.0/8 device.,the,


In [None]:
cpe:2.3:a:hookturn:advanced_forms_for_acf:*:*:*:*:*:wordpress:*:*

In [13]:
df[df.cve_id == "CVE-2021-25441"]

Unnamed: 0,cve_id,vendor,product,version,descr,initial_cpe


In [7]:
df['version'].unique()[:1000]

array(['1.4.3', '5.6.0.1351', '5.6.0.1358', '1.4', '1.11', '-', '6.4.4',
       '2.5.51', '2.5.24', '2.5.46', '8.0_final', '1.0.5', '2.2.4',
       '1.6.3_p2', '1.1a', '1.1f', '2.0', '3.1.17', '4.2', '',
       '3.5_solaris_mp2', '0.3.1_b1', '1.3.1', '1.3.4', '4.1', '9.6.1',
       '9.1.1', '9.3.1', '9.1.3', '9.3.5', '9.5.0', '9.3.2', '9.2.2',
       '9.4.1', '9.7.0', '9.6', '9.2.1', '9.4.2', '9.1', '9.4.3', '9.2.0',
       '9.4.0', '9.3.3', '9.5.1', '9.6.0', '10.0.6', '10.0', '3.5',
       '0.9.8', '2.11.1', '2.10.11', '2.8.0-a1', '2.9.1', '2.11.0a1',
       '2.10.2', '2.8.6', '2.12.1', '5.1.6', '1.11.2', '1.13.3', '4.7.0',
       '5.8.5', '5.9.5', '4.7.7', '6.3.9', '5.5.0', '0.73', 'x2.0.0',
       '0.8.5', '1.0.0', '1.6.0', '0.13.0\\+1', '0.15.0', '2.6.27.56',
       '1.6.2', '1.4.0', '1.7.3', '5.0.0', '5.0.1', '5.1', '5.1.17',
       '0.3.3', '0.3.0', '2.17a', '2.18', '2.34b', '2.63a', '2.73',
       '2.74c', '2.93', '2.96', '3.16', '3.16b', '3.18', '3.28', '3.46',
       '3.52', '

## Версии before

In [6]:
import re

def extract_word_before_version(text):
    # The regex looks for a word (letters) followed by a version pattern with at least two dots
    pattern = r'([a-zA-Z><=]+)\s+\d+(?:\.\d+){1,}'
    match = re.search(pattern, text)
    if match:
        return match.group(1)
    else:
        return ''

In [62]:
print(extract_word_before_version("Google Chrome before 19.0.1084.46"))  # Output: before
print(extract_word_before_version("i have got the name 2.1.0"))          # Output: name
print(extract_word_before_version("this is tuesday 1.2.1.1.1"))          # Output: tuesday
print(extract_word_before_version("mama i love you 6.0"))      
print(extract_word_before_version("all that are >= 6.0"))   

before
name
tuesday
you
>=


In [7]:
df['matched_regex_before'] = df['descr'].apply(extract_word_before_version)

In [64]:
matched_df = df['matched_regex_before'].value_counts().reset_index()
matched_df = matched_df.rename(columns={'index': 'word'})
matched_df

Unnamed: 0,word,matched_regex_before
0,before,223119
1,,105968
2,through,54907
3,version,51086
4,to,48571
...,...,...
9354,TxtBlog,1
9355,cmsWorks,1
9356,cpLinks,1
9357,PostfixAdmin,1


In [79]:
matched_df['capital'] = matched_df['word'].str.match(r'^[A-Z]')  # Regex: ^[A-Z] means "starts with uppercase A-Z"
matched_df[matched_df.capital == False].head(5)

Unnamed: 0,word,matched_regex_before,capital
0,before,223119,False
1,,105968,False
2,through,54907,False
3,version,51086,False
4,to,48571,False


In [1]:
matched_df.head(30)

NameError: name 'matched_df' is not defined

In [80]:
matched_df[matched_df.capital == False].head(5).word.tolist()

['before', '', 'through', 'version', 'to']

In [1]:
# prior
# up to
before_words = ['before',
 'to',
 'version',
 'through',
 'versions',
 'and',
 'application',
 'is',
 'from',
 'including',
 'in',
 'below',
 'than',
 'the',
 'possibly',
 'is',
 'after',
 'older',
 'until',
 '<',
 '<='
 ]

before_words2 = [
    'older',
    'until'
]

In [5]:
len(before_words)

21

In [136]:
def check_conditions(row, wrd):
    if ((row['vendor'] in row['descr']) and
        (row['product'] in row['descr']) and
        (row['matched_regex_before'] == wrd)):
        return True
    else:
        return False
    
def check_conditions_before_less_conditions(row, wrd):
    if (((row['vendor'] in row['descr'].lower()) or
        (row['product'] in row['descr'].lower())) and
        (row['matched_regex_before'] == wrd)):
        return True
    else:
        return False

In [82]:
tt = df[df.apply(lambda x: check_conditions(x, 'before'), axis=1)]
tt.sample(len(tt)).head()

Unnamed: 0,cve_id,vendor,product,version,descr,matched_regex_before
600628,CVE-2002-2220,chetcpasswd,chetcpasswd,1.12,"Buffer overflow in Pedro Lineu Orso chetcpasswd before 1.12, when configured for access from 0.0.0.0, allows local users to gain privileges via unspecified vectors.",before
96551,CVE-2022-25875,svelte,svelte,3.37.0,The package svelte before 3.49.0 are vulnerable to Cross-site Scripting (XSS) due to improper input sanitization and to improper escape of attributes when using objects during SSR (Server-Side Ren...,before
133498,CVE-2023-50724,resque,resque,1.22.0,"Resque (pronounced like ""rescue"") is a Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later. resque-web in resque versions before 2.1...",before
187545,CVE-2022-21190,mozilla,convict,0.5.0,This affects the package convict before 6.2.3. This is a bypass of [CVE-2022-22143](https://security.snyk.io/vuln/SNYK-JS-CONVICT-2340604). The [fix](https://github.com/mozilla/node-convict/commit...,before
283239,CVE-2016-10714,zsh,zsh,4.0.5,"In zsh before 5.3, an off-by-one error resulted in undersized buffers that were intended to support PATH_MAX characters.",before


In [83]:
df_new_words = pd.DataFrame()
for wrd in tqdm(before_words):
    filtered_df = df[df.apply(lambda x: check_conditions(x, wrd), axis=1)]
    print(wrd, filtered_df.shape)
    if filtered_df.empty:
        continue
    df_samples_per_word = filtered_df.sample(len(filtered_df), random_state=43).groupby('cve_id').sample(1, random_state=43).head()
    if df_new_words.empty:
        df_new_words = df_samples_per_word
    else:
        df_new_words = pd.concat([df_new_words, df_samples_per_word])

  5%|▍         | 1/21 [00:03<01:16,  3.82s/it]

before (13923, 6)


 10%|▉         | 2/21 [00:07<01:12,  3.84s/it]

to (10647, 6)


 14%|█▍        | 3/21 [00:11<01:09,  3.84s/it]

version (6167, 6)


 19%|█▉        | 4/21 [00:15<01:05,  3.84s/it]

through (1804, 6)


 24%|██▍       | 5/21 [00:19<01:01,  3.84s/it]

versions (665, 6)


 29%|██▊       | 6/21 [00:23<00:57,  3.85s/it]

and (24, 6)


 33%|███▎      | 7/21 [00:26<00:53,  3.85s/it]

application (109, 6)


 38%|███▊      | 8/21 [00:30<00:50,  3.87s/it]

is (37, 6)


 43%|████▎     | 9/21 [00:34<00:46,  3.84s/it]

from (229, 6)


 48%|████▊     | 10/21 [00:38<00:42,  3.84s/it]

including (777, 6)


 52%|█████▏    | 11/21 [00:42<00:38,  3.84s/it]

in (121, 6)


 57%|█████▋    | 12/21 [00:46<00:34,  3.83s/it]

below (148, 6)


 62%|██████▏   | 13/21 [00:49<00:30,  3.85s/it]

than (27, 6)


 67%|██████▋   | 14/21 [00:53<00:26,  3.86s/it]

the (7, 6)


 71%|███████▏  | 15/21 [00:57<00:23,  3.86s/it]

possibly (23, 6)


 76%|███████▌  | 16/21 [01:01<00:19,  3.88s/it]

is (37, 6)


 81%|████████  | 17/21 [01:05<00:15,  3.88s/it]

after (14, 6)


 86%|████████▌ | 18/21 [01:09<00:11,  3.90s/it]

older (0, 6)


 90%|█████████ | 19/21 [01:13<00:07,  3.91s/it]

until (0, 6)


 95%|█████████▌| 20/21 [01:17<00:03,  3.91s/it]

< (335, 6)


100%|██████████| 21/21 [01:21<00:00,  3.87s/it]

<= (663, 6)





In [137]:
#
df_new_words_before = pd.DataFrame()
for wrd in tqdm(before_words2):
    filtered_df = df[df.apply(lambda x: check_conditions_before_less_conditions(x, wrd), axis=1)]
    print(wrd, filtered_df.shape)
    if filtered_df.empty:
        continue
    df_samples_per_word = filtered_df.sample(len(filtered_df), random_state=43).groupby('cve_id').sample(1, random_state=43).head()
    if df_new_words_before.empty:
        df_new_words_before = df_samples_per_word
    else:
        df_new_words_before = pd.concat([df_new_words_before, df_samples_per_word])

 50%|█████     | 1/2 [00:05<00:05,  5.49s/it]

older (0, 7)


100%|██████████| 2/2 [00:11<00:00,  5.52s/it]

until (18, 7)





In [139]:
df_new_words_before

Unnamed: 0,cve_id,vendor,product,version,descr,matched_regex_before,matched_regex_after
540011,CVE-2018-0250,cisco,aironet_access_point_software,8.2\(160.0\),"A vulnerability in Central Web Authentication (CWA) with FlexConnect Access Points (APs) for Cisco Aironet 1560, 1810, 1810w, 1815, 1830, 1850, 2800, and 3800 Series APs could allow an authenticat...",until,
376666,CVE-2023-39196,apache,ozone,1.3.0,Improper Authentication vulnerability in Apache Ozone.\n\nThe vulnerability allows an attacker to download metadata internal to the Storage Container Manager service without proper authentication....,until,and
208388,CVE-2023-4090,acilia,widestand,-,"Cross-site Scripting (XSS) reflected vulnerability on WideStand until 5.3.5 version, which generates one of the meta tags directly using the content of the queried URL, which would allow an attack...",until,version
178482,CVE-2023-45814,littlebigfresh,bunkum,4.0,"Bunkum is an open-source protocol-agnostic request server for custom game servers. First, a little bit of background. So, in the beginning, Bunkum's `AuthenticationService` only supported injectin...",until,
48602,CVE-2023-7078,cloudflare,miniflare,3.20230821.0,Sending specially crafted HTTP requests to Miniflare's server could result in arbitrary HTTP and WebSocket requests being sent from the server. If Miniflare was configured to listen on external ne...,until,


In [12]:
df_new_words.to_excel('test_versions.xlsx', index=False)

In [85]:
df_new_words.head()

Unnamed: 0,cve_id,vendor,product,version,descr,matched_regex_before
595399,CVE-1999-1383,tcsh,tcsh,6.05,"(1) bash before 1.14.7, and (2) tcsh 6.05 allow local users to gain privileges via directory names that contain shell metacharacters (` back-tick), which can cause the commands enclosed in the dir...",before
159804,CVE-2001-0366,sap,saposcol,1.3,"saposcol in SAP R/3 Web Application Server Demo before 1.5 trusts the PATH environmental variable to find and execute the expand program, which allows local users to obtain root access by modifyin...",before
488080,CVE-2001-0439,licq,licq,,licq before 1.0.3 allows remote attackers to execute arbitrary commands via shell metacharacters in a URL.,before
342690,CVE-2001-0825,xinetd,xinetd,2.1.8.9,"Buffer overflow in internal string handling routines of xinetd before 2.1.8.8 allows remote attackers to execute arbitrary commands via a length argument of zero or less, which disables the length...",before
505766,CVE-2001-1229,libshout,libshout,,Buffer overflows in (1) Icecast before 1.3.9 and (2) libshout before 1.0.4 allow remote attackers to cause a denial of service (crash) and execute arbitrary code.,before


In [87]:
from collections import Counter
Counter(df_new_words.cve_id.apply(lambda x: x[4:8]))

Counter({'1999': 4,
         '2001': 5,
         '2010': 4,
         '2012': 2,
         '2014': 7,
         '2008': 2,
         '2015': 1,
         '2005': 1,
         '2006': 4,
         '2009': 3,
         '2017': 6,
         '2018': 5,
         '2002': 2,
         '2004': 2,
         '2022': 12,
         '2019': 6,
         '2020': 5,
         '2021': 7,
         '2023': 2,
         '2016': 3,
         '2007': 2})

In [151]:
all_cves = []
all_tokens = []
all_bio = []
for i, row in df_new_words_before.iterrows():
    cves = []
    bio_ann = []
    spl_tokens = [x for x in re.split(' |\\n', row['descr']) if x]
    tokens = []
    # print(spl_tokens)
    for tok in spl_tokens:
        if tok.endswith('.'):
            tokens.append(tok.rstrip('.'))
            tokens.append('.')
        elif tok.endswith(','):
            tokens.append(tok.rstrip(','))
            tokens.append(',')
        else:
            tokens.append(tok)
    # print(tokens)
    for tok_i in range(len(tokens)):
        if tokens[tok_i] == row['vendor']:
            bio_ann.append('B-vendor')
        elif tokens[tok_i] == row['product']:
            bio_ann.append('B-product')
        elif tokens[tok_i] == row['matched_regex_before']:
            bio_ann.append('B-version')
        else:
            bio_ann.append('O')
        if tok_i == 0:
            cves.append(row['cve_id'])
        else:
            cves.append('0')
    assert len(bio_ann) == len(tokens)
    all_bio.extend(bio_ann)
    all_tokens.extend(tokens)
    all_cves.extend(cves)

In [152]:
df_before_annotated = pd.DataFrame(data={'cve_id': all_cves,
                                         'words': all_tokens,
                                         'custom_bio': all_bio})

In [153]:
df_before_annotated.head(300)

Unnamed: 0,cve_id,words,custom_bio
0,CVE-2018-0250,A,O
1,0,vulnerability,O
2,0,in,O
3,0,Central,O
4,0,Web,O
...,...,...,...
295,0,1.4.0,O
296,0,",",O
297,0,which,O
298,0,fixes,O


In [149]:
df[df.cve_id == 'CVE-2023-39196'].descr

369266    Improper Authentication vulnerability in Apache Ozone.\n\nThe vulnerability allows an attacker to download metadata internal to the Storage Container Manager service without proper authentication....
376665    Improper Authentication vulnerability in Apache Ozone.\n\nThe vulnerability allows an attacker to download metadata internal to the Storage Container Manager service without proper authentication....
376666    Improper Authentication vulnerability in Apache Ozone.\n\nThe vulnerability allows an attacker to download metadata internal to the Storage Container Manager service without proper authentication....
381626    Improper Authentication vulnerability in Apache Ozone.\n\nThe vulnerability allows an attacker to download metadata internal to the Storage Container Manager service without proper authentication....
381627    Improper Authentication vulnerability in Apache Ozone.\n\nThe vulnerability allows an attacker to download metadata internal to the Storage Container 

In [154]:
df_before_annotated.to_csv(f'./cve_dataset_bio_{len(df_before_annotated)}_before_versions_added_until.tsv', index=False, sep='\t')

In [96]:
df_tt = pd.read_csv('cve_dataset_bio_4477_before_versions.tsv', sep='\t')

In [98]:
df_tt.iloc[146:178]

Unnamed: 0,cve_id,words,custom_bio
146,CVE-2001-1229,Buffer,O
147,0,overflows,O
148,0,in,O
149,0,(1),O
150,0,Icecast,B-vendor
151,0,Icecast,B-product
152,0,before,B-version
153,0,1.3.9,O
154,0,and,O
155,0,(2),O


In [43]:
print(np.array(all_cves).shape)
print(np.array(all_tokens).shape)
print(np.array(all_bio).shape)

(4095,)
(4095,)
(4095,)


In [155]:
pd.read_csv('cve_dataset_bio_4477_before_versions.tsv', sep='\t')

Unnamed: 0,cve_id,words,custom_bio
0,CVE-1999-1383,(1),O
1,0,bash,B-vendor
2,0,before,B-version
3,0,1.14.7,I-version
4,0,",",O
...,...,...,...
5154,0,access,O
5155,0,other,O
5156,0,local,O
5157,0,servers,O


## версии after

In [102]:
import re

def extract_word_after_version(text):
    # The regex looks for a version pattern with at least two dots followed by a word (letters)
    pattern = r'\d+(?:\.\d+){2,}\s+(\b[a-zA-Z]+\b)'
    match = re.search(pattern, text)
    if match:
        return match.group(1)
    else:
        return ''

# Test cases
print(extract_word_after_version("Google Chrome before 19.0.1084.46 may"))  # Output: may
print(extract_word_after_version("i have got the name 2.1.0 version")) # Output: version
print(extract_word_after_version("this is tuesday 1.2.1.1.1 today"))   # Output: today
print(extract_word_after_version("mama i love you 6.0"))                # Output: False

may
version
today



In [103]:
df['matched_regex_after'] = df['descr'].apply(extract_word_after_version)
matched_df2 = df['matched_regex_after'].value_counts().reset_index()
matched_df2 = matched_df2.rename(columns={'index': 'word'})
matched_df2

Unnamed: 0,word,matched_regex_after
0,,308257
1,and,81376
2,allows,56205
3,does,48507
4,for,25996
...,...,...
614,GPL,1
615,compares,1
616,cPanel,1
617,IF,1


In [106]:
matched_df2['capital'] = matched_df2['word'].str.match(r'^[A-Z]')  # Regex: ^[A-Z] means "starts with uppercase A-Z"
matched_df2[matched_df2.capital == False].head(50)

Unnamed: 0,word,matched_regex_after,capital
0,,308257,False
1,and,81376,False
2,allows,56205,False
3,does,48507,False
4,for,25996,False
5,versions,20638,False
6,through,14630,False
7,are,13671,False
8,is,13230,False
9,has,8485,False


In [111]:
matched_df2.iloc[:50].word.tolist()

['',
 'and',
 'allows',
 'does',
 'for',
 'versions',
 'through',
 'are',
 'is',
 'has',
 'allow',
 'due',
 'contains',
 'to',
 'on',
 'at',
 'or',
 'in',
 'before',
 'was',
 'leaks',
 'via',
 'have',
 'can',
 'could',
 'might',
 'do',
 'uses',
 'contain',
 'mishandles',
 'did',
 'may',
 'using',
 'of',
 'unserialises',
 'that',
 'a',
 'beta',
 'allowed',
 'use',
 'build',
 'when',
 'release',
 'there',
 'includes',
 'unserializes',
 'which',
 'as',
 'suffers',
 'Beta']

In [113]:
after_words = [
    'and',
    'through',
    'to',
    'or',
    'before',
    'beta',
    'build',
    
]

In [None]:
for wrd in matched_df2[matched_df2.capital == False].head(50).word:
    print(wrd)
    r = df[df['matched_regex_after'] == wrd]
    if not r.empty:
        print(r.sample(1).descr)
    print('\n\n\n')

In [115]:
def check_conditions_after(row, wrd):
    if ((row['vendor'] in row['descr']) and
        (row['product'] in row['descr']) and
        (row['matched_regex_after'] == wrd)):
        return True
    else:
        return False

In [156]:
def check_conditions_after_less_conditions(row, wrd):
    if (((row['vendor'] in row['descr'].lower()) or
        (row['product'] in row['descr'].lower())) and
        (row['matched_regex_after'] == wrd)):
        return True
    else:
        return False

In [157]:
df_new_words_after = pd.DataFrame()
for wrd in tqdm(after_words):
    filtered_df = df[df.apply(lambda x: check_conditions_after(x, wrd), axis=1)]
    print(wrd, filtered_df.shape)
    if filtered_df.empty:
        continue
    elif filtered_df.cve_id.nunique() <= 5:
        filtered_df = df[df.apply(lambda x: check_conditions_after_less_conditions(x, wrd), axis=1)]
    df_samples_per_word = filtered_df.sample(len(filtered_df), random_state=43).groupby('cve_id').sample(1, random_state=43).head(5)
    if df_new_words_after.empty:
        df_new_words_after = df_samples_per_word
    else:
        df_new_words_after = pd.concat([df_new_words_after, df_samples_per_word])

 14%|█▍        | 1/7 [00:03<00:22,  3.78s/it]

and (5178, 7)


 29%|██▊       | 2/7 [00:07<00:18,  3.73s/it]

through (1002, 7)


 43%|████▎     | 3/7 [00:11<00:14,  3.72s/it]

to (130, 7)


 57%|█████▋    | 4/7 [00:14<00:11,  3.72s/it]

or (128, 7)


 71%|███████▏  | 5/7 [00:18<00:07,  3.74s/it]

before (171, 7)
beta (3, 7)


 86%|████████▌ | 6/7 [00:27<00:05,  5.64s/it]

build (1, 7)


100%|██████████| 7/7 [00:37<00:00,  5.35s/it]


In [158]:
df_new_words_after

Unnamed: 0,cve_id,vendor,product,version,descr,matched_regex_before,matched_regex_after
119349,CVE-1999-1483,svgalib,svgalib,,Buffer overflow in zgv in svgalib 1.2.10 and earlier allows local users to execute arbitrary code via a long HOME environment variable.,svgalib,and
170729,CVE-2001-0556,nedit,nedit,,The Nirvana Editor (NEdit) 5.1.1 and earlier allows a local attacker to overwrite other users' files via a symlink attack on (1) backup files or (2) temporary files used when nedit prints a file o...,,and
567214,CVE-2001-0570,minicom,minicom,,minicom 1.83.1 and earlier allows a local attacker to gain additional privileges via numerous format string attacks.,minicom,and
259505,CVE-2001-0700,w3m,w3m,0.1.8,Buffer overflow in w3m 0.2.1 and earlier allows a remote attacker to execute arbitrary code via a long base64 encoded MIME header.,m,and
595407,CVE-2001-0834,htdig,htdig,,"htsearch CGI program in htdig (ht://Dig) 3.1.5 and earlier allows remote attackers to use the -c option to specify an alternate configuration file, which could be used to (1) cause a denial of ser...",,and
499787,CVE-2003-1462,mod_survey,mod_survey,3.0.7,"mod_survey 3.0.0 through 3.0.15-pre6 does not check whether a survey exists before creating a subdirectory for it, which allows remote attackers to cause a denial of service (disk consumption and ...",survey,through
359080,CVE-2005-0686,mlterm,mlterm,2.7,"Integer overflow in mlterm 2.5.0 through 2.9.1, with gdk-pixbuf support enabled, allows remote attackers to execute arbitrary code via a large image file that is used as a background.",mlterm,through
236251,CVE-2005-1692,xine,gxine,0.44,"Format string vulnerability in gxine 0.4.1 through 0.4.4, and other versions down to 0.3, allows remote attackers to execute arbitrary code via a ram file with a URL whose hostname contains format...",gxine,through
380589,CVE-2005-3345,rssh,rssh,2.2.3,rssh 2.0.0 through 2.2.3 allows local users to bypass access restrictions and gain root privileges by using the rssh_chroot_helper command to chroot to an external directory.,rssh,through
240356,CVE-2006-4244,sql-ledger,sql-ledger,2.8.14,"SQL-Ledger 2.4.4 through 2.6.17 authenticates users by verifying that the value of the sql-ledger-[username] cookie matches the value of the sessionid parameter, which allows remote attackers to g...",Ledger,through


In [159]:
df_new_words_after.to_csv('df_new_words_after.csv', index=False)

In [160]:
all_cves = []
all_tokens = []
all_bio = []
for i, row in df_new_words_after.iterrows():
    cves = []
    bio_ann = []
    spl_tokens = [x for x in re.split(' |\\n', row['descr']) if x]
    tokens = []
    # print(spl_tokens)
    for tok in spl_tokens:
        if tok.endswith('.'):
            tokens.append(tok.rstrip('.'))
            tokens.append('.')
        elif tok.endswith(','):
            tokens.append(tok.rstrip(','))
            tokens.append(',')
        else:
            tokens.append(tok)
    # print(tokens)
    for tok_i in range(len(tokens)):
        if tokens[tok_i] == row['vendor']:
            bio_ann.append('B-vendor')
        elif tokens[tok_i] == row['product']:
            bio_ann.append('B-product')
        elif tokens[tok_i] == row['matched_regex_after']:
            bio_ann.append('B-version')
        else:
            bio_ann.append('O')
        if tok_i == 0:
            cves.append(row['cve_id'])
        else:
            cves.append('0')
    assert len(bio_ann) == len(tokens)
    all_bio.extend(bio_ann)
    all_tokens.extend(tokens)
    all_cves.extend(cves)

In [None]:
df_after_annotated = pd.DataFrame(data={'cve_id': all_cves,
                                         'words': all_tokens,
                                         'custom_bio': all_bio})
df_after_annotated.to_csv(f'./cve_dataset_bio_{len(df_after_annotated)}_after_versions.tsv', index=False, sep='\t')

In [164]:
df[df.product == df.vendor].descr

Series([], Name: descr, dtype: object)

In [167]:
df

Unnamed: 0,cve_id,vendor,product,version,descr,matched_regex_before,matched_regex_after
0,CVE-2004-0013,jabber_software_foundation,jabber_server,1.4.3,"jabber 1.4.2, 1.4.2a, and possibly earlier versions, does not properly handle SSL connections, which allows remote attackers to cause a denial of service (crash).",jabber,
1,CVE-2004-0043,yahoo,messenger,5.6.0.1351,Buffer overflow in Yahoo Instant Messenger 5.6.0.1351 and earlier allows remote attackers to cause a denial of service (crash) and possibly execute arbitrary code via a long filename in the downlo...,Messenger,and
2,CVE-2004-0043,yahoo,messenger,5.6.0.1358,Buffer overflow in Yahoo Instant Messenger 5.6.0.1351 and earlier allows remote attackers to cause a denial of service (crash) and possibly execute arbitrary code via a long filename in the downlo...,Messenger,and
3,CVE-2004-0159,samhain_labs,hsftp,1.4,Format string vulnerability in hsftp 1.11 allows remote authenticated users to cause a denial of service and possibly execute arbitrary code via file names containing format string characters that...,hsftp,
4,CVE-2004-0159,samhain_labs,hsftp,1.11,Format string vulnerability in hsftp 1.11 allows remote authenticated users to cause a denial of service and possibly execute arbitrary code via file names containing format string characters that...,hsftp,
...,...,...,...,...,...,...,...
686070,CVE-2002-0600,kth,kth_kerberos,4_1.1.1,Heap overflow in the KTH Kerberos 4 FTP client 4-1.1.1 allows remote malicious servers to execute arbitrary code on the client via a long response to a passive (PASV) mode request.,,allows
686071,CVE-2002-0910,debian,netstd,3.07,"Buffer overflows in netstd 3.07-17 package allows remote DNS servers to execute arbitrary code via a long FQDN reply, as observed in the utilities (1) linux-ftpd, (2) pcnfsd, (3) tftp, (4) tracero...",netstd,
686072,CVE-2002-1964,wesmo,phpeventcalendar,1.1,Unknown vulnerability in WesMo phpEventCalendar 1.1 allows remote attackers to execute arbitrary commands via unknown attack vectors.,phpEventCalendar,
686073,CVE-2002-2110,rca,digital_cable_modem,dcm225,The RCA Digital Cable Modems DCM225 and DCM225E allow remote attackers to cause a denial of service (modem device reset) by connecting to port 80 on the 10.0.0.0/8 device.,the,


Доп. версии

In [169]:
l = '''CVE-2023-40050, 
CVE-2020-27589, 
CVE-2022-39255, 
CVE-2023-50709, 
CVE-2022-0944, 
CVE-2023-1404, 
CVE-2023-5054, 
CVE-2020-24025, 
CVE-2016-5007, 
CVE-2023-40050, 
CVE-2022-47595'''

In [175]:
ll =[x.replace(' \n', '') for x in l.split(',')]

In [191]:
df_additional = df[df.cve_id.isin(ll)].groupby('cve_id').sample(1, random_state=11)
df_additional.shape

(10, 7)

In [192]:
all_cves = []
all_tokens = []
all_bio = []
for i, row in df_additional.iterrows():
    cves = []
    bio_ann = []
    spl_tokens = [x for x in re.split(' |\\n', row['descr']) if x]
    tokens = []
    # print(spl_tokens)
    for tok in spl_tokens:
        if tok.endswith('.'):
            tokens.append(tok.rstrip('.'))
            tokens.append('.')
        elif tok.endswith(','):
            tokens.append(tok.rstrip(','))
            tokens.append(',')
        else:
            tokens.append(tok)
    # print(tokens)
    for tok_i in range(len(tokens)):
        bio_ann.append('O')
        if tok_i == 0:
            cves.append(row['cve_id'])
        else:
            cves.append('0')
    assert len(bio_ann) == len(tokens)
    all_bio.extend(bio_ann)
    all_tokens.extend(tokens)
    all_cves.extend(cves)

In [193]:
len(df_additional)

10

In [194]:
df_additional_annotated = pd.DataFrame(data={'cve_id': all_cves,
                                         'words': all_tokens,
                                         'custom_bio': all_bio})
df_additional_annotated.to_csv(f'./cve_dataset_bio_{len(df_additional_annotated)}_additionals.tsv', index=False, sep='\t')

In [None]:
df_additional_annotated

Unnamed: 0,cve_id,words,custom_bio
0,CVE-2016-5007,Both,O
1,0,Spring,O
2,0,Security,O
3,0,3.2.x,O
4,0,",",O
...,...,...,...
682,0,recommendation,O
683,0,is,O
684,0,to,O
685,0,upgrade,O


## Создаём один финальный датасет

In [14]:
import pandas as pd

# File paths
file1 = 'cve_dataset_bio_15192_texts_custom_bio.tsv'
file2 = 'cve_dataset_bio_4477_before_versions.tsv'
file3 = 'cve_dataset_bio_1682_after_versions.tsv'
file4 = 'cve_dataset_bio_687_additionals.tsv'
output_file = 'cve_dataset_bio.tsv'

# Read each TSV file
# df1 = pd.read_csv(file1, sep='\t')
# df1 = df1[['cve_id', 'words', 'custom_bio']]

df2 = pd.read_csv(file2, sep='\t')
df3 = pd.read_csv(file3, sep='\t')
df4 = pd.read_csv(file4, sep='\t')



In [15]:
add_cve_df = pd.concat([df2, df3, df4], ignore_index=True)

In [17]:
add_cve_df.cve_id.nunique()

129

In [21]:
df4[(df4.custom_bio == 'B-version') | (df4.custom_bio == 'I-version')].words.unique()

array(['3.2.x', '4.0.x', '4.1.0', '4.1.x', '4.2.x', '2.0.0', 'to',
       '4.14.1', '0.0.25', '-', '0.0.52', 'prior', '6.10.1', 'Prior',
       'version', '0.23.19', '<=', '9.0.15', 'up', ',', 'and',
       'including', '1.6', '4.10.29', '6.9.3', '0.34.34', '`v0.34.34`'],
      dtype=object)

In [226]:
# Concatenate the DataFrames
merged_df = pd.concat([df1, df2, df3, df4], ignore_index=True)

In [227]:
merged_df

Unnamed: 0,cve_id,words,custom_bio
0,CVE-2010-0001,Integer,O
1,0,underflow,O
2,0,in,O
3,0,the,O
4,0,unlzw,O
...,...,...,...
692597,0,recommendation,O
692598,0,is,O
692599,0,to,O
692600,0,upgrade,O


In [230]:
merged_df.custom_bio.value_counts()

O            596287
B-version     29260
I-version     25546
B-product     18957
I-product     11933
B-vendor      10557
I-vendor         62
Name: custom_bio, dtype: int64

In [229]:
merged_df['custom_bio'] = merged_df['custom_bio'].replace({'B-versions': 'B-version'})

In [231]:
merged_df.to_csv(output_file, sep='\t', index=False)

## Разбиваем датасет на train/test

In [61]:
df = pd.read_csv('cve_dataset_bio.tsv', sep='\t')

In [62]:
from sklearn.model_selection import train_test_split

In [63]:
# Group by CVE entries
groups = []
current_group = []
for _, row in df.iterrows():
    if row["cve_id"] != "0":
        if current_group:
            groups.append(current_group)
        current_group = [row]
    else:
        current_group.append(row)
if current_group:
    groups.append(current_group)

In [64]:
# Split into train/test/val
train_groups, test_groups = train_test_split(groups, test_size=0.1, random_state=42)

In [65]:
# Reconstruct splits
def flatten(groups):
    return pd.DataFrame([row for group in groups for row in group])

In [66]:
train_df = flatten(train_groups)
test_df = flatten(test_groups)

тестируем, что из датасетов, созданных для обогащения, часть записей попала в тест

In [67]:
print(len(df2[df2.cve_id != '0'].cve_id.unique()))
test_df[test_df.cve_id.isin(df2[df2.cve_id != '0'].cve_id.unique())]

84


Unnamed: 0,cve_id,words,custom_bio
685990,CVE-2017-2594,hawtio,B-product
689842,CVE-2023-4090,Cross-site,O
689731,CVE-2023-39196,Improper,O
689882,CVE-2023-45814,Bunkum,B-product
687367,CVE-2020-15119,In,O
687402,CVE-2012-2666,golang,B-vendor
685418,CVE-2014-125098,A,O


In [68]:
print(len(df3[df3.cve_id != '0'].cve_id.unique()))
test_df[test_df.cve_id.isin(df3[df3.cve_id != '0'].cve_id.unique())]

35


Unnamed: 0,cve_id,words,custom_bio
691601,CVE-2004-1822,Multiple,O
691397,CVE-2013-3009,The,O


In [69]:
print(len(df4[df4.cve_id != '0'].cve_id.unique()))
test_df[test_df.cve_id.isin(df4[df4.cve_id != '0'].cve_id.unique())]

10


Unnamed: 0,cve_id,words,custom_bio
691912,CVE-2016-5007,Both,O
692422,CVE-2023-5054,The,O


In [73]:
set(train_df.cve_id) & set(test_df.cve_id)

{'0'}

In [74]:
train_df.custom_bio.value_counts(normalize=1)

O            0.860556
B-version    0.042443
I-version    0.036957
B-product    0.027398
I-product    0.017254
B-vendor     0.015297
I-vendor     0.000095
Name: custom_bio, dtype: float64

In [75]:
test_df.custom_bio.value_counts(normalize=1)

O            0.864320
B-version    0.040503
I-version    0.036240
B-product    0.027130
I-product    0.017008
B-vendor     0.014756
I-vendor     0.000043
Name: custom_bio, dtype: float64

In [76]:
train_df.tail(10)

Unnamed: 0,cve_id,words,custom_bio
333189,0,impact,O
333190,0,via,O
333191,0,vectors,O
333192,0,that,O
333193,0,leverage,O
333194,0,"""",O
333195,0,type,O
333196,0,confusion,O
333197,0,.,O
333198,0,"""",O


In [92]:
train_df = train_df.reset_index().drop('index', axis=1)
train_df['words'] = train_df['words'].fillna(' ')
train_df

Unnamed: 0,cve_id,words,custom_bio
0,CVE-2011-0776,The,O
1,0,sandbox,O
2,0,implementation,O
3,0,in,O
4,0,Google,B-vendor
...,...,...,...
622454,0,"""",O
622455,0,type,O
622456,0,confusion,O
622457,0,.,O


In [94]:
train_df.to_csv('train_df_cve_dataset_bio.tsv', sep='\t')

In [78]:
test_df.head(10)

Unnamed: 0,cve_id,words,custom_bio
473408,CVE-2012-1790,Absolute,O
473409,0,path,O
473410,0,traversal,O
473411,0,vulnerability,O
473412,0,in,O
473413,0,Webgrind,B-product
473414,0,1.0,B-version
473415,0,and,O
473416,0,1.0.2,B-version
473417,0,allows,O


In [95]:
test_df = test_df.reset_index().drop('index', axis=1)
test_df['words'] = test_df['words'].fillna(' ')
test_df

Unnamed: 0,cve_id,words,custom_bio
0,CVE-2012-1790,Absolute,O
1,0,path,O
2,0,traversal,O
3,0,vulnerability,O
4,0,in,O
...,...,...,...
70138,0,an,O
70139,0,unspecified,O
70140,0,internal,O
70141,0,error,O


In [96]:
test_df.to_csv('test_df_cve_dataset_bio.tsv', sep='\t')

## Evaluate

In [22]:
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                         pipeline)

In [23]:
path_to_model = "/home/mikhail/Documents/pandan_study/vkr/vulns_scanner/mikhail_code/models/nuner2_v1_150325/best_model_tmp"
final_tokenizer = AutoTokenizer.from_pretrained(path_to_model, use_fast=True, add_prefix_space=True, local_files_only=True)
final_model = AutoModelForTokenClassification.from_pretrained(path_to_model, local_files_only=True)

In [82]:
test_df.index

Int64Index([473408, 473409, 473410, 473411, 473412, 473413, 473414, 473415,
            473416, 473417,
            ...
            337475, 337476, 337477, 337478, 337479, 337480, 337481, 337482,
            337483, 337484],
           dtype='int64', length=70143)

In [24]:
s = 'Improper Restriction of XML External Entity Reference in GitHub repository hazelcast/hazelcast in 5.1-BETA-1'

token_classifier = pipeline(
    "token-classification", model=final_model, aggregation_strategy="first", tokenizer=final_tokenizer
)
res = token_classifier(s)
for i, r in enumerate(res):
    # print('Entity: '+ r['entity_group'] + '   Word: ' + r['word'])
    print('Entity: '+ r['entity_group'] + '   Word: ' + r['word'] + '   Prob: ' + str(r['score']))

Device set to use cpu


Entity: version   Word:  5.1-BETA-1   Prob: 0.99713945




In [40]:
tokens = final_tokenizer(s, return_tensors='pt', truncation=True, padding=True)
tokens[:5]

[Encoding(num_tokens=30, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

In [56]:
tokens['input_ids'][0]

tensor([    0, 27455,  1741, 40950,  1499,     9, 46917, 25468, 46718, 34177,
           11, 39097, 30076, 32468,   523,  5182,    73,   298, 43874,  5182,
           11,   195,     4,   134,    12,   387, 19739,    12,   134,     2])

In [None]:
final_tokenizer.decode(tokens['input_ids'][0], )

'<s> Improper Restriction of XML External Entity Reference in GitHub repository hazelcast/hazelcast in 5.1-BETA-1</s>'

In [43]:
output = final_model(**tokens)

In [34]:
output.logits.shape

torch.Size([1, 30, 7])

In [36]:
import numpy as np

In [46]:
lbls_in_dataset = [
    'O',
    'B-product',
    'I-product',
    'B-vendor',
    'I-vendor',
    'B-version',
    'I-version',
]
label2id = {v:i for i, v in enumerate(lbls_in_dataset)}
id2label = {i:v for i, v in enumerate(lbls_in_dataset)}

In [48]:
res = np.argmax(output.logits.detach().numpy(), axis=2)[0]
res

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,
       5, 5, 5, 5, 5, 5, 5, 0])

In [49]:
[id2label[x] for x in res]


['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-version',
 'B-version',
 'B-version',
 'B-version',
 'B-version',
 'B-version',
 'B-version',
 'B-version',
 'O']

In [87]:
import torch

In [98]:
groups = []
current_group = []
for _, row in test_df.iterrows():
    if row["cve_id"] != "0":
        if current_group:
            groups.append(current_group)
        current_group = [row]
    else:
        current_group.append(row)
if current_group:
    groups.append(current_group)

In [None]:
from seqeval.metrics import classification_report
y_true = []
y_pred = []

for group in groups:
    # Extract words and ground truth labels
    words = [row["words"] for row in group]
    true_tags = [row["custom_bio"] for row in group]
    
    # Tokenize with word alignment
    tokenized = final_tokenizer(
        words,
        is_split_into_words=True,  # Critical for per-word alignment
        return_tensors="pt",
        truncation=True,
        # return_offsets_mapping=True
    ).to(final_model.device)
    
    # Get word-to-token mapping
    word_ids = tokenized.word_ids(batch_index=0)
    
    # Inference
    with torch.no_grad():
        outputs = final_model(**tokenized)
    
    # Get predictions
    pred_indices = torch.argmax(outputs.logits, dim=2).squeeze().tolist()
    pred_tags = [final_model.config.id2label[idx] for idx in pred_indices]
    
    # Align predictions to original words
    aligned_preds = []
    current_word = None
    for idx, word_id in enumerate(word_ids):
        if word_id is None:
            continue  # Skip special tokens
        if word_id != current_word:
            aligned_preds.append(pred_tags[idx])
            current_word = word_id
    
    # Ensure alignment matches original word count
    if len(aligned_preds) != len(words):
        print(f"Alignment error in CVE {group[0]['cve_id']}")
        continue
    
    y_true.append(true_tags)
    y_pred.append(aligned_preds)

# Compute metrics
print(classification_report(y_true, y_pred, mode='strict'))

              precision    recall  f1-score   support

     product       0.97      0.97      0.97      1903
      vendor       0.98      0.99      0.98      1035
     version       0.99      0.99      0.99      2841

   micro avg       0.98      0.98      0.98      5779
   macro avg       0.98      0.98      0.98      5779
weighted avg       0.98      0.98      0.98      5779



In [100]:
test_df.custom_bio.value_counts()

O            60626
B-version     2841
I-version     2542
B-product     1903
I-product     1193
B-vendor      1035
I-vendor         3
Name: custom_bio, dtype: int64

In [102]:
path_to_model = "numind/NuNER-v2.0"
base_tokenizer = AutoTokenizer.from_pretrained(path_to_model, use_fast=True, add_prefix_space=True)
base_model = AutoModelForTokenClassification.from_pretrained(path_to_model)

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at numind/NuNER-v2.0 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [103]:
from seqeval.metrics import classification_report
y_true = []
y_pred = []

for group in groups:
    # Extract words and ground truth labels
    words = [row["words"] for row in group]
    true_tags = [row["custom_bio"] for row in group]
    
    # Tokenize with word alignment
    tokenized = base_tokenizer(
        words,
        is_split_into_words=True,  # Critical for per-word alignment
        return_tensors="pt",
        truncation=True,
        # return_offsets_mapping=True
    ).to(base_model.device)
    
    # Get word-to-token mapping
    word_ids = tokenized.word_ids(batch_index=0)
    
    # Inference
    with torch.no_grad():
        outputs = base_model(**tokenized)
    
    # Get predictions
    pred_indices = torch.argmax(outputs.logits, dim=2).squeeze().tolist()
    pred_tags = [base_model.config.id2label[idx] for idx in pred_indices]
    
    # Align predictions to original words
    aligned_preds = []
    current_word = None
    for idx, word_id in enumerate(word_ids):
        if word_id is None:
            continue  # Skip special tokens
        if word_id != current_word:
            aligned_preds.append(pred_tags[idx])
            current_word = word_id
    
    # Ensure alignment matches original word count
    if len(aligned_preds) != len(words):
        print(f"Alignment error in CVE {group[0]['cve_id']}")
        continue
    
    y_true.append(true_tags)
    y_pred.append(aligned_preds)

# Compute metrics
print(classification_report(y_true, y_pred, mode='strict'))

ValueError: Invalid token is found: LABEL_1. Allowed prefixes are: B|O|I.

## Создаём валидационный датасет, не участвующий в обучении

Его CVE не будут входить в train\test

In [1]:
import pandas as pd
import psycopg2 as p2
from psycopg2 import sql
from collections import Counter

In [5]:
dbname = "vulns_scanner"
user = 'postgres'
password = 'postgres'
host = 'localhost'
port = '5432'

conn = p2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
cur = conn.cursor()
cur.execute('''
select cve_id, cpe_id_pk, vendor, product, version, descr, initial_cpe   
from cves c inner join descriptions d on c.cve_id_pk=d.cve_id_fk
inner join cve_cpe_config ccc on c.cve_id_pk=ccc.cve_id_fk inner join cpes cp on ccc.cpe_id_fk=cp.cpe_id_pk 
--where descr like '%_._%'
order by random()
limit 100000
''')
colnames = [desc[0] for desc in cur.description]
tuples = cur.fetchall()
cur.close()
df = pd.DataFrame(tuples, columns=colnames)
df.head()

Unnamed: 0,cve_id,cpe_id_pk,vendor,product,version,descr,initial_cpe
0,CVE-2024-0056,256947,microsoft,.net,8.0.0,Microsoft.Data.SqlClient and System.Data.SqlCl...,cpe:2.3:a:microsoft:.net:8.0.0:-:*:*:*:*:*:*
1,CVE-2022-0424,68426,supsystic,popup,1.9.17,The Popup by Supsystic WordPress plugin before...,cpe:2.3:a:supsystic:popup:1.9.17:*:*:*:*:wordp...
2,CVE-2023-4386,375813,wpdeveloper,essential_blocks,4.0.8,The Essential Blocks plugin for WordPress is v...,cpe:2.3:a:wpdeveloper:essential_blocks:4.0.8:*...
3,CVE-2020-36699,488896,quick_page\/post_redirect_project,quick_page\/post_redirect,3.1,The Quick Page/Post Redirect Plugin for WordPr...,cpe:2.3:a:quick_page\/post_redirect_project:qu...
4,CVE-2022-39354,185125,evm_project,evm,0.7.0,"SputnikVM, also called evm, is a Rust implemen...",cpe:2.3:a:evm_project:evm:0.7.0:*:*:*:*:rust:*:*


In [49]:
df[df.descr.str.contains('after versi')]

Unnamed: 0,cve_id,cpe_id_pk,vendor,product,version,descr,initial_cpe
22051,CVE-2021-32740,625949,addressable_project,addressable,2.3.3,Addressable is an alternative implementation t...,cpe:2.3:a:addressable_project:addressable:2.3....
31942,CVE-2021-32740,625954,addressable_project,addressable,2.3.8,Addressable is an alternative implementation t...,cpe:2.3:a:addressable_project:addressable:2.3....
54769,CVE-2021-32740,625951,addressable_project,addressable,2.3.5,Addressable is an alternative implementation t...,cpe:2.3:a:addressable_project:addressable:2.3....
59120,CVE-2021-32740,625950,addressable_project,addressable,2.3.4,Addressable is an alternative implementation t...,cpe:2.3:a:addressable_project:addressable:2.3....
80825,CVE-2021-32740,625955,addressable_project,addressable,2.4.0,Addressable is an alternative implementation t...,cpe:2.3:a:addressable_project:addressable:2.4....


In [24]:
df_sample_raw = df.sample(250, random_state=42)

In [25]:
train_df = pd.read_csv('train_df_cve_dataset_bio.tsv', sep='\t', usecols=['cve_id'])
test_df = pd.read_csv('test_df_cve_dataset_bio.tsv', sep='\t', usecols=['cve_id'])
used_cve = train_df.cve_id.unique().tolist() + test_df.cve_id.unique().tolist()

In [27]:
df_sample = df_sample_raw[~df_sample_raw.cve_id.isin(used_cve)]
df_sample.shape

(229, 7)

In [28]:
df_sample = df_sample.drop_duplicates(subset=['cve_id']).iloc[:200]
df_sample.shape

(200, 7)

In [38]:
year2count = Counter(df_sample.cve_id.astype(str).apply(lambda x: x[4:8]))
year2count_list = list(zip(*list((x,y) for x,y in year2count.items())))
total_per_year = pd.DataFrame(data={'year':map(int, year2count_list[0]),
                   'total': year2count_list[1]})
total_per_year.sort_values('year')

Unnamed: 0,year,total
18,2001,1
19,2002,1
22,2003,1
17,2004,2
11,2005,3
8,2006,4
14,2007,7
20,2008,4
16,2009,2
21,2011,1


In [40]:
df_sample['vendor_in_text'] = df_sample.apply(lambda x: 1 if x['vendor'].lower() in x['descr'].lower() else 0, axis=1)
df_sample['product_in_text'] = df_sample.apply(lambda x: 1 if x['product'].lower() in x['descr'].lower() else 0, axis=1)

In [41]:
df_sample.to_csv('df_100_not_in_stucco_v3_180525.csv', index=False)

In [42]:
df_sample.head(10)

Unnamed: 0,cve_id,cpe_id_pk,vendor,product,version,descr,initial_cpe,vendor_in_text,product_in_text
75721,CVE-2021-34085,628902,glensawyer,mp3gain,1.3.4,Read access violation in the III_dequantize_sa...,cpe:2.3:a:glensawyer:mp3gain:1.3.4:beta:*:*:*:...,0,1
80184,CVE-2014-7221,722762,teamspeak,teamspeak3,3.0.7.1,TeamSpeak Client 3.0.14 and earlier allows rem...,cpe:2.3:a:teamspeak:teamspeak3:3.0.7.1:*:*:*:c...,1,0
92991,CVE-2018-7279,541558,alienvault,open_source_security_information_management,5.3,A remote code execution issue was discovered i...,cpe:2.3:a:alienvault:open_source_security_info...,1,0
76434,CVE-2020-24743,472694,zohocorp,manageengine_applications_manager,14.5,An issue was found in /showReports.do Zoho Man...,cpe:2.3:a:zohocorp:manageengine_applications_m...,0,0
84004,CVE-2020-24786,472744,zohocorp,manageengine_o365_manager_plus,4.3,An issue was discovered in Zoho ManageEngine E...,cpe:2.3:a:zohocorp:manageengine_o365_manager_p...,0,0
80917,CVE-2013-3607,553572,supermicro,x9dax-if,-,Multiple stack-based buffer overflows in the w...,cpe:2.3:h:supermicro:x9dax-if:-:*:*:*:*:*:*:*,1,0
60767,CVE-2019-13183,689916,flarum,flarum,0.1.0,Flarum before 0.1.0-beta.9 allows CSRF against...,cpe:2.3:a:flarum:flarum:0.1.0:beta8.1:*:*:*:*:*:*,1,1
50074,CVE-2018-15121,522169,auth0,aspnet,-,An issue was discovered in Auth0 auth0-aspnet ...,cpe:2.3:a:auth0:aspnet:-:*:*:*:*:*:*:*,1,1
27701,CVE-2013-2175,549900,haproxy,haproxy,1.4.17,HAProxy 1.4 before 1.4.24 and 1.5 before 1.5-d...,cpe:2.3:a:haproxy:haproxy:1.4.17:*:*:*:*:*:*:*,1,1
42141,CVE-2016-10714,422757,zsh,zsh,4.2.2,"In zsh before 5.3, an off-by-one error resulte...",cpe:2.3:a:zsh:zsh:4.2.2:*:*:*:*:*:*:*,1,1


In [43]:
df_sample['vendor_in_text'].sum()

107

In [44]:
df_sample['product_in_text'].sum()

114

In [3]:
68/200

0.34

In [47]:
df_sample[(df_sample['product_in_text'] == 1) & (df_sample['vendor_in_text'] == 1)].shape

(68, 9)

In [175]:
import json
with open('/home/mikhail/Documents/pandan_study/vkr/vulns_scanner/mikhail_code/data/full_corpus.json') as j:
    corpus = json.loads(j.read())

In [177]:
len(corpus)

3

In [180]:
len(corpus['NVD'])

15192

In [182]:
df_tr = pd.read_csv('train_df_cve_dataset_bio.tsv', sep='\t')
df_te = pd.read_csv('test_df_cve_dataset_bio.tsv', sep='\t')

In [189]:
df_tr['cve_id'].nunique() + df_te['cve_id'].nunique() - 1 - 15192

126

In [190]:
df_tr['cve_id'].nunique() + df_te['cve_id'].nunique() - 1 

15318

In [None]:
обучение – 12411

валидация – 1379

тест – 1532


In [None]:
12411 + 1379 + 1532

15322

| Model       | Text Classification | Named Entity Recognition |
|-------------|---------------------|--------------------------|
|             | Accuracy | F1      | Precision | Recall   |
|-------------|----------|---------|-----------|----------|
| BERT        | 84.3     | 89.1    | 92.1      | 89.4     |
| YourModel   | **91.2** | **94.7**| **94.8**  | **93.0** |

\documentclass{article}
\usepackage{multirow}
\begin{document}
\begin{tabular}{ |p{3cm}||p{3cm}|p{3cm}|p{3cm}|  }
 \hline
 \multicolumn{4}{|c|}{Country List} \\
 \hline
 Country Name or Area Name& ISO ALPHA 2 Code &ISO ALPHA 3 Code&ISO numeric Code\\
 \hline
 Afghanistan   & AF    &AFG&   004\\
 Aland Islands&   AX  & ALA   &248\\
 Albania &AL & ALB&  008\\
 Algeria    &DZ & DZA&  012\\
 American Samoa&   AS  & ASM&016\\
 Andorra& AD  & AND   &020\\
 Angola& AO  & AGO&024\\
 \hline
\end{tabular}
\end{document}