# NVD Vulnerabilities Dataset Extraction
## Overview
This script fetches cybersecurity vulnerability data from the National Vulnerability Database (NVD) using its REST API. It retrieves CVE details, CVSS scores, attack vectors, and affected operating systems.

## Key Components
1. Fetching Data from NVD API
    * Uses the NVD API endpoint.
    * Supports filtering by publication date.
    * Handles pagination to fetch large datasets.
    * Implements rate limiting to comply with API restrictions.
2. Processing Vulnerabilities
    * Extracts key details:
        * CVE ID (Common Vulnerabilities and Exposures identifier)
        * Description (English version of the vulnerability summary)
        * CVSS Score (Severity score from CVSS v3.1, v3.0, or v2)
        * Attack Vector (Exploitability details)
        * Affected OS (Extracted from CPE configuration data)
3. Data Filtering & Storage
    * Filters vulnerabilities with both CVSS Score and Attack Vector.
    * Converts extracted data into a structured Pandas DataFrame.
    * Saves results as a CSV file (nvd_vulnerabilities_with_os.csv).

## Usage
1. Set the start_date and end_date for filtering vulnerabilities.
2. Run the script to retrieve and process data.
3. Check the output CSV for structured vulnerability records.

## Dependencies
* Python (3.x)
* Requests (For API communication)
* Pandas (For data processing)

Code provided by ManavKhambhayata & Ananya Verma at:\
https://www.kaggle.com/code/manavkhambhayata/cve-2024-database-extraction-framework/notebook

In [1]:
import requests
import pandas as pd
import time

# NVD API URL
API_URL = "https://services.nvd.nist.gov/rest/json/cves/2.0"

# Replace with your valid API key
API_KEY = "ea5501a5-24fe-4720-80e3-2abed401d92f"

def fetch_nvd_data(start_date, end_date, results_per_page=200):
    headers = {"apiKey": API_KEY}
    params = {
        "resultsPerPage": results_per_page,
        "startIndex": 0,
        "pubStartDate": f"{start_date}T00:00:00.000Z",
        "pubEndDate": f"{end_date}T23:59:59.999Z",
    }

    all_vulnerabilities = []

    while True:
        response = requests.get(API_URL, headers=headers, params=params)

        if response.status_code == 200:
            data = response.json()
            vulnerabilities = data.get("vulnerabilities", [])
            all_vulnerabilities.extend(vulnerabilities)

            total_results = data.get("totalResults", 0)
            print(f"Fetched {len(vulnerabilities)} records. Total expected: {total_results}")

            params["startIndex"] += results_per_page
            if params["startIndex"] >= total_results:
                break

            time.sleep(6)  # Respect rate limits

        elif response.status_code == 403:
            print("Error: Invalid API Key or permissions.")
            break
        else:
            print(f"Error: {response.status_code} - {response.text}")
            break

    return all_vulnerabilities

# Example usage (dates in the future for demonstration)
start_date = "2024-01-01"
end_date = "2024-01-15"

vulnerabilities = fetch_nvd_data(start_date, end_date)

if vulnerabilities:
    filtered_data = []
    for vuln in vulnerabilities:
        cve_id = vuln["cve"]["id"]

        # Extract description
        descriptions = vuln["cve"].get("descriptions", [])
        description = next((desc["value"] for desc in descriptions if desc.get("lang") == "en"), "N/A")

        # Extract CVSS Score and Attack Vector
        metrics = vuln["cve"].get("metrics", {})
        cvss_score, attack_vector = "N/A", "N/A"

        # Check CVSS versions in priority order (v3.1 > v3.0 > v2)
        if "cvssMetricV31" in metrics:
            cvss_data = metrics["cvssMetricV31"][0].get("cvssData", {})
            cvss_score = cvss_data.get("baseScore", "N/A")
            attack_vector = cvss_data.get("vectorString", "N/A")
        elif "cvssMetricV30" in metrics:
            cvss_data = metrics["cvssMetricV30"][0].get("cvssData", {})
            cvss_score = cvss_data.get("baseScore", "N/A")
            attack_vector = cvss_data.get("vectorString", "N/A")
        elif "cvssMetricV2" in metrics:
            cvss_data = metrics["cvssMetricV2"][0].get("cvssData", {})
            cvss_score = cvss_data.get("baseScore", "N/A")
            attack_vector = cvss_data.get("vectorString", "N/A")

        # Extract OS Information from CPE data
        os_list = set()
        configurations = vuln["cve"].get("configurations", [])
        for config in configurations:
            for node in config.get("nodes", []):
                for cpe_match in node.get("cpeMatch", []):
                    cpe_uri = cpe_match.get("criteria", "")
                    if cpe_uri.startswith("cpe:2.3:"):
                        parts = cpe_uri.split(":")
                        if len(parts) >= 5 and parts[2] == 'o':  # 'o' indicates OS
                            vendor = parts[3].replace("_", " ").title()
                            product = parts[4].replace("_", " ").title()
                            version = parts[5] if len(parts) > 5 else ""
                            if version not in ["*", "-"]:
                                os_list.add(f"{vendor} {product} {version}".strip())
                            else:
                                os_list.add(f"{vendor} {product}".strip())

        os_info = ", ".join(os_list) if os_list else "N/A"

        # Filter records with valid CVSS data
        if cvss_score != "N/A" and attack_vector != "N/A":
            filtered_data.append({
                "CVE_ID": cve_id,
                "CPE_Name": cpeName,
                "Description": description,
                "CVSS_Score": cvss_score,
                "Attack_Vector": attack_vector,
                "Affected_OS": os_info
            })

    df = pd.DataFrame(filtered_data)

    if not df.empty:
        df.to_csv("nvd_vulnerabilities_with_os.csv", index=False)
        print(f"Saved {len(df)} records with OS information to CSV.")
    else:
        print("No records with both CVSS Score and Attack Vector found.")
#else:
#    print("No data fetched.")

Fetched 200 records. Total expected: 1327
Fetched 200 records. Total expected: 1327
Fetched 200 records. Total expected: 1327
Fetched 200 records. Total expected: 1327
Fetched 200 records. Total expected: 1327
Fetched 200 records. Total expected: 1327
Fetched 127 records. Total expected: 1327
Saved 1313 records with OS information to CSV.
