<a href="https://colab.research.google.com/github/msmekka/ncssm-summer25-cyber/blob/main/NIST_RMF_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Implementing Data Governance Using the NIST Risk Management Framework**

### **Applied Data Science - Governance Example**

---

## **Objective**

Apply the NIST Risk Management Framework (RMF) to assess and mitigate risks in a dataset. Use Python to perform data analysis, identify potential governance issues, and develop risk mitigation strategies.

---

## **Table of Contents**

1. [Introduction](#introduction)
2. [Understanding NIST RMF](#understanding-nist-rmf)
3. [Dataset Exploration](#dataset-exploration)
4. [Data Governance Assessment](#data-governance-assessment)
5. [Applying NIST RMF](#applying-nist-rmf)
6. [Developing Governance Policies](#developing-governance-policies)
7. [Conclusion](#conclusion)

---

## **Introduction** <a name="introduction"></a>

Data governance is critical in ensuring that data is managed properly, securely, and ethically within an organization. The NIST Risk Management Framework provides a structured approach to managing risks associated with information systems.

In this activity we will:

- Explore a synthetic dataset containing sensitive information.
- Identify potential risks and compliance requirements.
- Apply the NIST RMF to select and implement appropriate security controls.
- Develop governance policies based on our findings.

---

## **Understanding NIST RMF** <a name="understanding-nist-rmf"></a>

The NIST RMF consists of seven steps:

1. **Prepare**
2. **Categorize Information Systems**
3. **Select Security Controls**
4. **Implement Security Controls**
5. **Assess Security Controls**
6. **Authorize Information Systems**
7. **Monitor Security Controls**

These steps help organizations manage and mitigate risks systematically.

---

## **Dataset Exploration** <a name="dataset-exploration"></a>

We will use a synthetic dataset that simulates sensitive information.

### **Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### **Load the Dataset**

For this assignment, we'll generate a synthetic dataset.

In [None]:
# Generate synthetic data
np.random.seed(42)

num_records = 1000

data = pd.DataFrame({
    'First_Name': np.random.choice(['John', 'Jane', 'Alice', 'Bob'], num_records),
    'Last_Name': np.random.choice(['Doe', 'Smith', 'Johnson', 'Williams'], num_records),
    'SSN': np.random.randint(100000000, 999999999, num_records).astype(str),
    'Credit_Card_Number': np.random.randint(1000000000000000, 9999999999999999, num_records).astype(str),
    'Email': np.random.choice(['example1@mail.com', 'example2@mail.com'], num_records),
    'Phone_Number': np.random.randint(1000000000, 1999999999, num_records).astype(str),
    'Medical_Diagnosis': np.random.choice(['Healthy', 'Diabetes', 'Hypertension'], num_records),
    'Account_Balance': np.random.uniform(-5000, 20000, num_records).round(2)
})

data.head()

Unnamed: 0,First_Name,Last_Name,SSN,Credit_Card_Number,Email,Phone_Number,Medical_Diagnosis,Account_Balance
0,Alice,Smith,895139886,1500331559590448,example1@mail.com,1054659450,Healthy,19890.95
1,Bob,Johnson,472396043,6905730646249089,example2@mail.com,1959701064,Healthy,-4737.66
2,John,Doe,279963197,4117922061134340,example2@mail.com,1809107354,Hypertension,17376.63
3,Alice,Doe,994086735,2980955884817946,example2@mail.com,1038887321,Diabetes,-1271.48
4,Alice,Doe,628048343,1581227358599316,example2@mail.com,1384201975,Diabetes,14648.13


### **Explore the Dataset**

In [None]:
# Check for null values
data.isnull().sum()

Unnamed: 0,0
First_Name,0
Last_Name,0
SSN,0
Credit_Card_Number,0
Email,0
Phone_Number,0
Medical_Diagnosis,0
Account_Balance,0


In [None]:
# Data types
data.dtypes

Unnamed: 0,0
First_Name,object
Last_Name,object
SSN,object
Credit_Card_Number,object
Email,object
Phone_Number,object
Medical_Diagnosis,object
Account_Balance,float64


In [None]:
# Statistical summary
data.describe(include='all')

Unnamed: 0,First_Name,Last_Name,SSN,Credit_Card_Number,Email,Phone_Number,Medical_Diagnosis,Account_Balance
count,1000,1000,1000.0,1000.0,1000,1000.0,1000,1000.0
unique,4,4,1000.0,1000.0,2,1000.0,3,
top,Bob,Doe,398173123.0,2377621545409529.0,example1@mail.com,1096706715.0,Hypertension,
freq,280,267,1.0,1.0,512,1.0,344,
mean,,,,,,,,7234.72285
std,,,,,,,,7178.291771
min,,,,,,,,-4976.76
25%,,,,,,,,1083.075
50%,,,,,,,,7094.26
75%,,,,,,,,13470.765


---

## **Data Governance Assessment** <a name="data-governance-assessment"></a>

### **Identify Sensitive Data**

We need to detect and classify sensitive information.

In [None]:
# List of columns potentially containing sensitive data
sensitive_columns = ['SSN', 'Credit_Card_Number', 'Medical_Diagnosis', 'Account_Balance']

# Display sensitive data sample
data[sensitive_columns].head()

Unnamed: 0,SSN,Credit_Card_Number,Medical_Diagnosis,Account_Balance
0,895139886,1500331559590448,Healthy,19890.95
1,472396043,6905730646249089,Healthy,-4737.66
2,279963197,4117922061134340,Hypertension,17376.63
3,994086735,2980955884817946,Diabetes,-1271.48
4,628048343,1581227358599316,Diabetes,14648.13


### **Assess Compliance Requirements**

Map data types to relevant regulations:

- **PII (Personally Identifiable Information):** SSN, Name, Email, Phone Number
  - **Regulations:** GDPR, CCPA
- **PHI (Protected Health Information):** Medical Diagnosis
  - **Regulations:** HIPAA
- **Financial Information:** Credit Card Number, Account Balance
  - **Regulations:** PCI DSS, GLBA

### **Risk Identification**

For each category of sensitive data, identify potential risks:

- **Unauthorized Access:** Data breaches exposing PII and financial information.
- **Data Leakage:** Improper handling leading to PHI exposure.
- **Non-compliance Penalties:** Failing to comply with regulations can result in legal and financial penalties.

---

## **Applying NIST RMF** <a name="applying-nist-rmf"></a>

### **1. Categorize the Data**

Determine the impact levels:

- **Confidentiality:** High
- **Integrity:** Moderate
- **Availability:** Low

### **2. Select Security Controls**

Using NIST SP 800-53 controls catalog, select appropriate controls:

- **Access Control (AC):** AC-2, AC-3
- **Audit and Accountability (AU):** AU-2, AU-12
- **Identification and Authentication (IA):** IA-2
- **Encryption (SC):** SC-13

### **3. Implement Security Controls**

#### **Data Encryption**

In [None]:
!pip install cryptography

from cryptography.fernet import Fernet

# Generate encryption key
key = Fernet.generate_key()
cipher_suite = Fernet(key)

# Function to encrypt data
def encrypt_column(column):
    return column.apply(lambda x: cipher_suite.encrypt(x.encode()).decode())

# Encrypt sensitive columns
data_encrypted = data.copy()
for col in ['SSN', 'Credit_Card_Number']:
    data_encrypted[col] = encrypt_column(data_encrypted[col])

data_encrypted.head()



Unnamed: 0,First_Name,Last_Name,SSN,Credit_Card_Number,Email,Phone_Number,Medical_Diagnosis,Account_Balance
0,Alice,Smith,gAAAAABoSiJEEHM19bvvJmhPZzc0vwVM9uQfyvJH7Fs4LI...,gAAAAABoSiJEKofwSGE9nxce59inS6gCZXerm5Goz3J_M4...,example1@mail.com,1054659450,Healthy,19890.95
1,Bob,Johnson,gAAAAABoSiJEQ4-7GBlizBsvWiFP6ilUV79Txm_8haraw1...,gAAAAABoSiJE2qk7-O6Pz1P0MQ06hiBmQm0P2ad3p4TAuy...,example2@mail.com,1959701064,Healthy,-4737.66
2,John,Doe,gAAAAABoSiJEtosW7Fs-BNkEkB4rCjaCN9NdhdoAYoz5qS...,gAAAAABoSiJEOLrmkmxam7__o7lH-x--tYxDqCxq_mTjCt...,example2@mail.com,1809107354,Hypertension,17376.63
3,Alice,Doe,gAAAAABoSiJEmr2Iqlu3OkSTC-ZhahHWxR0ga_cJ0Vuqw3...,gAAAAABoSiJEIsD2q57Yaeud-VJUVCcsZsTFsJp8oFJNpU...,example2@mail.com,1038887321,Diabetes,-1271.48
4,Alice,Doe,gAAAAABoSiJELhrKYIszLZCwKAodvxqgDR_Z-2hjyAUW-w...,gAAAAABoSiJE-CGrBUPlS-T6tJYApI_BDv8gU3KyGofvs0...,example2@mail.com,1384201975,Diabetes,14648.13


#### **Access Controls**

Implement basic access control simulation.

In [None]:
# Define user roles
user_roles = {
    'analyst': ['First_Name', 'Last_Name', 'Email', 'Phone_Number', 'Medical_Diagnosis', 'Account_Balance'],
    'manager': data_encrypted.columns.tolist(),
}

# Function to get data based on role
def get_data_by_role(role):
    allowed_columns = user_roles.get(role, [])
    return data_encrypted[allowed_columns]

# Example: Data accessible by an analyst
get_data_by_role('analyst').head()

Unnamed: 0,First_Name,Last_Name,Email,Phone_Number,Medical_Diagnosis,Account_Balance
0,Alice,Smith,example1@mail.com,1054659450,Healthy,19890.95
1,Bob,Johnson,example2@mail.com,1959701064,Healthy,-4737.66
2,John,Doe,example2@mail.com,1809107354,Hypertension,17376.63
3,Alice,Doe,example2@mail.com,1038887321,Diabetes,-1271.48
4,Alice,Doe,example2@mail.com,1384201975,Diabetes,14648.13


### **4. Assess Security Controls**

Test the implemented controls.

- **Encryption Test:** Ensure encrypted data is not readable without decryption.
- **Access Control Test:** Verify that users cannot access unauthorized data.

In [None]:
# Attempt to read encrypted SSN without decryption
print(data_encrypted['SSN'].head())

0    gAAAAABoSiJEEHM19bvvJmhPZzc0vwVM9uQfyvJH7Fs4LI...
1    gAAAAABoSiJEQ4-7GBlizBsvWiFP6ilUV79Txm_8haraw1...
2    gAAAAABoSiJEtosW7Fs-BNkEkB4rCjaCN9NdhdoAYoz5qS...
3    gAAAAABoSiJEmr2Iqlu3OkSTC-ZhahHWxR0ga_cJ0Vuqw3...
4    gAAAAABoSiJELhrKYIszLZCwKAodvxqgDR_Z-2hjyAUW-w...
Name: SSN, dtype: object


### **5. Continuous Monitoring Plan**

Propose a plan for ongoing monitoring.

- **Regular Audits:** Schedule scripts to check data access logs.
- **Anomaly Detection:** Implement monitoring to detect unusual access patterns.

---

## **Developing Governance Policies** <a name="developing-governance-policies"></a>

### **Policy Drafting**

Based on the risk assessment:

1. **Data Encryption Policy:** All PII and financial data must be encrypted at rest and in transit.
2. **Access Control Policy:** Implement role-based access controls limiting data access based on user roles.
3. **Compliance Policy:** Regularly review and update practices to comply with GDPR, HIPAA, PCI DSS, etc.
4. **Data Retention Policy:** Define how long data is stored and when it should be disposed of securely.

### **Sample Policy Document**

```markdown
#### **Data Encryption Policy**

All sensitive data, including PII and financial information, must be encrypted using industry-standard encryption algorithms both at rest and during transmission.

#### **Access Control Policy**

Access to sensitive data is restricted based on the principle of least privilege. Users are granted access only to the data necessary for their role.

#### **Compliance Policy**

The organization commits to complying with all relevant data protection regulations. Regular audits will be conducted to ensure adherence.

#### **Data Retention Policy**

Sensitive data will be retained only as long as necessary for business purposes and will be disposed of securely thereafter.
```

---

## **References**

- NIST SP 800-37 Rev. 2: [Risk Management Framework](https://csrc.nist.gov/publications/detail/sp/800-37/rev-2/final)
- NIST SP 800-53 Rev. 5: [Security and Privacy Controls](https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final)
- HIPAA Compliance Guidelines
- GDPR Regulations