# Classifying S3 Data Sensitivity with Machine Learning
This notebook demonstrates how to categorize S3 data objects as sensitive or non-sensitive by analyzing object metadata with Python and scikit-learn.

## 1. Business Problem
With data stored in S3 buckets, we need an automated way to identify sensitive data and enforce security policies. Manually classifying data does not scale.

We will build a proof of concept to show how object metadata like bucket names and access patterns can be used to train an ML model to classify sensitivity.

## 2. Sample Data
We create sample metadata for a few S3 objects containing attributes like bucket name and last accessed date:

In [10]:
import pandas as pd

data = [{'s3bucket': 'financial-data', 'days_since_access': 2291},
        {'s3bucket': 'model-data', 'days_since_access': 119},
        {'s3bucket': 'log-files', 'days_since_access': 2733}]

df = pd.DataFrame(data)
print(df)

         s3bucket  days_since_access
0  financial-data               2291
1      model-data                119
2       log-files               2733


## 3. Feature Engineering
We transform the text data about S3 bucket names into numeric vectors using scikit-learn's TfidfVectorizer. This encoder converts text into tf-idf vectors.

TfidfVectorizer removes stopwords, applies tokenization, ngram generation, and calculates document frequencies to encode text data.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['s3bucket'])

## 4. Classification Model
With numeric vectors representing the data, we can train a classification algorithm to predict sensitivity labels:

In [12]:
from sklearn.naive_bayes import MultinomialNB

y = [0, 1, 0] # Labels - 0 = non-sensitive, 1 = sensitive

nb = MultinomialNB()
nb.fit(X, y) 

## 5. Making Predictions
We can now use our model to classify new S3 objects:

In [13]:
X_test = vectorizer.transform(['financial-reports'])

y_pred = nb.predict(X_test)
print(y_pred)

[0]


## 6. Evaluating Performance
We check accuracy on sample data by comparing to known labels:

In [14]:
import numpy as np
from sklearn.metrics import accuracy_score

y = np.array([0, 1, 0])
y_pred = np.array([0, 1, 1])

accuracy_score(y, y_pred)

0.6666666666666666

In real applications, precision and recall also matter for sensitive data.

This notebooks shows a basic workflow for metadata-based S3 classification with Python. Next steps could include larger data, better features, and tuning models.