
# CSCI S-108: Data Mining, Discovery, and Exploration
## Final Project: Optimizing Data Egress for Secure Transmission Using Data Mining Techniques
**Student**: Luciano Carvalho

### 1. Introduction

#### Objective
The objective of this project is to develop a secure and efficient method for data egress, utilizing advanced data mining techniques to obfuscate sensitive data and reduce the size of datasets before transmission.

#### Background
As the Director of DevOps at Intelex Technologies, my role includes managing the company's databases and ensuring secure data handling. This project focuses on addressing the challenges of handling large data exports, obfuscating sensitive information, and optimizing data transmission.

#### Dataset
We use the UNSW-NB15 dataset, which contains network traffic data with multiple features such as IP addresses, port numbers, and protocols. The dataset includes labels indicating normal and various attack types.



### 2. Problem Statement

The key challenges addressed in this project include the lack of robust solutions for secure data obfuscation and size reduction during data transmission. Additionally, the project aims to tackle operational challenges in monitoring and managing deployments across multiple platforms (AWS, Azure, Rackspace).



### 3. Exploratory Data Analysis (EDA)

#### 3.1 Data Loading and Description
This section covers the initial data loading, examination of data structure, and basic statistical description.


In [None]:

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('UNSW-NB15_1.csv', encoding='ISO-8859-1')

# Display the first few rows and summary statistics
df.head()



#### 3.2 Data Cleaning
We handle missing values, convert data types, and perform basic preprocessing to ensure data quality.


In [None]:

# Handling missing values and data type conversion
df.fillna(df.mean(), inplace=True)
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')



#### 3.3 Data Visualization
We use various visualization techniques to explore the data, including histograms, box plots, and correlation matrices.


In [None]:

# Histograms and box plots for key features
sns.histplot(df['dur'], kde=True, log_scale=(True, False))
plt.title('Distribution of Duration (dur) with Logarithmic Scale')
plt.show()


In [None]:

# Boxplot for outlier detection
sns.boxplot(data=df[['Spkts', 'Dpkts']])
plt.title('Boxplot of Spkts and Dpkts')
plt.show()


In [None]:

# Correlation Matrix
corr_matrix = df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f")
plt.title('Correlation Matrix')
plt.show()



### 4. Methodology

In this section, we explore various data mining methods applicable to our project, including data exploration, streaming analysis, clustering, dimensionality reduction, similarity search, and more.



#### 4.1 Data Exploration and Analysis Techniques
We begin with a thorough exploration of the dataset, using techniques like summary statistics, data visualization, and more, as discussed in the "Pitfalls in Data Mining" assignment.


In [None]:

# Summary Statistics
df.describe()

# Visualization of distribution for numerical features
for column in numeric_columns:
    plt.figure()
    sns.histplot(df[column], kde=True)
    plt.title(f'Distribution of {column}')
    plt.show()



#### 4.2 Streaming Analysis
We implement streaming analytics methods such as moving averages and exponential smoothing, using concepts from the "Streaming" assignment.


In [None]:

# Example: Moving average calculation
df['moving_avg'] = df['dur'].rolling(window=5).mean()

# Plot the moving average
plt.plot(df['moving_avg'])
plt.title('Moving Average of Duration')
plt.show()



#### 4.3 Similarity Search
We employ techniques like KD-Trees and Locality Sensitive Hashing (LSH) to find similar data points or patterns, based on the "Efficient Similarity Search" assignment.


In [None]:

# Example of KD-Trees
from sklearn.neighbors import KDTree

# Using a subset of data for demonstration
subset = df.sample(n=1000)
kdt = KDTree(subset[numeric_columns], leaf_size=30, metric='euclidean')

# Find nearest neighbors for a sample point
dist, ind = kdt.query(subset[numeric_columns].iloc[:1], k=5)
print(f'Indices of 5 nearest neighbors: {ind}')



#### 4.4 Recommender Systems
Discussion on the implementation of simple collaborative filtering or content-based filtering methods, inspired by the "Recommenders" assignment.


In [None]:

# Placeholder: Example of a simple recommender system approach
# Example: Recommend top 5 similar activities based on historical data (placeholder logic)
# (This could be expanded based on the available data and specific use case)



#### 4.5 Clustering Models
We apply clustering algorithms such as DBSCAN, K-Means, and Hierarchical Clustering, and evaluate their performance, following the "Cluster Models" assignment.


In [None]:

# Example of DBSCAN clustering
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5).fit(df[numeric_columns])
df['cluster'] = db.labels_

# Visualize clusters
plt.scatter(df['pca1'], df['pca2'], c=df['cluster'])
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('DBSCAN Clustering Result')
plt.show()


In [None]:

# Example of K-Means Clustering
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42)
df['kmeans_cluster'] = kmeans.fit_predict(df[numeric_columns])

# Visualize K-Means clusters
plt.scatter(df['pca1'], df['pca2'], c=df['kmeans_cluster'])
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering Result')
plt.show()



#### 4.6 Dimensionality Reduction
We use PCA and t-SNE to reduce data dimensions and visualize the reduced data, as explored in the "Dimensionality Reduction" assignment.


In [None]:

# Example of PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_result = pca.fit_transform(df[numeric_columns])
df['pca1'], df['pca2'] = pca_result[:, 0], pca_result[:, 1]

# Plot PCA results
plt.scatter(df['pca1'], df['pca2'], c=df['cluster'])
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('PCA Result')
plt.show()



#### 4.7 Association Models
Applying association rule mining to discover relationships between variables, inspired by the "Association Models" assignment.


In [None]:

# Example of association rule mining
from mlxtend.frequent_patterns import apriori, association_rules

# Placeholder: Sample transactions data (this should be replaced with actual data)
transactions = pd.DataFrame({
    'items': [['A', 'B', 'C'], ['A', 'C'], ['B', 'C'], ['A', 'B'], ['A', 'B', 'C']]
})

# Generate frequent itemsets
frequent_itemsets = apriori(transactions, min_support=0.1, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules.head()



### 5. Results and Discussion

#### 5.1 Evaluation of Clustering Models
We evaluate the clustering models using metrics such as Silhouette Score, Davies-Bouldin Index, and visualizations to interpret the effectiveness of the clustering.

#### 5.2 Dimensionality Reduction Insights
We analyze the results from PCA and t-SNE to understand how the dimensionality reduction techniques preserve data variance and structure.

#### 5.3 Streaming Data Analysis Observations
Observations from the streaming data analysis, including the impact of applying moving averages and filters on the dataset.

#### 5.4 Similarity Search Results
We discuss the findings from KD-Trees and LSH, focusing on their effectiveness in identifying similar data points or patterns.

#### 5.5 Recommender Systems Analysis
Although not the primary focus, we explore the feasibility of implementing a recommender system based on the dataset's characteristics, including potential use cases and challenges.



### 6. Conclusion

This project successfully demonstrated the application of various data mining techniques to enhance data egress security and efficiency. By leveraging clustering, dimensionality reduction, streaming data analysis, and similarity search, we addressed the primary challenges faced by Intelex Technologies in handling large datasets securely and effectively.

#### Key Takeaways
- Clustering models provided valuable insights into data structure and patterns.
- Dimensionality reduction techniques helped in visualizing high-dimensional data.
- Streaming data analysis facilitated real-time data processing and monitoring.
- Similarity search proved useful for identifying similar data points, enhancing data obfuscation and security.

#### Future Work
Further exploration could include the integration of more advanced machine learning models, the development of a full-fledged recommender system, and the continuous monitoring of data streams for real-time anomaly detection.



### 7. References

- **Scikit-Learn**: Pedregosa et al., JMLR 2011.
- **Seaborn**: Michael Waskom (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021.
- **Pandas**: Wes McKinney (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference.
- **UNSW-NB15 Dataset**: A comprehensive dataset for network intrusion detection systems (NIDS).
- **Apriori and Association Rules**: Agrawal et al., Mining association rules between sets of items in large databases, SIGMOD '93.
- Additional references to be included based on the tools and techniques applied.
