# Python Libraries for AI

## Scikit-learn
- Built on numpy, scipy, matplotlib
- Does supervised learning, unsupervised, model selection & eval, data processing

Data Preprocessing
- tools to transform raw data into suitable format
- Feature scaling. StandardScaler, MinMaxScaler, RobustScaler

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

OneHotEncoder: Creates binary column for each category
LabelEncoder: Unique int for each cat

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)

SimpleImputer: Replace missing values with specified strat (mean, median, most frequent, etc)
KNNImputer: Imputes missing values with k-Nearest neighbors alg

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Scikit-learn has tools for selecting best model & evaluating performance
Splitting data into training & testing sets important for evaluating model's generalization ability for unseen data

Cross-validation more robust eval by splitting data into folds and testing/training on diff combos

In [None]:
# Splitting
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Cross-validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)

Metrics to eval model performance

- accuracy_score: Class tasks
- mean_squared_error: Regression tasks
- precision_score, recall_score, f1_score: Class tasks with imbalanced classes

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

Model Training & Prediction
Consistent API for training & predicting w/ diff models

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0)

# Train model using fit() method with training data
model.fit(X_train, y_train)

# Make predictions on new data using predict()
y_pred = model.predict(X_test)

## PyTorch

FOSS ml lib developed by FB AI Research Lab. Framework for building & deploying various types of ML models, including Deep Learning models

Features
- Deep Learning: Excels at DL, can develop CNN with multiple layers & architectures
- Dynamic Computational Graphs: Unlike static comp graphs like in TensorFlow, uses DCG which allow for more flexible & intuitive model building/debugging
- GPU Support: GPU accel, speeds up training process for compu intensive models
- TorchVision Integration: Library integrated that provides user-friendly interface for image datasets, pre-trained models, common image transformations
- Auto Differentiation: Uses autograd to auto compute gradients, simplifying process of backpropagation
- Community/Ecosystem: Large community, rich ecosystem tools, libraries, resources

Dynamic Computation Graph: created on fly during forward pass, allowing more flexible & dynamic model building. Easier to implement complex/nonlinear models
Tensors: multi-dimensional array hold data being processed. Can be const, var, placeholders. PyTorch tensors similar to numpy arrays but can run on GPU for faster compu

In [None]:
import torch

x = torch.tensor([1.0,2.0,3.0])
if torch.cuda.is_available():
    x = x.to('cuda')

# torch.nn contains various layers/modules for constructing NN
# Sequential API allows building models layer by layer, adding each sequentially

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784,128),
    nn.ReLU(),
    nn.Linear(128,10),
    nn.Softmax(dim=1)
)

# Module class provides more flex for building complex models with nonlinear topologies, shared layers, multiple input/output

class CustomModel(nn.Module):
    def __init__(self):
        super(CustomModel, self).__init__()
        self.layer1 = nn.Linear(784,128)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(128,10)
        self.softmax = nn.Softmax(dim=1)
    def forward(self,x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        x = self.softmax(x)
        return x
    
model = CustomModel()

Training and Eval

Optimizers: algs that adjust model's params during training to minimize loss function. Pytorch offers various
- Adam
- SGD (Stcochastic Gradient Descent)
- RMSprop

In [None]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001)

Loss func measure diff between model's predict and actual target vals. Pytorch provides various loss func
- CrossEntropyLoss: for multi-class classification
- BCEWithLogitsLoss: for binary classif
- MSELoss: for regression

In [None]:
import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()

Metrics eval model's preform during test/train
- Accuracy
- Precision
- Recall

In [None]:
def accuracy(output, target):
    _, predicted = torch.max(output,1)
    correct = (predicted == target).sum().item()
    return correct / len(target)

In [None]:
# Training loop updates models params based on training data
import torch

epochs = 10
num_batches = 100

for epoch in range(epochs):
    for batch in range(num_batches):
        # Get batch of data
        x_batch, y_batch = get_batch(batch)

        # Forward Pass
        y_pred = model(x_batch)

        # Calc loss
        loss = loss_fn(y_pred, y_batch)

        # Back pass & optim
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 10 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Batch [{batch+1}/{num_batches}], Loss: {loss.item():.4f}')

In [None]:
# Data Loading & Preprocessing

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self,data,labels):
        self.data = data
        self.labels = labels
    def __len__(self):
        return len(self.data)
    def __getitem__(self,idx):
        return self.data[idx],self.labels[idx]
    
dataset = CustomDataset(data,labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [None]:
# Model Saving and Loading

torch.save(model.state_dict(), 'model.pth')

model = CustomModel()
model.load_state_dict(torch.load('model.pth'))
model.eval() # Set model to eval mode

# Datasets

Collections of data points used for analysis & model training. Data Preprocessing is crucial in ML pipeline, transforming raw data into suitable for alg to process effectively

Forms of datasets
- Tabular Data: Organized into tables w/ rows and columns, common in spreadsheets/dbs
- Image Data: Sets of images, numerically as pixel arrays
- Text Data: Unstructured data, sentences, paragraphs, full documents
- Time Series Data: Sequential data points collected over time, emphasizing temporal patterns

Quality of dataset is important
- Model Accuracy: Quality datasets = more accurate models. Poor quality like noisy, incomplete, biased leads to poorer model performance
- Generalization: Carefully curated allows effective generalization for unseen data. Minimized overfitting and ensures consistent performance in real-world
- Efficiency: Clean, well-prepared data reduces train time & compu demands, streamlining entire process
- Reliability: Reliable dataset leads to trustworthy insight/decision. In critical domains like healthcare/finance, data quality affects dependability of results.

## What Makes a Dataset Good

- Relevance
- Completeness
- Consistency (format)
- Quality: accurate, free from errors; errors can arise from data collection, entry, or transmission issues
- Representativeness
- Balance: Especially important for classification. Imbalanced leads to bias that performs poorly on minority classes. Techniques (oversampling,undersampling,synthetic data) can help balance
- Size

## Dataset

demo_dataset.csv is a csv file containing network log entries. analyzing entries allows one to simulate various network scenarios useful for developing/evaluating IDS

### Structure

- log_id: Unique ID for each entry
- source_ip
- destination_port
- protocol
- bytes_transferred
- threat_level : Indicator of severity. 0 normal, 1 low-threat, 2 high-threat

### Challenges & Considerations

- Dataset contains mix of numerical and categorical
- Missing values and invalid entries in some columns, requiring data cleaning
- Some numeric columns may contain non-numeric strings, which must be converted/removed
- Threat_level column has unknown values that must be standardized/addressed during preprocessing

In [2]:
import pandas as pd

# pandas Dataframe is flexible, 2D labeled data structure that supports operations for data exploring/preprocessing
# advantages: labeled axes, heterogeneous data handling, integration with other libraries

data = pd.read_csv('./demo_dataset.csv')

In [5]:
# Various ops to understand structure, anomalies, determine cleaning/transformations needed

# First few rows of dataset
print(data.head())

# Summary of column data types and non-null counts
# Shows dataset shape, column names, data types, how many entries for each. Early detection of columns w/ unexpected/missing data
print(data.info())

# Identify col with missing vals
print(data.isnull().sum())

   log_id       source_ip destination_port protocol bytes_transferred  \
0      10      10.0.0.100      STRING_PORT      FTP              4096   
1      12  172.16.254.100              110     POP3          NEGATIVE   
2      27  172.16.254.200              110     POP3       NON_NUMERIC   
3       1   192.168.1.100               80     HTTP              1024   
4       2    192.168.1.81               53      TLS              9765   

  threat_level  
0            ?  
1            1  
2            1  
3            0  
4            0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   log_id             100 non-null    int64 
 1   source_ip          99 non-null     object
 2   destination_port   99 non-null     object
 3   protocol           100 non-null    object
 4   bytes_transferred  100 non-null    object
 5   threat_level       100

# Data Preprocessing

- Data Cleaning: Missing values, Duplicates, Smoothing noisy data
- Data Transformation: Normalizing, encoding, scaling, reducing data
- Data Integration: Merging/Aggregating from multiple sources
- Data Formatting: Converting types and reshaping data structures

Addresses inconsistencies, missing vals, outliers, noise, feature scaling, improving accuracy, efficiency, robustness of ML models

In [10]:
import re

def is_valid_ip(ip):
    pattern = re.compile(r'^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$')
    return bool(pattern.match(ip))

invalid_ips = data[~data['source_ip'].astype(str).apply(is_valid_ip)]
print(invalid_ips)

    log_id       source_ip destination_port protocol bytes_transferred  \
40      41      10.0.0.300               25     SMTP              4096   
51      52    10.10.10.450      STRING_PORT      FTP              4096   
55      56             NaN               53      DNS              1024   
57      58   192.168.1.475              NaN      UDP              2048   
63      64      MISSING_IP               53      DNS              1024   
65      66   192.168.1.600      UNUSED_PORT      UDP              2048   
71      72      MISSING_IP               53      DNS              1024   
74      75    172.16.1.400               80     HTTP              1024   
82      83    172.16.1.450               80     HTTP              1024   
87      88      MISSING_IP               53      DNS              1024   
88      89    10.10.10.700              443      TLS               512   
92      93      INVALID_IP              110     POP3              4096   
93      94  192.168.1.1050            

In [9]:
def is_valid_port(port):
    try:
        port = int(port)
        return 0 <= port <= 65535
    except ValueError:
        return False
    
invalid_ports = data[~data['destination_port'].apply(is_valid_port)]
print(invalid_ports)

    log_id       source_ip destination_port protocol bytes_transferred  \
0       10      10.0.0.100      STRING_PORT      FTP              4096   
34      35   192.168.1.200      STRING_PORT      FTP              4096   
51      52    10.10.10.450      STRING_PORT      FTP              4096   
57      58   192.168.1.475              NaN      UDP              2048   
65      66   192.168.1.600      UNUSED_PORT      UDP              2048   
67      68     10.10.10.77      STRING_PORT      FTP              4096   
78      79   172.16.254.77           999999     HTTP              2048   
97      98  192.168.1.1100      UNUSED_PORT      UDP              2048   

   threat_level  
0             ?  
34            ?  
51            ?  
57            1  
65            1  
67            ?  
78            1  
97            0  


In [11]:
valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']

invalid_protocols = data[~data["protocol"].isin(valid_protocols)]
print(invalid_protocols)

    log_id      source_ip destination_port protocol bytes_transferred  \
30      31  192.168.1.119              443  UNKNOWN              9513   
80      81  192.168.1.224               25  UNKNOWN              1161   

   threat_level  
30            2  
80            1  


In [12]:
def is_valid_bytes(bytes):
    try:
        bytes = int(bytes)
        return bytes >= 0
    except ValueError:
        return False
    
invalid_bytes = data[~data["bytes_transferred"].apply(is_valid_bytes)]
print(invalid_bytes)

    log_id       source_ip destination_port protocol bytes_transferred  \
1       12  172.16.254.100              110     POP3          NEGATIVE   
2       27  172.16.254.200              110     POP3       NON_NUMERIC   
93      94  192.168.1.1050               53      DNS       NON_NUMERIC   

   threat_level  
1             1  
2             1  
93            0  


In [13]:
def is_valid_threat_level(threat_level):
    try:
        threat_level = int(threat_level)
        return 0 <= threat_level <= 2
    except ValueError:
        return False
    
invalid_threat_levels = data[~data["threat_level"].apply(is_valid_threat_level)]
print(invalid_threat_levels)

    log_id      source_ip destination_port protocol bytes_transferred  \
0       10     10.0.0.100      STRING_PORT      FTP              4096   
34      35  192.168.1.200      STRING_PORT      FTP              4096   
51      52   10.10.10.450      STRING_PORT      FTP              4096   
67      68    10.10.10.77      STRING_PORT      FTP              4096   

   threat_level  
0             ?  
34            ?  
51            ?  
67            ?  


In [14]:
# Dropping invalid entries

# ignore errors covers face that there might be overlap between indexes that match other invalid criteria
data = data.drop(invalid_ips.index, errors='ignore')
data = data.drop(invalid_ports.index, errors='ignore')
data = data.drop(invalid_protocols.index, errors='ignore')
data = data.drop(invalid_bytes.index, errors='ignore')
data = data.drop(invalid_threat_levels.index, errors='ignore')

print(data.describe(include='all'))

            log_id     source_ip destination_port protocol bytes_transferred  \
count    77.000000            77               77       77                77   
unique         NaN            68                6        9                73   
top            NaN  192.168.1.55               80     HTTP              1024   
freq           NaN             3               22       22                 4   
mean     46.519481           NaN              NaN      NaN               NaN   
std      28.591317           NaN              NaN      NaN               NaN   
min       1.000000           NaN              NaN      NaN               NaN   
25%      22.000000           NaN              NaN      NaN               NaN   
50%      45.000000           NaN              NaN      NaN               NaN   
75%      70.000000           NaN              NaN      NaN               NaN   
max     100.000000           NaN              NaN      NaN               NaN   

       threat_level  
count            

Dropping preferred when accuracy paramount, and loss of some data points doesn't significantly compromise the analysis.
Not always feasible, like if dataset is small or invalid entries contribute substantial data

After dropping, only left with 77 clean entries.

### Imputing Missing Values
Replacing missing/invalid with estimated values. Maintain integrity & usability of data, especially in ML and Data Analysis tasks where missing values can lead to bias/inaccuracy
Convert all invalid/corrupted entries like MISSING_IP,INVALID_IP,STRING_PORT,UNUSED_PORT,NON_NUMERIC, ? into NaN. Standardizes the rep of missing values, enabling uniforn downstream imputation steps

In [None]:
import pandas as pd
import numpy as np
import re
from ipaddress import ip_address

df = pd.read_csv('demo_dataset.csv')

invalid_ips = ['INVALID_IP', 'MISSING_IP']
invalid_ports = ['STRING_PORT', 'UNUSED_PORT']
invalid_bytes = ['NON_NUMERIC','NEGATIVE']
invalid_threat = ['?']

df.replace(invalid_ips + invalid_ports + invalid_bytes + invalid_threat, np.nan, inplace=True)

df['destination_port'] = pd.to_numeric(df['destination_port'], errors='coerce')
df['bytes_transferred'] = pd.to_numeric(df['bytes_transferred'], errors='coerce')
df['threat_level'] = pd.to_numeric(df['threat_level'], errors='coerce')

def is_valid_ip(ip):
    pattern = re.compile(r'^((25[0-5]|2[0-4][0-9]|[01]?\d?\d)\.){3}(25[0-5]|2[0-4]\d|[01]?\d?\d)$')
    if pd.isna(ip) or not pattern.match(str(ip)):
        return np.nan
    return ip

df['source_ip'] = df['source_ip'].apply(is_valid_ip)

#NaN now represents all missing/invalid data points

In [16]:
# Basic numeric columns like bytes_transferred. Simple methods such as median/mean. For categorical like protocol, use most frequent

from sklearn.impute import SimpleImputer

numeric_cols = ['destination_port','bytes_transferred','threat_level']
categorical_cols = ['protocol']

num_imputer = SimpleImputer(strategy='median')
df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])

cat_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])

These imputations ensure cols have valid, non-missing vals, do not consider complex relationships among features

For more sophisticated scenarios, use advanced techniques like KNNImputer or IterativeImputer. These consider relationships among features to produce contextually meaningful imputations

In [17]:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

After cleaning & imputations, apply domain knowledge. For source_ip values that are still missing, assign a default like 0.0.0.0. Validate protocol vals against known protocols. For ports, ensure it's in range, and for protocols that imply certain ports, consider mode-based assignment or domain-specific mappings.

In [18]:
valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']
df.loc[~df['protocol'].isin(valid_protocols), 'protocol'] = df['protocol'].mode()[0]

df['source_ip'] = df['source_ip'].fillna('0.0.0.0')
df['destination_port'] = df['destination_port'].clip(lower=0, upper=65535)

In [20]:
print(df.describe(include='all'))

            log_id source_ip  destination_port protocol  bytes_transferred  \
count   100.000000       100        100.000000      100          100.00000   
unique         NaN        76               NaN        9                NaN   
top            NaN   0.0.0.0               NaN     HTTP                NaN   
freq           NaN        15               NaN       27                NaN   
mean     50.500000       NaN        776.860000      NaN         4138.64000   
std      29.011492       NaN       6542.582099      NaN         2526.40978   
min       1.000000       NaN         22.000000      NaN          498.00000   
25%      25.750000       NaN         53.000000      NaN         1693.25000   
50%      50.500000       NaN         80.000000      NaN         4096.00000   
75%      75.250000       NaN        110.000000      NaN         5971.75000   
max     100.000000       NaN      65535.000000      NaN         9765.00000   

        threat_level  
count     100.000000  
unique           