# Python Libraries for AI

## Scikit-learn
- Built on numpy, scipy, matplotlib
- Does supervised learning, unsupervised, model selection & eval, data processing

Data Preprocessing
- tools to transform raw data into suitable format
- Feature scaling. StandardScaler, MinMaxScaler, RobustScaler

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

OneHotEncoder: Creates binary column for each category
LabelEncoder: Unique int for each cat

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)

SimpleImputer: Replace missing values with specified strat (mean, median, most frequent, etc)
KNNImputer: Imputes missing values with k-Nearest neighbors alg

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Scikit-learn has tools for selecting best model & evaluating performance
Splitting data into training & testing sets important for evaluating model's generalization ability for unseen data

Cross-validation more robust eval by splitting data into folds and testing/training on diff combos

In [None]:
# Splitting
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Cross-validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)

Metrics to eval model performance

- accuracy_score: Class tasks
- mean_squared_error: Regression tasks
- precision_score, recall_score, f1_score: Class tasks with imbalanced classes

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

Model Training & Prediction
Consistent API for training & predicting w/ diff models

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0)

# Train model using fit() method with training data
model.fit(X_train, y_train)

# Make predictions on new data using predict()
y_pred = model.predict(X_test)

## PyTorch

FOSS ml lib developed by FB AI Research Lab. Framework for building & deploying various types of ML models, including Deep Learning models

Features
- Deep Learning: Excels at DL, can develop CNN with multiple layers & architectures
- Dynamic Computational Graphs: Unlike static comp graphs like in TensorFlow, uses DCG which allow for more flexible & intuitive model building/debugging
- GPU Support: GPU accel, speeds up training process for compu intensive models
- TorchVision Integration: Library integrated that provides user-friendly interface for image datasets, pre-trained models, common image transformations
- Auto Differentiation: Uses autograd to auto compute gradients, simplifying process of backpropagation
- Community/Ecosystem: Large community, rich ecosystem tools, libraries, resources

Dynamic Computation Graph: created on fly during forward pass, allowing more flexible & dynamic model building. Easier to implement complex/nonlinear models
Tensors: multi-dimensional array hold data being processed. Can be const, var, placeholders. PyTorch tensors similar to numpy arrays but can run on GPU for faster compu

In [None]:
import torch

x = torch.tensor([1.0,2.0,3.0])
if torch.cuda.is_available():
    x = x.to('cuda')

# torch.nn contains various layers/modules for constructing NN
# Sequential API allows building models layer by layer, adding each sequentially

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784,128),
    nn.ReLU(),
    nn.Linear(128,10),
    nn.Softmax(dim=1)
)

# Module class provides more flex for building complex models with nonlinear topologies, shared layers, multiple input/output

class CustomModel(nn.Module):
    def __init__(self):
        super(CustomModel, self).__init__()
        self.layer1 = nn.Linear(784,128)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(128,10)
        self.softmax = nn.Softmax(dim=1)
    def forward(self,x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        x = self.softmax(x)
        return x
    
model = CustomModel()

Training and Eval

Optimizers: algs that adjust model's params during training to minimize loss function. Pytorch offers various
- Adam
- SGD (Stcochastic Gradient Descent)
- RMSprop

In [None]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001)

Loss func measure diff between model's predict and actual target vals. Pytorch provides various loss func
- CrossEntropyLoss: for multi-class classification
- BCEWithLogitsLoss: for binary classif
- MSELoss: for regression

In [None]:
import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()

Metrics eval model's preform during test/train
- Accuracy
- Precision
- Recall

In [None]:
def accuracy(output, target):
    _, predicted = torch.max(output,1)
    correct = (predicted == target).sum().item()
    return correct / len(target)

In [None]:
# Training loop updates models params based on training data
import torch

epochs = 10
num_batches = 100

for epoch in range(epochs):
    for batch in range(num_batches):
        # Get batch of data
        x_batch, y_batch = get_batch(batch)

        # Forward Pass
        y_pred = model(x_batch)

        # Calc loss
        loss = loss_fn(y_pred, y_batch)

        # Back pass & optim
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 10 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Batch [{batch+1}/{num_batches}], Loss: {loss.item():.4f}')

In [None]:
# Data Loading & Preprocessing

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self,data,labels):
        self.data = data
        self.labels = labels
    def __len__(self):
        return len(self.data)
    def __getitem__(self,idx):
        return self.data[idx],self.labels[idx]
    
dataset = CustomDataset(data,labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [None]:
# Model Saving and Loading

torch.save(model.state_dict(), 'model.pth')

model = CustomModel()
model.load_state_dict(torch.load('model.pth'))
model.eval() # Set model to eval mode

# Datasets

Collections of data points used for analysis & model training. Data Preprocessing is crucial in ML pipeline, transforming raw data into suitable for alg to process effectively

Forms of datasets
- Tabular Data: Organized into tables w/ rows and columns, common in spreadsheets/dbs
- Image Data: Sets of images, numerically as pixel arrays
- Text Data: Unstructured data, sentences, paragraphs, full documents
- Time Series Data: Sequential data points collected over time, emphasizing temporal patterns

Quality of dataset is important
- Model Accuracy: Quality datasets = more accurate models. Poor quality like noisy, incomplete, biased leads to poorer model performance
- Generalization: Carefully curated allows effective generalization for unseen data. Minimized overfitting and ensures consistent performance in real-world
- Efficiency: Clean, well-prepared data reduces train time & compu demands, streamlining entire process
- Reliability: Reliable dataset leads to trustworthy insight/decision. In critical domains like healthcare/finance, data quality affects dependability of results.

## What Makes a Dataset Good

- Relevance
- Completeness
- Consistency (format)
- Quality: accurate, free from errors; errors can arise from data collection, entry, or transmission issues
- Representativeness
- Balance: Especially important for classification. Imbalanced leads to bias that performs poorly on minority classes. Techniques (oversampling,undersampling,synthetic data) can help balance
- Size

## Dataset

demo_dataset.csv is a csv file containing network log entries. analyzing entries allows one to simulate various network scenarios useful for developing/evaluating IDS

### Structure

- log_id: Unique ID for each entry
- source_ip
- destination_port
- protocol
- bytes_transferred
- threat_level : Indicator of severity. 0 normal, 1 low-threat, 2 high-threat

### Challenges & Considerations

- Dataset contains mix of numerical and categorical
- Missing values and invalid entries in some columns, requiring data cleaning
- Some numeric columns may contain non-numeric strings, which must be converted/removed
- Threat_level column has unknown values that must be standardized/addressed during preprocessing

In [2]:
import pandas as pd

# pandas Dataframe is flexible, 2D labeled data structure that supports operations for data exploring/preprocessing
# advantages: labeled axes, heterogeneous data handling, integration with other libraries

data = pd.read_csv('./demo_dataset.csv')

In [5]:
# Various ops to understand structure, anomalies, determine cleaning/transformations needed

# First few rows of dataset
print(data.head())

# Summary of column data types and non-null counts
# Shows dataset shape, column names, data types, how many entries for each. Early detection of columns w/ unexpected/missing data
print(data.info())

# Identify col with missing vals
print(data.isnull().sum())

   log_id       source_ip destination_port protocol bytes_transferred  \
0      10      10.0.0.100      STRING_PORT      FTP              4096   
1      12  172.16.254.100              110     POP3          NEGATIVE   
2      27  172.16.254.200              110     POP3       NON_NUMERIC   
3       1   192.168.1.100               80     HTTP              1024   
4       2    192.168.1.81               53      TLS              9765   

  threat_level  
0            ?  
1            1  
2            1  
3            0  
4            0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   log_id             100 non-null    int64 
 1   source_ip          99 non-null     object
 2   destination_port   99 non-null     object
 3   protocol           100 non-null    object
 4   bytes_transferred  100 non-null    object
 5   threat_level       100