# Data Science Guide
### Karl Evans 2024/2025

## 1. Getting Started - Systems
## 1.1 Gitlab


In [None]:
#Git
git clone \
git pull \
git add \
git commit -m 'something' \
git push \
git branch \
gti status \
git checkout 

## 1.2 Bash/Shell

In [None]:
htop \
pwd \
ls \
cd \
mkdir \
touch \
.. \
rm \
fm -rf \
\

## 1.3 Containerisation
### - Environments


## - Virtual Environment 
A virtual environment creates a folder that contains a copy (or symlink) to a specific interpreter. When you install packages into a virtual environment it will end up in this new folder, and thus isolated from other packages used by other workspaces.

In [None]:
python -m venv example-env
# Windows
example-env\Scripts\activate
# Unix/MacOS
source example-env/bin/activate
deactivate

#### - Conda
conda environments define what software is available for a project  
Environments in conda are self-contained, isolated spaces where you can install specific versions of software packages, including dependencies, libraries, and Python versions. This isolation helps avoid conflicts between package versions and ensures that your projects have the exact libraries and tools they need. 

In [None]:
conda env list \
conda create --name <> python =<> \
conda activate \
conda deactivate \
conda env remove --name <> \

#### - Jupyter Kernels
Kernels are programming language specific processes that run independently and interact with the Jupyter Applications and their user interfaces. ipykernel is the reference Jupyter kernel built on top of IPython, providing a powerful environment for interactive computing in Python.

In [None]:
python -m pip install package
python -m pip freeze 
python -m pip list
python -m pip show

## 2. Data Analysis
## 2.1 SQL

## 2.2 Data Structures

## 2.3 EDA

## 3. Modelling

## 3.1 Feature Engineering
 - Linear models learn sums and differences naturally, but can't learn anything more complex.
 - Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains.
 - Linear models and neural nets generally do better with normalized features. Neural nets especially need features scaled to values not too far from 0. Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so.
 - Tree models can learn to approximate almost any combination of features, but when a combination is especially important they can still benefit from having it explicitly created, especially when data is limited.
 - Counts are especially helpful for tree models, since these models don't have a natural way of aggregating information across many features at once.

### - Categorical

In [None]:
OneHotEncoding
LabelEncoding
FrequencyEncoding
MeanEncoding
TargetEncoding

### - Numeric

In [None]:
Smoothing
Clipping
Z-score scaling
Linear scaling
log scaling
# Binning
# - Good for 1) non-linear 2) clustered

.gt(0).sum(axis=1)

In [None]:
ignore_features = ['reliability','intrnt_email_addr','clnt_intrnl_id']

cat_col = [col for col, type in df.dtypes.items() if type in ["category", 'object']and col not in ignore_features]
num_col = [col for col, type in df.dtypes.items() if type in ['float', 'int64']and col not in ignore_features]
features = num_col + cat_col

y=df['reliability']
X=df.drop(['reliability','intrnt_email_addr','clnt_intrnl_id'], axis=1)  df['reliability'] = np.where(df['reliability'] == "Unreliable", 1, 0)
y= df['reliability'].values

In [None]:
PCA
product or sum for high same sign features
ratio or subtract high opposite sign features

In [None]:
Clustering
kmeans = KMeans(n_clusters=6)
X["Cluster"] = kmeans.fit_predict(X)
X["Cluster"] = X["Cluster"].astype("category")

### Missing Data

In [None]:
Mean/Median Imputation
"-999" for tree based models
Most Frequent Imputation
"Miss"

## 3.2 Pipelines

In [None]:
# EG catboost
from catboost import CatBoostClassifier

def build_pipeline(num_features: List[str], cat_features: List[str]) -> Pipeline:
    """Full pipeline

    This function constructs the whole pipeline for training

    Params:
        config (Dict): Config content from yaml
        num_features (List[str]): List of numeric features
        cat_features (List[str]): List of categorial feature

    Returns:
        Pipeline that contians pre-process, sampling (If specified) and model

    Note:
        * Config assumes we're in the `pipeline` level already
    """
    numeric_transformer = make_pipeline(SimpleImputer(strategy='mean'),
                                       StandardScaler())

    categorical_transformer = make_pipeline(SimpleImputer(strategy='most_frequent'),
        #SimpleImputer(strategy='constant', fill_value='missing'),
        OneHotEncoder(handle_unknown='ignore', min_frequency=0.05))

    preprocessor = make_column_transformer((numeric_transformer, num_features),
        (categorical_transformer, cat_features),
        remainder="passthrough")
    
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', CatBoostClassifier())
    ])
    return pipe

pipeline = build_pipeline(num_features=num_col, cat_features=cat_col)

## 3.3 Cross Validation


### Feature Importance

Mutual Information

In [None]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]

## 3.4 Class Imbalance

#### - Oversampling

#### - Undersampling

## 3.5 Dimensionality Reduction
#### - PCA
#### - t-SNE
#### - LDA
#### - ICA
#### - UMAP

## 3.6 Ensembling
#### - Bagging
#### - Boosting
#### - Stacking/Meta 

### Association
#### - 

## 3.7 Hyperparameter Tuning

In [None]:
import optuna
from sklearn.model_selection import cross_val_score
import numpy as np

X_train, X_test,y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=21)

def objective(trial):
    # Suggest hyperparameters
    params = {
        'model__iterations':trial.suggest_int("iterations", , 500, step=1),
        'model__depth': trial.suggest_int("depth", 4, 10, step=1),,
        'model__learning_rate':trial.suggest_int("learning_rate", , 500, step=1),,
        'model__cat_features':'',
        'model__loss_function':'Logloss',
        'model__verbose': 'True'     
    }

iterations=2,
                           depth=2,
                           learning_rate=1,
                           loss_function='Logloss',
                           verbose=True)
    
    model = pipeline.set_params(**params)
    scores = cross_val_score(model, X_train, y_train, cv=10, scoring='f1', verbose=3, error_score='raise')
    
    return scores.mean()
#### - GridSearchCV
#### - HyperOpt
#### - Optuna
  # Create a study object
study = optuna.create_study(direction="maximize")

# Optimize the objective function
study.optimize(objective, n_trials=10)

### 3.1 Supervised Learning

In [None]:

### Linear Regression
#### - Ridge
#### - Lasso

### Logistic Regression

### Support Vector Machines

### Randon Forests

### K Nearest Neighbours

### Naive Bayes

### Gradient Boosting
#### - XGBoost

### Neural Networks



### Outlier Detection
#### - Isolation Forest
#### - One Class Classification


In [None]:
from sklearn.datasets import load_iris
import sklearn
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, 
                                                    iris.target, 
                                                    test_size=0.4, 
                                                    random_state=1)
svm_class_1 = SVC()
svm_class_1.fit(X_train,y_train==1)
plot_classifier(X_train, y_train==1,svm_class_1)

## 3.2 Unsupervised Learning

In [None]:
### Clustering
#### - k-Means
#### - Hierarchical
#### - DBSCANS
#### - Gaussian Mixture Models

## 3.4 Semi-Supervised Learning


## 3.5 Deep Learning

In [None]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # Define RNN layer
        self.rnn = nn.RNN(
            input_size=1,
            hidden_size=32,
            num_layers=2,
            batch_first=True,
        )
        self.fc = nn.Linear(32, 1)

    def forward(self, x):
        # Initialize first hidden state with zeros
        h0 = torch.zeros(2, x.size(0), 32)
        # Pass x and h0 through recurrent layer
        out, _ = self.rnn(x, h0)  
        # Pass recurrent layer's last output through linear layer
        out = self.fc(out[:, -1, :])
        return out
		
		
class Net(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        # Define lstm layer
        self.lstm = nn.LSTM(
            input_size=1,
            hidden_size=32,
            num_layers=2,
            batch_first=True,
        )
        self.fc = nn.Linear(32, 1)

    def forward(self, x):
        h0 = torch.zeros(2, x.size(0), 32)
        # Initialize long-term memory
        c0 =  torch.zeros(2, x.size(0), 32)
        # Pass all inputs to lstm layer
        out, _ = self.lstm(x,(h0,c0))
        out = self.fc(out[:, -1, :])
        return out
		
class Net(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        # Define RNN layer
        self.gru = nn.GRU(
            input_size=1,
            hidden_size=32,
            num_layers=2,
            batch_first=True,
        )
        self.fc = nn.Linear(32, 1)

    def forward(self, x):
        h0 = torch.zeros(2, x.size(0), 32)
        out, _ = self.gru(x, h0)  
        out = self.fc(out[:, -1, :])
        return out

## 5. Reinforcement Learning


## 6. Metrics


## 7. Other
### Causality
#### - Propensity



## 8. Production
#### 1. AP455.md
#### 2. aep_manifest
#### 3. entrypoint.sh

### Model Lifecycling
#### - MLflow