# 🌟 Operation NeuroNexus: Outsmarting SkyNet

Trondheim lies under the iron grip of SkyNet, an AI system that has seized control of the city's entire digital infrastructure. As the last line of defense against total machine domination, you and your team of elite hackers have been tasked with a crucial mission: infiltrate SkyNet's systems, decode its defenses, and liberate the city from its digital oppressor.

## 🎯 Mission Overview

Operation NeuroNexus consists of four independent, yet interconnected missions. Each mission targets a different aspect of SkyNet's infrastructure and requires you to apply various Supervised Learning techniques covered in this course. Your objective: outsmart the AI at its own game.

## 📊 Mission Structure

1. Each mission has a specific task related to combating SkyNet.
2. Following the task description, you'll find a set of formal requirements that your solution must meet.
3. The primary measure of your success is the accuracy of your machine learning model. In this battle of human vs. AI, performance is key.
4. After completing each task, you should answer a series of questions to demonstrate your understanding of the techniques used.

## 🧪 A Note on Test Data

In a departure from real-world scenarios, you will have access to the target variables of the test sets for each mission. This has been arranged to facilitate the evaluation of your models. However, remember that in actual machine learning projects, test targets are not available, as predicting these is the ultimate goal of your supervised models.

## 📝 Submission Guidelines

- For each mission, provide your code solution and model results inside this notebook.
- Answer the follow-up questions in markdown format within this notebook. A few sentences is enough, no requirements for length of answers.
- Ensure your explanations are clear, concise, and demonstrate a deep understanding of the techniques employed.


Good luck! The resistance is counting on you.

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 🌞 Mission 1: Predicting SkyNet's Power Consumption

### 🎯 The Mission
Intelligence suggests that SkyNet's central core has a critical weakness: **its power consumption**. We must understand its energy needs to plan a coordinated strike and temporarily disable its defenses.

### 🧠 Your Task
Develop a predictive model to estimate SkyNet's power consumption based on its **Network Activity**.

**Goal**: Implement a **Linear Regression model using Gradient Descent, from scratch**.

Use `LinearRegression` class from `linear_regression.py` stored in this folder. Your task is to complete two functions: `fit` (find the optimal parameters of the regression) and `predict` (apply them to the test data).

> Note: The `%autoreload` IPython magic command allows instant updates from `linear_regression.py`.

### 📊 Formal Requirements
1. **Implementation**: 
   - Use standard Python libraries (numpy, math, pandas, etc.)
   - Implement gradient descent

2. **Discussion**:

   a. Visualize the fitted curve. Derive the resulting Energy consumption formula.
   
   b. Analyze prediction error distribution. What is an unbiased estimator?

---


In [None]:
%load_ext autoreload
%autoreload 2

In [25]:
from linear_regression import LinearRegression 
from logistic_regression import LogisticRegression 


In [None]:
# Data
data = pd.read_csv('mission1.csv')

plt.figure(figsize=(8, 6))
plt.scatter(data['Net_Activity'], data['Energy'], c='blue', label='Data points')
plt.grid(True)
plt.xlabel('Network Activity', fontsize=14)
plt.ylabel('Energy', fontsize=14)
plt.title('Energy vs. Traffic', fontsize=16)
plt.legend()
plt.show()

In [None]:
data = pd.read_csv('mission1.csv')

X = data[['Net_Activity']].values  # 2D array
y = data['Energy'].values

linreg = LinearRegression(lr=0.0044, n_iters=10000)

linreg.fit(X, y)

y_pred = linreg.predict(X)

plt.figure(figsize=(6, 4))
plt.grid(True)
plt.scatter(data['Net_Activity'], data['Energy'], color='blue', label='Data points')
plt.plot(data['Net_Activity'], y_pred, color='red', label='Regression line')
plt.xlabel('Net Activity', fontsize=14)
plt.ylabel('Energy', fontsize=14)
plt.title('Energy vs. Net Activity', fontsize=16)
plt.legend()
plt.show()

mse = np.mean((y_pred - y) ** 2)
print("Weights:", linreg.weights)
print("Bias:", linreg.bias)
print("Mean Squared Error:", mse)


## Discussion 
a.
The formula for linear regression is given by: 
Energy = weight * Net Activity + bias. 

So based on our output above, we get: 
### Energy = 3.017 * Net Activity + 4.72. 

The MSE is what we want to minimize, as this is a measure of how accurate our model is. 

b. 

In [None]:
error = y - linreg.predict(X)

plt.figure()
plt.hist(error, bins=30, edgecolor='k')
plt.xlabel('Error')
plt.ylabel('')
plt.title('Prediction Error Distribution')
plt.grid(True)
plt.show()

b. Analyze prediction error distribution. What is an unbiased estimator? 
- An unbias gradient estimator is important to ensure we're moving in the correct direction towards the optimum. An estimator is unbiased if its expected value is equal to the true value of the parameter being estimated. So if the average of many samples converge to the true value. 


## 🧠 Mission 2: Decoding SkyNet's Neural Encryption

### 🌐 The Discovery
SkyNet has evolved, using a "Synapse Cipher" that mimics human neural patterns. We've intercepted two types of neural signals that may determine SkyNet's next moves.

### 🎯 Your Mission
1. Evolve your linear regression into logistic regression
2. Engineer features to unravel hidden neural connections
3. Predict SkyNet's binary decisions (0 or 1) from paired signals

### 📊 Formal Requirements
1. **Implementation**: 
   - Use standard Python libraries
   - Implement gradient descent

2. **Performance**: Achieve at least 0.88 accuracy on the test set

3. **Discussion**:

   a. Explain poor initial performance and your improvements

   b. What is the model's inductive bias. Why is it important?

   c. Try to solve the problem using `sklearn.tree.DecisionTreeClassifier`. Can it solve the problem? Why/Why not?
   
   d. Plot the ROC curve

---

In [None]:
data = pd.read_csv('mission2.csv')

train = data[data['split'] == 'train'].copy()
test = data[data['split'] == 'test'].copy()

X_train = train[['x0', 'x1']].values  # .values to get np array
y_train = train['y'].values
X_test = test[['x0', 'x1']].values
y_test = test['y'].values

logreg = LogisticRegression(lr=0.01, n_iters=1000)
logreg.fit(X_train, y_train)

y_pred_test = logreg.predict(X_test)

accuracy = np.mean(y_pred_test == y_test)
print(f"Accuracy: {accuracy * 100:.2f}%")

In [None]:
train = data[data['split'] == 'train'].copy()
test = data[data['split'] == 'test'].copy()

train['x2'] = train['x0'] * train['x1']
test['x2'] = test['x0'] * test['x1']

X_train = train[['x0', 'x1', 'x2']].values
y_train = train['y'].values
X_test = test[['x0', 'x1', 'x2']].values
y_test = test['y'].values

logreg = LogisticRegression(lr=0.01, n_iters=1000)  #
logreg.fit(X_train, y_train)

y_pred_test = logreg.predict(X_test)
accuracy = np.mean(y_pred_test == y_test)
print(f"Accuracy: {accuracy * 100:.2f}%")

In [None]:
train['x2'] = np.where(train['x2'] < 0, 1, -1)
test['x2'] = np.where(test['x2'] < 0, 1, -1)

X_train = train[['x2']].values
y_train = train['y'].values
X_test = test[['x2']].values
y_test = test['y'].values

reg = LogisticRegression(lr=0.01, n_iters=1000)

reg.fit(X_train, y_train)
y_pred_test = reg.predict(X_test)

accuracy_test = np.mean(y_pred_test == y_test)

print(f"Test Accuracy: {accuracy_test * 100:.2f}%")

   a. Explain poor initial performance and your improvements
    - The initial performance was bad because the model was trying to classify the data using only two features (x0 and x1) which were not linearly separable. A simple linear boundary 2D (x0 and x1) was not enough to separate the two classes accurately. So we implemented a new feature x2 by multiplying x0 and x1. By adding this, we transformed the problem into a linearly separable one in 3D space (x0, x1, x2), which significantly improved the model's performance

   b. What is the model's inductive bias. Why is it important?
      - The model's inductive bias is that the learning algorithm generalize from the training data to unseen test data. In the case of logistic regression. The model assumes that the data is linearly separable. This is important because it helps the model to make predictions on unseen data.

   c. Try to solve the problem using `sklearn.tree.DecisionTreeClassifier`. Can it solve the problem? Why/Why not?
      - Yes, the DecisionTreeClassifier can solve this problem, potentially with even better accuracy than the LR model. This is because decision trees can capture non-linear relationships between features without requiring explicit feature engineering.
   

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(random_state=69)
dt.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt.predict(X_test)

# Calculate accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Accuracy: {accuracy_dt:.4f}")

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc

train['x2'] = np.where(train['x2'] < 0, 1, -1)
test['x2'] = np.where(test['x2'] < 0, 1, -1)

X_train = train[['x2']].values
y_train = train['y'].values
X_test = test[['x2']].values
y_test = test['y'].values

reg = LogisticRegression()
reg.fit(X_train, y_train)

y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)

y_pred_prob = reg.predict_proba(X_test)[:, 1]  

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--', label='No Skill')  
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Logistic Regression')
plt.legend()
plt.grid(True)
plt.show()



## 🌆 Mission 3: CyberGuard

### 🌐 The Discovery
SkyNet's drone communications use quantum entanglement encryption. We need a rapid response system to intercept these messages.

### 🎯 Your Mission
Develop a decision tree classifier to process intercepted communications. Use `sklearn.tree.DecisionTreeClassifier`.

> "Every misclassification risks losing a sector of the city to machine control."

### 🧠 The Challenge
1. **Rarity**: Critical communications are only 20% of the data stream
2. **Quantum Complexity**: Encryption information is hidden in quantum states

### 🚀 Your Tools
- Intercepted AI communications dataset
- Quantum signature analysis skills
- Decision tree algorithm

### 📊 Formal Requirements
1. **Accuracy**: Achieve ROC AUC >= 0.72 on the test set
2. **Discussion**:

   a. Explain your threshold-breaking strategy. Did you change the default hyperparameters?

   b. Justify ROC AUC usage. Plot and interpret ROC.
   
   c. Try to solve the problem using sklearn’s Random Forest Classifier. Compare the results.


---

In [34]:
train = pd.read_csv('mission3_train.csv')
test = pd.read_csv('mission3_test.csv')

In [35]:
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score

dtc = DecisionTreeClassifier( # Used GridSearchCV to find the best hyperparameters. Takes a long time to run so removed the code. 
    max_depth=5,
    min_samples_leaf=4,
    min_samples_split=2,
    class_weight=None,
    random_state=69
)

In [36]:
# Transform data_stream_3 

def transform_stream_3(X):
    return (X['data_stream_3'] * 1000).astype(int) % 2

train['data_stream_3_tf'] = transform_stream_3(train)
test['data_stream_3_tf'] = transform_stream_3(test)

# Select features
features = [col for col in train.columns if col.startswith('data_stream') and col != 'data_stream_3_tf']
features.append('data_stream_3_tf')

In [None]:
dtc.fit(train[features], train['target']) # X, y

# Test the model
y_pred_proba = dtc.predict_proba(test[features])[:, 1]
y_pred = dtc.predict(test[features])
auc_score = roc_auc_score(test['target'], y_pred_proba)
accuracy = accuracy_score(test['target'], y_pred)
print(f"Test AUC: {auc_score:.3f}")
print(f"Test Accuracy: {accuracy:.3f}")

# Cross-validation AUC
cv_scores = cross_val_score(dtc, train[features], train['target'], cv=10, scoring='roc_auc')
cv_scores_accuracy = cross_val_score(dtc, train[features], train['target'], cv=10, scoring='accuracy')
print(f"Cross-validation AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
print(f"Cross-validation Accuracy: {cv_scores_accuracy.mean():.3f} (+/- {cv_scores_accuracy.std() * 2:.3f})")



 a. Explain your threshold-breaking strategy. Did you change the default hyperparameters?

   b. Justify ROC AUC usage. Plot and interpret ROC.
   
   c. Try to solve the problem using sklearn’s Random Forest Classifier. Compare the results.

a.
## Tuning of hyperparameters:
To achieve the ROC AUC of 0.72 or higher, I did a grid search with cross val to find the best combination of hyperparam. for decision tree classifier. I tuned: 
max_depth: increased it to 5, but within fair levels to avoid overfit.
min_samples_split: set this to 2, decreasing chances of overfit.
min_samples_leaf: set to 4. Higher number decreases chances of overfit.
class_weight: handles class inbalance, was given that only 20% of data was critical comm.

This resulted in a decision tree not too deep, while capturing important data. Handling the class imbalance did not improve ROC AUC. 
b.
## ROC AUC Usage
ROC AUC was used for evaluation because it acts independently of threshold. It measures the classifiers ability to rank positive instances relative to negative ones. This is good as we can just tune the thershold to balance trade-off between TP and FP. 
It is also robust to class imbalance. It was given in the task, that only 20% of the data is critical communication, so the accuracy could be misguiding to follow. ROC AUC have a focus on the classifiers ranking ability without looking at the class distribution.  

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, roc_curve
from sklearn.model_selection import cross_val_score

def plot_roc(y_true, dt_y_pred_proba, rf_y_pred_proba):
    dt_fpr, dt_tpr, _ = roc_curve(y_true, dt_y_pred_proba) #decision tree FP rate, DT TP rate
    rf_fpr, rf_tpr, _ = roc_curve(y_true, rf_y_pred_proba) #random forest FP rate, RF TP rate
    
    dt_auc_score = roc_auc_score(y_true, dt_y_pred_proba) # decision tree acurracy score
    rf_auc_score = roc_auc_score(y_true, rf_y_pred_proba) # random forest accuracy score
    
    plt.figure(figsize=(8, 6))
    plt.plot(dt_fpr, dt_tpr, label=f'Decision Tree (AUC = {dt_auc_score:.2f})')
    plt.plot(rf_fpr, rf_tpr, label=f'Random Forest (AUC = {rf_auc_score:.2f})')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve - Decision Tree vs Random Forest')
    plt.legend(loc='lower right')
    plt.show()


# Create the DT classifier with hyperparameters found with GridSearchCV
dtc = DecisionTreeClassifier(
    max_depth=5,
    min_samples_leaf=4,
    min_samples_split=2,
    class_weight=None,
    random_state=69
)

dtc.fit(train[features], train['target'])

dt_y_pred_proba = dtc.predict_proba(test[features])[:, 1]
dt_y_pred = dtc.predict(test[features])
dt_auc_score = roc_auc_score(test['target'], dt_y_pred_proba)
dt_accuracy = accuracy_score(test['target'], dt_y_pred)
print(f"Decision Tree - Test AUC: {dt_auc_score:.3f}")
print(f"Decision Tree - Test Accuracy: {dt_accuracy:.3f}")

rfc = RandomForestClassifier(random_state=69)

rfc.fit(train[features], train['target'])

rf_y_pred_proba = rfc.predict_proba(test[features])[:, 1]
rf_y_pred = rfc.predict(test[features])
rf_auc_score = roc_auc_score(test['target'], rf_y_pred_proba)
rf_accuracy = accuracy_score(test['target'], rf_y_pred)
print(f"Random Forest - Test AUC: {rf_auc_score:.3f}")
print(f"Random Forest - Test Accuracy: {rf_accuracy:.3f}")

plot_roc(test['target'], dt_y_pred_proba, rf_y_pred_proba)
plt.savefig('tree_mission_3.1.pdf')

## Interpetation of the ROC curve:
Both the RF and DT are good choices here, with an AUC of 0.72 and 0.73 respectively. The ROC curve shows the trade-off between the true positive rate and the false positive rate. The closer the curve is to the top-left corner, the better the model.
#
#
#
#
#
#


## ⚡ Final Mission: Mapping SkyNet's Energy Nexus

### 🌐 The Discovery
SkyNet is harvesting energy from Trondheim's buildings. Some structures provide significantly more power than others.

### 🎯 Your Mission
Predict the "Nexus Rating" of unknown buildings in Trondheim (test set).

### 🧠 The Challenge
1. **Target**: Transform the Nexus Rating to reveal true energy hierarchy
2. **Data Quality**: Handle missing values and categorical features
3. **Ensembling**: Use advanced models and ensemble learning

### 📊 Formal Requirements
1. **Performance**: Achieve RMSLE <= 0.294 on the test set
2. **Discussion**:

   a. Explain your threshold-breaking strategy

   b. Justify RMSLE usage. Why do we use this metric? Which loss function did you use?

   c. Plot and interpret feature importances

   d. Describe your ensembling techniques

   e. In real life, you do not have the test targets. How would you make sure your model will work good on the unseen data? 

---

In [39]:
train = pd.read_csv('final_mission_train.csv')
test = pd.read_csv('final_mission_test.csv')

import pandas as pd
import numpy as np
from ensemble_learning import EnsembleLearning
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
print("Original test data:")
print(test.head())

# Shift data in the columns of the test data
test_shifted = test.copy()
test_shifted.iloc[:, 1:] = test.iloc[:, :-1].values
test_shifted['nexus_rating'] = test.iloc[:, -1]

# shifted test data
print("\nShifted test data:")
print(test_shifted.head())

In [41]:
from sklearn.metrics import mean_squared_log_error

def rmsle(y_test, y_pred):
    """ Root Mean Squared Logarithmic Error """
    return np.sqrt(mean_squared_log_error(y_test, y_pred))

In [None]:
EL = EnsembleLearning()
test_shifted = EL.shift_data(test)
X_train, X_test, y_train, y_test = EL.prepare_data(train, test_shifted)
EL.fit(X_train, y_train)
y_pred = EL.predict(X_test)
rmsle = EL.evaluate(y_test, y_pred)
print(f"Model RMSLE: {rmsle:.4f}")
print(f"Required RMSLE: 0.294")

Model RMSLE: 0.2981
Required RMSLE: 0.294

In [None]:
from catboost import CatBoostRegressor

import matplotlib.pyplot as plt

X_train, X_test, y_train, y_test = EL.prepare_data(train, test_shifted)

catboost_model = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6, silent=True)
catboost_model.fit(X_train, y_train)
feature_importances = catboost_model.get_feature_importance()
feature_names = train.columns.drop('nexus_rating')

plt.figure()
plt.barh(feature_names, feature_importances)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('CatBoost Feature Importance')
plt.show()

a. Explain your threshold-breaking strategy

   b. Justify RMSLE usage. Why do we use this metric? Which loss function did you use? 
   - RMSLE (Root Mean Squared Logarithmic Error) was chosen as our evaluation metric because it penalizes underestimation more heavily than overestimation. This is great with our goal of avoiding underestimates of building Nexus Ratings. However, Mean Squared Error (MSE) was used as the loss function during training, as it's a common choice for regression problems and often leads to good overall performance.

   c. Plot and interpret feature importances
   - The feature importance plot shows the importance of each feature in the model. The higher the value, the more important the feature is. In the plot above we see that energy consumption is the most important feature. This means that energy consumption has the most significant impact on the Nexus Rating of a building.

   d. Describe your ensembling techniques
   - We used a Random Forest Regressor as our main model and a Gradient Boosting Regressor as a secondary model. We then combined the predictions of these two models to make the final prediction. This is a simple form of ensembling called model stacking. This way of combining models often leads to better performance than using a single model.

   e. In real life, you do not have the test targets. How would you make sure your model will work good on the unseen data? 
   - To enhance our model's performance on unseen data, we employ several strategies:
      * Cross-validation: Consider model stability across different data subsets.
      * Hyperparameter tuning: Optimize model parameters for better generalization.
      * Feature engineering: Create features to capture patterns in our data.