In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from google.colab import drive
import os
from os import listdir

print('Successful imports')

Successful imports


In [2]:
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
%cd '/content/gdrive/MyDrive/Machine Learning in healthcare/Final project'

/content/gdrive/MyDrive/Machine Learning in healthcare/Final project


 ## IDEAS

 CLUSTERING - GMM

 DIMENSIONALITY REDUCTION - Umap + plot to observe data separation in latent space

 Study pairwise mutual information + HSIC + correlations

 Association rules, observe what usually goes together A PRIORI

 Graph analysis on graph version of latent space and run algorithms for strongly connected components. Something that accounts for intra-class noise and difference between other classes (or clusters)



## Feature selection

In [7]:
data = pd.read_csv('Eureca_Adapted.csv', low_memory=False)

In [5]:
all = data.columns
cols = []

In [6]:
# series of interesting information,
mixture = ['type', 'suicide','EST_CIV', 'children', 'ideation_of_suicide', 'desire_of_death', 'Planification_of_suicide', 'tm', 'tm2',
        'tabacco_act', 'niv_edu','his_fam_suicide_behavior', 'his_fam_suicide', 'his_fam_suicide_attempt',
        'desire_of_death', 'Planification_of_suicide', 'dd_depre', 'dd_bipolar', 'dd_sz', 'dd_psychotic']

In [102]:
# BDHI and BIS scales
BDHIS = [col for col in all if ("BDHI" in col)]
BIS = [col for col in all if ("BIS" in col)]



In [103]:
# polymorphisms and other data are in the last 30 variables
extra = list(data.columns[-30:])
for col in mixture +  BDHIS +  BIS + extra:
  cols.append(col)


In [107]:
print(f"Original data dimension: {len(all)}")
print(f"Current subset dimension: {len(cols)}")


Original data dimension: 604
Current subset dimension: 170


We have removed 434 dimensions however we want to work with more or less 50, in order to reduce further our dimensionality we are gonna use the mutual information score of pairs of features.

Mutual Information (MI) is a measure of the statistical dependence or information shared between two random variables. It quantifies the amount of information obtained about one variable through the observation of another variable. In the context of feature selection, mutual information is often used to assess the relationship between features.
For discrete random variables X and Y, the mutual information (MI) is calculated using the following formula:

$$ I(X;Y) = \sum_{x \in X} \sum_{y \in Y} P(x, y) \log\left(\frac{P(x, y)}{P(x)P(y)}\right) $$

Here:
- $ I(X;Y) $ is the mutual information between X and Y.
- $P(x, y)$ is the joint probability mass function of X and Y.
- $P(x)$ and $P(y)$ are the marginal probability mass functions of X and Y, respectively.

The score values between $0$ and $1$ and the closer it is to 1, the stronger the relationship between the variables



BIS (Behavioural Inhibition systems) scales are psychological instruments designed to measure an individual's sensitivity to punishment and threat. The BIS scales typically include items related to anxiety, caution, and avoidance in the face of punishment or uncertain situations. Higher scores on these scales indicate greater sensitivity to potential negative outcomes. Our dataset contains 34 versions of it, plus three particular cases for impulsivity score, and planification score

In [71]:
from sklearn.metrics import mutual_info_score
import pandas as pd

def filterMU(data, referenceColumn, threshold=0.1):
    """
    Filter columns based on mutual information scores with respect to a reference column.

    Parameters:
    - data: DataFrame
        The input DataFrame.
    - referenceColumn: str
        The column name that will be used as the reference.
    - threshold: float, optional (default=0.1)
        The threshold for mutual information score. Columns with scores below this threshold will be filtered out.

    Returns:
    - list
        A list of column names that have mutual information scores above the threshold.
    """
    miScores = {}
    for col in data.columns:
        if col != referenceColumn:
            miScore = mutual_info_score(data[referenceColumn], data[col])
            if miScore < threshold:
                miScores[col] = miScore

    selectedColumns = list(miScores.keys())
    return selectedColumns

In [76]:
# thresholded mutual information scores for BIS34, redefined BIS,
reBIS = filterMU(data[BIS], 'BIS34', .4)
reBIS.extend(BIS[34:])

This same reasoning is applied for the BDHI scale, for which we have 83

In [91]:
# for .4 filters out all  BDHIS, for .5 none, filter out from BDHIX, keeping BDHI specifics
reBDHIS = filterMU(data[BDHIS[:75]], 'BDHI75', .445)


In [108]:
reducedFeatures = mixture + reBIS + reBDHIS + BDHIS[76:] + extra


The same reasoning applies to other scales present in the data.

In [110]:
subset = data[reducedFeatures]
subset.to_csv('Eureca2.0.csv', index=False)

In [None]:
data.head()

In [None]:
data[BIS]

## Exploratory Data analysis: Understanding of our data