# Introducción
El trastorno por déficit de atención con hiperactividad (TDAH) se caracteriza por la actividad motora excesiva, la falta de atención y la impulsividad. HTR1B es un receptor de serotonina. Los polimorfismos del gen HTR1B se han relacionado con diversos trastornos psiquiátricos, como el trastorno por déficit de atención con hiperactividad (TDAH) y el trastorno obsesivo-compulsivo (TOC).

In [249]:
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import cross_val_score

# Dataset description

Each excel sheet, except the first one, have a table with data for each molecular marker. 

First, I will merge all the markers keeping only the records that have all the markers available. This way we can train the logistic model taking into account the other marker for better identification of strong influences in TDAH diagnostic. 

In [250]:
file = "~/PiconCossio/data_science_class/DATOS-TDAH-R_original.xlsx"
xls = pd.ExcelFile(file)
sheets_dict = pd.read_excel(file, sheet_name = xls.sheet_names[1:])

In [251]:
sheets_dict

{'1.HTR1B':     CODIGO  1.HTR1B  TDAH  Edad  Genero  NSE  Adversidad
 0    HH006        2     1     8       2    1           0
 1    HH009        2     1     9       2    3           1
 2    HH014        2     1    15       2    3           1
 3    HH015        1     1    12       1    2           1
 4    HH016        1     0    18       2    3           0
 ..     ...      ...   ...   ...     ...  ...         ...
 388  HP089        1     1    46       1    6           0
 389  HP205        1     0    43       1    4           0
 390  HP211        3     1    46       1    5           0
 391  HP216        2     0    51       1    4           0
 392  HP415        2     0    51       1    2           0
 
 [393 rows x 7 columns],
 '2.HTR2A':     CODIGO  2.HTR2A  TDAH  Edad  Genero  NSE  Adversidad
 0    HH015        2     1    12       1    2           1
 1    HH016        1     0    18       2    3           0
 2    HH022        2     1    15       1    2           1
 3    HH023        3   

In [252]:
for idx, val in enumerate(sheets_dict.keys()):
    if idx == 0:
        df = sheets_dict[val]
        df = pd.concat([df.iloc[:, :3], df.iloc[:,4]], axis=1) ## keeping only CODIGO, SNP, TDAH and Genero
        continue
    df_2 = sheets_dict[val].rename(columns={"CODIGO ": "CODIGO"})
    df = df.merge(df_2.iloc[:,:2], on="CODIGO", how="inner")

## Use the classic 0 for female and 1 for male
df.replace({"Genero":2}, 0, inplace=True)

In [253]:
print(df.shape)
df.head()

(92, 14)


Unnamed: 0,CODIGO,1.HTR1B,TDAH,Genero,2.HTR2A,3.DRD2,4.DRD4,5. DAT1,6.SLC6A4(LS),7.SLC6A4(VNTR),8.HTR2C,9.SNAP25(DdeI),10.SNP25( Mn1I),11. TPH2
0,HH016,1,0,0,1,2,7,5,2,3,3,2,3,1
1,HH022,2,1,1,2,2,10,4,1,3,5,2,3,1
2,HH024,1,0,0,1,3,7,5,1,3,2,2,3,1
3,HH027,1,0,1,1,1,13,3,1,1,5,1,2,1
4,HH028,1,1,1,2,1,2,5,2,2,5,1,2,2


# First stadistical insights

I do not take into accound *Edad, NSE, Adversidad*, because those are variables that will introduce a lot of noise, hidding the SNP influence and providing false positives, for example, the **6 to 20 years** age range has more data, which is will take a leading role in our model, but actually it is kind of trivial since the TDAH is a condition that appears at early age.

In [254]:
# Existe una mayoria de registros para el genero femenino
df["Genero"].value_counts()

Genero
0    58
1    34
Name: count, dtype: int64

In [255]:
## Existen mas casos de TDAH para el femenino?
## Curiosamente no, hay mayor numero de casos positivos para el genero masculino
df.groupby("Genero")["TDAH"].sum()

Genero
0    14
1    24
Name: TDAH, dtype: int64

Similary, we find that there is an **unbalance** between females and males. We found that **females** got a **higher** number of data and that **TDAH positive represents the minority**, whereas for **males** we observe the opposite where **TDAH positive represents the majority**. Therefore, *Genero* would also take a leading role, and it likely to produce overfitting resulting in a positive diagnosis every time men are recorded. Therefore, for this first analyses *Genero* and *8.HTR2C* will be filtered out.

In the diagonal of the below pairplot, we can see differences in data collection per SNP.

In [256]:
#sns.pairplot(df, kind="kde", diag_kind="kde", corner=True)

In [257]:
df_autosomal = df.drop(["Genero", "8.HTR2C"], axis=1)

# Regresion logistica

All marker possess more than 2 SNP, therefore it is necessary to transform it to binary so the model does no interpret the categories with an ascending meaning. I will use one-hot encoding

In [258]:
df_autosomal.columns

Index(['CODIGO', '1.HTR1B', 'TDAH', '2.HTR2A', '3.DRD2', '4.DRD4', '5. DAT1',
       '6.SLC6A4(LS)', '7.SLC6A4(VNTR)', '9.SNAP25(DdeI)', '10.SNP25( Mn1I)',
       '11. TPH2'],
      dtype='object')

In [259]:
columns=(df_autosomal.columns[1:2]).append(df_autosomal.columns[3:])
columns

Index(['1.HTR1B', '2.HTR2A', '3.DRD2', '4.DRD4', '5. DAT1', '6.SLC6A4(LS)',
       '7.SLC6A4(VNTR)', '9.SNAP25(DdeI)', '10.SNP25( Mn1I)', '11. TPH2'],
      dtype='object')

In [260]:
## Preproccesing
### La columna para el marcador molecular es una variable categorica, que emplea valores numericos (1,2,3). Emplearemos one-hot coding para evitar malinterpretación de valores categoricos.
df_autosomal_transformed = pd.get_dummies(df_autosomal, columns=(df_autosomal.columns[1:2]).append(df_autosomal.columns[3:]))
df_autosomal_transformed.head()

Unnamed: 0,CODIGO,TDAH,1.HTR1B_1,1.HTR1B_2,1.HTR1B_3,2.HTR2A_1,2.HTR2A_2,2.HTR2A_3,3.DRD2_1,3.DRD2_2,...,7.SLC6A4(VNTR)_3,9.SNAP25(DdeI)_1,9.SNAP25(DdeI)_2,9.SNAP25(DdeI)_3,10.SNP25( Mn1I)_1,10.SNP25( Mn1I)_2,10.SNP25( Mn1I)_3,11. TPH2_1,11. TPH2_2,11. TPH2_3
0,HH016,0,True,False,False,True,False,False,False,True,...,True,False,True,False,False,False,True,True,False,False
1,HH022,1,False,True,False,False,True,False,False,True,...,True,False,True,False,False,False,True,True,False,False
2,HH024,0,True,False,False,True,False,False,False,False,...,True,False,True,False,False,False,True,True,False,False
3,HH027,0,True,False,False,True,False,False,True,False,...,False,True,False,False,False,True,False,True,False,False
4,HH028,1,True,False,False,False,True,False,True,False,...,False,True,False,False,False,True,False,False,True,False


**Note** the increase in the number of columns, it could be a problem for the logistic regression, a PCA could help in the analyses of this with the other meta-variables. However, it does not allow us to check what SNPs might have the strongest influence.

### **Variable Correlation**

Checking the **Pearson** correlaction between SNPs and TDAH diagnosis with SNPs could be useful. Here, correlation is filtered for values x >= 0.2 or x <= 2

In [261]:
## Lets plot only those with a correlation higher than 0.2 or lower than -0.2
corr_threshold = 0.2
correlation_df = df_autosomal_transformed.iloc[:,1:].corr().where((df_autosomal_transformed.iloc[:,1:].corr() >= corr_threshold) | (df_autosomal_transformed.iloc[:,1:].corr() <= -1*(corr_threshold)))


In [262]:
import plotly.express as px
fig = px.imshow(correlation_df,width=1000, height=800)
fig.show()

The TDAH column reveal some apparently strong positive and negative correlactions. However, it must be visualized taking into account the next histogram. Observe that some SNPs do not have a lot of data, so any significative correlation could be despicable.   

In [263]:
fig = px.histogram(x=df_autosomal_transformed.iloc[:,2:].sum().index, y=df_autosomal_transformed.iloc[:,2:].sum().values)
fig.show()

# Train logistic model

Here, we will use cross validation, due to the scarcity of data.

### Only autosomal SNPs

As it was said before, due to the differences in number of data between Genre, we are using only **autosomal markers**.

In [264]:
df_autosomal_transformed.head()

Unnamed: 0,CODIGO,TDAH,1.HTR1B_1,1.HTR1B_2,1.HTR1B_3,2.HTR2A_1,2.HTR2A_2,2.HTR2A_3,3.DRD2_1,3.DRD2_2,...,7.SLC6A4(VNTR)_3,9.SNAP25(DdeI)_1,9.SNAP25(DdeI)_2,9.SNAP25(DdeI)_3,10.SNP25( Mn1I)_1,10.SNP25( Mn1I)_2,10.SNP25( Mn1I)_3,11. TPH2_1,11. TPH2_2,11. TPH2_3
0,HH016,0,True,False,False,True,False,False,False,True,...,True,False,True,False,False,False,True,True,False,False
1,HH022,1,False,True,False,False,True,False,False,True,...,True,False,True,False,False,False,True,True,False,False
2,HH024,0,True,False,False,True,False,False,False,False,...,True,False,True,False,False,False,True,True,False,False
3,HH027,0,True,False,False,True,False,False,True,False,...,False,True,False,False,False,True,False,True,False,False
4,HH028,1,True,False,False,False,True,False,True,False,...,False,True,False,False,False,True,False,False,True,False


In [371]:
## Model set up
model_autosomal = LogisticRegression(fit_intercept=False, penalty="elasticnet", solver="saga", l1_ratio=0.3)

In [372]:
## Cross validation
score = cross_val_score(model_autosomal, df_autosomal_transformed.iloc[:,2:] ,df_autosomal_transformed["TDAH"].values, cv=5, scoring='f1_weighted')
print("score in each iteration: ",score)
print("Mean and std score by cross validation: ", score.mean(), score.std())

score in each iteration:  [0.45802005 0.6273074  0.61481481 0.55555556 0.43001443]
Mean and std score by cross validation:  0.5371424498884453 0.08029760090564944


In [373]:
model_autosomal.fit(df_autosomal_transformed.iloc[:,2:], df_autosomal_transformed["TDAH"].values)

### Coefficients

Once the model is trained we can get the coefficients which will tell us what are the features (SNPs) with the highest influence in the model. Consider that these coefficients are the same as those that are producing the great F1 score showed by cross-validation. Additionally, here we are using elasticnet this method allows us to configure a balance between L1 and L2 regularization so, we will see some coefficients in zero that could be consider as with no correlation so far (consider also some of them are 0 due to the amount of data), L1 can be kind of agressive, so will be important to achieve a equilibrium, in this case I will use a ratio of 0.3

In [374]:
feature_coefficients = pd.DataFrame({
    'Feature': df_autosomal_transformed.iloc[:,2:].columns,
    'Coefficient': model_autosomal.coef_[0]
})

fig = px.scatter(feature_coefficients, x=feature_coefficients["Feature"], y=feature_coefficients["Coefficient"])
fig.update_traces(marker={'size': 10})

In [375]:
snp_significative = feature_coefficients[(feature_coefficients["Coefficient"]>0)]["Feature"].to_list()
snp_significative

['1.HTR1B_2',
 '3.DRD2_2',
 '4.DRD4_5',
 '4.DRD4_6',
 '5. DAT1_3',
 '5. DAT1_6',
 '7.SLC6A4(VNTR)_2',
 '9.SNAP25(DdeI)_2',
 '10.SNP25( Mn1I)_1',
 '11. TPH2_1']

### Complementary

Additionally, the same logistic model but from statsmodels will be used. This model have a small advantage, that produces P-values for features which are commonly used in biology. However, it presents problems with a lot of features like in this case, therefore I will only use those SNPs that presented a coeficient > 0.1

In [376]:
import statsmodels.api as sm

model_autosomal_sm=sm.Logit(df_autosomal_transformed["TDAH"].values[:], df_autosomal_transformed[snp_significative])

In [395]:
result = model_autosomal_sm.fit(method="cg", maxiter=100)

Optimization terminated successfully.
         Current function value: 0.639788
         Iterations: 97
         Function evaluations: 301
         Gradient evaluations: 301


In [396]:
result.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,92.0
Model:,Logit,Df Residuals:,82.0
Method:,MLE,Df Model:,9.0
Date:,"Thu, 06 Jun 2024",Pseudo R-squ.:,0.05629
Time:,16:03:54,Log-Likelihood:,-58.86
converged:,True,LL-Null:,-62.371
Covariance Type:,nonrobust,LLR p-value:,0.6349

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
1.HTR1B_2,0.5739,0.414,1.387,0.165,-0.237,1.385
3.DRD2_2,-0.4678,0.450,-1.039,0.299,-1.350,0.415
4.DRD4_5,0.7299,0.961,0.759,0.448,-1.154,2.614
4.DRD4_6,9.3908,55.136,0.170,0.865,-98.673,117.454
5. DAT1_3,-0.0277,0.914,-0.030,0.976,-1.819,1.763
5. DAT1_6,8.0529,55.185,0.146,0.884,-100.108,116.214
7.SLC6A4(VNTR)_2,0.1723,0.381,0.452,0.651,-0.574,0.919
9.SNAP25(DdeI)_2,-0.6438,0.457,-1.410,0.158,-1.539,0.251
10.SNP25( Mn1I)_1,-0.2046,0.621,-0.329,0.742,-1.422,1.013


## Using both sexual and autosomal HTR

In [397]:
df_transformed.iloc[:,3:].head()

NameError: name 'df_transformed' is not defined

In [None]:
## 1. train-test split using also the sexual chromosome htr

X_train, X_test, y_train, y_test = train_test_split(df_transformed.iloc[:,3:],df_transformed["TDAH"].values , random_state=42,test_size=0.3, shuffle=True)
model = LogisticRegression(class_weight='balanced', fit_intercept=False, solver="liblinear", penalty="l1")
model.fit(X_train, y_train)

In [None]:
## 2. cross validation
score = cross_val_score(model, df_transformed.iloc[:,3:],df_transformed["TDAH"].values, cv=3, scoring="recall_weighted")
print("score in each iteration: ",score)
print("Mean and std score by cross validation: ", score.mean(), score.std())

In [None]:
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

In [None]:
print(classification_report(y_test, model.predict(X_test)))

In [None]:
feature_coefficients = pd.DataFrame({
    'Feature': X_test.columns,
    'Coefficient': model.coef_[0]
})

print(feature_coefficients)

## Using only the sexual HTR

In [None]:
df_transformed.iloc[:,9:].head()

In [None]:
## Using only the sexual htr
X_train, X_test, y_train, y_test = train_test_split(df_transformed.iloc[:,9:],df_transformed["TDAH"].values , random_state=42,test_size=0.3, shuffle=True)
model = LogisticRegression(class_weight='balanced', fit_intercept=False, solver="liblinear", penalty="l1")
model.fit(X_train, y_train)

In [None]:
score = cross_val_score(model, df_transformed.iloc[:,9:],df_transformed["TDAH"].values, cv=3, scoring="f1_weighted")
print("score in each iteration: ",score)
print("Mean and std score by cross validation: ", score.mean(), score.std())

In [None]:
print(classification_report(y_test, model.predict(X_test)))

In [None]:
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

In [None]:
feature_coefficients = pd.DataFrame({
    'Feature': X_test.columns,
    'Coefficient': model.coef_[0]
})

feature_coefficients

# Only HTR in males

In [None]:
only_males = df_transformed[df_transformed["Genero"] == 1]
only_males.shape

In [None]:
only_males.drop(["8.HTR2C_2","8.HTR2C_3"], axis=1, inplace=True)

In [None]:
##  no es un dataset muy desbalanceado minoria > 20% sin embargo son muy pocos los datos
## 1. train-test split using also the sexual chromosome htr
X_train, X_test, y_train, y_test = train_test_split(only_males.iloc[:,3:],only_males["TDAH"].values , random_state=42,test_size=0.3, shuffle=True)
model_male = LogisticRegression(class_weight="balanced", fit_intercept=False, solver="liblinear", penalty="l1")
model_male.fit(X_train,y_train)

In [None]:
## 2. cross validation
score = cross_val_score(model_male, only_males.iloc[:,3:],only_males["TDAH"].values, cv=3, scoring="recall_weighted")
print("score in each iteration: ",score)
print("Mean and std score by cross validation: ", score.mean(), score.std())

In [None]:
feature_coefficients = pd.DataFrame({
    'Feature': X_test.columns,
    'Coefficient': model_male.coef_[0]
})

feature_coefficients

In [None]:
ConfusionMatrixDisplay.from_estimator(model_male, X_test, y_test)

In [None]:
print(classification_report(y_test, model_male.predict(X_test)))

In [None]:
only_males.sum()

# Only females

In [None]:
only_females = df_transformed[df_transformed["Genero"] == 0]
only_females.shape

In [None]:
only_females.drop(["8.HTR2C_5"], axis=1, inplace=True)

In [None]:
## 1. train-test split using also the sexual chromosome htr
X_train, X_test, y_train, y_test = train_test_split(only_females.iloc[:,3:],only_females["TDAH"].values , random_state=42,test_size=0.1, shuffle=True)
model_female = LogisticRegression(class_weight="balanced", fit_intercept=False, solver="liblinear", penalty="l1")
model_female.fit(X_train, y_train)

In [None]:
## 2. cross validation
score = cross_val_score(model_female, only_females.iloc[:,3:],only_females["TDAH"].values, cv=3, scoring="recall_weighted")
print("score in each iteration: ", score)
print("Mean and std score by cross validation: ", score.mean(), score.std())

In [None]:
feature_coefficients = pd.DataFrame({
    'Feature': X_test.columns,
    'Coefficient': model_female.coef_[0]
})

feature_coefficients

In [None]:
ConfusionMatrixDisplay.from_estimator(model_female, X_train, y_train)

In [None]:
only_females.sum()