# Predicting depression and anxiety in Spain

## Pasos a seguir:

- Definir manualmente variables interesantes
- Modificar nombres de variables interesantes para no tener el codigo
- Agrupar variables para hacer un analisis exploratorio
- Crear target 
- Terminar de fusionar la target Depresion + Ansiedad

The following project employs data from the Spain's 2020 European Health Survey (EHS). The Spain's EHS aims to provide information about the overall health status of Spain's population regarding cronic diseases or accidents, limitations to complete daily activities, access and use of health care services, as well as environmental characteristics and daily life-habits that may represent a health risk. The **key variables** for this project are **G25a_20** and **G25a_21**, which measure whether the participant has suffered depression or anxiety, respectively.
 
Source: https://www.ine.es/dyngs/INEbase/operacion.htm?c=Estadistica_C&cid=1254736176784&menu=resultados&idp=1254735573175#!tabs-1254736195745

## All imports

In [22]:
#######data management/analysis libraries, and storing models:
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sqlite3 #for sql tables
from pickle import dump, load #for storing models after tunning
import re
import json #to work with json data format

#######Iterative imputer to fill missing numerical data:
    # explicitly require this experimental feature
from sklearn.experimental import enable_iterative_imputer  # noqa
    # now you can import normally from sklearn.impute
from sklearn.impute import IterativeImputer

#######feature selection selectkbest and method
from sklearn.feature_selection import chi2 , SelectKBest, mutual_info_regression, f_classif

#######splitting train and test and model tunning
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, RandomizedSearchCV

#######model performance assessment
from sklearn.metrics import accuracy_score, f1_score, mean_squared_error, r2_score, make_scorer, classification_report, ConfusionMatrixDisplay, confusion_matrix


In [23]:

def ver_dataframe_completo():

    # Configura Pandas para mostrar DataFrames completos sin truncar
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_colwidth', None)

def ver_dataframe_columnas():

    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)

def restaurar_ajuste():

    # Restaura el valor de pandas para mostrar dataframes truncados
    pd.reset_option('all')
    

In [24]:
RUTA_DATAFRAME = '/workspaces/mental_health_spain/data/raw/EESEadulto_2020.csv'

dtypes={'K33': object, 'K35':object}

df = pd.read_csv(RUTA_DATAFRAME, sep='\t', dtype=dtypes)

In [25]:
df.replace(r'^\s*$', pd.NA, regex=True, inplace=True)

In [26]:
df.head()

Unnamed: 0,CCAA,IDENTHOGAR,A7_2a,SEXOa,EDADa,PROXY_0,PROXY_1,PROXY_2,PROXY_2b,PROXY_3b,...,Y134,Y135,FACTORADULTO,CLASE_PR,IMC,CMD1,CMD2,CMD3,SEVERIDAD_DEPRESIVA,CUADROS_DEPRESIVOS
0,16,2500011,1,1,60,1,,,,,...,,,822.756,5,2,0.0,0.0,0.0,1,3
1,16,2500021,1,2,87,1,,,,,...,,,1287.294,1,9,0.0,0.0,0.0,3,3
2,16,2500031,1,1,38,1,,,,,...,,,607.022,4,3,2.86,0.0,6.67,3,2
3,16,2500061,2,2,43,1,,,,,...,1.0,3.0,1303.95,1,2,5.71,5.0,6.67,1,3
4,16,2500071,1,1,41,1,,,,,...,,,1341.778,4,3,0.0,0.0,0.0,1,3


In [27]:
df = df.fillna(-1).astype(int)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22072 entries, 0 to 22071
Columns: 427 entries, CCAA to CUADROS_DEPRESIVOS
dtypes: int64(427)
memory usage: 71.9 MB


In [29]:
df.head()

Unnamed: 0,CCAA,IDENTHOGAR,A7_2a,SEXOa,EDADa,PROXY_0,PROXY_1,PROXY_2,PROXY_2b,PROXY_3b,...,Y134,Y135,FACTORADULTO,CLASE_PR,IMC,CMD1,CMD2,CMD3,SEVERIDAD_DEPRESIVA,CUADROS_DEPRESIVOS
0,16,2500011,1,1,60,1,-1,-1,-1,-1,...,-1,-1,822,5,2,0,0,0,1,3
1,16,2500021,1,2,87,1,-1,-1,-1,-1,...,-1,-1,1287,1,9,0,0,0,3,3
2,16,2500031,1,1,38,1,-1,-1,-1,-1,...,-1,-1,607,4,3,2,0,6,3,2
3,16,2500061,2,2,43,1,-1,-1,-1,-1,...,1,3,1303,1,2,5,5,6,1,3
4,16,2500071,1,1,41,1,-1,-1,-1,-1,...,-1,-1,1341,4,3,0,0,0,1,3


In [30]:
df.replace(-1 , np.nan, regex=True, inplace=True)

In [31]:
df.head()

Unnamed: 0,CCAA,IDENTHOGAR,A7_2a,SEXOa,EDADa,PROXY_0,PROXY_1,PROXY_2,PROXY_2b,PROXY_3b,...,Y134,Y135,FACTORADULTO,CLASE_PR,IMC,CMD1,CMD2,CMD3,SEVERIDAD_DEPRESIVA,CUADROS_DEPRESIVOS
0,16,2500011,1,1,60,1,,,,,...,,,822,5,2,0,0,0,1,3
1,16,2500021,1,2,87,1,,,,,...,,,1287,1,9,0,0,0,3,3
2,16,2500031,1,1,38,1,,,,,...,,,607,4,3,2,0,6,3,2
3,16,2500061,2,2,43,1,,,,,...,1.0,3.0,1303,1,2,5,5,6,1,3
4,16,2500071,1,1,41,1,,,,,...,,,1341,4,3,0,0,0,1,3


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22072 entries, 0 to 22071
Columns: 427 entries, CCAA to CUADROS_DEPRESIVOS
dtypes: float64(273), int64(154)
memory usage: 71.9 MB
