<img src=".\img\mioti.png">   
<br />
<center style="color:#888">Proyecto Activaciones Madrid SAMUR-PC<br/></center>

# Proyecto Activaciones Madrid SAMUR-PC: Modelo

<img src="./img/samur690.jpg" style="width: 800px">

# 1. Visión general del problema.

Realizar un modelo que determine el tiempo de respuesta después de la activación de una emergencia.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = (20, 10)
rcParams.update({'font.size': 14})
from pathlib import Path
import scipy.cluster.hierarchy as sch
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import precision_score, recall_score,accuracy_score,f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.svm import SVC
from sklearn.svm import SVR

## 2.Obtención de los datos

In [2]:
#Cargamos el dataframe perfetamente preprocesado
emergencias_df_aux=pd.read_csv('./datos/emergencias_pp_def.csv',sep=';', encoding='iso8859').reset_index(drop=True)
emergencias_df_aux.head()
emergencias_df_aux['Fecha Activacion']=pd.to_datetime(emergencias_df_aux['Fecha Activacion'])
emergencias_df_aux.sample(10)

Unnamed: 0,Fecha Activacion,Hora Solicitud,Hora Intervención,Código,Distrito,Hospital
120885,2017-11-26,18:15:38,18:19:00,Convulsión y/o epilepsia,Moncloa - Aravaca,Sin hospitalizacion
228591,2018-09-12,22:00:57,22:10:57,Sobredosis,Villa de Vallecas,Sin hospitalizacion
83564,2017-08-15,18:58:47,19:04:27,"Casual: caída, etc",Usera,Sin hospitalizacion
101000,2017-10-06,9:26:27,9:34:56,Patología cardiovascular,Centro,Sin hospitalizacion
237421,2018-10-04,11:32:22,11:36:12,Accidente menos de 3 victimas,Tetuán,La Paz
41585,2017-04-28,10:59:26,11:44:34,Colaboración del FOXTROP en una actuación,San Blas - Canillejas,Sin hospitalizacion
440738,2020-02-23,6:22:11,6:28:08,Apertura de puerta,Chamberí,Clínico San Carlos
280270,2019-01-21,12:29:01,12:33:06,Patología cardiovascular,Chamberí,Sin hospitalizacion
88079,2017-09-01,14:41:57,14:48:00,"Casual: caída, etc",Carabanchel,Doce de Octubre
392664,2019-10-29,1:13:22,1:19:41,Heridas,Hortaleza,Ramón y Cajal


In [3]:
#Se visualiza la información del nuevo dataframe
emergencias_df_aux_tiempo=emergencias_df_aux[['Fecha Activacion','Hora Solicitud','Hora Intervención']].copy()
emergencias_df_aux_tiempo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556526 entries, 0 to 556525
Data columns (total 3 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   Fecha Activacion   556526 non-null  datetime64[ns]
 1   Hora Solicitud     556526 non-null  object        
 2   Hora Intervención  508648 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 12.7+ MB


## 3. Analizar y preparar los datos para nuestro algoritmo

Disponemos de un dataset con pocas variables que va a necesitar de una tansformación importante para acomodar los campos a nuestras necesidades y generar nuevas variables sintéticas que refuercen la información para evitar subentrenamiento.

Vamos a generar un campo con el tiempo de respuesta haciendo la diferencia entre solicitud e intervención. Vamos a definir dummies con los campos de día de la semana, distrito y código de activación

In [4]:
#Se pasan las variables de hora a datatime
emergencias_df_aux_tiempo['Tiempo Respuesta']=(pd.to_datetime(emergencias_df_aux['Hora Intervención']) - pd.to_datetime(emergencias_df_aux['Hora Solicitud'])).dt.seconds

In [5]:
#Se calculan las horas de solicitud y respuesta y se transforman a formato numérico
emergencias_df_aux_tiempo['Hora S']=pd.to_numeric(emergencias_df_aux['Hora Solicitud'].str.split(':',expand=True)[0])
emergencias_df_aux_tiempo['Hora I']=pd.to_numeric(emergencias_df_aux['Hora Intervención'].str.split(':',expand=True)[0])

In [6]:
#Se definen los días de la semana
emergencias_df_aux_tiempo['Dia Semana']=emergencias_df_aux_tiempo['Fecha Activacion'].dt.dayofweek

In [7]:
#Se visualizan los datos
emergencias_df_aux_tiempo

Unnamed: 0,Fecha Activacion,Hora Solicitud,Hora Intervención,Tiempo Respuesta,Hora S,Hora I,Dia Semana
0,2017-01-01,0:23:19,0:28:59,340.0,0,0.0,6
1,2017-01-01,0:27:35,0:35:44,489.0,0,0.0,6
2,2017-01-01,0:47:26,0:55:49,503.0,0,0.0,6
3,2017-01-01,0:55:13,1:02:23,430.0,0,1.0,6
4,2017-01-01,1:07:11,1:19:44,753.0,1,1.0,6
...,...,...,...,...,...,...,...
556521,2021-03-31,23:33:25,,,23,,2
556522,2021-03-31,23:34:17,23:52:36,1099.0,23,23.0,2
556523,2021-03-31,23:47:02,23:53:46,404.0,23,23.0,2
556524,2021-03-31,23:48:18,,,23,,2


In [8]:
emergencias_reg=emergencias_df_aux_tiempo[['Fecha Activacion','Tiempo Respuesta','Hora S','Hora I','Dia Semana']]
emergencias_reg.sample(10)

Unnamed: 0,Fecha Activacion,Tiempo Respuesta,Hora S,Hora I,Dia Semana
353702,2019-07-18,2249.0,9,10.0,3
300368,2019-03-12,810.0,12,12.0,1
73981,2017-07-17,344.0,0,0.0,0
246535,2018-10-26,133.0,20,20.0,4
436180,2020-02-12,1409.0,17,17.0,2
173166,2018-04-17,604.0,12,12.0,1
119520,2017-11-23,,13,,3
174717,2018-04-21,1070.0,6,7.0,5
52225,2017-05-25,1625.0,20,20.0,3
19548,2017-02-27,540.0,8,9.0,0


In [9]:
emergencias_reg['Hora I']=emergencias_reg['Hora I'].fillna(method='bfill')
emergencias_reg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556526 entries, 0 to 556525
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Fecha Activacion  556526 non-null  datetime64[ns]
 1   Tiempo Respuesta  508648 non-null  float64       
 2   Hora S            556526 non-null  int64         
 3   Hora I            556526 non-null  float64       
 4   Dia Semana        556526 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 21.2 MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  emergencias_reg['Hora I']=emergencias_reg['Hora I'].fillna(method='bfill')


In [10]:
emergencias_reg=emergencias_reg.dropna(axis=0)
emergencias_reg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 508648 entries, 0 to 556525
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Fecha Activacion  508648 non-null  datetime64[ns]
 1   Tiempo Respuesta  508648 non-null  float64       
 2   Hora S            508648 non-null  int64         
 3   Hora I            508648 non-null  float64       
 4   Dia Semana        508648 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 23.3 MB


In [11]:
emergencias_reg.describe()

Unnamed: 0,Tiempo Respuesta,Hora S,Hora I,Dia Semana
count,508648.0,508648.0,508648.0,508648.0
mean,549.924765,13.458484,13.502709,3.089168
std,920.682599,6.06897,6.104147,2.000129
min,0.0,0.0,0.0,0.0
25%,316.0,10.0,10.0,1.0
50%,452.0,14.0,14.0,3.0
75%,627.0,18.0,19.0,5.0
max,86348.0,23.0,23.0,6.0


In [12]:
def filtrar_outlier_tukey(x):
    q1 = np.percentile(x, 25)
    q3 = np.percentile(x, 75)
    iqr = q3 - q1 
    print("[q1=%f, q3=%f, iqr=%f]" % (q1, q3, iqr))
    
    floor = q1 - 1.5*iqr
    ceiling = q3 + 1.5*iqr
    print("[floor=%f, ceiling=%f]" % (floor, ceiling))
    
    outlier_indices = list(x.index[(x < floor)|(x > ceiling)])
    outlier_values = list(x[outlier_indices])

    return outlier_indices, outlier_values

In [13]:
outlier_indices, outlier_values = filtrar_outlier_tukey(emergencias_reg['Tiempo Respuesta'])

[q1=316.000000, q3=627.000000, iqr=311.000000]
[floor=-150.500000, ceiling=1093.500000]


In [14]:
print(outlier_indices)
print(outlier_values)

[105, 173, 176, 177, 187, 235, 237, 238, 305, 365, 401, 420, 462, 480, 507, 588, 589, 596, 622, 657, 663, 664, 667, 669, 757, 781, 794, 821, 833, 918, 922, 927, 931, 939, 979, 1009, 1115, 1129, 1229, 1250, 1256, 1262, 1267, 1283, 1303, 1304, 1305, 1308, 1355, 1360, 1383, 1388, 1449, 1452, 1467, 1474, 1491, 1503, 1539, 1560, 1578, 1600, 1601, 1612, 1613, 1614, 1615, 1616, 1618, 1624, 1625, 1627, 1629, 1630, 1631, 1633, 1635, 1638, 1639, 1640, 1641, 1643, 1644, 1647, 1651, 1654, 1660, 1663, 1664, 1665, 1666, 1669, 1670, 1674, 1706, 1752, 1761, 1788, 1863, 1864, 1874, 1901, 1979, 1987, 1993, 2007, 2128, 2141, 2144, 2160, 2185, 2203, 2244, 2270, 2272, 2276, 2278, 2285, 2286, 2290, 2297, 2322, 2333, 2416, 2424, 2437, 2468, 2482, 2533, 2585, 2746, 2755, 2831, 2890, 2907, 2923, 2947, 2952, 3007, 3013, 3023, 3032, 3047, 3051, 3055, 3116, 3124, 3125, 3132, 3168, 3201, 3210, 3222, 3230, 3331, 3348, 3349, 3364, 3370, 3426, 3531, 3533, 3561, 3564, 3572, 3583, 3692, 3713, 3754, 3756, 3759, 3797, 38

In [15]:
emergencias_reg.loc[outlier_indices,'Tiempo Respuesta']=emergencias_reg['Tiempo Respuesta'].median()

In [16]:
emergencias_reg.describe()

Unnamed: 0,Tiempo Respuesta,Hora S,Hora I,Dia Semana
count,508648.0,508648.0,508648.0,508648.0
mean,457.041329,13.458484,13.502709,3.089168
std,213.662658,6.06897,6.104147,2.000129
min,0.0,0.0,0.0,0.0
25%,316.0,10.0,10.0,1.0
50%,452.0,14.0,14.0,3.0
75%,578.0,18.0,19.0,5.0
max,1093.0,23.0,23.0,6.0


In [17]:
df_y=emergencias_reg['Tiempo Respuesta']
df_y=pd.qcut(df_y,
        q=[0.0,0.2, 0.4, 
           0.6, 0.8, 1.0],
        labels=[1,2,3,4,5])
dataset_y=np.array(df_y)
dataset_y

array([2, 4, 4, ..., 3, 3, 1], dtype=int64)

In [18]:
#Se define una función para nombrar el día de la semana
def dia(x):
    if x['Dia Semana']==0:
        x['Dia Semana']='lunes'
    elif x['Dia Semana']==1:
        x['Dia Semana']='martes'
    elif x['Dia Semana']==2:
        x['Dia Semana']='miercoles'
    elif x['Dia Semana']==3:
        x['Dia Semana']='jueves'
    elif x['Dia Semana']==4:
        x['Dia Semana']='viernes'
    elif x['Dia Semana']==5:
        x['Dia Semana']='sabado'
    elif x['Dia Semana']==6:
        x['Dia Semana']='domingo'
    return x
#Se aplica y se comprueba que funciona correctamente
emergencias_reg=emergencias_reg.apply(dia,axis=1)
emergencias_reg.head()

Unnamed: 0,Fecha Activacion,Tiempo Respuesta,Hora S,Hora I,Dia Semana
0,2017-01-01,340.0,0,0.0,domingo
1,2017-01-01,489.0,0,0.0,domingo
2,2017-01-01,503.0,0,0.0,domingo
3,2017-01-01,430.0,0,1.0,domingo
4,2017-01-01,753.0,1,1.0,domingo


In [19]:
#Se crean unas dummies con los días de la semana
dia_semana=pd.DataFrame(emergencias_reg['Dia Semana'],columns=['Dia Semana'])
dia_semana=pd.get_dummies(dia_semana,'Dia Semana')
dia_semana=dia_semana.rename(columns={'Dia Semana_lunes':'lunes','Dia Semana_martes':'martes','Dia Semana_miercoles':'miercoles',
                                      'Dia Semana_jueves':'jueves','Dia Semana_viernes':'viernes','Dia Semana_sabado':'sabado',
                                      'Dia Semana_domingo':'domingo'})
dia_semana.head()

Unnamed: 0,domingo,jueves,lunes,martes,miercoles,sabado,viernes
0,1,0,0,0,0,0,0
1,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0
3,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0


In [20]:
#Se unen al data frame de tiempo
emergencias_reg=pd.concat([emergencias_reg,dia_semana],axis=1)
emergencias_reg=emergencias_reg.drop('Dia Semana',axis=1)
emergencias_reg.sample(10)

Unnamed: 0,Fecha Activacion,Tiempo Respuesta,Hora S,Hora I,domingo,jueves,lunes,martes,miercoles,sabado,viernes
461928,2020-05-23,220.0,23,23.0,0,0,0,0,0,1,0
219902,2018-08-16,585.0,6,6.0,0,1,0,0,0,0,0
367172,2019-08-27,426.0,14,14.0,0,0,0,1,0,0,0
102244,2017-10-09,320.0,0,0.0,0,0,1,0,0,0,0
122905,2017-12-01,666.0,22,22.0,0,0,0,0,0,0,1
392778,2019-10-29,0.0,11,11.0,0,0,0,1,0,0,0
338364,2019-06-13,92.0,9,9.0,0,1,0,0,0,0,0
432935,2020-02-04,655.0,13,14.0,0,0,0,1,0,0,0
443420,2020-02-29,475.0,15,15.0,0,0,0,0,0,1,0
525016,2020-12-13,167.0,13,13.0,1,0,0,0,0,0,0


In [21]:
emergencias_distrito=pd.DataFrame(emergencias_df_aux['Distrito'],columns=['Distrito'])
emergencias_distrito.head()

Unnamed: 0,Distrito
0,Centro
1,Carabanchel
2,Salamanca
3,Centro
4,Villa de Vallecas


In [22]:
emergencias_distrito.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556526 entries, 0 to 556525
Data columns (total 1 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Distrito  556526 non-null  object
dtypes: object(1)
memory usage: 4.2+ MB


In [23]:
emergencias_distrito=pd.get_dummies(emergencias_distrito,'Distrito')
emergencias_distrito.head()

Unnamed: 0,Distrito_Arganzuela,Distrito_Barajas,Distrito_Carabanchel,Distrito_Centro,Distrito_Chamartín,Distrito_Chamberí,Distrito_Ciudad Lineal,Distrito_Fuencarral - El Pardo,Distrito_Hortaleza,Distrito_Latina,...,Distrito_Moratalaz,Distrito_Puente de Vallecas,Distrito_Retiro,Distrito_Salamanca,Distrito_San Blas - Canillejas,Distrito_Tetuán,Distrito_Usera,Distrito_Vicálvaro,Distrito_Villa de Vallecas,Distrito_Villaverde
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [24]:
emergencias_reg=pd.concat([emergencias_reg,emergencias_distrito],axis=1)
emergencias_reg.sample(10)

Unnamed: 0,Fecha Activacion,Tiempo Respuesta,Hora S,Hora I,domingo,jueves,lunes,martes,miercoles,sabado,...,Distrito_Moratalaz,Distrito_Puente de Vallecas,Distrito_Retiro,Distrito_Salamanca,Distrito_San Blas - Canillejas,Distrito_Tetuán,Distrito_Usera,Distrito_Vicálvaro,Distrito_Villa de Vallecas,Distrito_Villaverde
273328,2019-01-02,544.0,7.0,8.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,0
24585,2017-03-12,112.0,13.0,13.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
107968,2017-10-24,634.0,8.0,9.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
228239,2018-09-11,589.0,22.0,22.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
111800,2017-11-02,459.0,19.0,19.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,0,0,0,0
64749,2017-06-23,640.0,17.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
416666,2019-12-24,404.0,14.0,14.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
318379,2019-04-27,398.0,20.0,21.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,1,0,0,0,0
115233,2017-11-12,529.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
189498,2018-05-28,427.0,15.0,15.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0


In [25]:
emergencias_codigo=pd.DataFrame(emergencias_df_aux['Código'])
emergencias_codigo=pd.get_dummies(emergencias_codigo,'Código')
emergencias_codigo.head()

Unnamed: 0,Código_+ de 3 SVB y eq. Sanit.,Código_1 SVB,Código_1 o + SVB y 1 o + SVA,Código_1 o + SVB y 1 o + SVA y PMA,Código_Acc. Trasp. Merc. Peligrosas,Código_Accidente con 3 o más víctimas confirmadas,Código_Accidente con vehículo pesado,Código_Accidente de autobús / autocar,Código_Accidente de avión,Código_Accidente de bicicleta,...,Código_Riesgo químico,Código_SAMUR,Código_Serv. Preventivo desde oper. Ordinario,Código_Servicios de análisis de riesgos,Código_Servicios de seguimiento de riesgos,Código_Servicios especiales,Código_Sobredosis,Código_Todos los medios de SAMUR-PC,Código_Violencia de genero,Código_Vía publica
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
emergencias_reg=pd.concat([emergencias_reg,emergencias_codigo],axis=1)
emergencias_reg.sample(10)

Unnamed: 0,Fecha Activacion,Tiempo Respuesta,Hora S,Hora I,domingo,jueves,lunes,martes,miercoles,sabado,...,Código_Riesgo químico,Código_SAMUR,Código_Serv. Preventivo desde oper. Ordinario,Código_Servicios de análisis de riesgos,Código_Servicios de seguimiento de riesgos,Código_Servicios especiales,Código_Sobredosis,Código_Todos los medios de SAMUR-PC,Código_Violencia de genero,Código_Vía publica
284067,2019-01-30,257.0,23.0,23.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,0
111200,2017-11-01,348.0,6.0,6.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,0
338781,2019-06-14,285.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
454172,2020-04-15,36.0,15.0,15.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,0
415816,2019-12-22,95.0,11.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
384665,2019-10-10,268.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
58374,2017-06-09,472.0,10.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2410,NaT,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
125255,2017-12-07,0.0,18.0,18.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
221150,2018-08-20,281.0,17.0,17.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
emergencias_reg.to_csv('./datos/emergencias_reg.csv', sep=';',index=False,encoding='iso8859')

In [28]:
emergencias_reg=pd.read_csv('./datos/emergencias_reg.csv',sep=';', encoding='iso8859').reset_index(drop=True)
emergencias_reg.head()

Unnamed: 0,Fecha Activacion,Tiempo Respuesta,Hora S,Hora I,domingo,jueves,lunes,martes,miercoles,sabado,...,Código_Riesgo químico,Código_SAMUR,Código_Serv. Preventivo desde oper. Ordinario,Código_Servicios de análisis de riesgos,Código_Servicios de seguimiento de riesgos,Código_Servicios especiales,Código_Sobredosis,Código_Todos los medios de SAMUR-PC,Código_Violencia de genero,Código_Vía publica
0,2017-01-01,340.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,2017-01-01,489.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,2017-01-01,503.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,2017-01-01,430.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,2017-01-01,753.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
emergencias_reg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556526 entries, 0 to 556525
Columns: 126 entries, Fecha Activacion to Código_Vía publica
dtypes: float64(10), int64(115), object(1)
memory usage: 535.0+ MB


In [30]:
emergencias_reg=emergencias_reg.dropna(axis=0)
emergencias_reg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 508648 entries, 0 to 556525
Columns: 126 entries, Fecha Activacion to Código_Vía publica
dtypes: float64(10), int64(115), object(1)
memory usage: 492.8+ MB


In [31]:
em_reg=emergencias_reg[(emergencias_reg['Fecha Activacion']>='2019-06-01')&(emergencias_reg['Fecha Activacion']<='2019-12-31')]
em_reg

Unnamed: 0,Fecha Activacion,Tiempo Respuesta,Hora S,Hora I,domingo,jueves,lunes,martes,miercoles,sabado,...,Código_Riesgo químico,Código_SAMUR,Código_Serv. Preventivo desde oper. Ordinario,Código_Servicios de análisis de riesgos,Código_Servicios de seguimiento de riesgos,Código_Servicios especiales,Código_Sobredosis,Código_Todos los medios de SAMUR-PC,Código_Violencia de genero,Código_Vía publica
332903,2019-06-01,159.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
332904,2019-06-01,289.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
332905,2019-06-01,484.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
332906,2019-06-01,553.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
332907,2019-06-01,372.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
419454,2019-12-31,439.0,23.0,23.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
419455,2019-12-31,261.0,23.0,23.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
419456,2019-12-31,444.0,23.0,23.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
419457,2019-12-31,320.0,23.0,23.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
#Separación de variables x y
y=em_reg['Tiempo Respuesta']
x=em_reg.drop(['Fecha Activacion','Tiempo Respuesta'],axis=1)

In [33]:
x_norm=StandardScaler().fit_transform(x)

## 4. Preparar los datos para el algoritmo de Machine Learning que vamos a utilizar

En primer lugar vamos a realizar un modelo de regresión para tratar de predecir los tiempos de respuesta, el algoritmo empleado será Random Forest

In [34]:
x_train,x_test,y_train,y_test=train_test_split(x_norm,y,test_size=0.2,random_state=39)
print(x_train.shape)
print(x_test.shape)

(62929, 124)
(15733, 124)


In [37]:
reg=RandomForestRegressor(n_estimators=1000)
reg.fit(x_train,y_train)
print('Precisión entrenamiento: ',reg.score(x_train,y_train))
pred=reg.predict(x_test)
print('Precisión prueba: ',reg.score(x_test,y_test))

Precisión entrenamiento:  0.5531159204600652
Precisión prueba:  -0.08845421379745266


El modelo de regresión acaba dando un resultado de test negativo, lo que nos indica ue hemos elegido mal el modelo para nuestra clasificación, es muy probable que la correlación entre la variable objetivo y los atributos sea muy baja.


A continucación se va a probar empleando algoritmos de clasificación, vamos a estabecer 5 intervalos de tiempo de intervención y después preparar el algoritmo. En este caso vamos a probar un algoritmo de Random Forest y otro de máquinas de soporte

In [38]:
df_y = em_reg['Tiempo Respuesta']
df_y = pd.qcut(df_y, 5, labels=["0", "1", "2", "3", "4"])
df_y

332903    0
332904    1
332905    3
332906    3
332907    1
         ..
419454    2
419455    0
419456    2
419457    1
419458    0
Name: Tiempo Respuesta, Length: 78662, dtype: category
Categories (5, object): ['0' < '1' < '2' < '3' < '4']

In [39]:
x_train,x_test,y_train,y_test=train_test_split(x_norm,df_y,test_size=0.2,random_state=39)
print(x_train.shape)
print(x_test.shape)

(62929, 124)
(15733, 124)


In [40]:
clf=RandomForestClassifier(n_estimators=1000)
clf.fit(x_train,y_train)
print('Precisión entrenamiento: ',clf.score(x_train,y_train))
pred=clf.predict(x_test)
print('Precisión prueba: ',clf.score(x_test,y_test))

Precisión entrenamiento:  0.6960066106246723
Precisión prueba:  0.24966630648954427


In [41]:
clf2=SVC()
clf2.fit(x_train,y_train)
print('Precisión entrenamiento: ',clf2.score(x_train,y_train))
pred=clf2.predict(x_test)
print('Precisión prueba: ',clf2.score(x_test,y_test))

Precisión entrenamiento:  0.3118435061736242
Precisión prueba:  0.27814148604843325


## 6. Análisis de resultados

Los resultados confirman un problema de subentrenemiento en nuestro dataset que no se ha compensado con el tratamiento dado mediante variables sintéticas. Los resultados del Random Forest entrenan bien el modelo pero fallan en el test, estas métricas no mejoran aplicando un modelo SVC mostrando unos resultados de prueba ligeramente superiores.

En conclusión, los datos de los que disponemos no son eficientes para llevar a cabo nuestro objetivo y necesitaríamos de información adicional para reforzarlo, por ejemplo datos meteorológicos