# An√°lisis de series temporales
Modelos Supervisados y No Supervisados.

## Casificacion Multiclase

En el presente documento intentaremos predecir cuanto tardata una entrega determinada, en un rango de 0 a 20 dias.

Para ello relizaremos una clasificacion multiclase con diferentes algoritmos, para intentar  definir cual predice de manera mas precisa el resultado esperado.

## Bibliotecas
Vamos a cargar las biblitecas necesarias para realizar el analisis.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression

## Datos
Para empezr, cargamos los datos de los envios de Marzo de 2019.

In [2]:
cols = ['service',
        'sender_zipcode',
        'receiver_zipcode',
        'sender_state',
        'receiver_state',
        'shipment_type',
        'quantity',
        'status',
        'date_created',
        'date_sent',
        'date_visit',
        'target']

df = pd.read_csv('./shipments_BR_201903.csv', usecols=cols)

In [3]:
df.head(5)

Unnamed: 0,sender_state,sender_zipcode,receiver_state,receiver_zipcode,shipment_type,quantity,service,status,date_created,date_sent,date_visit,target
0,SP,3005,SP,5409,express,1,0,done,2019-03-04 00:00:00,2019-03-05 13:24:00,2019-03-07 18:01:00,2
1,SP,17052,MG,37750,standard,1,1,done,2019-03-19 00:00:00,2019-03-20 14:44:00,2019-03-27 10:21:00,5
2,SP,2033,SP,11040,express,1,0,done,2019-02-18 00:00:00,2019-02-21 15:08:00,2019-02-28 18:19:00,5
3,SP,13900,SP,18500,express,1,0,done,2019-03-09 00:00:00,2019-03-11 15:48:00,2019-03-12 13:33:00,1
4,SP,4361,RS,96810,express,1,0,done,2019-03-08 00:00:00,2019-03-12 08:19:00,2019-03-16 08:24:00,4


Como intentaremos predecir envios que tardan de 0 a 20 dias, los target mayores seran modificados a 20.

In [4]:
df['target'] = np.where(df['target'] > 20, 20, df['target'])

Para poder incluir el tipo de envio entre las features, lo separamos en 3 features independientes.

In [5]:
#df = pd.get_dummies(df, columns=['shipment_type'])

Definimos las columnas que usaremos como features para el modelo, y cual sera la columna target.

In [6]:
#features = ['sender_zipcode', 'receiver_zipcode', 'service', 'shipment_type_express', 'shipment_type_standard', 'shipment_type_super']
features = ['sender_zipcode', 'receiver_zipcode', 'service']
target = 'target'

Antes de aplicar cualquier modelo de prediccion, graficamos la distribucion de los datos segun las diferentes features.

In [7]:
#sns.pairplot(
#    data=df,
#    vars=features,
#    hue=target)

## Regresion Logistica

Como modelo base, utilizaremos un pipeline de sklearn, que nos permita normalizar los datos y luego pasarlos a una Regresion Logistica Multiclase.

In [8]:
def drop_last_zip_digit(data):
    for row in data:
        row[0] = round(row[0] / 10)
        row[1] = round(row[1] / 10)
    return data

zip_features = [0, 1]
function_transformer = FunctionTransformer(drop_last_zip_digit, validate=False)

model = Pipeline([
    ('zip_cutter', ColumnTransformer(transformers=[('log', function_transformer, zip_features)])),
    ('normalizer', MinMaxScaler()),
    ('reduce_dim', PCA()),
    ('classifier', LogisticRegression(solver='lbfgs', 
                                      multi_class='multinomial',
                                      max_iter=500))
])

Definimos una fecha de corte para separar los datos de entrenamiento y test. Para asegurar que los datos no sean modificados, trabajamos sobre una copia del dataset original

In [9]:
copy = df.copy()

cut_off = '2019-03-20'
df_train = copy.query(f'date_visit <= "{cut_off}"')
df_test = copy.query(f'date_created > "{cut_off}"')

X_train = df_train[features].values.astype(np.float)
y_train = df_train[target].values

X_test = df_test[features].values.astype(np.float)
y_test = df_test[target].values

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((673645, 3), (673645,), (76378, 3), (76378,))

In [10]:
y_train = df_train[target].values
y_test = df_test[target].values
y_train.shape, y_test.shape

((673645,), (76378,))

Entrenamos nuestro modelo con los datos de entrenamiento y test definidos.

In [11]:
%%time
result = model.fit(X_train, y_train)
result

Wall time: 1min 4s


Pipeline(memory=None,
         steps=[('zip_cutter',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('log',
                                                  FunctionTransformer(accept_sparse=False,
                                                                      check_inverse=True,
                                                                      func=<function drop_last_zip_digit at 0x0000021DB7182268>,
                                                                      inv_kw_args=None,
                                                                      inverse_func=None,
                                                                      kw_args=None,
                                                                      pass_y='deprecated',
                                            

Tomamos los scores del resultado para intentar definir que tan acertado fue el modelo.

In [12]:
y_pred = model.predict(X_test)

metrics = {
    'accuracy': accuracy_score(y_test, y_pred),
    'precision': precision_score(y_test, y_pred, average='macro'),
    'recall': recall_score(y_test, y_pred, average='macro'),
    'f1_score': f1_score(y_test, y_pred, average='macro'),
}

metrics

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


{'accuracy': 0.42414045929456123,
 'precision': 0.09267482806430831,
 'recall': 0.08913194095878259,
 'f1_score': 0.07790023656234467}

El modelo no se comporta adecuadamente con los datos disponibles. 

## Arboles de Desicion

Iniciamos con una copia del dataset original

In [13]:
copy = df.copy()

cut_off = '2019-03-20'
df_train = copy.query(f'date_visit <= "{cut_off}"')
df_test = copy.query(f'date_created > "{cut_off}"')

X_train = df_train[features].values.astype(np.float)
y_train = df_train[target].values

X_test = df_test[features].values.astype(np.float)
y_test = df_test[target].values

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((673645, 3), (673645,), (76378, 3), (76378,))

In [14]:
y_train = df_train[target].values
y_test = df_test[target].values
y_train.shape, y_test.shape

((673645,), (76378,))

Creamons un pileline de skLearn

In [15]:
model = Pipeline([
    ('zip_cutter', ColumnTransformer(transformers=[('log', function_transformer, zip_features)])),
    ('normalizer', MinMaxScaler()),
    ('reduce_dim', PCA()),
    ('classifier', XGBClassifier())
])

In [16]:
%%time
result = model.fit(X_train, y_train)
result

Wall time: 9min 20s


Pipeline(memory=None,
         steps=[('zip_cutter',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('log',
                                                  FunctionTransformer(accept_sparse=False,
                                                                      check_inverse=True,
                                                                      func=<function drop_last_zip_digit at 0x0000021DB7182268>,
                                                                      inv_kw_args=None,
                                                                      inverse_func=None,
                                                                      kw_args=None,
                                                                      pass_y='deprecated',
                                            

In [17]:
y_pred = model.predict(X_test)

metrics = {
    'accuracy': accuracy_score(y_test, y_pred),
    'precision': precision_score(y_test, y_pred, average='macro'),
    'recall': recall_score(y_test, y_pred, average='macro'),
    'f1_score': f1_score(y_test, y_pred, average='macro'),
}

metrics

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


{'accuracy': 0.44998559794705284,
 'precision': 0.09554537674202396,
 'recall': 0.0936833155833895,
 'f1_score': 0.0867023032978598}

## KNN

Iniciamos con una copia del dataset original

In [18]:
copy = df.copy()

cut_off = '2019-03-20'
df_train = copy.query(f'date_visit <= "{cut_off}"')
df_test = copy.query(f'date_created > "{cut_off}"')

X_train = df_train[features].values.astype(np.float)
y_train = df_train[target].values

X_test = df_test[features].values.astype(np.float)
y_test = df_test[target].values

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((673645, 3), (673645,), (76378, 3), (76378,))

In [19]:
y_train = df_train[target].values
y_test = df_test[target].values
y_train.shape, y_test.shape

((673645,), (76378,))

Creamons un pileline para estimar Kvecinos cercanos

In [20]:
model = Pipeline([
    ('zip_cutter', ColumnTransformer(transformers=[('log', function_transformer, zip_features)])),
    ('normalizer', MinMaxScaler()),
    ('reduce_dim', PCA()),
    ('classifier', KNeighborsClassifier())
])

In [21]:
%%time
result = model.fit(X_train, y_train)
result

Wall time: 4.13 s


Pipeline(memory=None,
         steps=[('zip_cutter',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('log',
                                                  FunctionTransformer(accept_sparse=False,
                                                                      check_inverse=True,
                                                                      func=<function drop_last_zip_digit at 0x0000021DB7182268>,
                                                                      inv_kw_args=None,
                                                                      inverse_func=None,
                                                                      kw_args=None,
                                                                      pass_y='deprecated',
                                            

In [22]:
y_pred = model.predict(X_test)

metrics = {
    'accuracy': accuracy_score(y_test, y_pred),
    'precision': precision_score(y_test, y_pred, average='macro'),
    'recall': recall_score(y_test, y_pred, average='macro'),
    'f1_score': f1_score(y_test, y_pred, average='macro'),
}

metrics

  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


{'accuracy': 0.40133284453638485,
 'precision': 0.08561883396883968,
 'recall': 0.08138383192871715,
 'f1_score': 0.07840445569665833}

## Regresion

Iniciamos con una copia del dataset original

In [23]:
copy = df.copy()

cut_off = '2019-03-20'
df_train = copy.query(f'date_visit <= "{cut_off}"')
df_test = copy.query(f'date_created > "{cut_off}"')

X_train = df_train[features].values.astype(np.float)
y_train = df_train[target].values

X_test = df_test[features].values.astype(np.float)
y_test = df_test[target].values

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((673645, 3), (673645,), (76378, 3), (76378,))

In [None]:
%%time
result = model.fit(X_train, y_train)
result

Creamons un pileline para estimar una regresion lineal

In [None]:
model = Pipeline([
    ('zip_cutter', ColumnTransformer(transformers=[('log', function_transformer, zip_features)])),
    ('normalizer', MinMaxScaler()),
    ('reduce_dim', PCA()),
    ('classifier', LinearRegression())
])

In [None]:
%%time
result = model.fit(X_train, y_train)
result

In [None]:
y_pred = model.predict(X_test)

metrics = {
    'accuracy': accuracy_score(y_test, y_pred),
    'precision': precision_score(y_test, y_pred, average='macro'),
    'recall': recall_score(y_test, y_pred, average='macro'),
    'f1_score': f1_score(y_test, y_pred, average='macro'),
}

metrics

## Prediccion

Iniciamos con una copia del dataset original

In [24]:
copy = df.copy()

cut_off = '2019-03-20'
df_train = copy.query(f'date_visit <= "{cut_off}"')
df_test = copy.query(f'date_created > "{cut_off}"')

X_train = df_train[features].values.astype(np.float)
y_train = df_train[target].values

X_test = df_test[features].values.astype(np.float)
y_test = df_test[target].values

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((673645, 3), (673645,), (76378, 3), (76378,))