- knn con vecinos cercanos considerando pickup and dropoff data
- knn con el respecto al día de la semana (finde o no)
    -- hacer esto con todos los datos temporales (lunes a viernes y sábado y domingo)
- distancia ==> sklearn.metrics.pairwise
    -- manhattan, euclidean, canberra, minkowski, cosine (Jere va a leer porque sí)

# Contexto

El proyecto trata sobre **Uber Inc.**, la compañía de taxis más grande del mundo. En este trabajo, nuestro objetivo es **predecir la tarifa de futuros viajes**.  

Uber brinda servicio a millones de clientes cada día, por lo que gestionar adecuadamente sus datos es clave para desarrollar nuevas estrategias de negocio y obtener mejores resultados.  

### Variables del conjunto de datos  

**Variables explicativas:**  
- **key**: identificador único de cada viaje.
- **date**: fecha y hora.  
- **pickup_datetime**: fecha y hora en que se inició el viaje.  
- **passenger_count**: cantidad de pasajeros en el vehículo (dato ingresado por el conductor).  
- **pickup_longitude**: longitud del punto de inicio del viaje.  
- **pickup_latitude**: latitud del punto de inicio del viaje.  
- **dropoff_longitude**: longitud del punto de destino.  
- **dropoff_latitude**: latitud del punto de destino.  

**Variable objetivo (target):**  
- **fare_amount**: costo del viaje en dólares.  

# Librerías

In [18]:
# Importamos librerías a utilizar
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import random
import math

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler   # u otros scalers
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LassoCV, RidgeCV, ElasticNetCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

import holidays
from geopy.distance import geodesic

# Carga del dataset



In [19]:
# Carga de datos del dataset en dataframe
file_path= 'uber_fares.csv'

df = pd.read_csv(file_path)


# Análisis exploratorio de datos

## Análisis descriptivo

In [20]:
df.head()

Unnamed: 0,key,date,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,2015-05-07 19:52:06.0000003,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,2009-07-17 20:04:56.0000002,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,2009-08-24 21:45:00.00000061,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,2009-06-26 08:22:21.0000001,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,2014-08-28 17:47:00.000000188,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5


In [21]:
# Columnas, ¿cuáles son variables numéricas y cuales variables categóricas?
df.columns

Index(['key', 'date', 'fare_amount', 'pickup_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count'],
      dtype='object')

-Variables numéricas: fare_amount, pickup_longitude, pickup_latitude, dropoff_longitude, droptoff_latitude, passenger_count.

-Variables categóricas: key, date, pickup_datetime, passenger_count.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   key                200000 non-null  int64  
 1   date               200000 non-null  object 
 2   fare_amount        200000 non-null  float64
 3   pickup_datetime    200000 non-null  object 
 4   pickup_longitude   200000 non-null  float64
 5   pickup_latitude    200000 non-null  float64
 6   dropoff_longitude  199999 non-null  float64
 7   dropoff_latitude   199999 non-null  float64
 8   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB


In [23]:
# Valores nulos
df.isna().sum()

Unnamed: 0,0
key,0
date,0
fare_amount,0
pickup_datetime,0
pickup_longitude,0
pickup_latitude,0
dropoff_longitude,1
dropoff_latitude,1
passenger_count,0


In [24]:
# Principales medidas descriptivas
df.describe()

Unnamed: 0,key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,200000.0,200000.0,200000.0,200000.0,199999.0,199999.0,200000.0
mean,27712500.0,11.359955,-72.527638,39.935885,-72.525292,39.92389,1.684535
std,16013820.0,9.901776,11.437787,7.720539,13.117408,6.794829,1.385997
min,1.0,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0
25%,13825350.0,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0
50%,27745500.0,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0
75%,41555300.0,12.5,-73.967154,40.767158,-73.963658,40.768001,2.0
max,55423570.0,499.0,57.418457,1644.421482,1153.572603,872.697628,208.0


## Análisis de datos atípicos

In [25]:
# Dropear filas donde el valor del viaje es nulo o negativo

df = df[df["fare_amount"].notna()]               # remove NaN / None
df = df[df["fare_amount"].astype(float) > 0]

In [26]:
# Cantidad de pasajeros/as
df.loc[(df['passenger_count'] < 0), 'passenger_count'] = 0
df.loc[(df['passenger_count'] > 6), 'passenger_count'] = 6

# Crear categorías para cantidad de pasajeros
def categorize_passengers(count):
    if count == 0:
        return 'delivery'
    elif 1 <= count <= 4:
        return 'normal'
    elif 5 <= count <= 6:
        return 'xl'

df['passenger_category'] = df['passenger_count'].apply(categorize_passengers)

# Crear dummies para las categorías
passenger_dummies = pd.get_dummies(df['passenger_category'], prefix='passenger', dtype=int)
df = pd.concat([df, passenger_dummies], axis=1)

# Dropear columnas que no serán utilizadas
df.drop(columns=['passenger_category','passenger_count'], inplace=True)

In [79]:
# Separar datetime en date y time
df["dateTime"] = pd.to_datetime(df["pickup_datetime"]) # sin errors, ya que checkeamos que no hay valores nulos
df["dateTime"] = df["dateTime"].dt.tz_convert("America/New_York") # Esto cubre el horario de verano

df["time"] = df["dateTime"].dt.time
df["date"] = df["dateTime"].dt.date

In [80]:
# Codificación de ...
df["date"] = pd.to_datetime(df["date"])
df["weekday_num"] = df["date"].dt.dayofweek + 1

k = (2*math.pi)/7 # 7 por los días de la semana

df["sen_weekday_num"] = np.sin(k*df["weekday_num"])
df["cos_weekday_num"] = np.cos(k*df["weekday_num"])

In [81]:
# Traer feriados para validar si es fin de semana o feriado
us_holidays = holidays.US(state="NY")

df["weekend_or_holiday"] = ((df["weekday_num"] >= 6) | (df["date"].dt.date.isin(us_holidays))).astype(int)

In [82]:
# 1 a 7 ==> madrugada ==> V = 1  ==> CODIFICACIÓN BINARIA ==> 0 | 0
# 7 a 11 ==> mañana ==> V = 2 ==> CODIFICACIÓN BINARIA ==> 0 | 1
# 11 a 19 ==> tarde ==> V = 3 ==> CODIFICACIÓN BINARIA ==> 1 | 1
# 19 a 1 ==> noche ==> V = 4 ==> CODIFICACIÓN BINARIA ==> 1 | 0

df["time"] = df["dateTime"].dt.hour

df["bin_time_1"] = 0
df["bin_time_2"] = 0

df.loc[(df["time"] >= 7) & (df["time"] < 11), ["bin_time_1", "bin_time_2"]] = [0, 1]
df.loc[(df["time"] >= 11) & (df["time"] < 19), ["bin_time_1", "bin_time_2"]] = [1, 1]
df.loc[(df["time"] >= 19) | (df["time"] < 1), ["bin_time_1", "bin_time_2"]] = [1, 0]

In [83]:
# ...
k_week= 2*np.pi/52

df["week"] = df["date"].dt.isocalendar().week

df["sen_week_num"] = np.sin(k_week*df["week"])
df["cos_week_num"] = np.cos(k_week*df["week"])

In [85]:
# INVIERNO: de diciembre a febrero ==> CODIFICACIÓN BINARIA ==> 0 | 0
# PRIMAVERA: de marzo a mayo ==> CODIFICACIÓN BINARIA ==> 0 | 1
# VERANO: de junio a agosto ==> CODIFICACIÓN BINARIA ==> 1 | 1
# OTOÑO: de septiembre a noviembre ==> CODIFICACIÓN BINARIA ==> 1 | 0

df['month'] = df['date'].dt.month

df.loc[(df['month'] >= 3) & (df['month'] < 6), ["bin_month_1", "bin_month_2"]] = [0, 1]
df.loc[(df['month'] >= 6) & (df['month'] < 9), ["bin_month_1", "bin_month_2"]] = [1, 1]
df.loc[(df['month'] >= 9) & (df['month'] < 12), ["bin_month_1", "bin_month_2"]] = [1, 0]
df.loc[(df['month'] == 12) | (df['month'] < 3), ["bin_month_1", "bin_month_2"]] = [0, 0]

Official source

The NYC Department of City Planning (DCP) provides the NYC Borough Boundary shapefile (“nybb”) that defines the boundary clipped to the shoreline.
s-media.nyc.gov

The metadata for that dataset gives the bounding box of that official polygon: West −74.257159°, East −73.699215°, North 40.915568°, South 40.496010°

longitud ==> este oeste

latitud ==> norte sur

In [88]:
# Setear coordenadas inválidas a NaN
df.loc[df["pickup_latitude"] > 41, ["pickup_latitude"]] = np.nan
df.loc[df["pickup_latitude"] < 40, ["pickup_latitude"]] = np.nan
df.loc[df["dropoff_latitude"] > 41, ["dropoff_latitude"]] = np.nan
df.loc[df["dropoff_latitude"] < 40, ["dropoff_latitude"]] = np.nan

df.loc[df["pickup_longitude"] > -73.50, ["pickup_longitude"]] = np.nan
df.loc[df["pickup_longitude"] < -74, ["pickup_longitude"]] = np.nan
df.loc[df["dropoff_longitude"] > -73.50, ["dropoff_longitude"]] = np.nan
df.loc[df["dropoff_longitude"] < -74, ["dropoff_longitude"]] = np.nan

Con estos valores hacemos KNN para setearlos

In [89]:
cols = ["pickup_latitude", "dropoff_latitude", "pickup_longitude", "dropoff_longitude"]

df[cols] = df[cols].replace(0.0, np.nan)

conditions = {
    "pickup_latitude":  (df["pickup_latitude"] > 40.91) | (df["pickup_latitude"] < 40.49),
    "dropoff_latitude": (df["dropoff_latitude"] > 40.91) | (df["dropoff_latitude"] < 40.49),
    "pickup_longitude": (df["pickup_longitude"] > -73.70) | (df["pickup_longitude"] < -74.25),
    "dropoff_longitude": (df["dropoff_longitude"] > -73.70) | (df["dropoff_longitude"] < -74.25),
} # dentro de Nueva York

invalid_mask = (
    conditions["pickup_latitude"] |
    conditions["dropoff_latitude"] |
    conditions["pickup_longitude"] |
    conditions["dropoff_longitude"]
)

print(df.loc[invalid_mask, ["pickup_latitude", "dropoff_latitude", "pickup_longitude", "dropoff_longitude"]])

df[df["pickup_latitude"] ==  0.007380]



        pickup_latitude  dropoff_latitude  pickup_longitude  dropoff_longitude
696           40.190564         40.190564               NaN                NaN
917           40.750307         40.911958               NaN         -73.910177
1738          40.966745         40.960380        -73.867800         -73.855478
1905          40.221474         40.223950        -73.733760         -73.737622
2671          40.428030         40.422658        -73.902908         -73.931997
...                 ...               ...               ...                ...
196122        40.769707         40.975602        -73.866782         -73.886882
196616        40.749896         40.973143        -73.984697                NaN
197034        40.646700         40.653412        -73.781310         -73.691822
197604        40.724057         40.723995        -73.587592         -73.591222
197695        40.350780         40.358117        -73.573012         -73.548835

[163 rows x 4 columns]


Unnamed: 0,key,date,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,dateTime,...,cos_weekday_num,weekend_or_holiday,bin_time_1,bin_time_2,week,sen_week_num,cos_week_num,month,bin_month_1,bin_month_2


In [94]:
# Select columns for imputation and features
cols_to_use = ['col1', 'col2', 'col3']  # Columns to impute and use as features
df_subset = df[cols_to_use]

# Initialize KNNImputer
knn_imputer = KNNImputer(n_neighbors=10)

# Perform imputation
df_imputed_subset = pd.DataFrame(knn_imputer.fit_transform(df_subset),
                                columns=cols_to_use,
                                index=df.index)

# Merge imputed columns back into the original DataFrame
df[cols_to_use] = df_imputed_subset

KeyError: "None of [Index(['col1', 'col2', 'col3'], dtype='object')] are in the [columns]"

## Datos atípicos y faltantes

Valores atipicos o mal inputados
*   Para limitar las longitudes y latitudes dentro de los Estados Unidos, se pueden usar los rangos geográficos aproximados que corresponden a las fronteras del país. Los valores aproximados son: Latitudes: Desde 24.396308° N (en la frontera sur, en la Florida) hasta 49.384358° N (en la frontera norte, cerca de la línea entre EE. UU. y Canadá). Longitudes: Desde -125.0° W (en la costa oeste, en California) hasta -66.93457° W (en la costa este, en Maine).
*   un auto no puede llevar 208 pasajeros (el maximo es 6, en los uber XL)









In [None]:
# Imputamos por la moda de la columna 'passenger_count' las celdas que continen valores negativos o mayores que 6

def impute_passengers (df):
  '''
  Imputa las celdas que continen valores negativos o mayores que 6 en la columna
  'passenger_count' con la moda de la columna y devuelve el dataframe modificado
  '''
  passenger_mode = df['passenger_count'].mode()[0]
  df.loc[(df['passenger_count'] < 0) | (df['passenger_count'] > 6), 'passenger_count'] = passenger_mode
  return df

In [None]:
#  Imputamos por KNN las coordenadas inválidas

def impute_coordinates_train_test(X_train, X_test):
    """
    Imputa las coordenadas faltantes o inválidas por separado para train y test usando KNN.
    """
    # Columnas a imputar
    cols_to_impute = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

    # Crear copias para evitar modificar los originales
    X_train_imputed = X_train.copy()
    X_test_imputed = X_test.copy()

    # Marcar coordenadas inválidas como NaN
    for col in cols_to_impute:
        if 'longitude' in col:
            X_train_imputed.loc[(X_train_imputed[col] < -125.0) | (X_train_imputed[col] > -66.93457) | (X_train_imputed[col] == 0) | (X_train_imputed[col].isna()), col] = np.nan
            X_test_imputed.loc[(X_test_imputed[col] <  -125.0) | (X_test_imputed[col] > -66.93457) | (X_test_imputed[col] == 0) | (X_test_imputed[col].isna()), col] = np.nan
        elif 'latitude' in col:
            X_train_imputed.loc[(X_train_imputed[col] < 24.396308) | (X_train_imputed[col] >  49.384358) | (X_train_imputed[col] == 0) | (X_train_imputed[col].isna()), col] = np.nan
            X_test_imputed.loc[(X_test_imputed[col] < 24.396308) | (X_test_imputed[col] >  49.384358) | (X_test_imputed[col] == 0) | (X_test_imputed[col].isna()), col] = np.nan

    # Entrenar el imputador en el conjunto de entrenamiento
    imputer = KNNImputer(n_neighbors=10)
    imputer.fit(X_train_imputed[cols_to_impute])

    # Aplicar el imputador en ambos conjuntos (entrenamiento y prueba)
    X_train_imputed[cols_to_impute] = imputer.transform(X_train_imputed[cols_to_impute])
    X_test_imputed[cols_to_impute] = imputer.transform(X_test_imputed[cols_to_impute])

    return X_train_imputed, X_test_imputed

'''# Llamar a la función para imputar los conjuntos ESTO LO SACAMOS?
X_train_imputed, X_test_imputed = impute_coordinates_train_test(X_train, X_test)'''


'# Llamar a la función para imputar los conjuntos ESTO LO SACAMOS?\nX_train_imputed, X_test_imputed = impute_coordinates_train_test(X_train, X_test)'