<br>
<img src="data/uber_logo.png" alt="Logo de Uber" style="width:300px"/><br>

# **Uber NYC Analysis & Clustering**<br>

### 👨‍💻 Jorge Gómez Galván
* LinkedIn: [linkedin.com/in/jorgeggalvan/](https://www.linkedin.com/in/jorgeggalvan/) 
* E-mail: ggalvanjorge@gmail.com

---
## **Manipulación y transformación de datos**

In [1]:
# Importación de librerías
import pandas as pd
from datetime import datetime

import geopandas as gpd
from shapely.geometry import Point

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Lectura de datasets
df_uber_apr14 = pd.read_csv('./data/uber-raw-data-apr14.csv')
df_uber_may14 = pd.read_csv('./data/uber-raw-data-may14.csv')
df_uber_jun14 = pd.read_csv('./data/uber-raw-data-jun14.csv')
df_uber_jul14 = pd.read_csv('./data/uber-raw-data-jul14.csv')
df_uber_aug14 = pd.read_csv('./data/uber-raw-data-aug14.csv')
df_uber_sep14 = pd.read_csv('./data/uber-raw-data-sep14.csv')

In [3]:
# Lectura del geodataframe  con la información geográfica de los barrios de NYC
gdf_nyc_nbhoods = gpd.read_file('./data/nyc_neighborhoods.shp')

gdf_nyc_nbhoods

Unnamed: 0,boro_code,borough,nbhood,population,area_sqft,geometry
0,1.0,Manhattan,Battery Park City-Lower Manhattan,38015,1.901167e+07,"MULTIPOLYGON (((-74.00078 40.69429, -74.00096 ..."
1,1.0,Manhattan,Central Harlem North-Polo Grounds,79254,2.540303e+07,"POLYGON ((-73.93445 40.83598, -73.93447 40.835..."
2,1.0,Manhattan,Central Harlem South,47328,1.444123e+07,"POLYGON ((-73.94177 40.80709, -73.94226 40.806..."
3,1.0,Manhattan,Chinatown,47182,1.450188e+07,"POLYGON ((-73.98382 40.72147, -73.98386 40.721..."
4,1.0,Manhattan,Clinton,43514,1.837385e+07,"POLYGON ((-73.99383 40.77293, -73.99379 40.772..."
...,...,...,...,...,...,...
190,5.0,Staten Island,Stapleton-Rosebank,25807,4.643336e+07,"POLYGON ((-74.07258 40.63794, -74.07257 40.637..."
191,5.0,Staten Island,Todt Hill-Emerson Hill-Heartland Village-Light...,32145,1.848894e+08,"POLYGON ((-74.09777 40.61062, -74.09730 40.610..."
192,5.0,Staten Island,West New Brighton-New Brighton-St. George,31423,5.602857e+07,"POLYGON ((-74.07258 40.63794, -74.07330 40.637..."
193,5.0,Staten Island,Westerleigh,24364,6.325777e+07,"POLYGON ((-74.13047 40.63089, -74.13014 40.629..."


### **1 - Unificación de datasets**

#### 1.1 - Concatenación de datasets

In [4]:
# Concatenación de todos los datasets
dataframes = [df_uber_apr14, df_uber_may14, df_uber_jun14, df_uber_jul14, df_uber_aug14, df_uber_sep14]
df_uber = pd.concat(dataframes, ignore_index=True)

# Cambio de nombres de variables
df_uber = df_uber.rename(columns={'Date/Time':'date_time', 'Lat':'lat', 'Lon':'lon', 'Base':'base'})

df_uber.head()

Unnamed: 0,date_time,lat,lon,base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [5]:
# Tamaño del dataset concatenado
df_uber.shape

(4534327, 4)

#### 1.2 - Generación de barrios

In [6]:
# Lista vacía para almacenar los barrios
neighborhoods = []

# Bucle para generar el barrio de cada viaje de Uber
for index, row in df_uber.iterrows():
    
    # Creación de punto geométrico con las coordenadas a partir de latitud y longitud
    point = Point(row['lon'], row['lat'])
    
    # Variables para almacenar el barrio más cercano y su distancia mínima
    nearest_neighborhood = None
    min_distance = float('inf')
    
    # Bucle para recorrer cada fila del geoframe con los barrios y encontrar el barrio más cercano de cada viaje 
    for geo_index, geo_row in gdf_nyc_nbhoods.iterrows():
        
        # Distancia entre el punto y el barrio
        distance = geo_row['geometry'].distance(point)
        
        # Actualización del nombre del barrio y la distancia mínima si se encuentra una distancia menor
        if distance < min_distance:
            nearest_neighborhood = geo_row['nbhood']
            min_distance = distance
    
    # Agregar el nombre del barrio más cercano a la lista
    neighborhoods.append(nearest_neighborhood)

In [7]:
# Agregar la lista de barrios al dataset
df_uber['nbhood'] = neighborhoods

df_uber.head(10)

Unnamed: 0,date_time,lat,lon,base,nbhood
0,2014-04-01 00:11:00,40.769,-73.9549,B02512,Lenox Hill-Roosevelt Island
1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,SoHo-TriBeCa-Civic Center-Little Italy
2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,East Village
3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,Midtown-Midtown South
4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,Turtle Bay-East Midtown
5,2014-04-01 00:33:00,40.7383,-74.0403,B02512,West Village
6,2014-04-01 00:39:00,40.7223,-73.9887,B02512,Chinatown
7,2014-04-01 00:45:00,40.762,-73.979,B02512,Midtown-Midtown South
8,2014-04-01 00:55:00,40.7524,-73.996,B02512,Hudson Yards-Chelsea-Flatiron-Union Square
9,2014-04-01 01:01:00,40.7575,-73.9846,B02512,Midtown-Midtown South


In [8]:
# Comprobación de existencia de viajes con valores nulos en la columna de barrio
df_uber[df_uber['nbhood'].isnull()]

Unnamed: 0,date_time,lat,lon,base,nbhood


#### 1.3 - Join con GeoFrame

In [9]:
# Combinación del dataset con el geoframe
df_uber = df_uber.merge(gdf_nyc_nbhoods[['borough','nbhood']], how='left', left_on='nbhood', right_on='nbhood')

df_uber.head()

Unnamed: 0,date_time,lat,lon,base,nbhood,borough
0,2014-04-01 00:11:00,40.769,-73.9549,B02512,Lenox Hill-Roosevelt Island,Manhattan
1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,SoHo-TriBeCa-Civic Center-Little Italy,Manhattan
2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,East Village,Manhattan
3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,Midtown-Midtown South,Manhattan
4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,Turtle Bay-East Midtown,Manhattan


### **2 - Comprobación de valores nulos**

In [10]:
# Comprobación de existencia de nulos en cada variable
df_uber.isnull().sum()

date_time    0
lat          0
lon          0
base         0
nbhood       0
borough      0
dtype: int64

### **3 - Transformación de variables**

#### 3.1 - Transformación de fecha y hora

In [11]:
# Conversión de fecha y hora a tipo datetime
df_uber['date_time'] = pd.to_datetime(df_uber['date_time'])

#### 3.2 - Creación de variables a partir de la fecha y hora

In [12]:
# Extracción de fecha, mes, nº de semana, día de la semana, día, hora y minuto
df_uber['date'] = df_uber['date_time'].dt.date
df_uber['n_month'] = df_uber['date_time'].dt.month
df_uber['n_week'] = df_uber['date_time'].dt.week
df_uber['n_day_week'] = df_uber['date_time'].dt.dayofweek + 1
df_uber['day'] = df_uber['date_time'].dt.day
df_uber['hour'] = df_uber['date_time'].dt.hour
df_uber['minute'] = df_uber['date_time'].dt.minute

In [13]:
# Definición de nombres de meses y de días de la semana en español
months_spanish = ['ene','feb','mar','abr','may','jun','jul','ago','sep','oct','nov','dic']
day_week_spanish = ['lun','mar','mie','jue','vie','sab','dom']

# Nombres de meses y días de la semana en español
df_uber['month'] = df_uber['n_month'].apply(lambda x: months_spanish[x - 1])
df_uber['day_week'] = df_uber['n_day_week'].apply(lambda x: day_week_spanish[x - 1])

In [14]:
# Identificación de días de la semana y de fin de semana
df_uber['weekend'] = df_uber['n_day_week'].apply(lambda x: 'yes' if x in [5, 6, 7] else 'no')

#### 3.3 - Tipos de variables

In [15]:
# Comprobación de los tipos de variables
df_uber.dtypes

date_time     datetime64[ns]
lat                  float64
lon                  float64
base                  object
nbhood                object
borough               object
date                  object
n_month                int64
n_week                 int64
n_day_week             int64
day                    int64
hour                   int64
minute                 int64
month                 object
day_week              object
weekend               object
dtype: object

### **4 - Exportación del dataset**

In [16]:
# Cambio de orden de las columnas
df_uber = df_uber[['date_time', 'date', 'n_month', 'month', 'n_week', 'n_day_week', 'day_week', 'weekend', 'day', 'hour', 'minute', 'lat', 'lon', 'base', 'nbhood', 'borough']]

df_uber.head()

Unnamed: 0,date_time,date,n_month,month,n_week,n_day_week,day_week,weekend,day,hour,minute,lat,lon,base,nbhood,borough
0,2014-04-01 00:11:00,2014-04-01,4,abr,14,2,mar,no,1,0,11,40.769,-73.9549,B02512,Lenox Hill-Roosevelt Island,Manhattan
1,2014-04-01 00:17:00,2014-04-01,4,abr,14,2,mar,no,1,0,17,40.7267,-74.0345,B02512,SoHo-TriBeCa-Civic Center-Little Italy,Manhattan
2,2014-04-01 00:21:00,2014-04-01,4,abr,14,2,mar,no,1,0,21,40.7316,-73.9873,B02512,East Village,Manhattan
3,2014-04-01 00:28:00,2014-04-01,4,abr,14,2,mar,no,1,0,28,40.7588,-73.9776,B02512,Midtown-Midtown South,Manhattan
4,2014-04-01 00:33:00,2014-04-01,4,abr,14,2,mar,no,1,0,33,40.7594,-73.9722,B02512,Turtle Bay-East Midtown,Manhattan


In [17]:
# Exportación del dataframe
df_uber.to_csv('./data/uber_data_transformed.csv')