## **DATA WRANGLING BASES**
This file attemps to gather, collect, and transform the bases raw dataset from the source attached below in order to analyse the data avilable and proceed with it. The following processes will be dealt with:

1. Reading the .csv file and transforming variables
2. Data exploration
3. Reshaping data
4. Filtering data

#### **IMPORT LIBRARIES**

In [1]:
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go

#### **1. READ DATA and VARIABLE TRANSFORMATION**
**Dataset**: bases_bicimad.xls     

**Description**: Dataset of the existing bases of the BiciMAD service. 

**Dataframe size**: 269 base stations including extensions and 13 variables.

In [5]:
bases= pd.read_excel("../Data/Bases/bases_bicimad.xls")
bases.shape

(269, 13)

In [6]:
bases.head()

Unnamed: 0,Número,Gis_X,Gis_Y,Fecha de Alta,Distrito,Barrio,Calle,Nº Finca,Tipo de Reserva,Número de Plazas,Longitud,Latitud,Direccion
0,001 a,440443.61,4474290.65,43803,01 CENTRO,01-06 SOL,"ALCALA, CALLE, DE",2,BiciMAD,30,-3.701998,40.417111,"ALCALA, CALLE, DE, 2"
1,001 b,440480.56,4474301.74,43867,01 CENTRO,01-06 SOL,"ALCALA, CALLE, DE",6,BiciMAD,30,-3.701564,40.417213,"ALCALA, CALLE, DE, 6"
2,2,440134.83,4474678.23,41813,01 CENTRO,01-05 UNIVERSIDAD,"MIGUEL MOYA, CALLE, DE",1,BiciMAD,24,-3.705674,40.42058,"MIGUEL MOYA, CALLE, DE, 1"
3,3,440012.98,4475760.68,41813,07 CHAMBERÍ,07-02 ARAPILES,"CONDE DEL VALLE DE SUCHIL, PLAZA, DEL",2,BiciMAD,18,-3.707212,40.430322,"CONDE DEL VALLE DE SUCHIL, PLAZA, DEL, 2"
4,4,440396.4,4475565.36,41813,01 CENTRO,01-05 UNIVERSIDAD,"MANUELA MALASAÑA, CALLE, DE",3,BiciMAD,24,-3.702674,40.42859,"MANUELA MALASAÑA, CALLE, DE, 3"


**Variables type check**: correct

In [7]:
bases.dtypes

Número               object
Gis_X               float64
Gis_Y               float64
Fecha de Alta         int64
Distrito             object
Barrio               object
Calle                object
Nº Finca             object
Tipo de Reserva      object
Número de Plazas      int64
Longitud            float64
Latitud             float64
Direccion            object
dtype: object

**NaN check**: don't exist

In [8]:
bases.isna().sum()

Número              0
Gis_X               0
Gis_Y               0
Fecha de Alta       0
Distrito            0
Barrio              0
Calle               0
Nº Finca            0
Tipo de Reserva     0
Número de Plazas    0
Longitud            0
Latitud             0
Direccion           0
dtype: int64

#### **2. DATA EXPLORATION**

Variable **"Tipo de Reserva"** has only one category: irrelevant

In [9]:
bases["Tipo de Reserva"].value_counts()

BiciMAD    269
Name: Tipo de Reserva, dtype: int64

**Número de Plazas distribution**:an 81% of bases have 24 docks. 

In [10]:
fig = px.histogram(bases, x="Número de Plazas", nbins = 30, histnorm='probability density')
fig.update_traces(marker_color = "darkorange")
fig.show()

**Distrito distribution**: Half the number of stations are concentrated in CENTRO, SALAMANCA and CHAMBERÍ

In [11]:
count_distrito = bases["Distrito"].value_counts() 
labels_distrito = count_distrito.index

fig = px.pie(bases, values=count_distrito, names=labels_distrito, color=labels_distrito,
             color_discrete_sequence=px.colors.sequential.RdBu)

fig.update_layout(title = "Distribución número de bases por distrito")
fig.show()

**Barrio distribution**: irregular distribution between neighborhoods. Number of bases range from 1 to 14 by neigborhood 

In [13]:
count_barrio = bases["Barrio"].value_counts() 
labels_barrio = count_barrio.index

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x = labels_barrio,
        y = count_barrio,
        showlegend = False
    )
)
fig.update_layout(title = "Distribución Número de número de bases por barrio",
                  xaxis_title = "Barrio", yaxis_title = "Número absoluto de estaciones")
fig.show()

**Location of Bases - Latitude and Longitude**: all bases located in Madrid. Correct data.

#### **3. RESHAPING DATA**

1. With the purpose of optimizing space, variables that are irrelevant for the model are deleted - bases_clean

** Delete CALLE and Nº de FINCA as these to variables concatenated form variable DIRECCION. 

In [53]:
bases_clean = bases.drop(columns = ["Gis_X", "Gis_Y", "Fecha de Alta", "Direccion", "Tipo de Reserva"], axis = 1)
bases_clean.head()

Unnamed: 0,Número,Distrito,Barrio,Calle,Nº Finca,Número de Plazas,Longitud,Latitud
0,001 a,01 CENTRO,01-06 SOL,"ALCALA, CALLE, DE",2,30,-3.701998,40.417111
1,001 b,01 CENTRO,01-06 SOL,"ALCALA, CALLE, DE",6,30,-3.701564,40.417213
2,2,01 CENTRO,01-05 UNIVERSIDAD,"MIGUEL MOYA, CALLE, DE",1,24,-3.705674,40.42058
3,3,07 CHAMBERÍ,07-02 ARAPILES,"CONDE DEL VALLE DE SUCHIL, PLAZA, DEL",2,18,-3.707212,40.430322
4,4,01 CENTRO,01-05 UNIVERSIDAD,"MANUELA MALASAÑA, CALLE, DE",3,24,-3.702674,40.42859


2. Join bases that have two stations (a and b) under the same station. Add the number of docks.

In [54]:
bases_repeat = bases_clean[(bases_clean['Número'].str.len() > 3 )]
bases_repeat

Unnamed: 0,Número,Distrito,Barrio,Calle,Nº Finca,Número de Plazas,Longitud,Latitud
0,001 a,01 CENTRO,01-06 SOL,"ALCALA, CALLE, DE",2,30,-3.701998,40.417111
1,001 b,01 CENTRO,01-06 SOL,"ALCALA, CALLE, DE",6,30,-3.701564,40.417213
21,020 ampliacion,01 CENTRO,01-04 JUSTICIA,"ALCALA, CALLE, DE",49,6,-3.69529,40.419186
25,025 a,01 CENTRO,01-06 SOL,"CELENQUE, PLAZA, DEL",1,24,-3.705998,40.417342
26,025 b,01 CENTRO,01-06 SOL,"CELENQUE, PLAZA, DEL",1,24,-3.706024,40.417259
79,080 a,02 ARGANZUELA,02-07 ATOCHA,"CIUDAD DE BARCELONA, AVENIDA, DE LA",S/N,24,-3.690482,40.407412
80,080 b,02 ARGANZUELA,02-07 ATOCHA,"CIUDAD DE BARCELONA, AVENIDA, DE LA",S/N,27,-3.690724,40.407626
91,090 ampliacion,04 SALAMANCA,04-01 RECOLETOS,"GOYA, CALLE, DE",20,6,-3.683561,40.425079
107,106 a,04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",32,18,-3.687887,40.424864
108,106 b,04 SALAMANCA,04-01 RECOLETOS,"SERRANO, CALLE, DE",34,18,-3.68786,40.425008


In [55]:
bases_clean.loc[0:1, "Número"] = 1
bases_clean.loc[21, "Número"] = 20
bases_clean.loc[25:26, "Número"] = 25
bases_clean.loc[79:80, "Número"] = 80
bases_clean.loc[91, "Número"] = 90
bases_clean.loc[107:108, "Número"] = 106
bases_clean.loc[113:114, "Número"] = 111
bases_clean.loc[119:120, "Número"] = 116
bases_clean.loc[133, "Número"] = 128
bases_clean.loc[146, "Número"] = 140
bases_clean.loc[168, "Número"] = 161

bases_clean.dtypes

Número               object
Distrito             object
Barrio               object
Calle                object
Nº Finca             object
Número de Plazas      int64
Longitud            float64
Latitud             float64
dtype: object

In [57]:
bases_final = bases_clean.groupby(["Número", 'Distrito', 'Barrio', 'Calle'], as_index=False).agg({ 'Nº Finca':pd.Series.mode, 'Número de Plazas':'sum', 'Longitud': 'min', 'Latitud': 'min'})

#### **Save data**

In [59]:
bases_final.to_csv('../Data/Bases/bases.csv')