## **DATA WRANGLING BASES**
This file attemps to gather, collect, and transform the bases raw dataset from the source attached below in order to analyse the data avilable and proceed with it. The following processes will be dealt with:

1. Reading the .csv file and transforming variables
2. Data exploration
3. Reshaping data
4. Filtering data

#### **IMPORT LIBRARIES**

In [None]:
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go

#### **1. READ DATA and VARIABLE TRANSFORMATION**
**Dataset**: bases_bicimad.xls     

**Description**: Dataset of the existing bases of the BiciMAD service. 

**Dataframe size**: 269 base stations including extensions and 13 variables.

In [None]:
bases= pd.read_excel("../Data/Bases/bases_bicimad.xls")
bases.shape

In [None]:
bases.head()

**Variables type check**: correct

In [None]:
bases.dtypes

**NaN check**: don't exist

In [None]:
bases.isna().sum()

#### **2. DATA EXPLORATION**

Variable **"Tipo de Reserva"** has only one category: irrelevant

In [None]:
bases["Tipo de Reserva"].value_counts()

**Número de Plazas distribution**:an 81% of bases have 24 docks. 

In [None]:
fig = px.histogram(bases, x="Número de Plazas", nbins = 30, histnorm='probability density')
fig.update_traces(marker_color = "darkorange")
fig.show()

**Distrito distribution**: Half the number of stations are concentrated in CENTRO, SALAMANCA and CHAMBERÍ

In [None]:
count_distrito = bases["Distrito"].value_counts() 
labels_distrito = count_distrito.index

fig = px.pie(bases, values=count_distrito, names=labels_distrito, color=labels_distrito,
             color_discrete_sequence=px.colors.sequential.RdBu)

fig.update_layout(title = "Distribución número de bases por distrito")
fig.show()

**Barrio distribution**: irregular distribution between neighborhoods. Number of bases range from 1 to 14 by neigborhood 

In [None]:
count_barrio = bases["Barrio"].value_counts() 
labels_barrio = count_barrio.index

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x = labels_barrio,
        y = count_barrio,
        showlegend = False
    )
)
fig.update_layout(title = "Distribución Número de número de bases por barrio",
                  xaxis_title = "Barrio", yaxis_title = "Número absoluto de estaciones")
fig.show()

**Location of Bases - Latitude and Longitude**: all bases located in Madrid. Correct data.

#### **3. RESHAPING DATA**

1. With the purpose of optimizing space, variables that are irrelevant for the model are deleted - bases_clean

** Delete CALLE and Nº de FINCA as these to variables concatenated form variable DIRECCION. 

In [None]:
bases_clean = bases.drop(columns = ["Gis_X", "Gis_Y", "Fecha de Alta", "Direccion", "Tipo de Reserva"], axis = 1)
bases_clean.head()

In [None]:
bases_clean.head(50)

2. Join bases that have two stations (a and b) under the same station. Add the number of docks.

In [None]:
bases_repeat = bases_clean[(bases_clean['Número'].str.len() > 3 )]
bases_repeat

In [None]:
bases_clean.loc[0:1, "Número"] = 1
bases_clean.loc[21, "Número"] = 20
bases_clean.loc[25:26, "Número"] = 25
bases_clean.loc[79:80, "Número"] = 80
bases_clean.loc[91, "Número"] = 90

bases_clean.loc[107:108, "Número"] = 106
bases_clean.loc[113:114, "Número"] = 111
bases_clean.loc[119:120, "Número"] = 116


bases_clean.loc[133, "Número"] = 128
bases_clean.loc[146, "Número"] = 140
bases_clean.loc[168, "Número"] = 161

bases_clean.dtypes

In [None]:
bases_final = bases_clean.groupby(["Número", 'Distrito', 'Barrio', 'Calle'], as_index=False).agg({ 'Nº Finca':pd.Series.mode, 'Número de Plazas':'sum', 'Longitud': 'min', 'Latitud': 'min'})

In [None]:
bases_clean2=bases_clean.copy()
bases_clean2.loc[0, "Número"] = '1a'
bases_clean2.loc[1, "Número"] = '1b'
bases_clean2.loc[21, "Número"] = 20
bases_clean2.loc[25, "Número"] = '25a'
bases_clean2.loc[26, "Número"] = '25b'
bases_clean2.loc[79, "Número"] = '80a'
bases_clean2.loc[80, "Número"] = '80b'
bases_clean2.loc[91, "Número"] = 90

bases_clean2.loc[107, "Número"] = '106a'
bases_clean2.loc[108, "Número"] = '106b'
bases_clean2.loc[113, "Número"] = '111a'
bases_clean2.loc[114, "Número"] = '111b'
bases_clean2.loc[119, "Número"] = '116a'
bases_clean2.loc[120, "Número"] = '116b'
bases_clean2.loc[133, "Número"] = 128
bases_clean2.loc[146, "Número"] = 140
bases_clean2.loc[168, "Número"] = 161

bases_final2 = bases_clean2.groupby(["Número", 'Distrito', 'Barrio', 'Calle'], as_index=False).agg({ 'Nº Finca':pd.Series.mode, 'Número de Plazas':'sum', 'Longitud': 'min', 'Latitud': 'min'})
bases_final2=bases_final2.astype({'Número': str, 'Nº Finca': str})

#### **Save data**

In [None]:
bases_final2.to_parquet('../Data/Bases/basesSituaciones.parquet')

In [None]:
bases_final.to_csv('../Data/Bases/bases.csv')