# ANÁLISE DOS REGISTROS DO SISTEMA COMANDO DE OCORRÊNCIAS COM ORIGEM METEOROLÓGICA

---

### Notebook Sections:
1. Exploratory Data Analysis
    1. Asses general quality of the dataset
    2. Check potential of the dataset to serve as catalog of incidents caused by rain, for predicive modeling.
2. Data Cleaning

### Importar modulos e funções

In [26]:
import os, pandas as pd, numpy as np, matplotlib.pyplot as plt#, requests, json, folium
import seaborn as sns; sns.set()
# from folium import plugins
from IPython.display import clear_output as co

### Definir classe 'data' com endereço dos dados
class DATA:
    path = r'C:\Users\luisr\Desktop\Repositories\Dados\Desafio COR-Rio IV\\'
    AlertaAPI = r'http://websempre.rio.rj.gov.br/json/chuvas'

### Carregando dados

In [27]:
comando = pd.read_csv(DATA.path + 'comando.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


---
# Exploratory Analysis and Data Cleaning

## 0. Extract records of incidents caused by rain

In [28]:
titles = [
    "Bolsão d'água em via", 'Vazamento de água / esgoto',
    'Alagamentos e enchentes', "Lâmina d'água",
    "Lâmina d'água em via", 'Alagamento',
    'Enchente', 'Bueiro'
]

records = comando[comando['POP_TITULO'].isin(titles)]; 
events = records.groupby('EVENTO_ID').first() # Isolando ocorrências (primeiro registro de cada ocorrência)
data = records.reset_index(drop=False).rename(columns={'index': 'REGISTRO_ID'})

print(records.shape) # Número total de registros encontrados
print(events.shape) # Número eventos de registros encontrados

(12409, 18)
(4884, 17)


**Key findings**:
1. The record history of incidents caused by rain has some overlaping categories for incident type. There is a need to confirm the reason for the overlaping categories with the institution.

### 1. Asses data types

##### Check first rows

In [29]:
records.head(3)

Unnamed: 0,EVENTO_ID,EVENTO_TITULO,EVENTO_DESCRICAO,EVENTO_GRAVIDADE,EVENTO_BAIRRO,STATUS,EVENTO_INICIO,EVENTO_INICIO_HORA,EVENTO_FIM,EVENTO_FIM_HORA,EVENTO_PRAZO,EVENTO_LATITUDE,EVENTO_LONGITUDE,POP_TITULO,POP_DESCRICAO,ORGAO_SIGLA,ORGAO_NOME,ACAO
214,60,"Queda de Árvore na Rua Pacheco Leão, 1587",,BAIXO,,FECHADO,2015-04-10,15:59:00,2015-04-11,08:37:00,,-23.0,-4323430000000000.0,Bolsão d'água em via,Bolsão d'água em via,CET-RIO,Companhia de Engenharia de Tráfego,Desfazer o acidente
215,60,"Queda de Árvore na Rua Pacheco Leão, 1587",,BAIXO,,FECHADO,2015-04-10,15:59:00,2015-04-11,08:37:00,,-23.0,-4323430000000000.0,Bolsão d'água em via,Bolsão d'água em via,CET-RIO,Companhia de Engenharia de Tráfego,Organizar o trânsito
216,60,"Queda de Árvore na Rua Pacheco Leão, 1587",,BAIXO,,FECHADO,2015-04-10,15:59:00,2015-04-11,08:37:00,,-23.0,-4323430000000000.0,Bolsão d'água em via,Bolsão d'água em via,COMLURB,Companhia de Limpeza Urbana,Cortar e retirar árvore


##### Identify variables' types

In [30]:
cols = {
    'text': [
        'EVENTO_TITULO', 'EVENTO_DESCRICAO'
    ],
    'categorical': [
        'EVENTO_GRAVIDADE', 'EVENTO_BAIRRO', 'STATUS', 'EVENTO_PRAZO',
        'POP_TITULO', 'POP_DESCRICAO', 'ORGAO_SIGLA', 'ORGAO_NOME', 'ACAO'
    ],
    'datetime': [
        'EVENTO_INICIO', 'EVENTO_INICIO_HORA',
        'EVENTO_FIM', 'EVENTO_FIM_HORA'
    ],
    'location': [
        'EVENTO_LATITUDE', 'EVENTO_LONGITUDE'
    ]
}

#### Check raw data types

In [31]:
records.head().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 214 to 250
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   EVENTO_ID           5 non-null      int64  
 1   EVENTO_TITULO       5 non-null      object 
 2   EVENTO_DESCRICAO    1 non-null      object 
 3   EVENTO_GRAVIDADE    5 non-null      object 
 4   EVENTO_BAIRRO       0 non-null      object 
 5   STATUS              5 non-null      object 
 6   EVENTO_INICIO       5 non-null      object 
 7   EVENTO_INICIO_HORA  5 non-null      object 
 8   EVENTO_FIM          5 non-null      object 
 9   EVENTO_FIM_HORA     5 non-null      object 
 10  EVENTO_PRAZO        0 non-null      object 
 11  EVENTO_LATITUDE     5 non-null      float64
 12  EVENTO_LONGITUDE    5 non-null      float64
 13  POP_TITULO          5 non-null      object 
 14  POP_DESCRICAO       5 non-null      object 
 15  ORGAO_SIGLA         5 non-null      object 
 16  ORGAO_NO

Conclusions:

Only datetime variables require type conversion.

---
### 2. Asses category values

Categorical columns (unique categories):
* EVENTO_GRAVIDADE (BAIXO, MEDIO, ALTO, SEM_CLASSIFICAÇÃO, CRITICO). Extra values: Litoral do Rio de Janeiro
* EVENTO_BAIRRO (339) - Desestruturado
    * contém 'nan'
    * Valores de categorias não uniforme: Bairros de mesmo nome com escrita diferente.
    * Contem valores inválidos: Nomes de rua, números, endereços, etc.
    * Necessita método strip()
* EVENTO_STATUS (FECHADO, ABERTO). Extra values: FINALIZADO, Litoral do Rio de Janeiro
* (CURTO, MEDIO, LONGO)
* POP_TITULO (40)
* POP_DESCRICAO (42)
* ORGAO_SIGLA (34)
* ORGAO_NOME (35)
* ACAO (70)

In [None]:
for col in cols['categorical']: # descomente para conferir contagem de  valores unicos
#     print('\n\n', col, ': ', len(records[col].unique()), '\n')
    display(records[col].value_counts().to_frame(col))

---
### 3. Asses text variables

---
### 4. Asses location variables

#### Functions to process coordinates

In [38]:
#### Compute and update order of magnitude of float value or series.

def orderOfMagnitude(number):
    return np.floor(np.log10(abs(number)))

def correctMagnitude(number, mag=1):
    if type(number)==float:
        magnitude = orderOfMagnitude(number)
        return number / 10 ** ( orderOfMagnitude(number) - mag )
    else:
        return [correctMagnitude(n, mag) for n in number]

# Replace values below provided decimal precision with NAN in series.

def isBelowDecimal(series, decimal=1):
    next_decimal = abs(series * 10 ** (decimal-1))
    next_abs_dif = next_decimal - next_decimal.round(0)
    return next_abs_dif == 0

# Compute decimal precision

def fillnaBelowDecimal(df, decimal=1, cols=None, subset='all'): # accepts array
    if cols is None: cols = df.columns
    for col in cols:
        below_msk = isBelowDecimal(df[col], decimal=decimal)
        if subset=='all':
            df[below_msk] = np.nan
        elif subset=='each':
            df[col][below_msk] = np.nan
    return df    

# Reverse geocode coordinates with google maps API

import googlemaps

def googleReverseGeocode(
    coordinates, result_type=None, location_type=None, language='pt-BR',
    keep_cols =  ['place_id', 'types', 'formatted_address'],
    drop_cols = ['address_components', 'geometry', 'plus_code'], # if included, 'keep_cols' argument is ignored
    keep_geometry_cols = ['location', 'location_type'],
    googleAPIKey=None
):
    gmaps = googlemaps.Client(key=googleAPIKey) # load google api key
    result = []; n_coords = len(coordinates)
    for i, coords in enumerate(coordinates):
        res = gmaps.reverse_geocode(
            coords, language='pt-BR',
            result_type='|'.join(result_type),
            location_type='|'.join(location_type)
        )
        df = pd.DataFrame(res); coords_df = []
        if drop_cols is not None:
            found_cols = set(drop_cols).intersection(df.columns)
            keep_cols = df.drop(found_cols, 1).columns
        for j, row in df.iterrows():
            keep_info = row[keep_cols]
            keep_info['search_index'] = i
            location = pd.Series(row['geometry']['location'])
            location['location_type'] = row['geometry']['location_type']
            address = pd.DataFrame(row['address_components'])
            address = pd.Series(
                address['long_name'].values,
                index=address['types'].map(lambda types: ', '.join(types))
            )
            coords_df.append(pd.concat([keep_info, location, address], 0))
        result.append(pd.DataFrame(coords_df))
        print(f'{i+1}/{n_coords} coordinates reversed geocoded.'); co(wait=True)

    print(f'Done! Total of {n_coords} requests.')
    return pd.concat(result, 0)

#### Correct magnitude and replace with NAN coordinates below decimal precision of four places.

In [39]:
# Correct magnitude of coordinates
data.loc[:, 'lat'] = correctMagnitude(data['EVENTO_LATITUDE'], mag=1)
data.loc[:, 'lng'] = correctMagnitude(data['EVENTO_LONGITUDE'], mag=1)

# Fill coordinates below decimal precision of 4 places.
data[['lat', 'lng']] = fillnaBelowDecimal(data[['lat', 'lng']], decimal=3, subset='all')

  return np.floor(np.log10(abs(number)))
  return number / 10 ** ( orderOfMagnitude(number) - mag )
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[below_msk] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.loc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_array(key, value)


#### Percentage of records without valid coordinates

In [40]:
data['lat'].isna().sum() / len(data) * 100 # 6 % dos incidentes sem cordenadas.

16.850672898702555

In [41]:
data.groupby('EVENTO_ID').first()['lat'].isna().sum() / len(data) * 100 # 16 % dos registros sem cordenadas.

6.092352324925457

#### Extracting catalog of incidents with known coordinates

In [42]:
catalog = data.groupby('EVENTO_ID').first().dropna(subset=['lat', 'lng']) # Incident catalog (no missing coordinates)

catalog.shape, catalog[catalog['POP_TITULO']==titles[0]].shape

((4128, 20), (3140, 20))

#### Reverse geocode extracted incidents' coordinates with Google Maps API

In [43]:
location_type = [
    'ROOFTOP',
    'RANGE_INTERPOLATED',
    'APPROXIMATE'
]

result_type = [
    'street_address', 'route', 'intersection',
    'colloquial_area', 'plus_code', 'postal_code',
    'establishment', 'premise'
]

catalog_coords = catalog[['lat', 'lng']].values.tolist()

# geocode_result = googleReverseGeocode( # Search already perfomed
#     catalog_coords,
#     result_type=result_type,
#     location_type=location_type,
#     googleAPIKey=open('../GoogleApiKey.txt', 'r').read()
# )

In [44]:
# geocode_result.to_csv('dados/catalog_reverse_geocode_result.csv', index=False)
geocode_result = pd.read_csv('dados/catalog_reverse_geocode_result.csv')

#### Check result

##### Count of unique street numbers found per location search

In [61]:
numbered = geocode_result.dropna(subset=['street_number']).drop_duplicates(subset=['search_index', 'street_number'])
numbered.groupby('search_index').count()['street_number'].value_counts().sort_index()

1     615
2    1688
3    1348
4     444
5       8
Name: street_number, dtype: int64

##### Count of unique routes found per location search

In [62]:
numbered = geocode_result.dropna(subset=['route']).drop_duplicates(subset=['search_index', 'route'])
numbered.groupby('search_index').count()['route'].value_counts().sort_index()

1    1939
2    1882
3     267
4      27
5       2
Name: route, dtype: int64

##### Locations without a numbered search result

In [46]:
len(catalog) - len(numbered['search_index'].unique()) # location search results without a numbered address

25

#### Include geodecoded results in catalog

Options to filter:
1. Prioritize type 'street_address', then 'route', then 'establishment'/'premise'
2. only if street_number is not missing
3. Prioritize first street_number of each location search

In [70]:
geocode_result['search_index'].unique()

array([   0,    1,    2, ..., 4125, 4126, 4127], dtype=int64)

In [69]:
len(catalog_coords)

4128

In [None]:
filtered_geocode_result = 

##### Original coordinates' difference from found coordinates

### 5. Asses time variables

#### Convert event start and end dates and times to 'datetime64' type

In [235]:
start = data['EVENTO_INICIO'] + ' ' + data['EVENTO_INICIO_HORA']
end = data['EVENTO_FIM'] + ' ' + data['EVENTO_FIM_HORA']

data['evento_inicio'] = pd.to_datetime(start)
data['evento_fim'] = pd.to_datetime(end)

# data.set_index('evento_inicio', inplace=True)

In [248]:
data['evento_inicio']

0       2015-04-10 15:59:00
1       2015-04-10 15:59:00
2       2015-04-10 15:59:00
3       2015-04-10 15:59:00
4       2015-05-05 07:52:00
                ...        
12404   2022-04-30 12:33:00
12405   2022-04-30 15:33:00
12406   2022-04-30 15:33:00
12407   2022-04-30 16:09:00
12408   2022-04-30 16:09:00
Name: evento_inicio, Length: 12409, dtype: datetime64[ns]

### 7. Check missing values notation uniformity