## Extract Phase.
#### Donwloads from datasource tar.gz files containing geographical coordinates, saves and extracts on a destination folder.
The `extract_from_source()` is a python module that receives a url containing a `tar.gz` datasource, a destination directory and a flag to exctract the content.
The output is files extracted at destination directory.

In [2]:
from extract.extract_targz_from_source import extract_from_source

source = "https://s3.amazonaws.com/dev.etl.python/datasets/data_points.tar.gz"
destination_path = "/app/data_from_source"
extract = True

extract_from_source(source, destination_path, extract)


File downloaded from https://s3.amazonaws.com/dev.etl.python/datasets/data_points.tar.gz to /app/data_from_source/data_points.tar.gz.
Files extracted at /app/data_from_source directory.


# Transform Phase

#### 1 - Reads a directory containing coordinates/points data files.

In [3]:
from transform.transform_raw_to_csv import get_data_files

files_path = "/app/data_from_source"
files = get_data_files("/app/data_from_source")


Files found data_points_20180101.txt data_points_20180102.txt data_points_20180103.txt


#### 2 - Prepares for cleaning, transforming file content to `List[List[str()]]`

In [4]:
from transform.transform_raw_to_csv import wrangle_points_to_list

data_files = []
for file in files:
    data_files.append(f"{files_path}/{file}")
points_list = wrangle_points_to_list(data_files)
print(points_list)


[['Latitude: 30°02′59″S   -30.04982864', 'Longitude: 51°12′05″W   -51.20150245', 'Distance: 2.2959 km  Bearing: 137.352°'], ['Latitude: 30°04′03″S   -30.06761588', 'Longitude: 51°14′23″W   -51.23976111', 'Distance: 4.2397 km  Bearing: 210.121°'], ['Latitude: 30°03′21″S   -30.05596474', 'Longitude: 51°10′22″W   -51.17286827', 'Distance: 4.9213 km  Bearing: 118.814°'], ['Latitude: 30°02′18″S   -30.03841576', 'Longitude: 51°14′58″W   -51.24943145', 'Distance: 3.088 km  Bearing: 262.19°'], ['Latitude: 30°00′22″S   -30.00613726', 'Longitude: 51°14′19″W   -51.23864809', 'Distance: 3.7605 km  Bearing: 327.479°'], ['Latitude: 30°04′02″S   -30.06713593', 'Longitude: 51°12′13″W   -51.20357392', 'Distance: 3.8596 km  Bearing: 159.435°'], ['Latitude: 30°03′45″S   -30.06247621', 'Longitude: 51°10′45″W   -51.1792341', 'Distance: 4.8235 km  Bearing: 129.929°'], ['Latitude: 30°01′13″S   -30.02030812', 'Longitude: 51°10′51″W   -51.18087659', 'Distance: 3.8845 km  Bearing: 65.769°'], ['Latitude: 30°03′4

#### 3 - Apply data detection using regexp and store in a`list[dict{}]`

In [None]:
from transform.transform_raw_to_csv import convert_data_coordinates

detected_points = convert_data_coordinates(points_list)

print(detected_points)


#### 4 - Remove duplicated dictionaries inside `detected_points` list.

In [None]:
from transform.transform_raw_to_csv import  remove_duplicates

print(f"Before deduplication {len(detected_points)}")

deduplicated_points = remove_duplicates(detected_points)

print(f"After deduplication {len(deduplicated_points)}")
print(deduplicated_points)


#### 5 - Converts the `List[dict{}]` of deduplicated points in a csv file and saves at the disk.

In [None]:
from transform.transform_raw_to_csv import write_points_to_csv

path_to_csv = "/app/normalized_data/data.csv"

write_points_to_csv(deduplicated_points, path_to_csv)

with open(path_to_csv, "r") as csv_file:
    [print(line) for line in csv_file]


#### 6 - Reads data from CSV file, converts and returns a Dataframe containing the values.

In [None]:
from transform.transform_csv_to_database import Converter

converter = Converter(api_key="AIzaSyCZ1RwYvtM-fbjWp7ZQnMggAVJVS9LJMFA")
dataset_from_csv = converter.get_coordinates_from_csv_file(path_to_csv)

print(dataset_from_csv)

#### 7 - Makes the API calls to retrieve data from latitude/longitude points and saves to database.

In [None]:
converter.save_dataset_coordinates_to_database(dataset_from_csv)


## Load Phase
#### 1 - Reads data from Database and displays in current Cell.


In [1]:
import dataset
from decouple import config
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display, HTML
import pandas as pd

InteractiveShell.ast_node_interactivity = "all"

db_user = config("POSTGRES_USER")
db_name = config("POSTGRES_DB")
db_password = config("POSTGRES_PASSWORD")
db_host = config("POSTGRES_HOST")
string_connection = (f"postgresql://{db_user}:{db_password}@{db_host}:5432/{db_name}")

db = dataset.connect(string_connection)

with db.engine.connect() as conn, conn.begin():
    data = pd.read_sql("addresses", conn)
    display(HTML(data.to_html()))



Unnamed: 0,street_number,street_name,neighborhood,city,state,country,postal_code,latitude,longitude
0,405,Rua Monsenhor Veras,Santana,Porto Alegre,RS,Brazil,90610-010,-30.049829,-51.201502
1,1979,Avenida Edvaldo Pereira Paiva,Praia de Belas,Porto Alegre,RS,Brazil,90110-060,-30.067616,-51.239761
2,71,Rua Felizardo de Farias,Santo Antônio,Porto Alegre,RS,Brazil,90660-130,-30.067136,-51.203574
3,77,Rua Gen Telmo Oliveira Santana,Paternon,Porto Alegre,RS,Brazil,90610-170,-30.062476,-51.179234
4,508,Avenida Plínio Brasil Milano,Higienópolis,Porto Alegre,RS,Brazil,90520-001,-30.020308,-51.180877
5,55,Avenida Berlim,São Geraldo,Porto Alegre,RS,Brazil,90240-581,-30.012601,-51.20277
6,1931,Avenida Bento Gonçalves,Partenon,Porto Alegre,RS,Brazil,90650-002,-30.058322,-51.195034
7,70,Rua Mexiana Ilha Da Pintada,Arquipélago,Porto Alegre,RS,Brazil,90090-130,-30.013519,-51.262426
8,1678,Avenida Benjamin Constant,São João,Porto Alegre,RS,Brazil,90550-002,-30.016115,-51.196416
9,151,Rua General Vitórino,Centro Histórico de Porto Alegre,Porto Alegre,RS,Brazil,90020-170,-30.030591,-51.225197
