In [1]:
library(dplyr)
library(readr)

setwd("C:/Users/s1155063404/Desktop/Projects/brazilian-ecommerce-dataset/DataCleaning")

raw_geolocation = read_csv("../RawDataset/geolocation_olist_public_dataset.csv")


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Parsed with column specification:
cols(
  zip_code_prefix = col_character(),
  city = col_character(),
  state = col_character(),
  lat = col_double(),
  lng = col_double()
)


## Data Cleaning

In [2]:
#Remove the simulated points that are outside of the Brazil boundary
geolocation = raw_geolocation %>%
  filter(lat <= 5.27438888, lat >= -33.75116944, lng >= -73.98283055, lng <= -34.79314722)

In [3]:
#We want to use the centroid of the simulated locations to estimate the location of a city, 
#however, it gives a volatile estimate
tmp = geolocation %>%
  group_by(state, city) %>%
  summarize(var_lat = var(lat),
            var_lng = var(lng),
            lat = mean(lat),
            lng = mean(lng)) %>%
  arrange(-var_lng, -var_lat) %>%
  filter (var_lng > 0.5, var_lat > 0.5)

tmp

state,city,var_lat,var_lng,lat,lng
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
sp,primavera,41.118655,25.6486848,-21.443973,-46.04159
sp,ibitiuva,14.073267,18.4355779,-19.330061,-46.35987
pe,bom jardim,2.960433,16.4246224,-7.022487,-37.41175
mg,conceicao da ibitipoca,30.003837,8.3017348,-15.39398,-40.59827
mt,boa esperanca,77.263909,7.3980348,-9.408915,-54.13112
ba,cumuruxatiba,27.359058,6.0773802,-14.0965,-37.76255
pe,belo jardim,2.111529,4.7486821,-8.496136,-36.66233
rj,sao sebastiao do alto,9.040131,1.9562347,-21.204717,-41.78528
rs,estancia velha,3.188229,1.7783566,-29.482507,-51.0418
ba,itamira,3.352161,0.5960393,-13.017369,-38.77515


The reason is due to some problematic data input, which associates the location with a wrong city

In [4]:
#The region across city ibitiuva ranges from (-21, -48) to (-12, -38)?
geolocation %>%
  filter(state == "sp", city == "ibitiuva")

#This region should cover lots of different cities
geolocation %>%
  filter(lat >= -21.0, lat <= -12.6, lng >= -48.3, lng <= -38.7)

zip_code_prefix,city,state,lat,lng
<chr>,<chr>,<chr>,<dbl>,<dbl>
147,ibitiuva,sp,-20.99506,-48.32698
147,ibitiuva,sp,-12.61931,-38.67958
147,ibitiuva,sp,-21.00032,-48.32602
147,ibitiuva,sp,-21.01804,-48.23385
147,ibitiuva,sp,-21.01758,-48.23293


zip_code_prefix,city,state,lat,lng
<chr>,<chr>,<chr>,<dbl>,<dbl>
141,candia,sp,-20.89902,-47.98807
143,brodowski,sp,-20.99648,-47.66360
143,brodowski,sp,-20.98509,-47.64483
143,brodowski,sp,-20.98745,-47.65752
143,batatais,sp,-20.88617,-47.56766
143,brodowski,sp,-20.99248,-47.65866
143,batatais,sp,-20.90419,-47.58345
143,brodowski,sp,-20.99551,-47.66198
143,batatais,sp,-20.88814,-47.59349
143,batatais,sp,-20.88373,-47.57431


The mistaked input are removed by cross-checking with the actual location of each city from some online sources.

In [5]:
corrected_geolocation = read_csv("../CleanedDataset/corrected_geolocation.csv")

corrected_geolocation %>%
  group_by(state, city) %>%
  summarize(var_lat = var(lat),
            var_lng = var(lng),
            lat = mean(lat),
            lng = mean(lng)) %>%
  arrange(-var_lng, -var_lat) %>%
  filter (var_lng > 0.5, var_lat > 0.5)

Parsed with column specification:
cols(
  zip_code_prefix = col_integer(),
  city = col_character(),
  state = col_character(),
  lat = col_double(),
  lng = col_double()
)


state,city,var_lat,var_lng,lat,lng
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
