# Prepare Datasets

#### Dataset 1: `s3://mapbox/gabbar/v1/reviewed-changesets.csv`
- Changesets reviewed by users on [`osmcha.mapbox.com`](https://osmcha.mapbox.com/)
- From 1st October, 2015 to 31st March, 2017.

#### Dataset 2: `s3://mapbox/gabbar/v1/reviewed-real-changesets.json`
- Real changesets version of changesets reviewed by users on [`osmcha.mapbox.com`](https://osmcha.mapbox.com/)

#### Dataset 3: `s3://mapbox/gabbar/v1/march-2017-changesets.csv`
- All changesets in the month of `March, 2017`.

#### Dataset 4:  `s3://mapbox/gabbar/v1/march-2017-real-changesets.json`
- Real changesets version of all changesets in the month of `March, 2017`.


## Notes
#### Why changesets only after `1st October, 2015`?
- Users actively reviewed changesets on osmcha only after `October, 2017`.

In [1]:
import os
import datetime

In [2]:
import pandas as pd

In [3]:
# Directory to store downloaded data, create if it does not exist.
downloads = '../downloads/v1/'
!mkdir -p downloads

## Dataset 1: `s3://mapbox/gabbar/v1/reviewed-changesets.csv`

In [4]:
# Download changests from osmcha.
url = 'http://osmcha.mapbox.com/?is_suspect=False&is_whitelisted=All&harmful=None&checked=True&all_reason=True&sort=-date&render_csv=True'
changesets = pd.read_csv(url)

In [5]:
# Convert `date` from a string to a Python object.
changesets['date'] = pd.to_datetime(changesets['date'])

start_date = datetime.datetime(2015, 10, 1)
end_date = datetime.datetime(2017, 3, 31)

changesets = changesets[(changesets['date'] >= start_date) & (changesets['date'] <= end_date)]

In [6]:
print('Changesets shape: {}'.format(changesets.drop_duplicates('ID').shape))
changesets.head(3)

Changesets shape: (55411, 19)


Unnamed: 0,ID,user,editor,Powerfull Editor,comment,source,imagery used,date,reasons,reasons__name,create,modify,delete,bbox,is suspect,harmful,checked,check_user__username,check date
12787,47309546,EdSS,iD 2.1.3,False,fix duplicate and close nodes,Not reported,Bing aerial imagery,2017-03-30 23:59:37,1.0,suspect_word,6.0,3.0,2.0,"SRID=4326;POLYGON ((-73.8574144 40.8634715, -7...",True,False,True,manoharuss,2017-03-31T10:12:50.260699+00:00
12788,47309437,Lotsofdotsandlines,iD 2.1.3,False,#gwu #usaid,Not reported,Bing aerial imagery;Local GPX,2017-03-30 23:49:55,40.0,New mapper,150.0,0.0,24.0,"SRID=4326;POLYGON ((33.4092155 -3.5244155, 33....",True,False,True,BharataHS,2017-03-31T05:13:18.188425+00:00
12789,47309432,Nea123,iD 2.1.3,False,#hotosm-project-2301 #PEPFAR #MapGive #YouthMa...,Not reported,Bing aerial imagery;Local GPX,2017-03-30 23:49:28,40.0,New mapper,16.0,14.0,5.0,"SRID=4326;POLYGON ((33.4011782 -3.6697061, 33....",True,False,True,srividya_c,2017-03-31T05:17:00.684731+00:00


In [7]:
filename = 'reviewed-changesets.csv'
filepath = os.path.join(downloads, filename)
changesets.to_csv(filename, index=False)