# Data preprocessing

I need to tranform the geo coordinates to a 2D projection of the points in plane. I cannot do this directly. The dataset contains information in regular coordinates (4.680490, -74.132076 as you usually see in Google Maps). A possible solution would be to choose an origin point that which x and y coordinates are smaller than any x or y in the dataset. Then I would have to calculate the distance (in meters) between the [x y] of each point and the origin. I need to choose a good projection to do this. 

Time is limited, so I decided to transform all the to UTM coordinates (already in meters). Given that the UTM projection of the globe is quite complex, I will choose only one state (Florida or New York) and will only consider points in a particular ZONE NUMBER and ZONE LETTER, so I don't have to account for the change in the distance when a point belongs to a different ZONE NUMBER and ZONE LETTER.

In [1]:
import utm
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('./kaggle_income.csv', encoding='latin-1')

In [3]:
df

Unnamed: 0,id,State_Code,State_Name,State_ab,County,City,Place,Type,Primary,Zip_Code,Area_Code,ALand,AWater,Lat,Lon,Mean,Median,Stdev,sum_w
0,1011000,1,Alabama,AL,Mobile County,Chickasaw,Chickasaw city,City,place,36611,251,10894952,909156,30.771450,-88.079697,38773,30506,33101,1638.260513
1,1011010,1,Alabama,AL,Barbour County,Louisville,Clio city,City,place,36048,334,26070325,23254,31.708516,-85.611039,37725,19528,43789,258.017685
2,1011020,1,Alabama,AL,Shelby County,Columbiana,Columbiana city,City,place,35051,205,44835274,261034,33.191452,-86.615618,54606,31930,57348,926.031000
3,1011030,1,Alabama,AL,Mobile County,Satsuma,Creola city,City,place,36572,251,36878729,2374530,30.874343,-88.009442,63919,52814,47707,378.114619
4,1011040,1,Alabama,AL,Mobile County,Dauphin Island,Dauphin Island,Town,place,36528,251,16204185,413605152,30.250913,-88.171268,77948,67225,54270,282.320328
5,1011050,1,Alabama,AL,Cullman County,Cullman,Dodge City,Town,place,35057,256,8913021,26837,34.045414,-86.882670,50715,42643,35886,173.325959
6,1011060,1,Alabama,AL,Escambia County,East Brewton,East Brewton city,City,place,36426,251,8826252,91015,31.091440,-87.055345,33737,23610,28256,758.771322
7,1011070,1,Alabama,AL,Elmore County,Coosada,Elmore,Town,place,36020,334,10222339,176500,32.544337,-86.336446,46319,40242,38941,397.052564
8,1011080,1,Alabama,AL,Morgan County,Eva,Eva,Town,place,35621,256,10544874,78981,34.326504,-86.765318,57994,39591,47235,137.496039
9,1011090,1,Alabama,AL,Talladega County,Sylacauga,Fayetteville,CDP,place,35151,256,45178321,6034534,33.168097,-86.442774,54807,41712,51359,380.728238


In [4]:
new_york  = df.loc[df['State_ab'] == 'NY']

In [5]:
new_york

Unnamed: 0,id,State_Code,State_Name,State_ab,County,City,Place,Type,Primary,Zip_Code,Area_Code,ALand,AWater,Lat,Lon,Mean,Median,Stdev,sum_w
18841,36011353,36,New York,NY,Jefferson County,Adams,Adams,Village,place,13605,315,3742271,0,43.810218,-76.022978,54499,37697,48943,470.579258
18842,36011363,36,New York,NY,Genesee County,Alexander,Alexander,Village,place,14005,585,1132465,0,42.902021,-78.259113,64770,57765,40606,91.795365
18843,36011373,36,New York,NY,Dutchess County,Amenia,Amenia,CDP,place,12501,845,3152792,70626,41.848203,-73.555575,82188,56048,67388,245.057685
18844,36011383,36,New York,NY,Tioga County,Apalachin,Apalachin,CDP,place,13732,607,3779536,0,42.070565,-76.162636,73219,54363,55251,311.055400
18845,36011393,36,New York,NY,Nassau County,Hempstead,Atlantic Beach,Village,place,11550,516,1136400,1554299,40.588013,-73.729360,131519,114157,78333,183.382857
18846,36011403,36,New York,NY,Chenango County,Bainbridge,Bainbridge,Village,place,13733,607,3403618,102212,42.301697,-75.478883,52841,44096,42015,348.187076
18847,36011413,36,New York,NY,Nassau County,Island Park,Barnum Island,CDP,place,11558,516,2279725,972781,40.606348,-73.645982,110516,91529,71566,289.377702
18848,36011423,36,New York,NY,Orange County,Washingtonville,Beaver Dam Lake,CDP,place,10992,845,4692489,771376,41.447173,-74.118707,100299,102874,56742,258.481909
18849,36011433,36,New York,NY,Allegany County,Belmont,Belmont,Village,place,14813,585,2579199,4170,42.220105,-78.030794,46554,39086,35590,318.020553
18850,36011443,36,New York,NY,Rockland County,Blauvelt,Blauvelt,CDP,place,10913,845,11660297,272602,41.068236,-73.954937,123435,115560,76643,545.788323


In [6]:
new_york_simple = new_york[['State_Name','State_ab','Lat','Lon','Mean','Median','Stdev']]

In [10]:
new_york_simple

Unnamed: 0,State_Name,State_ab,Lat,Lon,Mean,Median,Stdev,easting
18841,New York,NY,43.810218,-76.022978,54499,37697,48943,417722.480944
18842,New York,NY,42.902021,-78.259113,64770,57765,40606,723766.208174
18843,New York,NY,41.848203,-73.555575,82188,56048,67388,619909.143604
18844,New York,NY,42.070565,-76.162636,73219,54363,55251,403819.017874
18845,New York,NY,40.588013,-73.729360,131519,114157,78333,607526.764688
18846,New York,NY,42.301697,-75.478883,52841,44096,42015,460527.894216
18847,New York,NY,40.606348,-73.645982,110516,91529,71566,614551.445436
18848,New York,NY,41.447173,-74.118707,100299,102874,56742,573615.263383
18849,New York,NY,42.220105,-78.030794,46554,39086,35590,745064.152504
18850,New York,NY,41.068236,-73.954937,123435,115560,76643,587801.368729


In [8]:
#pd.concat([new_york_simple,pd.DataFrame(columns=['easting', 'northing', 'zone_number', 'zone_letter'])])

## Transform coordinates to UTM

In [9]:
new_york_simple['easting'] = new_york_simple.apply(lambda row: utm.from_latlon(row['Lat'], row['Lon'])[0], axis=1)
new_york_simple['northing'] = new_york_simple.apply(lambda row: utm.from_latlon(row['Lat'], row['Lon'])[1], axis=1)
new_york_simple['zone_number'] = new_york_simple.apply(lambda row: utm.from_latlon(row['Lat'], row['Lon'])[2], axis=1)
new_york_simple['zone_letter'] = new_york_simple.apply(lambda row: utm.from_latlon(row['Lat'], row['Lon'])[3], axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
