# Data Merging

## Importing data and modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('darkgrid')
sns.set_palette('viridis')

In [2]:
train = pd.read_csv('../data/train_cleaned.csv')
weather = pd.read_csv('../data/weather_cleaned.csv')
spray = pd.read_csv('../data/spray_cleaned.csv')

In [3]:
weather.columns = weather.columns.map(lambda x: x.lower())
spray.columns = spray.columns.map(lambda x: x.lower())

In [4]:
train.date[0], weather.date[0], spray.date[0]

('2007-05-29', '2007-05-01', '2011-08-29')

In [5]:
spray.head()

Unnamed: 0,date,latitude,longitude
0,2011-08-29,42.391623,-88.089163
1,2011-08-29,42.391348,-88.089163
2,2011-08-29,42.391022,-88.089157
3,2011-08-29,42.390637,-88.089158
4,2011-08-29,42.39041,-88.088858


The date formats in each of the datasets are the same so we can merge on those columns.

## Establishing `station` column in train dataset

We decide to split Chicago on a E/W axis to divide the city into north and south halves. Traps in the 'north' half will be tied to weather recorded at O'Hare Airport (`station` = 1) and traps in the 'south' half will be tied to weather recorded at Midway Airport (`station` = 2). To do this, we find the midpoint between the latitudes of both stations and use that latitude line as the division between the north and south sides of Chicago.

### Finding the midpoint

In [6]:
(41.998+41.786)/2

41.891999999999996

### Using the midpoint to divide observations

In [7]:
train['station'] = np.where(train['latitude']>=41.892, 1, 2)

In [8]:
train.station.value_counts()

2    6072
1    4434
Name: station, dtype: int64

In [9]:
train.shape, weather.shape

((10506, 13), (2918, 19))

In [10]:
train_weather = pd.merge(train, weather, on=['date', 'station'], )

In [11]:
train_weather.shape

(10440, 30)

In [12]:
train_weather.columns

Index(['date', 'address', 'species', 'block', 'street', 'trap',
       'addressnumberandstreet', 'latitude', 'longitude', 'addressaccuracy',
       'nummosquitos', 'wnvpresent', 'station', 'tmax', 'tmin', 'tavg',
       'depart', 'dewpoint', 'wetbulb', 'heat', 'cool', 'sunrise', 'sunset',
       'codesum', 'preciptotal', 'stnpressure', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed'],
      dtype='object')

## Exporting merged data

In [13]:
train_weather.to_csv('../data/train_weather_merged.csv', index=False)