In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
trainzq = pd.read_csv('../../data/train_eng.csv', parse_dates=['date'])
trainkev = pd.read_csv('../../data/train_eng_kev.csv', parse_dates=['date'])
weather = pd.read_csv('../../data/weather_eng.csv', parse_dates=['date'])

In [3]:
trainzq.head()

Unnamed: 0,date,species,totalmosquitos,wnvpresent,trap,latitude,longitude
0,2007-05-29,CULEX PIPIENS,1,0,T096,41.731922,-87.677512
1,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T086,41.688324,-87.676709
2,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T048,41.867108,-87.654224
3,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T129,41.891126,-87.61156
4,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T050,41.919343,-87.694259


In [4]:
trainkev.head()

Unnamed: 0.1,Unnamed: 0,date,species,trap,latitude,longitude,year,month,nummosquitos,wnvpresent,species_ord
0,0,2007-05-29,CULEX PIPIENS,T096,41.731922,-87.677512,2007,5,1,0,2.0
1,1,2007-05-29,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,2007,5,1,0,2.0
2,2,2007-05-29,CULEX PIPIENS/RESTUANS,T015,41.974089,-87.824812,2007,5,1,0,2.0
3,3,2007-05-29,CULEX PIPIENS/RESTUANS,T048,41.867108,-87.654224,2007,5,1,0,2.0
4,4,2007-05-29,CULEX PIPIENS/RESTUANS,T050,41.919343,-87.694259,2007,5,1,0,2.0


In [6]:
trainzq['species_ord'] = trainkev['species_ord']

In [7]:
trainzq.head()

Unnamed: 0,date,species,totalmosquitos,wnvpresent,trap,latitude,longitude,species_ord
0,2007-05-29,CULEX PIPIENS,1,0,T096,41.731922,-87.677512,2.0
1,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T086,41.688324,-87.676709,2.0
2,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T048,41.867108,-87.654224,2.0
3,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T129,41.891126,-87.61156,2.0
4,2007-05-29,CULEX PIPIENS/RESTUANS,1,0,T050,41.919343,-87.694259,2.0


In [5]:
weather.head()

Unnamed: 0.1,Unnamed: 0,date,tmax,tmin,tavg,depart,dewpoint,wetbulb,heat,cool,...,tavg_lag28,rel_humid_lag5,rel_humid_lag14,rel_humid_lag28,avgspeed_lag5,avgspeed_lag14,avgspeed_lag28,preciptotal_lag5,preciptotal_lag14,preciptotal_lag28
0,0,2007-05-01,83,51,67,14,51,56,0,2,...,,,,,,,,,,
1,1,2007-05-02,59,42,51,-2,42,47,13,0,...,,,,,,,,,,
2,2,2007-05-03,66,47,57,3,40,49,8,0,...,,,,,,,,,,
3,3,2007-05-04,72,50,61,7,41,50,4,0,...,,,,,,,,,,
4,4,2007-05-05,66,53,60,5,38,49,5,0,...,,39.634503,,,11.54,,,0.002,,


In [12]:
comb = pd.merge(trainzq, weather, on='date').drop(columns='Unnamed: 0')

In [14]:
comb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8475 entries, 0 to 8474
Data columns (total 52 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               8475 non-null   datetime64[ns]
 1   species            8475 non-null   object        
 2   totalmosquitos     8475 non-null   int64         
 3   wnvpresent         8475 non-null   int64         
 4   trap               8475 non-null   object        
 5   latitude           8475 non-null   float64       
 6   longitude          8475 non-null   float64       
 7   species_ord        8475 non-null   float64       
 8   tmax               8475 non-null   int64         
 9   tmin               8475 non-null   int64         
 10  tavg               8475 non-null   int64         
 11  depart             8475 non-null   int64         
 12  dewpoint           8475 non-null   int64         
 13  wetbulb            8475 non-null   int64         
 14  heat    

## Total Mosquitos

In [23]:
# Top 10 features correlated to wnvpresent
comb.corr()[['wnvpresent']].abs().sort_values('wnvpresent',ascending=False)[1:11]

Unnamed: 0,wnvpresent
totalmosquitos,0.233532
avgspeed_lag28,0.138465
rel_humid_lag14,0.136085
tavg_lag28,0.131077
rel_humid_lag28,0.126917
rel_humid_lag5,0.111116
species_ord,0.108576
avgspeed_lag14,0.102773
tavg_lag14,0.100259
sunrise,0.097634


As seen above, the features in our dataset generally have low correlation to `wnvpresent`. The top 10 consists mostly of features that we engineered earlier on, while the strongest feature turns out to be `totalmosquitos`, with a Pearson correlation score of 0.23 with our target. However, upon further study of our dataset, the use of this feature in our final model might be quite limited.

The test dataset does not contain the `NumMosquitos` column that we used to create the `totalmosquitos` feature in our train data. We explored using the structure of the data (total number of rows for each unique date-species-trap) to estimate the number of mosquitos in the test set, since the observations were capped at 50 mosquitos per row. However, although it may have worked for this particular dataset, we decided that it would not be useful for our model outside of this particular use case. 

In [24]:
# Top 10 features correlated to totalmosquitos
comb.corr()[['totalmosquitos']].abs().sort_values('totalmosquitos',ascending=False)[1:11]

Unnamed: 0,totalmosquitos
wnvpresent,0.233532
tmin,0.06804
tavg_lag5,0.066488
species_ord,0.066198
cool,0.065672
tavg,0.065043
wetbulb,0.059867
latitude,0.058984
tavg_lag14,0.057675
tmax,0.057313


Additionally, as seen above, all other independent features had very low correlation to `totalmosquitos`, and predicting the value of `totalmosquitos` using a secondary regression model would just add another layer of randomness that may not be useful to improving our model. Ultimately, due to the reasons above, we decided to drop the feature from our final model.

In [27]:
comb.drop(columns='totalmosquitos', inplace=True)