# Project 4 - Predicting West Nile Virus (Kaggle Challenge)

## Background

Every year from late-May to early-October, public health workers in Chicago setup mosquito traps scattered across the city. Every week from Monday through Wednesday, these traps collect mosquitos, and the mosquitos are tested for the presence of West Nile virus before the end of the week. The test results include the number of mosquitos, the mosquitos species, and whether or not West Nile virus is present in the cohort. 

Main dataset

These test results are organized in such a way that when the number of mosquitos exceed 50, they are split into another record (another row in the dataset), such that the number of mosquitos are capped at 50. 

The location of the traps are described by the block number and street name. For your convenience, we have mapped these attributes into Longitude and Latitude in the dataset. Please note that these are derived locations. For example, Block=79, and Street= "W FOSTER AVE" gives us an approximate address of "7900 W FOSTER AVE, Chicago, IL", which translates to (41.974089,-87.824812) on the map.

Some traps are "satellite traps". These are traps that are set up near (usually within 6 blocks) an established trap to enhance surveillance efforts. Satellite traps are postfixed with letters. For example, T220A is a satellite trap to T220. 

Please note that not all the locations are tested at all times. Also, records exist only when a particular species of mosquitos is found at a certain trap at a certain time. In the test set, we ask you for all combinations/permutations of possible predictions and are only scoring the observed ones.

Spray Data

The City of Chicago also does spraying to kill mosquitos. You are given the GIS data for their spray efforts in 2011 and 2013. Spraying can reduce the number of mosquitos in the area, and therefore might eliminate the appearance of West Nile virus. 

## Problem statement

Analyzing weather data and GIS data and predicting whether or not West Nile virus is present, for a given time, location, and species.

## Executive Summary

## Data Dictionary of downloaded files

Spray data: GIS data of spraying efforts in 2011 and 2013

|Feature|Python data Type|Description|
|---|---|---|
|**Date**|*String*|Date of spray|
|**Time**|*String*|Time of spray|
|**Latitude**|*float*|Latitude of spray location|
|**Longitude**|*float*|Longitude of spray location|



Weather data: Weather data from 2007 to 2014

|Feature|Python data Type|Description|
|---|---|---|
|**Station**|*Integer*|Station ID|
|**Date**|*String*|Date of the weather data|
|**Tmax**|*Integer*|Max temperature in Fahrenheit|
|**Tmin**|*Integer*|Min temperature in Fahrenheit|
|**Tavg**|*Integer*|Average temperature in Fahrenheit|
|**Depart**|*Integer*|Temperature departure from normal|
|**DewPoint**|*Integer*|Average Dew Point|
|**WetBulb**|*Float*|Average Wet Bulb|
|**Heat**|*Integer*|Heating days(season begins with July)|
|**Cool**|*Integer*|Cooling days(season begins with Jan)|
|**Sunrise**|*String*|Calculated Sunset in 24H format|
|**Sunset**|*String*|Calculated Sunrise in 24H format|
|**CodeSum**|*String*|Weather Type|
|**Depth**|*Integer*|Snow Depth in inches|
|**Water1**|*Integer*|Amount of water equivalent from melted Snow|
|**SnowFall**|*Float*|SnowFall in precipitation|
|**PrecipTotal**|*Float*|Water precipitation|
|**StnPressure**|*Float*|Average Station Pressure|
|**SeaLevel**|*Float*|Average Sea Level Pressure|
|**ResultSpeed**|*Float*|Resultant Wind Speed|
|**ResultDir**|*Integer*|Resultant Wind Direction in Degrees|
|**AvgSpeed**|*Float*|Average Wind Speed|
              
Training data: The training set consists of data from 2007, 2009, 2011, and 2013.

|Feature|Python data Type|Description|
|---|---|---|
|**Date**|*String*|Date that the WNV test is performed|
|**Address**|*String*|Approximate address of the location of trap. |
|**Species**|*String*|Species of mosquitos|
|**Block**|*String*|Block number of address for the location of the trap|
|**Street**|*String*|Street name |
|**Trap**|*String*|Trap ID|
|**AddressNumberAndStreet**|*String*|Approximate address returned from GeoCoder|
|**Latitude**|*Float*|Latitude returned from Geocoder|
|**Longitude**|*Float*|Longitude returned from Geocoder|
|**AddressAccuracy**|*Integer*|accuracy returned from GeoCoder|
|**NumMosquitos**|*Integer*|number of mosquitoes caught in this trap|
|**WnvPresent**|*Integer*|Whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present. 
|

Testing data: Dataset to predict the test results for 2008, 2010, 2012, and 2014.

|Feature|Python data Type|Description|
|---|---|---|
|**Id**|*Integer*|ID of the record|
|**Date**|*String*|Date that the WNV test is performed|
|**Address**|*String*|Approximate address of the location of trap. |
|**Species**|*String*|Species of mosquitos|
|**Block**|*String*|Block number of address for the location of the trap|
|**Street**|*String*|Street name |
|**Trap**|*String*|Trap ID|
|**AddressNumberAndStreet**|*String*|Approximate address returned from GeoCoder|
|**Latitude**|*Float*|Latitude returned from Geocoder|
|**Longitude**|*Float*|Longitude returned from Geocoder|
|**AddressAccuracy**|*Integer*|accuracy returned from GeoCoder|

# Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import csv files

In [2]:
spray_df = pd.read_csv("../../data/spray.csv")
weather_df = pd.read_csv("../../data/weather.csv")
train_df = pd.read_csv("../../data/train.csv")
test_df = pd.read_csv("../../data/test.csv")

# Data cleaning

## Spray data

In [3]:
spray_df.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [4]:
spray_df.shape

(14835, 4)

In [5]:
spray_df.dtypes

Date          object
Time          object
Latitude     float64
Longitude    float64
dtype: object

In [6]:
#Check which column has null
spray_df.isnull().sum()

Date           0
Time         584
Latitude       0
Longitude      0
dtype: int64

In [7]:
#Drop duplicates
spray_df.drop_duplicates(inplace = True)

In [8]:
spray_df[spray_df['Time'].isnull()]["Date"].value_counts()

2011-09-07    584
Name: Date, dtype: int64

In [9]:
# Drop time since only one date is affected
spray_df.drop(["Time"], axis = 1, inplace = True)

In [10]:
# Convert date to datetime object
spray_df["Date"] = pd.to_datetime(spray_df["Date"], format = "%Y-%m-%d")
spray_df.dtypes

Date         datetime64[ns]
Latitude            float64
Longitude           float64
dtype: object

In [11]:
# Convert column names to lowercase 
spray_df.columns = [col.lower() for col in spray_df.columns]
spray_df.head()

Unnamed: 0,date,latitude,longitude
0,2011-08-29,42.391623,-88.089163
1,2011-08-29,42.391348,-88.089163
2,2011-08-29,42.391022,-88.089157
3,2011-08-29,42.390637,-88.089158
4,2011-08-29,42.39041,-88.088858


In [46]:
spray_df.to_csv("../../data/spray_clean.csv")

## Read weather data

In [12]:
weather_df.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


In [13]:
# Check data types
weather_df.dtypes

Station          int64
Date            object
Tmax             int64
Tmin             int64
Tavg            object
Depart          object
DewPoint         int64
WetBulb         object
Heat            object
Cool            object
Sunrise         object
Sunset          object
CodeSum         object
Depth           object
Water1          object
SnowFall        object
PrecipTotal     object
StnPressure     object
SeaLevel        object
ResultSpeed    float64
ResultDir        int64
AvgSpeed        object
dtype: object

In [14]:
# Check invalid values for Tavg
weather_df["Tavg"].value_counts()

73    138
77    117
70    117
75    110
71    109
74    107
72    104
69    103
78    102
76    100
68     99
79     98
66     93
67     89
61     88
64     86
80     84
65     84
63     81
57     67
62     66
60     61
50     57
81     55
58     49
53     49
82     48
54     48
55     48
56     46
52     46
59     45
51     36
83     34
49     29
45     28
46     24
47     24
84     21
44     19
48     17
86     16
85     16
42     15
43     12
M      11
87      9
41      7
40      5
91      4
89      4
39      4
88      4
36      2
38      2
37      2
90      2
94      1
92      1
93      1
Name: Tavg, dtype: int64

In [15]:
weather_df["Tavg"] = weather_df.apply(lambda row: row["Tmax"] + row["Tmin"] if row["Tavg"] == "M" else row["Tavg"], axis=1)
weather_df["Tavg"].astype("int64")

0       67
1       68
2       51
3       52
4       56
        ..
2939    45
2940    42
2941    45
2942    40
2943    42
Name: Tavg, Length: 2944, dtype: int64

In [16]:
# Check invalid values for Tavg
weather_df["Tavg"].value_counts()

73     138
70     117
77     117
75     110
71     109
      ... 
94       1
132      1
108      1
113      1
124      1
Name: Tavg, Length: 70, dtype: int64

In [17]:
weather_df.shape

(2944, 22)

In [18]:
# Check pythonic null values
weather_df.isnull().sum()

Station        0
Date           0
Tmax           0
Tmin           0
Tavg           0
Depart         0
DewPoint       0
WetBulb        0
Heat           0
Cool           0
Sunrise        0
Sunset         0
CodeSum        0
Depth          0
Water1         0
SnowFall       0
PrecipTotal    0
StnPressure    0
SeaLevel       0
ResultSpeed    0
ResultDir      0
AvgSpeed       0
dtype: int64

In [19]:
# Convert date to datetime object
weather_df["Date"] = pd.to_datetime(weather_df["Date"], format = "%Y-%m-%d")
weather_df.dtypes

Station                 int64
Date           datetime64[ns]
Tmax                    int64
Tmin                    int64
Tavg                   object
Depart                 object
DewPoint                int64
WetBulb                object
Heat                   object
Cool                   object
Sunrise                object
Sunset                 object
CodeSum                object
Depth                  object
Water1                 object
SnowFall               object
PrecipTotal            object
StnPressure            object
SeaLevel               object
ResultSpeed           float64
ResultDir               int64
AvgSpeed               object
dtype: object

In [20]:
# Check columns with values M representing missing data as per documentation: Depth, Water1, SnowFall PrecipTotal 
for col in ["Depth", "Water1", "SnowFall", "PrecipTotal"]:
    print(f"Printing value counts for {col}")
    print(weather_df[col].value_counts())
    print()

Printing value counts for Depth
M    1472
0    1472
Name: Depth, dtype: int64

Printing value counts for Water1
M    2944
Name: Water1, dtype: int64

Printing value counts for SnowFall
M      1472
0.0    1459
  T      12
0.1       1
Name: SnowFall, dtype: int64

Printing value counts for PrecipTotal
0.00    1577
  T      318
0.01     127
0.02      63
0.03      46
        ... 
1.96       1
1.18       1
0.91       1
2.09       1
1.58       1
Name: PrecipTotal, Length: 168, dtype: int64



WE have missing data for Water1 data, so it would not be useful for modelling. Let's have a further look on other columns with small unique values.

It seems like each weather station 2 has no data for depth and 0 value for snow depth across the dataset, it may not be useful for modelling. On snowfall data weather station 2 has no data pertaining to snowfall and only little amount of snowfall was detected by weather station 1 (0, 0.1 and traces of snow), hence this column could be dropped as well.

In [21]:
print(weather_df[weather_df["Depth"] == "0"]["Station"].value_counts())
print(weather_df[weather_df["Depth"] == "M"]["Station"].value_counts())

1    1472
Name: Station, dtype: int64
2    1472
Name: Station, dtype: int64


In [22]:
print(weather_df[weather_df["SnowFall"] == "M"]["Station"].value_counts())
print(weather_df[weather_df["SnowFall"] != "M"]["Station"].value_counts())

2    1472
Name: Station, dtype: int64
1    1472
Name: Station, dtype: int64


In [23]:
# Drop the above columns
weather_df.drop(["SnowFall","Depth","Water1"], inplace = True, axis = 1)

In [24]:
#Check depart value counts and replace the missing values("M") with values from another weather station for the same date
weather_df["Depart"].value_counts()

M      1472
 2       93
-1       84
-2       80
 5       77
 1       76
 7       76
 3       75
 0       74
-3       72
 4       71
 6       67
 8       59
-5       57
-4       56
-6       50
 9       47
10       46
-8       43
-7       30
11       28
12       28
-9       25
13       23
14       22
-10      22
15       15
16       12
-11      10
-12       8
17        7
18        6
-14       6
-13       5
20        4
19        4
-16       3
22        3
-15       3
21        2
-17       2
23        1
Name: Depart, dtype: int64

In [25]:
# Replace Missing values with nan and apply backfill method to use values from previous row (station1) for station 2 rows
weather_df["Depart"].replace("M", np.nan, inplace = True)
weather_df["Depart"].fillna(method='bfill',inplace = True)

In [26]:
print(weather_df[weather_df["Depart"] != "M"]["Station"].value_counts())

1    1472
2    1472
Name: Station, dtype: int64


In [27]:
#Format codesum to a list of weather types instead
weather_df["CodeSum"].value_counts()

                        1609
RA                       296
RA BR                    238
BR                       110
TSRA RA BR                92
                        ... 
RA FG+ MIFG BR             1
BR VCTS                    1
TSRA FG+ BR HZ             1
TSRA RA FG+ FG BR HZ       1
TS HZ                      1
Name: CodeSum, Length: 98, dtype: int64

In [28]:
# Define a set of code 
code_set = ("+FC", 
"FC",
"TS",
"GR",
"RA",
"DZ",
"SN",
"SG",
"GS",
"PL",
"IC",
"FG+",
"FG",
"BR",
"UP",
"HZ",
"FU",
"VA",
"DU",
"DS",
"PO",
"SA",
"SS",
"PY",
"SQ",
"DR",
"SH",
"FZ",
"MI",
"PR",
"BC",
"BL",
"VC",
)

# Defne a function that splits a concatenated code into its component

def split_concat_code(string):
    if string:
        reformated_string = ""
        list_of_code = string.strip().split()
        for code in list_of_code:
            if ((code not in code_set) and (len(code) %2 == 0)):
                split_code = [code[i:i+2] for i in range(0, len(code), 2)]
                new_code_string = " ".join(split_code)
                reformated_string = reformated_string + " " + new_code_string
            else:
                reformated_string = reformated_string + " " + code
        
        # Return a string with distinct code
        return " ".join(set(reformated_string.split()))
    else:
        return string

In [29]:
weather_df["CodeSum_formatted"] = weather_df["CodeSum"].apply(lambda x: split_concat_code(x))
weather_df.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,CodeSum_formatted
0,1,2007-05-01,83,50,67,14,51,56,0,2,0448,1849,,0.0,29.1,29.82,1.7,27,9.2,
1,2,2007-05-01,84,52,68,-3,51,57,0,3,-,-,,0.0,29.18,29.82,2.7,25,9.6,
2,1,2007-05-02,59,42,51,-3,42,47,14,0,0447,1850,BR,0.0,29.38,30.09,13.0,4,13.4,BR
3,2,2007-05-02,60,43,52,2,42,47,13,0,-,-,BR HZ,0.0,29.44,30.08,13.3,2,13.4,HZ BR
4,1,2007-05-03,66,46,56,2,40,48,9,0,0446,1851,,0.0,29.39,30.12,11.7,7,11.9,


In [30]:
s1 = weather_df[["Date","CodeSum_formatted"]].merge(weather_df[["Date"]], how='left', on=['Date'])
s1.drop_duplicates("Date", inplace =  True)
s1.head(10)

Unnamed: 0,Date,CodeSum_formatted
0,2007-05-01,
4,2007-05-02,BR
8,2007-05-03,
12,2007-05-04,RA
16,2007-05-05,
20,2007-05-06,
24,2007-05-07,RA
28,2007-05-08,BR
32,2007-05-09,HZ BR
36,2007-05-10,BR


In [31]:
#Format codesum to a list of weather types instead
weather_df["CodeSum_formatted"].value_counts()

                   1609
RA                  296
RA BR               238
RA TS BR            150
BR                  110
                   ... 
RA VC TS FG           1
FG+                   1
RA VC BR FG           1
BR FG+ RA TS HZ       1
BR BC MI FG           1
Name: CodeSum_formatted, Length: 73, dtype: int64

In [32]:
# Check data types
weather_df.dtypes

Station                       int64
Date                 datetime64[ns]
Tmax                          int64
Tmin                          int64
Tavg                         object
Depart                       object
DewPoint                      int64
WetBulb                      object
Heat                         object
Cool                         object
Sunrise                      object
Sunset                       object
CodeSum                      object
PrecipTotal                  object
StnPressure                  object
SeaLevel                     object
ResultSpeed                 float64
ResultDir                     int64
AvgSpeed                     object
CodeSum_formatted            object
dtype: object

In [33]:
weather_df.groupby("Date").mean()

Unnamed: 0_level_0,Station,Tmax,Tmin,DewPoint,ResultSpeed,ResultDir
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-05-01,1.5,83.5,51.0,51.0,2.20,26.0
2007-05-02,1.5,59.5,42.5,42.0,13.15,3.0
2007-05-03,1.5,66.5,47.0,40.0,12.30,6.5
2007-05-04,1.5,72.0,50.0,41.5,10.25,7.5
2007-05-05,1.5,66.0,53.5,38.5,11.45,7.0
...,...,...,...,...,...,...
2014-10-27,1.5,78.0,52.5,51.5,12.35,19.0
2014-10-28,1.5,67.0,46.5,39.0,14.40,26.0
2014-10-29,1.5,49.0,38.0,33.0,9.00,29.0
2014-10-30,1.5,52.0,34.5,34.5,5.50,23.5


In [34]:
# Convert column names to lowercase 
# weather_df.columns = [col.title() for col in weather_df.columns]
# weather_df.head()

## Read training data
1 day 1 trap how many types of mosquitos

In [35]:
train_df = pd.read_csv("../../data/train.csv")

In [36]:
train_df.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [37]:
# Check size of train data and names of columns
print(train_df.shape)
train_df.columns

(10506, 12)


Index(['Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')

In [38]:
train_df.dtypes

Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
dtype: object

In [39]:
# Check for nulls
train_df.isnull().sum()

Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
NumMosquitos              0
WnvPresent                0
dtype: int64

In [40]:
# Convert date to datetime object
train_df["Date"] = pd.to_datetime(train_df["Date"], format = "%Y-%m-%d")
train_df.dtypes

Date                      datetime64[ns]
Address                           object
Species                           object
Block                              int64
Street                            object
Trap                              object
AddressNumberAndStreet            object
Latitude                         float64
Longitude                        float64
AddressAccuracy                    int64
NumMosquitos                       int64
WnvPresent                         int64
dtype: object

In [41]:
# Convert column names to lowercase 
train_df.columns = [col.lower() for col in train_df.columns]
train_df.head()

Unnamed: 0,date,address,species,block,street,trap,addressnumberandstreet,latitude,longitude,addressaccuracy,nummosquitos,wnvpresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [None]:
train_df.to_csv("../../data/train_clean.csv")

# EDA and Feature Engineering

## Read Testing data

In [None]:
test_df.head()

In [None]:
test_df.shape

In [None]:
test_df.columns

In [None]:
test_df.isnull().sum()

In [None]:
# Check if there are columns unique to either train or test data
print(set(train_df.columns) - set(test_df.columns))

print(set(test_df.columns) - set(train_df.columns))

## Read in processed csv files 

In [None]:
# Check wnv class representations
fig, ax  = plt.subplots(figsize = (15,10))
train_df["WnvPresent"].value_counts(normalize = True).plot(kind = "bar", ax = ax, color = ["blue", "orange"])
for i in ax.patches:
    height = i.get_height()
    ax.text(i.get_x()+ 0.4*i.get_width(), height*1.005,'{:.2f}{}'.format(height*100,'%'))

In [None]:
# Hypothesis if more mosquitoes caught means higher chances of having the presence of virus.
# Since max number of mosquitos per entry is 50, special processing is needed to 
# sum up the mosquitoes count for the sample place. Need to group by date and lat long

total_mosquitoes_df = train_df.groupby(["date","trap"])[["nummosquitos"]].sum()

#Rename column
total_mosquitoes_df.columns = ["totalmosquitos"]


In [None]:
# Join dataframe back to train data
new_df = pd.merge(train_df, total_mosquitoes_df,  how='inner', left_on=['Date','Trap'], right_on = ['Date','Trap'])

In [None]:
# Consider group by traps and see the proportion of traps has west nilevirus occurring by using mean value and sort
train_df.groupby("Trap")["WnvPresent"].mean().sort_values(ascending = False)

## Read mapdata

In [None]:
mapdata = np.loadtxt("../../data/mapdata_copyright_openstreetmap_contributors.txt")

lats = train_df["Latitude"]
longs = train_df["Longitude"]

origin = [41.6, -88.3]              # lat/long of origin (lower left corner)
upperRight = [42.5, -87.5]          # lat/long of upper right corner
#rescales the image data to the GPS co-ordinates of the bounding box for Chicago defined by extent arguments
fig, ax = plt.subplots(1, 2, figsize = (15,12))
ax[0].imshow(mapdata, cmap=plt.get_cmap('gray'), extent=[origin[1], upperRight[1], origin[0], upperRight[0]])
ax[1].imshow(mapdata, cmap=plt.get_cmap('gray'), extent=[origin[1], upperRight[1], origin[0], upperRight[0]])
sns.scatterplot(x = "Longitude", y = "Latitude", data= spray_df, hue = 'Date', ax = ax[0]);
sns.scatterplot(x = "Longitude", y = "Latitude", data= train_df, hue = 'WnvPresent', ax = ax[1]);

ax[0].set_title("Plot on the location of sprays")
ax[1].set_title("Plot on the existence of West Nile Virus in Chicago")
#plt.scatter(x=intersection[1], y=intersection[0], c='b', s=60, marker='s')

#plt.savefig('map.png')