### Heuristic Models (Cost Function Extension)
Look at the Seattle weather in the **data** folder. Come up with a heuristic model to predict if it will rain today. Keep in mind this is a time series, which means that you only know what happened historically (before a given date). One example of a heuristic model is: It will rain tomorrow if it rained more than 1 inch (>1.0 PRCP) today. Describe your heuristic model in the next cell.

**your model here**  

Examples:  

If rained yesterday it will rain today.  
If it rained yesterday or the day before it will rain today.

In [1]:
#here is an example of how to build and populate a hurestic model

import pandas as pd

df = pd.read_csv('../data/seattle_weather_1948-2017.csv')

numrows = 25549 # can be as large as 25549

#create an empty dataframe to hold 100 values
heuristic_df = pd.DataFrame({'yesterday':[0.0]*numrows,
                             'today':[0.0]*numrows,
                             'tomorrow':[0.0]*numrows,
                             'guess':[False]*numrows, #logical guess
                             'rain_tomorrow':[False]*numrows, #historical observation
                             'correct':[False]*numrows, #TRUE if your guess matches the historical observation
                             'true_positive':[False]*numrows, #TRUE If you said it would rain and it did
                             'false_positive':[False]*numrows,#TRUE If you sait id would rain and it didn't
                             'true_negative':[False]*numrows, #TRUE if you said it wouldn't rain and it didn't
                             'false_negative':[False]*numrows}) #TRUE if you said it wouldn't raing and it did

#sort columns for convience
seq = ['yesterday',
       'today',
       'tomorrow',
       'guess',
       'rain_tomorrow',
       'correct',
       'true_positive',
       'false_positive',
       'true_negative',
       'false_negative']
heuristic_df = heuristic_df.reindex(columns=seq)

In [2]:
df.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
0,1948-01-01,0.47,51,42,True
1,1948-01-02,0.59,45,36,True
2,1948-01-03,0.42,45,35,True
3,1948-01-04,0.31,45,34,True
4,1948-01-05,0.17,45,32,True


In [3]:
heuristic_df.head()

Unnamed: 0,yesterday,today,tomorrow,guess,rain_tomorrow,correct,true_positive,false_positive,true_negative,false_negative
0,0.0,0.0,0.0,False,False,False,False,False,False,False
1,0.0,0.0,0.0,False,False,False,False,False,False,False
2,0.0,0.0,0.0,False,False,False,False,False,False,False
3,0.0,0.0,0.0,False,False,False,False,False,False,False
4,0.0,0.0,0.0,False,False,False,False,False,False,False


Build a loop to add your heuristic model guesses as a column to this dataframe

In [4]:
# here is an example loop that populates the dataframe created earlier
# with the total percip from yesterday and today
# then the guess is set to true if rained both yesterday and today 
for z in range(numrows):
    #start at time 2 in the data frame
    i = z + 2
    #pull values from the dataframe
    yesterday = df.iloc[(i-2),1]
    today = df.iloc[(i-1),1]
    tomorrow = df.iloc[i,1]
    rain_tomorrow = df.iloc[(i),1]
    
    heuristic_df.iat[z,0] = yesterday
    heuristic_df.iat[z,1] = today
    heuristic_df.iat[z,2] = tomorrow
    heuristic_df.iat[z,3] = False # set guess default to False
    heuristic_df.iat[z,4] = rain_tomorrow
    
    #example hueristic
    if today > 0.0 and yesterday > 0.0:
        heuristic_df.iat[z,3] = True
        
    if heuristic_df.iat[z,3] == heuristic_df.iat[z,4]:
        heuristic_df.iat[z,5] = True
        if heuristic_df.iat[z,3] == True:
            heuristic_df.iat[z,6] = True #true positive
        else:
            heuristic_df.iat[z,8] = True #true negative
    else:
        heuristic_df.iat[z,5] = False
        if heuristic_df.iat[z,3] == True:
            heuristic_df.iat[z,7] = True #false positive
        else:
            heuristic_df.iat[z,9] = True #false negative

### Evaluate the performance of the Heuristic model

***split data into training and testing***

In [5]:
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
# train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)
# enter split function here to make h_train and h_test subsets of the data
X_train, X_test, y_train, y_test = train_test_split(df, df['RAIN'], test_size=0.2)

In [6]:
X_train.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
15370,1990-01-30,0.26,42,33,True
22738,2010-04-03,0.17,47,37,True
10442,1976-08-03,0.0,76,57,False
21882,2007-11-29,0.0,43,37,False
5486,1963-01-08,0.05,46,32,True


In [7]:
X_train.head()
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20440 entries, 15370 to 11241
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    20440 non-null  object 
 1   PRCP    20438 non-null  float64
 2   TMAX    20440 non-null  int64  
 3   TMIN    20440 non-null  int64  
 4   RAIN    20438 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 958.1+ KB


In [8]:
X_test.head()
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5111 entries, 19769 to 3045
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    5111 non-null   object 
 1   PRCP    5110 non-null   float64
 2   TMAX    5111 non-null   int64  
 3   TMIN    5111 non-null   int64  
 4   RAIN    5110 non-null   object 
dtypes: float64(1), int64(2), object(2)
memory usage: 239.6+ KB


***the accuracy of your predicitions***

In [9]:
# we used this simple approach in the first part to see what percent of the time we where correct 
# calculated as (true positive + true negative)/ number of guesses
TP_TN=heuristic_df['correct'].value_counts()
acc=TP_TN/numrows
acc
print("Accuracy:\n\n",acc)

Accuracy:

 True     0.671611
False    0.328389
Name: correct, dtype: float64


In [10]:
# precision is the percent of your postive prediction which are correct
# more specifically it is calculated (num true positive)/(num tru positive + num false positive)
NTP=heuristic_df['true_positive'].sum()
NFP=heuristic_df['false_positive'].sum()
prec = NTP/(NTP+NFP)
print("Precision:", prec)

Precision: 0.674109000138677


***the recall of your predicitions***

In [11]:
# recall the percent of the time you are correct when you predict positive
# more specifically it is calculated (num true positive)/(num tru positive + num false negative)
NFN=heuristic_df['false_negative'].sum()
rec=NTP/(NTP+NFN)
print("Recall:", rec)

Recall: 0.44592239244106047


***The sum of squared error (SSE) of your predictions***

In [12]:
import numpy as np

def sse(y_true, y_pred):
    '''returns sum of squared errors (actual vs model)'''
    squared_errors = (y_true - y_pred) ** 2
    return np.sum(squared_errors)


sse(y_true=heuristic_df['rain_tomorrow'].astype('int'), y_pred=heuristic_df['guess'].astype('int'))

8390

In [13]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_true=heuristic_df['rain_tomorrow'].astype('int'), y_pred=heuristic_df['guess'].astype('int'))

0.3283885866374418