# Gradient Descent 
### Assignment ML3
Use the same project from the previous assignment (the heuristic modeling) and build a function that takes a vector of predictions using your heuristic and a vector of realizations (the correct values) from the data set and calculate:

- Precision (Classification "RAIN" column)
- Recall  (Classification "RAIN" column)
- SSE Cost of your prediction (Regression "PRCP" column)

SSE is the sum of squared error (adding up the difference in your prediction and the actual value after you have squared each individual difference), you can find more about how to calculate it [here](https://www.wikihow.com/Calculate-the-Sum-of-Squares-for-Error-(SSE)). 

#### Required:
1. Separate your data set into training and testing. (80/20 split)
1. Calculate the Precision and Recall for the classification heuristic you made on Sunday
1. Calculate the MSE, MAE, or SSE for the regression heuristic you made Monday.
1. Save your results and repeat the process 5 times.
1. Once you have repeated steps 1-4 5 times and saved the results, calculate the average score from your saved results
1. Submit your notebook to the Learn Platform when you have finished.

In [1]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, recall_score, precision_score

In [2]:
df = pd.read_csv("./data/seattle_weather_1948-2017.csv").dropna()
df["RAIN"] = df["RAIN"].astype(bool)

In [3]:
# Find nulls in PRCP column
df[pd.isnull(df['PRCP'])]

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN


In [4]:
# CONVERET THE RAIN COL TYPE
df['RAIN'] = df['RAIN'].replace(True , 1)
df['RAIN'] = df['RAIN'].replace(False , 0)
df.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
0,1948-01-01,0.47,51,42,1
1,1948-01-02,0.59,45,36,1
2,1948-01-03,0.42,45,35,1
3,1948-01-04,0.31,45,34,1
4,1948-01-05,0.17,45,32,1


In [5]:
df['PRCP'].value_counts()

0.00    14648
0.01      933
0.02      707
0.03      493
0.04      428
        ...  
2.58        1
2.49        1
2.18        1
5.02        1
2.61        1
Name: PRCP, Length: 207, dtype: int64

In [6]:
df['TMIN'].value_counts()

42    1042
50    1033
53    1024
40    1012
54     997
      ... 
7        4
2        1
1        1
71       1
0        1
Name: TMIN, Length: 68, dtype: int64

In [7]:
# How many rows have a PRCP > 0 and Rain == True

condition_1 = df["TMIN"] == 60
condition_2 = df["PRCP"] == 0.00

df[(condition_1 & condition_2)].count()

DATE    169
PRCP    169
TMAX    169
TMIN    169
RAIN    169
dtype: int64

In [8]:
# All rows with PRCP > 0 and RAIN == True are filtered in the above code
df["RAIN"].value_counts()

0    14648
1    10900
Name: RAIN, dtype: int64

In [9]:
# Splitting data

# Split into training and test sets
train, test = train_test_split(
    df, 
    train_size=0.8, # 80% of data to train
    test_size=0.2, # 20% of data to test
    random_state=42
)

## Rain

In [10]:
# HA model for Rain prediction 

# Note 

# x --> future state "tomorrow"
# x-1 --> today
# x-2 --> yesterday



def heuristic(df):
    
    preds = []
    
    for x in range(len(df)):
            # If either of last two days == True then predict true
            if df.iloc[x-1]['RAIN'] or df.iloc[x-2]['RAIN']:
                if (df.iloc[x]['TMAX'] <= 55 and df.iloc[x]['TMAX'] >= 50):
                    preds.append(1)
                elif (df.iloc[x]['TMIN'] <= 39 and df.iloc[x]['TMIN'] >= 45):
                    preds.append(1)
                else:
                    preds.append(0)
            else:
                # Predict false if the above is not true
                preds.append(0)
                
    return preds

In [11]:
# Apply Heuristic on training set
train['preds'] = heuristic(train)
train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['preds'] = heuristic(train)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
11497,1979-06-24,0.0,78,55,0,0
14118,1986-08-27,0.0,88,65,0,0
1604,1952-05-23,0.0,67,45,0,0
3693,1958-02-10,0.03,52,39,1,0
4742,1960-12-25,0.25,44,36,1,0


In [12]:
# Calculate Accuracy, precision and recall
def sklearn_RAIN (df):
    
    actual = df["RAIN"]
    Prediction = df["preds"]
    
    accuracy = accuracy_score(actual, Prediction)
    recall = recall_score(actual, Prediction)
    precision= precision_score(actual, Prediction)
    
    return accuracy, recall, precision

In [13]:
# Calculate accuracy, Precision and recall for training set
sklearn_RAIN(train)

(0.6112144045405618, 0.19650005718860802, 0.6510041682455475)

In [14]:
# Apply Heuristic on test set
test['preds'] = heuristic(test)
test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['preds'] = heuristic(test)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
20415,2003-11-23,0.38,43,37,1,0
15959,1991-09-11,0.0,69,51,0,0
12282,1981-08-17,0.0,83,55,0,0
5183,1962-03-11,0.0,49,32,0,0
20488,2004-02-04,0.06,46,37,1,0


In [15]:
# Calculate accuracy, Precision and recall for Test set
sklearn_RAIN(test)

(0.6178082191780822, 0.20074177097821047, 0.6540785498489426)

In [16]:
# run the steps five times 

def multiple_trails(data, train_size=0.8, test_size=0.2, iterations = 5):
    
    acc = []
    rec = []
    pre = []
    
    for x in range(iterations):
        train, test = train_test_split(data, test_size=test_size, train_size = train_size)
        test["preds"] = heuristic(test)
        results = sklearn_RAIN(test)
        acc.append(np.round(results,2)[0])
        rec.append(np.round(results,2)[1])
        pre.append(np.round(results,2)[2])
    return acc, rec, pre

In [17]:
# multi trails 
acc, rec, pre = multiple_trails(df)
print (f'acc = {acc}, rec = {rec}, pre = {pre}')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = 

acc = [0.61, 0.61, 0.61, 0.61, 0.61], rec = [0.21, 0.19, 0.18, 0.19, 0.2], pre = [0.7, 0.64, 0.61, 0.61, 0.66]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)


In [18]:
# compute the avg 
acc = round(sum(acc)/5,2)
rec = round(sum(rec)/5,2)
pre = round(sum(pre)/5,2)

print (f'acc = {acc}, rec = {rec}, pre = {pre}')

acc = 0.61, rec = 0.19, pre = 0.64


## PRCP

In [19]:
# Create function to perform our heuristic

# Note 

# x --> future state "tomorrow"
# x-1 --> today
# x-2 --> yesterday

def heuristic(df):
    
    preds = []
    
    for x in range(len(df)):
        if (df.iloc[x]['TMIN'] >= 50) | (df.iloc[x]['TMAX'] >= 50):
            preds.append(0)
        elif (df.iloc[x]['TMIN'] >= 40) | (df.iloc[x]['TMAX'] >= 40):
            preds.append(0.02)
        else:
            # Predict false if the above is not true
            preds.append(0.01)
                
    return preds

In [20]:
# Apply Heuristic or train set
train['preds'] = heuristic(train)
train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['preds'] = heuristic(train)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
11497,1979-06-24,0.0,78,55,0,0.0
14118,1986-08-27,0.0,88,65,0,0.0
1604,1952-05-23,0.0,67,45,0,0.0
3693,1958-02-10,0.03,52,39,1,0.0
4742,1960-12-25,0.25,44,36,1,0.02


In [21]:
# Calculate mse, mae, and rms
def sklearn_PRCP (df):
    
    actual = df["PRCP"]
    Prediction = df["preds"]
    
    mse = mean_squared_error(actual, Prediction)
    mae = mean_absolute_error(actual, Prediction)
    rms = mean_squared_error(actual, Prediction, squared=False)
    
    return mse, mae, rms

In [22]:
# computing the mse, mae, and rms for training set
sklearn_PRCP(train)

(0.06780925237303062, 0.10555680594970153, 0.2604020974820107)

In [23]:
# Apply Heuristic
test['preds'] = heuristic(test)

test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['preds'] = heuristic(test)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
20415,2003-11-23,0.38,43,37,1,0.02
15959,1991-09-11,0.0,69,51,0,0.0
12282,1981-08-17,0.0,83,55,0,0.0
5183,1962-03-11,0.0,49,32,0,0.02
20488,2004-02-04,0.06,46,37,1,0.02


In [24]:
# computing the mse, mae, and rms for test set
sklearn_PRCP(test)

(0.06489714285714285, 0.10565949119373778, 0.25474917636204997)

In [25]:
# run the steps five times 

def multiple_trails_PRCP(data, train_size=0.8, test_size=0.2, iterations = 5):
    
    mse = []
    mae = []
    rms = []
    
    for x in range(iterations):
        train, test = train_test_split(data, test_size=test_size, train_size = train_size)
        test["preds"] = heuristic(test)
        results = sklearn_PRCP(test)
        mse.append(np.round(results,2)[0])
        mae.append(np.round(results,2)[1])
        rms.append(np.round(results,2)[2])
    return mse, mae, rms

In [26]:
# multi trails
mse, mae, rms = multiple_trails_PRCP(df)
print (f'mse = {mse}, mae = {mae}, rms = {rms}')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = 

mse = [0.06, 0.06, 0.07, 0.07, 0.07], mae = [0.1, 0.1, 0.1, 0.11, 0.11], rms = [0.25, 0.25, 0.26, 0.26, 0.26]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)


In [27]:
# compute the mean - calc the avg 
mse = round(sum(mse)/5,2)
mae = round(sum(mae)/5,2)
rms = round(sum(rms)/5,2)

print (f'mse = {mse}, mae = {mae}, rms = {rms}')

mse = 0.07, mae = 0.1, rms = 0.26
