
## ML - Assignment 3
Required:
1. Separate your data set into training and testing. (80/20 split)
1. Calculate the Precision and Recall for the classification heuristic you made on Sunday
1. Calculate the MSE, MAE, or SSE for the regression heuristic you made Monday.
1. Save your results and repeat the process 5 times.
1. Once you have repeated steps 1-4 5 times and saved the results, calculate the average score from your saved results
1. Submit your notebook to the Learn Platform when you have finished.

In [1]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, recall_score, precision_score

In [2]:
# Load the data
df = pd.read_csv('./data/seattle_weather_1948-2017.csv').dropna()
df.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
0,1948-01-01,0.47,51,42,True
1,1948-01-02,0.59,45,36,True
2,1948-01-03,0.42,45,35,True
3,1948-01-04,0.31,45,34,True
4,1948-01-05,0.17,45,32,True


In [3]:
# Find nulls in PRCP column
df[pd.isnull(df['PRCP'])]

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN


In [4]:
# Find nulls in RAIN column
df[pd.isnull(df['RAIN'])]

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN


In [5]:
def RAIN_INSERTION(cols):
    """
    Insert False where NaN values are present
    """
    RAIN = cols[0]
    if pd.isnull(RAIN):
        return False
    else:
        return RAIN

In [6]:
def PRCP_INSERTION(col):
    """
    Insert the Mean of PRCP where NaN values are present
    """
    PRCP = col[0]
    if pd.isnull(PRCP):
        return df['PRCP'].mean()
    else:
        return PRCP

In [7]:
# Apply the functions
df['RAIN']=df[['RAIN']].apply(RAIN_INSERTION,axis=1)

In [8]:
df['PRCP']=df[['PRCP']].apply(PRCP_INSERTION,axis=1)

In [9]:
# CONVERET THE RAIN COL TYPE
df['RAIN'] = df['RAIN'].replace(True , 1)
df['RAIN'] = df['RAIN'].replace(False , 0)
df.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
0,1948-01-01,0.47,51,42,1
1,1948-01-02,0.59,45,36,1
2,1948-01-03,0.42,45,35,1
3,1948-01-04,0.31,45,34,1
4,1948-01-05,0.17,45,32,1


In [10]:
df['PRCP'].value_counts()

0.00    14648
0.01      933
0.02      707
0.03      493
0.04      428
        ...  
2.58        1
2.49        1
2.18        1
5.02        1
2.61        1
Name: PRCP, Length: 207, dtype: int64

In [11]:
df['TMIN'].value_counts()

42    1042
50    1033
53    1024
40    1012
54     997
      ... 
7        4
2        1
1        1
71       1
0        1
Name: TMIN, Length: 68, dtype: int64

In [12]:
# How many rows have a PRCP > 0 and Rain == True

condition_1 = df["TMIN"] == 60
condition_2 = df["PRCP"] == 0.00

df[(condition_1 & condition_2)].count()

DATE    169
PRCP    169
TMAX    169
TMIN    169
RAIN    169
dtype: int64

In [13]:
# All rows with PRCP > 0 and RAIN == True are filtered in the above code
df["RAIN"].value_counts()

0    14648
1    10900
Name: RAIN, dtype: int64

## RAIN

In [14]:
# Splitting data

# Split into training and test sets
train, test = train_test_split(
    df, 
    train_size=0.8, # 80% of data to train
    test_size=0.2, # 20% of data to test
    
)

In [15]:
# HA model for Rain prediction 

# Note 

# x --> future state "tomorrow"
# x-1 --> today
# x-2 --> yesterday



def heuristic(df):
    
    preds = []
    
    for x in range(len(df)):
            # If either of last two days == True then predict true
            if df.iloc[x-1]['RAIN'] or df.iloc[x-2]['RAIN']:
                if (df.iloc[x]['TMAX'] <= 55 and df.iloc[x]['TMAX'] >= 50):
                    preds.append(1)
                elif (df.iloc[x]['TMIN'] <= 39 and df.iloc[x]['TMIN'] >= 45):
                    preds.append(1)
                else:
                    preds.append(0)
            else:
                # Predict false if the above is not true
                preds.append(0)
                
    return preds

In [16]:
# Apply Heuristic on training set
train['preds'] = heuristic(train)
train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['preds'] = heuristic(train)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
10909,1977-11-13,0.22,54,45,1,1
24150,2014-02-13,0.07,55,46,1,1
5531,1963-02-22,0.0,48,36,0,0
15336,1989-12-27,0.29,42,30,1,0
12349,1981-10-23,0.0,58,41,0,0


In [17]:
# Calculate Accuracy, precision and recall
def sklearn_RAIN (df):
    
    actual = df["RAIN"]
    Prediction = df["preds"]
    
    accuracy = accuracy_score(actual, Prediction)
    recall = recall_score(actual, Prediction)
    precision= precision_score(actual, Prediction)
    
    return accuracy, recall, precision

In [18]:
# Calculate accuracy, Precision and recall for training set
sklearn_RAIN(train)

(0.6143947548683825, 0.1992619926199262, 0.6484052532833021)

In [19]:
# Apply Heuristic on test set
test['preds'] = heuristic(test)
test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['preds'] = heuristic(test)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
15060,1989-03-26,0.15,51,42,1,1
2826,1955-09-27,0.14,56,47,1,0
5678,1963-07-19,0.0,70,54,0,0
8734,1971-11-30,0.0,46,37,0,0
20964,2005-05-25,0.0,77,50,0,0


In [20]:
# Calculate accuracy, Precision and recall for Test set
sklearn_RAIN(test)

(0.5925636007827788, 0.17953321364452424, 0.6116207951070336)

In [21]:
# run the steps five times 

def multiple_trails(data, train_size=0.8, test_size=0.2, iterations = 5):
    
    acc = []
    rec = []
    pre = []
    
    for x in range(iterations):
        train, test = train_test_split(data, test_size=test_size, train_size = train_size)
        test["preds"] = heuristic(test)
        results = sklearn_RAIN(test)
        acc.append(np.round(results,2)[0])
        rec.append(np.round(results,2)[1])
        pre.append(np.round(results,2)[2])
    return acc, rec, pre

In [22]:
# multi trails 
acc, rec, pre = multiple_trails(df)
print (acc, rec, pre)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = 

[0.6, 0.62, 0.62, 0.61, 0.61] [0.19, 0.21, 0.2, 0.17, 0.2] [0.63, 0.68, 0.65, 0.64, 0.66]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)


In [23]:
# compute the mean 
acc = sum(acc)/5
rec = sum(rec)/5
pre = sum(pre)/5

print (acc, rec, pre)

0.6119999999999999 0.19400000000000003 0.652


## PRCP

In [24]:
# Create function to perform our heuristic

# Note 

# x --> future state "tomorrow"
# x-1 --> today
# x-2 --> yesterday

def heuristic(df):
    
    preds = []
    
    for x in range(len(df)):
        if (df.iloc[x]['TMIN'] >= 50) | (df.iloc[x]['TMAX'] >= 50):
            preds.append(0)
        elif (df.iloc[x]['TMIN'] >= 40) | (df.iloc[x]['TMAX'] >= 40):
            preds.append(0.02)
        else:
            # Predict false if the above is not true
            preds.append(0.01)
                
    return preds

In [25]:
# Apply Heuristic or train set
train['preds'] = heuristic(train)
train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['preds'] = heuristic(train)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
10909,1977-11-13,0.22,54,45,1,0.0
24150,2014-02-13,0.07,55,46,1,0.0
5531,1963-02-22,0.0,48,36,0,0.02
15336,1989-12-27,0.29,42,30,1,0.02
12349,1981-10-23,0.0,58,41,0,0.0


In [26]:
# Calculate mse, mae, and rms
def sklearn_PRCP (df):
    
    actual = df["PRCP"]
    Prediction = df["preds"]
    
    mse = mean_squared_error(actual, Prediction)
    mae = mean_absolute_error(actual, Prediction)
    rms = mean_squared_error(actual, Prediction, squared=False)
    
    return mse, mae, rms

In [27]:
# computing the mse, mae, and rms for training set
sklearn_PRCP(train)

(0.06834696643507192, 0.10553919170173208, 0.2614325275000644)

In [28]:
# Apply Heuristic
test['preds'] = heuristic(test)

test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['preds'] = heuristic(test)


Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
15060,1989-03-26,0.15,51,42,1,0.0
2826,1955-09-27,0.14,56,47,1,0.0
5678,1963-07-19,0.0,70,54,0,0.0
8734,1971-11-30,0.0,46,37,0,0.02
20964,2005-05-25,0.0,77,50,0,0.0


In [29]:
# computing the mse, mae, and rms for test set
sklearn_PRCP(test)

(0.06274649706457926, 0.10572994129158512, 0.25049250899893044)

In [30]:
# run the steps five times 

def multiple_trails_PRCP(data, train_size=0.8, test_size=0.2, iterations = 5):
    
    mse = []
    mae = []
    rms = []
    
    for x in range(iterations):
        train, test = train_test_split(data, test_size=test_size, train_size = train_size)
        test["preds"] = heuristic(test)
        results = sklearn_PRCP(test)
        mse.append(np.round(results,2)[0])
        mae.append(np.round(results,2)[1])
        rms.append(np.round(results,2)[2])
    return mse, mae, rms

In [31]:
# multi trails
mse, mae, rms = multiple_trails_PRCP(df)
print (mse, mae, rms)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = 

[0.06, 0.07, 0.07, 0.07, 0.07] [0.1, 0.11, 0.11, 0.11, 0.11] [0.25, 0.26, 0.26, 0.26, 0.26]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["preds"] = heuristic(test)


In [32]:
# compute the mean 
mse = sum(mse)
mae = sum(mae)
rms = sum(rms)

print (mse, mae, rms)

0.068 0.10800000000000001 0.258
