# Ce se întamplă în cazul clasificarii binare daca se modifică pragul de decizie din 0.5 în alte valori. Cum se poate aprecia calitatea clasificatorului pentru diferite valori ale pragului?

#### Raspuns: Daca pragul este mai mare (de exemplu 0.8) atunci sansele ca algoritmul sa prezica corect label-ul pozitiv sunt mai mici. Aceasta situatie poate conduce de exemplu la clasificarea unor pacienti bolnavi ca fiind sanatosi, ceea ce nu este de dorit. Daca pragul este mai mic, sansele ca cei bolnavi sa fie clasificati corect sunt mai mari. (nu este atat de grav daca pacientii sanatosi sunt considerati bolnavi) 

# Rezolvarea unei probleme de regresie/clasificare prin: folosirea validarii încrucișate (K-fold cross validation) - pb cu happiness dupa pib

In [1]:
import csv
import os
import matplotlib.pyplot as plt
import numpy as np 
from sklearn import linear_model
import pandas as pd 
from sklearn.metrics import mean_squared_error


#Ce îi poate face pe oameni fericiți? - dupa PIB
def readData(dataPath: str):
    df = pd.read_csv(dataPath, delimiter=',', header='infer')
    df = df.dropna()
    return df

#split data frame in k sets
def splitDataInKSets(dataFrame, k):
    size = dataFrame.shape[0]
    arr = np.array_split(range(size),k) 
    input = [[dataFrame["Economy..GDP.per.Capita."].iloc[i] for i in index] for index in arr]
    output = [[dataFrame["Happiness.Score"].iloc[i] for i in index] for index in arr]
    return input, output

def getErrors(computed_output, validation_output):
    computedError = mean_squared_error(validation_output, computed_output)
    return computedError

def trainRegressor(regressor, dataFrame, k):
    errors = []
    input, output = splitDataInKSets(dataFrame, k)
    for i in range(0, k):
        validationInputSet = input[i]
        validationOutputSet = output[i]
        trainingInputSet = []
        trainingOutputSet = []
        for j in range(0, k):
            if j != i:
                trainingInputSet += input[j]
                trainingOutputSet += output[j]
        regressor.partial_fit([[trainingInputSet[ind]] for ind in range(0, len(trainingInputSet))], trainingOutputSet)
        computed_output = regressor.predict([[validationInputSet[ind]] for ind in range(0, len(validationInputSet))])
        errors.append(getErrors(computed_output, validationOutputSet))
    return errors

In [2]:
dataFrame = readData("2017.csv")
regressor = linear_model.SGDRegressor()
errors = trainRegressor(regressor, dataFrame, 6)
print(errors)
overallError = sum(errors) / len(errors)
print("Overall error = ", overallError)

[10.179377752289922, 2.28892688827479, 0.4669109843562096, 0.5233600573819017, 0.5543750269744534, 0.4012853667884819]
Overall error =  2.402372679344293


# Investigarea diferitelor funcții de loss - pt pb 1 - PIB

- **Squared Error** (squared_error)
    - The ordinary least squares is the square of the difference between the actual value and predicted value.
    - It tends to penalize model more and more for larger differences thereby giving more weight to outliers

- **Huber** (huber)
    - The mean squared error (MSE) or squared error gives too much importance to outliers and Mean Average error (MAE) (here instead of squaring we take absolute value of errors) gives equal weightage to all points
    - Huber loss combines MSE and MAE to give best of both wold- it is quadratic(MSE) when the error is small else MAE

- **Epsilon Insensitive** (epsilon_insensitive)
    - The value of epsilon determines the distance within which errors are considered to be zero . The loss function ignores error which are less than or equal to epsilon value by treating them zero.
    - Thus the loss function effectively forces the optimizer to find such a hyperplane that a tube of width epsilon around this hyperplane will contain all the datapoints.

In [4]:
from sklearn import metrics

def getTrainingAndValidationSets(dfWorldHappiness):
    dataSize = dfWorldHappiness.shape[0]
    
    trainingIndexSet = np.random.choice(range(dataSize), size=int(0.8 * dataSize), replace=False)
    validationIndexSet = [i for i in range(dataSize) if i not in trainingIndexSet]

    trainingInputSet = [dfWorldHappiness["Economy..GDP.per.Capita."].iloc[index] for index in trainingIndexSet]
    trainingOutputSet = [dfWorldHappiness["Happiness.Score"].iloc[index] for index in trainingIndexSet]

    validationInputSet = [dfWorldHappiness["Economy..GDP.per.Capita."].iloc[index] for index in validationIndexSet]
    validationOutputSet = [dfWorldHappiness["Happiness.Score"].iloc[index] for index in validationIndexSet]

    return trainingInputSet, trainingOutputSet, validationInputSet, validationOutputSet

def getRegressor(dataFrame, loss_type):
    trainingInputSet, trainingOutputSet, _, _ = getTrainingAndValidationSets(dataFrame)
    X = [[el] for el in trainingInputSet]
    regressor = linear_model.SGDRegressor(loss=loss_type)
    regressor.fit(X, trainingOutputSet)
    return regressor

def main():
    dataFrame = readData("2017.csv")
    _,_,validationInput,validationOutput = getTrainingAndValidationSets(dataFrame)
    for loss_type in linear_model.SGDRegressor().loss_functions:
        regressor = getRegressor(dataFrame, loss_type)
        computedOutput = regressor.predict([[validationInput[i]] for i in range(0, len(validationInput))]) 
        err = metrics.r2_score(validationOutput, computedOutput)
        print("Loss type: ", loss_type, " Error = ", err)

main()    

Loss type:  squared_error  Error =  0.47318649695354564
Loss type:  huber  Error =  0.24942280176284015
Loss type:  epsilon_insensitive  Error =  0.4844170281281762
Loss type:  squared_epsilon_insensitive  Error =  0.5031070050134103
