**Classification**

Optimize the likelihood function // minimize the negative log-likelihood function

cross-entropy loss function

min of $- sum(y_ilog(p_i) + (1-y_i)log(1-p_i))$

###Evaluation for Binary Classification

**Balanced data**

When data is balanced (equally positive and negative), use the classification accuracy

$accuracy = number of correct predictions / number of total predictions$

* ```True positive: ``` prediction is positive, ground truth is positive
* ```False positive```: prediction is positive, ground truth is negative
* ```False negative```: prediction is negative, ground truth is positive
* ```True negative```: prediction is negative, ground truth is negative

$ accuracy = (TP + TN) / (TP + FP + TN + GN) $

**Imbalanced data**
When data is imbalanced, use the recall or precision

$recall = TP / (TP + FN)$

$precision = TP/ (TP + FP)$

When to care about recall?
* ex. Covid-19 diagnosis
* important to find all positive samples

When to care about precision?
* ex. Google search
* users are sensitive to prediction error

Relationship between recall & precision

``` F1 score: ``` harmonic mean of recall and precision, conveys balance of recall & precision

$F1 score = (2 * recall * precision ) / ( recall + precision )$

In [None]:
#preprocess
X, y = datasets.load_iris(return_X_y = True)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.1,
                                                    random_state = 0)

#normalize
normalizer = StandardScaler()
X_train = normalizer.fit_transform(X_train)
X_test = normalizer.transform(X_test)

In [None]:
#train logistic regression
clf = LogisticRegression(penalty='12',C=1.0)
clf.fit(X_train,y_train)

y_train_pred = clf.predict(X_train)
acc = accuracy_score(y_train, y_train_pred)
print("training accuracy {}:.4f".format(acc))

In [None]:
#evaluate logistic regression
y_test_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_test_pred)
print("testing accuracy {}:.4f".format(acc))

In [None]:
#test difference regularization vals

regularization_coeff = [0.1, 0.5, 1.0, 5.0]

for reg in regularization_coeff:
  clf = LogisticRegression(penalty='12',C=reg)

  clf.fit(X_train, y_train)

  y_test_pred = clf.predict(X_test)
  acc = accuracy_score(y_test, y_test_pred)

  print("reg coeff: {}, accuracy: {:.3f}".format(1.0/reg, acc))

### Model Selection: Threefold Split

```Training Set: ``` used for training during training phase

```Validation Set: ``` used for hyperparameter selection during training phase

```Testing Set: ``` used for evaluating the model after obtaining the model

Pros:

* Fast & Simple

Cons:

* Large variance
* Wastes data

In [None]:
#Threefold split

X, y = datasets.load_iris(return_X_y = True)
print(X.shape)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y,
                                                            test_size = 0.1,
                                                            random_state = 0)

X_train, X_valid, y_train, y_valid = train_test_split(X_train_val,
                                                      y_train_val,
                                                      test_size = 0.1,
                                                      random_state = 0)

In [None]:
regularization_coeff = [0.1, 0.5, 1.0, 5.0]

best_acc = 0.0
best_reg = 0.0

#training
for reg in regularization_coeff:
  clf = LogisticRegression(penalty='12',C=reg)

  clf.fit(X_train, y_train) #use training data

  y_valid_pred = clf.predict(X_valid) #use validation set
  acc = accuracy_score(y_valid, y_valid_pred)

  if acc > best_acc:
    best_acc = acc
    best_reg = reg

  print("reg coeff: {}, accuracy: {:.3f}".format(1.0/reg, acc))

#evaluation
clf = LogisticRegression(penalty='12',C=best_reg)
clf.fit(X_train_val, y_train_val)

y_test_pred = clf.predict(X_test) #use testing set
acc = accuracy_score(y_test, y_test_pred)
print("acc: {:.3f}".format(acc))

### Model Selection: Cross-validation

Training data: randomly partition into $k$ folds
* $k-1$ folds for training set
* 1 fold for validation set

How to select the model
* for each hyperparameter, train the model $k$ times
* evaluate the model $k$ times
* use the mean of $k$ evaluation to select model



Pros:
* More data
* More stable

Cons:
* slower

In [None]:
for reg in regularization_coeff:

  sum_acc = 0.0

  for fold in range(5):
    index_of_folds_temp = index_of_folds.copy()

    valid_index = index_of_folds_temp[fold,:].reshape(-1)
    train_index = np.delete(index_of_folds_temp, fold, 0).reshape(-1)

    X_train = X_train_val[train_index]
    y_train = y_train_val[train_index]

    X_valid = X_train_val[valid_index]
    y_valid = y_train_val[valid_index]

    clf = LogisticRegression(penalty='12', C=reg)
    clf.fit(X_train, y_train)

    y_valid_pred = clf.predict(X_valid)
    acc = accuracy_score(y_valid, y_valid_pred)

    sum_acc += acc

  cur_acc = sum_acc / 5.0
  print("reg coeff: {}, acc: {:.3f}".format(1.0/reg, cur_acc))

  if cur_acc > best_acc:
    best_acc = cur_acc
    best_reg = reg

clf = LogisticRegression(penalty='12',C=best_reg)
clf.fit(X_train_val, y_train_val)

y_test_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_test_pred)

print("acc: {:.3f}".format(acc))