<h1>Text Classification using Regularized Logistic Regression<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Regularization-Theory" data-toc-modified-id="Regularization-Theory-1">Regularization Theory</a></span></li><li><span><a href="#Machine-Learning-Project-Lifecycle:-Third-Iteration" data-toc-modified-id="Machine-Learning-Project-Lifecycle:-Third-Iteration-2">Machine Learning Project Lifecycle: Third Iteration</a></span><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-2.1">Problem Statement</a></span></li><li><span><a href="#Training-Data" data-toc-modified-id="Training-Data-2.2">Training Data</a></span></li><li><span><a href="#Preprocessing-+-Feature-Engineering" data-toc-modified-id="Preprocessing-+-Feature-Engineering-2.3">Preprocessing + Feature Engineering</a></span></li><li><span><a href="#Machine-Learning-Algorithm:-Logistic-Regression" data-toc-modified-id="Machine-Learning-Algorithm:-Logistic-Regression-2.4">Machine Learning Algorithm: Logistic Regression</a></span><ul class="toc-item"><li><span><a href="#Using-Sklearn-Implementation-for-Student-Loan-Class-Prediction" data-toc-modified-id="Using-Sklearn-Implementation-for-Student-Loan-Class-Prediction-2.4.1">Using Sklearn Implementation for Student Loan Class Prediction</a></span></li><li><span><a href="#Custom-Implementation-for-Student-Loan-Class-Prediction-using-l2-penalty" data-toc-modified-id="Custom-Implementation-for-Student-Loan-Class-Prediction-using-l2-penalty-2.4.2">Custom Implementation for Student Loan Class Prediction using l2 penalty</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-2.5">Model Evaluation</a></span></li><li><span><a href="#Quality-Metrics" data-toc-modified-id="Quality-Metrics-2.6">Quality Metrics</a></span></li><li><span><a href="#Model-Evaluation-on-Test-Dataset" data-toc-modified-id="Model-Evaluation-on-Test-Dataset-2.7">Model Evaluation on Test Dataset</a></span></li></ul></li><li><span><a href="#Homework" data-toc-modified-id="Homework-3">Homework</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-4">Resources</a></span></li></ul></div>

<img src="../images/classification.png" alt="Classification" style="width: 700px;"/>

## Regularization Theory

## Machine Learning Project Lifecycle: Third Iteration

### Problem Statement

Classify the Financial Consumer Complaints into different Product Categories given consumer complaint text.

**Product Categories**

- Credit reporting, repair, or other
- Debt collection
- Student loan
- Money transfer, virtual currency, or money service
- Bank account or service

### Training Data

[Kaggle: Consumer Complaint Database](https://www.kaggle.com/selener/consumer-complaint-database)

In [5]:
import pandas as pd

In [6]:
complaints_training_dataset = pd.read_csv('../datasets/consumer_complaints_training_dataset.csv')

In [7]:
complaints_training_dataset.head()

Unnamed: 0,Product,Complaint_text
0,"Credit reporting, repair, or other","My name is XXXX XXXX XXXX , not XXXX X..."
1,"Credit reporting, repair, or other",I was shocked when I reviewed my credit report...
2,"Credit reporting, repair, or other",Equifax misused of credit file. Disputing acco...
3,"Credit reporting, repair, or other",I am disturbed that you continue to list the v...
4,"Credit reporting, repair, or other",I went to multiple different credit report web...


In [8]:
complaints_training_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Product         20000 non-null  object
 1   Complaint_text  20000 non-null  object
dtypes: object(2)
memory usage: 312.6+ KB


**Q) What is the distribution of complaints for each product type?**

In [9]:
complaints_training_dataset.Product.unique()

array(['Credit reporting, repair, or other', 'Debt collection',
       'Student loan',
       'Money transfer, virtual currency, or money service',
       'Bank account or service'], dtype=object)

In [10]:
complaints_training_dataset\
    .groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
Bank account or service,4000
"Credit reporting, repair, or other",4000
Debt collection,4000
"Money transfer, virtual currency, or money service",4000
Student loan,4000


**Q) Find out the Occurances of Duplicate Text messages if any?**

In [11]:
complaints_training_dataset['Complaint_text'].nunique()

19913

In [12]:
duplicate_complaints = complaints_training_dataset['Complaint_text']\
    .value_counts()\
    [complaints_training_dataset['Complaint_text'].value_counts() > 2].index

In [13]:
len(duplicate_complaints)

9

### Preprocessing + Feature Engineering

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

RANDOM_STATE = 19

In [15]:
count_vectorizer = CountVectorizer(stop_words='english', max_features=5000)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    complaints_training_dataset['Complaint_text'],
    complaints_training_dataset['Product'],
    test_size=.2,
    stratify=complaints_training_dataset['Product'],
    random_state=RANDOM_STATE)

In [17]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((16000,), (4000,), (16000,), (4000,))

In [18]:
X_train_count_vectorizer = count_vectorizer.fit_transform(X_train)
X_test_count_vectorizer = count_vectorizer.transform(X_test)

In [19]:
len(count_vectorizer.get_feature_names())

5000

In [20]:
count_vectorizer.get_feature_names()[:10]

['00', '000', '10', '100', '1000', '10000', '100000', '1005', '11', '110']

In [21]:
list(count_vectorizer.vocabulary_.items())[:10]

[('xxxx', 4976),
 ('account', 322),
 ('listed', 2727),
 ('credit', 1279),
 ('report', 3828),
 ('experian', 1842),
 ('paid', 3234),
 ('closed', 1021),
 ('2007', 63),
 ('like', 2712)]

In [22]:
X_train_count_vectorizer.shape, X_test_count_vectorizer.shape

((16000, 5000), (4000, 5000))

### Machine Learning Algorithm: Logistic Regression

#### Using Sklearn Implementation for Student Loan Class Prediction

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [24]:
sklearn_binary_classifier = LogisticRegression(penalty='none',
                                               max_iter=101,
                                               random_state=RANDOM_STATE)

In [25]:
sklearn_binary_classifier.fit(X_train_count_vectorizer, y_train == 'Student loan')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=101,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=19, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [26]:
sklearn_binary_classifier_predictions = sklearn_binary_classifier.predict(X_test_count_vectorizer)

In [27]:
sklearn_binary_classifier_score = accuracy_score(y_test == 'Student loan', sklearn_binary_classifier_predictions)
sklearn_binary_classifier_score

0.954

#### Custom Implementation for Student Loan Class Prediction using l2 penalty

**Estimating Conditional Probability using Link Function**

$ \displaystyle P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))} $

In [28]:
import numpy as np

def sigmoid(scores):
    return 1.0 / (1 + np.exp(-scores))

def predict_probability(feature_matrix, coefficients):
    scores = np.dot(feature_matrix, coefficients)
    predictions = sigmoid(scores)
    return predictions

In [29]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])

correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),          1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_predictions = np.array( [ 1./(1+np.exp(-correct_scores[0])), 1./(1+np.exp(-correct_scores[1])) ] )

print('The following outputs must match ')
print('------------------------------------------------')
print('correct_predictions           =', correct_predictions)
print('output of predict_probability =', predict_probability(dummy_feature_matrix, dummy_coefficients))

The following outputs must match 
------------------------------------------------
correct_predictions           = [0.98201379 0.26894142]
output of predict_probability = [0.98201379 0.26894142]


**Compute derivative with respect to a single coefficient with L2 Penalty(Ridge)**

- For coefficients($w_1 .. w_j$) other than intercept ($w0$) term

$
\displaystyle \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) \color{red}{-2\lambda w_j }
$

- For intercept ($w0$) term

$
\displaystyle \frac{\partial\ell}{\partial w_0} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$

We will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:
* `errors` vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
* `feature` vector containing $h_j(\mathbf{x}_i)$  for all $i$.
* `coefficient` containing the current value of coefficient $w_j$.
* `l2_penalty` representing the L2 penalty constant $\lambda$
* `feature_is_constant` telling whether the $j$-th feature is constant or not.

In [41]:
def feature_derivative(errors, feature, coefficient=None, is_intercept=None,
                       penalty_type=None, penalty_value=None):
    derivative = np.dot(errors, feature)
    if penalty_type is not None and not is_intercept:
        if penalty_type == 'l2':
            derivative -= 2 * penalty_value * coefficient
    return derivative

**Compute log likelihood with l2 penalty which is given by**

$ \ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) \color{red}{-\lambda\|\mathbf{w}\|_2^2}$

In [42]:
def compute_log_likelihood(feature_matrix, target_labels, target_label,
                           coefficients, penalty_value=None, penalty_type=0):
    indicator = (target_labels == target_label)
    scores = np.dot(feature_matrix, coefficients)
    if penalty_type == 'l2':
        likelihood = np.sum((indicator - 1) * scores - np.log(1 + np.exp(-scores))) \
            - penalty_value*np.sum(coefficients[1:]**2)
    else:
        likelihood = np.sum((indicator - 1) * scores - np.log(1 + np.exp(-scores)))
    return likelihood

In [43]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])
dummy_sentiment = np.array([-1, 1])

correct_indicators = np.array([ -1==+1, 1==+1])
correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),  1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_first_term  = np.array( [ (correct_indicators[0]-1)*correct_scores[0],
                                 (correct_indicators[1]-1)*correct_scores[1] ] )
correct_second_term = np.array( [ np.log(1. + np.exp(-correct_scores[0])), 
                                 np.log(1. + np.exp(-correct_scores[1])) ] )

correct_ll          =      sum( [ correct_first_term[0]-correct_second_term[0],
                                 correct_first_term[1]-correct_second_term[1] ] ) 

print('The following outputs must match ')
print('------------------------------------------------')
print('correct_log_likelihood           =', correct_ll)
print('output of compute_log_likelihood =', compute_log_likelihood(dummy_feature_matrix,
                                                                   dummy_sentiment,
                                                                   1,
                                                                   dummy_coefficients))

The following outputs must match 
------------------------------------------------
correct_log_likelihood           = -5.331411615436032
output of compute_log_likelihood = -5.331411615436032


**Any Questions????**

**Train Binary Logistic Regression Classifier model using Gradient Ascent**

In [44]:
def train_binary_lr_classifier(
        features_matrix, target_labels, target_label,
        initial_coefficients, step_size, max_iterations,
        penalty_type=None, penalty_value=0):
    coefficients = np.array(initial_coefficients)
    for iteration in range(max_iterations):
        predictions = predict_probability(features_matrix, coefficients)
        
        indicator = (target_labels == target_label)
        
        errors = indicator - predictions
        
        for j in range(len(coefficients)):
            is_intercept = (j == 0)
            derivative = feature_derivative(errors, features_matrix[:, j],
                                            coefficients[j], is_intercept,
                                            penalty_type, penalty_value)
            coefficients[j] += step_size * derivative
            
        if (iteration <= 100 and iteration % 10 == 0)\
            or (iteration <= 1000 and iteration % 100 == 0)\
            or (iteration <= 10000 and iteration % 1000 == 0)\
            or iteration % 10000 == 0:
            lp = compute_log_likelihood(features_matrix, target_labels,
                                        target_label, coefficients,
                                        penalty_type, penalty_value)
            print('----------------------------------')
            print(f'Iteration: {iteration} -> Likelihood value: {lp} for {target_label} classifier.')
            predicted_probabilities = predict_probability(features_matrix, coefficients)
            predicted_classes = predicted_probabilities > .5
            correct_predictions = predicted_classes == (target_labels == target_label)
            print(f'Minimum Probability:{predictions.min()},',
                  f'Maximum Probability:{predictions.max()},',
                  f'Current Accuracy: {correct_predictions.sum() / len(target_labels)}')
    return coefficients

In [45]:
def count_vectorized_features_to_features_matrix(count_vectorized_features):
    constant_feature = np.ones((count_vectorized_features.shape[0], 1))
    return np.hstack((constant_feature, count_vectorized_features.toarray()))

In [47]:
# Run LR without Penalty type(Unregularized LR)
# TODO: Capture the likelihood values for plotting & comparision with regularized techniques
# Hyper Paramters
step_size, max_iterations =1e-5, 101

# Target Variables
target_labels, target_label = y_train, 'Student loan'

# Feature Matrix & Initial Coefficients
X_train_features_matrix = count_vectorized_features_to_features_matrix(X_train_count_vectorizer)
initial_coefficients = np.zeros(X_train_features_matrix.shape[1])

print(X_train_features_matrix.shape, initial_coefficients.shape)

unregularized_binary_classifier_coeffs = train_binary_lr_classifier(
    X_train_features_matrix,
    target_labels,
    target_label,
    initial_coefficients,
    step_size,
    max_iterations)

(16000, 5001) (5001,)
----------------------------------
Iteration: 0 -> Likelihood value: -18351.684887894677 for Student loan classifier.
Minimum Probability:0.5, Maximum Probability:0.5, Current Accuracy: 0.8033125
----------------------------------
Iteration: 10 -> Likelihood value: -6158.331576286401 for Student loan classifier.
Minimum Probability:2.7025490668168023e-09, Maximum Probability:0.999999999982643, Current Accuracy: 0.891375
----------------------------------
Iteration: 20 -> Likelihood value: -4004.3037195526767 for Student loan classifier.
Minimum Probability:7.215761438369419e-14, Maximum Probability:0.9999999999956504, Current Accuracy: 0.9375625
----------------------------------
Iteration: 30 -> Likelihood value: -3576.3360214762492 for Student loan classifier.
Minimum Probability:5.858595032778935e-17, Maximum Probability:0.9999999999208924, Current Accuracy: 0.9464375
----------------------------------
Iteration: 40 -> Likelihood value: -3346.3896591181638 for 

In [48]:
# Run LR without Penalty type(Unregularized LR)
# TODO: Capture the likelihood values for plotting & comparision with regularized techniques
# Hyper Paramters
step_size, max_iterations =1e-5, 101

# Regularization hyperparameters
penalty_type = 'l2'
penalty_value = 1e2

# Target Variables
target_labels, target_label = y_train, 'Student loan'

# Feature Matrix & Initial Coefficients
X_train_features_matrix = count_vectorized_features_to_features_matrix(X_train_count_vectorizer)
initial_coefficients = np.zeros(X_train_features_matrix.shape[1])

print(X_train_features_matrix.shape, initial_coefficients.shape)

l2_regularized_binary_classifier_coeffs = train_binary_lr_classifier(
    X_train_features_matrix,
    target_labels,
    target_label,
    initial_coefficients,
    step_size,
    max_iterations,
    penalty_type,
    penalty_value)

(16000, 5001) (5001,)
----------------------------------
Iteration: 0 -> Likelihood value: -18351.684887894677 for Student loan classifier.
Minimum Probability:0.5, Maximum Probability:0.5, Current Accuracy: 0.8033125
----------------------------------
Iteration: 10 -> Likelihood value: -6607.806921825849 for Student loan classifier.
Minimum Probability:4.400568040057556e-09, Maximum Probability:0.9999999999974809, Current Accuracy: 0.8840625
----------------------------------
Iteration: 20 -> Likelihood value: -4056.971542149547 for Student loan classifier.
Minimum Probability:1.3739147296742533e-13, Maximum Probability:0.999999999997776, Current Accuracy: 0.9355625
----------------------------------
Iteration: 30 -> Likelihood value: -3617.133746377894 for Student loan classifier.
Minimum Probability:1.5747867717788922e-16, Maximum Probability:0.9999999998711926, Current Accuracy: 0.945875
----------------------------------
Iteration: 40 -> Likelihood value: -3396.6168907829187 for S

In [51]:
# Run LR without Penalty type(Unregularized LR)
# TODO: Capture the likelihood values for plotting & comparision with regularized techniques
# Hyper Paramters
step_size, max_iterations =1e-5, 101

# Regularization hyperparameters
penalty_type = 'l2'
penalty_value = 1e1

# Target Variables
target_labels, target_label = y_train, 'Student loan'

# Feature Matrix & Initial Coefficients
X_train_features_matrix = count_vectorized_features_to_features_matrix(X_train_count_vectorizer)
initial_coefficients = np.zeros(X_train_features_matrix.shape[1])

print(X_train_features_matrix.shape, initial_coefficients.shape)

l2_regularized_binary_classifier_coeffs1 = train_binary_lr_classifier(
    X_train_features_matrix,
    target_labels,
    target_label,
    initial_coefficients,
    step_size,
    max_iterations,
    penalty_type,
    penalty_value)

(16000, 5001) (5001,)
----------------------------------
Iteration: 0 -> Likelihood value: -18351.684887894677 for Student loan classifier.
Minimum Probability:0.5, Maximum Probability:0.5, Current Accuracy: 0.8033125
----------------------------------
Iteration: 10 -> Likelihood value: -6220.791067695103 for Student loan classifier.
Minimum Probability:2.867202244042661e-09, Maximum Probability:0.9999999999869293, Current Accuracy: 0.889625
----------------------------------
Iteration: 20 -> Likelihood value: -4009.0173025810373 for Student loan classifier.
Minimum Probability:7.696035592019838e-14, Maximum Probability:0.9999999999958935, Current Accuracy: 0.937375
----------------------------------
Iteration: 30 -> Likelihood value: -3580.353891354651 for Student loan classifier.
Minimum Probability:6.474260876238639e-17, Maximum Probability:0.9999999999163967, Current Accuracy: 0.946375
----------------------------------
Iteration: 40 -> Likelihood value: -3351.3495692917045 for Stu

In [52]:
# Run LR without Penalty type(Unregularized LR)
# TODO: Capture the likelihood values for plotting & comparision with regularized techniques
# Hyper Paramters
step_size, max_iterations =1e-5, 101

# Regularization hyperparameters
penalty_type = 'l2'
penalty_value = 1e3

# Target Variables
target_labels, target_label = y_train, 'Student loan'

# Feature Matrix & Initial Coefficients
X_train_features_matrix = count_vectorized_features_to_features_matrix(X_train_count_vectorizer)
initial_coefficients = np.zeros(X_train_features_matrix.shape[1])

print(X_train_features_matrix.shape, initial_coefficients.shape)

l2_regularized_binary_classifier_coeffs2 = train_binary_lr_classifier(
    X_train_features_matrix,
    target_labels,
    target_label,
    initial_coefficients,
    step_size,
    max_iterations,
    penalty_type,
    penalty_value)

(16000, 5001) (5001,)
----------------------------------
Iteration: 0 -> Likelihood value: -18351.684887894677 for Student loan classifier.
Minimum Probability:0.5, Maximum Probability:0.5, Current Accuracy: 0.8033125
----------------------------------
Iteration: 10 -> Likelihood value: -5814.662399479671 for Student loan classifier.
Minimum Probability:6.488224045428997e-09, Maximum Probability:0.9999999999234497, Current Accuracy: 0.901
----------------------------------
Iteration: 20 -> Likelihood value: -4756.739264664686 for Student loan classifier.
Minimum Probability:2.007953255282054e-11, Maximum Probability:0.9999999999998785, Current Accuracy: 0.91625
----------------------------------
Iteration: 30 -> Likelihood value: -4145.150522833872 for Student loan classifier.
Minimum Probability:7.433063284366743e-13, Maximum Probability:0.9999999999998315, Current Accuracy: 0.9284375
----------------------------------
Iteration: 40 -> Likelihood value: -3865.9952237466578 for Student

In [57]:
# Run LR without Penalty type(Unregularized LR)
# TODO: Capture the likelihood values for plotting & comparision with regularized techniques
# Hyper Paramters
step_size, max_iterations =1e-5, 101

# Regularization hyperparameters
penalty_type = 'l2'
penalty_value = 5

# Target Variables
target_labels, target_label = y_train, 'Student loan'

# Feature Matrix & Initial Coefficients
X_train_features_matrix = count_vectorized_features_to_features_matrix(X_train_count_vectorizer)
initial_coefficients = np.zeros(X_train_features_matrix.shape[1])

print(X_train_features_matrix.shape, initial_coefficients.shape)

l2_regularized_binary_classifier_coeffs3 = train_binary_lr_classifier(
    X_train_features_matrix,
    target_labels,
    target_label,
    initial_coefficients,
    step_size,
    max_iterations,
    penalty_type,
    penalty_value)

(16000, 5001) (5001,)
----------------------------------
Iteration: 0 -> Likelihood value: -18351.684887894677 for Student loan classifier.
Minimum Probability:0.5, Maximum Probability:0.5, Current Accuracy: 0.8033125
----------------------------------
Iteration: 10 -> Likelihood value: -6190.004230439016 for Student loan classifier.
Minimum Probability:2.784576699553309e-09, Maximum Probability:0.9999999999849836, Current Accuracy: 0.8908125
----------------------------------
Iteration: 20 -> Likelihood value: -4006.635450087092 for Student loan classifier.
Minimum Probability:7.451573003774772e-14, Maximum Probability:0.9999999999957703, Current Accuracy: 0.9375
----------------------------------
Iteration: 30 -> Likelihood value: -3578.3420588832523 for Student loan classifier.
Minimum Probability:6.158800465192896e-17, Maximum Probability:0.9999999999186562, Current Accuracy: 0.9464375
----------------------------------
Iteration: 40 -> Likelihood value: -3348.866979323787 for Stud

In [49]:
X_test_features_matrix = count_vectorized_features_to_features_matrix(X_test_count_vectorizer)
unregularized_binary_classifier_predictions = predict_probability(
    X_test_features_matrix,
    unregularized_binary_classifier_coeffs) > .5
unregularized_binary_classifier_score = accuracy_score(
    y_test == 'Student loan',
    unregularized_binary_classifier_predictions)
unregularized_binary_classifier_score

0.947

In [50]:
X_test_features_matrix = count_vectorized_features_to_features_matrix(X_test_count_vectorizer)
l2_regularized_binary_classifier_predictions = predict_probability(
    X_test_features_matrix,
    l2_regularized_binary_classifier_coeffs) > .5
l2_regularized_binary_classifier_score = accuracy_score(
    y_test == 'Student loan',
    l2_regularized_binary_classifier_predictions)
l2_regularized_binary_classifier_score

0.945

In [53]:
X_test_features_matrix = count_vectorized_features_to_features_matrix(X_test_count_vectorizer)
l2_regularized_binary_classifier_predictions1 = predict_probability(
    X_test_features_matrix,
    l2_regularized_binary_classifier_coeffs1) > .5
l2_regularized_binary_classifier_score1 = accuracy_score(
    y_test == 'Student loan',
    l2_regularized_binary_classifier_predictions1)
l2_regularized_binary_classifier_score1

0.94675

In [54]:
X_test_features_matrix = count_vectorized_features_to_features_matrix(X_test_count_vectorizer)
l2_regularized_binary_classifier_predictions2 = predict_probability(
    X_test_features_matrix,
    l2_regularized_binary_classifier_coeffs2) > .5
l2_regularized_binary_classifier_score2 = accuracy_score(
    y_test == 'Student loan',
    l2_regularized_binary_classifier_predictions2)
l2_regularized_binary_classifier_score2

0.922

In [58]:
X_test_features_matrix = count_vectorized_features_to_features_matrix(X_test_count_vectorizer)
l2_regularized_binary_classifier_predictions3 = predict_probability(
    X_test_features_matrix,
    l2_regularized_binary_classifier_coeffs3) > .5
l2_regularized_binary_classifier_score3 = accuracy_score(
    y_test == 'Student loan',
    l2_regularized_binary_classifier_predictions3)
l2_regularized_binary_classifier_score3

0.947

In [59]:
(sklearn_binary_classifier_score,
 unregularized_binary_classifier_score,
 l2_regularized_binary_classifier_score,
 l2_regularized_binary_classifier_score1,
 l2_regularized_binary_classifier_score2,
 l2_regularized_binary_classifier_score3)

(0.954, 0.947, 0.945, 0.94675, 0.922, 0.947)

### Model Evaluation

In [36]:
from sklearn.model_selection import cross_val_score

In [37]:
cv_scores = cross_val_score(LogisticRegression(penalty='none', max_iter=101, random_state=RANDOM_STATE),
                            X_train_count_vectorizer,
                            y_train,
                            cv=5)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver 

In [38]:
print(cv_scores.mean())
print(cv_scores)

0.8332499999999999
[0.82875   0.8365625 0.825625  0.841875  0.8334375]


### Quality Metrics

In [39]:
from sklearn.metrics import (accuracy_score,
                             confusion_matrix)

import seaborn as sns;
sns.set()

import matplotlib.pyplot as plt
%matplotlib inline

### Model Evaluation on Test Dataset

- Note: Retrain the model using full training [dataset](../datasets/consumer_complaints_training_dataset.csv) & test using the test [dataset](../datasets/consumer_complaints_test_dataset.csv).

## Homework

## Resources