### Conclusion
- Testing 5 vs All, using text tone as percentages
    - Logistic Regression was fitted using Recursive Feature Elemination (RFE), and each time the model was trained and tested and it was found that the accuracy maintained 69% until selecting the best 12 features, as after selecting the best 11 features and below the accuracy dropped to 49%
    - The best 12 features are: ['wf_resolved', 'wf_open', 'wf_in_progress', 'wf_reopened', 'wf_validation', 'wf_resolved_under_monitoring', 'wf_closed', 'wf_waiting', 'wf_under_review', 'wf_pending_deployment', 'wf_total_time', 'reporter_terms_count'], noticing that no text tone features has effect in this model.
    - The Recall for the other label is higher, 87%, but the precision is low which mean means that the model captures more cases in the other labels but there is a high rate for misclassifying category 5 as other
- Testing 5 vs All, using text tone as count
    - The model after RFE was fitted against 35 features and it gave a slight better performance with 70%
    - The selected features are: ['issue_comments_count', 'wf_resolved', 'wf_open', 'wfe_open', 'wf_in_progress', 'wfe_in_progress', 'wf_reopened', 'wf_validation', 'wf_resolved_under_monitoring', 'wf_closed', 'wf_waiting', 'wf_under_review', 'wf_pending_deployment', 'turn', 'wf_total_time', 'processing_steps', 'assignee_utterances_count', 'reporter_utterances_count', 'others_utterances_count', 'assignee_terms_count', 'reporter_terms_count', 'others_terms_count', 'utr_assignee_open_close', 'utr_assignee_inform', 'utr_assignee_assignment_update', 'utr_assignee_status_update', 'utr_reporter_request', 'utr_others_open_close', 'utr_others_user_mention', 'utr_others_update_request', 'issue_type_Ticket', 'issue_priority_High', 'issue_priority_Medium', 'utr_reporter_inform', 'utr_reporter_technical']
    - The Recall and precision of the model didn't change
    - It was noticed that text tone and related features were part of the 35 selected features

## Do general imports

In [1]:
from classifiers.testing import cycle_test,TestType,TestInputs,DatasetFeatures

rfe_n_features_without_text_tone = 9 # done
rfe_n_features_with_text_tone_as_perentages = 21 # done
rfe_n_features_with_text_tone_as_count = 10

### The code

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import statsmodels.api as sm

def pre_process(x,y):
    # print(len(x.columns))
    num_x = x.select_dtypes(include='number')
    x[num_x.columns] = MinMaxScaler().fit_transform(num_x[num_x.columns])
    y.loc[y['Q1'] == 5,'Q1'] = 1
    return x,y

def fit_and_test_statsm(inputs: TestInputs,features_in):
    log_reg = sm.Logit(inputs.y_train, sm.add_constant(inputs.x_train[features_in])).fit(maxiter=100,method='lbfgs') 
    print(log_reg.summary())
    predicted_prop = log_reg.model.predict(log_reg.params, exog=sm.add_constant(inputs.x_test[features_in]))   
    # print(predicted_prop)
    return list(map(round, predicted_prop))    

def fit_and_test_sklearn(inputs: TestInputs,features_in):
    clf = LogisticRegression().fit(inputs.x_train[features_in],inputs.y_train.iloc[:,0])
    predicted = clf.predict(inputs.x_test[features_in])
    print(clf.get_params(True))
    # for i,f in enumerate(features_in):
    #     coef = '{0:.5f}'.format(clf.coef_[0][i])
    #     print(f'{f}')
    # inter = '{0:.5f}'.format(clf.intercept_[0])
    # print(f'intercept {inter}')
    return predicted

def fit_and_test(inputs: TestInputs):    
    lr_model = LogisticRegression()
    rfe_n = rfe_n_features_without_text_tone
    if inputs.dataset_features == DatasetFeatures.WITH_TEXT_TONE_AS_COUNTS:
        rfe_n = rfe_n_features_with_text_tone_as_count
    if inputs.dataset_features == DatasetFeatures.WITH_TEXT_TONE_AS_PERCENTAGES:
        rfe_n = rfe_n_features_with_text_tone_as_perentages

    # Use RFE to select the top 10 features
    rfe = RFE(lr_model, n_features_to_select=rfe_n)
    rfe.fit(inputs.x_train, inputs.y_train.iloc[:,0])

    # Print the selected features
    features_in = [inputs.x_train.columns[i] for i,b in enumerate(rfe.support_) if b == True]
    return fit_and_test_sklearn(inputs, features_in)

### The test

In [3]:
cycle_test('Logistic Regression',fit_and_test,pre_processor=pre_process,test_type=TestType.FIVE_VS_ALL)

start test Logistic Regression to test 5 Vs All
Total records in dataset 747
{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
5 Vs All - Without DA features
              precision    recall  f1-score   support

           0       0.88      0.61      0.72        70
           1       0.80      0.95      0.86       111

    accuracy                           0.82       181
   macro avg       0.84      0.78      0.79       181
weighted avg       0.83      0.82      0.81       181

{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
5 Vs All - 