<h1>Model Training</h1>
<h3>
    <ol>
        <li>
            Accessing Training and Testing Data
        </li>
        <li>
            Training Various ML Models
            <ul>
                <li>Logistic Regression</li>
                <li>Linear Discriminant Analysis</li>
                <li>K-Nearest Neighbors</li>
                <li>Naive Bayes</li>
                <li>Decision Tree Classifier</li>
                <li>Support Vector Machine</li>
            </ul>
        </li>
        <li>
            Tabulating the Results of the Training
        </li>
    </ol>
</h3>

***

<h4>Accessing Training and Testing Data</h4>

In [1]:
import pandas as pd

# loading training and test data dataframes
train_data = pd.read_pickle('../Data/model/train_data.pkl')
test_data = pd.read_pickle('../Data/model/test_data.pkl')

In [2]:
# separating dependent and independent variables
X_train = train_data.iloc[:, :-1]
y_train = train_data.iloc[:, -1]

In [3]:
X_train.shape

(384, 500)

In [4]:
y_train.shape

(384,)

In [5]:
# separating dependent and independent variables
X_test = test_data.iloc[:, :-1]
y_test = test_data.iloc[:, -1]

In [6]:
X_test.shape

(96, 500)

In [7]:
y_test.shape

(96,)

In [8]:
def get_frequency(data_set: pd.DataFrame) -> float:
    return (data_set != 1).values.sum()/len(data_set) * 100

print(get_frequency(y_train))
print(get_frequency(y_test))

50.0
50.0


<h4>Training Various ML Models</h4>

In [9]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix

In [13]:
import numpy as np

columns = index = ['scientific', 'conspiracy']

def get_classification_report_html(model_predictions: np.array):
    report = classification_report(y_test, model_predictions, output_dict=True)
    df = pd.DataFrame(report).transpose()
    return df.to_html()

def get_confusion_matrix_html(model_predictions: np.array):
    return pd.DataFrame(confusion_matrix(y_test, model_predictions), index=index, columns=columns).to_html()

<h5>Logistic Regression</h5>

In [10]:
from sklearn.linear_model import LogisticRegression

lg_grid = {
    "penalty" : ['none', 'l1', 'l2', 'elasticnet'],
    'tol': [1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3],
    "C" : [100, 10, 1.0, 0.1, 0.01],
    "solver" : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    "max_iter": [50, 100, 500, 1000, 5000, 10000, 20000]
}

lg_search = RandomizedSearchCV(
    estimator = LogisticRegression(class_weight='balanced', random_state = 1),
    param_distributions = lg_grid,
    random_state = 1,
    n_iter = 60,
    return_train_score = True,
    error_score = 0
)

lg_model = lg_search.fit(X_train, y_train)
print(lg_search.best_params_)
print(lg_search.best_score_)



{'tol': 0.001, 'solver': 'liblinear', 'penalty': 'l2', 'max_iter': 500, 'C': 0.1}
0.7760765550239235


130 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to 0.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1091, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 61, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penaltie

In [11]:
lg_pred = lg_model.predict(X_test)

In [12]:
print(classification_report(lg_pred, y_test))

              precision    recall  f1-score   support

         0.0       0.75      0.90      0.82        40
         1.0       0.92      0.79      0.85        56

    accuracy                           0.83        96
   macro avg       0.83      0.84      0.83        96
weighted avg       0.85      0.83      0.83        96



<h5>Linear Discriminant Analysis</h5>

In [22]:
# linear discriminant analysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda_grid = {
    "solver": ['svd', 'lsqr', 'eigen'],
    "shrinkage": ['auto', 'float']
}

lda_search = RandomizedSearchCV(
    estimator = LinearDiscriminantAnalysis(),
    param_distributions = lda_grid,
    return_train_score = True,
    random_state = 1,
    n_iter = 60
)

lda_model = lda_search.fit(X_train, y_train)
print(lda_model.best_params_)
print(lda_model.best_score_)



{'solver': 'lsqr', 'shrinkage': 'auto'}
0.7447710184552291


20 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/discriminant_analysis.py", line 589, in fit
    raise NotImplementedError("shrinkage not supported")
NotImplementedError: shrinkage not supported

--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/isobarbar

In [23]:
lda_pred = lda_model.predict(X_test)

In [24]:
print(classification_report(lda_pred, y_test))

              precision    recall  f1-score   support

         0.0       0.83      0.78      0.81        51
         1.0       0.77      0.82      0.80        45

    accuracy                           0.80        96
   macro avg       0.80      0.80      0.80        96
weighted avg       0.80      0.80      0.80        96



<h5>K-Nearest Neighbors</h5>

In [26]:
# k-nearest neighbors
from sklearn.neighbors import KNeighborsClassifier

knn_grid = {
    "n_neighbors": [5, 10, 15],
    "weights": ['uniform', 'distance'],
    "algorithm": ['auto', 'ball_tree', 'kd_tree', 'brute'],
    "leaf_size": [15, 30, 60, 90]
}

knn_search = RandomizedSearchCV( 
    estimator = KNeighborsClassifier(),
    param_distributions = knn_grid,
    return_train_score = True,
    random_state = 1,
    n_iter = 60
)

knn_model = knn_search.fit(X_train, y_train)

In [27]:
knn_pred = knn_search.predict(X_test)

In [28]:
print(classification_report(knn_pred, y_test))

              precision    recall  f1-score   support

         0.0       0.62      0.77      0.69        39
         1.0       0.81      0.68      0.74        57

    accuracy                           0.72        96
   macro avg       0.72      0.73      0.72        96
weighted avg       0.74      0.72      0.72        96



<h5>Naive Bayes</h5>

In [29]:
# naive-bayes
from sklearn.naive_bayes import GaussianNB

nb_grid = {
    "var_smoothing": [1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
}

nb_search = RandomizedSearchCV(
    estimator=GaussianNB(), 
    param_distributions=nb_grid, 
    return_train_score=True, 
    random_state=1, n_jobs=-1, 
    n_iter=50
)

nb_result = nb_search.fit(X_train, y_train)
nb_pred = nb_result.predict(X_test)



In [30]:
print(classification_report(nb_pred, y_test))

              precision    recall  f1-score   support

         0.0       0.90      0.61      0.72        71
         1.0       0.42      0.80      0.55        25

    accuracy                           0.66        96
   macro avg       0.66      0.70      0.64        96
weighted avg       0.77      0.66      0.68        96



<h5>Decision Tree Classifier</h5>

In [31]:
# decision tree
from sklearn.tree import DecisionTreeClassifier

grid = {
    "criterion" : ['gini', 'entropy', 'log_loss'],
    "splitter": ['best', 'random'],
    "max_depth" : [5, 10, 25, 50, 100, 500],
    "max_features" : ['auto', 'sqrt', 'log2'] 
}

dt_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=1), param_distributions=grid, return_train_score=True, random_state=1, n_jobs=-1, n_iter=25)
dt_result = dt_search.fit(X_train, y_train)
dt_pred = dt_result.predict(X_test)



In [32]:
print(classification_report(dt_pred, y_test))

              precision    recall  f1-score   support

         0.0       0.65      0.78      0.70        40
         1.0       0.81      0.70      0.75        56

    accuracy                           0.73        96
   macro avg       0.73      0.74      0.73        96
weighted avg       0.74      0.73      0.73        96



<h5>Support Vector Machine</h5>

In [33]:
from sklearn.svm import SVC

grid = {
    "kernel" : ['polynomial', 'gaussian', 'linear'],
    "C" : [100, 10, 1.0, 0.1, 0.01],
    "gamma": ['scale', 'auto'],
    "degree": [2, 3, 4, 5, 6],
    "shrinking": [True, False]
}

svc_search = RandomizedSearchCV(estimator=SVC(random_state=1), param_distributions=grid, return_train_score=True, random_state=1, n_jobs=-1, n_iter=25)
svc_result = svc_search.fit(X_train, y_train)
svc_pred = svc_result.predict(X_test)

75 fits failed out of a total of 125.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
55 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/svm/_base.py", line 251, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/Users/isobarbaric/miniforge3/lib/python3.9/site-packages/sklearn/svm/_base.py", line 333, in _dense_fit
    ) = libsvm.fit(
  File "sklearn/svm/_libsvm.pyx", line 176, in sklearn.svm._libsvm.fit
ValueError: 'gaussian'

In [34]:
print(classification_report(svc_pred, y_test))

              precision    recall  f1-score   support

         0.0       0.73      0.81      0.77        43
         1.0       0.83      0.75      0.79        53

    accuracy                           0.78        96
   macro avg       0.78      0.78      0.78        96
weighted avg       0.79      0.78      0.78        96



<h4>Tabulating the Results of the Training</h4>

In [35]:
from sklearn.metrics import classification_report, confusion_matrix

<style>
  .center {
    display: block;
    margin-left: auto;
    margin-right: auto;
  }
</style>
<img src="https://miro.medium.com/max/1838/1*fxiTNIgOyvAombPJx5KGeA.png"  height="300" class='center'/>

Generating Predictions

In [36]:
import json
from bag_of_words import BagOfWords
import pandas as pd

def build_predict_dataframe(test_content):
    with open('../Data/model/relevant_words.json') as f:
        relevant_words = json.loads(f.read())

    current_test = BagOfWords(test_content, None)

    cols = {}
    for word in relevant_words:
        cols[word] = [current_test.freq_chart[word] if word in current_test.freq_chart else 0]

    data_set = pd.DataFrame(data = cols)

    return data_set

In [37]:
s = "Jacob Puliyel, MD, a pediatrician in India for more than 40 years, brought suit in the Supreme Court of India against the Union of India and COVID-19 vaccine manufacturers in a legal challenge to the country\u2019s COVID vaccine program.1 Dr. Puliyel, who has served as Director of Research and Projects at Holy Family Hospital in Delhi and is a former member of the National Technical Advisory Group (NTAG) on immunizations in India, sued the government and COVID vaccine manufacturers seeking release of information related to the COVID vaccine approval process, as well as arguing for a policy change that allows unvaccinated persons to enter public spaces and access resources.\n\nSpecifically, Dr. Puliyel asked the Supreme Court of India for the release of each phase of clinical trial data for the COVID vaccines administered in India; disclosure of minutes from the meeting of the Subject Expert Committee and the NTGAI with regard to vaccines; release of information surrounding the approval or rejection of emergency use applications of vaccines by the Drugs Controller General of India (DCGI); and disclosure of post vaccination data related to COVID."

In [38]:
p_mat = build_predict_dataframe(s)

In [40]:
len(p_mat.columns)

500

In [42]:
lg_model.predict(p_mat)

array([0.])

In [43]:
nb_result.predict_proba(p_mat)

array([[1.00000000e+000, 3.39487718e-158]])

Table to Be Added

<table>
    <thead>
        <tr>
            <th></th>
            <th rowspan="2">Confusion Matrix</th>
            <th rowspan="2">Classification Report</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Logistic Regression</th>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>positive</th>
                        <th>negative</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>positive</th>
                        <td>34</td>
                        <td>18</td>
                        </tr>
                        <tr>
                        <th>negative</th>
                        <td>5</td>
                        <td>47</td>
                        </tr>
                    </tbody>
                </table>
            </td>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>precision</th>
                        <th>recall</th>
                        <th>f1-score</th>
                        <th>support</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>0.0</th>
                        <td>0.871795</td>
                        <td>0.653846</td>
                        <td>0.747253</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>1.0</th>
                        <td>0.723077</td>
                        <td>0.903846</td>
                        <td>0.803419</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>accuracy</th>
                        <td>0.778846</td>
                        <td>0.778846</td>
                        <td>0.778846</td>
                        <td>0.778846</td>
                        </tr>
                        <tr>
                        <th>macro avg</th>
                        <td>0.797436</td>
                        <td>0.778846</td>
                        <td>0.775336</td>
                        <td>104.000000</td>
                        </tr>
                        <tr>
                        <th>weighted avg</th>
                        <td>0.797436</td>
                        <td>0.778846</td>
                        <td>0.775336</td>
                        <td>104.000000</td>
                        </tr>
                    </tbody>
                </table>            
            </td>
        </tr>
        <tr>
            <th>Linear Discriminant Analysis</th>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>positive</th>
                        <th>negative</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>positive</th>
                        <td>37</td>
                        <td>15</td>
                        </tr>
                        <tr>
                        <th>negative</th>
                        <td>7</td>
                        <td>45</td>
                        </tr>
                    </tbody>
                </table>
            </td>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>precision</th>
                        <th>recall</th>
                        <th>f1-score</th>
                        <th>support</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>0.0</th>
                        <td>0.725000</td>
                        <td>0.557692</td>
                        <td>0.630435</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>1.0</th>
                        <td>0.640625</td>
                        <td>0.788462</td>
                        <td>0.706897</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>accuracy</th>
                        <td>0.673077</td>
                        <td>0.673077</td>
                        <td>0.673077</td>
                        <td>0.673077</td>
                        </tr>
                        <tr>
                        <th>macro avg</th>
                        <td>0.682813</td>
                        <td>0.673077</td>
                        <td>0.668666</td>
                        <td>104.000000</td>
                        </tr>
                        <tr>
                        <th>weighted avg</th>
                        <td>0.682812</td>
                        <td>0.673077</td>
                        <td>0.668666</td>
                        <td>104.000000</td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>
        <tr>
            <th>KNN</th>
            <td>
                <table border="1" class="dataframe">
                <thead>
                    <tr style="text-align: right;">
                    <th></th>
                    <th>positive</th>
                    <th>negative</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                    <th>positive</th>
                    <td>23</td>
                    <td>29</td>
                    </tr>
                    <tr>
                    <th>negative</th>
                    <td>20</td>
                    <td>32</td>
                    </tr>
                </tbody>
                </table>
            </td>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>precision</th>
                        <th>recall</th>
                        <th>f1-score</th>
                        <th>support</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>0.0</th>
                        <td>0.534884</td>
                        <td>0.442308</td>
                        <td>0.484211</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>1.0</th>
                        <td>0.524590</td>
                        <td>0.615385</td>
                        <td>0.566372</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>accuracy</th>
                        <td>0.528846</td>
                        <td>0.528846</td>
                        <td>0.528846</td>
                        <td>0.528846</td>
                        </tr>
                        <tr>
                        <th>macro avg</th>
                        <td>0.529737</td>
                        <td>0.528846</td>
                        <td>0.525291</td>
                        <td>104.000000</td>
                        </tr>
                        <tr>
                        <th>weighted avg</th>
                        <td>0.529737</td>
                        <td>0.528846</td>
                        <td>0.525291</td>
                        <td>104.000000</td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>
        <tr>
            <th>GaussianNB</th>
            <td>
                <table border="1" class="dataframe">
                <thead>
                    <tr style="text-align: right;">
                    <th></th>
                    <th>positive</th>
                    <th>negative</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                    <th>positive</th>
                    <td>37</td>
                    <td>15</td>
                    </tr>
                    <tr>
                    <th>negative</th>
                    <td>7</td>
                    <td>45</td>
                    </tr>
                </tbody>
                </table>
            </td>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>precision</th>
                        <th>recall</th>
                        <th>f1-score</th>
                        <th>support</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>0.0</th>
                        <td>0.840909</td>
                        <td>0.711538</td>
                        <td>0.770833</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>1.0</th>
                        <td>0.750000</td>
                        <td>0.865385</td>
                        <td>0.803571</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>accuracy</th>
                        <td>0.788462</td>
                        <td>0.788462</td>
                        <td>0.788462</td>
                        <td>0.788462</td>
                        </tr>
                        <tr>
                        <th>macro avg</th>
                        <td>0.795455</td>
                        <td>0.788462</td>
                        <td>0.787202</td>
                        <td>104.000000</td>
                        </tr>
                        <tr>
                        <th>weighted avg</th>
                        <td>0.795455</td>
                        <td>0.788462</td>
                        <td>0.787202</td>
                        <td>104.000000</td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>
        <tr>
            <th>Decision Tree Classifier</th>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>positive</th>
                        <th>negative</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>positive</th>
                        <td>33</td>
                        <td>19</td>
                        </tr>
                        <tr>
                        <th>negative</th>
                        <td>17</td>
                        <td>35</td>
                        </tr>
                    </tbody>
                </table>
            </td>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>precision</th>
                        <th>recall</th>
                        <th>f1-score</th>
                        <th>support</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>0.0</th>
                        <td>0.660000</td>
                        <td>0.634615</td>
                        <td>0.647059</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>1.0</th>
                        <td>0.648148</td>
                        <td>0.673077</td>
                        <td>0.660377</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>accuracy</th>
                        <td>0.653846</td>
                        <td>0.653846</td>
                        <td>0.653846</td>
                        <td>0.653846</td>
                        </tr>
                        <tr>
                        <th>macro avg</th>
                        <td>0.654074</td>
                        <td>0.653846</td>
                        <td>0.653718</td>
                        <td>104.000000</td>
                        </tr>
                        <tr>
                        <th>weighted avg</th>
                        <td>0.654074</td>
                        <td>0.653846</td>
                        <td>0.653718</td>
                        <td>104.000000</td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>     
        <tr>
            <th>Support Vector Machine</th>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>positive</th>
                        <th>negative</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>positive</th>
                        <td>32</td>
                        <td>20</td>
                        </tr>
                        <tr>
                        <th>negative</th>
                        <td>10</td>
                        <td>42</td>
                        </tr>
                    </tbody>
                </table>            
            </td>
            <td>
                <table border="1" class="dataframe">
                    <thead>
                        <tr style="text-align: right;">
                        <th></th>
                        <th>precision</th>
                        <th>recall</th>
                        <th>f1-score</th>
                        <th>support</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                        <th>0.0</th>
                        <td>0.761905</td>
                        <td>0.615385</td>
                        <td>0.680851</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>1.0</th>
                        <td>0.677419</td>
                        <td>0.807692</td>
                        <td>0.736842</td>
                        <td>52.000000</td>
                        </tr>
                        <tr>
                        <th>accuracy</th>
                        <td>0.711538</td>
                        <td>0.711538</td>
                        <td>0.711538</td>
                        <td>0.711538</td>
                        </tr>
                        <tr>
                        <th>macro avg</th>
                        <td>0.719662</td>
                        <td>0.711538</td>
                        <td>0.708847</td>
                        <td>104.000000</td>
                        </tr>
                        <tr>
                        <th>weighted avg</th>
                        <td>0.719662</td>
                        <td>0.711538</td>
                        <td>0.708847</td>
                        <td>104.000000</td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>
    </tbody>
</table>