# Preprocessing

To use many and diverse models from scikit-learn, we build the AuthorClassifier class, containing the base structure to use the machine learning models. 

This class has a constructor method that receives a machine learning model and a vectorizer from sklearn, and can receive to a scaler object (like MinMaxScaler or StandardScaler) and a PCA object. This 4 elements will define the steps of the pipeline, that is created using the fit method.

predict method gives the predicted classes and store the predictions probabilities, because this is usefull to calculate the AUC score. evaluate method calculates (from a binary problem so far) the metrics of accuracy, precision, recall, F1-score and AUC ROC. For precision, recall and F1-score is calculated the metrics based on macro, micro, weighted and for each class.

Obs: Given that TfidfVectorizer and CountVectorizer output is a sparse matrix, when we want to calculate PCA or change the scale we need to tranform this features from sparse to dense, and the class SparseToArray on utils.py file do that.

In [1]:
import sys
sys.path.insert(1, '../../libs')
from utils import get_data
from autorship import AuthorClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
data = get_data("../../data/authors.csv")
data = data[data.username.isin(data.username.unique()[:2])] #select two authors

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    data.comment, data.username, test_size=0.33, random_state=42)
    
clf = AuthorClassifier(clf=LogisticRegression())
pipe = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.evaluate(y_test, y_pred))

{'author1': 'BluePirate89', 'author2': 'Manada_2', 'precision_author1': 0.8471, 'recall_author1': 0.9028, 'f1_score_author1': 0.8741, 'precision_author2': 0.899, 'recall_author2': 0.8415, 'f1_score_author2': 0.8693, 'precision_weighted': 0.8734, 'precision_micro': 0.8717, 'precision_macro': 0.873, 'recall_weighted': 0.8717, 'recall_micro': 0.8717, 'recall_macro': 0.8721, 'f1_weighted': 0.8716, 'f1_micro': 0.8717, 'f1_macro': 0.8717, 'auc_score': 0.9505, 'accuracy': 0.8717}


Another example

In [4]:
data = get_data("../../data/authors.csv")
data = data[data.username.isin(data.username.unique()[10:12])] #select two authors

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    data.comment, data.username, test_size=0.33, random_state=42)
    
clf = AuthorClassifier(vectorizer=TfidfVectorizer(),
                        clf=RandomForestClassifier(), 
                        scaler=MinMaxScaler(),
                        pca=PCA(n_components=0.95))
pipe = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(pipe,"\n")
print(clf.evaluate(y_test, y_pred))

Pipeline(steps=[('TfidfVectorizer()', TfidfVectorizer()),
                ('SparseToArray()', SparseToArray()),
                ('MinMaxScaler()', MinMaxScaler()),
                ('PCA(n_components=0.95)', PCA(n_components=0.95)),
                ('RandomForestClassifier()', RandomForestClassifier())]) 

{'author1': 'CariocaSatanico', 'author2': 'xanax101010', 'precision_author1': 0.7567, 'recall_author1': 0.9688, 'f1_score_author1': 0.8497, 'precision_author2': 0.958, 'recall_author2': 0.6951, 'f1_score_author2': 0.8057, 'precision_weighted': 0.8584, 'precision_micro': 0.8305, 'precision_macro': 0.8573, 'recall_weighted': 0.8305, 'recall_micro': 0.8305, 'recall_macro': 0.832, 'f1_weighted': 0.8275, 'f1_micro': 0.8305, 'f1_macro': 0.8277, 'auc_score': 0.9522, 'accuracy': 0.8305}
