# Overview

We are trying to find the best porforming classification models. We will be training the models to determine the if any particular page on a PDF is a map (aka alignment sheet) or not. 


In this code we will Once the necessary libraries are imported, the following actions are performed:

- <strong>Load labelled data: </strong>
Here we generate the features using "extract_features" function.  

- <strong>Train test split: </strong>
Split the dataset into test and train set. 

- <strong>Prepare validation set: </strong>
Create validation dataframe.

- <strong>Implement classification models: </strong>
Train various classification models and then get accuracy score and confusion matric for test and validation set.  

- <strong>Compare models: </strong>
Compare the accuracy score and the confusion matrix and save the best model for future use. 

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm, tree
import xgboost
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import random

In [4]:
from feature_extraction import extract_features

path = os.getcwd()
path = os.path.abspath('..')

### Load labeled data 

Here we have the PDF names (or Data IDs) and we also have manually marked which pages on these PDFs are maps. First, we create the dataframe of the features by extracting the features of each page on these PDFs using the function "extract_features". Then by using the marked values of each page being map or not we create the dataframe for the dependent variable. The variable "dataID_pageNo" is onnly used to identify a certain page. 

In [7]:
path_pdf = path + "\\TrainingSet\\"

DataIDHand = [268712,  486221, 500633, 
               555093, 684494, 
              895015, 2392922, 2445549, 2758927,
              2813701,  
              2967854, 2968069,  
              3891802,
              4036098]
Pages = [[3,4,5,8,9,10,14,15,24,25,26], range(1,5),  [5,9], 
          [6,9,33,34], [12,13,14],
         range(1,11), [1], range(1,4),  [9],
         [40, 92, 95, 143, 170, 180, 216, 217, 218, 219],
         [3,4],range(1,13),  
         [33, 34, 35, 89, 90, 91, 92, 93, 100, 146, 147, 148, 149, 153, 154, 159, 160, 161, 162, 165, 166, 169, 170, 173, 174, 177, 178, 181, 182, 184, 185, 188, 189], 
          []]

print(len(DataIDHand))
print(len(Pages))

14
14


In [None]:
#fetching featuresfor the pages of the PDF Files
X_df, dataIDs, error_files = extract_features(DataIDHand, path_pdf) 
#Features
#dataIDs
#error_files

File Starting: 268712. PDF 1 out of 14
File Starting: 486221. PDF 2 out of 14
File Starting: 500633. PDF 3 out of 14


In [None]:
X_df.to_csv(path + "\\root\\features_test_train.csv")
dataIDs.to_csv(path + "\\root\\dataIDs.csv")
print(len(error_files))

In [None]:
X_df = pd.read_csv((path + "\\root\\features_test_train.csv", index_col = 0)
dataIDs = pd.read_csv((path + "\\root\\dataIDs.csv", index_col = 0)
X_df.head()

In [None]:
X_df_features = X_df.copy()
X_df_features.drop(columns=['dataID_pageNo'], inplace=True)
X_df_features.head()

In [None]:
def get_Y_values(dataIDs, Pages):
    Y_class = []
    dataID_pageNo = []
    j = 0
    for index, row in dataIDs.iterrows():
        #print(row['DataIDs'])
        #print(row['Page_no'])
        for i in range(1,row['Page_no']+1):
            if i in Pages[j]:
                Y_class.append(1)
            else:
                Y_class.append(0)
            dataID_pageNo.append(str(row['DataIDs']) + "_" +str(i))
        j = j+1
    
    Y_df = pd.DataFrame({'dataID_pageNo' : dataID_pageNo, 
                         'Y_class' : Y_class})
    Y_dfclass = pd.DataFrame({'Y_class' : Y_class})
    
    return Y_df, Y_dfclass
    
                
Y_df, Y_dfclass = get_Y_values(dataIDs, Pages)

In [None]:
print(len(Y_df))
print(len(X_df))
print(len(Y_dfclass))

### Train test split

We set the seed value to get the same results when rerunning the code. Then we split the dataset randomly into train set and test set. (train set = 0.75, test set =0.25)

In [None]:
random.seed(19)
X_train, X_test, y_train, y_test = train_test_split(X_df_features,
                                                    Y_dfclass,
                                                    test_size = 0.25,
                                                    random_state = 8)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
print("Training Set: ", len(y_train))
print("Alignment Sheets in Training Set: ", len(y_train[y_train.Y_class > 0]))
print()
print("Test Set: ", len(y_test))
print("Alignment Sheets in Training Set: ", len(y_test[y_test.Y_class > 0]))

### Prepare validation set 

In the validation set we make use of new PDFs representing a new real world problem. For validation sets we would except slighly lower level of accuracy. The model which performs better on validation set typically performs better overall. 

We extract features using 'extract_features' function as earlier. Then we create validation dataframes for features and dependent variable. 

In [None]:
DataIDHand = [3410189, 3970828, 2968356]
Pages = [[], 
         [29, 35, 51, 59, 100, 101, 108, 109, 165, 179, 225, 231, 293, 294], 
         [9,18, 26]]

print(len(DataIDHand))
print(len(Pages))

In [None]:
path_pdf = (path + "\\RF Validation Set\\"

# #fetching featuresfor the pages of the PDF Files
X_df_valid, dataIDs_valid, error_files = extract_features(DataIDHand, path_pdf) 
# #Features
# #dataIDs
# #error_files

In [None]:
X_df_valid.to_csv((path + "\\root\\features_valid.csv")
dataIDs_valid.to_csv(path + "\\root\\dataIDs_valid.csv")
print(len(error_files))

In [None]:
X_df_valid = pd.read_csv((path + "\\root\\features_valid.csv", index_col = 0)
dataIDs_valid = pd.read_csv((path + "\\root\\dataIDs_valid.csv", index_col = 0)
X_df_valid.head()

In [None]:
X_df_features_valid = X_df_valid.copy()
X_df_features_valid.drop(columns=['dataID_pageNo'], inplace=True)
X_df_features_valid.head()

In [None]:
Y_df_valid, Y_dfclass_valid = get_Y_values(dataIDs_valid, Pages)

print(len(Y_df_valid))
print(len(X_df_features_valid))
print(len(X_df_valid))
print(len(Y_dfclass_valid))

### Implement classification models

In this section we are using regression models as classification models, hence they are collectively reffered to as classification models. <br>

First we save the classification models and their names in an array. Then for each of these models first we fit the model using the training dataset and then generate the confusion matrix and accuracy score for each of these models. 

In [None]:
classifiers = []
name = []
# we will create an array of Classifiers and append different classification models to our array.
model1 = xgboost.XGBClassifier()
classifiers.append(model1)
name.append("xgboost")

model2 = svm.SVC()
classifiers.append(model2)
name.append("svc")

model3 = tree.DecisionTreeClassifier()
classifiers.append(model3)
name.append("decisiontree")

model4 = RandomForestClassifier()
classifiers.append(model4)
name.append("rfc")


model5 = RandomForestRegressor(n_estimators=5)
classifiers.append(model5)
name.append("rfr5")

model6 = RandomForestRegressor(n_estimators=25)
classifiers.append(model6)
name.append("rfr25")

model7 = RandomForestRegressor(n_estimators=50)
classifiers.append(model7)
name.append("rfr50")

model8 = RandomForestRegressor(n_estimators=75)
classifiers.append(model8)
name.append("rfr75")

model9 = RandomForestRegressor(n_estimators=100)
classifiers.append(model9)
name.append("rfr100")


model10 = XGBRegressor(n_estimators=5)
classifiers.append(model10)
name.append("xgbr5")

model11 = XGBRegressor(n_estimators=25)
classifiers.append(model11)
name.append("xgbr25")

model12 = XGBRegressor(n_estimators=50)
classifiers.append(model12)
name.append("xgbr50")

model13 = XGBRegressor(n_estimators=75)
classifiers.append(model13)
name.append("xgbr75")

model14 = XGBRegressor(n_estimators=100)
classifiers.append(model14)
name.append("xgbr100")

In [None]:
i = 0
random.seed(10)
test_accuracy = []
valid_accuracy = []
cm_test = []
cm_valid = []
for clf in classifiers:
    print("________________________________________________________")
    print("________________________________________________________")
    #fit our algorithms in our Train dataset 
    clf.fit(X_train, y_train)
    
    #get test dataset prediction
    if "rfr" or "xgbr" in name[i]:
        y_pred_nb = clf.predict(X_test)
        #y_pred.shape
        #y_pred
        y_pred = []
        for y in y_pred_nb:
            if y > 0.50:
                y_pred.append(1)
            else:
                y_pred.append(0)
    else:
        y_pred= clf.predict(X_test)
        
    print(name[i])
    acc = accuracy_score(y_test, y_pred)
    test_accuracy.append(acc)
    print("Accuracy of %s is %s"%(clf, acc))
    cm = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix of %s is %s"%(clf, cm))
    cm_test.append(cm)
    
    
    print("________________Validation Set ___________________________")
    #get validation accuracy
    if "rfr" or "xgbr" in name[i]:
        y_pred_nb = clf.predict(X_df_features_valid)
        #y_pred.shape
        #y_pred
        y_pred = []
        for y in y_pred_nb:
            if y > 0.50:
                y_pred.append(1)
            else:
                y_pred.append(0)
    else:
        y_pred= clf.predict(X_df_features_valid)
        
    print(name[i])
    acc = accuracy_score(Y_dfclass_valid["Y_class"], y_pred)
    valid_accuracy.append(acc)
    print("Accuracy of %s is %s"%(clf, acc))
    cm = confusion_matrix(Y_dfclass_valid["Y_class"], y_pred)
    print("Confusion Matrix of %s is %s"%(clf, cm))
    cm_valid.append(cm)
    i = i +1

### Compare classification models

Here we save the accuracy score and the confusion matric for the test and validation sets. Then we save the pickled version of the best classification model for future use. 

In [None]:
classification_models = pd.DataFrame({'name': name, 
                                     'test_accuracy': test_accuracy,
                                     'test_cm': cm_test, 
                                     'valid_accuracy':valid_accuracy,
                                     'valid_cm': cm_valid})
classification_models["product"] = classification_models["test_accuracy"]*classification_models["valid_accuracy"]

classification_models = classification_models.sort_values(by=['product'])
classification_models.head(15)

In [None]:
i = 0
random.seed(10)

for clf in classifiers:
    
    if name[i] != "rfc":
        i = i +1
        continue
    print(name[i])
    clf.fit(X_train, y_train)
    filename = path + "\\root\\alignment_sheet_classifier_rfc.sav"
    pickle.dump(clf, open(filename, 'wb'))
    
    filename = "alignment_sheet_classifier_rfr50.sav"
    pickle.dump(clf, open(filename, 'wb'))
    i = i +1

### Further analysis 

Here we try to observe the features and their importance score for classification model. 

In [None]:
_importance = clf.feature_importances_
feature = []

for col in X_df_features:
    feature.append(col)
    
df_f_importance = pd.DataFrame({'Feature_Name' :  feature, 
                                'Importance':  f_importance})
df_f_importance