# Solution for classification of documents


## 1.1 Explanation
We would be using the SGD classifier as well as a naive bayes classifier to classify new pictures of academic documents to one of our four classes. Most certificates and school documents are fairly similar, as most times all changes is either the name, year or class. I personally expect most classifiers to have extremely high accuracy scores for this application.

In [1]:
import os

path = os.getcwd()  
print ("The current working directory is {}" .format(path))

The current working directory is C:\Users\Alabi Oluwatosin\Documents\Projects\document-ocr


In [2]:
classes = ["school_id", "french_club_award", "jet_club_award", "report_card"]; 
dataFolders = ["train","test","val"];

In [3]:
import shutil
import os
import random
from pathlib import Path
try:
    from PIL import Image
except ImportError:
    import Image

In [4]:
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import LabelBinarizer
import sklearn.datasets as skds
from pathlib import Path

## 1.2 Machine learning algorithms
We would be using the naive bayes algorithm and the SGD classifier  to classify the documents



In [5]:
extractedtrainData = pd.read_csv("train_data.csv") 
extractedtestData = pd.read_csv("test_data.csv")

In [6]:
extractedtrainData.head()

Unnamed: 0,file_name,extracted_information,class_name
0,ID (11).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\neee JADE,school_id
1,ID (2).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nTOLU FOLA,school_id
2,ID (6).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nPanu SIMEON,school_id
3,ID (9).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nFEMI QUADRI,school_id
4,french_participation (1).txt,Chapel Science High School awards\n\nfor parti...,french_club_award


In [7]:
extractedtestData.head()

Unnamed: 0,file_name,extracted_information,class_name
0,ID (12).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nPenna SALA,school_id
1,ID (13).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nYUSUF BELLO,school_id
2,ID (14).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nAZEEZ BALA,school_id
3,ID.txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\noa OLADEJI,school_id
4,french_participation (11).txt,Chapel Science High School awards\n\nfor parti...,french_club_award


In [8]:
# Label encoding would be done on the class_name column before feeding the data to our classifiers
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

le.fit(extractedtrainData['class_name'])

LabelEncoder()

In [9]:
list(le.classes_)

['french_club_award', 'jet_club_award', 'report_card', 'school_id']

In [10]:
label_index = le.transform(extractedtrainData['class_name']) 
test_label_index = le.transform(extractedtestData['class_name']) 
print(label_index)
print(test_label_index)

[3 3 3 3 0 0 0 0 1 1 1 1 2 2 2 2]
[3 3 3 3 0 0 0 0 1 1 1 1 2 2 2 2]


In [11]:
train_text = extractedtrainData["extracted_information"] 
train_tags = extractedtrainData['class_name']
train_files_names = extractedtrainData["file_name"]

test_data_text = extractedtestData["extracted_information"]

print(train_text.shape)
print(test_data_text.shape)

(16,)
(16,)


In [12]:
#perform vectorization on the text using sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(train_text)
vectors.shape

(16, 69)

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics


clf = MultinomialNB(alpha=.01)
clf.fit(vectors, label_index)

# newsgroups_test = fetch_20newsgroups(subset='test')
vectors_test = vectorizer.transform(test_data_text)

pred = clf.predict(vectors_test)
# np.mean(pred == newsgroups_test.target)
metrics.f1_score(test_label_index, pred, average='macro')

1.0

In [14]:
from sklearn.pipeline import Pipeline

from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('tfidf', TfidfVectorizer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=10, random_state=42))])

text_clf_svm = text_clf_svm.fit(train_text, label_index)
predicted_svm = text_clf_svm.predict(test_data_text)

np.mean(predicted_svm == test_label_index)

1.0

As expected the accuracy is off the chart, due to the kind of images and documents been used. An academic setting like in Nigeria, these are the kind of documents seen such as the WAEC certificate, the Jamb result, school results, testimonials, they are all very similar documents with change in name date, student Number etc and most viable classifiers would be able to classify them correctly almost 100% of the time

Save the sgd classifier model and test saved model

In [15]:
pickle.dump(text_clf_svm, open('model.pkl','wb'))

## 1.3 Test the classifier with never before seen data

In [17]:
extractedValData = pd.read_csv("val_data.csv")
extractedValData

Unnamed: 0,file_name,extracted_information,class_name
0,ID (1).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nALABI TOSIN,school_id
1,ID (10).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nPETER JOHN,school_id
2,ID (3).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nDAVID INO,school_id
3,ID (4).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nEU JAMES,school_id
4,ID (5).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\npeace BA...,school_id
5,ID (7).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nVICTOR O...,school_id
6,ID (8).txt,CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\neee CHID...,school_id
7,french_participation (5).txt,Chapel Science High School awards\n\nfor parti...,french_club_award
8,french_participation (6).txt,Chapel Science High School awards\n\nfor parti...,french_club_award
9,french_participation (7).txt,Chapel Science High School awards\n\nfor parti...,french_club_award


In [18]:
validation_sentence = extractedValData.loc[ 0 , 'extracted_information']
validation_sentence

'CHAPEL SCIENCE\n\nHIGH SCHOOL\n\n \n\nALABI TOSIN'

In [19]:
model = pickle.load(open('model.pkl','rb'))
classInteger = model.predict([validation_sentence])
classInteger
list(le.inverse_transform(classInteger))

['school_id']

## 1.5 Extracting keywords from a file


In [19]:
validation_sentence

'SUMMER CERTIFICATE\n\nThe Kwara University awards\n\nJohn Peter\n\nfor completing our short course, Basic Home\nTraining for Puppies.\n\nAVAZI OMEIZA ABDULGANIYU AMBALI\nHOD Vice chancellor'

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint

In [20]:
doc = nlp(validation_sentence)
##pprint([(X.text, X.label_) for X in doc.ents])
pprint([(X.text) for X in doc.ents])

['SUMMER', 'Kwara University', 'John Peter', 'Puppies', 'HOD']
