# Read Dataset

We decide that our AI will only work on those subject:
- Biology
- Computer Science
- Physics
- Chemistry
- Philosophy

To make our AI understand which subject one file is in, we decide that if a file has some keys words, then it may be related to this subject.
So, we have to create a dataset, where for each subject, there is a list of keys words. Our dataset is in the file 'Dataset_Topics.txt'

In [1]:
f = open("Dataset_Topics.txt", "r")

# We create a dictionary where the key is a school subject
# and the value is a set of words related to this subject
dataset = {"biology": set(dict.fromkeys(f.readline().split(";"))),
           "compsci": set(dict.fromkeys(f.readline().split(";"))),
           "physics": set(dict.fromkeys(f.readline().split(";"))),
           "chemistry": set(dict.fromkeys(f.readline().split(";"))),
           "philosophy": set(dict.fromkeys(f.readline().split(";")))}

f.close()

# print(dataset)

# Create training/validation/testing set

Now that we have our dataset, we need to create a training set, a validation set and a testing set. We have decided that our AI will just read PDF file only (possibly that in the future that we had other format). It will be easier to do a supervised learning. So, we'll just select a lot of file and labelised them.

In [4]:
%pip install PyPDF2

Note: you may need to restart the kernel to use updated packages.


In [2]:
import PyPDF2
import re
import os
import numpy as np
from tqdm import tqdm
import pandas as pd

In [77]:
key = ['biology', 'compsci', 'physics', 'chemistry', 'philosophy']
idx = dict()
for i in range(0,len(key)):
    idx[key[i]] = i

# Path towards the folder where there are all files
folder_path = os.path.abspath(os.getcwd()) + r'\FileForTraining'

# For each file, we will count
scores = list()
data_filename_topics = pd.read_csv('Dataset_fileName-Topics.csv')
for filename, _ in tqdm(data_filename_topics.values):
    file = os.path.join(folder_path, filename)
    if(os.path.isfile(file)):
        text = None
        extension = os.path.splitext(file)[1]
        if extension == ".pdf":  # If the file is a pdf file
            with open(file, 'rb') as pdfFileObj:
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict = False)
                text = re.sub(r'[^\w\s]', ' ', pdfReader.getPage(0).extractText())
                for pageNumber in range(1, pdfReader.numPages):
                    pageText = re.sub(r'[^\w\s]', ' ', pdfReader.getPage(pageNumber).extractText())
                    text = ' '.join([text, pageText])

                text = text.split(' ')

        # If the file is a pdf, we can compute his score
        if text != None:
            score = np.zeros(len(key))
            for word in text:
                w = word.lower()
                for subject in dataset:
                    if(w in dataset[subject]):
                        score[idx[subject]] += 1
            scores.append(score)
    else:
        print("The file", file, "is not supported.")

# We decide to put all those information in dataframe
df_x = pd.DataFrame(np.array(scores), columns = key)
df_y = data_filename_topics['topic']

100%|██████████| 304/304 [02:51<00:00,  1.78it/s]


In [78]:
print(df_x)
print(df_y)

     biology  compsci  physics  chemistry  philosophy
0        0.0     21.0      0.0        4.0         1.0
1        0.0     52.0      1.0        2.0         2.0
2        1.0     98.0      1.0        3.0         6.0
3        2.0    144.0      7.0        2.0        11.0
4        0.0    143.0      3.0        4.0        10.0
..       ...      ...      ...        ...         ...
299    161.0     51.0     23.0       51.0        20.0
300    615.0    256.0    209.0      115.0       150.0
301     46.0    688.0   1763.0      671.0      1126.0
302      3.0      7.0      2.0        2.0         7.0
303      5.0    103.0    258.0       63.0       277.0

[304 rows x 5 columns]
0      1
1      1
2      1
3      1
4      1
      ..
299    0
300    0
301    2
302    4
303    2
Name: topic, Length: 304, dtype: int64


## Let's analyse a bit the data

In [79]:
for i in range(0, len(key)):
    print(f"{key[i]}: {len(df_y[df_y == i])}")

biology: 48
compsci: 83
physics: 96
chemistry: 46
philosophy: 31


The number of files per topics isn't really well balanced but we'll work with that.

Let's see, if we sort according to the greatest number of words in a topic, if this corresponds to the related topic.

In [80]:
prediction_max = np.array([np.argmax(row) for row in df_x.values])
print(f'Number of correct: {sum(df_y.values == prediction_max)} on {len(df_y)} ({sum(df_y.values == prediction_max)*100/len(df_y):.4} %)')

Number of correct: 189 on 304 (62.17 %)


It seems that just selecting the topic with the most commun words doesn't always work.

## Split data

Now that we have our dataframe, we have to split it into 3 sets : training, validation, testing set.

In [78]:
%pip install torch==1.12.0+cpu torchvision==0.13.0+cpu torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.htmlNote: you may need to restart the kernel to use updated packages.



In [81]:
from sklearn.model_selection import train_test_split
import torch as t

# Split the data into 70% for training, 15% for validation and 15% for testing
train_x, rest_x, train_y, rest_y = train_test_split(df_x.values, df_y.values, train_size=0.7, shuffle=True)
val_x, test_x, val_y, test_y = train_test_split(rest_x, rest_y, train_size=0.5, shuffle=True)

# Transformation and normalization
train_x = t.tensor(train_x, dtype = t.float32)
val_x = t.tensor(val_x, dtype = t.float32)
test_x = t.tensor(test_x, dtype = t.float32)

train_y = t.tensor(train_y, dtype= int)
val_y = t.tensor(val_y, dtype= int)
test_y = t.tensor(test_y, dtype= int)

## Create the model

In [87]:
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F

class MLP(nn.Module):
  def __init__(self, D_in, H, D_out):
    super(MLP, self).__init__()

    # Inputs to hidden layer linear transformation
    self.input = nn.Linear(D_in, H)
    self.hidden = nn.Linear(H, H)
    self.hidden2 = nn.Linear(H,H)
    self.output = nn.Linear(H, D_out)

  def forward(self, x):
    x = F.relu(self.input(x))
    x = F.relu(self.hidden(x))
    x = F.relu(self.hidden2(x))
    y_pred = self.output(x)

    return y_pred

def train_model(model, criterion, optimizer, train_x, train_y, val_x, val_y, num_epochs = 10, batch_size = 64, show_info = False):
  # Set model to train mode
  model.train()

  # Training loop
  for epoch in range(0,num_epochs):
    perm = t.randperm(len(train_y))
    sum_loss = 0.

    for i in range(0, len(train_y), batch_size):
      x1 = Variable(train_x[perm[i:i + batch_size]], requires_grad=False)
      y1 = Variable(train_y[perm[i:i + batch_size]], requires_grad=False)

      # Reset gradient
      optimizer.zero_grad()
      
      # Forward
      fx = model(x1)
      loss = criterion(fx, y1)
      
      # Backward
      loss.backward()
      
      # Update parameters
      optimizer.step()
      
      sum_loss += loss.item()

    val_loss = validation_model(model, criterion, val_x, val_y, batch_size)
    if(show_info and epoch%10==0):
      print(f"Epoch: {epoch} \tTraining Loss: {sum_loss} \tValidation Loss: {val_loss}")

def validation_model(model, criterion, val_x, val_y, batch_size):
  valid_loss = 0
  perm = t.randperm(len(val_y))

  # Set to validation mode
  model.eval()
  
  for i in range(0, len(val_y), batch_size):
      x1 = Variable(val_x[perm[i:i + batch_size]], requires_grad=False)
      y1 = Variable(val_y[perm[i:i + batch_size]], requires_grad=False)
      
      # Forward
      fx = model(x1)
      loss = criterion(fx, y1)
      
      valid_loss += loss.item()

  return valid_loss

def evaluate_model(model, test_x, test_y):
  model.eval()
  y_pred = model(test_x)

  y_pred = t.max(y_pred,1).indices
  accuracy = t.sum(y_pred == test_y)/len(y_pred)
  
  return accuracy

In [88]:
# Hyperparameters
learning_rate = 1e-3
epochs = 100
batch_size = 8

D_in, H, D_out = train_x.shape[1], 256, len(key)
model = MLP(D_in, H, D_out)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = learning_rate)

# Train the model
train_model(model, criterion, optimizer, train_x, train_y,
            val_x, val_y, epochs, batch_size, show_info = True)

#Evaluate the model
accuracy = evaluate_model(model, test_x, test_y)*100
print(f'Accuracy: {accuracy} %')

Epoch: 0 	Training Loss: 44.20754516124725 	Validation Loss: 7.679714024066925
Epoch: 10 	Training Loss: 25.072161182761192 	Validation Loss: 18.944414913654327
Epoch: 20 	Training Loss: 16.2971902936697 	Validation Loss: 41.86445939540863
Epoch: 30 	Training Loss: 11.428470591083169 	Validation Loss: 51.15706565976143
Epoch: 40 	Training Loss: 7.9906492829322815 	Validation Loss: 36.69418954849243
Epoch: 50 	Training Loss: 11.928312636911869 	Validation Loss: 114.44620054960251
Epoch: 60 	Training Loss: 8.908174134790897 	Validation Loss: 172.9312653541565
Epoch: 70 	Training Loss: 4.992218680679798 	Validation Loss: 133.90561950206757
Epoch: 80 	Training Loss: 6.442145840032026 	Validation Loss: 131.05893683433533
Epoch: 90 	Training Loss: 4.327533654868603 	Validation Loss: 69.04219061136246
Accuracy: 84.78260803222656 %


In [89]:
y_pred = model(test_x)
y_pred = t.max(y_pred,1).indices

key2 = key.copy()
key2.append('Total')
df_result = pd.DataFrame(np.zeros((len(key),len(key) + 1), dtype= int), columns = key2,  index = key)
df_test_y = pd.DataFrame(test_y, dtype = int)
df_y_pred = pd.DataFrame(y_pred, dtype = int)
for i in range(0,len(key)):
    l = df_test_y[df_y_pred[0] == i]
    df_result.values[i][len(key)] = len(l)
    for j in range(0,len(key)):
        df_result.values[i][j] = len(l[l[0] == j])

print(df_result)

            biology  compsci  physics  chemistry  philosophy  Total
biology           7        0        0          0           0      7
compsci           0       13        1          0           0     14
physics           1        0       12          1           0     14
chemistry         1        0        0          4           1      6
philosophy        1        1        0          0           3      5


In this table, each row represents a kind of file that the neural network should initially associate. The columns represent the number of files that the neural network had associated to a subject.
Thanks to this table, we can observe the file that our AI could mistake. We don't have many files for the testing set but we can observe that some computer science files could be mistaken with physics files.

## Save the model

In [90]:
save_model_path = os.path.abspath(os.getcwd()) + '/save_model.pt'
t.save(model, save_model_path)

## How use our model?

Firstly, you need to load the model.

In [107]:
import os
import numpy as np
import torch as t
from tkinter import Tk, filedialog
import shutil
from tqdm import tqdm

load_model_path = os.path.abspath(os.getcwd()) + r'\save_model.pt'
load_model = t.load(load_model_path)
load_model.eval()

MLP(
  (input): Linear(in_features=5, out_features=256, bias=True)
  (hidden): Linear(in_features=256, out_features=256, bias=True)
  (hidden2): Linear(in_features=256, out_features=256, bias=True)
  (output): Linear(in_features=256, out_features=5, bias=True)
)

Now that the model is loaded, you have to select the folder where all the files, you want to sort, are stored. After that, you will find some folder with your files in, sorted by our AI, in the folder 'Files_sorted_by_AI'.

In [108]:
# This the list of all topics that our AI knows and was train for
key = ['biology', 'compsci', 'physics', 'chemistry', 'philosophy']

root = Tk()
root.withdraw()

root.attributes('-topmost', True)
folder_path = filedialog.askdirectory()

# For each file, we will count
scores = list()
filename_list = list()
print("Scan of the files in progress...\t(can take some times with several files)")
for filename in tqdm(os.listdir(folder_path)):
    file = os.path.join(folder_path, filename)
    if(os.path.isfile(file)):
        text = None
        extension = os.path.splitext(file)[1]
        if extension == ".pdf":  # If the file is a pdf file
            with open(file, 'rb') as pdfFileObj:
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict = False)
                text = re.sub(r'[^\w\s]', ' ', pdfReader.getPage(0).extractText())
                for pageNumber in range(1, pdfReader.numPages):
                    pageText = re.sub(r'[^\w\s]', ' ', pdfReader.getPage(pageNumber).extractText())
                    text = ' '.join([text, pageText])

                text = text.split(' ')

        # If the file is a pdf, we can compute his score
        if text != None:
            score = np.zeros(len(key))
            for word in text:
                w = word.lower()
                for subject in dataset:
                    if(w in dataset[subject]):
                        score[idx[subject]] += 1
            scores.append(score)
            filename_list.append(file)
    else:
        print("The file", file, "is not supported.")
print("Scan finish")

Scan of the files in progress...	(can take some times with several files)


100%|██████████| 45/45 [00:31<00:00,  1.44it/s]

The file C:/Users/apira/Downloads/Test\biology is not supported.
The file C:/Users/apira/Downloads/Test\chemistry is not supported.
The file C:/Users/apira/Downloads/Test\compsci is not supported.
The file C:/Users/apira/Downloads/Test\philosophy is not supported.
The file C:/Users/apira/Downloads/Test\physics is not supported.
Scan finish





In [109]:
# Path where the AI will create the folder with the files sorted
# The result will appear in the same folder that you have selected but if you want, you can modify the path
files_sorted_path = folder_path

data = t.tensor(np.array(scores), dtype = t.float32)

# We predict to which topic are related each files
load_model.eval()
data_prediction = load_model(data)
data_prediction = t.max(data_prediction,1).indices

# We create the folder of the topics found by the AI
for i in range(len(key)):
    if i in data_prediction:
        if not os.path.exists(files_sorted_path + f'/{key[i]}'):
            os.makedirs(files_sorted_path + f'/{key[i]}')

# We move all the files in the topic sorted by the AI
for i in range(len(filename_list)):
    shutil.move(filename_list[i], files_sorted_path + f"/{key[data_prediction[i]]}/{os.path.basename(filename_list[i])}")