# Read Dataset

We decide that our AI will only work on those subject:
- Biology
- Computer Science
- Physics
- Chemistry
- Philosophy

To make our AI understand which subject one file is in, we decide that if a file has some keys words, then it may be related to this subject.
So, we have to create a dataset, where for each subject, there is a list of keys words. Our dataset is in the file 'Dataset_Topics.txt'

In [1]:
f = open("Dataset_Topics.txt", "r")

# We create a dictionary where the key is a school subject
# and the value is a set of words related to this subject
dataset = {"biology": set(dict.fromkeys(f.readline().split(";"))),
           "compsci": set(dict.fromkeys(f.readline().split(";"))),
           "physics": set(dict.fromkeys(f.readline().split(";"))),
           "chemistry": set(dict.fromkeys(f.readline().split(";"))),
           "philosophy": set(dict.fromkeys(f.readline().split(";")))}

f.close()

# print(dataset)

# Create training/validation/testing set

Now that we have our dataset, we need to create a training set, a validation set and a testing set. We have decided that our AI will just read Word or PDF file only (possibly that in the future that we had other format). It will be easier to do a supervised learning. So, we'll just select a lot of file and labelised them.

In [4]:
%pip install PyPDF2

Note: you may need to restart the kernel to use updated packages.


In [3]:
import PyPDF2
import re
import os
import tkinter
from tkinter import filedialog
import numpy as np
from tqdm import tqdm
import pandas as pd

In [4]:
key = ['biology', 'compsci', 'physics', 'chemistry', 'philosophy']
index = dict()
for ind in range(0,len(key)):
    index[key[ind]] = ind

# Path towards the folder where there are all files
folder_path = os.path.abspath(os.getcwd()) + '\FileForTraining'

# For each file, we will count
scores = list()
data_filename_topics = pd.read_csv('Dataset_fileName-Topics.csv')
for filename, _ in tqdm(data_filename_topics.values):
    file = os.path.join(folder_path, filename)
    if(os.path.isfile(file)):
        text = None
        extension = os.path.splitext(file)[1]
        if extension == ".pdf":  # If the file is a pdf file
            with open(file, 'rb') as pdfFileObj:
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
                text = re.sub(r'[^\w\s]', ' ', pdfReader.getPage(0).extractText())
                for pageNumber in range(pdfReader.numPages):
                    pageObj = pdfReader.getPage(pageNumber)
                    pageText = re.sub(r'[^\w\s]', ' ', pageObj.extractText())
                    text = ' '.join([text, pageText])

                text = text.split(' ')
        elif extension == ".doc" or extension == ".docx":  # If the file is a docx or doc
            # text = textfromword(file)
            pass

        # If the file is a pdf or a word, we can compute his score
        if text != None:
            score = np.zeros(len(key))
            for word in text:
                w = word.lower()
                for subject in dataset:
                    if(w in dataset[subject]):
                        score[index[subject]] += 1
            scores.append(score)
    else:
        print("The file", file, "is not supported.")

# We decide to put all those information in dataframe
df_x = pd.DataFrame(np.array(scores), columns = key)
df_y = data_filename_topics['topic']

  4%|▍         | 13/304 [00:14<05:08,  1.06s/it]Xref table not zero-indexed. ID numbers for objects will be corrected.
Xref table not zero-indexed. ID numbers for objects will be corrected.
  5%|▍         | 15/304 [00:14<02:55,  1.65it/s]Xref table not zero-indexed. ID numbers for objects will be corrected.
  5%|▌         | 16/304 [00:14<02:21,  2.03it/s]Xref table not zero-indexed. ID numbers for objects will be corrected.
Xref table not zero-indexed. ID numbers for objects will be corrected.
  6%|▌         | 18/304 [00:14<01:33,  3.06it/s]Xref table not zero-indexed. ID numbers for objects will be corrected.
  6%|▋         | 19/304 [00:15<01:25,  3.35it/s]Xref table not zero-indexed. ID numbers for objects will be corrected.
  7%|▋         | 21/304 [00:15<01:38,  2.88it/s]Xref table not zero-indexed. ID numbers for objects will be corrected.
  7%|▋         | 22/304 [00:16<01:38,  2.85it/s]Xref table not zero-indexed. ID numbers for objects will be corrected.
  8%|▊         | 23/304 [

In [5]:
print(df_x)
print(df_y)

     biology  compsci  physics  chemistry  philosophy
0        0.0     21.0      0.0        4.0         1.0
1        0.0     52.0      1.0        2.0         2.0
2        1.0     98.0      1.0        3.0         6.0
3        2.0    144.0      7.0        2.0        11.0
4        0.0    143.0      3.0        4.0        10.0
..       ...      ...      ...        ...         ...
299    161.0     52.0     23.0       51.0        20.0
300    617.0    256.0    209.0      115.0       150.0
301     46.0    688.0   1772.0      672.0      1130.0
302      4.0     10.0      4.0        3.0        11.0
303      6.0    105.0    289.0       65.0       280.0

[304 rows x 5 columns]
0      1
1      1
2      1
3      1
4      1
      ..
299    0
300    0
301    2
302    4
303    2
Name: topic, Length: 304, dtype: int64


In [6]:
for i in range(0, len(key)):
    print(f"{key[i]}: {len(df_y[df_y == i])}")

biology: 48
compsci: 83
physics: 96
chemistry: 46
philosophy: 31


Now that we have our dataframe, we have to split it into 3 sets : training, validation, testing set.

In [78]:
%pip install torch==1.12.0+cpu torchvision==0.13.0+cpu torchaudio==0.12.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.htmlNote: you may need to restart the kernel to use updated packages.



In [7]:
from sklearn.model_selection import train_test_split
import torch as t

# Split the data into 70% for training, 15% for validation and 15% for testing
train_x, rest_x, train_y, rest_y = train_test_split(df_x.values, df_y.values, train_size=0.7)
val_x, test_x, val_y, test_y = train_test_split(rest_x, rest_y, train_size=0.5)

# Transformation and normalization
train_x = t.tensor(train_x, dtype = t.float32)
val_x = t.tensor(val_x, dtype = t.float32)
test_x = t.tensor(test_x, dtype = t.float32)

train_y = t.tensor(train_y, dtype= int)
val_y = t.tensor(val_y, dtype= int)
test_y = t.tensor(test_y, dtype= int)

In [8]:
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F

class MLP(nn.Module):
  def __init__(self, D_in, H, D_out):
    super(MLP, self).__init__()

    # Inputs to hidden layer linear transformation
    self.input = nn.Linear(D_in, H)
    self.hidden = nn.Linear(H, H)
    self.output = nn.Linear(H, D_out)

    self.logsoftmax = nn.LogSoftmax()

  def forward(self, x):
    x = F.relu(self.input(x))
    x = F.relu(self.hidden(x))
    y_pred = self.output(x)

    return y_pred

def train_model(model, criterion, optimizer, train_x, train_y, val_x, val_y, num_epochs = 10, batch_size = 64, show_info = False):
  # Set model to train mode
  model.train()

  # Training loop
  for epoch in range(0,num_epochs):
    perm = t.randperm(len(train_y))
    sum_loss = 0.

    for i in range(0, len(train_y), batch_size):
      x1 = Variable(train_x[perm[i:i + batch_size]], requires_grad=False)
      y1 = Variable(train_y[perm[i:i + batch_size]], requires_grad=False)

      # Reset gradient
      optimizer.zero_grad()
      
      # Forward
      fx = model(x1)
      loss = criterion(fx, y1)
      
      # Backward
      loss.backward()
      
      # Update parameters
      optimizer.step()
      
      sum_loss += loss.item()

    val_loss = validation_model(model, criterion, val_x, val_y, batch_size)
    if(show_info):
      print(f"Epoch: {epoch+1}\tTraining Loss: {sum_loss}\tValidation Loss: {val_loss}")

def validation_model(model, criterion, val_x, val_y, batch_size):
  valid_loss = 0
  perm = t.randperm(len(val_y))

  # Set to validation mode
  model.eval()
  
  for i in range(0, len(val_y), batch_size):
      x1 = Variable(val_x[perm[i:i + batch_size]], requires_grad=False)
      y1 = Variable(val_y[perm[i:i + batch_size]], requires_grad=False)
      
      # Forward
      fx = model(x1)
      loss = criterion(fx, y1)
      
      valid_loss += loss.item()

  return valid_loss

def evaluate_model(model, test_x, test_y):
  model.eval()
  y_pred = model(test_x)

  y_pred = t.max(y_pred,1).indices
  accuracy =  t.mean(t.Tensor([i == j for i, j in zip(y_pred, test_y)]))

  return accuracy

In [13]:
# Hyperparameters
learning_rate = 1e-3
epochs = 200
batch_size = 8

D_in, H, D_out = train_x.shape[1], 256, len(key)
model = MLP(D_in, H, D_out)

criterion = nn.CrossEntropyLoss()
optimizer = t.optim.Adam(model.parameters(), lr = learning_rate)

# Train the model
train_model(model, criterion, optimizer, train_x, train_y,
            val_x, val_y, epochs, batch_size, show_info = True)

#Evaluate the model
accuracy = evaluate_model(model, test_x, test_y)*100
print(f'Accuracy: {accuracy} %')

Epoch: 1	Training Loss: 80.41137282550335	Validation Loss: 22.816313683986664
Epoch: 2	Training Loss: 68.61460008472204	Validation Loss: 4.923495844006538
Epoch: 3	Training Loss: 50.0225683003664	Validation Loss: 10.803473770618439
Epoch: 4	Training Loss: 43.37522664666176	Validation Loss: 4.83735679090023
Epoch: 5	Training Loss: 47.27895198762417	Validation Loss: 8.270422205328941
Epoch: 6	Training Loss: 55.34852270781994	Validation Loss: 7.3193947076797485
Epoch: 7	Training Loss: 25.690837129950523	Validation Loss: 7.590088665485382
Epoch: 8	Training Loss: 41.590294390916824	Validation Loss: 4.740236647427082
Epoch: 9	Training Loss: 31.90928490459919	Validation Loss: 5.866961777210236
Epoch: 10	Training Loss: 26.048245646059513	Validation Loss: 8.449291855096817
Epoch: 11	Training Loss: 33.00748674571514	Validation Loss: 9.816524595022202
Epoch: 12	Training Loss: 28.086455792188644	Validation Loss: 5.482224375009537
Epoch: 13	Training Loss: 24.363523855805397	Validation Loss: 5.23316

## Test avec keras

In [None]:
%pip install keras
%pip install tensorflow

In [None]:
import tensorflow as tf
from keras.models import Sequential
from keras import Dense

model = Sequential(
    Dense(D_in, activation = "relu"),
    Dense(H, activation = "relu"),
    Dense(D_out)
)