<a href="https://colab.research.google.com/github/martinTan1215/Insurance_Document_Detection/blob/main/InsuranceDocumentDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Training


In [None]:
from PIL import Image
import numpy as np
import os
import pytesseract
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import joblib

In this section, I import the necessary libraries and modules that will be used in the code. This includes the PIL library for image processing, numpy for numerical operations, os for file and directory operations, pytesseract for text extraction from images, and scikit-learn modules for data preprocessing, model training, and evaluation. I also import the joblib module for saving and loading the trained model.

In [None]:
# Set the path to your images
image_dir = '/Users/tanqiao/Downloads/Insurance_Dataset'

# Set the names of your classes
classes = ['Home Insurance Claim Form', 'Car Insurance Claim Form',
           'Health Insurance Claim Form']

In this section, I specify the path to the directory where the images are stored. You need to replace the path with the actual path on your system. I also define the names of the classes, which represent the different types of insurance claim forms.

In [None]:
# Create lists to store the data and labels
x_data = []
y_data = []


# Loop over each class
for label, class_name in enumerate(classes):
    # Set the path to the class folder
    class_dir = os.path.join(image_dir, class_name)

    # Loop over each image in the class folder
    for filename in os.listdir(class_dir):
        # Open the image
        image = Image.open(os.path.join(class_dir, filename))

        # Resize the image pixels
        image = image.resize((500, 500))

        # Convert the image to grayscale
        image = image.convert('L')

        # Use pytesseract to convert image to text
        text = pytesseract.image_to_string(image)

        # Replace 'Auto Insurance', 'Vehicle Insurance' with 'Car Insurance'
        text = text.replace('Auto Insurance Claim Form',
                            'Car Insurance Claim Form')
        text = text.replace('Auto Insurance', 'Car Insurance Claim Form')
        text = text.replace('Vehicle Insurance Claim Form',
                            'Car Insurance Claim Form')
        text = text.replace('Vehicle Insurance', 'Car Insurance Claim Form')
        text = text.replace('Motor Insurance', 'Car Insurance Claim Form')

        # Replace 'Property Insurance' with 'Home Insurance'
        text = text.replace('Property Insurance Claim Form',
                            'Home Insurance Claim Form')
        text = text.replace('Property Insurance', 'Home Insurance Claim Form')

        # Replace 'Health Insurance' with 'Health Insurance'
        text = text.replace('Health Insurance', 'Health Insurance Claim Form')
        text = text.replace('Life Insurance', 'Health Insurance Claim Form')

        # Add the text to the data list
        x_data.append(text)

        # Add the label to the labels list
        y_data.append(label)

This section is responsible for preparing the data and labels for training the model. I initialize two lists, x_data and y_data, to store the text data and corresponding labels. I loop over each class and each image in the class folder. For each image, I open it, resize it to 300x300 pixels, convert it to grayscale, and extract the text using pytesseract. I perform some text replacements to standardize the class names. The extracted text is added to x_data, and the label (represented by the index of the class) is added to y_data.

In [None]:
# Convert the data and labels to NumPy arrays
x_data = np.array(x_data)
y_data = np.array(y_data)

# Vectorize the text data
vectorizer = TfidfVectorizer()
x_data = vectorizer.fit_transform(x_data)

# Save the fitted vectorizer
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

In this section, I convert the data and labels into NumPy arrays for further processing. I then use the TfidfVectorizer from scikit-learn to convert the text data into a numerical representation using TF-IDF (Term Frequency-Inverse Document Frequency) encoding. The vectorizer is fitted on the text data using the fit_transform() method. I also save the fitted vectorizer for later use in prediction.

In [None]:
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
    x_data, y_data, test_size=0.2)

This section splits the data and labels into training and testing sets using the train_test_split() function from scikit-learn. The test_size parameter specifies the proportion of the data to be allocated for testing (in this case, 20%).

In [None]:
# Create and train a classifier
clf = SVC()

# Perform cross-validation on the training data
cv_scores = cross_val_score(clf, x_train, y_train, cv=5)

# Fit the classifier on the training data
clf.fit(x_train, y_train)

# Evaluate the classifier on the training set
train_accuracy = clf.score(x_train, y_train)

# Evaluate the classifier on the test set
test_accuracy = clf.score(x_test, y_test)

# Print the cross-validation scores and performance metrics
print("Cross-validation scores:", cv_scores)
print("Mean cross-validation score:", np.mean(cv_scores))
print("Train accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)

In this section, I create an SVM (Support Vector Machine) classifier using the SVC class from scikit-learn. I perform cross-validation on the training data using the cross_val_score() function, with 5-fold cross-validation. This provides an estimate of the model's performance on unseen data. I then fit the classifier on the training data using the fit() method. The classifier's accuracy is evaluated on both the training set (train_accuracy) and the test set (test_accuracy). Finally, it will print the cross-validation scores and performance metrics to evaluate the model's performance.

In [None]:
# Save the trained model
joblib.dump(clf, "insurance_classification_model.pkl")

In this section, we save the trained classifier using the joblib.dump() function from the joblib module. The trained model is saved in a file named "insurance_classification_model.pkl". This allows us to load and use the trained model for making predictions in the future.

Overall, the entire code of Model Training performs the following steps:


1.   Reads and preprocesses the image data.
2.   Extracts the text from the images using OCR.
3.   Standardizes the class names in the text data.
4.   Converts the text data into numerical features using TF-IDF encoding.
5.   Splits the data into training and testing sets.
6.   Trains an SVM classifier on the training data and evaluates its performance.
7.   Saves the trained model for future use.

# Prediction

In [None]:
import joblib
from PIL import Image
import pytesseract
import os
import shutil

In this section, I import the necessary libraries for image processing, text extraction, file operations, and copying files.

In [None]:
# Load the saved model
clf = joblib.load("insurance_classification_model.pkl")

# Load the fitted vectorizer
vectorizer = joblib.load("tfidf_vectorizer.pkl")

Here, I load the previously trained model (saved as insurance_classification_model.pkl) and the fitted vectorizer (saved as tfidf_vectorizer.pkl) using the joblib library.

In [None]:
# Set the path to the folder containing the images to predict
folder_path = "/Users/tanqiao/Desktop/Test_Dataset"

# Set the paths for the output folders
output_folder_home = "/Users/tanqiao/Desktop/DestinationFolder/Home"
output_folder_car = "/Users/tanqiao/Desktop/DestinationFolder/Car"
output_folder_health = "/Users/tanqiao/Desktop/DestinationFolder/Health"

# Create the output folders if they don't exist
os.makedirs(output_folder_home, exist_ok=True)
os.makedirs(output_folder_car, exist_ok=True)
os.makedirs(output_folder_health, exist_ok=True)

In this section, I set the path to the folder containing the images to predict (folder_path). I also set the paths for the output folders (output_folder_home, output_folder_car, output_folder_health). If the output folders don't exist, they are created using the os.makedirs function

In [None]:
# Iterate over the images in the folder
for file_name in os.listdir(folder_path):
    # Construct the full path to the image
    image_path = os.path.join(folder_path, file_name)

    # Open the image
    image = Image.open(image_path)

    # Resize the image pixels
    image = image.resize((500, 500))

    # Convert the image to grayscale
    image = image.convert('L')

    # Use pytesseract to convert the image to text
    text = pytesseract.image_to_string(image)

    # Preprocess the text
    text = text.replace('Auto Insurance Claim Form', 'Car Insurance Claim Form')
    text = text.replace('Auto Insurance', 'Car Insurance Claim Form')
    text = text.replace('Vehicle Insurance Claim Form',
                        'Car Insurance Claim Form')
    text = text.replace('Vehicle Insurance', 'Car Insurance Claim Form')
    text = text.replace('Motor Insurance', 'Car Insurance Claim Form')
    text = text.replace('Property Insurance Claim Form',
                        'Home Insurance Claim Form')
    text = text.replace('Property Insurance', 'Home Insurance Claim Form')
    text = text.replace('Health Insurance', 'Health Insurance Claim Form')
    text = text.replace('Life Insurance', 'Health Insurance Claim Form')

    # Vectorize the preprocessed text using the fitted vectorizer
    x_new = vectorizer.transform([text])

    # Make the prediction
    predicted_label = clf.predict(x_new)[0]
    class_names = ['Home Insurance Claim Form', 'Car Insurance Claim Form',
                   'Health Insurance Claim Form']
    predicted_class = class_names[predicted_label]

    # Define the destination folder based on the predicted class
    if predicted_class == 'Home Insurance Claim Form':
        destination_folder = output_folder_home
    elif predicted_class == 'Car Insurance Claim Form':
        destination_folder = output_folder_car
    elif predicted_class == 'Health Insurance Claim Form':
        destination_folder = output_folder_health

    # Copy the image to the corresponding destination folder
    shutil.copy(image_path, os.path.join(destination_folder, file_name))

    # Print the predicted class and destination folder for current image
    print(f"Predicted class for {file_name}: {predicted_class}")
    print(f"Destination folder: {destination_folder}")

This section iterates over each image in the specified folder. It performs image processing steps such as resizing the image to 500x500 pixels and converting it to grayscale. The image is then converted to text using pytesseract. The text is preprocessed by replacing specific terms to match the trained model's class labels. The preprocessed text is vectorized using the fitted vectorizer. The model predicts the class label for the image, and based on the predicted class, the corresponding destination folder is defined. The image is then copied to the appropriate destination folder using shutil. Finally, the predicted class and the destination folder for each image are printed.

In summary, this code loads the trained model and vectorizer, processes images in a specified folder, extracts text from the images using pytesseract, preprocesses the text, and predicts the class label for each image. The predicted images are then copied to the respective output folders based on the predicted class