# Help vision impaired people to see the world

Combining leading edge technologies for helping people – Connect the dots:


1. **Classifying** different **pictures** (tickets, floorplans and documents)
2. **Recognizing text** in the images
3. Reading out loud through **text-to-speech** (English)



##<font color=red>Advice:</font>
<font color=red>To train the network fast, from the Runtime environment Menu, select the GPU Hardware Acceleration.

In Spanish 'Entorno de ejecucion -> Cambiar tipo de entorno de ejecución -> Acelerador por Hardware: GPU'.</font>

**STEPS:**

1. Understanding the text of the images
2. Creating a model to predict the type of information of tickets




### 1. Understanding the text of the images

Once we have detected a ticket picture, we will need to select its rellevant information, so we have to understand the its text. To do so, we need to know which part of the text corresponds to each type of data we want to obtain (`company name`, `address`, `date` and `final amount`).

We have the real summary of each ticket (`bbox` files) and the final summary, but we don't have any field to relate them. So we will have to match the partitioned text with the sentence for each tag.

To do so, we will use the TF-IDF method to compute a numeric value of each sentence of the ticket and the summarized one. To compute the similarity we will use cosine similarity.

After that, we will apply some processing steps to check that all sentences tagged with `Company` are followed, and the same with the `Address` tag, giving preference to the biggest detected chunk. In addition, if some sentences are near and have been tagged with the same tag (`Company` or `Address`), we will make some adjust in order to get all the chunk together.

In [None]:
!sudo apt install tesseract-ocr
!pip install pytesseract
!pip install easyocr
!pip install planar
!pip install install spacy-transformers

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# import necessary packages
import pandas as pd
import numpy as np

from glob import glob
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
from PIL import Image

import zipfile
import shutil
import os
import json

import csv
from csv import reader

import nltk
from nltk.text import TextCollection
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

import numpy as np
from numpy.linalg import norm

from functools import lru_cache

In [None]:
# specify path to the dataset
ROOT = "/content/drive/MyDrive/ColabNotebooks/"
DOCS_DATASET_PATH = "Database"
NEW_DATA_PATH = 'DatabaseOrdered'
OCR_DATA_PATH = 'boxesAndKey'

# specify the paths to our training and validation set 
TRAIN = "train"
TEST = "test"

classes = ['Documents', 'Plans', 'Tickets']

ocr_extract = ['boxes', 'key']
boxColPos = ['0', '1', '2', '3', '4', '5', '6', '7']
worldCol = [*(f'word{i:02}' for i in range(1, 10))]

# specify the paths to our training and validation set 
TRAIN = "train"
TEST = "test"

In [None]:
img_fns = glob(ROOT + NEW_DATA_PATH + '/' + TRAIN + '/Tickets/*') + glob(ROOT + NEW_DATA_PATH + '/' + TEST + '/Tickets/*')
print(len(img_fns))
print(img_fns[0:2])
print(img_fns[-2:])

626
['/content/drive/MyDrive/ColabNotebooks/DatabaseOrdered/train/Tickets/000.jpg', '/content/drive/MyDrive/ColabNotebooks/DatabaseOrdered/train/Tickets/001.jpg']
['/content/drive/MyDrive/ColabNotebooks/DatabaseOrdered/test/Tickets/624.jpg', '/content/drive/MyDrive/ColabNotebooks/DatabaseOrdered/test/Tickets/625.jpg']


In [None]:
with zipfile.ZipFile(ROOT+DOCS_DATASET_PATH+'.zip', 'r') as external_zip:
  for c in classes:
    with external_zip.open(c+'.zip') as internal_zip:
        with zipfile.ZipFile(internal_zip, 'r') as internal_zip_file:
          internal_zip_file.extractall(ROOT)
          
          if not os.path.exists(ROOT+OCR_DATA_PATH+'/'+ocr_extract[0]):
            os.makedirs(ROOT+OCR_DATA_PATH+'/'+ocr_extract[0])

          if not os.path.exists(ROOT+OCR_DATA_PATH+'/'+ocr_extract[1]):
            os.makedirs(ROOT+OCR_DATA_PATH+'/'+ocr_extract[1])

          filenames = internal_zip_file.namelist()
          for f in filenames:
            if f.endswith('.csv') and f.startswith(c+'/'+ocr_extract[0]):
              shutil.copy(ROOT+f, ROOT+OCR_DATA_PATH+'/'+ocr_extract[0])
            elif f.endswith('.json') and f.startswith(c+'/'+ocr_extract[1]):
              shutil.copy(ROOT+f, ROOT+OCR_DATA_PATH+'/'+ocr_extract[1])

          print(len(os.listdir(ROOT+OCR_DATA_PATH+'/'+ocr_extract[0])))
          print(len(os.listdir(ROOT+OCR_DATA_PATH+'/'+ocr_extract[1])))

In [None]:
def load_real_text(image_id):
  image_id = img_fns[id].split('/')[-1].split('.')[0]
  boxFile = ROOT+OCR_DATA_PATH+'/'+ocr_extract[0]+'/'+image_id+'.csv'
  colNames = boxColPos + ['text']

  boxes = pd.DataFrame(columns=colNames)
  boxes.head()

  with open(boxFile, 'r') as read_obj:
    csv_reader = reader(read_obj)
    for row in csv_reader:
      d_box = row[0:8]
      d_text = row[8:len(row)]
      text = (','.join(d_text))
      d_csv = d_box+[text]

      boxes.loc[len(boxes)]=d_csv
  return boxes

In [None]:
def transform_boxes_file(boxes):
  bboxNewColumn = []
  for r in range(len(boxes)):
    bboxNewColumn.append([[boxes.iloc[r]['0'], boxes.iloc[r]['1']], 
                            [boxes.iloc[r]['2'], boxes.iloc[r]['3']],
                            [boxes.iloc[r]['4'], boxes.iloc[r]['5']],
                            [boxes.iloc[r]['6'], boxes.iloc[r]['7']],
                            ])

  boxes['bbox'] = bboxNewColumn

  boxes = boxes.drop(boxColPos, axis='columns')
  return boxes

In [None]:
def get_actual_text_and_boxes(id):
  image_id = img_fns[id].split('/')[-1].split('.')[0]
  boxes_image_id = load_real_text(image_id)
  boxesClean_image_id = transform_boxes_file(boxes_image_id)
  return boxesClean_image_id

In [None]:
id=0
text = get_actual_text_and_boxes(id)
text.head(10)

Unnamed: 0,text,bbox
0,TAN WOON YANN,"[[72, 25], [326, 25], [326, 64], [72, 64]]"
1,BOOK TA .K(TAMAN DAYA) SDN BND,"[[50, 82], [440, 82], [440, 121], [50, 121]]"
2,789417-W,"[[205, 121], [285, 121], [285, 139], [205, 139]]"
3,"NO.53 55,57 & 59, JALAN SAGU 18,","[[110, 144], [383, 144], [383, 163], [110, 163]]"
4,"TAMAN DAYA,","[[192, 169], [299, 169], [299, 187], [192, 187]]"
5,"81100 JOHOR BAHRU,","[[162, 193], [334, 193], [334, 211], [162, 211]]"
6,JOHOR.,"[[217, 216], [275, 216], [275, 233], [217, 233]]"
7,DOCUMENT NO : TD01167104,"[[50, 342], [279, 342], [279, 359], [50, 359]]"
8,DATE:,"[[50, 372], [96, 372], [96, 390], [50, 390]]"
9,25/12/2018 8:13:39 PM,"[[165, 372], [342, 372], [342, 389], [165, 389]]"


In [None]:
def get_actual_key_values(image_id):
  image_id = img_fns[image_id].split('/')[-1].split('.')[0]
  keyFile = ROOT+OCR_DATA_PATH+'/'+ocr_extract[1]+'/'+image_id+'.json'
  f = open(keyFile)
  jskey = json.load(f)
  f.close()
  return jskey

In [None]:
summary = get_actual_key_values(id)
summary

{'company': 'BOOK TA .K (TAMAN DAYA) SDN BHD',
 'date': '25/12/2018',
 'address': 'NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.',
 'total': '9.00'}

In [None]:
taggs = ['date', 'total', 'company', 'address']

@lru_cache(maxsize=1000000)
def tf_idf(word, t1, mytexts):
    return mytexts.tf_idf(word, t1)

def similarity(t1, t2, mytexts):
    words1 = tokenizer.tokenize(t1)
    words2 = tokenizer.tokenize(t2)

    if(len(words1)==0 or len(words2)==0):
      return 0
    vocab = list(set(words1).union(words2))    # vocab contains the vocabulary (unique words) of both texts
    v1 = np.array([tf_idf(word, t1, mytexts) for word in vocab])
    v2 = np.array([tf_idf(word, t2, mytexts) for word in vocab])
    sim = np.dot(v1,v2)/(norm(v1)*norm(v2)) 
    return round(sim,3)
  


#find the consecutive values in a list given the lowest one
def find_cons_min_add(i_max, ind_v):
  ind_v = sorted(ind_v)
  for i in ind_v[1:]:
    if i - i_max == 1:
      if text.at[i,'final_tag'] != '' and text.at[i,'final_tag'] != 'address':
        break
      else:
        text.at[i,'final_tag'] = 'address'
        i_max = i
    else:
      break

#find the consecutive values in a list given the biggest one
def find_cons_max_add(i_max, ind_v):
  ind_v = sorted(ind_v, reverse=True)
  for i in ind_v[1:]:
    if i_max - i == 1:
      if text.at[i,'final_tag'] != '' and text.at[i,'final_tag'] != 'address':
        break
      else:
        text.at[i,'final_tag'] = 'address'
        i_max = i
    else:
      break

def first_non_consecutive(lst):
  for i, j in enumerate(lst, lst[0]):
    if i!=j:
      return j
  return -1

def check_tags(text, i):
  #print('CHECK')
  act_tag = text.at[i,'final_tag']
  indx = text.index[text['final_tag']==act_tag].tolist()
  #print(text.iloc[indx].head(5))
  f = first_non_consecutive(indx)
  #print('i', i)
  #print('f', f)
  if f != -1:
    pos = indx.index(f)
    l_indx_inf = indx[0:pos+1]
    l_indx_inf = l_indx_inf[:-1]
    l_indx_sup = indx[pos:len(indx)]
    l_indx_sup = l_indx_sup[1:]
    #print('l_indx_sup', l_indx_sup)
    #print('l_indx_inf', l_indx_inf)

    if(len(l_indx_sup)>len(l_indx_inf)):
      #print('\tIF')
      for j in l_indx_inf:
        print('\tupdated!')
        text.at[j, 'final_tag'] = 'kk'
    elif (len(l_indx_sup)<len(l_indx_inf)):
      #print('\tELEIF')
      for j in l_indx_sup:
        print('\tupdated!')
        text.at[j, 'final_tag'] = 'kk'
    else:
      #print('\tELSE')
      col = 'tag_' + act_tag
      v_sup = max(text.iloc[l_indx_sup][col])
      v_inf = max(text.iloc[l_indx_inf][col])
      if v_sup > v_inf:
        for j in l_indx_inf:
          print('\tupdated!')
          text.at[j, 'final_tag'] = 'kk'
      else:
        for j in l_indx_sup:
          print('\tupdated!')
          text.at[j, 'final_tag'] = 'kk'




def find_cons_min_comp(i_max, ind_v):
  #print('MIN')
  ind_v = sorted(ind_v)
  for i in ind_v[1:]:
    if i - i_max == 1:
      if text.at[i,'final_tag'] != '' and text.at[i,'final_tag'] != 'company':
        check_tags(text, i)
        break
      else:
        text.at[i,'final_tag'] = 'company'
        i_max = i
    else:
      break

def find_cons_max_comp(i_max, ind_v):
  #print('MAX')
  ind_v = sorted(ind_v, reverse=True)
  for i in ind_v[1:]:
    if i_max - i == 1:
      if text.at[i,'final_tag'] != '' and text.at[i,'final_tag'] != 'company':
        check_tags(text, i)
        break
      else:
        text.at[i,'final_tag'] = 'company'
        i_max = i
    else:
      break

def tag_textBoxes(text, summary):
  #Create empty columns to save similarities
  text['tag_date'] = ''
  text['tag_total'] = ''
  text['tag_company'] = ''
  text['tag_address'] = ''
  text['final_tag'] = ''

  mytexts = TextCollection(list(map(lambda x: x.upper(), text['text'].tolist())))

  #Get the similarity by every summarized field with the text of the receipt 
  for t in summary.keys():
    search_t = summary[t]
    for i in range(len(text['text'])):
      simT = similarity(search_t, text['text'].iloc[i], mytexts)
      colname = 'tag_'+t
      text.at[i,colname] = float(simT)

  #Process bad results
  text['tag_date'] = text['tag_date'].fillna(0)
  text['tag_total'] = text['tag_total'].fillna(0)
  text['tag_company'] = text['tag_company'].fillna(0)
  text['tag_address'] = text['tag_address'].fillna(0)
  


  #Find the DATE value in the receipt
  max_values = text.sort_values(by=['tag_date'], ascending=False)
  i = max_values.index[0]

  if i>0:
    text.at[i,'final_tag'] = 'date'

  #Find the TOTAL value in the receipt
  max_value = max(text['tag_total'].tolist())
  
  if max_value>0:
    price = sorted(text.index[text['tag_total']==max_value].tolist(), reverse=True)
    for i in price:
      if text.at[i,'final_tag'] == '':
        text.at[i,'final_tag'] = 'total'
        break
  
  #Find the ADDRESS value in the receipt
  max_values = text.sort_values(by=['tag_address'], ascending=False)['tag_address'].tolist()
  m_values = max_values[0:5]


  m_values = [i for i in m_values if i != 0]
  if len(m_values)>0:
    top_m = m_values[0]
    m_values=list(set(m_values))

    ind_v = []
    for m_v in m_values:
      ind_v += text.index[text['tag_address']==m_v].tolist()

    i_max = text.index[text['tag_address']==top_m].tolist()[0]
    text.at[i_max,'final_tag'] = 'address'

    if i_max == min(ind_v):
      find_cons_min_add(i_max, ind_v)

    if i_max == max(ind_v):
      find_cons_max_add(i_max, ind_v)

    if (i_max != max(ind_v) and (i_max != min(ind_v))):
      ind_v = sorted(ind_v)
      pos = ind_v.index(i_max)

      l_inf = ind_v[0:pos+1]
      find_cons_max_add(i_max, l_inf)

      l_sup = ind_v[pos:len(ind_v)]
      find_cons_min_add(i_max, l_sup)

  #Find the COMPANY value in the receipt
  #print('-----COMPANY-------')
  max_values = text.sort_values(by=['tag_company'], ascending=False)['tag_company'].tolist()
  m_values = max_values[0:5]

  m_values = [i for i in m_values if i != 0]
  #print('m_values', m_values)

  if len(m_values)>0:
    top_m = m_values[0]
    m_values=list(set(m_values))


    ind_v = []
    for m_v in m_values:
      ind_v += text.index[text['tag_company']==m_v].tolist()

    i_max = text.index[text['tag_company']==top_m].tolist()[0]
    #print(text.iloc[ind_v].head(5))


    text.at[i_max,'final_tag'] = 'company'
    if top_m < 1:
      if i_max == min(ind_v):
        #print('IF 1')
        find_cons_min_comp(i_max, ind_v)

      if i_max == max(ind_v):
        #print('IF 2')
        find_cons_max_comp(i_max, ind_v)

      if (i_max != max(ind_v) and (i_max != min(ind_v))):
        #print('IF 3')
        #print('i_max', i_max)
        
        ind_v = sorted(ind_v)
        #print('ind_v', ind_v)
        pos = ind_v.index(i_max)
        l_inf = ind_v[0:pos+1]
        #print('l_inf', l_inf)
        find_cons_max_comp(i_max, l_inf)

        l_sup = ind_v[pos:len(ind_v)]
        #print('l_sup', l_sup)
        find_cons_min_comp(i_max, l_sup)

  return text

In [None]:
id=414
text = get_actual_text_and_boxes(id)
summary = get_actual_key_values(id)
print(summary)

{'company': 'KEDAI UHAT DAN RUNCIT CHONG HWA', 'date': 'OCT 3, 2016', 'address': '3, JALAN PERDANA 5, TAMAN INDAH PERDANA, KEPONG, 52100 KL.', 'total': 'RM33.90'}


In [None]:
t = tag_textBoxes(text, summary)
t.loc[t['final_tag'].isin(taggs)]

Unnamed: 0,text,bbox,tag_date,tag_total,tag_company,tag_address,final_tag
0,KEDAI UHAT DAN RUNCIT CHONG HWA,"[[166, 183], [793, 183], [793, 224], [166, 224]]",0.0,0.0,1.0,0.146,company
1,"3, JALAN PERDANA 5, TAMAN INDAH PERDANA,","[[63, 226], [868, 226], [868, 281], [63, 281]]",0.039,0.089,0.175,0.846,address
2,"KEPONG, 52100 KL.","[[304, 282], [648, 282], [648, 325], [304, 325]]",0.0,0.0,0.0,0.614,address
8,"OCT 3, 2016 12:16:25 PM","[[349, 530], [810, 530], [810, 572], [349, 572]]",0.773,0.082,0.0,0.108,date
23,RM33.90,"[[742, 1028], [896, 1028], [896, 1102], [742, ...",0.168,1.0,0.0,0.071,total


In [None]:
id=32
text = get_actual_text_and_boxes(id)
summary = get_actual_key_values(id)
print(summary)

{'company': 'UNIHAKKA INTERNATIONAL SDN BHD', 'date': '10 MAR 2018', 'address': '12, JALAN TAMPOI 7/4,KAWASAN PERINDUSTRIAN TAMPOI,81200 JOHOR BAHRU,JOHOR', 'total': ''}


In [None]:
t = tag_textBoxes(text, summary)
t.loc[t['final_tag'].isin(taggs)]

Unnamed: 0,text,bbox,tag_date,tag_total,tag_company,tag_address,final_tag
0,UNIHAKKA INTERNATIONAL SDN BHD,"[[338, 337], [628, 337], [628, 354], [338, 354]]",0.0,0.0,1.0,0.0,company
1,10 MAR 2018 18:24,"[[430, 353], [536, 353], [536, 367], [430, 367]]",0.854,0.0,0.0,0.024,date
3,"12, JALAN TAMPOI 7/4,KAWASAN PERINDUSTRIAN","[[358, 389], [607, 389], [607, 405], [358, 405]]",0.0,0.0,0.0,0.727,address
4,"TAMPOI,81200 JOHOR BAHRU,JOHOR","[[386, 407], [576, 407], [576, 425], [386, 425]]",0.0,0.0,0.0,0.831,address


In [None]:
textes = []
bboxes = []
final_tags = []

for id in range(len(img_fns)):
  text = get_actual_text_and_boxes(id)
  summary = get_actual_key_values(id)
  t = tag_textBoxes(text, summary)
  #t = t.loc[t['final_tag'].isin(taggs)]

  textes += t['text'].tolist()
  bboxes += t['bbox'].tolist()
  final_tags += t['final_tag'].tolist()

df = pd.DataFrame({'text': textes, 'bbox': bboxes, 'final_tag': final_tags})
df.head()

print('DONE')

  sim = np.dot(v1,v2)/(norm(v1)*norm(v2))


	updated!
DONE


In [None]:
df = pd.DataFrame({'text': textes, 'bbox': bboxes, 'final_tag': final_tags})
df.head(10)

Unnamed: 0,text,bbox,final_tag
0,TAN WOON YANN,"[[72, 25], [326, 25], [326, 64], [72, 64]]",
1,BOOK TA .K(TAMAN DAYA) SDN BND,"[[50, 82], [440, 82], [440, 121], [50, 121]]",company
2,789417-W,"[[205, 121], [285, 121], [285, 139], [205, 139]]",
3,"NO.53 55,57 & 59, JALAN SAGU 18,","[[110, 144], [383, 144], [383, 163], [110, 163]]",address
4,"TAMAN DAYA,","[[192, 169], [299, 169], [299, 187], [192, 187]]",address
5,"81100 JOHOR BAHRU,","[[162, 193], [334, 193], [334, 211], [162, 211]]",address
6,JOHOR.,"[[217, 216], [275, 216], [275, 233], [217, 233]]",address
7,DOCUMENT NO : TD01167104,"[[50, 342], [279, 342], [279, 359], [50, 359]]",
8,DATE:,"[[50, 372], [96, 372], [96, 390], [50, 390]]",
9,25/12/2018 8:13:39 PM,"[[165, 372], [342, 372], [342, 389], [165, 389]]",date


In [None]:
df.tail(10)

Unnamed: 0,text,bbox,final_tag
33598,(RM),"[[314, 713], [357, 713], [357, 735], [314, 735]]",
33599,(RM),"[[433, 712], [478, 712], [478, 734], [433, 734]]",
33600,SR @ A,"[[125, 747], [197, 747], [197, 766], [125, 766]]",
33601,11.32,"[[318, 745], [366, 745], [366, 763], [318, 763]]",
33602,0.68,"[[437, 745], [478, 745], [478, 762], [437, 762]]",
33603,TOTAL,"[[124, 772], [170, 772], [170, 790], [124, 790]]",
33604,11.32,"[[318, 773], [366, 773], [366, 790], [318, 790]]",
33605,0.68,"[[439, 772], [478, 772], [478, 789], [439, 789]]",
33606,THANK YOU,"[[242, 810], [362, 810], [362, 828], [242, 828]]",
33607,"FOR ANY ENQUIRY, PLEASE CONTACT US:","[[159, 837], [461, 837], [461, 853], [159, 853]]",


### 2. Creating a model to predict the type of information of tickets

Now we have a dataset with all tickets' text and their taggs, so we can train a supervised model that will assign the correspondig tag to future textes. We will use it after extracting the text of the future tickets.



Let's use *sklearn* to build the set of vectors for all sentences using the **tf-idf** model. 

Variable *tfidf* will be a sparse matrix of dimensions "number of sentences" by "size of the vocabulary"

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df['text'])
print('Sentences:', tfidf.shape[0], "  Vocabulary size:", tfidf.shape[1])

Sentences: 33608   Vocabulary size: 7599


In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names)





Let's try to predict the tag of some texts in the corpus using the **tf-idf** model.

In order to predict the tag of a sentence we will do the following:

1. We create the vectorized tf-idf matrix.
2. We separate the data into two different datasets: One for training the model and another for testing. The data for testing will be used to measure the performance of the built model and it will not be used for building the model.
3. We create the model. In this case we use a simple method called *KNeighborsClassifier*. This method, when asked to make a prediction, it simply looks for the most similar document in the training set (according to the similarity metric given) and returns the specialty to which this document is most similar as a prediction.
4. Test the performance of the model by comparing predictions of the model for the testing texts with the actual tag of the text, and print it. Special attention should be paid to the value of the **accuracy** of the results.


In [None]:
import pickle

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

#1
tfidf = tfidf_vectorizer.fit_transform(df['text'])


#2
X_train, X_test, Y_train, Y_test = train_test_split(tfidf, np.array(df.final_tag), test_size=0.3, shuffle=True, random_state=42)

#3
neigh = KNeighborsClassifier(metric='cosine')
neigh.fit(X_train, Y_train)

#4
Y_pred = neigh.predict(X_test)
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

                   0.96      0.99      0.98      8991
     address       0.95      0.82      0.88       509
     company       0.94      0.79      0.86       225
        date       0.86      0.67      0.75       166
       total       0.07      0.01      0.01       192

    accuracy                           0.96     10083
   macro avg       0.75      0.66      0.70     10083
weighted avg       0.94      0.96      0.95     10083



In [None]:
filename = ROOT+'tag_model.sav'
pickle.dump(neigh, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.predict(X_test)
print(classification_report(Y_test, result))

              precision    recall  f1-score   support

                   0.96      0.99      0.98      8991
     address       0.95      0.82      0.88       509
     company       0.94      0.79      0.86       225
        date       0.86      0.67      0.75       166
       total       0.07      0.01      0.01       192

    accuracy                           0.96     10083
   macro avg       0.75      0.66      0.70     10083
weighted avg       0.94      0.96      0.95     10083

