# Extract Text From Images

As we saw in the previous notebook, there's a part of messages that do not contain any text, so we will try to extract info from images linked with the message. <br>

we will extract text from all images, not only those that do not contain text, sometimes the related photo may contain info that may not be included in the text.

### PaddleOCR

we will be using PaddleOCR to extract text from images, PaddleOCR is one of the most accurate open-source projects used for optical character recognition, it can detect and recognize 80+ languages, it supports multiple detection and recognition algorithms and models, and also have very straight forward documentation. Their most significant feature is detecting the text-in-the-wild problem, which is not possible in other OCR systems like Tesseract. <br>
Our problem is considered a text-in-the-wild problem, because the job post advertisement, may have variant designs and structure, so it's not a document, nor a structured image, and paddle-ocr would be the best choice.

You can check the project from the following link
[PaddleOCR Github](https://github.com/PaddlePaddle/PaddleOCR)


In [2]:
import numpy as np
from paddleocr import PaddleOCR
import pandas as pd


In [3]:
# reading data
data = pd.read_csv("../data/data_v1.csv")
data.head()

Unnamed: 0,id,date,photo,width,height,text,text_entities,raw_text,type
0,1567,2022-01-02T13:46:28,photos/photo_1205@02-01-2022_13-46-28.jpg,800.0,419.0,"['Job Title:', {'type': 'hashtag', 'text': '#s...","[{'type': 'plain', 'text': 'Job Title:'}, {'ty...",Job Title:#senior and a junior #developer\n \n...,job_post
1,1568,2022-01-03T11:09:36,photos/photo_1206@03-01-2022_11-09-36.jpg,1110.0,1124.0,"['Job Title:', {'type': 'hashtag', 'text': '#c...","[{'type': 'plain', 'text': 'Job Title:'}, {'ty...",Job Title:#cashier\nJob Type: #full_time\n \nش...,job_post
2,1569,2022-01-03T14:28:11,photos/photo_1207@03-01-2022_14-28-11.jpg,1280.0,1267.0,"['Company: ', {'type': 'hashtag', 'text': '#Na...","[{'type': 'plain', 'text': 'Company: '}, {'typ...",Company: #National_Technology_Group #NTG)\nJob...,job_post
3,1570,2022-01-03T17:12:13,photos/photo_1208@03-01-2022_17-12-13.jpg,1014.0,1124.0,"['Job title: ', {'type': 'hashtag', 'text': '#...","[{'type': 'plain', 'text': 'Job title: '}, {'t...",Job title: #Employees for Operations Departmen...,job_post
4,1571,2022-01-03T19:16:11,photos/photo_1209@03-01-2022_19-16-11.jpg,1062.0,1125.0,"[{'type': 'link', 'text': 'https://www.faceboo...","[{'type': 'link', 'text': 'https://www.faceboo...",https://www.facebook.com/384708578676644/posts...,link


In [10]:
# initializing model with corresponding parameters
ocr = PaddleOCR(
    use_angle_cls=False, lang="en", drop_score=0.92, show_log=False, use_gpu=True
)


In [4]:
data.shape, data.photo.notna().sum()


((892, 9), 708)

data contain 708 photos


In [13]:
ocr_res = np.empty(data.photo.shape, dtype="O")
for i, img_path in enumerate(data.photo.values):
    if img_path is not np.nan:
        result = ocr.ocr("../data/" + img_path, cls=False, det=True, rec=True)
        txts = [line[1][0] for line in result]
    else:
        txts = [""]
    ocr_res[i] = " ".join(txts)
    if i % 100 == 0:
        print(f"{i} images done!")


0 images done!
100 images done!
200 images done!
300 images done!
400 images done!
500 images done!
600 images done!
700 images done!
800 images done!


In [14]:
# append ocr results to the data
data["ocr_res"] = ocr_res


In [15]:
data.loc[data.ocr_res == ""]["type"].value_counts()


job_post    151
others       18
empty        15
link          2
Name: type, dtype: int64

### Creating Full Text 
we will create a new column called *full_text* that contain the original text message and the ocr result

In [16]:
data.raw_text.fillna('' , inplace=True)
data.ocr_res.fillna('',inplace=True)

data['full_text'] = (data.raw_text + '\n' + data.ocr_res)

In [18]:
data.to_csv("../data/data_v2.csv", index=False)