## Instalación Tesseract

In [1]:
!apt-get update
!apt-get install -y tesseract-ocr
!pip install pytesseract
!tesseract -v

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,696 kB]
Get:12 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,318 kB]
Get:13 https://r2u.stat.illinois.edu/ubuntu jammy/

## Primer test

In [3]:
import pytesseract
from PIL import Image

# Cargamos imagen
image_path = "image6.png"
image = Image.open(image_path)

# Extraemos texto con OCR
text = pytesseract.image_to_string(image)

print(text)

Sent on: Friday, June 23, 2023 11:31:12 AM
To:
Subject: YOUR ACCOUNT IS AT RISK!!

Dear Valued User ,

We received a request from you to terminate your Office 365 email due to a dual
college/universities account. This process has begun by our administrator. If you did not
authorize this action and you have no knowledge of it, you are advised to re-verify your account.
Please give us 24 hours to terminate your account if you initiated the request. Failure to re-verify
will result in the closure of your account and you will lose all of my files on these 365 accounts.

If this request was made accidentally and you have no knowledge of it, you are advised to copy
and paste the URL Below into the address bar of your web browser to fill in the form.

cutt.ly/OwtNi6KO
Failure to Verify will result in the closure of your account.

lowa State University
IT Helpdesk All Right Reserved.



## Integración con NLP

### Creación diccionario palabras clave
Diccionario de palabras clave generado con GPT.

In [4]:
keywords_list = {
    "urgency": [
        "urgent", "immediately", "important", "required", "warning", "alert",
        "final notice", "act now", "verify now", "last chance", "immediate action",
        "action required", "limited time", "confirm immediately", "security alert",
        "urgent action", "expires soon", "time sensitive", "you must act fast",
        "one-time offer", "before it’s too late", "Failure to re-verify", "Failure to Verify",
        "will result in the closure of your account", "loss of my files", "closure of your account",
        "failure to verify", "you are advised", "please give us 24 hours", "last chance to act"
    ],
    "account_security": [
        "verify", "account", "login", "credentials", "username", "password",
        "access", "unauthorized", "locked", "suspended", "disabled", "unusual activity",
        "reset password", "update required", "security check", "confirm identity",
        "new login", "account recovery", "secure your account", "password change",
        "identity verification", "check your account", "re-verify your account", "terminate your Office 365 email"
    ],
    "financial": [
        "invoice", "billing", "payment", "refund", "charge", "credit", "debit",
        "balance", "transaction", "statement", "overdue", "amount due", "penalty",
        "wire transfer", "escrow", "funds", "processing fee", "unpaid balance",
        "suspicious transaction", "credit card", "bank details", "payment request",
        "transaction alert", "funds transfer", "payment verification", "tax payment"
    ],
    "banks_and_services": [
        "paypal", "bank", "visa", "mastercard", "american express", "discover",
        "chase", "wells fargo", "citibank", "hsbc", "capital one", "revolut",
        "apple pay", "google pay", "zelle", "venmo", "western union", "moneygram",
        "secure payment", "credit union", "atm", "bank account", "payment gateway",
        "funds transfer", "debit card", "financial service"
    ],
    "scam_indicators": [
        "prize", "winner", "lottery", "congratulations", "free", "gift", "claim",
        "reward", "guaranteed", "exclusive", "special offer", "limited offer",
        "you've been selected", "unclaimed", "act fast", "pre-approved", "risk-free",
        "winner notification", "prize claim", "rewards program", "bonus", "free gift",
        "don’t miss out", "exclusive offer", "failure"
    ],
    "identity_and_documents": [
        "ssn", "social security", "tax", "irs", "passport", "driver's license",
        "national id", "identity verification", "government", "official notice",
        "identification required", "compliance", "fraud alert", "secure documents",
        "identity theft", "personal information", "tax identification", "verify your identity",
        "scan documents", "email verification", "document submission"
    ],
    "malicious_intent": [
        "click here", "download", "attachment", "open file", "review document",
        "security message", "suspicious activity", "deceptive", "unauthorized access",
        "security threat", "email verification", "confirm details", "risk assessment",
        "secure link", "sensitive information", "update your account", "urgent security patch",
        "contact us immediately", "verify your account now", "report suspicious activity",
        "fill in the form", "click this link", "please reply"
    ],
    "delivery_and_shipping": [
        "package", "tracking code", "shipment", "delivery", "tracking number",
        "delivery preferences", "shipping update", "delivery status", "shipping address",
        "parcel", "order status", "tracking information", "shipment confirmation",
        "we've received your order", "shipping confirmation", "delivery failure", "pending shipment"
    ],
    "fake_emails_and_links": [
        "click this link", "activate now", "please reply", "click to confirm", "open now",
        "confirm now", "don’t miss this", "click here for details",
        "to claim your reward", "secure link", "confirmation needed", "check your account",
        "new message", "important notification", "secure page", "immediate confirmation",
        "fill in the form", "cutt.ly"
    ]
}

### Pipeline de OCR + NLP

In [20]:
import re
import pytesseract
import spacy
from PIL import Image

# Configuración Tesseract
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

# Spacy
nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    """Limpia el texto extraído eliminando caracteres no deseados y normalizando."""

    text = text.lower()

    interface_keywords = [
        "compose", "inbox", "starred", "snoozed", "sent", "drafts",
        "notes", "more", "spam", "trash", "sent mail", "trash", "archive", "draft",
        "search", "reply", "forward", "to", "from", "subject", "cc", "bcc"
    ]
    for keyword in interface_keywords:
      text = re.sub(rf"^\s*{keyword}.*$", "", text, flags=re.MULTILINE)

    text = re.sub(r"\S+@\S+", "", text) # Direcciones de correo electrónico
    text = re.sub(r"\b(?:\w{3,9} \d{1,2},? \d{4},? \d{1,2}:\d{2}[APMapm]+)\b", "", text) # Fechas
    text = re.sub(r"\b\d{3}[-.\s]?\d{4}\b", "", text) # Números de teléfono
    text = re.sub(r"[^\x00-\x7F]+", "", text) # Carácteres extraños
    text = re.sub(r"[^\w\s.,!?;]", "", text) # Símbolos
    text = re.sub(r"\s{2,}", " ", text) # Espacios
    text = re.sub(r'[^\w\s]', '', text) # Puntuación
    text = re.sub(r"\n\s*\n", "\n", text) # Saltos de línea
    text = text.strip()
    return text

def extract_text_from_image(image_path):
    """Extrae texto de la imagen"""
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image)
    return text

def extract_info_with_nlp(text):
    """Usa NLP para extraer emails, links y datos sensibles del texto."""
    # Preprocesamos texto con Spacy
    doc = nlp(text)

    # Extraemos @
    email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
    emails = re.findall(email_pattern, text)
    email_domains = [email.split('@')[1] for email in emails]

    # Extraemos links
    link_pattern = r'(https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(?:/[^ \n]*)?|www\.[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(?:/[^ \n]*)?|[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(?:/[^ \n]*)?)'
    links = re.findall(link_pattern, text)
    valid_links = [link for link in links if not any(domain in link for domain in email_domains)]

    # Extraemos palabras clave
    keywords = {
        category: [keyword for keyword in keywords if keyword in text.lower()]
        for category, keywords in keywords_list.items()
        if any(keyword in text.lower() for keyword in keywords)
    }

    # Extraemos números de teléfono
    phone_pattern = r'\b(?:\+?\d{1,3}[\s.-]?)?(?:\(?\d{2,4}\)?[\s.-]?)?\d{3,4}[\s.-]?\d{3,4}\b'
    phones = re.findall(phone_pattern, text)

    # Limpiamos el texto
    cleaned_text = clean_text(text)

    return {
        "emails_domain": email_domains,
        "links": valid_links,
        "phones": phones,
        "keywords": keywords,
        "cleaned_text": cleaned_text
    }



In [22]:
image_path = "image6.png"
text_extracted = extract_text_from_image(image_path)
structured_info = extract_info_with_nlp(text_extracted)

print("\n Texto extraído", text_extracted)
print("\n Texto estructurado:\n", structured_info)


 Texto extraído Sent on: Friday, June 23, 2023 11:31:12 AM
To:
Subject: YOUR ACCOUNT IS AT RISK!!

Dear Valued User ,

We received a request from you to terminate your Office 365 email due to a dual
college/universities account. This process has begun by our administrator. If you did not
authorize this action and you have no knowledge of it, you are advised to re-verify your account.
Please give us 24 hours to terminate your account if you initiated the request. Failure to re-verify
will result in the closure of your account and you will lose all of my files on these 365 accounts.

If this request was made accidentally and you have no knowledge of it, you are advised to copy
and paste the URL Below into the address bar of your web browser to fill in the form.

cutt.ly/OwtNi6KO
Failure to Verify will result in the closure of your account.

lowa State University
IT Helpdesk All Right Reserved.


 Texto estructurado:
 {'emails_domain': [], 'links': ['cutt.ly/OwtNi6KO'], 'phones': [], 'k

In [23]:
print("\nContenido de cleaned_text:\n", structured_info["cleaned_text"])


Contenido de cleaned_text:
 dear valued user  we received a request from you to terminate your office 365 email due to a dual
collegeuniversities account this process has begun by our administrator if you did not
authorize this action and you have no knowledge of it you are advised to reverify your account
please give us 24 hours to terminate your account if you initiated the request failure to reverify
will result in the closure of your account and you will lose all of my files on these 365 accounts if this request was made accidentally and you have no knowledge of it you are advised to copy
and paste the url below into the address bar of your web browser to fill in the form cuttlyowtni6ko
failure to verify will result in the closure of your account lowa state university
it helpdesk all right reserved


In [26]:
image_path = "image6.png"
text_extracted = extract_text_from_image(image_path)
structured_info = extract_info_with_nlp(text_extracted)

print("\n Texto extraído", text_extracted)
print("\n Texto estructurado:\n", structured_info)


 Texto extraído Sent on: Friday, June 23, 2023 11:31:12 AM
To:
Subject: YOUR ACCOUNT IS AT RISK!!

Dear Valued User ,

We received a request from you to terminate your Office 365 email due to a dual
college/universities account. This process has begun by our administrator. If you did not
authorize this action and you have no knowledge of it, you are advised to re-verify your account.
Please give us 24 hours to terminate your account if you initiated the request. Failure to re-verify
will result in the closure of your account and you will lose all of my files on these 365 accounts.

If this request was made accidentally and you have no knowledge of it, you are advised to copy
and paste the URL Below into the address bar of your web browser to fill in the form.

cutt.ly/OwtNi6KO
Failure to Verify will result in the closure of your account.

lowa State University
IT Helpdesk All Right Reserved.


 Texto estructurado:
 {'emails_domain': [], 'links': ['cutt.ly/OwtNi6KO'], 'phones': [], 'k

In [27]:
print("\nContenido de cleaned_text:\n", structured_info["cleaned_text"])


Contenido de cleaned_text:
 dear valued user  we received a request from you to terminate your office 365 email due to a dual
collegeuniversities account this process has begun by our administrator if you did not
authorize this action and you have no knowledge of it you are advised to reverify your account
please give us 24 hours to terminate your account if you initiated the request failure to reverify
will result in the closure of your account and you will lose all of my files on these 365 accounts if this request was made accidentally and you have no knowledge of it you are advised to copy
and paste the url below into the address bar of your web browser to fill in the form cuttlyowtni6ko
failure to verify will result in the closure of your account lowa state university
it helpdesk all right reserved


## Testeo con imágenes

### Mount Drive

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
image_path = "/content/drive/MyDrive/ColabNotebooks/KeepCodingIA/TFM/Models/OCR/Data/Images/"

In [21]:
import os

image_files = [f for f in os.listdir(image_path) if f.endswith(('.jpg', '.jpeg', '.png'))]

for image_name in image_files:
    image = os.path.join(image_path, image_name)
    text_extracted = extract_text_from_image(image)
    structured_info = extract_info_with_nlp(text_extracted)

    print("\nTexto extraído:", text_extracted)
    print("\nTexto estructurado:\n", structured_info)



Texto extraído: > Mail

Qo

eOVO*

Compose

Inbox
Starred
Snoozed
Sent
Drafts
Notes

More

@ OW BOG BD 7of4a01 < >

Amazon Account - - - SUSPENDED !!_ inbox x o eG
‘Amazon Service <amazon.service@013802mail.com> ‘Apr 12, 2022, 1:54PM (22hoursago) yy  ¢
tome

amazon
aT

Dear Amazon Customer,
YOUR ACCOUNT HAS BEEN LOCKED
Due to suspicious activity including several unusual transactions on your Amazon Account your Account is suspended until further notice.

To validate your identity, unfreeze your Account, and cancel any unwanted charges please, call our Security Support Team on the following number
IMMEDIATELY and be ready too provide your billing address, usemame and pass word:

= 555-5555

After youve been verified your Account will be reactivated with in 24 hrs. If we do not here from you in three working days the charges on your Account
will be non refundable!

Regard,
‘Amazon Customer Service
‘Amazon.com


Texto estructurado:
 {'emails_domain': ['013802mail.com'], 'links': ['amazo