Skip to content

Commit

Permalink
Improve the email and href parsing regexes a little. Also add descrip…
Browse files Browse the repository at this point in the history
…tive comments since regexes are rarely capable of standing on their own.

Read the comments for details on the new rules but I tested them pretty thoroughly and they seem to work satisfactorily.  Let me know if anyone finds issues.

Note that we really need to do this with a parse tree.  Or at least replace the urls and emails with some marker and then put them back in later.  Parsing the whole thing repeatedly caused me all kinds of problems like emails matching already matched URLs, having to match entities instead of the real characters, etc.  These problems force us to tighten the rules used to match emails (like enforcing a subset of characters preceding email addresses).  I may do this at some point but not now.

* core/string_api.php
  (string_insert_hrefs): redo regexes to be more comprehensive
  (string_strip_hrefs): match much more generally, including href anchors
    with other attributes or extraneous whitespace


git-svn-id: http://mantisbt.svn.sourceforge.net/svnroot/mantisbt/trunk@2106 f5dc347c-c33d-0410-90a0-b07cc1902cb9
  • Loading branch information
Julian Fitzell committed Mar 17, 2003
1 parent 5d61610 commit 8198742
Showing 1 changed file with 39 additions and 12 deletions.
51 changes: 39 additions & 12 deletions core/string_api.php
Expand Up @@ -6,7 +6,7 @@
# See the README and LICENSE files for details

# --------------------------------------------------------
# $Id: string_api.php,v 1.33 2003-03-14 18:54:32 int2str Exp $
# $Id: string_api.php,v 1.34 2003-03-17 00:25:47 jfitzell Exp $
# --------------------------------------------------------

$t_core_dir = dirname( __FILE__ ).DIRECTORY_SEPARATOR;
Expand Down Expand Up @@ -165,26 +165,53 @@ function string_insert_hrefs( $p_string ) {
return $p_string;
}

$p_string = eregi_replace( "([[:alnum:]]+)://([^[:space:]<]*)([[:alnum:]#?/&=@])",
"<a href=\"\\1://\\2\\3\">\\1://\\2\\3</a>",
# This is based on the description in RFC 2396 which specifies how
# to match URLs generically without knowing their type
$p_string = preg_replace( '/(([[:alpha:]][-+.[:alnum:]]*):\/\/(%[[:digit:]A-Fa-f]{2}|[-_.!~*\';\/?:@&=+$,[:alnum:]])+)/s',
'<a href="\1">\1</a>',
$p_string);
$p_string = eregi_replace( "^(([a-z0-9_]|\\-|\\.)+@([^[:space:]<]*)([[:alnum:]-]))",
"<a href=\"mailto:\\1\" target=\"_new\">\\1</a>",
$p_string);
$p_string = eregi_replace( "( )(([a-z0-9_]|\\-|\\.)+@([^[:space:]<]*)([[:alnum:]-]))",
"\\1<a href=\"mailto:\\2\" target=\"_new\">\\2</a>",

# Set up a simple subset of RFC 822 email address parsing
# We don't allow domain literals or quoted strings
# We also don't allow the & character in domains even though the RFC
# appears to do so. This was to prevent &gt; etc from being included.
# Note: we could use email_get_rfc822_regex() but it doesn't work well
# when applied to data that has already had entities inserted.
$t_atom = '(?:[^()<>@,;:\\\".\[\]\000-\037\177 &]+)';

# In order to avoid selecting URLs containing @ characters as email
# addresses we limit our selection to addresses that are preceded by:
# * the beginning of the string
# * a &lt; entity (allowing '<foo@bar.baz>')
# * whitespace
# * a : (allowing 'send email to:foo@bar.baz')
# * a \n, \r, or > (because newlines have been replaced with <br />
# and > isn't valid in URLs anyway
#
# At the end of the string we allow the opposite:
# * the end of the string
# * a &gt; entity
# * whitespace
# * a , character (allowing 'email foo@bar.baz, or ...')
# * a \n, \r, or <
$p_string = preg_replace( '/(?<=^|&lt;|[\s\:\>\n\r])('.$t_atom.'(?:\.'.$t_atom.')*\@'.$t_atom.'(?:\.'.$t_atom.')*)(?=$|&gt;|[\s\,\<\n\r])/s',
'<a href="mailto:\1" target="_new">\1</a>',
$p_string);
return $p_string;
}

# --------------------
# Detect href anchors in the string and replace them with URLs and email addresses
function string_strip_hrefs( $p_string ) {
$p_string = eregi_replace( "<a href=\"mailto:(([a-z0-9_]|\\-|\\.)+@([^[:space:]]*)([[:alnum:]-]))\" target=\"_new\">(([a-z0-9_]|\\-|\\.)+@([^[:space:]]*)([[:alnum:]-]))</a>",
"\\1",
# First grab mailto: hrefs. We don't care whether the URL is actually
# correct - just that it's inside an href attribute.
$p_string = preg_replace( '/<a\s[^\>]*href="mailto:([^\"]+)"[^\"]*>[^\<]*<\/a>/s',
'\1',
$p_string);
$p_string = eregi_replace( "<a href=\"([[:alnum:]]+://[^[:space:]]*)([[:alnum:]#?/&=])\">([^<]*)</a>",
"\\1",

# Then grab any other href
$p_string = preg_replace( '/<a\s[^\>]*href="([^\"]+)"[^\"]*>[^\<]*<\/a>/s',
'\1',
$p_string);
return $p_string;
}
Expand Down

0 comments on commit 8198742

Please sign in to comment.