New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mailcheck for PHP #173
Comments
@devnix I wrote a PHP implementation of Mailcheck 8 months ago. It is a single file include in your codebase. I would love to collaborate with you on implementing features your PHP library covers that mine doesn't. |
Hi @msigley! Let me know if I can help you. Anyway, I find that the original implementation of this library is buggy and not very flexible, so I decided to modify some things like the distance algorithm, the data source, etc. I also tried to implement the Soundex algorithm but it was not an improvement. I found a lot of implementations similar to the JS one in PHP but none of them would work as intended on a production system, that's why I decided to write yet another one. I would be eager to try to standardize one library and join efforts on one project. |
@devnix The soundex algorithm really isn't good for helping with typos. Its designed to find words that sound like your given word in spoken language, not for fixing typos. I implemented a Sift4 string distance algorithm for typo corrections. I would also love to collaborate to create one standard library. Try my library out and open issues for feature enhancements and improvements. I will also gladly accept PRs. |
I also found that the Levenshtein algorithm resulted in better suggestions than the Sift4, and had more than enough performance to check a list of 10.000+ domains. Every improvement that I felt I could do is in my repo and you are free to use all the code if you comply the license attached. I'm not eager to start an implementation battle :-) |
@devnix We use the same license GPL v3 so any merges shouldn't be any issue. The approach to domain matching I take is vastly different that the JS version. I break the domain down, then apply the string distance algo. Sift4 is faster which is why I used it, and for small strings, the parts of the domain are usually a max of 10 characters, it makes almost no difference to use Levenshtein. I also implemented DNS email domain validation for unknown domains. I don't want to start a implementation battle either. I'll send you an email so we can discuss this further away from this github issue. |
What came from this discussion? 🤔 |
He sent to me this email but I didn't reply him:
I think my implementation is completely different (rather than a mere JS to PHP translation), to yield more coherent suggestions under my point of view. I honestly didn't get the point of the whole conversation. I didn't want to say anything else because English is not my first language and maybe I'm misunderstanding him, but I was feeling like he wanted me to implement my features to his repo. I really really don't get the point. |
@devnix I was really hoping for a collaborative effort towards a standardized solution for doing a Mailcheck in PHP when I sent that email. Both you and I are not the first people to write a library that does this. I don't care about who's code is better. I was mostly hoping to understand your use cases to compare both of our solutions to understand the pros and cons of our two approaches. For example: Why did you choose to add all ~1,500 top level domains (TLDs) to your dictionary when most email addresses in use live on a small fraction of them? I purposely left a bunch of them out so typos of common TLDs wouldn't match a valid TLD. Is there a reason why you chose to use the levenshtein string distance algorithm? I chose to use sift4 because in my tests it gave better suggestions and was faster for small strings. Is there a reason why you chose to compare domains as a single string instead of breaking it down into its individual parts? I also implemented DNS record validation on emails as a check all. Did you chose to not do this for performance reasons? I thought this was a good way to take advantage of the fact PHP is a server language. |
@msigley Hi! Excuse me if I misunderstood your intentions. Please let me answer your questions:
I choose it because if you write a valid domain with a valid TLD that is not listed, the suggestion algorithm will always tell you that your email is not valid, with an annoying list of suggestions.
Yeah! Levenshtein gave me better results than Sift4 without feeling any kind of slowness on my side, running it against a HUGE list. I started the library writing several tests about how I felt the library needed to suggest, and that's what drove my implementation. I'm willing to hear for feedback to test more email address cases to ensure it can give the best suggestion possible.
Absolutely! Because you want to compare it as a single domain. It is a single domain. If I have an email address that finishes with Here it helps to have a growing list of know email providers with their localized, know TLDs (in the original list I used I found If it doesn't found any know full domain suggestion, it will only check for TLDs to suggest. If you entered a correct TLD and it seems like a correct email not listed on the full domains array, it will assume a correct address.
I find this totally out of the scope of the package. It is a suggestion library, not a validation one. There are plenty of options out there to perform this kind of validations. I started the library writing a validation part, until I realized it was absurd, because half of the times the user will input an incorrect email addres that needs a suggestion in order to be valid! Also, I have been observing that DNS/MX checking is not the best way to determine if the email address is wrong: https://symfony.com/doc/4.4/reference/constraints/Email.html#checkmx As I said, this is not a validation library, but I would prefer a false positive rather than a blocking false negative. |
@devnix Thanks for the reply. I honestly think we just took two different approaches to this problem and came up if two different solutions. See I disagree that sanitation and validation are out of scope. If you want to do suggestions you should always sanitize and validate your email strings first. Sanitation will remove a lot of obvious typos like characters that should never be in an email address like #, &, etc. Validation ensures you are only offering suggestions for invalid email strings. DNS validation is a nice way to not need a super huge dictionary for suggestions. It also can be done very reliably. I read RFC2821 when I implemented DNS validation. I also handle bad DNS providers returning an A record for all NXDOMAIN requests: If domain doesn't have a DNS record, it can't receive mail so I don't understand the point of this note in the symfony docs: Do you have a large list of typos and corrections I could run a test against? I'm curious to see which one of our approaches gives better suggestions. I have all of my documented typo'ed emails here: |
I use a server-side java version (which is no longer available) of the isEmail PHP script for a couple of years now. It's quite effective and optionally performs DNS & MX checks. http://isemail.info/ Please note that this won't offer alternatives to popular domain names, but it will let you know if the domain or mail servers are no longer configured. (Example: I had a client contact us to complain that her friend's email address was valid and our website wasn't accepting it. She finally reached out via phone and learned that the while domain was still up, they stopped using email for that domain and no longer had it configured.) |
@JamoCA Technically according to RFC2821 if a domain has an A record it can except mail even if no MX record exists. It sounds like that library only checks MX records. Lets not too off topic though. The focus of this conversation is Mailcheck. |
Technically UTF-8 characters are allowed in an email address, but I don't see that functioning anywhere yet. You still might want to check out the isEmail PHP library regarding their validation & reasoning. (Sorry for going off topic. I'll continue to invalidate email if no MX record exists.) |
I just think there are plenty of sanitation and validation utilities out there way more advanced than what I could develop in a small amount of time, better tested and widely adopted. Also, there is no point in validating an email address that probably will not be valid, because the purpose of the library is to make a suggestion from a mistyped email address. |
Disclaimer: Hey, this library is no longer maintained. I rewrote it entirely and update it for 2022 and onward: https://github.com/ZooTools/email-spell-checker 💙 Written in TypeScript and removed jQuery Come check it out and give it a ⭐️ for the effort. |
Hi! I've just released a similar library for PHP: https://github.com/devnix/mailcheck
It uses the Levenshtein algorithm to find suggestions and uses a different strategy for suggesting unknown hostnames by suggesting just the TLDs.
I wrote it because I felt the Javascript version too buggy for my needs. If you are willing to give the project a breeze and try to test and standardize the libraries I would love to get my repo added to the organization. Otherwise, I would prefer to maintain it myself.
I would love to hear your thoughts!
The text was updated successfully, but these errors were encountered: