Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mailcheck for PHP #173

Open
devnix opened this issue Nov 13, 2019 · 15 comments
Open

Mailcheck for PHP #173

devnix opened this issue Nov 13, 2019 · 15 comments

Comments

@devnix
Copy link

devnix commented Nov 13, 2019

Hi! I've just released a similar library for PHP: https://github.com/devnix/mailcheck

It uses the Levenshtein algorithm to find suggestions and uses a different strategy for suggesting unknown hostnames by suggesting just the TLDs.

I wrote it because I felt the Javascript version too buggy for my needs. If you are willing to give the project a breeze and try to test and standardize the libraries I would love to get my repo added to the organization. Otherwise, I would prefer to maintain it myself.

I would love to hear your thoughts!

@msigley
Copy link

msigley commented Dec 10, 2019

@devnix I wrote a PHP implementation of Mailcheck 8 months ago. It is a single file include in your codebase. I would love to collaborate with you on implementing features your PHP library covers that mine doesn't.
https://github.com/msigley/mailcheck-php

@devnix
Copy link
Author

devnix commented Dec 11, 2019

Hi @msigley! Let me know if I can help you. Anyway, I find that the original implementation of this library is buggy and not very flexible, so I decided to modify some things like the distance algorithm, the data source, etc. I also tried to implement the Soundex algorithm but it was not an improvement.

I found a lot of implementations similar to the JS one in PHP but none of them would work as intended on a production system, that's why I decided to write yet another one. I would be eager to try to standardize one library and join efforts on one project.

@msigley
Copy link

msigley commented Dec 11, 2019

@devnix The soundex algorithm really isn't good for helping with typos. Its designed to find words that sound like your given word in spoken language, not for fixing typos.

I implemented a Sift4 string distance algorithm for typo corrections. I would also love to collaborate to create one standard library.

Try my library out and open issues for feature enhancements and improvements. I will also gladly accept PRs.
https://github.com/msigley/mailcheck-php

@devnix
Copy link
Author

devnix commented Dec 12, 2019

I also found that the Levenshtein algorithm resulted in better suggestions than the Sift4, and had more than enough performance to check a list of 10.000+ domains.

Every improvement that I felt I could do is in my repo and you are free to use all the code if you comply the license attached. I'm not eager to start an implementation battle :-)

@msigley
Copy link

msigley commented Dec 12, 2019

@devnix We use the same license GPL v3 so any merges shouldn't be any issue.

The approach to domain matching I take is vastly different that the JS version. I break the domain down, then apply the string distance algo. Sift4 is faster which is why I used it, and for small strings, the parts of the domain are usually a max of 10 characters, it makes almost no difference to use Levenshtein.

I also implemented DNS email domain validation for unknown domains.

I don't want to start a implementation battle either. I'll send you an email so we can discuss this further away from this github issue.

@alystair
Copy link

What came from this discussion? 🤔

@devnix
Copy link
Author

devnix commented Mar 26, 2020

He sent to me this email but I didn't reply him:

Hi Dev_NIX,

My name is Matthew and we have been discussing merging our Mailcheck implementations together on Github.

I would love if you have the time to evaluate the features I have implemented and let me know what you think and what features we are missing for the use case you wrote your Mailcheck implementation for. I am English as a first language and I only have the library live in 3 use cases so I know there are improvements to be made.

I'll gladly look through your codebase for improvements, but it is alot larger than mine and the approaches to domain matching and such is vastly different so I am not sure how much would translate.

I know I am asking for a lot, so if nothing else, could you contribute some test cases or point me towards yours to help improve the suggestions for you use case, that would be great.

Matthew

I think my implementation is completely different (rather than a mere JS to PHP translation), to yield more coherent suggestions under my point of view.

I honestly didn't get the point of the whole conversation. I didn't want to say anything else because English is not my first language and maybe I'm misunderstanding him, but I was feeling like he wanted me to implement my features to his repo. I really really don't get the point.

@msigley
Copy link

msigley commented Mar 26, 2020

@devnix I was really hoping for a collaborative effort towards a standardized solution for doing a Mailcheck in PHP when I sent that email. Both you and I are not the first people to write a library that does this. I don't care about who's code is better.

I was mostly hoping to understand your use cases to compare both of our solutions to understand the pros and cons of our two approaches. For example:

Why did you choose to add all ~1,500 top level domains (TLDs) to your dictionary when most email addresses in use live on a small fraction of them? I purposely left a bunch of them out so typos of common TLDs wouldn't match a valid TLD.

Is there a reason why you chose to use the levenshtein string distance algorithm? I chose to use sift4 because in my tests it gave better suggestions and was faster for small strings.

Is there a reason why you chose to compare domains as a single string instead of breaking it down into its individual parts?

I also implemented DNS record validation on emails as a check all. Did you chose to not do this for performance reasons? I thought this was a good way to take advantage of the fact PHP is a server language.

@devnix
Copy link
Author

devnix commented Mar 27, 2020

@msigley Hi! Excuse me if I misunderstood your intentions. Please let me answer your questions:

I was mostly hoping to understand your use cases to compare both of our solutions to understand the pros and cons of our two approaches. For example:

Why did you choose to add all ~1,500 top level domains (TLDs) to your dictionary when most email addresses in use live on a small fraction of them? I purposely left a bunch of them out so typos of common TLDs wouldn't match a valid TLD.

I choose it because if you write a valid domain with a valid TLD that is not listed, the suggestion algorithm will always tell you that your email is not valid, with an annoying list of suggestions.

Is there a reason why you chose to use the Levenshtein string distance algorithm? I chose to use sift4 because in my tests it gave better suggestions and was faster for small strings.

Yeah! Levenshtein gave me better results than Sift4 without feeling any kind of slowness on my side, running it against a HUGE list. I started the library writing several tests about how I felt the library needed to suggest, and that's what drove my implementation. I'm willing to hear for feedback to test more email address cases to ensure it can give the best suggestion possible.

Is there a reason why you chose to compare domains as a single string instead of breaking it down into its individual parts?

Absolutely! Because you want to compare it as a single domain. It is a single domain.

If I have an email address that finishes with hotmail.fr and I write htmail.f, I don't want hotmail.com as a first suggestion, right? There are closer answers for this case.

Here it helps to have a growing list of know email providers with their localized, know TLDs (in the original list I used I found gmail.es which, as far as I know, never existed, and would yield incorrect suggestions too).

If it doesn't found any know full domain suggestion, it will only check for TLDs to suggest. If you entered a correct TLD and it seems like a correct email not listed on the full domains array, it will assume a correct address.

I also implemented DNS record validation on emails as a check all. Did you chose to not do this for performance reasons? I thought this was a good way to take advantage of the fact PHP is a server language.

I find this totally out of the scope of the package. It is a suggestion library, not a validation one. There are plenty of options out there to perform this kind of validations. I started the library writing a validation part, until I realized it was absurd, because half of the times the user will input an incorrect email addres that needs a suggestion in order to be valid!

Also, I have been observing that DNS/MX checking is not the best way to determine if the email address is wrong: https://symfony.com/doc/4.4/reference/constraints/Email.html#checkmx

As I said, this is not a validation library, but I would prefer a false positive rather than a blocking false negative.

@msigley
Copy link

msigley commented Mar 27, 2020

@devnix Thanks for the reply. I honestly think we just took two different approaches to this problem and came up if two different solutions.

See I disagree that sanitation and validation are out of scope. If you want to do suggestions you should always sanitize and validate your email strings first. Sanitation will remove a lot of obvious typos like characters that should never be in an email address like #, &, etc. Validation ensures you are only offering suggestions for invalid email strings.

DNS validation is a nice way to not need a super huge dictionary for suggestions. It also can be done very reliably. I read RFC2821 when I implemented DNS validation. I also handle bad DNS providers returning an A record for all NXDOMAIN requests:
https://github.com/msigley/mailcheck-php/blob/b9e8003c59ea7d542100b6ba991676ea69220b62/mailcheck.php#L169-L212

If domain doesn't have a DNS record, it can't receive mail so I don't understand the point of this note in the symfony docs:
https://symfony.com/doc/4.4/reference/constraints/Email.html#checkmx
Maybe they are referring to the previously mentioned NXDOMAIN issue?

Do you have a large list of typos and corrections I could run a test against? I'm curious to see which one of our approaches gives better suggestions. I have all of my documented typo'ed emails here:
https://github.com/msigley/mailcheck-php/blob/master/mailcheck-test.php
The list isn't super huge but contains test cases other suggestion libraries at the time failed when I wrote this. How well does your library handle double TLDs for example:
'com.tw', 'co.nz', 'co.uk'

@JamoCA
Copy link

JamoCA commented Mar 27, 2020

I use a server-side java version (which is no longer available) of the isEmail PHP script for a couple of years now. It's quite effective and optionally performs DNS & MX checks. http://isemail.info/

Please note that this won't offer alternatives to popular domain names, but it will let you know if the domain or mail servers are no longer configured. (Example: I had a client contact us to complain that her friend's email address was valid and our website wasn't accepting it. She finally reached out via phone and learned that the while domain was still up, they stopped using email for that domain and no longer had it configured.)

@msigley
Copy link

msigley commented Mar 27, 2020

@JamoCA Technically according to RFC2821 if a domain has an A record it can except mail even if no MX record exists. It sounds like that library only checks MX records. Lets not too off topic though. The focus of this conversation is Mailcheck.

@JamoCA
Copy link

JamoCA commented Mar 27, 2020

Technically UTF-8 characters are allowed in an email address, but I don't see that functioning anywhere yet. You still might want to check out the isEmail PHP library regarding their validation & reasoning. (Sorry for going off topic. I'll continue to invalidate email if no MX record exists.)

@devnix
Copy link
Author

devnix commented Mar 30, 2020

I just think there are plenty of sanitation and validation utilities out there way more advanced than what I could develop in a small amount of time, better tested and widely adopted.

Also, there is no point in validating an email address that probably will not be valid, because the purpose of the library is to make a suggestion from a mistyped email address.

@ferreiro
Copy link

Disclaimer: Hey, this library is no longer maintained.

I rewrote it entirely and update it for 2022 and onward: https://github.com/ZooTools/email-spell-checker

💙 Written in TypeScript and removed jQuery
✅ Fixed a couple of bugs like ZooTools/email-spell-checker#3 or ZooTools/email-spell-checker#4
🚀 Reduced bundle size to <2KB.

✨ Update TLDs (69+) and added modern startup domains (like .io, .so, .xyz or .dev)
🙏 Implemented suggestions that people made in this repo that were ignored.
Link: https://github.com/ZooTools/email-spell-checker

Come check it out and give it a ⭐️ for the effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants