Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data validation to prevent rogue entries #32

Closed
jacobsalmela opened this issue Aug 8, 2015 · 4 comments
Closed

Add data validation to prevent rogue entries #32

jacobsalmela opened this issue Aug 8, 2015 · 4 comments

Comments

@jacobsalmela
Copy link
Contributor

@jacobsalmela jacobsalmela commented Aug 8, 2015

Sometimes entries from the third-party lists have errors, such as adlog..com (two periods instead of one).

This will probably require some sed and awk skills to check for the proper formatting of:

subdomain(s) (if applicable), a period (.), the domain, another period (.), and finally, the top level domain.

@korhadris

This comment has been minimized.

Copy link
Contributor

@korhadris korhadris commented Aug 19, 2015

Here's a sed command to remove duplicated '.' characters:

sed 's/\.\+/./g'

This is probably faster as it will only make the replacement if there are two or more '.' characters ...

sed 's/\.\.\+/./g'

@kurumushi

This comment has been minimized.

Copy link

@kurumushi kurumushi commented Aug 19, 2015

The real challenge is to see if these can be combined into one, I'm not very good at simplification.
Here's a few more useful ones:
sed -e 's/ \.//g': remove periods at the start and the middle of domains.
sed -e 's/\[. ]+$/$/g' remove domains ending in a period, and trailing white space.

@korhadris

This comment has been minimized.

Copy link
Contributor

@korhadris korhadris commented Aug 19, 2015

Yeah, simplifying regexp statements to be optimized is rough. As long as I really don't require the speed I usually combine statements like this:

sed -e 's/\.\.\+/./g' -e 's/^[. ]*//' -e 's/\.* \+\.*/ /g' -e 's/[. ]\+$//'

This way it's easier for a human to look and see what's going on.
This removes duplicated . characters and replaces with a single .; removes spaces and . characters at the front of the line; removes extra spaces and . characters in the middle (leaving a single space); removes spaces and . characters at the end of the line.

I tested it with the following example:

echo "  .192.168.23.515.. ...foo...bar..com.  " | sed -e 's/\.\.\+/./g' -e 's/^[. ]*//' -e 's/\.* \+\.*/ /g' -e 's/[. ]\+$//'
192.168.23.515 foo.bar.com

I also fixed our regexp examples in this post (we both had some errors). I had an extra / at the end of my commands above (now editted); sed needs to have + escaped, or it will try to match a + character; $ in the second substitution section will put in an actual $ character; there was an extra \ before the [, which would try to match a [ character instead of starting a list of possible characters to match. I also removed the g from the end of the that can only match at the end of the line, since the global part isn't needed (doesn't break having it there though).

@jacobsalmela

This comment has been minimized.

Copy link
Contributor Author

@jacobsalmela jacobsalmela commented Aug 22, 2015

It looks pretty good, but it's a bit out of my skill level. Maybe someone could make a pull request and I can test it out.

jacobsalmela added a commit that referenced this issue Sep 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.