Create spam classification tutorial #112

bkmgit · 2020-08-24T06:33:01Z

New command line tutorial

mlpack-bot · 2020-08-24T06:33:03Z

Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:

All code should follow the style guide
Documentation added for any new functionality
Tests added for any new functionality
Tests that are added follow the testing guide
Headers and license information added to the top of any new code files
HISTORY.md updated if the changes are big or user-facing
All CI checks should be passing

Thank you again for your contributions! 👍

spam/tutorial.md

zoq · 2020-08-28T17:59:52Z

spam/tutorial.md

+rm labels.txt
+```
+
+The next step is to convert all text in the messages to lower case and for simplicity remove punctuation and any symbols that are not spaces, line endings or in the range a-z (one would need expand this range of symbols for production use) 


Not sure I get the last sentence.

Perhaps it can be reworded as:

To enable easy comparison of words which will be used as the features, only letters a-z, line endings \n and spaces are used as features. A larger feature set can be helpful, but for small data sets the occurrences of other symbols are not frequent enough to help in classification.

zoq · 2020-08-28T18:01:22Z

spam/tutorial.md

+rm messagesLower.txt
+```
+
+We now obtain a sorted list of unique words used (this step may take a few minutes, so use nice to give it a low priority while you continue with other tasks on your computer).


Hm, I would remove nice as the default behaviour, we could mention it on the side.

On a low end laptop nice is quite useful to enable other work. On a more powerful machine, the effect will not be to drastic, so that in both cases the code works.

spam/tutorial.md

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

bkmgit · 2020-08-28T18:29:09Z

Thanks for the review. Can also put all commands in a bash script. Should there be a styleguide for the examples?

zoq · 2020-08-29T20:47:17Z

Thanks for the review. Can also put all commands in a bash script. Should there be a styleguide for the examples?

I like the idea, maybe we can split it up into multiple scripts, wondering if we should add a notebook that runs the bash scripts as well? The examples follow: https://github.com/mlpack/mlpack/wiki/DesignGuidelines

bkmgit · 2020-08-30T07:58:48Z

Can split it up into multiple scripts. Not so keen on notebooks for this, not so great for integration into production use. For example see rspamd and Spam Assassin perceptron description. For SMS spam Spam Hound is available, but not sure of an open source equivalent. Android does support C++, see documentation here.

rcurtin

Hey @bkmgit, this is really nice! Thank you for taking the time to put this together. I left some comments, mostly simple and stylistic; let me know what you think.

Can split it up into multiple scripts. Not so keen on notebooks for this, not so great for integration into production use.

I agree with what you mean here---most people using mlpack from the command-line won't be using notebooks. It might be simple to add a notebook that just runs the various scripts individually, but I do agree that perhaps in addition to tutorial.md, we should have a number of scripts in the directory that users can run directly. Like, e.g., spam-classification.sh could run everything, and then this could call out to other auxiliary scripts.

My mental model is that the most effective tutorials are ones where users can very quickly and easily run something (e.g. type a command or click 'run' in a notebook cell), and then once they see results, they can dig into the code to understand what's actually going on. So as long as we can structure it in a way such that the above is reasonably true, I think it's great! It may even be "easiest" to structure all the text in tutorials.md as comments in, e.g., some spam-classification.sh file, and then have lower-level comments in other auxiliary scripts? Just tossing the idea out there, don't feel obligated. Maybe there are better ways. :)

spam/tutorial.md

rcurtin · 2020-09-12T16:20:06Z

spam/tutorial.md

+rm dataset.csv
+rm dataset1.csv
+rm dataset.txt
+```


It's nice to see other people who do data science with sed, awk, rev, tr, and grep too! 😄

spam/tutorial.md

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update formatting to make it neater

https://docs.travis-ci.com/user/reference/focal/

rcurtin · 2021-07-30T17:10:41Z

@bkmgit I think you just need to chmod +x spam_classification.sh. Here's the relevant Travis build line:

/home/travis/.travis/functions: line 109: ./spam_classification.sh: Permission denied

bkmgit · 2021-07-30T17:37:47Z

Thanks. Done.

bkmgit · 2021-07-30T18:40:24Z

It does seem to run, but times out.

bkmgit · 2021-07-31T05:39:55Z

Results should not vary by more than say 15%. The part that takes a long time is creating the term frequency matrix. This could perhaps be saved for use in the tests since it does not use anything in mlpack. Alternatively, a C++ implementation could be used, some examples include: - https://github.com/ClarityNLP/ClarityNLP/blob/master/nlp/algorithms/matrix_preprocessor/src/term_frequency_matrix.cpp - http://taozhaojie.github.io/2015/06/12/tfidf/ This could either be a standalone program, or if more use of NLP is expected, something to add to mlpack.

rcurtin · 2021-07-31T18:52:18Z

spam/spam_classification.sh

+message is then removed and placed in another file.
+COMMENT
+
+tr '\r' '\n' < ../data/dataset_sms_spam_bhs_indonesia_v1/dataset_sms_spam_v1.csv > dataset.txt


Is this necessary on all systems? I found that the output I got here for dataset.txt when I ran locally was like this:

$ head dataset.txt Teks,label "[PROMO] Beli paket Flash mulai 1GB di MY TELKOMSEL APP dpt EXTRA kuota 2GB 4G LTE dan EXTRA nelpon hingga 100mnt/1hr. Buruan, cek di tsel.me/mytsel1 S&K",2 2.5 GB/30 hari hanya Rp 35 Ribu Spesial buat Anda yang terpilih. Aktifkan sekarang juga di *550*905#. Promo sd 30 Nov 2015.Buruan aktifkan sekarang. S&K,2 "2016-07-08 11:47:11.Plg Yth, sisa kuota Flash Anda 478KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2 "2016-08-07 11:29:47.Plg Yth, sisa kuota Flash Anda 7160KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2

What's the goal of the line? Maybe we can use dos2unix or something instead?

Added sed '/^$/d' dataset2.csv > dataset.csv to remove the extra lines.

Regenerating data files. mlpack_preprocess_split fails with an error if one of the labels is a .

Yeah, I encountered the same thing. I see that labels.csv has one line that's just a ., which can't be parsed:

$ cat labels.csv | sort |uniq -c 1 . 569 0 574 1

Maybe there is a bug or an extra case that needs to be handled in the preprocessing script?

Fixed this, incorrect lines were joined. Checking build.

Thanks! :) I'm checking to see if it works on my system too. 👍

bkmgit · 2021-08-01T17:29:09Z

It seems to run correctly now.

Can add some checks that confusion matrix values are within a difference of 10 from the reported ones.
If the data.csv and labels.csv files are added to the dataset, can modify the writeup so that these only need to be recreated if someone wants to check. This will reduce running time in CI.
Building the command line programs adds some time to the CI build process, but maybe this is ok, willing to add some further tests.

rcurtin · 2021-08-01T20:33:25Z

Perfect, everything works locally for me. Do you want to revert .travis.yml to its original form, but leave the spam classification example there except commented? Due to travis's limits, we'll have to move to Jenkins, where it doesn't matter if things take a long time. @zoq and I discussed it a bit in #174; maybe we can use a Docker container with all the command-line programs available already.

I don't think it's a problem to leave out data.csv and labels.csv---part of what makes the example interesting, in my opinion, is the command-line data science preprocessing. :)

Can add some checks that confusion matrix values are within a difference of 10 from the reported ones.

Up to you---I think personally as long as the script completes successfully, then everything should be working fine. If there is a failure, probably the mlpack programs will issue a non-zero exit code and then spam_classification.sh will return a non-zero exit code too.

bkmgit · 2021-08-02T05:48:01Z

Ok. Will revert to state before adding CSV files, and then include update script, with it commented out in .travis.yml mlpack tests https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests are quite complete, but maybe checking for accuracy of the examples would not hurt. A warning can be issued if get a successful run but values deviate from expected ones significantly. Having some automated tests associated with each example may be good practice, #120 For notebooks, perhaps https://github.com/davidbrochart/nbterm or https://github.com/jupyter/nbconvert could be used at some point.

rcurtin · 2021-08-05T01:27:10Z

Sounds good! I agree on your comments; maybe we should open issues and handle those separately?

bkmgit · 2021-11-17T16:44:30Z

@rcurtin Ok, made changes. Will open separate issues.

rcurtin

Sorry it took so long to get back to this. Everything looks good to me! Right now Travis is no longer hooked up to this repository, but I'll open an issue to set up a Jenkins job instead.

Thanks again for the contribution!

bkmgit · 2022-01-03T04:30:24Z

Thanks for the feedback. Looking forward to making further contributions.

mlpack-bot

Second approval provided automatically after 24 hours. 👍

rcurtin · 2022-01-05T00:20:36Z

Thanks again @bkmgit!

initial creation of spam tutorial and update of data download script

3bf2dd2

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Aug 24, 2020

bkmgit mentioned this pull request Aug 24, 2020

Spam classification example #102

Closed

zoq added c: examples and removed s: unanswered s: unlabeled labels Aug 26, 2020

zoq reviewed Aug 28, 2020

View reviewed changes

bkmgit and others added 7 commits August 28, 2020 21:12

Update spam/tutorial.md

aa5cec5

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update spam/tutorial.md

0236200

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update spam/tutorial.md

65d1df2

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update spam/tutorial.md

9b04377

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update spam/tutorial.md

943266c

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update spam/tutorial.md

ea562e7

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update spam/tutorial.md

1586e9b

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

rcurtin reviewed Sep 12, 2020

View reviewed changes

bkmgit and others added 9 commits September 17, 2020 18:59

Update spam/tutorial.md

0c6ba9c

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update spam/tutorial.md

11a07b7

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update spam/tutorial.md

d7c4891

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update spam/tutorial.md

1ed293c

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update spam/tutorial.md

b4df918

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update spam/tutorial.md

1e57f69

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update spam/tutorial.md

da82a46

Co-authored-by: Ryan Curtin <ryan@ratml.org>

Update tutorial.md

7e5de02

Update formatting to make it neater

add tutorial as a bash script

3884cfc

rcurtin reopened this Jul 30, 2021

bkmgit added 2 commits July 30, 2021 19:07

Update version of Ubuntu

0677b0a

check if build will work without build script

a58a6f4

https://docs.travis-ci.com/user/reference/focal/

update script permissions

9d9c401

rcurtin reviewed Jul 31, 2021

View reviewed changes

bkmgit added 6 commits August 1, 2021 08:12

remove spam pre-processing in CI

72460e6

fix error in ordering of commands

745e953

enable building of command line executables

fef1858

remove example builds due to time constraint

4285f0e

temporarily disable dataset download, travis

a518cb2

update data files

2db8885

bkmgit added 4 commits November 17, 2021 19:31

Merge branch 'master' into master

1ced319

remove file as it can be pre-processed

32fec35

remove file as it can be pre-processed

cbd31c3

skip processing of spam

f483f4e

rcurtin approved these changes Jan 3, 2022

View reviewed changes

mlpack-bot bot approved these changes Jan 4, 2022

View reviewed changes

mlpack-bot bot removed the s: needs review label Jan 4, 2022

rcurtin merged commit c1e3eeb into mlpack:master Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create spam classification tutorial #112

Create spam classification tutorial #112

bkmgit commented Aug 24, 2020

mlpack-bot bot commented Aug 24, 2020

zoq Aug 28, 2020

bkmgit Aug 28, 2020

zoq Aug 28, 2020

bkmgit Aug 28, 2020

bkmgit commented Aug 28, 2020

zoq commented Aug 29, 2020

bkmgit commented Aug 30, 2020

rcurtin left a comment

rcurtin Sep 12, 2020

rcurtin commented Jul 30, 2021

bkmgit commented Jul 30, 2021

bkmgit commented Jul 30, 2021

bkmgit commented Jul 31, 2021 via email

rcurtin Jul 31, 2021

bkmgit Aug 1, 2021

bkmgit Aug 1, 2021

rcurtin Aug 1, 2021

bkmgit Aug 1, 2021

rcurtin Aug 1, 2021

bkmgit commented Aug 1, 2021

rcurtin commented Aug 1, 2021

bkmgit commented Aug 2, 2021 via email •

edited

rcurtin commented Aug 5, 2021

bkmgit commented Nov 17, 2021

rcurtin left a comment

bkmgit commented Jan 3, 2022

mlpack-bot bot left a comment

rcurtin commented Jan 5, 2022

Create spam classification tutorial #112

Create spam classification tutorial #112

Conversation

bkmgit commented Aug 24, 2020

mlpack-bot bot commented Aug 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmgit commented Aug 28, 2020

zoq commented Aug 29, 2020

bkmgit commented Aug 30, 2020

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin commented Jul 30, 2021

bkmgit commented Jul 30, 2021

bkmgit commented Jul 30, 2021

bkmgit commented Jul 31, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmgit commented Aug 1, 2021

rcurtin commented Aug 1, 2021

bkmgit commented Aug 2, 2021 via email • edited

rcurtin commented Aug 5, 2021

bkmgit commented Nov 17, 2021

rcurtin left a comment

Choose a reason for hiding this comment

bkmgit commented Jan 3, 2022

mlpack-bot bot left a comment

Choose a reason for hiding this comment

rcurtin commented Jan 5, 2022

bkmgit commented Aug 2, 2021 via email •

edited