NLP to detect hit keywords for SPAM/HAM dataset #13

ajwad-shaikh · 2020-11-05T07:49:03Z

Implement NLP to extract keywords from SPAM and HAM corpus.

A frequency vector of these keywords would be a great feature for our model. To make sure, we have keywords specific to SPAM and HAM characteristics of the PR, we decide to do the following.

N = complexity of the model (starting with 30, might change iteratively to achieve better results)

A = Top N keywords list from SPAM dataset
B = Top N keywords list from HAM dataset

SPAM_KEYWORDS = (A - B)
HAM_KEYWORDS = (B - A)

Suggest using multi-rake for rapid keyword extraction from corpus

The text was updated successfully, but these errors were encountered:

ajwad-shaikh assigned vrushti-mody Nov 5, 2020

ajwad-shaikh added the machine-learning tasks related to Machine Learning model building label Nov 5, 2020

vrushti-mody linked a pull request Nov 5, 2020 that will close this issue

Generating Spam keywords #14

Merged

ajwad-shaikh closed this as completed Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP to detect hit keywords for SPAM/HAM dataset #13

NLP to detect hit keywords for SPAM/HAM dataset #13

ajwad-shaikh commented Nov 5, 2020

NLP to detect hit keywords for SPAM/HAM dataset #13

NLP to detect hit keywords for SPAM/HAM dataset #13

Comments

ajwad-shaikh commented Nov 5, 2020