Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emails/ folder not included for training #72

Closed
oxlsf opened this issue Dec 22, 2015 · 5 comments
Closed

emails/ folder not included for training #72

oxlsf opened this issue Dec 22, 2015 · 5 comments

Comments

@oxlsf
Copy link

oxlsf commented Dec 22, 2015

Hi,

It seems like the raw emails you used for the ML training are not included in the repo. I'd like to train the AI on my own emails, can you tell me what's the right format to use?

@obukhov-sergey
Copy link
Member

Hi @oxlsf we plan to open-source the emails we used for creating the dataset but it will require some work / time since not all of them were originally publicly available like emails from enron public dataset. I.e. we'll need to remove all sensitive information.

Right now we provide just the processed dataset here https://github.com/mailgun/talon/blob/master/talon/signature/data/train.data. Each line represents a line from an email. Each element in a line is either 0 or 1 except for the last one. It's 0 if the corresponding feature from the feature set is false for the line and 1 otherwise. The last element is 1 if the line belongs to a signature line and -1 otherwise. Here's the feature set that we used https://github.com/mailgun/talon/blob/master/talon/signature/learning/featurespace.py#L15

@dichen001
Copy link

Hi @obukhov-sergey, how is your progress on opensource your training data? I have the same need with @oxlsf.
By the way, is it possible that you could kindly provide the processed data from Enron (i.e marked with #sig# if that line belongs a signature part)?

@obukhov-sergey
Copy link
Member

@oxlsf @dichen001 we recently open-sourced annotated email dataset. It's not the one used by talon but we plan to switched to it once it has over 600 emails (now it's over 190 emails).

The idea is to use cleansed data that doesn't have private or personal information and encourage people to contribute their own emails (with cleansed phone numbers, URLs, etc) to keep the dataset up to date.

Feel free to contribute :) I'll be adding more emails shortly. I also plan to refactor the code that prepares the train data so that it's easy to add more emails to the dataset and test the library.

@itsvivekshetty
Copy link

Hi @obukhov-sergey, I have the same need of training the data, is there a way i can do it now? can you please provide steps to train the data.

@obukhov-sergey
Copy link
Member

@itsvivekshetty @dichen001 @oxlsf here's the open-sourced dataset https://github.com/mailgun/forge, it's not the one used for training but once it has more data we'll use it instead, PRs are welcomed, I've also added a section in Readme with more info on how to retrain the classifier with your own raw emails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants