emails/ folder not included for training #72

oxlsf · 2015-12-22T11:46:04Z

Hi,

It seems like the raw emails you used for the ML training are not included in the repo. I'd like to train the AI on my own emails, can you tell me what's the right format to use?

obukhov-sergey · 2015-12-29T21:47:12Z

Hi @oxlsf we plan to open-source the emails we used for creating the dataset but it will require some work / time since not all of them were originally publicly available like emails from enron public dataset. I.e. we'll need to remove all sensitive information.

Right now we provide just the processed dataset here https://github.com/mailgun/talon/blob/master/talon/signature/data/train.data. Each line represents a line from an email. Each element in a line is either 0 or 1 except for the last one. It's 0 if the corresponding feature from the feature set is false for the line and 1 otherwise. The last element is 1 if the line belongs to a signature line and -1 otherwise. Here's the feature set that we used https://github.com/mailgun/talon/blob/master/talon/signature/learning/featurespace.py#L15

dichen001 · 2016-02-24T03:11:25Z

Hi @obukhov-sergey, how is your progress on opensource your training data? I have the same need with @oxlsf.
By the way, is it possible that you could kindly provide the processed data from Enron (i.e marked with #sig# if that line belongs a signature part)?

obukhov-sergey · 2016-06-30T22:07:25Z

@oxlsf @dichen001 we recently open-sourced annotated email dataset. It's not the one used by talon but we plan to switched to it once it has over 600 emails (now it's over 190 emails).

The idea is to use cleansed data that doesn't have private or personal information and encourage people to contribute their own emails (with cleansed phone numbers, URLs, etc) to keep the dataset up to date.

Feel free to contribute :) I'll be adding more emails shortly. I also plan to refactor the code that prepares the train data so that it's easy to add more emails to the dataset and test the library.

itsvivekshetty · 2018-10-01T09:49:53Z

Hi @obukhov-sergey, I have the same need of training the data, is there a way i can do it now? can you please provide steps to train the data.

obukhov-sergey · 2018-11-02T12:43:36Z

@itsvivekshetty @dichen001 @oxlsf here's the open-sourced dataset https://github.com/mailgun/forge, it's not the one used for training but once it has more data we'll use it instead, PRs are welcomed, I've also added a section in Readme with more info on how to retrain the classifier with your own raw emails.

dichen001 mentioned this issue Feb 24, 2016

Asking for Processed Training data. dichen001/Email-Signature-Finder-Parser#1

Closed

EdenHazard10 mentioned this issue May 25, 2017

Example format of training data #142

Open

qxh5696 mentioned this issue Oct 11, 2018

Training Talon with our email set #173

Closed

obukhov-sergey closed this as completed Nov 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emails/ folder not included for training #72

emails/ folder not included for training #72

oxlsf commented Dec 22, 2015

obukhov-sergey commented Dec 29, 2015

dichen001 commented Feb 24, 2016

obukhov-sergey commented Jun 30, 2016

itsvivekshetty commented Oct 1, 2018

obukhov-sergey commented Nov 2, 2018

emails/ folder not included for training #72

emails/ folder not included for training #72

Comments

oxlsf commented Dec 22, 2015

obukhov-sergey commented Dec 29, 2015

dichen001 commented Feb 24, 2016

obukhov-sergey commented Jun 30, 2016

itsvivekshetty commented Oct 1, 2018

obukhov-sergey commented Nov 2, 2018