Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to create train.csv, validation.csv, test.csv #19

Closed
akmalsabri opened this issue Jun 30, 2020 · 11 comments
Closed

how to create train.csv, validation.csv, test.csv #19

akmalsabri opened this issue Jun 30, 2020 · 11 comments

Comments

@akmalsabri
Copy link

Hi, how to setup and create train.csv, validation.csv, test.csv from Financial Pharase Bank data?

@emes83
Copy link

emes83 commented Jun 30, 2020

Hi, how to setup and create train.csv, validation.csv, test.csv from Financial Pharase Bank data?

The same here.

@akmalsabri
Copy link
Author

@emes83 or maybe we need to create them on our own from the link https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10

@emes83
Copy link

emes83 commented Jun 30, 2020

@emes83 or maybe we need to create them on our own from the link https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10

Maybe, I believe that, yes, but how?. I'm wondering how to do this to adapt the solution to another language in the future.

@akmalsabri
Copy link
Author

@emes83 i try this data . #5 (comment) .maybe you should try.
But still ,I get error IndexError: list index out of range when running train_data = finbert.get_data('train')

@emes83
Copy link

emes83 commented Jul 1, 2020

@emes83 i try this data . #5 (comment) .maybe you should try.
But still ,I get error IndexError: list index out of range when running train_data = finbert.get_data('train')

The same here.

@praslisa
Copy link

praslisa commented Jul 5, 2020

I am getting the same error IndexError: list index out of range and the dataset in the link has invalid characters as well. Anyone could fix the issue?

@emes83
Copy link

emes83 commented Jul 20, 2020

Can somebody please share train/validation/test files? In proper/working format?

@valentintsl
Copy link

Hello mates !
After doing a thorough search of the program, I found the architecture of the train, test, validation inputs.
The csv files need to have the following column names [text, label] WITH THE INDEX.
The data needs to be separated by tab '\t'.

I attach the csv files for the Financial Phrase Bank dataset that I made.

FinancialPhraseBankforFinBERT.zip

Valentin TASSEL

@evanzd
Copy link

evanzd commented Sep 14, 2020

The labels provided by @valentintsl are not balanced. I'm using the script below to create these datasets:

import pandas as pd

with open('Sentences_50Agree.txt', 'rb') as f:
    data = f.read().decode(errors='ignore')

df = pd.DataFrame([x.split('@') for x in data.strip().split('\r\n')], columns=['text', 'label'])

pos = df.query('label=="positive"')
pos = pos.sample(len(pos), random_state=0) # shuffle samples

neg = df.query('label=="negative"')
neg = neg.sample(len(neg), random_state=0)

neu = df.query('label=="neutral"')
neu = neu.sample(len(neu), random_state=0)

n_pos = int(len(pos)*0.2)
n_neg = int(len(neg)*0.2)
n_neu = int(len(neu)*0.2)

pd.concat([pos[:-n_pos*2], neg[:-n_neg*2], neu[:-n_neu*2]], axis=0).to_csv('train.csv', sep='\t')
pd.concat([pos[-n_pos*2:-n_pos], neg[-n_neg*2:-n_neg], neu[-n_neu*2:-n_neu]], axis=0).to_csv('validation.csv', sep='\t')
pd.concat([pos[-n_pos:], neg[-n_neg:], neu[-n_neu:]], axis=0).to_csv('test.csv', sep='\t')

@nithinreddyy
Copy link

trained_model = finbert.train(train_examples = train_data, model = model)

Error is

TypeError                                 Traceback (most recent call last)
<ipython-input-11-2ebf0cb3d4e8> in <module>
----> 1 trained_model = finbert.train(train_examples = train_data, model = model)

~\finBERT-master\finbert\finbert.py in train(self, train_examples, model)
    482                     print('No best model found')
    483                 torch.save({'epoch': str(i), 'state_dict': model.state_dict()},
--> 484                            self.config.model_dir / ('temporary' + str(i)))
    485                 best_model = i
    486 

TypeError: unsupported operand type(s) for /: 'str' and 'str'

Can anyone check this and help me, please?

@doguaraci
Copy link
Member

You can find the instructions to create these files on the updated README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants