Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Database filter script #19

Closed
sanketgarade opened this issue Jul 9, 2021 · 10 comments
Closed

CSV Database filter script #19

sanketgarade opened this issue Jul 9, 2021 · 10 comments

Comments

@sanketgarade
Copy link
Contributor

sanketgarade commented Jul 9, 2021

Program to take a database (csv format currently) as input, keep only the necessary data (as per a filter criteria which is another input), and output this data in the same format as the input database.

@sanketgarade
Copy link
Contributor Author

purpose -

to filter the input csv data as per the provided arguments

filter types

  • invalid data
  • all words
  • specific topic
  • specific alphabet (which will be initial of the english word)
  • TBD (more can be added)

about invalid data filter

the invalid data filter is to be run always before running any other filter, as it eliminates those data elements which have insufficient or invalid data

  • insufficient data - any of the english or marathi word is missing

  • invalid data - if english word contains non english characters (this can be thought of later, and is low priority right now), same for marathi word.

IMP

  • make separate functions for each filter type
  • as per the passed argument of the "filter type" call the relavant function

steps

  1. take passed csv data and filter type as argument
  2. generate a truncated csv data structure as per the target filter
  • call the specific filter function here
  1. return this truncated data to the calling function

@zarbod
Copy link
Collaborator

zarbod commented Jul 12, 2021

Hey so when you say "insufficient data" do you mean missing English or Marathi words exclusively, or does it also include missing examples and tags?

@sanketgarade
Copy link
Contributor Author

only the main 2 words. 1 en and 1 mr.

@zarbod
Copy link
Collaborator

zarbod commented Jul 12, 2021

Thanks. Also could you explain what the "all words" filter is supposed to do?

@sanketgarade
Copy link
Contributor Author

Thanks. Also could you explain what the "all words" filter is supposed to do?

"All words" basically means no filtering (other than the invalid/insufficient data, of course).

@zarbod
Copy link
Collaborator

zarbod commented Jul 12, 2021

So I would just call the invalid/insufficient data scripts when the filter type is "All words"?

@sanketgarade
Copy link
Contributor Author

Yes. Pretty much.

@zarbod
Copy link
Collaborator

zarbod commented Jul 12, 2021

Also, the filter by topic function will require the topic as an argument. Do you want me to add an optional topic argument to the main filter function?

@sanketgarade
Copy link
Contributor Author

Yes you can do it in whichever way that makes the functions easy to use and also reusable.

What I've written in the gen-out.py file is just a basic example.
You can fill in the details and missing gaps.

@sanketgarade
Copy link
Contributor Author

pending issues from PR #32

among this priority ones are the # 1 and # 3

@sanketgarade sanketgarade changed the title Database filter script CSV Database filter script Jul 16, 2021
sanketgarade added a commit that referenced this issue Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants