Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arabic data #295

Closed
mcfrank opened this issue May 11, 2023 · 25 comments
Closed

arabic data #295

mcfrank opened this issue May 11, 2023 · 25 comments
Assignees
Labels

Comments

@mcfrank
Copy link
Member

mcfrank commented May 11, 2023

import arabic data from https://github.com/langcog/ArabicCAT

@mcfrank mcfrank added the data label May 11, 2023
@alvinwmtan
Copy link
Contributor

alvinwmtan commented Jul 30, 2023

@mcfrank
Copy link
Member Author

mcfrank commented Jul 31, 2023 via email

@alvinwmtan
Copy link
Contributor

JISH:
Contributor: Jeddah Institute for Speech and Hearing
Citation: Dashash, N., & Safi, S. (2014). JISH Arabic Communicative Development Inventory: Saudi population JACDI: User’s guide and technical manual. Jeddah: Jeddah Institute for Speech and Hearing

Alroqi:
Contributors:
Haifa Alroqi, King Abdulaziz University
Alaa Almohammadi, King Abdulaziz University
Khadeejah Alaslani, Purdue University
Citation: TBD

@HenryMehta
Copy link
Contributor

@alvinwmtan
I've started on Arabic (Saudi).

A couple of problems. WS is too big to create a database row. There are 1079 items. The program creates a 15 character text field for each and this is too big a database row for MySQL which is the database we're using. I'm trying to find a solution but no progress yet (and I'm not confident).

WG has a new category (negation_words). I need to add this to the categories.csv file. I need to add it with a lexical_category and a lexical_class. I have used function_words for both for the time being as this seems to be used quite a lot.

Finally, some of the cells have "Understands ONLY, Understands & Says" in them. They should be one or the other. No cells have them reversed so I think this is the actual value. I can link these so that these result in produces BUT I will need to amend the file so these use a semi-colon instead of comma because the comma specifies a different field.

@alvinwmtan
Copy link
Contributor

@HenryMehta

  • WS too big: hmm, I'm not really sure what an alternative solution would be. It would be sad to have to drop some rows—it just happens that the form for this language is particularly large...
  • WG negation_words: function_words is good for them.
  • "Understands ONLY, Understands & Says": let's map these to "produces" as you mentioned. It should be okay to amend the original file.

@HenryMehta
Copy link
Contributor

@alvinwmtan

Arabic (Saudi) WG is now available to test.

I cannot load WS until we have a decision about whether we could us u instread of understands and p instead of produces. This would need to apply across all datasets and would impact the shiny app as previously mentioned

@HenryMehta HenryMehta self-assigned this Dec 4, 2023
@alvinwmtan
Copy link
Contributor

(fixing by switching to "u" and "p", as in #298)

@mcfrank
Copy link
Member Author

mcfrank commented Dec 4, 2023 via email

@HenryMehta
Copy link
Contributor

@alvinwmtan We still have an issue here. I am now getting an error message of "Too many columns". I've done some reading about this and I cannot increased parameters to allow more fields. I therefore propose we amend the Arabic (Saudi) WS to be 2 files and hence 2 tables.

@HenryMehta
Copy link
Contributor

I endorse this suggestion since it may come up again and will generally save space. But we do need to update the shiny apps as noted. @mikabr may need to update. Will we need to change all instruments or are "understands" and "u" now both options?

On Mon, Dec 4, 2023 at 1:27 PM Alvin Tan @.> wrote: (fixing by switching to "u" and "p", as in #298 <#298>) — Reply to this email directly, view it on GitHub <#295 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI25F3R3LYKHO6536HFKRDYHY52PAVCNFSM6AAAAAAX6L5NH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGUYDKMZXGA . You are receiving this because you were mentioned.Message ID: @.>

@mcfrank @alvinwmtan
For now I've applied it to French (French) WS plus all future instruments added

@alvinwmtan
Copy link
Contributor

@HenryMehta Hm okay. Do you know what the column limit is?

@HenryMehta
Copy link
Contributor

@alvinwmtan It's not actually that simple because it also depends on the column names. I could probably work out but would take some time. I think we should aim to keepthe max to 750

@alvinwmtan
Copy link
Contributor

@HenryMehta Given that the size of the col names also matters, do you think it might be possible to retain the full table if we converted all the colnames to just numbers? That would reduce the size. If not I'll think about how to split the dataset up.

@HenryMehta
Copy link
Contributor

@alvinwmtan We could try but I don't know how many columns that would give us and the names would actually need changing for every study because of the way the application works. We would need to change the code as well because column names are current called 'item_xx', where xx is the column number. We could reduce it name to 'ixx' because columns names must start with a letter

@alvinwmtan
Copy link
Contributor

@HenryMehta Here is one attempt: I've separated the words (WS) and all other item types (WSOther); WS still has >800 items but hopefully it will be okay. The WS from Alroqi is unchanged. Let me know if this split is still too large and I will find a different solution.

[ArabicSaudi_WS].csv
[ArabicSaudi_WSOther].csv
ArabicSaudiWS_JISH_data.csv
ArabicSaudiWS_JISH_fields.csv
ArabicSaudiWS_JISH_values.csv
ArabicSaudiWSOther_JISH_data.csv
ArabicSaudiWSOther_JISH_fields.csv
ArabicSaudiWSOther_JISH_values.csv

@HenryMehta
Copy link
Contributor

@alvinwmtan You've split the JISH files but not the Alroqi

@alvinwmtan
Copy link
Contributor

@HenryMehta I believe the Alroqi files are all still within "WS" (only the JISH had items that now fall in "WSOther")

@HenryMehta
Copy link
Contributor

OK

HenryMehta added a commit that referenced this issue Dec 14, 2023
@HenryMehta
Copy link
Contributor

@alvinwmtan Deploying to dev now - will need about 40 minutes to load

@mikabr
Copy link
Member

mikabr commented Dec 15, 2023

I've implemented allowing "u" and "p" values in wordbankr. but none of the Saudi Arabic tables seem to have those values, and the WSOther table seems to have zero rows (I'm connecting to wordbank2-dev-3).

@alvinwmtan
Copy link
Contributor

@HenryMehta WS looks good, don't seem to see any WSOther data

@HenryMehta
Copy link
Contributor

@alvinwmtan try now

@alvinwmtan
Copy link
Contributor

@HenryMehta WS and WSOther look good. I realised I also failed to disambiguate some of the items in the WG; these should be de-conflicted now:

ArabicSaudiWG_Alroqi_data.csv
ArabicSaudiWG_Alroqi_fields.csv

HenryMehta added a commit that referenced this issue Dec 20, 2023
@HenryMehta
Copy link
Contributor

@alvinwmtan You've re-introduced the cells with "understands only, understands & says" instead of just one. I have previously changed these to "understands & says". I have reapplied this change

@alvinwmtan
Copy link
Contributor

@HenryMehta thanks for catching that; looks good to me now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

4 participants