Extracting speech from the childes corpus and organising it in the following format:

```
├── Eng-NA
│   ├── adult
│   │   ├── *.txt
│   ├── child
│   │   ├── *.txt
│   └── metadata.csv
└── Eng-UK
    ├── adult
    ├── child
    └── metadata.csv
```
Each first folder corresponds to a Language_Accent.
- English North America
- English United Kingdom.

The `child` folder contains child speech the `adult` folder contains adult speech.
File are named after the original CHILDES folders so we do not remove the link between child/adult sets.
The metadata folder contains information about contents:

| file_id           | lang  child_gender | child_age |         |
|-------------------|--------------------|-----------|---------|
| Haggerty_haggerty | eng                | female    | 2;07.18 |
| Brown_Eve_010600b | eng                | female    | 1;06.00 |
| Brown_Eve_010600a | eng                | female    | 1;06.00 |

Items that do not have a `child_gender` or `child_age` columns signify those files did not contain child speech.

In [1]:
from lexical_benchmark import settings
from lexical_benchmark.datasets.human import CHILDESPreparation

prep = CHILDESPreparation()
prep.load_dir(settings.PATH.source_childes / "Eng-NA", "Eng-NA")
prep.load_dir(settings.PATH.source_childes / "Eng-UK", "Eng-UK")
prep.export_to_dir(settings.PATH.raw_childes, show_progress=True)

Spliting speech by age.

We separate child speech by the following age groups:

- `00 months - 06 months`
- `06 months - 12 months`
- `12 months - 18 months`
- `18 months - 24 months`
- `24 months - 30 months`

Which is a split by 6 month interval from 0 to 2.5 years old.

In [2]:
from lexical_benchmark import settings
from lexical_benchmark.datasets.human import OrganizeByAge
from rich.console import Console


with Console().status("Building child age splits..."):
    organizer = OrganizeByAge(settings.PATH.raw_childes, [ "Eng-NA", "Eng-UK"])
    organizer.make_age_splits()
    organizer.build_splits()
print("Splits build succesfully...")

Output()

In [6]:
from lexical_benchmark import settings

for lang in ("Eng-NA", "Eng-UK"):
    print(f"--{lang}--")
    for min_age, max_age in settings.CHILDES_AGE_RANGES:
        location = settings.PATH.raw_childes / lang / "child_by_age" / f"{min_age}_{max_age}"
        print(f"[{min_age},{max_age}]:: {len(list(location.glob('*.txt')))} number of entries.")

--Eng-NA--
[0,6]:: 3 number of entries.
[6,12]:: 678 number of entries.
[12,18]:: 322 number of entries.
[18,24]:: 651 number of entries.
[24,30]:: 858 number of entries.
--Eng-UK--
[0,6]:: 1 number of entries.
[6,12]:: 194 number of entries.
[12,18]:: 174 number of entries.
[18,24]:: 136 number of entries.
[24,30]:: 619 number of entries.
