**Author**: Naomi Baes and Chat GPT  

**Note**: 
- The `"step4_get_sentences_target.py"` script in the `"0.0_corpus_preprocessing"` folder must be run to generate files in the `"natural_lines_targets"` folder (corpus lines containing the target term for each target).
- This notebook (1) randomly samples sentences with targets, reading them to (2) encodes them using a sentence transformer model; (3) compiles final cds dataframes for analysis

## Step 0: Get 1500 synthetic sentences

In [18]:
# Script 2: Samples up to 1,500 unique synthetic sentences for each target word and epoch.

# Ensures no duplicate sentences per output file, with sibling-to-target replacement for each sampled sentence. 
# If fewer than 1,500 sentences are available, outputs all available sentences and logs a warning.

# Output files are named systematically by target and epoch (e.g., trauma_1970-1974.synthetic_1500_sentences.tsv) for easy identification. 
# Uses ThreadPoolExecutor for parallel processing, ensuring efficient handling of multiple (target, interval) combinations.

#!pip install spacy
#python -m spacy download en_core_web_sm

%run step0_get_1500_unique_sentences_5-year.py

  synthetic_corpus = pd.read_csv(synthetic_corpus_file, sep="\t", header=None, names=["sentence", "year", "label"])
Processing:   0%|          | 0/60 [00:00<?, ?it/s]

[INFO] Processing target 'trauma' for interval 1970-1974.


Processing trauma:  77%|██████████████████████████████▋         | 1149/1500 [02:57<00:54,  6.47it/s]
Processing:   2%|▏         | 1/60 [02:57<2:54:42, 177.68s/it]

[INFO] No valid sentences found in this round for target 'trauma'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_1970-1974.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 1975-1979.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [02:04<00:00, 12.06it/s]
Processing:   3%|▎         | 2/60 [05:02<2:21:28, 146.35s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_1975-1979.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 1980-1984.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [01:57<00:00, 12.73it/s]
Processing:   5%|▌         | 3/60 [07:00<2:06:41, 133.37s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_1980-1984.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 1985-1989.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [02:00<00:00, 12.47it/s]
Processing:   7%|▋         | 4/60 [09:00<1:59:39, 128.21s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_1985-1989.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 1990-1994.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [01:55<00:00, 12.99it/s]
Processing:   8%|▊         | 5/60 [10:55<1:53:20, 123.64s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_1990-1994.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 1995-1999.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [02:16<00:00, 10.97it/s]
Processing:  10%|█         | 6/60 [13:12<1:55:17, 128.10s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_1995-1999.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 2000-2004.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [02:23<00:00, 10.46it/s]
Processing:  12%|█▏        | 7/60 [15:36<1:57:36, 133.14s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_2000-2004.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 2005-2009.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [03:02<00:00,  8.20it/s]
Processing:  13%|█▎        | 8/60 [18:39<2:09:08, 149.00s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_2005-2009.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 2010-2014.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [03:59<00:00,  6.27it/s]
Processing:  15%|█▌        | 9/60 [22:38<2:30:40, 177.27s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_2010-2014.synthetic_1500_sentences.tsv
[INFO] Processing target 'trauma' for interval 2015-2019.


Processing trauma: 100%|████████████████████████████████████████| 1500/1500 [03:48<00:00,  6.55it/s]
Processing:  17%|█▋        | 10/60 [26:27<2:41:02, 193.24s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\trauma_2015-2019.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 1970-1974.


Processing anxiety:  34%|█████████████▍                          | 503/1500 [01:05<02:10,  7.64it/s]
Processing:  18%|█▊        | 11/60 [27:33<2:05:58, 154.26s/it]

[INFO] No valid sentences found in this round for target 'anxiety'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_1970-1974.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 1975-1979.


Processing anxiety:  75%|█████████████████████████████▍         | 1131/1500 [05:10<01:41,  3.65it/s]
Processing:  20%|██        | 12/60 [32:43<2:41:21, 201.70s/it]

[INFO] No valid sentences found in this round for target 'anxiety'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_1975-1979.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 1980-1984.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [03:42<00:00,  6.74it/s]
Processing:  22%|██▏       | 13/60 [36:26<2:42:55, 207.99s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_1980-1984.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 1985-1989.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [02:50<00:00,  8.81it/s]
Processing:  23%|██▎       | 14/60 [39:16<2:30:44, 196.62s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_1985-1989.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 1990-1994.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [02:01<00:00, 12.32it/s]
Processing:  25%|██▌       | 15/60 [41:18<2:10:32, 174.06s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_1990-1994.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 1995-1999.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [02:20<00:00, 10.71it/s]
Processing:  27%|██▋       | 16/60 [43:38<2:00:08, 163.83s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_1995-1999.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 2000-2004.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [02:42<00:00,  9.22it/s]
Processing:  28%|██▊       | 17/60 [46:20<1:57:10, 163.50s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_2000-2004.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 2005-2009.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [03:00<00:00,  8.31it/s]
Processing:  30%|███       | 18/60 [49:21<1:58:02, 168.63s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_2005-2009.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 2010-2014.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [04:00<00:00,  6.23it/s]
Processing:  32%|███▏      | 19/60 [53:22<2:10:04, 190.34s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_2010-2014.synthetic_1500_sentences.tsv
[INFO] Processing target 'anxiety' for interval 2015-2019.


Processing anxiety: 100%|███████████████████████████████████████| 1500/1500 [04:15<00:00,  5.87it/s]
Processing:  33%|███▎      | 20/60 [57:38<2:19:59, 209.99s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\anxiety_2015-2019.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 1970-1974.


Processing depression:  64%|███████████████████████▋             | 962/1500 [03:46<02:06,  4.26it/s]
Processing:  35%|███▌      | 21/60 [1:01:24<2:19:37, 214.82s/it]

[INFO] No valid sentences found in this round for target 'depression'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\depression_1970-1974.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 1975-1979.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [05:41<00:00,  4.40it/s]
Processing:  37%|███▋      | 22/60 [1:07:05<2:40:05, 252.78s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_1975-1979.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 1980-1984.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [04:16<00:00,  5.84it/s]
Processing:  38%|███▊      | 23/60 [1:11:22<2:36:38, 254.02s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_1980-1984.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 1985-1989.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [03:49<00:00,  6.55it/s]
Processing:  40%|████      | 24/60 [1:15:11<2:27:56, 246.56s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_1985-1989.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 1990-1994.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [02:33<00:00,  9.77it/s]
Processing:  42%|████▏     | 25/60 [1:17:45<2:07:33, 218.68s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_1990-1994.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 1995-1999.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [02:59<00:00,  8.33it/s]
Processing:  43%|████▎     | 26/60 [1:20:45<1:57:21, 207.09s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_1995-1999.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 2000-2004.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [03:07<00:00,  8.02it/s]
Processing:  45%|████▌     | 27/60 [1:23:52<1:50:36, 201.12s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_2000-2004.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 2005-2009.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [03:38<00:00,  6.87it/s]
Processing:  47%|████▋     | 28/60 [1:27:30<1:50:01, 206.30s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_2005-2009.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 2010-2014.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [04:48<00:00,  5.21it/s]
Processing:  48%|████▊     | 29/60 [1:32:19<1:59:17, 230.89s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_2010-2014.synthetic_1500_sentences.tsv
[INFO] Processing target 'depression' for interval 2015-2019.


Processing depression: 100%|████████████████████████████████████| 1500/1500 [04:42<00:00,  5.31it/s]
Processing:  50%|█████     | 30/60 [1:37:02<2:03:13, 246.46s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\depression_2015-2019.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 1970-1974.


Processing mental_health:  76%|█████████████████████████        | 1137/1500 [02:49<00:54,  6.71it/s]
Processing:  52%|█████▏    | 31/60 [1:39:51<1:47:58, 223.40s/it]

[INFO] No valid sentences found in this round for target 'mental_health'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_1970-1974.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 1975-1979.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [02:21<00:00, 10.58it/s]
Processing:  53%|█████▎    | 32/60 [1:42:13<1:32:49, 198.91s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_1975-1979.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 1980-1984.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [01:53<00:00, 13.16it/s]
Processing:  55%|█████▌    | 33/60 [1:44:07<1:18:03, 173.45s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_1980-1984.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 1985-1989.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [02:03<00:00, 12.16it/s]
Processing:  57%|█████▋    | 34/60 [1:46:10<1:08:39, 158.43s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_1985-1989.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 1990-1994.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [01:59<00:00, 12.51it/s]
Processing:  58%|█████▊    | 35/60 [1:48:10<1:01:12, 146.89s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_1990-1994.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 1995-1999.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [02:21<00:00, 10.58it/s]
Processing:  60%|██████    | 36/60 [1:50:32<58:08, 145.36s/it]  

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_1995-1999.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 2000-2004.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [02:21<00:00, 10.62it/s]
Processing:  62%|██████▏   | 37/60 [1:52:53<55:15, 144.17s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_2000-2004.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 2005-2009.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [03:02<00:00,  8.22it/s]
Processing:  63%|██████▎   | 38/60 [1:55:56<57:05, 155.71s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_2005-2009.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 2010-2014.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [04:14<00:00,  5.89it/s]
Processing:  65%|██████▌   | 39/60 [2:00:11<1:04:53, 185.42s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_2010-2014.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_health' for interval 2015-2019.


Processing mental_health: 100%|█████████████████████████████████| 1500/1500 [04:13<00:00,  5.91it/s]
Processing:  67%|██████▋   | 40/60 [2:04:25<1:08:40, 206.00s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_health_2015-2019.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 1970-1974.


Processing mental_illness:  76%|████████████████████████▎       | 1138/1500 [02:35<00:49,  7.30it/s]
Processing:  68%|██████▊   | 41/60 [2:07:01<1:00:28, 190.97s/it]

[INFO] No valid sentences found in this round for target 'mental_illness'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_1970-1974.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 1975-1979.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [02:01<00:00, 12.34it/s]
Processing:  70%|███████   | 42/60 [2:09:02<51:03, 170.18s/it]  

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_1975-1979.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 1980-1984.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [01:51<00:00, 13.44it/s]
Processing:  72%|███████▏  | 43/60 [2:10:54<43:14, 152.62s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_1980-1984.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 1985-1989.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [01:43<00:00, 14.50it/s]
Processing:  73%|███████▎  | 44/60 [2:12:38<36:46, 137.90s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_1985-1989.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 1990-1994.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [01:48<00:00, 13.79it/s]
Processing:  75%|███████▌  | 45/60 [2:14:26<32:17, 129.17s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_1990-1994.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 1995-1999.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [02:13<00:00, 11.20it/s]
Processing:  77%|███████▋  | 46/60 [2:16:40<30:28, 130.60s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_1995-1999.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 2000-2004.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [02:22<00:00, 10.55it/s]
Processing:  78%|███████▊  | 47/60 [2:19:03<29:03, 134.09s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_2000-2004.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 2005-2009.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [02:54<00:00,  8.58it/s]
Processing:  80%|████████  | 48/60 [2:21:58<29:16, 146.34s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_2005-2009.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 2010-2014.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [04:22<00:00,  5.71it/s]
Processing:  82%|████████▏ | 49/60 [2:26:20<33:13, 181.22s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_2010-2014.synthetic_1500_sentences.tsv
[INFO] Processing target 'mental_illness' for interval 2015-2019.


Processing mental_illness: 100%|████████████████████████████████| 1500/1500 [04:19<00:00,  5.78it/s]
Processing:  83%|████████▎ | 50/60 [2:30:40<34:07, 204.73s/it]

[INFO] Output file saved at: synthetic/output/unique_5-year\mental_illness_2015-2019.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 1970-1974.


Processing abuse:   1%|▍                                          | 14/1500 [00:04<08:00,  3.09it/s]
Processing:  85%|████████▌ | 51/60 [2:30:44<21:42, 144.69s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_1970-1974.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 1975-1979.


Processing abuse:   5%|██                                         | 70/1500 [00:08<02:57,  8.05it/s]
Processing:  87%|████████▋ | 52/60 [2:30:53<13:51, 103.91s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_1975-1979.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 1980-1984.


Processing abuse:   7%|██▊                                       | 101/1500 [00:13<03:01,  7.72it/s]
Processing:  88%|████████▊ | 53/60 [2:31:06<08:56, 76.68s/it] 

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_1980-1984.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 1985-1989.


Processing abuse:  13%|█████▍                                    | 195/1500 [00:50<05:39,  3.84it/s]
Processing:  90%|█████████ | 54/60 [2:31:57<06:53, 68.92s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_1985-1989.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 1990-1994.


Processing abuse:  24%|██████████                                | 360/1500 [02:23<07:32,  2.52it/s]
Processing:  92%|█████████▏| 55/60 [2:34:20<07:35, 91.17s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_1990-1994.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 1995-1999.


Processing abuse:  41%|█████████████████▏                        | 613/1500 [06:19<09:09,  1.62it/s]
Processing:  93%|█████████▎| 56/60 [2:40:40<11:50, 177.70s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_1995-1999.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 2000-2004.


Processing abuse:  41%|█████████████████▏                        | 614/1500 [07:36<10:58,  1.34it/s]
Processing:  95%|█████████▌| 57/60 [2:48:16<13:04, 261.40s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_2000-2004.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 2005-2009.


Processing abuse:  55%|███████████████████████▎                  | 832/1500 [14:04<11:18,  1.02s/it]
Processing:  97%|█████████▋| 58/60 [3:02:22<14:33, 436.50s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_2005-2009.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 2010-2014.


Processing abuse:  74%|██████████████████████████████▏          | 1106/1500 [24:32<08:44,  1.33s/it]
Processing:  98%|█████████▊| 59/60 [3:26:54<12:27, 747.31s/it]

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_2010-2014.synthetic_1500_sentences.tsv
[INFO] Processing target 'abuse' for interval 2015-2019.


Processing abuse:  88%|███████████████████████████████████▉     | 1315/1500 [28:16<03:58,  1.29s/it]
Processing: 100%|██████████| 60/60 [3:55:11<00:00, 235.18s/it] 

[INFO] No valid sentences found in this round for target 'abuse'. Skipping further siblings.
[INFO] Output file saved at: synthetic/output/unique_5-year\abuse_2015-2019.synthetic_1500_sentences.tsv
[INFO] Summary file saved at: synthetic/output/unique_5-year/summary.csv
[INFO] Sampling and file generation completed.





Plot descriptives (annual counts) for synthetic sentences from the file "summary.csv"

In [11]:
import pandas as pd
import os
import matplotlib.pyplot as plt

# Define file paths
input_dir = os.path.abspath("synthetic/output/unique_5-year")  # Updated directory for input files
output_folder = os.path.abspath("synthetic/output/unique_5-year")  # Updated output directory

# Ensure the output folder exists
os.makedirs(output_folder, exist_ok=True)

# Read the summary dataset
summary_file = os.path.join(input_dir, "summary.csv")
summary_data = pd.read_csv(summary_file)

# Process the data for plotting
epoch_count_summary = summary_data.groupby(["epoch", "target"]).agg(
    total_count=pd.NamedAgg(column="count", aggfunc="sum")
).reset_index()

# Generate plots for each target term using the summarized data
targets = epoch_count_summary['target'].unique()
colors = {
    'abuse': '#8B0000',
    'anxiety': '#FF6347',
    'depression': '#4B0082',
    'mental_health': '#008080',
    'mental_illness': '#800080',
    'trauma': '#DC143C',
}

n_targets = len(targets)
n_cols = 2
n_rows = (n_targets + n_cols - 1) // n_cols
fig, axs = plt.subplots(n_rows, n_cols, figsize=(10, n_rows * 3), sharex=True, constrained_layout=True)

axs = axs.flatten()

for i, target in enumerate(targets):
    ax = axs[i]
    target_data = epoch_count_summary[epoch_count_summary["target"] == target]
    ax.bar(
        target_data["epoch"], target_data["total_count"],
        color=colors.get(target, "grey"), edgecolor='black', linewidth=0.8
    )
    ax.set_title(target.capitalize(), fontsize=20)
    ax.axhline(500, color="black", linestyle="--", linewidth=1.5, label="Low threshold")
    if target_data["total_count"].max() >= 1500:
        ax.axhline(1500, color="darkgreen", linestyle="--", linewidth=1.5, label="High threshold")
    ax.set_xlabel("Epoch", fontsize=18)
    ax.set_ylabel("Count", fontsize=18)
    ax.tick_params(axis='both', labelsize=16)
    ax.legend(loc='upper right', fontsize=14)

# Only set x-tick labels on the last row axes
for ax in axs[-n_cols:]:
    ax.set_xticklabels(target_data["epoch"], rotation=45, ha="right", fontsize=16)

# Remove unused subplots if any
for j in range(len(targets), len(axs)):
    fig.delaxes(axs[j])

# Save the plot
plot_file_path = os.path.join("../figures/plot_appendixC_breadth.png")
plt.savefig(plot_file_path, dpi=300, bbox_inches='tight')
plt.close()
print(f"Plot saved to {plot_file_path}.")

  ax.set_xticklabels(target_data["epoch"], rotation=45, ha="right", fontsize=16)


Plot saved to ../figures/plot_appendixC_breadth.png.


##### Generate merged files from 1500 5-year interval samples for all-year sampling strategy, to match Sentiment and Intensity restrictions

In [22]:
import pandas as pd
import os

# Define the directory where the CSV files are located
input_directory = 'synthetic/output/unique_5-year'  # Folder with the 5-year CSV files
output_directory = 'synthetic/output/unique_all-year'  # Folder to save the merged files

# List of targets (you can add more targets to this list)
targets = ['abuse', 'anxiety', 'depression', 'mental_health', 'mental_illness', 'trauma']

# Create the output directory if it does not exist
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Loop through each target and process the corresponding files
for target in targets:
    df_list = []  # Initialize an empty list to collect DataFrames for the current target

    # Loop through the files in the input directory
    for file_name in os.listdir(input_directory):
        if file_name.startswith(target) and file_name.endswith("1500_sentences.tsv"):  # Check for correct file format
            file_path = os.path.join(input_directory, file_name)
            try:
                # Read the CSV file assuming it has no header and specific columns are expected
                df = pd.read_csv(file_path, header=None, names=['sentence', 'label', 'year'], sep='\t')
                df_list.append(df)
                print(f"Found and added file: {file_name}")  # Debugging print
            except pd.errors.ParserError as e:
                print(f"Failed to parse {file_name}: {e}")  # Print error if failed to parse

    # Check if any files were added to df_list
    if not df_list:
        print(f"No files found for target: {target}")  # Inform if no files are found
        continue

    # Concatenate all DataFrames for the current target into one
    merged_df = pd.concat(df_list, ignore_index=True)
    # Save the merged DataFrame to the new output folder
    merged_df.to_csv(os.path.join(output_directory, f'{target}_synthetic_sentences.csv'), index=False)
    print(f"Files for '{target}' merged successfully into '{output_directory}/{target}_synthetic_sentences.csv'")

Found and added file: abuse_1970-1974.synthetic_1500_sentences.tsv
Found and added file: abuse_1975-1979.synthetic_1500_sentences.tsv
Found and added file: abuse_1980-1984.synthetic_1500_sentences.tsv
Found and added file: abuse_1985-1989.synthetic_1500_sentences.tsv
Found and added file: abuse_1990-1994.synthetic_1500_sentences.tsv
Found and added file: abuse_1995-1999.synthetic_1500_sentences.tsv
Found and added file: abuse_2000-2004.synthetic_1500_sentences.tsv
Found and added file: abuse_2005-2009.synthetic_1500_sentences.tsv
Found and added file: abuse_2010-2014.synthetic_1500_sentences.tsv
Found and added file: abuse_2015-2019.synthetic_1500_sentences.tsv
Files for 'abuse' merged successfully into 'synthetic/output/unique_all-year/abuse_synthetic_sentences.csv'
Found and added file: anxiety_1970-1974.synthetic_1500_sentences.tsv
Found and added file: anxiety_1975-1979.synthetic_1500_sentences.tsv
Found and added file: anxiety_1980-1984.synthetic_1500_sentences.tsv
Found and added

In [23]:
%run step2_get_descriptives_synthetic.py 

Sibling frequencies saved to synthetic/output/unique_all-year\z_sibling_frequencies_1500.csv


## Step 1: Randomly sample contexts (sentences) with which to compute breadth measure

#### 5-year.cosine: Random Sampling without Replacement (Uniform)
Explanation: The script randomly samples up to 50 sentences 10 times from input files for specified target terms, divided into 5-year intervals (e.g., 1970-1974, 1975-1979). It processes the files, checks if they exist, and saves the sampled sentences to an output directory, with each sentence including its associated year and category.

- Sampling strategy: Generate up to 50 random sentences 10 times from target_term filtered lines (in the input folder) for further analysis. 
- Output example for generate_interval_samples(): mental_illness.sentences.psych.1970-1974.1; [...].2, etc.

This script creates synthetic datasets for by sampling sentences associated with sibling terms. It methodically shuffles and selects from these lists to ensure diversity and prevent duplicates within each iteration. The balance of synthetic to natural sentences is controlled by predefined injection ratios. To boost efficiency and avoid overextending resources, `max_attempts` is set to limit the number of cycles through the sibling list when collecting sentences, ensuring the script does not stall if the required number of synthetic sentences isn't available. This cap is critical for maintaining efficient processing, particularly when dealing with limited or highly specific data sets.

This code incorporates a mechanism to ensure that if there is a shortage of unique synthetic sentences, the system can reuse previously successful sentences after ensuring all unique options have been exhausted.


Note: The strategy we went with is to iterate through the synthetic data and randomly select (irrespective of considering unique sentences) so it mimics a random selection process only limited by the interval. If we want to try a forced unique sentence approach, see OLD > "final_combined.5-year.cds_mpnet.uniform-unique-sent.csv" but this would not be random as discussed.

In [27]:
# run stratified random sampling strategy
%run step1_randomly_sample_sentences_5-year.py

100%|██████████| 60/60 [00:12<00:00,  4.92it/s]


In [28]:
# get descriptives to examine sibling distribution from random sampling
%run step2_get_descriptives_5-year.py

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences['sibling'] = synthetic_sentences['type'].str.extract(r"synthetic_(\S+)")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences['sibling'] = synthetic_sentences['type'].str.extract(r"synthetic_(\S+)")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences

#### all-year.cosine: Random Sampling with Replacement (bootstrapped)
- Sentences can be repeated within the same sample because we are sampling with replacement (but we apply a 3-sentence cap for datasets under 97478 rows).
- Used to estimate variability or generate robust results even with small datasets.

In [15]:
%run step1_randomly_sample_sentences_all-year.py 

6it [00:04,  1.23it/s]


In [16]:
# get descriptives to examine sibling distribution from random sampling
%run step2_get_descriptives_all-year.py

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences['sibling'] = synthetic_sentences['type'].str.extract(r"synthetic_(\S+)")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences['sibling'] = synthetic_sentences['type'].str.extract(r"synthetic_(\S+)")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences

Sibling frequencies saved to output\sibling_frequencies_all-year.cosine.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences['sibling'] = synthetic_sentences['type'].str.extract(r"synthetic_(\S+)")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences['sibling'] = synthetic_sentences['type'].str.extract(r"synthetic_(\S+)")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_sentences

# End of notebook