**Authors: Naomi Baes and Chat GPT**

The script processes linguistic data by generating samples that combine natural sentences and synthetic variations (both more and less) across predefined time intervals for specified target terms. It handles sampling without replacement within each batch to ensure diversity and integrity, and saves the mixed samples into separate directories based on the variation type, applying different injection ratios to simulate varying levels of synthetic data integration for linguistic analysis.

*myenv (Python 3.11.9)*

# Compute Intensity Index

## Sample sentences from natural and synthetic corpora with which to compute Intensity 

Input:
- Natural corpus: 0.0_corpus_preprocessing/output/natural_lines_targets/"abuse_lines_targets" etc.
- Synthetic corpus: 
    3_intensity/synthetic/output/5-year/"abuse_1970-1974.synthetic_sentences.csv" etc.
    3_intensity/synthetic/output/all-year/"abuse_synthetic_sentences.csv" etc.

Output:
- 5-year sampling strategy
    5-year/high/"abuse_1970-1974_synthetic_sentiment_0.1.tsv" etc.
    5-year/low/"abuse_1970-1974_synthetic_sentiment_0.1.tsv" etc.
- all-year sampling strategy
    5-year/high/"abuse_synthetic_sentiment_0.1.tsv" etc.
    5-year/low/"abuse_synthetic_sentiment_0.1.tsv" etc.

In [3]:
# Randomly sample using the random stratified sampling strategy (5-year intervals)

%run step0_randomly_sample_sentences_5-year.py

100%|██████████| 60/60 [00:19<00:00,  3.01it/s]


In [None]:
# Randomly sample using the bootstrapped sampling strategy (whole corpus)
%run step0_randomly_sample_sentences_all-year.py

## 5-year Intensity: Preprocess sentences and calculate index

In [1]:
#!pip install spacy pandas tqdm
#!python -m spacy download en_core_web_lg

%run step1_lemmatize_sentences_intensity_5-year.py  # base (Python 3.11.4)

  from pandas.core import (
Lemmatizing files: 100%|██████████| 3627/3627 [12:15<00:00,  4.93it/s]
Lemmatizing files: 100%|██████████| 3600/3600 [11:20<00:00,  5.29it/s]


In [2]:
%run step2_get_collocates_5-year.py

Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.10_lemmatized_collocates.csv
Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.10_lemmatized_lemmatized_collocates.csv
Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.1_lemmatized_collocates.csv
Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.1_lemmatized_lemmatized_collocates.csv
Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.2_lemmatized_collocates.csv
Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.2_lemmatized_lemmatized_collocates.csv
Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.3_lemmatized_collocates.csv
Output written to output/5-year.cosine/collocates\high/abuse_1970-1974_synthetic_intensity_0.3_lemmatized_lemmatized_c

This script processes filenames to extract relevant details such as the target term, epoch, and injection ratio from filenames formatted with specific characteristics. It then reads associated CSV files containing collocate data, merges this data with valence ratings from the Warriner dataset, computes weighted valence indices for each file, and saves the results into a combined CSV for further analysis. It does so for collocates in both positive and negative folders.

In [3]:
%run step3_compute_intensity-index_5-year.py

Processing files:   0%|          | 0/7227 [00:00<?, ?it/s]

## all-year Intensity: Preprocess sentences and calculate index

In [2]:
%run step1_lemmatize_sentences_intensity_all-year.py # base (Python 3.11.4)

Processing output/all-year.cosine/high: 100%|██████████| 3600/3600 [20:36<00:00,  2.91it/s]
Processing output/all-year.cosine/low: 100%|██████████| 3600/3600 [07:18<00:00,  8.21it/s]


In [3]:
%run step2_get_collocates_all-year.py

Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.100_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.10_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.11_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.12_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.13_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.14_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.15_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.16_lemmatized_collocates.csv
Output written to output/all-year.cosine/collocates\high/abuse_synthetic_intensity_0.17

In [14]:
%run step3_compute_intensity-index_all-year.py

Processing files:   0%|          | 0/7200 [00:00<?, ?it/s]

File: abuse_synthetic_intensity_0.100_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.10_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.11_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.12_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.13_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.14_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.15_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.16_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.17_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthetic_intensity_0.18_lemmatized_collocates.csv, Target: abuse, Injection Ratio: 0
File: abuse_synthet

# End of Script