# ```preprocessing_bigrams.py```
Preprocesses sentences for bigram analysis.

### Input
- --files, -f
    - tsv files of sentences 

### Other arguments
- --cores, -p
    - Number of cores to use. 
    - Defaults to 4.

- --extension, -e
    - Custom extension for filename. 
    - Adds string to the end of filename.

### Output
- txt files of preprocessed sentences (in preprocessed_sents/)
    - Filename is automatically formatted using input file name.
        > ```fname = PATH + fd.name.split("/")[-1].split(".")[0] + "_" + args.extension + ".txt"```

### Example usage

In [7]:
!nohup python3 preprocessing_bigrams.py --files ../../2023-01-04/sent01.tsv -p 25

nohup: ignoring input and appending output to 'nohup.out'


# ```bigram_analysis.py```
Makes bigrams from preprocessed sent file, dumps them, and writes the bigrams that pass the metrics (frequency, MI, chi-squared) with their scores.

### Input
- --files, -f
    - tsv files of sentences.
    - Preprocessed (punctuation removed, sents separated by \n) files (in ```preprocessed_sents/```) to make bigrams from.
    - Unnecessary if using dumped bigrams.

- --bigram_file, -bf
    - File to dump/load bigrams (in ```bigrams/```).

### Other arguments
- --frequency, -F   
    - Threshold for frequency.
    - Defaults to 100.

- --mi, -MI
    - Threshold for MI.
    - Defaults to 5.

- --chi_2, -C
    - Confidence interval for chi-squared.
    - Defaults to 0.95.

- --all, -A
    - Do not apply any filter.

### Output
- --output, -o
    - Path of txt file to write sentences to.
    - Prints results if unspecified.

### Example usage

In [10]:
!python3 bigram_analysis.py --files preprocessed_sents/sent01.txt --bigram_file bigrams/sent01_test.pk --output bigram_analysis/bigram_analysis_sent01_test.txt --frequency 10

done <_io.TextIOWrapper name='preprocessed_sents/sent01.txt' mode='r' encoding='UTF-8'>
Percent done: 0.0%
dumping
done dumping
making unigrams
frequency
making unigram counter
MI
chi
done making dict
done calculating


# ```collocation_replacer.py```
Rewrites preprocessed sents with merged collocations based on bigram_analysis output (or \n separated list of collocations).

### Input
- --files, -f
    - tsv files to read sentences from (in ```in merged_sents/``` or ```preprocessed_sents/```).

- --collocation_file, -cf
    - File to read collocations from (in ```bigram_analysis/```, usually output of ```bigram_analysis.py```)

### Other arguments
- --cores, -p
    - Number of cores to use. 
    - Defaults to 4.

- --frequency, -F   
    - Threshold for frequency.
    - Defaults to 100.

- --mi, -MI
    - Threshold for MI.
    - Defaults to 5.

- --chi_2, -C
    - Confidence interval for chi-squared.
    - Defaults to 0.95.

- --all, -A
    - Do not apply any filter.

- --merge_collocations, -m
    - Merge collocations.

- --underscore_test, -t, 
    - Prints words with multiple underscores.

- --print_time, -P
    - Prints time.

- --extension, -e
    - Custom extension for filename. 
    - Adds string to the end of filename.

- --path, -fp
    - Path of output file.
    - Defaults to ```merged_sents```.

### Output
-  txt file of merged sentences 
    - The file is saved in ```merged_sents/aggressive``` or ```merged_sents/non-aggressive``` depending on whether -m is used, (unless path is specified using -fp).
    - Filenames are taken automatically from the input file.
    If -fp is used, the output is saved to that path with the automatically formatted file name.

### Example usage

In [13]:
!nohup python3 collocation_replacer.py --files preprocessed_sents/sent01.txt --collocation_file bigram_analysis/bigram_analysis_sent01_test.txt --path merged_sents/non-aggressive -p 25 --extension test

659.06s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


nohup: ignoring input and appending output to 'nohup.out'
