**Author**: Naomi Baes and Chat GPT

**Purpose**: This script filters corpora (abstracts and text articles) to extract lines containing specific target terms related to mental health, etc

**Note**: Output example for filter_lines(): 
    # mental_illness.context.cohacoca "" How does one feed into the other ? " " Can we define the essential elements of each of the arts ? " " What of these elements are accessible to young children ? " We also need to assess our current early childhood art activities and determine their validity for young children within the arts . Are the arts being used simply as a vehicle to achieve goals outside of the arts themselves ? Or are children being helped to understand the aesthetic qualities of their world ? And are children being helped to understand the nature of artistic expression and to use media in more meaningful ways from an artistic point of view ? When these latter questions can be answered affirmatively , then we are providing children with educationally worthwhile art experiences in their early childhood education programs . ||||| 1993 ||||| nf ||||| 746761"
    # mental_illness.context.psych " complex scientific age. For myself, I have no doubt that religion can provide many of the positive elements of good mental_health, and I believe that this concept will grow to full maturity in the years ahead. ||||| 1965 ||||| Pastoral Psychology ||||| Pastoral Psychology "

In [5]:
# Setup

import re
import os

# List of input files for different datasets
input_files = [
    "C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/COHACOCA/coha.coca.cleaned2.mental",
    "C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental"
]

# Specify target terms for filtering
target_terms = ["mental_health", "mental_illness", "perception"]

In [6]:
def filter_lines(input_file, target_terms, corpus):
    output_dir = "output"
    os.makedirs(output_dir, exist_ok=True)
    
    print(f"Processing {input_file} for target terms: {target_terms} and corpus: {corpus}")
    
    for target_term in target_terms:
        output_file = os.path.join(output_dir, f"{target_term}.lines.{corpus}")

        # Create a regular expression pattern to match the target term (case-insensitive)
        pattern = re.compile(r'\b' + re.escape(target_term) + r'\b', flags=re.IGNORECASE)

        # Initialize lines_written counter
        lines_written = 0

        # Open input and output files
        with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
            # Iterate over each line in the input file
            for line in infile:
                if re.search(pattern, line):
                    outfile.write(line)  # Write the line to the output file
                    lines_written += 1  # Increment the lines_written counter
            
            # Print summary of lines written for the current target term
            print(f"    Lines containing '{target_term}' written to: {output_file} ({lines_written} lines)")

In [7]:
# Process each input file for different target terms and corpora
for input_file in input_files:
    if 'COHACOCA' in input_file:
        corpus = 'cohacoca'
    elif 'Psychology' in input_file:
        corpus = 'psych'
    else:
        corpus = 'unknown'  # Handle unknown corpus
    
    # Call function
    filter_lines(input_file, target_terms, corpus)

print("Filtering completed.")

Processing C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/COHACOCA/coha.coca.cleaned2.mental for target terms: ['mental_health', 'mental_illness', 'perception'] and corpus: cohacoca
    Lines containing 'mental_health' written to: output\mental_health.lines.cohacoca (3604 lines)
    Lines containing 'mental_illness' written to: output\mental_illness.lines.cohacoca (1721 lines)
    Lines containing 'perception' written to: output\perception.lines.cohacoca (11371 lines)
Processing C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental for target terms: ['mental_health', 'mental_illness', 'perception'] and corpus: psych
    Lines containing 'mental_health' written to: output\mental_health.lines.psych (26274 lines)
    Lines containing 'mental_illness' written to: output\mental_illness.lines.psych (3873 lines)
    Lines containing 'perception' written to: output\perception.lines.psych (38614 lines)
Filtering comp