Author: Naomi Baes and Chat GPT

# Tokenize corpus into sentences

Model: en_core_web_sm https://spacy.io/models/en
- SENTS_P	Sentence segmentation (precision) 0.92
- SENTS_R	Sentence segmentation (recall) 0.89
- SENTS_F	Sentence segmentation (F-score)	0.91

## Step 1: Compute test corpus first to run model/scripts on (can skip but not advised)

In [2]:
import pandas as pd

# Set pandas options to display full content
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)

# File paths
input_file_path = "C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental"

# Load the input file using ' IIIII ' as the delimiter
df = pd.read_csv(input_file_path, sep=' IIIIII ', engine='python')

# Select the first 50 rows
test_set = df.head(50)

test_set.to_csv('output/output_test_set.tsv', sep='\t', index=False)

## Step 2: Run spaCy model to tokenize corpus into sentences

Note:
- Disabled `ner`, `parser`, `tagger`
- Using `sentencizer` as text is well-tructured and formal (academic papers)--can use `parser` when text is complex, ambiguous sentence boundaries or is informal (e.g., social media, speech transcripts)
- It also estimates where fullstops are missing and places full stops there to avoid errors (substantially lengthening run-time)
- Ignore SpaCy lemmatizer warning about missing POS annotation (because we do not need it for sentence tokenization)


In [28]:
%run step1_spacy_tokenizer_sentence.py # Output file: "..(etc)/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence.tsv"

100%|██████████| 871343/871343 [2:50:36<00:00, 85.12it/s]   


First 10 tokenized rows:
1: The withdrawal response of the land snail helix albolabris disappears on appropriate repetition of the (mechanical) stimulus.	1930	Psychological Research
2: In the usual terminology, the snail becomes habituated to the stimulus.	1930	Psychological Research
3: The disappearance of the response cannot be due to fatigue, as this term is usually understood; for (a) a single stimulus may be sufficient to effect it, (b) more intense stimulation will, under conditions not quantitatively ascertained, remove the habituation, a fact difficult to reconcile with the ordinary conception of fatigue, (c) cases were observed where the habituation took longer to re-effect after the extraneous stimulus than before.	1930	Psychological Research
4: Habituation disappears after appropriate rest.	1930	Psychological Research
5: It may be deepened by further stimulation after response has ceased.	1930	Psychological Research
6: The hypothesis is put forward of a physiological state o

## Step 3: Clean

Aim: 
- Removes sentences (1) containing "|||||", (2) that are empty, (3) that are malformed, and writes them to "removed_lines_log.txt"  
- Cleans and creates 2 output files: one with journal information in column 3 and one without journal information

In [8]:
%run step2_clean_corpus.py # Main output file: "..(etc)/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence-CLEAN.tsv"

Total lines removed: 881739
Cleaned corpus with journal_title written to C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence-CLEAN-journals.tsv
Cleaned corpus without journal_title written to C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence-CLEAN.tsv
Removed lines written to output/removed_lines_log.txt


In [10]:
import pandas as pd
df_cleaned = pd.read_csv("C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence-CLEAN.tsv", sep='\t', header=0)
print(df_cleaned.iloc[:, 1].unique())

  from pandas.core import (


[1930 1931 1932 1933 1934 1936 1937 1938 1939 1940 1941 1942 1943 1944
 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958
 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972
 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
 2015 2016 2017 2018 2019]


## Step 4: Inspect tokenized corpus

Aim: Get summary statistics for the cleaned corpus file
Note: Alter input file for tsv with/without journals

In [6]:
%run step3_get_summary_stats.py

Header: ['sentence', 'publication_year']

First 100 lines of the file (printed as they appear):

The withdrawal response of the land snail helix albolabris disappears on appropriate repetition of the (mechanical) stimulus.	1930
In the usual terminology, the snail becomes habituated to the stimulus.	1930
The disappearance of the response cannot be due to fatigue, as this term is usually understood; for (a) a single stimulus may be sufficient to effect it, (b) more intense stimulation will, under conditions not quantitatively ascertained, remove the habituation, a fact difficult to reconcile with the ordinary conception of fatigue, (c) cases were observed where the habituation took longer to re-effect after the extraneous stimulus than before.	1930
Habituation disappears after appropriate rest.	1930
It may be deepened by further stimulation after response has ceased.	1930
The hypothesis is put forward of a physiological state or process tending to diminish action; such process would be i

## Step 5: Filter the corpus for lines containing the targets

In [1]:
%run step4_get_sentences_target.py

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\naomi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Processing C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence-CLEAN.tsv for target term: mental_health in natural corpus
Lines containing 'mental_health' written to: output/natural_lines_targets\mental_health.lines.psych (43771 sentences)
Processing C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence-CLEAN.tsv for target term: mental_illness in natural corpus
Lines containing 'mental_illness' written to: output/natural_lines_targets\mental_illness.lines.psych (5806 sentences)
Processing C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/abstract_year_journal.csv.mental.sentence-CLEAN.tsv for target term: trauma in natural corpus
Lines containing 'trauma' written to: output/natural_lines_targets\trauma.lines.psych (20236 sentences)
Processing C:/Users/naomi/OneDrive/COMP80004_PhDResearch/RESEARCH/DATA/CORPORA/Psychology/ab

In [None]:
# End of notebook