<a href="https://colab.research.google.com/github/jonhnlee/EDUC-5913-Programming-fundamentals-of-Python/blob/main/HW_Week12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Investigating the Top 30 Most Frequent Words in Disfluency Research

In this exercise, I investigated the 30 most frequent words used in disfluency research.

## Methods

 I downloaded two datasets from Scopus: one containing all articles from the search query "disfluency AND linguistic OR language," and another containing "disfluency AND linguistic OR language AND stuttering OR patient OR pathology." Since my focus was on non-stuttering disfluency research, I excluded all stuttering-related articles by comparing the DOIs of each document in the two datasets and removing any overlapping entries. After this filtering process, I used spaCy to analyze the abstracts of the remaining non-stuttering disfluency articles and identified the 30 most frequently occurring words.

In [1]:
import pandas as pd
import spacy
from collections import Counter

In [27]:
# Import data sets
abstract_df = pd.read_excel('disfluency_abstract.xlsx')
stuttering_df = pd.read_excel('disfluency_abstract.xlsx')

In [29]:
# Strip whitespace in DOI column
abstract_df['DOI'] = abstract_df['DOI'].str.strip().str.lower()
stuttering_df['DOI'] = stuttering_df['DOI'].str.strip().str.lower()

# Identify DOIs to exclude
dois_to_exclude = set(stuttering_df['DOI'].dropna())  # Remove any NaN values

# Exclude abstract_df where DOI matches any in stuttering_df
df = abstract_df[~abstract_df['DOI'].isin(dois_to_exclude)]

In [30]:
# Combine the abstracts of all remaining articles
abstracts = df["Abstract"].dropna().tolist()  # Drop NaN values and convert to list
text = " ".join(abstracts)  # Combine all abstracts into a single string

In [44]:
# Preprossing
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)  # Process the combined text with spaCy
# Filter tokens
filtered_tokens = [
    token.lemma_.lower() for token in doc
    if not token.is_stop and not token.is_punct and token.lemma_.isalpha()
]

In [45]:
# Count word frequency
word_freq = Counter(filtered_tokens)

In [46]:
# Top 30 most frequent words
top_30 = word_freq.most_common(30)

# Results

In [47]:
print("Top 30 most frequent words and their frequencies:")
for word, freq in top_30:
    print(f"{word}: {freq}")

Top 30 most frequent words and their frequencies:
speech: 711
disfluency: 647
language: 379
model: 291
word: 227
system: 176
base: 172
paper: 169
spontaneous: 151
result: 148
speaker: 145
detection: 139
task: 133
feature: 128
spoken: 127
text: 119
study: 117
datum: 115
recognition: 113
approach: 111
corpus: 111
pause: 99
performance: 98
error: 93
utterance: 92
type: 92
sentence: 89
present: 87
analysis: 86
use: 84
