**This Jupyter Notebook processes speech transcripts from the ADReSS-2020 dataset (both train and test sets) to analyze linguistic patterns in dementia (CD) vs. healthy (CC) speech. It performs EDA, feature extraction, and preprocessing for machine learning applications. Some of these analyses will be used as features in another Jupyter Notebook (see handcraft_models.ipynb) where we extract handcrafted features.**

In [53]:
import os
import re
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import scipy.stats as stats
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

# Speech Data Analysis 

### The following functions are used to conduct the analysis:

1. **Speech Extraction**: Speech utterances are extracted from `.cha` files, and annotations are cleaned.

2. **Sentence and Word Counting**: The number of sentences and words is calculated for each transcript.

3. **Data Summary**: A summary of the total sentences and words per file is displayed.

4. **Number of Sentences vs. Number of Words**: A scatter plot to visualize the relationship between sentence count and word count for each transcript.

5. **Distribution of Sentences and Words**: Histograms to analyze the distribution of sentence counts and word counts across all files. To better understand the underlying distribution of the data, we fitted a **normal distribution** to the histograms. This helps us assess whether the data follows a typical Gaussian distribution, which is commonly used in statistical analysis. By fitting a normal distribution curve over the histogram, we can observe mean and standard deviation of the data.

6. **remove_outliers:** Removes outliers using the IQR method.

7. **Word Cloud**: A word cloud generated from the combined speech data to highlight the most frequent words used by healthy controls.


In [21]:
def extract_speech_from_cha(file_path):
    speech_data = []
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            if line.startswith("*PAR:"):
                clean_line = re.sub(r"\x15.*?\x15", "", line[5:]).strip()
                speech_data.append(clean_line)
    return speech_data

def count_sentences_and_words(speech_data):
    num_sentences = 0
    num_words = 0
    for speech in speech_data:
        sentences = re.split(r'[.!?]', speech)
        num_sentences += len([s for s in sentences if s.strip()])
        num_words += len(speech.split())
    return num_sentences, num_words

def extract_and_summarize_speech_data(folder_path, outliers=True):
    summary_data = []
    all_speech_data = []

    for filename in os.listdir(folder_path):
        if filename.endswith(".cha"):
            file_path = os.path.join(folder_path, filename)
            speech_data = extract_speech_from_cha(file_path)
            num_sentences, num_words = count_sentences_and_words(speech_data)
            summary_data.append([filename, num_sentences, num_words])
            all_speech_data.extend(speech_data)

    summary_df = pd.DataFrame(summary_data, columns=["File Name", "Num-Sentences", "Num-Words"])

    # Remove outliers for both Num-Sentences and Num-Words
    if not outliers:
        summary_df = remove_outliers(summary_df, "Num-Sentences")
        summary_df = remove_outliers(summary_df, "Num-Words")

    return summary_df, all_speech_data

def remove_outliers(df, column_name, multiplier=1.5):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR

    # Filter out the rows where the column values are outside the IQR range
    return df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]

def visualize_data_with_fitted_line(summary_df, all_speech_data):
    fig1 = px.scatter(summary_df, x="Num-Sentences", y="Num-Words", 
                      title="Number of Sentences vs Number of Words",
                      labels={"Num-Sentences": "Number of Sentences", "Num-Words": "Number of Words"},
                      hover_data=["File Name"])
    fig1.show()

    num_sentences = summary_df["Num-Sentences"]
    mu, std = stats.norm.fit(num_sentences)
    xmin, xmax = np.min(num_sentences), np.max(num_sentences)
    x = np.linspace(xmin, xmax, 100)
    p = stats.norm.pdf(x, mu, std)
    
    hist_data_sentences = go.Histogram(
        x=num_sentences, 
        nbinsx=30, 
        histnorm="probability density", 
        name="Histogram of Sentences",
        opacity=0.75,
        marker=dict(color="skyblue")
    )
    
    fitted_line_sentences = go.Scatter(
        x=x, 
        y=p, 
        mode="lines", 
        name="Fitted Normal Distribution (Sentences)", 
        line=dict(color="red", width=2)
    )
    
    fig2 = go.Figure(data=[hist_data_sentences, fitted_line_sentences])
    fig2.update_layout(
        title="Distribution of Number of Sentences per File with Fitted Line",
        xaxis_title="Number of Sentences",
        yaxis_title="Density",
        showlegend=True
    )
    fig2.show()

    num_words = summary_df["Num-Words"]
    mu, std = stats.norm.fit(num_words)
    xmin, xmax = np.min(num_words), np.max(num_words)
    x = np.linspace(xmin, xmax, 100)
    p = stats.norm.pdf(x, mu, std)
    
    hist_data_words = go.Histogram(
        x=num_words, 
        nbinsx=30, 
        histnorm="probability density", 
        name="Histogram of Words",
        opacity=0.75,
        marker=dict(color="lightgreen")
    )
    
    fitted_line_words = go.Scatter(
        x=x, 
        y=p, 
        mode="lines", 
        name="Fitted Normal Distribution (Words)", 
        line=dict(color="red", width=2)
    )
    
    fig3 = go.Figure(data=[hist_data_words, fitted_line_words])
    fig3.update_layout(
        title="Distribution of Number of Words per File with Fitted Line",
        xaxis_title="Number of Words",
        yaxis_title="Density",
        showlegend=True
    )
    fig3.show()

    all_speech_text = " ".join(all_speech_data)
    wordcloud = WordCloud(width=800, height=400, background_color="white").generate(all_speech_text)

    fig4 = go.Figure()
    fig4.add_trace(go.Image(
        z=wordcloud.to_array()
    ))
    fig4.update_layout(
        title="Word Cloud of Speech Data",
        xaxis=dict(showgrid=False, zeroline=False),
        yaxis=dict(showgrid=False, zeroline=False)
    )
    fig4.show()


##  Healthy (CC)

In [22]:
folder_path = "ADReSS-IS2020-data/train/transcription/cc"

# Set outliers to False if you want to remove outliers
summary_df_cc, all_speech_data_cc = extract_and_summarize_speech_data(folder_path, outliers=True)

summary_df_cc.head()

Unnamed: 0,File Name,Num-Sentences,Num-Words
0,S001.cha,15,164
1,S002.cha,13,103
2,S003.cha,30,160
3,S004.cha,40,280
4,S005.cha,15,131


In [24]:
visualize_data_with_fitted_line(summary_df_cc, all_speech_data_cc)

**The speech data of the CC group shows that most files contain between 5 and 25 sentences and 50 to 200 words, with an average of 5–12 words per sentence. The number of sentences typically peaks around 12–15, with few files exceeding 30 sentences, following a normal distribution with a slight right skew. The word count is mostly between 50 and 200 words, also exhibiting a normal distribution with a slight right skew. There are also two outliers with unusually high sentence and word counts.**

**The most frequent words among healthy controls include gram, exc, uh, sink, water, cookie, mother, stool, um, and boy. These words are typical when describing the Cookie Theft picture, indicating a focused and relevant narrative.**

## Dementia (CD)

In [18]:
folder_path = "ADReSS-IS2020-data/train/transcription/cd"

# Set outliers to False if you want to remove outliers
summary_df_cd, all_speech_data_cd = extract_and_summarize_speech_data(folder_path, outliers=True)

summary_df_cd.head()

Unnamed: 0,File Name,Num-Sentences,Num-Words
0,S079.cha,13,153
1,S080.cha,12,48
2,S081.cha,19,156
3,S082.cha,32,254
4,S083.cha,19,112


In [20]:
visualize_data_with_fitted_line(summary_df_cd, all_speech_data_cd)

**The speech data of the Dementia (CD) group shows a broad range in both sentence and word counts. Most files contain between 15–40 sentences and 50–200 words, averaging around 4–10 words per sentence. A single strong outlier (with over 120 sentences and 500 words) skews the distribution, flattening the normal curve and making the rest of the data appear compressed. This could mean that the person spoke a lot more than others or their speech wasn’t clearly divided into sentences.**

**Regarding word cloud, commonly used words include gram, exc, uh, sink, water, dishes, cookie, mother, know, jar, see, and boy. While many of these are similar to those used by healthy controls, additional frequent words like know, jar, and see suggest either more descriptive or potentially repetitive language.**

# Analysis of Fillers and Vague Words

Word cloud pictures revealed that patients frequently used fillers in their speech. Analyzing these fillers can help determine if there are significant differences, making them potential features for our machine learning models.

### The following functions are used to conduct the analysis of fillers and vague words in patient speech:

1. **extract_speech_pause_from_cha:** Extracts patient speech along with the number of pauses (indicated by "…") in the .cha file.

2. **process_group_data:** Processes multiple .cha files in a folder, extracting speech data and counting fillers and words to build a summary.

3. **analyze_speech_data:** Analyzes the extracted speech data by calculating the Type-Token Ratio (TTR) for fillers, average clauses per sentence, and counts of vague words.

4. **collect_pauses:** Collects pause data from multiple .cha files in a folder and returns it as a DataFrame for further analysis.

5. **scatter_sentence_complexity:** Analyzes sentence complexity by plotting a scatter plot showing the relationship between sentence length and the number of clauses per sentence.

In [42]:
train_cc = "ADReSS-IS2020-data/train/transcription/cc"
train_cd = "ADReSS-IS2020-data/train/transcription/cd"

In [54]:
# Common fillers and vague words
FILLERS = {"uh", "um", "erm", "er", "like", "you know", "i mean"}
VAGUE_WORDS = {"thing", "stuff", "that", "it", "this", "something"}


def extract_speech_pause_from_cha(file_path):
    """Extracts both clean speech and number of pauses from a .cha file."""
    speech_data = []
    pauses = []
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            if line.startswith("*PAR:"):
                clean_line = re.sub(r"\x15.*?\x15", "", line[5:]).strip()
                if clean_line:
                    speech_data.append(clean_line)
                    pause_count = line.count("...")
                    pauses.append(pause_count)
    return speech_data, pauses

def process_group_data(folder_path):
    all_sentences = []
    filler_counter = Counter()
    word_counter = Counter()

    for filename in os.listdir(folder_path):
        if filename.endswith(".cha"):
            file_path = os.path.join(folder_path, filename)
            speech_data = extract_speech_from_cha(file_path)
            all_sentences.extend(speech_data)

            for sentence in speech_data:
                words = re.findall(r"\b\w+\b", sentence.lower())
                word_counter.update(words)
                filler_counter.update([word for word in words if word in FILLERS])

    return all_sentences, filler_counter, word_counter

def analyze_speech_data(all_sentences, filler_counter, word_counter):
    total_fillers = sum(filler_counter.values())
    unique_fillers = len(filler_counter)
    ttr_fillers = unique_fillers / total_fillers if total_fillers else 0

    num_clauses = sum(sentence.count(",") + sentence.count(" and ") for sentence in all_sentences)
    avg_clauses_per_sentence = num_clauses / len(all_sentences) if all_sentences else 0

    vague_word_counts = {word: word_counter.get(word, 0) for word in VAGUE_WORDS}

    print(f"Type-Token Ratio (TTR) for Fillers: {ttr_fillers:.3f}")
    print(f"Average Clauses per Sentence: {avg_clauses_per_sentence:.2f}")
    print("Usage of Vague Words:", vague_word_counts)

    return {
        "TTR": ttr_fillers,
        "AvgClausesPerSentence": avg_clauses_per_sentence,
        "VagueWords": vague_word_counts
    }

def collect_pauses(folder_path, group_label="Group"):
    pauses = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".cha"):
            file_path = os.path.join(folder_path, filename)
            _, file_pauses = extract_speech_pause_from_cha(file_path)
            pauses.extend(file_pauses)

    return pd.DataFrame({
        "Pauses": pauses,
        "Group": [group_label] * len(pauses)
    })

def scatter_sentence_complexity(folder_path):
    sentence_lengths = []
    num_clauses = []

    def count_clauses(sentence):
        return sentence.count(",") + sentence.count(" and ") + sentence.count(" but ") + 1

    for filename in os.listdir(folder_path):
        if filename.endswith(".cha"):
            file_path = os.path.join(folder_path, filename)
            speech_data = extract_speech_from_cha(file_path)

            for sentence in speech_data:
                sentence_lengths.append(len(sentence.split()))
                num_clauses.append(count_clauses(sentence))

    df_complexity = pd.DataFrame({
        "Sentence Length": sentence_lengths,
        "Clauses": num_clauses
    })

    fig = px.scatter(df_complexity, x="Sentence Length", y="Clauses", opacity=0.6,
                     title="Sentence Complexity vs. Length")
    fig.show()
    return df_complexity

In [55]:
# Get sentence/filler/word info
all_speech_data_cc, filler_counts_cc, word_counts_cc = process_group_data(train_cc)
all_speech_data_cd, filler_counts_cd, word_counts_cd = process_group_data(train_cd)

# Analyze
cc_analysis = analyze_speech_data(all_speech_data_cc, filler_counts_cc, word_counts_cc)
cd_analysis = analyze_speech_data(all_speech_data_cd, filler_counts_cd, word_counts_cd)

print("CC Analysis:", cc_analysis)
print("CD Analysis:", cd_analysis)

# Violin Plot of Pauses
df_pauses = pd.concat([
    collect_pauses(train_cd, "Dementia"),
    collect_pauses(train_cc, "Healthy")
])
fig_violin = px.violin(df_pauses, x="Group", y="Pauses", box=True, points="all",
                       title="Hesitation & Pauses in Speech")
fig_violin.show()


Type-Token Ratio (TTR) for Fillers: 0.016
Average Clauses per Sentence: 0.16
Usage of Vague Words: {'it': 55, 'something': 4, 'stuff': 1, 'this': 6, 'that': 79, 'thing': 3}
Type-Token Ratio (TTR) for Fillers: 0.020
Average Clauses per Sentence: 0.13
Usage of Vague Words: {'it': 92, 'something': 9, 'stuff': 0, 'this': 48, 'that': 90, 'thing': 6}
CC Analysis: {'TTR': 0.015544041450777202, 'AvgClausesPerSentence': 0.1554054054054054, 'VagueWords': {'it': 55, 'something': 4, 'stuff': 1, 'this': 6, 'that': 79, 'thing': 3}}
CD Analysis: {'TTR': 0.01990049751243781, 'AvgClausesPerSentence': 0.13031914893617022, 'VagueWords': {'it': 92, 'something': 9, 'stuff': 0, 'this': 48, 'that': 90, 'thing': 6}}


### Scatter sentence complexity for CC

In [56]:
scatter_sentence_complexity(train_cc)

Unnamed: 0,Sentence Length,Clauses
0,11,1
1,6,1
2,13,1
3,10,1
4,13,2
...,...,...
735,12,1
736,14,3
737,14,4
738,14,2


### Scatter sentence complexity for CD

In [57]:
scatter_sentence_complexity(train_cd) 

Unnamed: 0,Sentence Length,Clauses
0,2,1
1,11,1
2,11,1
3,12,1
4,12,2
...,...,...
747,13,1
748,6,1
749,13,2
750,8,1


**The CC group tends to use slightly more complex sentences (with a higher average number of clauses per sentence) while relying on fewer fillers and vague words compared to the CD group. In contrast, the CD group, though using similar sentence structures, demonstrates a greater reliance on specific vague words, particularly "it" and "that," as well as a slightly higher frequency of fillers.**

**However, the observed differences between the groups are minimal based on the results. We will incorporate these features into the next Jupyter notebook to determine whether they contribute to effectively classifying the groups.**