Semantic Similarity Linguistic Analysis

Jessica Boyle


# Introduction

Classroom communication significantly influences students' academic performance (Thatcher et al., 2008). While existing research often concentrates on students' language development, effective communication encompasses the creation of shared meaning between teachers and students. Examining teachers' language use in delivering instructional messages becomes crucial in this context. Clarity in teacher expression involves the use of concrete, explicit, and fluent language, incorporating consistent vocabulary and content-related words (Brophy, 1988; Hollow & Wehby, 2017). Research suggests that unclear language negatively impacts student achievement, particularly for those with lower language proficiency (Ernst-Slavit & Mason, 2011). Clear language use benefits all students, with particular advantages for those struggling with academic and behavioral demands (Archer & Hughes, 2011). Additionally, teachers employing appropriate sequencing and explicit instruction break down complex skills into smaller units, addressing cognitive overload and considering students' working memory.

Critiques of research on instructional talk clarity highlight poorly defined constructs and high-inference measures, often requiring observers to make binary judgments (Titsworth et al., 2015). Leveraging natural language processing tools may provide a more robust and objective approach to studying teachers' spoken language.

This analysis aims to assess the cohesiveness of teacher talk during classroom instruction to determine how it relates to a commonly used classroom observational measure for teachers' instructional quality. The data include classroom transcripts of teacher and student talk and the Classroom Atmosphere Scoring System (CLASS) observation scores (Demszky & Hill, 2022). The CLASS is an observational measurement tool that assesses the quality of teachers’ social and instructional interactions with students as well as the intentionality and productivity evident in classroom settings (Pianta et al., 2008). SpaCy is used to calculate the semantic similarity between the individual adjacent teacher turns that are included in each transcript. Then an average semantic similarity score is calculated for each observation to provide an overall semantic similarity score. Each transcript is linked to a corresponding instructional dialogue, instructional support, and an overall quality rating obtained from the CLASS observations. This analysis examines the semantic similarity between teachers' adjacent utterances and its' relation to the quality ratings for instructional dialogue, overall instructional support, and the overall CLASS score.

**The research question that guides this study is:**

How does the semantic similarity of teachers' adjacent talk turns/utterances relate to their quality ratings for instructional dialogue scored via Classroom Atmosphere Scoring System (CLASS) observations?




# Method
## Dataset
The open-source National Center for Teacher Effectiveness(NCTE) transcript dataset includes anonymized transcripts from teachers’classroom observations from the NCTE Main Study [Kane]. The observations occurred between 2010-2013 in 4th and 5th-grade elementary math classrooms across four districts, predominately serving historically marginalized students. The transcripts are linked with a variety of outcome variables including classroom observation scores, demographic information, survey responses, and student scores. The dataset can be found at: https://github.com/ddemszky/classroom-transcript-analysis. This analysis focused on the use of the classroom transcripts and linked the Classroom Atmosphere Scoring System (CLASS) data.

<ul type="none">
  <li>Classroom Transcripts</li>
    <ul>
     <li>This analysis includes 1325 observation transcripts from 301 teachers, each with an average of 4 transcripts. Classroom lessons were captured using three cameras, a lapel microphone for teacher talk, and a bidirectional microphone for student talk. The recordings were transcribed by professional transcribers working under contract for a commercial transcription company.
     
The transcripts were organized by speaker turns (teacher, students, multiple students) where each row of the transcript data frame represents a speaker turn or utterance that may contain one or more speech acts or "sentences". In this analysis, student talk was removed and only teacher turns were analyzed. On average, the transcripts contain 5,733 words, with 87.7% of which are spoken by the teachers, and with 172 teacher utterances per transcript.
     
The transcripts are fully anonymized: student and teacher names are replaced with terms like “Student J”, “Teacher” or “Mrs. H”. Transcribers used square brackets to indicate when speech was [Inaudible], if they were unsure of a particular word due to audio quality, or to include meta-data such as [laughter], [students putting away materials]. All bracketed information was removed from the transcripts for this analysis.  </li>
     </ul>
</ul>

<ul type="none">
  <li>Classroom Atmosphere Scoring System (CLASS) </li>
    <ul>
      <li> The CLASS includes 3 domains and 11 sub-dimensions measuring teacher-student interactions. This analysis includes the overall CLASS score, the Instructional Support domain, and the Instructional Dialogue dimension. Observers score each dimension using a 7-point scale.
      </ul>
      </ul>

The **overall CLASS score** represents the teachers' average ability across the 3 domains and 11 dimensions. The Emotional Support domain includes the Positive Climate, Teacher Sensitivity, and Regard for Student Perspectives dimensions; the Classroom Organization domain includes the Behavior Management, Productivity, and Negative Climate dimensions; and the Instructional Support domain includes the Instructional Learning Format, Content Understanding, Analysis and Inquiry, Quality Feedback, and Instructional Dialogue dimensions.

The **Instructional Support domain** measures the teachers' instructional support ability to enhance learning through consistent, process-oriented feedback, focus on higher-order thinking skills, and presentation of new content within a broader, meaningful context. The instructional support domain score is calculated by taking an average of its' 5 dimensions' scores.

The **Instructional Dialogue dimension** is defined as the purposeful use of cumulative content-focused discussion among teachers and students. It measures whether teachers actively support students in connecting ideas and fostering a deeper understanding of the content. Lower scores (1,2) are assigned when there are minimal or no discussions in the classroom, and when the teacher seldom acknowledges, repeats, or extends on comments. Mid-range scores (3,4,5) are given when discussions occur, but they are brief or shift rapidly between topics without subsequent questions or comments. Higher scores (6,7) indicate the presence of frequent, content-driven discussions between teachers and students, fostering cumulative exchanges where teachers actively promote elaborate dialogue through open-ended questions and repetitions.
</li>

## Linguistic Analysis
The analysis of semantic similarity aimed to assess thecohesion of teachers’ discourse by calculating the similarity between theiradjacent utterances. Two distinct analyses were conducted to provide a richer understanding of semantic similarity in teacher discourse: one considering all words and the other focusing solely on content words.

For the analysis that included all the words, punctuationwas removed from the original NCTE transcripts. This step was critical due tothe arbitrary nature of punctuation in the transcription of spoken language, where people often express themselves without strictly adhering to conventional written sentence structures, especially in a context such as a classroom. Eliminating punctuation ensured the analysis focused on the teachers’ words and the semantic similarity scores were not impacted by punctuation. For the analysis that included only content words, The Natural Language Toolkit (NLTK) was used to remove stop words, a predetermined set ofcommonly used words (e.g., a, the, is) that carry minimal useful information.

By filtering out the stop words, the analysis isolated essential content words,providing a focused examination of words carrying the primary message in theteachers’ utterances. Both of the analyses used spaCy’s large English model to calculate the semantic similarity scores. The spaCy model uses Word2Vec embedding to represent words as dense vectors in a continuous vector space and captures semantic relationships based on their contextual usage in a large corpus. When calculating semantic similarity, SpaCy uses these pre-trained word embeddingsto measure the similarity between individual words, sentences, or documents. Forthis analysis, spaCy’s ‘doc.similarity’ function treated each teacher utteranceas a document. Each utterance, whether comprising one word, one statement, or multiplestatements, was compared to the adjacent utterance.  The resulting output was floating-pointnumbers ranging from 0 to 1, indicating the degree of similarity between thetwo utterances.

## Statistical Analysis
Each transcript included numerous teacher utterances, resulting in multiple semantic similarity scores associated with each transcript because of the pairwise comparison between two adjacent utterances. To enable linking the semantic similarity scores with the CLASS scores, a mean similarity score was computed for each transcript. This involved grouping each transcript ID, summing all semantic similarity values, and then dividing by the total number of values for that transcript. Mean similarity scores were calculated separately for all words and content words.

The relationship between the mean semantic similarity scores and quality ratings for instructional dialogue, instructional support, and the overall CLASS score was analyzed and visually represented using the PerformanceAnalytics and ggplot packages in R.



# NLP Analysis

The steps were taken for the NLP analysis:
1. Read in the transcript and CLASS data
2. Clean the transcript dataframe
3. Load the spaCy package
4. Remove stop words and punctuation
5. Calculate semantic similarity scores for teacher utterances
6. Save final dataframe to .csv file

### 1. Reading in transcripts and outcome data

In [2]:
# Load the Google Drive helper
from google.colab import drive
# mounting my drive to make it available
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#navigating to the google drive folder where the data is located
import os #package for figuring out operating system

os.getcwd() #wcurrent working directory
os.listdir() #files in the currrent working directory
os.chdir("drive/MyDrive/Colab_Notebooks/TextAnalytics_Final") #change directory to where the data is saved
os.listdir() #file I need is located in this folder

['class_data.csv',
 'ncte_single_utterances.csv',
 'ncte_single_utterances_dashesremoved.numbers',
 'ncte_single_utterances_dashesremoved.csv',
 'ncte_teacher_similarity_scores.csv',
 'ncte_test_text.csv']

**Reading in the classroom transcripts' csv file**

In [4]:
#import pandas
import pandas as pd

#read in the .csv file that includes the observation transcripts
ncte_utterances = pd.read_csv('ncte_single_utterances_dashesremoved.csv', encoding = "utf-8", usecols=['speaker','text','OBSID']) #utf-8 due to apostrophes in the dataframe being read in as odd unicode symbols

#reorder columns
ncte_utterances = ncte_utterances.reindex(columns=['OBSID', 'speaker', 'text'])

#lower casing everything in the text column
ncte_utterances['text'] = ncte_utterances['text'].str.lower()


In [5]:
# looking at the first 10 rows of the data
ncte_utterances.head(10)

Unnamed: 0,OBSID,speaker,text
0,2119,teacher,"friends, yesterday we started off by working o..."
1,2119,student,yes.
2,2119,teacher,"and yesterday towards the end of the period, y..."
3,2119,multiple students,yes.
4,2119,teacher,some of you might be done. if you are finished...
5,2119,student,can we find new ways to check it or [inaudible]?
6,2119,teacher,"well, the best way to check it is going to be ..."
7,2119,student,can you [find] the gallon?
8,2119,student,i don't know how to.
9,2119,teacher,i guess draw it basically remember yesterday w...


Clean the text

In [6]:
# there are some non alphabetic characters included in the text due to transcription and encoding variations

def clean_text(text):
    text = text.replace('¾', 'three-fourths')
    text = text.replace('½', 'one-half')
    text = text.replace('é', 'e')
    text = text.replace('¼', 'one-fourth')
    text = text.replace('à', 'a')
    text = text.replace('ĕ', 'e')
    text = text.replace('ē', 'e')
    text = text.replace('°', ' degrees')
    return text

# Apply the clean_text function to the 'text' column
ncte_utterances['text'] = ncte_utterances['text'].apply(clean_text)

Remove meta-data


In [7]:
#we can see meta-data included in the text above. The meta-data are indicated by square brackets []

import re

# Define a regex pattern to match the square brackets and the metadata within them
pattern = r'\[.*?\]'

# Use the re.sub() function to remove the square brackets and metadata
ncte_utterances['text'] = ncte_utterances['text'].apply(lambda x: re.sub(pattern, '', x))

In [8]:
ncte_utterances.head(10) #check that meta-data are removed

Unnamed: 0,OBSID,speaker,text
0,2119,teacher,"friends, yesterday we started off by working o..."
1,2119,student,yes.
2,2119,teacher,"and yesterday towards the end of the period, y..."
3,2119,multiple students,yes.
4,2119,teacher,some of you might be done. if you are finished...
5,2119,student,can we find new ways to check it or ?
6,2119,teacher,"well, the best way to check it is going to be ..."
7,2119,student,can you the gallon?
8,2119,student,i don't know how to.
9,2119,teacher,i guess draw it basically remember yesterday w...


In [9]:
#get dimensions
ncte_utterances.shape

(580408, 3)

Remove rows that now only have a period because of removing meta-data

In [10]:
# Remove rows containing only "."
ncte_utterances = ncte_utterances[ncte_utterances != "."].dropna()

In [11]:
#get dimensions
ncte_utterances.shape

#looks like the above removed about 25,000 rows that did not contain text

(554784, 3)

**Read in the CLASS outcome variables**

In [12]:
#read in the .csv file that includes the text and outcome variables
class_data = pd.read_csv('class_data.csv', encoding = "ISO-8859-1", usecols=['OBSID','instructional_dialogue','instruct_support','overall_score'])
#usecols to select only the variables related to instructional dialogue, as well as teachers ID numbers named OBSID

In [13]:
#we can see the observation ID number that will be used to join the sentiment similarity scores with the CLASS scores later as well as the outcome variables
class_data.head(10)

Unnamed: 0,OBSID,instructional_dialogue,instruct_support,overall_score
0,33,4.25,4.0,4.863636
1,128,3.333333,3.8,4.090909
2,455,3.75,3.55,4.522727
3,30,5.0,5.0,5.272727
4,130,2.333333,2.8,4.030303
5,585,4.0,4.25,5.090909
6,4250,3.0,3.3,3.818182
7,4599,5.75,5.2,5.5
8,4678,3.0,3.15,3.954545
9,2191,4.333333,4.066667,4.575758


**Keep only the observation indentifiers (OBSID) that exist in both dataframes**

In [14]:
# Get the OBSIDs that exist in both data frames
common_OBSIDs = ncte_utterances['OBSID'].isin(class_data['OBSID'])

# Filter the 'ncte_utterances' data frame to keep only the OBSIDs that exist in 'class_data'
ncte_utterances = ncte_utterances[common_OBSIDs]

In [15]:
#get dimensions
ncte_utterances.shape

#the above code appears to have removed about 150,000 rows from observation IDs that were not in the class data. Removing these observations before the NLP analysis will save some compute time.

(401705, 3)

In [16]:
# Calculate the number of unique OBSIDs in the 'ncte_utterances' data frame
num_unique_ncte_OBSIDs = ncte_utterances['OBSID'].nunique()
num_unique_class_OBSIDs = class_data['OBSID'].nunique()

# Print the number of unique OBSIDs
print("Number of unique OBSIDs in class_data:", num_unique_ncte_OBSIDs)
print("Number of unique OBSIDs in ncte_utterances:", num_unique_class_OBSIDs)

Number of unique OBSIDs in class_data: 1325
Number of unique OBSIDs in ncte_utterances: 1325


**Select only the rows where the teacher is the speaker**

In [17]:
# Filter out the student rows
ncte_teacher_utterances = ncte_utterances[ncte_utterances['speaker'] == 'teacher']


In [18]:
# Print the modified DataFrame
ncte_teacher_utterances.head(10)

Unnamed: 0,OBSID,speaker,text
0,2119,teacher,"friends, yesterday we started off by working o..."
2,2119,teacher,"and yesterday towards the end of the period, y..."
4,2119,teacher,some of you might be done. if you are finished...
6,2119,teacher,"well, the best way to check it is going to be ..."
9,2119,teacher,i guess draw it basically remember yesterday w...
11,2119,teacher,that works. so now show three fourths of that.
13,2119,teacher,you might wanna draw it a little bit bigger. l...
15,2119,teacher,do the same thing with what you're doing there...
17,2119,teacher,that's what you have going on? because student...
19,2119,teacher,"so four, five."


In [19]:
#get dimensions
ncte_teacher_utterances.shape

#about half of the rows were from teacher talk

(207427, 3)

### 3. Load in spaCy, NLTK, & string

In [20]:
import spacy
import spacy.cli #spacy command line interface

spacy.cli.download("en_core_web_lg") #downloading large model so that word vectors are available for the semantic similarity analysis

nlp = spacy.load("en_core_web_lg") #call for the spacy model

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [21]:
import nltk
from nltk.corpus import stopwords #will be used to remove stop words
import string #will be used to remove punctuation

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### 4. Remove Stop Words and Punctuation

In this section, punctuation is removed from the text column to create a new "text_no_punctuation" column which will allow us to calculate semantic similarity on all text in the transcripts then stop words are removed to create a new "content_words" column which will allow us to calculate semantic similarity on only the content words within the transcripts

**Using string.punctuation and regular expression to remove only punctuation**

In [22]:
def remove_punctuation(text):
    additional_punctuations = ['“', '”', '‘', '…', '−', '’', '—']  # stubborn punctuation that was not successfully getting removed with string.punctuation
    punctuation_pattern = '[' + re.escape(string.punctuation) + ''.join(map(re.escape, additional_punctuations)) + ']'
    cleaned_text = re.sub(punctuation_pattern, '', text)
    return cleaned_text

# Apply the remove_punctuation function to the 'text' column
ncte_teacher_utterances['text_no_punctuation'] = ncte_teacher_utterances['text'].apply(remove_punctuation)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ncte_teacher_utterances['text_no_punctuation'] = ncte_teacher_utterances['text'].apply(remove_punctuation)


In [23]:
ncte_teacher_utterances.head(10)

Unnamed: 0,OBSID,speaker,text,text_no_punctuation
0,2119,teacher,"friends, yesterday we started off by working o...",friends yesterday we started off by working on...
2,2119,teacher,"and yesterday towards the end of the period, y...",and yesterday towards the end of the period yo...
4,2119,teacher,some of you might be done. if you are finished...,some of you might be done if you are finished ...
6,2119,teacher,"well, the best way to check it is going to be ...",well the best way to check it is going to be t...
9,2119,teacher,i guess draw it basically remember yesterday w...,i guess draw it basically remember yesterday w...
11,2119,teacher,that works. so now show three fourths of that.,that works so now show three fourths of that
13,2119,teacher,you might wanna draw it a little bit bigger. l...,you might wanna draw it a little bit bigger li...
15,2119,teacher,do the same thing with what you're doing there...,do the same thing with what youre doing there ...
17,2119,teacher,that's what you have going on? because student...,thats what you have going on because student h...
19,2119,teacher,"so four, five.",so four five


In [24]:
# Check for remaining stubborn punctuations in the 'text_no_punctuation' column
stubborn_punctuations = ncte_teacher_utterances['text_no_punctuation'].str.extractall(r'([^a-zA-Z\s])')[0].unique()

# Print the stubborn punctuations
print("Stubborn Punctuations:", stubborn_punctuations)

#only numbers remain which are fine to stay in this analysis

Stubborn Punctuations: ['4' '2' '1' '5' '7' '8' '0' '9' '3' '6']


**Using NLTK stopwords list to remove stop words**

In [25]:
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = [word for word in text.split() if word.lower() not in stop_words]
    cleaned_text = ' '.join(words)
    return cleaned_text

ncte_teacher_utterances['content_words'] = ncte_teacher_utterances['text_no_punctuation'].apply(remove_stopwords)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ncte_teacher_utterances['content_words'] = ncte_teacher_utterances['text_no_punctuation'].apply(remove_stopwords)


In [26]:
# check dataframe
ncte_teacher_utterances.head(10)

Unnamed: 0,OBSID,speaker,text,text_no_punctuation,content_words
0,2119,teacher,"friends, yesterday we started off by working o...",friends yesterday we started off by working on...,friends yesterday started working word problem...
2,2119,teacher,"and yesterday towards the end of the period, y...",and yesterday towards the end of the period yo...,yesterday towards end period got word problem ...
4,2119,teacher,some of you might be done. if you are finished...,some of you might be done if you are finished ...,might done finished look make sure happy way d...
6,2119,teacher,"well, the best way to check it is going to be ...",well the best way to check it is going to be t...,well best way check going draw picture math pa...
9,2119,teacher,i guess draw it basically remember yesterday w...,i guess draw it basically remember yesterday w...,guess draw basically remember yesterday showed...
11,2119,teacher,that works. so now show three fourths of that.,that works so now show three fourths of that,works show three fourths
13,2119,teacher,you might wanna draw it a little bit bigger. l...,you might wanna draw it a little bit bigger li...,might wanna draw little bit bigger like wanna ...
15,2119,teacher,do the same thing with what you're doing there...,do the same thing with what youre doing there ...,thing youre gallon gas im talking case draw pi...
17,2119,teacher,that's what you have going on? because student...,thats what you have going on because student h...,thats going student h using pencil
19,2119,teacher,"so four, five.",so four five,four five


### 5. Calculate Semantic Similarity

Semantic similarity is calculated for the comparisons for all words and content words and both similarity score lists are added to the data frame.

**For all words (text_no_punctuation)**

In [27]:
number_docs = [] #holder for number of docs
docs = []  # holder for docs. This list is outside the loop to make sure the docs accumulate across all the rows
doc_similarity = []  # holder list for spacy docs semantic similarity

for index, row in ncte_teacher_utterances.iterrows():  # go through the Pandas dataframe row by row
    text = str(row["text_no_punctuation"])  # Convert "text" to string for spacy processing
    number_docs.append(text)
    tokenized_doc = nlp(text)  # spacy the text
    docs.append(tokenized_doc)

# calculate semantic similarity between consecutive documents
for i in range(len(number_docs) - 1):  # Iterate through the range of documents
    j = i + 1  # Compare with the next document to the previous one
    #print(f'this is the first doc: {docs[i]}\nthis is the second doc: {docs[j]}')

    # Check if either of the documents is empty (no text because the original document was empty or got removed in cleaning)
    if docs[i].text == '' or docs[j].text == '':
        similarity_score = 0 #any time there is no text in one row to compare to the other, give it a semantic similarity score of 0
    elif docs[i].has_vector and docs[j].has_vector:
        similarity_score = docs[i].similarity(docs[j]) #Compare sentences and assign them a value similarity score
    else:
        similarity_score = 0 #this takes care of empty vectors

    #print(f"The similarity between doc: {i}, and doc: {j}: {similarity_score}")
    doc_similarity.append(similarity_score)

#print(f'list of doc similarities: {doc_similarity}')

Add all text similarity score to dataframe

In [28]:
# Append a similarity score of 0 for the last document
doc_similarity.append(0)

# Add the 'similar_score' column in the Pandas dataframe that contains the semantic similarity scores calculated for each comparison
ncte_teacher_utterances['alltext_similar_score'] = doc_similarity

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ncte_teacher_utterances['alltext_similar_score'] = doc_similarity


In [29]:
ncte_teacher_utterances.head(10) #double checking dataframe contains scores

Unnamed: 0,OBSID,speaker,text,text_no_punctuation,content_words,alltext_similar_score
0,2119,teacher,"friends, yesterday we started off by working o...",friends yesterday we started off by working on...,friends yesterday started working word problem...,0.749705
2,2119,teacher,"and yesterday towards the end of the period, y...",and yesterday towards the end of the period yo...,yesterday towards end period got word problem ...,0.86929
4,2119,teacher,some of you might be done. if you are finished...,some of you might be done if you are finished ...,might done finished look make sure happy way d...,0.884623
6,2119,teacher,"well, the best way to check it is going to be ...",well the best way to check it is going to be t...,well best way check going draw picture math pa...,0.953867
9,2119,teacher,i guess draw it basically remember yesterday w...,i guess draw it basically remember yesterday w...,guess draw basically remember yesterday showed...,0.805809
11,2119,teacher,that works. so now show three fourths of that.,that works so now show three fourths of that,works show three fourths,0.806185
13,2119,teacher,you might wanna draw it a little bit bigger. l...,you might wanna draw it a little bit bigger li...,might wanna draw little bit bigger like wanna ...,0.907149
15,2119,teacher,do the same thing with what you're doing there...,do the same thing with what youre doing there ...,thing youre gallon gas im talking case draw pi...,0.857993
17,2119,teacher,that's what you have going on? because student...,thats what you have going on because student h...,thats going student h using pencil,0.388658
19,2119,teacher,"so four, five.",so four five,four five,0.415478


**For content words only**

In [30]:
number_docs = [] #holder for number of docs
docs = []  # holder for docs. This list is outside the loop to make sure the docs accumulate across all the rows
doc_similarity_2 = []  # holder list for spacy docs semantic similarity

for index, row in ncte_teacher_utterances.iterrows():  # go through the Pandas dataframe row by row
    text = str(row["content_words"])  # Convert "text" to string for spacy processing
    number_docs.append(text)
    tokenized_doc = nlp(text)  # spacy the text
    docs.append(tokenized_doc)

# calculate semantic similarity between consecutive documents
for i in range(len(number_docs) - 1):  # Iterate through the range of documents
    j = i + 1  # Compare with the next document to the previous one
    #print(f'this is the first doc: {docs[i]}\nthis is the second doc: {docs[j]}')

    # Check if either of the documents is empty (no text because the original document was all stop words or punctuation)
    if docs[i].text == '' or docs[j].text == '':
        similarity_score = 0 #any time there is no text in one row to compare to the other, give it a semantic similarity score of 0
    elif docs[i].has_vector and docs[j].has_vector:
        similarity_score = docs[i].similarity(docs[j]) #Compare sentences and assign them a value similarity score
    else:
        similarity_score = 0 #this takes care of empty vectors

    #print(f"The similarity between doc: {i}, and doc: {j}: {similarity_score}")
    doc_similarity_2.append(similarity_score)

#print(f'list of doc similarities: {doc_similarity}')

Add the content word similarity score to dataframe

In [31]:
# Append a similarity score of 0 for the last document
doc_similarity_2.append(0)

# Add the 'similar_score' column in the Pandas dataframe that contains the semantic similarity scores calculated for each comparison
ncte_teacher_utterances['cw_similar_score'] = doc_similarity_2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ncte_teacher_utterances['cw_similar_score'] = doc_similarity_2


In [32]:
ncte_teacher_utterances.head(10) #double checking dataframe contains scores

Unnamed: 0,OBSID,speaker,text,text_no_punctuation,content_words,alltext_similar_score,cw_similar_score
0,2119,teacher,"friends, yesterday we started off by working o...",friends yesterday we started off by working on...,friends yesterday started working word problem...,0.749705,0.766437
2,2119,teacher,"and yesterday towards the end of the period, y...",and yesterday towards the end of the period yo...,yesterday towards end period got word problem ...,0.86929,0.653268
4,2119,teacher,some of you might be done. if you are finished...,some of you might be done if you are finished ...,might done finished look make sure happy way d...,0.884623,0.758269
6,2119,teacher,"well, the best way to check it is going to be ...",well the best way to check it is going to be t...,well best way check going draw picture math pa...,0.953867,0.858346
9,2119,teacher,i guess draw it basically remember yesterday w...,i guess draw it basically remember yesterday w...,guess draw basically remember yesterday showed...,0.805809,0.517561
11,2119,teacher,that works. so now show three fourths of that.,that works so now show three fourths of that,works show three fourths,0.806185,0.740201
13,2119,teacher,you might wanna draw it a little bit bigger. l...,you might wanna draw it a little bit bigger li...,might wanna draw little bit bigger like wanna ...,0.907149,0.756144
15,2119,teacher,do the same thing with what you're doing there...,do the same thing with what youre doing there ...,thing youre gallon gas im talking case draw pi...,0.857993,0.677561
17,2119,teacher,that's what you have going on? because student...,thats what you have going on because student h...,thats going student h using pencil,0.388658,0.067523
19,2119,teacher,"so four, five.",so four five,four five,0.415478,0.024155


In [33]:
ncte_teacher_utterances.tail(10) #double checking that the last line in the dataframe received a score of 0

Unnamed: 0,OBSID,speaker,text,text_no_punctuation,content_words,alltext_similar_score,cw_similar_score
579869,750,teacher,"two. two, zero. so you start where?",two two zero so you start where,two two zero start,0.719059,0.371261
579871,750,teacher,very good. how many times are you gonna move a...,very good how many times are you gonna move ac...,good many times gonna move across,0.51067,0.575136
579873,750,teacher,move across two times. across. stay on the lin...,move across two times across stay on the line ...,move across two times across stay line numbers...,0.687832,0.595159
579875,750,teacher,now the next one is zero. so to move zero time...,now the next one is zero so to move zero times...,next one zero move zero times gonna end stayin...,0.698754,0.494072
579877,750,teacher,write the letter. very good.,write the letter very good,write letter good,0.453335,0.226874
579879,750,teacher,no you don’t have to do that right now. i’m so...,no you dont have to do that right now im sorry,dont right im sorry,0.51852,0.334875
579881,750,teacher,which coordinates are h and b . so the coordin...,which coordinates are h and b so the coordina...,coordinates h b coordinates remember numbers l...,0.897429,0.690558
579883,750,teacher,one and seven. what are the coordinates for b?,one and seven what are the coordinates for b,one seven coordinates b,0.887031,0.916133
579885,750,teacher,"so out of those, which one is the same? h is o...",so out of those which one is the same h is one...,one h one seven b three seven number,0.673991,0.314868
579887,750,teacher,right. very good.,right very good,right good,0.0,0.0


### 6. Save Dataframe

In [34]:
#save dataframe for analysis in R

ncte_teacher_utterances.to_csv('ncte_teacher_similarityscores.csv')

# Results








   






**Mean Similarity Scores**

The distribution and density of mean similarity scores for both all words and content words are visually depicted in Figure 1 and Figure 2 through violin plots. Across teachers, the mean of the mean semantic similarityscore for all words was 0.67, with a standard deviation of 0.07. Notably, ahigher den-sity of scores is observed within the range of 0.60 to 0.75. Incontrast, the mean of the mean similarity score for content words acrossteachers was 0.53, with a standard deviation of 0.07. Here, a higher density ofscores is concentrated within the range of 0.45 to 0.60.

<img src="https://drive.google.com/uc?export=view&id=1a00NIbEeDX5Ei2DSnTo3sRn6L7FrpA5K" width="600" height="350">

<img src="https://drive.google.com/uc?export=view&id=1RJLjngh6kVzBOC-YurZxa1C6EnouHNuy" width="600" height="350">


**Relationship between Mean Similarity Scores & CLASS**

The correlation matrix, shown in Figure 3, provides a visualrepresentation of the distribution of mean semantic similarity scores alongsideCLASS scores, offering insights into the relationship between these variables. Histograms embedded within the matrix illustrate that mean semantic similarityscores for all words and content words generally adhere to a normaldistribution. Additionally, scatterplots depict a subtle linear associationbetween both mean similarity scores and the instructional support variables.

The correlation coefficients affirm the visual trends in thescatterplots. For all words, the coefficients indicate positive correlationsbetween the mean similarity scores and instructional dialogue, instructionalsupport, and the overall CLASS scores. However, the correlation between themean similarity scores and instructional dialogue was not statisticallysignificant. The statistically significant correlations between similarityscores and instructional support, as well as the overall CLASS scores, demonstratedvery weak effect sizes (r < 0.1).

Similarly, for content words, the coefficients indicate apositive correlation between the mean similarity scores and instructionalsupport, along with the overall CLASS scores, but a negative relationship withthe instructional dialogue dimension. The correlations between the meansemantic similarity score with instructional dialogue and instructional supportwere not statistically significant. Despite statistical significance, thecorrelations between similarity scores and the overall CLASS scores exhibitvery weak effect sizes (r < 0.1).

<img src="https://drive.google.com/uc?export=view&id=1j4zKhJUfMlmakEwaY6jTRxYzN1qLwklp" width="600" height="350">








**Examining Raw Similarity Scores**

These weak effect sizes prompt reflection on the practical significance and real-world impact of the observed associations. The findings suggest the semantic similarity among teachers’ utterances does not account for the observed variations in quality rating scores for instructional dialogue, instructional support, or the overall CLASS quality scores.

<img src="https://drive.google.com/uc?export=view&id=1QnWFWgiy45GnogL6flOhj5ZjYRxN3OH-" width="600" height="350">


