In [1]:
import tendims
import gensim
import pandas as pd
import re


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mitra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
model = tendims.TenDimensionsClassifier(is_cuda=False, models_dir = './models/lstm_trained_models', 
                                        embeddings_dir='./embeddings')
dimensions = model.dimensions_list
print(dimensions)

Loaded word embeddings from ./embeddings\glove/glove.42B.300d.wv!
Vocab size: 1917494
Loaded word embeddings from ./embeddings\word2vec/GoogleNews-vectors-negative300.wv!
Vocab size: 3000000
Loaded word embeddings from ./embeddings\fasttext/wiki-news-300d-1M-subword.wv!
Vocab size: 999994
['support', 'knowledge', 'conflict', 'power', 'similarity', 'fun', 'status', 'trust', 'identity', 'romance']


Sentence-level classification

The classifier was trained on individual sentences. Although the classifier accepts text of any length, we recommend to compute the scores sentence-by-sentence. The function compute_score_split does that for you and returns the maximum and average values. When using the maximum, please consider that the longer the text, the higher the likelihood to get a larger maximum value. So, if you use the maximum, be sure to account for text length in you analysis (i.e. a high maximum score on a text of 10 words is not comparable with the same value on a text of 100 words). You can always split the sentences yourself and aggregate sentence-level values as you deem appropriate.

Score distribution

The classifier returns confidence scores in the range [0,1]. This number is proportional to the likelihood of the text containing the selected dimension. Depending on the input data and on the aggregation performed, the empirical distributions of the confidence score may differ across dimensions (may be bell-shaped, skewed, bi-modal, etc.). For this reason, binarizing the scores based on a fixed threshold might not be the best approach. An approch that proved effective is to binarize based on a high percentile (e.g., 75th or 85th percentiles) computed on your empirical distribution of scores.

Directionality

The classifier was trained to identify expressions that "convey" dimension D from the speaker to the listener. For example, in the case of the dimension support, the classifier is supposed to find expressions indicating that the speaker is offering some support to the lister. In practice, this directionality is not guaranteed, and the classifier picks up different types of verbal expressions of the social dimensions. For example, "I am willing to help you, whatever you need" and "Clara is willing to help George, whatever he needs" have both relatively high scores for the dimension support (0.86 and 0.75, respectively), but only the first one is an expression of the speaker offering support. To more strongly enforce directionality, and approach that proved effective is to consider only sentences containing second-person pronouns.

Errors

Be aware that the classifier was trained mostly on Reddit data. It can be used on any piece of text but you should expect some performance drop when used on textual data with very different style or distribution of words (e.g., Twitter). Last, as everything in life, the classifications made by this tool are not perfect, but given eough data you'll be able to see interesting and meaningful trends.

In [None]:
# example
model.compute_score_split('Hello, my name is Mike. I am willing to help you, whatever you need.', dimensions='support')
# (np.mean(scores), np.max(scores), np.min(scores), np.std(scores))

Testing individual threads and comments

In [None]:
# Load & read the data
thread_rknr7b = pd.read_csv("../data/threads/rknr7b.csv")
thread_rknr7b.head()

In [None]:
# Display all column names to confirm the cleaned body column's name
print(thread_rknr7b.columns)

# Get the number of rows in the DataFrame
num_rows = thread_rknr7b.shape[0]
print("Number of rows in the DataFrame:", num_rows)

In [None]:
# Display the full content of the 'body' for the first row to avoid truncation

# Set display option to show full text in columns
pd.set_option('display.max_colwidth', None)

print("*******1\nFull Body Text of the First Entry:\n", thread_rknr7b.loc[0, "body"])
print("*******3\nFull Body Text of the third Entry:\n", thread_rknr7b.loc[3, "body"])
print("*******5\nFull Body Text of the fifth Entry:\n", thread_rknr7b.loc[5, "body"])

# Check the data type of the 'body' column
print("\n *******\nData type of 'body' column:\n", thread_rknr7b["body"].dtype)

# Fill NaN values in 'body' with an empty string before calculating the length
thread_rknr7b["body"] = thread_rknr7b["body"].fillna("")

# Calculate the length of the text in 'body' for each entry and add it as a new column
thread_rknr7b["body_length"] = thread_rknr7b["body"].apply(len)


# Show summary statistics for the body lengths
print("\n *******\nSummary of 'body' lengths:\n", thread_rknr7b["body_length"].describe())

# Display the first few rows to inspect the 'body_length' column
print("\n******* \nSample rows with 'body' text and its length:\n")
print(thread_rknr7b[["body", "body_length"]].head(10))



In [None]:
# format 1> analyze_text
# Define a function to calculate additional metrics and check for special characters
def analyze_text(text):
    num_chars = len(text) if isinstance(text, str) else 0
    num_words = len(text.split()) if isinstance(text, str) else 0
    num_sentences = len(re.split(r'[.!?]', text)) - 1 if isinstance(text, str) else 0
    has_special_chars = bool(re.search(r'[\[\]\{\}\*\&]', text)) if isinstance(text, str) else False  # Check for special characters
    return pd.Series({
        "num_chars": num_chars,
        "num_words": num_words,
        "num_sentences": num_sentences,
        "has_special_chars": has_special_chars
    })

# Apply the function to each row in 'body' and store results in new columns
thread_rknr7b[["num_chars", "num_words", "num_sentences", "has_special_chars"]] = thread_rknr7b["body"].apply(analyze_text)

# Display the extracted metrics table without the 'body' column
print(thread_rknr7b[["num_chars", "num_words", "num_sentences", "has_special_chars"]].head())


In [None]:
# format 2> analyze_text
# Define the function to analyze text
def analyze_text_with_special_chars(text):
    if isinstance(text, str):  # Check if the input is a string
        special_chars = re.findall(r'[^\w\s\.\?\!]', text)  # Find characters that aren't alphanumeric, space, ., ?, !
        has_special_chars = len(special_chars) > 0
        return pd.Series({
            "num_chars": len(text),
            "num_words": len(text.split()),
            "num_sentences": len(re.split(r'[.!?]', text)) - 1,
            "has_special_chars": has_special_chars,
            "special_chars": ''.join(set(special_chars))  # Unique special characters
        })
    else:
        return pd.Series({
            "num_chars": 0,
            "num_words": 0,
            "num_sentences": 0,
            "has_special_chars": False,
            "special_chars": ""
        })
    
# Display other statistical features if needed
print("\nAdditional Features:")
print("Number of Words in 'body':", len(body_text.split()))
print("Number of Unique Words in 'body':", len(set(body_text.split())))
print("Average Word Length in 'body':", sum(len(word) for word in body_text.split()) / len(body_text.split()))

# Apply the function to each row in 'body' and store results in new columns
thread_rknr7b[["num_chars", "num_words", "num_sentences", "has_special_chars", "special_chars"]] = thread_rknr7b["body"].apply(analyze_text_with_special_chars)

# Display the metrics without the body content
print(thread_rknr7b[["num_chars", "num_words", "num_sentences", "has_special_chars", "special_chars"]].head())


In [None]:
# format 3> analyze_text

# Define the function to analyze text
def analyze_text_with_special_chars(text):
    if isinstance(text, str):  # Check if the input is a string
        # Find special characters that aren't alphanumeric, spaces, ., ?, !
        special_chars = re.findall(r'[^\w\s\.\?\!]', text)
        has_special_chars = len(special_chars) > 0
        words = text.split()
        
        # Calculate unique words and average word length
        num_unique_words = len(set(words))
        avg_word_length = sum(len(word) for word in words) / len(words) if words else 0

        return pd.Series({
            "num_chars": len(text),
            "num_words": len(words),
            "num_sentences": len(re.split(r'[.!?]', text)) - 1,
            "has_special_chars": has_special_chars,
            "special_chars": ''.join(set(special_chars)),  # Unique special characters
            "num_unique_words": num_unique_words,
            "avg_word_length": avg_word_length
        })
    else:
        # Handle non-string entries
        return pd.Series({
            "num_chars": 0,
            "num_words": 0,
            "num_sentences": 0,
            "has_special_chars": False,
            "special_chars": "",
            "num_unique_words": 0,
            "avg_word_length": 0
        })

# Apply the function to each row in 'body' and store results in new columns
thread_rknr7b[["num_chars", "num_words", "num_sentences", "has_special_chars", "special_chars", 
               "num_unique_words", "avg_word_length"]] = thread_rknr7b["body"].apply(analyze_text_with_special_chars)

# Display the metrics without the body content
print(thread_rknr7b[["num_chars", "num_words", "num_sentences", "has_special_chars", 
                     "special_chars", "num_unique_words", "avg_word_length"]].head())


In [None]:
#Let's fisrt work on the first bodey now >   body[0] is uor sample :
# Display the full content of the 'body' column for the first row
body_text1 = thread_rknr7b.loc[0, "body"]  # Access the full text of the first entry
print(body_text1)

In [None]:
# Check the type of the 'body' content
print("Type of 'body':", type(body_text1))

In [None]:
# Check the length
print("Length of 'body':", len(body_text1))

# Display the last 100 characters to confirm if there’s more to the text
print("Last part of text:", body_text1[-100:])

In [None]:
# Filter rows where 'has_delta' is 1
delta_rows = thread_rknr7b[thread_rknr7b['has_delta'] == 1]

delta_rows

In [None]:
# Count rows with has_delta as 0 and 1
delta_counts= thread_rknr7b['has_delta'].value_counts()
print(delta_counts)


In [None]:
# Filter rows where has_delta is 1 and display the first few
delta_rows_sample = thread_rknr7b[thread_rknr7b['has_delta'] == 1].head(4)

# Display the 'body' column and other relevant columns if needed
delta_rows_sample[['body', 'has_delta']]


In [None]:
posts_with_delta = thread_rknr7b[thread_rknr7b['has_delta'] == 1]
posts_with_delta.head()

In [None]:
test_comment = posts_with_delta.iloc[0]['body']
print(test_comment)

In [None]:
import re
import pandas as pd

# Define the cleaning function
def clean_and_split_text(text):
    # Remove special characters except ., ?, !
    text = re.sub(r'[^\w\s\.\?\!]', '', text)  
    
    # Split into sentences based on punctuation marks and clean up spaces
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    
    # Remove empty sentences and ensure they are properly formatted
    clean_sentences = [sentence.strip() for sentence in sentences if sentence]
    
    return clean_sentences

# Apply the cleaning function to the 'body' column and add the result as a new column
thread_rknr7b["clean_body"] = thread_rknr7b["body"].apply(clean_and_split_text)

# Display the cleaned and split body
print(thread_rknr7b[["body", "clean_body"]].head())


In [None]:
def clean_and_split_text_improved(text):
    # Remove special characters except ., ?, !
    text = re.sub(r'&#x200B;|\n|\t', ' ', text)  # Removes specific placeholders and newline characters
    text = re.sub(r'[^\w\s\.\?\!]', '', text)  # Removes all other special characters except ., ?, !
    
    # Split into sentences based on punctuation marks and clean up spaces
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    
    # Remove empty sentences and ensure they are properly formatted
    clean_sentences = [sentence.strip() for sentence in sentences if sentence]
    
    return clean_sentences

# Apply the improved cleaning function to the 'body' column
thread_rknr7b["clean_body"] = thread_rknr7b["body"].apply(clean_and_split_text_improved)

# Display the comparison of the first entry
print("Original Body:\n", thread_rknr7b["body"].iloc[0])
print("\nCleaned and Split Sentences:\n", thread_rknr7b["clean_body"].iloc[0])

In [None]:
'''
def clean_and_split_text_improved(text):
    # Remove special characters except ., ?, !
    text = re.sub(r'&#x200B;|\n|\t', ' ', text)  # Removes specific placeholders and newline characters
    text = re.sub(r'[^\w\s\.\?\!]', '', text)  # Removes all other special characters except ., ?, !
    
    # Split into sentences based on punctuation marks and clean up spaces
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    
    # Remove empty sentences and ensure they are properly formatted
    clean_sentences = [sentence.strip() for sentence in sentences if sentence]
    
    return clean_sentences

# Apply the improved cleaning function to the 'body' column
thread_rknr7b["clean_body"] = thread_rknr7b["body"].apply(clean_and_split_text_improved)

# Display the comparison of the first entry
print("Original Body:\n", thread_rknr7b["body"].iloc[0])
print("\nCleaned and Split Sentences:\n", thread_rknr7b["clean_body"].iloc[0])
'''

def advanced_clean_and_split_text(text):
    text = re.sub(r'&#x200B;|\n|\t', ' ', text)  # Remove specific placeholders, newline characters, and tabs
    text = re.sub(r'http\S+', '', text)   # Remove URLs
    #text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)  # Insert spaces where lowercase and uppercase letters are merged
    #text = re.sub(r"\bIm\b", "I'm", text) # Handle common contractions (expand or correct them)
    #text = re.sub(r"\bIts\b", "It's", text) # Handle common contractions (expand or correct them)
    #text = re.sub(r"\bDont\b", "Don't", text)  # Handle common contractions (expand or correct them)
    text = re.sub(r'[^\w\s\.\?\!]', '', text)  # Remove other special characters except ., ?, !
    
    # Split into sentences based on punctuation marks
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    # Clean up spaces and filter out any empty sentences
    clean_sentences = [sentence.strip() for sentence in sentences if sentence]
    return clean_sentences

# Apply this new cleaning function to the 'body' column
thread_rknr7b["clean_body"] = thread_rknr7b["body"].apply(advanced_clean_and_split_text)

# Display the comparison of the third entry to check improvements
print("Original Body:\n", thread_rknr7b["body"].iloc[3])
print("\nAdvanced Cleaned and Split Sentences:\n", thread_rknr7b["clean_body"].iloc[3])





In [None]:
thread_rknr7b.head()


In [None]:
#theooooooooooooooo



import re

def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = re.sub(r'&#\w+;|\\x\w\w', '', text)  # Remove HTML entities and unicode hex characters
    text = re.sub(r'[^\w\s\.\?\!]', '', text)  # Remove special characters except for ., ?, !
    text = re.sub(r'(?<=\w)([.?!])(?=\w)', r'\1 ', text)  # Ensure spacing after punctuation
    text = text.strip().lower()  # Trim and convert to lowercase
    return text

# Apply the function
cleaned_test_comment = clean_text(test_comment)
print(cleaned_test_comment)


In [None]:
#theooooooooooooooo

#thread_rknr7b['body_clean'] = thread_rknr7b[["body"]].astype({"body":"string"}).apply(clean_text,axis=1)

# Apply the clean_text function to each element in the 'body' column
thread_rknr7b['body_clean'] = thread_rknr7b['body'].astype("string").apply(clean_text)

# Display the first few rows to verify
print(thread_rknr7b[['body', 'body_clean']].head())


In [None]:
model.compute_score_split(cleaned_test_comment, 'knowledge')

In [None]:
model.compute_score_split(thread_rknr7b["clean_body"].iloc[0], 'knowledge')

In [None]:
model.compute_score_split(thread_rknr7b["clean_body"].iloc[0], 'trust')

In [None]:
model.compute_score_split(cleaned_test_comment, 'similarity')

In [None]:
model.compute_score_split(cleaned_test_comment, 'trust')

In [None]:
# Display all column names to confirm the cleaned body column's name
print(thread_rknr7b.columns)



In [None]:
def compute_knowledge_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['clean_body'], 'knowledge')
    return pd.Series([row['clean_body'], row['has_delta'], mean, max_score, min_score, std],
                     index=['clean_body', 'has_delta', 'knowledge_mean', 'knowledge_max', 'knowledge_min', 'knowledge_std'])
# Dataframe for knowledge scores
knowledge_scores_df = thread_rknr7b.apply(compute_knowledge_scores, axis=1)
# Display the first few rows of the knowledge scores dataframe
print("Knowledge Scores Dataframe:\n", knowledge_scores_df.head())


In [None]:
def compute_knowledge_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['body_clean'], 'knowledge')
    return pd.Series([mean, max_score, min_score, std], index=['knowledge_mean', 'knowledge_max', 'knowledge_min', 'knowledge_std'])

thread_rknr7b[['knowledge_mean', 'knowledge_max', 'knowledge_min', 'knowledge_std']] = thread_rknr7b.apply(compute_knowledge_scores, axis=1)

In [None]:
def compute_similarity_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['body_clean'], 'similarity')
    return pd.Series([mean, max_score, min_score, std], index=['similarity_mean', 'similarity_max', 'similarity_min', 'similarity_std'])

thread_rknr7b[['similarity_mean', 'similarity_max', 'similarity_min', 'similarity_std']] = thread_rknr7b.apply(compute_similarity_scores, axis=1)

In [None]:
def compute_trust_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['clean_body'], 'trust')
    return pd.Series([row['clean_body'], row['has_delta'], mean, max_score, min_score, std],
                     index=['clean_body', 'has_delta', 'trust_mean', 'trust_max', 'trust_min', 'trust_std'])


In [None]:
def compute_trust_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['body_clean'], 'trust')
    return pd.Series([mean, max_score, min_score, std], index=['trust_mean', 'trust_max', 'trust_min', 'trust_std'])

thread_rknr7b[['trust_mean', 'trust_max', 'trust_min', 'trust_std']] = thread_rknr7b.apply(compute_trust_scores, axis=1)

In [None]:
# List of all 10 dimensions
dimensions = ['knowledge', 'power', 'status', 'trust', 'support', 'romance', 
              'similarity', 'identity', 'fun', 'conflict']

# Function to compute scores for a specific dimension
def compute_dimension_scores(row, dimension):
    mean, max_score, min_score, std = model.compute_score_split(row['clean_body'], dimension)
    return pd.Series([row['clean_body'], row['has_delta'], mean, max_score, min_score, std],
                     index=['clean_body', 'has_delta', f'{dimension}_mean', f'{dimension}_max', 
                            f'{dimension}_min', f'{dimension}_std'])

# Loop over dimensions and compute scores for each, storing results in separate DataFrames
dimension_score_dfs = {}  # Dictionary to store DataFrames for each dimension
for dim in dimensions:
    score_df = thread_rknr7b.apply(lambda row: compute_dimension_scores(row, dim), axis=1)
    dimension_score_dfs[dim] = score_df  # Store each dimension's DataFrame
    

# Example: Display the first few rows of the knowledge scores DataFrame

print("Knowledge Scores DataFrame:\n", dimension_score_dfs['knowledge'].head())            #1>knowledge
print("trust Scores DataFrame:\n", dimension_score_dfs['trust'].head())                #2>trust
print("similarity Scores DataFrame:\n", dimension_score_dfs['similarity'].head())           #3>similarity
print("status Scores DataFrame:\n", dimension_score_dfs['status'].head())               #4>status
print("support Scores DataFrame:\n", dimension_score_dfs['support'].head())              #5>support
print("power Scores DataFrame:\n", dimension_score_dfs['power'].head())                #6>power
print("identity Scores DataFrame:\n", dimension_score_dfs['identity'].head())             #7>identity
print("conflict Scores DataFrame:\n", dimension_score_dfs['conflict'].head())             #8>conflict
print("fun Scores DataFrame:\n", dimension_score_dfs['fun'].head())                  #9>fun                            
print("romance Scores DataFrame:\n", dimension_score_dfs['romance'].head())              #10>romance
print("*******************end******************************************************************************")




In [None]:
# List of all 10 dimensions
dimensions = ['knowledge', 'power', 'status', 'trust', 'support', 'romance', 
              'similarity', 'identity', 'fun', 'conflict']

# Function to compute scores for a specific dimension
def compute_dimension_scores(row, dimension):
    mean, max_score, min_score, std = model.compute_score_split(row['clean_body'], dimension)
    return pd.Series([row['clean_body'], row['has_delta'], mean, max_score, min_score, std],
                     index=['clean_body', 'has_delta', f'{dimension}_mean', f'{dimension}_max', 
                            f'{dimension}_min', f'{dimension}_std'])

# Dictionary to store DataFrames for each dimension
dimension_score_dfs = {}

# Loop over dimensions, compute scores for each, and print results
for dim in dimensions:
    score_df = thread_rknr7b.apply(lambda row: compute_dimension_scores(row, dim), axis=1)
    dimension_score_dfs[dim] = score_df  # Store each dimension's DataFrame
    
    # Print each dimension's scores DataFrame head
    print(f"\n{dim.capitalize()} Scores DataFrame:\n", score_df.head())

print("\n*******************End of All Dimension Scores*******************")


In [None]:
s= "Only a fully trained Jedi Knight, with The Force as his ally, will conquer Vader and his Emperor. If you end your training now, if you choose the quick and easy path, as Vader did, you will become an agent of evil"





In [None]:
import pandas as pd
from your_module_path import TenDimensionsClassifier  # Replace with the actual path to the TenDimensionsClassifier class

# Initialize the classifier with your specified models and embeddings directories
classifier = TenDimensionsClassifier(
    models_dir='./models/lstm_trained_models',  # Path where the models are stored
    embeddings_dir='./embeddings',              # Path where the embeddings are stored
    is_cuda=False                               # Set to True if using GPU
)

# Function to compute all ten dimensions' scores for each row's clean_body
def compute_all_dimensions_scores(row):
    text = row['clean_body']  # Access the cleaned body of text in each row
    scores_dict = {}
    
    # Iterate over each dimension to calculate the scores
    for dimension in classifier.dimensions_list:
        # Calculate the mean, max, min, and std for each dimension at the sentence level
        mean, max_score, min_score, std = classifier.compute_score_split(text, dimension)
        
        # Store the results in a dictionary with clear naming
        scores_dict.update({
            f"{dimension}_mean": mean,
            f"{dimension}_max": max_score,
            f"{dimension}_min": min_score,
            f"{dimension}_std": std
        })
        
    return pd.Series(scores_dict)

# Apply the function to each row in your dataframe to get dimension scores
dimension_scores_df = thread_rknr7b.apply(compute_all_dimensions_scores, axis=1)

# Combine the new dimension scores with the original dataframe
final_df = pd.concat([thread_rknr7b, dimension_scores_df], axis=1)

# Display the final dataframe with dimension scores
print(final_df.head())


In [None]:
import pandas as pd
from tendimensions import DimensionClassifier  # Adjust import if the actual name differs
import numpy as np

# Initialize the model with chosen embedding type
embedding_path = 'embeddings/glove/glove.42B.300d.wv'  # Replace with actual path to your embeddings
model = DimensionClassifier(embedding_type='glove', embedding_path=embedding_path)

# Assuming `thread_rknr7b` is your dataset
# Apply cleaning function to split the 'body' into sentences and save it in 'clean_body'
def clean_and_split_text(text):
    # Replace this with your cleaning function
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [sentence.strip() for sentence in sentences if sentence]

thread_rknr7b["clean_body"] = thread_rknr7b["body"].apply(clean_and_split_text)

# Function to calculate scores for a single dimension
def compute_dimension_scores(row, dimension):
    scores = [model.compute_score_split(sentence, dimension) for sentence in row['clean_body']]
    means = [score[0] for score in scores]
    max_scores = [score[1] for score in scores]
    min_scores = [score[2] for score in scores]
    std_devs = [score[3] for score in scores]
    return pd.Series({
        f'{dimension}_mean': np.mean(means),
        f'{dimension}_max': np.max(max_scores),
        f'{dimension}_min': np.min(min_scores),
        f'{dimension}_std': np.std(means)
    })

# Calculate scores for each dimension
dimensions = ['knowledge', 'trust', 'support', 'similarity', 'power', 'status', 'identity', 'fun', 'romance', 'conflict']
for dim in dimensions:
    # Apply the scoring function to each row
    score_df = thread_rknr7b.apply(lambda row: compute_dimension_scores(row, dim), axis=1)
    # Concatenate the resulting scores to your main DataFrame
    thread_rknr7b = pd.concat([thread_rknr7b, score_df], axis=1)

# Display the first few rows to verify
print(thread_rknr7b.head())


In [None]:
# Set the first cleaned body in a variable
test_std_body1 = thread_rknr7b["clean_body"].iloc[0]


# Number of sentences in the cleaned body
num_sentences = len(test_std_body1)

print("Number of Sentences:", num_sentences)


# Initialize a list to store knowledge scores for each sentence
knowledge_scores = []

# Loop through each sentence in the cleaned body and compute the knowledge score
for sentence in test_std_body1:
    score = model.compute_score(sentence, 'knowledge')
    knowledge_scores.append(score)

# Calculate mean, max, min, and standard deviation of the scores
mean_score = sum(knowledge_scores) / len(knowledge_scores)
max_score = max(knowledge_scores)
min_score = min(knowledge_scores)
std_dev = (sum((x - mean_score) ** 2 for x in knowledge_scores) / len(knowledge_scores)) ** 0.5

# Print the results
print("Knowledge Scores for Each Sentence:", knowledge_scores)
print("Mean Knowledge Score:", mean_score)
print("Max Knowledge Score:", max_score)
print("Min Knowledge Score:", min_score)
print("Standard Deviation of Knowledge Scores:", std_dev)


In [None]:
import numpy as np 


import numpy as np
import pandas as pd

# Function to compute scores for a specific dimension with sentence-level aggregation
def compute_sentence_level_scores(row, dimension):
    # Get the list of sentences in the body
    sentences = row['clean_body']
    # Calculate the score for each sentence, handling empty scores
    scores = [model.compute_score(sentence, dimension) for sentence in sentences if sentence]
    
    # Check if scores is not empty to avoid calculation on empty lists
    if scores:
        mean_score = np.mean(scores)
        max_score = np.max(scores)
        min_score = np.min(scores)
        std_score = np.std(scores)
    else:
        mean_score, max_score, min_score, std_score = np.nan, np.nan, np.nan, np.nan
    
    return pd.Series([row['clean_body'], row['has_delta'], mean_score, max_score, min_score, std_score],
                     index=['clean_body', 'has_delta', f'{dimension}_mean', f'{dimension}_max', 
                            f'{dimension}_min', f'{dimension}_std'])

# Apply this function for each dimension in the dimensions list
dimension_score_dfs = {}

for dim in dimensions:
    score_df = thread_rknr7b.apply(lambda row: compute_sentence_level_scores(row, dim), axis=1)
    dimension_score_dfs[dim] = score_df  # Store each dimension's DataFrame
    print(f"\n{dim.capitalize()} Scores DataFrame:\n", score_df.head())


In [None]:
thread_rknr7b.head()

Extracting 3 dimensions for all posts

In [None]:
m_posts = pd.read_csv("../data/posts_final.csv", index_col=False)

In [None]:
# Clean text but keep sentence structure due to sentence-level classification
def clean_text(row):
    text = str(row[0])
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = re.sub(r'[^\w\s\.\?\!]', '', text)  # Remove special characters except for ., ?, !
    text = text.strip()  # Remove leading and trailing spaces
    text = text.lower()  # Convert to lowercase
    text = text.replace('x200b', '') # Remove x200b
    return text

cleaned_test_comment = clean_text(test_comment)
print(cleaned_test_comment)

In [None]:
m_posts['body_clean'] = m_posts[["body"]].astype({"body":"string"}).apply(clean_text,axis=1)

In [None]:
def compute_knowledge_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['body_clean'], 'knowledge')
    return pd.Series([mean, max_score, min_score, std], index=['knowledge_mean', 'knowledge_max', 'knowledge_min', 'knowledge_std'])

m_posts[['knowledge_mean', 'knowledge_max', 'knowledge_min', 'knowledge_std']] = m_posts.apply(compute_knowledge_scores, axis=1)

In [None]:
def compute_similarity_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['body_clean'], 'similarity')
    return pd.Series([mean, max_score, min_score, std], index=['similarity_mean', 'similarity_max', 'similarity_min', 'similarity_std'])

m_posts[['similarity_mean', 'similarity_max', 'similarity_min', 'similarity_std']] = m_posts.apply(compute_similarity_scores, axis=1)

In [None]:
def compute_trust_scores(row):
    mean, max_score, min_score, std = model.compute_score_split(row['body_clean'], 'trust')
    return pd.Series([mean, max_score, min_score, std], index=['trust_mean', 'trust_max', 'trust_min', 'trust_std'])

m_posts[['trust_mean', 'trust_max', 'trust_min', 'trust_std']] = m_posts.apply(compute_trust_scores, axis=1)

In [None]:
m_posts.head()

In [None]:
# # examples
# sentences = {
# 'knowledge' : [
#     "Only a fully trained Jedi Knight, with The Force as his ally, will conquer Vader and his Emperor. If you end your training now, if you choose the quick and easy path, as Vader did, you will become an agent of evil",
#     "Well, in layman's terms, you use a rotating magnetic field to focus a narrow beam of gravitons; these in turn fold space-time consistent with Weyl tensor dynamics until the space-time curvature becomes infinitely large and you have a singularity",
#     "Since positronic signatures have only been known to emanate from androids such as myself, it is logical to theorize that there is an android such as myself on Kolarus III",
# ],

# 'power' : [
#     "Now if you don't want to be the fifth person ever to die in meta-shock from a planar rift, I suggest you get down behind that desk and don't move until we give you the signal",
#     "You can ask any price you want, but you must give me those letters ",
#     "Right now you're in no position to ask questions! And your snide remarks..."
# ],

# 'status' : [
#     "I want to thank you, sir, for giving me the opportunity to work",
#     "Frankie, you're a good old man, and you've been loyal to my Father for years...so I hope you can explain what you mean",
#     "And we drink to her, and we all congratulate her on her wonderful accomplishment during this last year...her great success in A Doll's House!"
# ],

# 'trust' : [
#     "I'm trying to tell you – and this is where you have to trust me – but, I think your life might be in real danger",
#     "Mr. Lebowski is prepared to make a generous offer to you to act as courier once we get instructions for the money",
#     "Take the Holy Gospels in your hand and swear to tell the whole truth concerning everything you will be asked"
# ],

# 'support' : [
#     "I'm sorry, I just feel like... I know I shouldn't ask, I just need some kind of help, I just, I have a deadline tomorrow",
#     "Look, Dave, I know that you're sincere and that you're trying to do a competent job, and that you're trying to be helpful, but I can assure the problem is with the AO-units, and with your test gear",
#     "Well... listen, if you need any help, you know, back up, call me, OK?"
# ],

# 'romance' : [
#     "I'm going to marry the woman I love",
#     "If you are truly wild at heart, you'll fight for your dreams... Don’t turn away from love, Sailor ",
#     "You admit to me you do not love your fiance?"
# ],

# 'identity' : [
#     "Hey, I know what I'm talkin' about, black women ain't the same as white women ",
#     "That's how it was in the old world, Pop, but this is not Sicily",
#     "But, as you are so fond of observing, Doctor, I'm not human"
# ],

# 'fun' : [
#     "It’s just funny...who needs a serial psycho in the woods with a chainsaw when we have ourselves",
#     "I do enjoy playing bingo, if you'd like to join me for a game tomorrow night at church you’re welcome to",
#     "Oh, I'm sure it’s a lot of fun, 'cause the Incas did it, you know, and-and they-they-they were a million laughs"
# ],

# 'conflict' : [
#     "Forgive me for askin', son, and I don’t mean to belabor the obvious, but why is it that you’ve got your head so far up your own ass?",
#     "If you're lying to me you poor excuse for a human being, I'm gonna blow your brains all over this car",
#     "I couldn't give a shit if you believe me or not, and frankly I'm too tired to prove it to you"
# ]}

In [None]:
# for dim in sentences:
#     print(f' === {dim.upper()} ===')
#     for s in sentences[dim]:
#         score = model.compute_score(s, dim)
#         print (f'{s} -- {dim}={score:.2f}')