# B.1 - Preprocessing, Network Construction, and Language Detection

## Necessary Input Files:
- `goals_with_attributes.json`: Complete goals data with titles, descriptions, and metadata
- `users_with_goals.json`: User-goal relationships showing which goals each user has included
- `../02_preprocessing/lid.176.bin`: fastText language detection model (for Part II)

## Workflow:

### Part I: Network Construction (Sections 1-9)
1. **Load and Filter Data**: Load goal and user data, filter goals to only those with valid descriptions
2. **Create Co-occurrence Network**: Build initial network where goals are connected if they appear in the same user's list
3. **Show Sample Goals**: Display examples of goals with/without descriptions
4. **Prepare Text for Embedding**: Combine title and description for ALL goals
5. **Compute Embeddings**: Use sentence transformers to create embeddings (saves to `goal_embeddings.npy` for reuse)
6. **Find Similar Goals**: Use cosine similarity threshold (0.9) to group similar goals via Union-Find algorithm
7. **Analyze Similar Groups**: Show statistics and examples of all groups found
8. **Export Mapping**: Create Excel file with goal-to-representative mappings for validation
9. **Create Filtered Network**: Rebuild network using only representative goals and export to pickle file

### Part II: Network and Language Inspection
This section can run independently by loading the saved network file:
- **Load Network**: Read the filtered network from pickle file
- **Inspect Network**: Show connected components statistics
- **Language Detection (Section 10)**: Use fastText to detect language of each goal and export to Excel

## Output Files:
- `goal_embeddings.npy`: Cached sentence embeddings for all goals
- `goal_embedding_ids.json`: Goal IDs corresponding to the embeddings
- `goal_mapping_validation.xlsx`: Excel file with three sheets:
  - 'Goal Mapping': All goals in groups with their representatives
  - 'Summary': Overview statistics of the merging process
  - 'Representatives': Final list of representative goals
- `filtered_network.pkl`: NetworkX graph with merged goals (representative nodes only)
- `filtered_network_nodes.xlsx`: Node attributes including language detection results

After running this notebook for the first time, we had our embeddings data to skip the long process when running the notebook again. However, due to the files being too large to store on github, this notebook needs to compute the embeddings all over again if you want to run it. Similarly, for the language detection part, we are using the fastText language detection model, which was downloaded prior, and could also not be stored on github. You can orientate on the current output or download the lid.176.bin model yourself on: https://fasttext.cc/docs/en/language-identification.html.

## 1. Setup and Load Data
Import necessary libraries and load the input JSON files.

In [1]:
# Import libraries
import json
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import os

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load data files
with open('../Data/goals_with_attributes.json', 'r', encoding='utf-8') as f:
    goals = json.load(f)
print(f"Loaded {len(goals):,} goals")

with open('../Data/users_with_goals.json', 'r', encoding='utf-8') as f:
    users_data = json.load(f)
print(f"Loaded {len(users_data):,} users")

# Explore data structure
sample_goal_id = list(goals.keys())[0]
print(f"\nSample Goal (ID: {sample_goal_id}):")
print(f"  Title: {goals[sample_goal_id]['title']}")
print(f"  Description: {goals[sample_goal_id]['description'][:100] if goals[sample_goal_id]['description'] else '(empty)'}...")
print(f"  Wants to do: {goals[sample_goal_id]['wants_to_do']}")
print(f"  Have done: {goals[sample_goal_id]['have_done']}")
print(f"  Comments: {len(goals[sample_goal_id]['comments'])} comments")
print(f"  Tags: {len(goals[sample_goal_id]['tags'])} tags")

sample_user = list(users_data.keys())[0]
print(f"\nSample User ({sample_user}):")
print(f"  Number of goals: {len(users_data[sample_user])}")
print(f"  First goal: {users_data[sample_user][0]}")

# Filter goals with valid descriptions NOW (before network creation)
goals_with_descriptions = {}
for gid, gdata in goals.items():
    desc = gdata.get('description', '')
    if desc.strip():  # Non-empty after stripping whitespace
        goals_with_descriptions[gid] = gdata

print(f"\nGoals with valid descriptions: {len(goals_with_descriptions):,} ({100*len(goals_with_descriptions)/len(goals):.1f}%)")
print(f"Goals without valid descriptions: {len(goals) - len(goals_with_descriptions):,} ({100*(len(goals)-len(goals_with_descriptions))/len(goals):.1f}%)")

Loaded 231,269 goals
Loaded 10,087 users

Sample Goal (ID: jE2QgdsE):
  Title: Try Geocaching
  Description: (empty)...
  Wants to do: 579
  Have done: 156
  Comments: 12 comments
  Tags: 35 tags

Sample User (spockaholic):
  Number of goals: 86
  First goal: {'id': 'jE2QgdsE', 'href': '/goal/jE2QgdsE', 'text': 'Try Geocaching'}

Goals with valid descriptions: 3,394 (1.5%)
Goals without valid descriptions: 227,875 (98.5%)


## 2. Create Initial Co-occurrence Network
Build network where goals are connected if they co-occur in a user's list (only includes goals with valid descriptions).

In [3]:
# Build co-occurrence network (only goals with descriptions)
G = nx.Graph()

# Add only goals with descriptions as nodes (with all attributes)
for gid, gdata in goals_with_descriptions.items():
    G.add_node(gid, 
               title=gdata['title'],
               description=gdata['description'],
               wants_to_do=gdata['wants_to_do'],
               have_done=gdata['have_done'],
               comments=gdata['comments'],
               tags=gdata['tags'])

# Build edges from co-occurrences
edge_counter = Counter()

for username, goals_list in users_data.items():
    # Extract goal IDs that have descriptions
    user_goal_ids = [item['id'] for item in goals_list 
                     if 'id' in item and item['id'] in goals_with_descriptions]
    
    # Create edges between all pairs
    for i, goal_i in enumerate(user_goal_ids):
        for goal_j in user_goal_ids[i+1:]:
            edge = tuple(sorted([goal_i, goal_j]))
            edge_counter[edge] += 1

# Add edges to graph (with weights = co-occurrence count)
for (g1, g2), weight in edge_counter.items():
    G.add_edge(g1, g2, weight=weight)

# Calculate network statistics
num_nodes = G.number_of_nodes()
num_edges = G.number_of_edges()
max_possible_edges = (num_nodes * (num_nodes - 1)) // 2
density = nx.density(G)

print("\nInitial network statistics:")
print(f"  Nodes (goals): {num_nodes:,}")
print(f"  Edges: {num_edges:,}")
print(f"  Max possible edges: {max_possible_edges:,}")
print(f"  Network density: {density:.8f}")
print(f"  Average degree: {2*num_edges/num_nodes:.2f}")

# Connected components
num_components = nx.number_connected_components(G)
largest_cc = max(nx.connected_components(G), key=len) if num_components > 0 else set()
print(f"\n  Connected components: {num_components:,}")
print(f"  Largest component size: {len(largest_cc):,} nodes ({100*len(largest_cc)/num_nodes:.1f}%)")

# Store for later comparison
original_num_nodes = num_nodes
original_num_edges = num_edges
original_density = density


Initial network statistics:
  Nodes (goals): 3,394
  Edges: 258,027
  Max possible edges: 5,757,921
  Network density: 0.04481253
  Average degree: 152.05

  Connected components: 32
  Largest component size: 3,359 nodes (99.0%)


## 3. Show Sample Goals
Display examples of goals with and without valid descriptions to understand what's included/excluded.

In [4]:
# Show examples of filtered goals
print("\nSample goals with valid descriptions (used in initial network):")
for i, (gid, gdata) in enumerate(list(goals_with_descriptions.items())[:5], 1):
    print(f"\n{i}. {gdata['title']}")
    desc = gdata['description']
    print(f"   Description: {desc[:150]}...")
    print(f"   Stats: {gdata['wants_to_do']} want to do, {gdata['have_done']} have done, {len(gdata['comments'])} comments")

print("\nSample goals without valid descriptions:")
goals_without_desc = {k: v for k, v in goals.items() if k not in goals_with_descriptions}
for i, (gid, gdata) in enumerate(list(goals_without_desc.items())[:5], 1):
    desc = gdata.get('description', '')
    print(f"\n{i}. {gdata['title']}")
    print(f"   Description: '{desc}'")
    print(f"   Stats: {gdata['wants_to_do']} want to do, {gdata['have_done']} have done")


Sample goals with valid descriptions (used in initial network):

1. Make ice cream from scratch
   Description: Ice cream or ice-cream is a frozen dessert usually made from dairy products, such as milk and cream, and often combined with fruits or other ingredien...
   Stats: 3010 want to do, 973 have done, 46 comments

2. Leave an inspirational note inside a book for someone to find
   Description: Imagine the joy of discovering a heartfelt message nestled within the pages of a book, a simple yet profound gesture that can brighten someone's day o...
   Stats: 9856 want to do, 966 have done, 56 comments

3. Fly a kite
   Description: A kite is a tethered heavier-than-air or lighter-than-air craft with wing surfaces that react against the air to create lift and drag forces. A kite c...
   Stats: 3030 want to do, 1199 have done, 37 comments

4. Go skydiving
   Description: Parachuting, including also skydiving, is a method of transiting from a high point in the atmosphere to the surface 

## 4. Prepare Text for ALL Goals (for Similarity Analysis)
For similarity mapping, use all goals: title+description if valid, or title+"" if not

In [5]:
# Prepare text for embedding - USE ALL GOALS
goal_ids_for_embedding = list(goals.keys())  # ALL goals, not just with descriptions
goal_combined_texts = []

for gid in goal_ids_for_embedding:
    title = goals[gid]['title']
    desc = goals[gid].get('description', '')
    
    # Check if description is valid (non-empty after stripping)
    description = desc.strip()
    
    # Combine title and description
    combined = f"{title}. {description}" if description else title
    goal_combined_texts.append(combined.strip())

print(f"Prepared {len(goal_combined_texts):,} texts for embedding (ALL goals)")
print(f"  - {len(goals_with_descriptions):,} with valid descriptions")
print(f"  - {len(goals) - len(goals_with_descriptions):,} with only title (no valid description)")

# Text statistics
text_lengths = [len(t) for t in goal_combined_texts]
print(f"\nText length statistics:")
print(f"  Mean: {np.mean(text_lengths):.0f} characters")
print(f"  Median: {np.median(text_lengths):.0f} characters")
print(f"  Max: {np.max(text_lengths):,} characters")
print(f"  Min: {np.min(text_lengths):,} characters")

Prepared 231,269 texts for embedding (ALL goals)
  - 3,394 with valid descriptions
  - 227,875 with only title (no valid description)

Text length statistics:
  Mean: 39 characters
  Median: 29 characters
  Max: 2,317 characters
  Min: 1 characters


## 5. Create Embeddings using Sentence Transformer
Load pre-trained model and generate embeddings for all goal texts. (Note that the displayed output of this cell contains the output logic for when the embeddings were already created, such that we do not have to run it again each time we run through this notebook.)

In [6]:
# Load sentence transformer model and create embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast and efficient model

# Check if embeddings already exist (to save time on reruns)
embeddings_file = '../Data/Embeddings, Similarity and Language Detection/goal_embeddings.npy'
embedding_ids_file = '../Data/Embeddings, Similarity and Language Detection/goal_embedding_ids.json'

if os.path.exists(embeddings_file) and os.path.exists(embedding_ids_file):
    print(f"\nLoading existing embeddings from {embeddings_file}...")
    embeddings = np.load(embeddings_file)
    
    with open(embedding_ids_file, 'r', encoding='utf-8') as f:
        saved_goal_ids = json.load(f)
    
    # Verify that saved IDs match current goal_ids_for_embedding
    if saved_goal_ids == goal_ids_for_embedding:
        print(f"Loaded embeddings for {len(embeddings):,} goals")
        print(f"  Shape: {embeddings.shape}")
        print(f"  Embedding dimension: {embeddings.shape[1]}")
    else:
        print("Warning: Saved embeddings don't match current goals. Recomputing...")
        embeddings = model.encode(goal_combined_texts, show_progress_bar=True, batch_size=32)
        np.save(embeddings_file, embeddings)
        with open(embedding_ids_file, 'w', encoding='utf-8') as f:
            json.dump(goal_ids_for_embedding, f)
        print(f"Created and saved embeddings for {len(embeddings):,} goals")
else:
    print(f"\nCreating embeddings for {len(goal_combined_texts):,} goal texts...")
    print("(This may take a few minutes...)")
    embeddings = model.encode(goal_combined_texts, show_progress_bar=True, batch_size=32)
    
    # Save embeddings for future use
    print(f"\nSaving embeddings to {embeddings_file}...")
    np.save(embeddings_file, embeddings)
    
    with open(embedding_ids_file, 'w', encoding='utf-8') as f:
        json.dump(goal_ids_for_embedding, f)
    
    print(f"Created and saved embeddings for {len(embeddings):,} goals")
    print(f"  Shape: {embeddings.shape}")
    print(f"  Embedding dimension: {embeddings.shape[1]}")

# Show sample embeddings
print(f"\nSample embedding for first goal:")
print(f"  Goal: {goal_combined_texts[0][:80]}...")
print(f"  Embedding (first 10 dims): {embeddings[0][:10]}")
print(f"  L2 norm: {np.linalg.norm(embeddings[0]):.4f}")


Loading existing embeddings from ../Data/Embeddings, Similarity and Language Detection/goal_embeddings.npy...
Loaded embeddings for 231,269 goals
  Shape: (231269, 384)
  Embedding dimension: 384

Sample embedding for first goal:
  Goal: Try Geocaching...
  Embedding (first 10 dims): [ 0.00925845 -0.0417789  -0.02357222 -0.04139975  0.0457349  -0.10426646
 -0.06940327 -0.07827378 -0.04529943  0.03908972]
  L2 norm: 1.0000


## 6. Find Similar Goals using Cosine Similarity
Group goals that exceed similarity threshold and select most popular goal as representative

In [7]:
# Count popularity first (needed for representative selection)
print("Counting goal popularity (included_by_our_users)...")
included_by_our_users = Counter()
for username, goals_list in users_data.items():
    for item in goals_list:
        if 'id' in item:
            included_by_our_users[item['id']] += 1

print(f"Goal popularity statistics:")
pop_values = list(included_by_our_users.values())
print(f"  Mean: {np.mean(pop_values):.1f}")
print(f"  Median: {np.median(pop_values):.0f}")
print(f"  Max: {max(pop_values):,}")
print(f"  Min: {min(pop_values):,}")

print(f"\nNote: Computing full similarity matrix ({len(embeddings):,} x {len(embeddings):,}) would require ~199 GB")
print("Instead, we'll compute similarities in batches and find groups incrementally...")

Counting goal popularity (included_by_our_users)...
Goal popularity statistics:
  Mean: 2.2
  Median: 1
  Max: 1,672
  Min: 1

Note: Computing full similarity matrix (231,269 x 231,269) would require ~199 GB
Instead, we'll compute similarities in batches and find groups incrementally...


In [8]:
# Find similar goal groups using batch processing (memory efficient)
# Uses Union-Find to properly merge groups across batches
SIMILARITY_THRESHOLD = 0.9  # Adjustable threshold
BATCH_SIZE = 1000  # Process 1000 goals at a time

# Check if similarity results already exist
import pickle as pkl
similarity_cache_file = '../Data/Embeddings, Similarity and Language Detection/similarity_results_cache.pkl'

if os.path.exists(similarity_cache_file):
    print(f"\nLoading existing similarity results from {similarity_cache_file}...")
    
    # Load cached results
    with open(similarity_cache_file, 'rb') as f:
        cache_data = pkl.load(f)
    
    # Verify threshold matches
    if cache_data['threshold'] == SIMILARITY_THRESHOLD:
        print(f"Loaded similarity results (threshold = {SIMILARITY_THRESHOLD})")
        
        # Extract cached variables
        similar_groups = cache_data['similar_groups']
        groups_dict = cache_data['groups_dict']
        parent = cache_data['parent']
        total_in_groups = cache_data['total_in_groups']
        num_merged = cache_data['num_merged']
        total_goals_with_desc_in_groups = cache_data['total_goals_with_desc_in_groups']
        
        # Reconstruct find and union functions (needed by subsequent cells)
        def find(x):
            """Find root of element x with path compression"""
            if parent[x] != x:
                parent[x] = find(parent[x])
            return parent[x]
        
        def union(x, y):
            """Union two elements"""
            root_x = find(x)
            root_y = find(y)
            if root_x != root_y:
                parent[root_y] = root_x
        
        print(f"\nLoaded {len(similar_groups):,} groups of similar goals")
        print(f"\nGroup statistics:")
        print(f"  Total goals in groups: {total_in_groups:,}")
        print(f"  Goals with descriptions in groups: {total_goals_with_desc_in_groups:,}")
        print(f"  Groups (representatives): {len(similar_groups):,}")
        print(f"  Goals merged: {num_merged:,}")
        print(f"  Reduction: {num_merged:,} goals ({100*num_merged/len(goals):.1f}% of all goals)")
        
    else:
        print(f"Warning: Cached threshold ({cache_data['threshold']}) doesn't match current threshold ({SIMILARITY_THRESHOLD})")
        print("Recomputing similarities with current threshold...")
        similarity_cache_file = None  # Force recomputation

if not os.path.exists(similarity_cache_file) or cache_data['threshold'] != SIMILARITY_THRESHOLD:
    # Compute similarities from scratch
    print(f"\nFinding similar groups (threshold = {SIMILARITY_THRESHOLD})...")
    print(f"Processing in batches of {BATCH_SIZE} with proper cross-batch merging...\n")
    
    # Union-Find data structure to properly merge groups
    parent = {gid: gid for gid in goal_ids_for_embedding}
    
    def find(x):
        """Find root of element x with path compression"""
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]
    
    def union(x, y):
        """Union two elements"""
        root_x = find(x)
        root_y = find(y)
        if root_x != root_y:
            parent[root_y] = root_x
    
    num_batches = (len(embeddings) + BATCH_SIZE - 1) // BATCH_SIZE
    
    # Process batches and build union-find structure
    for batch_idx in range(num_batches):
        start_idx = batch_idx * BATCH_SIZE
        end_idx = min((batch_idx + 1) * BATCH_SIZE, len(embeddings))
        
        if batch_idx % 10 == 0:
            print(f"  Processing batch {batch_idx + 1}/{num_batches} (goals {start_idx:,} to {end_idx:,})...")
        
        # Compute similarities for this batch against ALL embeddings
        batch_embeddings = embeddings[start_idx:end_idx]
        batch_similarities = cosine_similarity(batch_embeddings, embeddings)
        
        # For each goal in batch, find all similar goals and union them
        for i in range(len(batch_embeddings)):
            global_i = start_idx + i
            gid_i = goal_ids_for_embedding[global_i]
            
            # Find all goals similar to this one (across all batches)
            similar_indices = np.where(batch_similarities[i] >= SIMILARITY_THRESHOLD)[0]
            
            # Union all similar goals together
            for j in similar_indices:
                gid_j = goal_ids_for_embedding[j]
                union(gid_i, gid_j)
    
    print("\nSimilarity computation complete. Building final groups...\n")
    
    # Group goals by their root (representative in union-find)
    groups_dict = defaultdict(list)
    
    for gid in goal_ids_for_embedding:
        root = find(gid)
        groups_dict[root].append(gid)
    
    # Filter to groups with size > 1 and at least one description
    similar_groups = []
    
    for root, members in groups_dict.items():
        if len(members) > 1:
            # Check if at least one member has a valid description
            members_with_desc = [gid for gid in members if gid in goals_with_descriptions]
            
            if len(members_with_desc) > 0:
                # Sort by popularity - only among those with descriptions
                members_with_desc_sorted = sorted(members_with_desc, 
                                                  key=lambda x: included_by_our_users.get(x, 0), 
                                                  reverse=True)
                
                # Most popular goal WITH description becomes representative
                representative = members_with_desc_sorted[0]
                
                # Sort all members by popularity
                members_sorted = sorted(members, key=lambda x: included_by_our_users.get(x, 0), reverse=True)
                
                similar_groups.append({
                    'representative': representative,
                    'members': members_sorted,
                    'members_with_desc': members_with_desc_sorted,
                    'included_by_our_users': [included_by_our_users.get(gid, 0) for gid in members_sorted]
                })
    
    print(f"Found {len(similar_groups):,} groups of similar goals (with at least one description)")
    
    # Calculate statistics
    total_in_groups = sum(len(g['members']) for g in similar_groups)
    num_merged = total_in_groups - len(similar_groups)
    total_goals_with_desc_in_groups = sum(len(g['members_with_desc']) for g in similar_groups)
    
    print(f"\nGroup statistics:")
    print(f"  Total goals in groups: {total_in_groups:,}")
    print(f"  Goals with descriptions in groups: {total_goals_with_desc_in_groups:,}")
    print(f"  Groups (representatives): {len(similar_groups):,}")
    print(f"  Goals merged: {num_merged:,}")
    print(f"  Reduction: {num_merged:,} goals ({100*num_merged/len(goals):.1f}% of all goals)")
    
    # Save results to cache file
    print(f"\nSaving similarity results to {similarity_cache_file}...")
    cache_data = {
        'threshold': SIMILARITY_THRESHOLD,
        'similar_groups': similar_groups,
        'groups_dict': dict(groups_dict),  # Convert defaultdict to dict for pickling
        'parent': parent,
        'total_in_groups': total_in_groups,
        'num_merged': num_merged,
        'total_goals_with_desc_in_groups': total_goals_with_desc_in_groups
    }
    
    with open(similarity_cache_file, 'wb') as f:
        pkl.dump(cache_data, f)
    
    print("Similarity results cached successfully")



Loading existing similarity results from ../Data/Embeddings, Similarity and Language Detection/similarity_results_cache.pkl...
Loaded similarity results (threshold = 0.9)

Loaded 283 groups of similar goals

Group statistics:
  Total goals in groups: 866
  Goals with descriptions in groups: 787
  Groups (representatives): 283
  Goals merged: 583
  Reduction: 583 goals (0.3% of all goals)


## 7. Analyze Similar Goal Groups
Show statistics for all groups found, not just those with descriptions

In [9]:
# Analyze ALL groups found by Union-Find (including those without descriptions)
print("\nComprehensive group analysis:")

# Count all groups (size > 1)
all_groups_info = []

for root, members in groups_dict.items():
    if len(members) > 1:
        members_with_desc = [gid for gid in members if gid in goals_with_descriptions]
        has_any_description = len(members_with_desc) > 0
        
        all_groups_info.append({
            'members': members,
            'members_with_desc': members_with_desc,
            'has_description': has_any_description,
            'size': len(members)
        })

# Split into two categories
groups_with_desc = [g for g in all_groups_info if g['has_description']]
groups_without_desc = [g for g in all_groups_info if not g['has_description']]

print(f"\nTOTAL GROUPS FOUND (similarity >= {SIMILARITY_THRESHOLD}):")
print(f"  Total groups: {len(all_groups_info):,}")
print(f"  Groups WITH at least 1 description: {len(groups_with_desc):,} ({100*len(groups_with_desc)/len(all_groups_info):.1f}%)")
print(f"  Groups WITHOUT any description: {len(groups_without_desc):,} ({100*len(groups_without_desc)/len(all_groups_info):.1f}%)")

total_goals_in_all_groups = sum(g['size'] for g in all_groups_info)
total_goals_in_groups_with_desc = sum(g['size'] for g in groups_with_desc)
total_goals_in_groups_without_desc = sum(g['size'] for g in groups_without_desc)

print(f"\nGOALS IN GROUPS:")
print(f"  Total goals in all groups: {total_goals_in_all_groups:,}")
print(f"  Goals in groups WITH description: {total_goals_in_groups_with_desc:,}")
print(f"  Goals in groups WITHOUT description: {total_goals_in_groups_without_desc:,}")

print(f"\nEXCLUDED FROM ANALYSIS:")
print(f"  Groups excluded (no descriptions): {len(groups_without_desc):,}")
print(f"  Goals excluded (in groups without descriptions): {total_goals_in_groups_without_desc:,}")
print(f"  Note: These groups were found but excluded from Excel/network because no member has a description")

# Group size statistics
with_desc_sizes = [g['size'] for g in groups_with_desc]
without_desc_sizes = [g['size'] for g in groups_without_desc]

print(f"\nGROUP SIZE STATISTICS:")
print(f"\n  Groups WITH description:")
print(f"    - Mean size: {np.mean(with_desc_sizes):.1f}")
print(f"    - Median size: {np.median(with_desc_sizes):.0f}")
print(f"    - Max size: {max(with_desc_sizes):,}")
print(f"    - Min size: {min(with_desc_sizes):,}")

print(f"\n  Groups WITHOUT description:")
print(f"    - Mean size: {np.mean(without_desc_sizes):.1f}")
print(f"    - Median size: {np.median(without_desc_sizes):.0f}")
print(f"    - Max size: {max(without_desc_sizes):,}")
print(f"    - Min size: {min(without_desc_sizes):,}")


Comprehensive group analysis:

TOTAL GROUPS FOUND (similarity >= 0.9):
  Total groups: 17,098
  Groups WITH at least 1 description: 283 (1.7%)
  Groups WITHOUT any description: 16,815 (98.3%)

GOALS IN GROUPS:
  Total goals in all groups: 82,350
  Goals in groups WITH description: 866
  Goals in groups WITHOUT description: 81,484

EXCLUDED FROM ANALYSIS:
  Groups excluded (no descriptions): 16,815
  Goals excluded (in groups without descriptions): 81,484
  Note: These groups were found but excluded from Excel/network because no member has a description

GROUP SIZE STATISTICS:

  Groups WITH description:
    - Mean size: 3.1
    - Median size: 2
    - Max size: 43
    - Min size: 2

  Groups WITHOUT description:
    - Mean size: 4.8
    - Median size: 2
    - Max size: 738
    - Min size: 2


Show concrete examples from both types of groups (with and without descriptions).

In [10]:
# Show examples of BOTH types of groups
print("\nExample groups with descriptions (included in analysis):")

for i, group_info in enumerate(groups_with_desc[:2], 1):
    members = group_info['members']
    members_with_desc = group_info['members_with_desc']   
    print(f"\nExample {i}: {len(members)} similar goals ({len(members_with_desc)} with descriptions)")
    
    for j, gid in enumerate(members[:5], 1):  # Show first 5
        title = goals[gid]['title']
        has_desc = "[desc]" if gid in goals_with_descriptions else "[none]"
        pop = included_by_our_users.get(gid, 0)
        print(f"  {has_desc} [{j}] {title[:60]:<60} (pop: {pop:>4})")
    
    if len(members) > 5:
        print(f"  ... and {len(members) - 5} more similar goals")

print("\n\n\nExample groups without descriptions (excluded from analysis):")
print("Note: These groups were found but excluded because NO member has a description")

for i, group_info in enumerate(groups_without_desc[:2], 1):
    members = group_info['members']
    print(f"\nExample {i}: {len(members)} similar goals (0 with descriptions)")
    
    for j, gid in enumerate(members[:5], 1):  # Show first 5
        title = goals[gid]['title']
        pop = included_by_our_users.get(gid, 0)
        desc = goals[gid].get('description', '')
        print(f"    [{j}] {title[:60]:<60} (pop: {pop:>4})")
        print(f"        Description: '{desc[:80]}'")
    
    if len(members) > 5:
        print(f"  ... and {len(members) - 5} more similar goals")

print(f"\n\n\nOnly the {len(groups_with_desc):,} groups WITH descriptions are used in the Excel file and network")


Example groups with descriptions (included in analysis):

Example 1: 43 similar goals (1 with descriptions)
  [desc] [1] Leave an inspirational note inside a book for someone to fin (pop: 1183)
  [none] [2] Leave an inspirational note inside of a book for someone to  (pop:    2)
  [none] [3] Leave an inspirational note inside a book for someone        (pop:   56)
  [none] [4] Leave an inspirational note in a book                        (pop:    3)
  [none] [5] Leave an inspirational note inside a library book            (pop:    2)
  ... and 38 more similar goals

Example 2: 3 similar goals (3 with descriptions)
  [desc] [1] Learn French                                                 (pop:  161)
  [desc] [2] Improve my French                                            (pop:   64)
  [desc] [3] Learn to speak French                                        (pop:   20)



Example groups without descriptions (excluded from analysis):
Note: These groups were found but excluded because NO me

## 8. Create Goal Mapping and Export to Excel
Create mapping from all goals to their representatives and export for validation

Visualize the representative selection process with detailed examples.

In [11]:
# Show example groups
print("\nExample similar goal groups:")
print(f"Showing first 5 groups with selection process:")

for i, group in enumerate(similar_groups[:5], 1):
    rep = group['representative']
    members = group['members']
    members_with_desc = group['members_with_desc']
    pops = group['included_by_our_users']
    
    print(f"\nGroup {i}: {len(members)} similar goals ({len(members_with_desc)} with descriptions)")
    
    for j, (gid, pop) in enumerate(zip(members, pops)):
        title = goals[gid]['title']
        has_desc = "[desc]" if gid in goals_with_descriptions else "[none]"
        is_rep = "<<< SELECTED" if gid == rep else ""
        wants = goals[gid]['wants_to_do']
        have = goals[gid]['have_done']
        print(f"  {has_desc} [{j+1}] {title[:45]:<45} (pop: {pop:>4}, {wants}W/{have}D) {is_rep}")
    
    print(f"\n  Selected: {goals[rep]['title']}")
    print(f"  Reason: Most popular goal WITH description ({included_by_our_users.get(rep, 0)} occurrences)")
    
    if i == 10:
        print(f"\n... (showing first 10 of {len(similar_groups)} total groups)")


Example similar goal groups:
Showing first 5 groups with selection process:

Group 1: 43 similar goals (1 with descriptions)
  [desc] [1] Leave an inspirational note inside a book for (pop: 1183, 9856W/966D) <<< SELECTED
  [none] [2] Leave an inspirational note for someone to fi (pop:   59, 390W/54D) 
  [none] [3] Leave an inspirational note inside a book for (pop:   56, 260W/33D) 
  [none] [4] Leave an inspirational note in a book for som (pop:   44, 295W/37D) 
  [none] [5] Leave an inspirational note inside a book for (pop:   24, 74W/10D) 
  [none] [6] leave an inspirational note inside a book for (pop:    6, 39W/4D) 
  [none] [7] Leave an inspirational note inside a library  (pop:    5, 41W/3D) 
  [none] [8] leave an inspirational note in a book for som (pop:    4, 11W/3D) 
  [none] [9] Leave an inspirational note in a book         (pop:    3, 14W/3D) 
  [none] [10] Leave an inspirational note in a library book (pop:    3, 6W/0D) 
  [none] [11] Leave an inspirational note inside of

Build the complete mapping dictionary and prepare data for Excel export.

In [12]:
# Create mapping dictionary
goal_mapping = {}

# Map each goal in a group to its representative
for group in similar_groups:
    rep = group['representative']
    for member in group['members']:
        goal_mapping[member] = rep

print(f"Created mapping for {len(goal_mapping):,} goals")

# Create detailed mapping for export
mapping_data = []

for group in similar_groups:
    rep = group['representative']
    rep_data = goals[rep]  # Use main goals dict (rep always has description)
    rep_pop = included_by_our_users.get(rep, 0)
    
    for i, (member, pop) in enumerate(zip(group['members'], group['included_by_our_users'])):
        member_data = goals[member]  # Use main goals dict (some members may not have descriptions)
        
        # Get description (may be empty for some members)
        member_desc = member_data.get('description', '')
        rep_desc = rep_data.get('description', '')
        
        mapping_data.append({
            'group_id': similar_groups.index(group) + 1,
            'is_representative': member == rep,
            'goal_id': member,
            'goal_title': member_data['title'],
            'goal_description': member_desc[:200] if member_desc else '',  # Truncate for Excel
            'has_description': member in goals_with_descriptions,
            'included_by_our_users': pop,
            'wants_to_do': member_data['wants_to_do'],
            'have_done': member_data['have_done'],
            'num_comments': len(member_data['comments']),
            'num_tags': len(member_data['tags']),
            'representative_id': rep,
            'representative_title': rep_data['title'],
            'representative_description': rep_desc[:200] if rep_desc else '',
            'representative_included_by_our_users': rep_pop,
            'group_size': len(group['members']),
            'position_in_group': i + 1
        })

# Create DataFrame
mapping_df = pd.DataFrame(mapping_data)

# Sort by group_id and position
mapping_df = mapping_df.sort_values(['group_id', 'position_in_group'])

print(f"\nMapping DataFrame created: {len(mapping_df):,} rows")
print(f"Columns: {list(mapping_df.columns)}")

# Show sample
print("\nSample mapping data:")
print(mapping_df.head(10).to_string(index=False))

Created mapping for 866 goals

Mapping DataFrame created: 866 rows
Columns: ['group_id', 'is_representative', 'goal_id', 'goal_title', 'goal_description', 'has_description', 'included_by_our_users', 'wants_to_do', 'have_done', 'num_comments', 'num_tags', 'representative_id', 'representative_title', 'representative_description', 'representative_included_by_our_users', 'group_size', 'position_in_group']

Sample mapping data:
 group_id  is_representative  goal_id                                                            goal_title                                                                                                                                                                                         goal_description  has_description  included_by_our_users  wants_to_do  have_done  num_comments  num_tags representative_id                                          representative_title                                                                                                 

Export the mapping to Excel with three sheets: Goal Mapping, Summary, and Representatives.

In [13]:
# Export to Excel
output_file = '../Data/Validation/goal_mapping_validation.xlsx'

# Calculate num_unique_after_merge for summary
num_unique_after_merge = len(goals_with_descriptions) - num_merged

with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
    # Main mapping sheet
    mapping_df.to_excel(writer, sheet_name='Goal Mapping', index=False)
    
    # Summary sheet
    summary_data = {
        'Metric': [
            'Total goals with descriptions',
            'Similar groups found',
            'Goals merged',
            'Unique goals after merge',
            'Reduction percentage',
            'Similarity threshold used'
        ],
        'Value': [
            len(goals_with_descriptions),
            len(similar_groups),
            num_merged,
            num_unique_after_merge,
            f"{100*num_merged/len(goals_with_descriptions):.2f}%",
            SIMILARITY_THRESHOLD
        ]
    }
    summary_df = pd.DataFrame(summary_data)
    summary_df.to_excel(writer, sheet_name='Summary', index=False)
    
    # Representatives only (for final network)
    representatives_data = []
    for group in similar_groups:
        rep = group['representative']
        rep_data = goals_with_descriptions[rep]
        representatives_data.append({
            'goal_id': rep,
            'title': rep_data['title'],
            'description': rep_data['description'],
            'wants_to_do': rep_data['wants_to_do'],
            'have_done': rep_data['have_done'],
            'num_comments': len(rep_data['comments']),
            'num_tags': len(rep_data['tags']),
            'included_by_our_users': included_by_our_users.get(rep, 0),
            'group_size': len(group['members']),
            'merged_goal_ids': ', '.join(group['members'][1:])  # All except representative
        })
    
    rep_df = pd.DataFrame(representatives_data)
    rep_df = rep_df.sort_values('included_by_our_users', ascending=False)
    rep_df.to_excel(writer, sheet_name='Representatives', index=False)

print(f"Exported to {output_file}")
print(f"  - Sheet 'Goal Mapping': {len(mapping_df)} rows (all goals in groups)")
print(f"  - Sheet 'Summary': Overview statistics")
print(f"  - Sheet 'Representatives': {len(representatives_data)} representative goals")

Exported to ../Data/Validation/goal_mapping_validation.xlsx
  - Sheet 'Goal Mapping': 866 rows (all goals in groups)
  - Sheet 'Summary': Overview statistics
  - Sheet 'Representatives': 283 representative goals


## 9. Create Filtered Network with Merged Goals
Rebuild network using only goals with descriptions and merge similar goals

In [14]:
# Build filtered network
print("Building filtered network...")
print("(Only goals with descriptions, merging similar goals)\n")

G_filtered = nx.Graph()

# First, create a mapping of representatives to their merged goals
rep_to_merged_goals = defaultdict(list)
for gid, rep_id in goal_mapping.items():
    if gid != rep_id:  # Only include goals that were merged (not the representative itself)
        rep_to_merged_goals[rep_id].append(gid)

# Add representative goals as nodes (with all attributes including merged_goals list)
for gid in goals_with_descriptions.keys():
    # Use representative if this goal was mapped
    rep_id = goal_mapping.get(gid, gid)
    if rep_id not in G_filtered:
        rep_data = goals_with_descriptions[rep_id]
        
        # Get list of merged goals (empty list if not a representative)
        merged_goals_list = rep_to_merged_goals.get(rep_id, [])
        
        G_filtered.add_node(rep_id, 
                           title=rep_data['title'],
                           description=rep_data['description'],
                           wants_to_do=rep_data['wants_to_do'],
                           have_done=rep_data['have_done'],
                           comments=rep_data['comments'],
                           tags=rep_data['tags'],
                           included_by_our_users=included_by_our_users.get(rep_id, 0),
                           merged_goals=merged_goals_list)

print(f"Added merged_goals attribute to all nodes")
print(f"  Representatives (with merged goals): {sum(1 for n in G_filtered.nodes() if len(G_filtered.nodes[n]['merged_goals']) > 0):,}")
print(f"  Non-representatives (empty list): {sum(1 for n in G_filtered.nodes() if len(G_filtered.nodes[n]['merged_goals']) == 0):,}")

# Build edges
edge_counter_filtered = Counter()

for username, goals_list in users_data.items():
    user_goal_ids = [item['id'] for item in goals_list if 'id' in item]
    
    # Filter to goals with descriptions and map to representatives
    user_goal_ids_filtered = []
    for gid in user_goal_ids:
        if gid in goals_with_descriptions:
            rep_id = goal_mapping.get(gid, gid)
            user_goal_ids_filtered.append(rep_id)
    
    # Remove duplicates (if multiple goals map to same representative)
    user_goal_ids_filtered = list(set(user_goal_ids_filtered))
    
    # Create edges
    for i, goal_i in enumerate(user_goal_ids_filtered):
        for goal_j in user_goal_ids_filtered[i+1:]:
            edge = tuple(sorted([goal_i, goal_j]))
            edge_counter_filtered[edge] += 1

# Add edges
for (g1, g2), weight in edge_counter_filtered.items():
    if g1 in G_filtered and g2 in G_filtered:
        G_filtered.add_edge(g1, g2, weight=weight)

# Calculate statistics
num_nodes_filtered = G_filtered.number_of_nodes()
num_edges_filtered = G_filtered.number_of_edges()
density_filtered = nx.density(G_filtered)

print("\nFiltered & merged network statistics:")
print(f"  Nodes: {num_nodes_filtered:,}")
print(f"  Edges: {num_edges_filtered:,}")
print(f"  Density: {density_filtered:.8f}")
print(f"  Average degree: {2*num_edges_filtered/num_nodes_filtered:.2f}")

# Connected components
num_components_filtered = nx.number_connected_components(G_filtered)
if num_components_filtered > 0:
    largest_cc_filtered = max(nx.connected_components(G_filtered), key=len)
    print(f"\n  Connected components: {num_components_filtered:,}")
    print(f"  Largest component: {len(largest_cc_filtered):,} nodes ({100*len(largest_cc_filtered)/num_nodes_filtered:.1f}%)")

# Comparison with original
print("\n" + "="*80)
print("COMPARISON: ORIGINAL vs FILTERED & MERGED")
print("\nComparison: Original vs Filtered & Merged")
print(f"{'Metric':<25} {'Original':<20} {'Filtered & Merged':<20} {'Change':<20}")

node_change = num_nodes_filtered - original_num_nodes
edge_change = num_edges_filtered - original_num_edges
density_change = density_filtered - original_density

print(f"{'Nodes':<25} {original_num_nodes:<20,} {num_nodes_filtered:<20,} {node_change:>+,} ({100*node_change/original_num_nodes:+.1f}%)")
print(f"{'Edges':<25} {original_num_edges:<20,} {num_edges_filtered:<20,} {edge_change:>+,} ({100*edge_change/original_num_edges:+.1f}%)")
print(f"{'Density':<25} {original_density:<20.8f} {density_filtered:<20.8f} {density_change:>+.8f} ({100*density_change/original_density:+.1f}%)")

# Show sample node attributes
print("\n" + "="*80)
print("SAMPLE NODE ATTRIBUTES IN FILTERED NETWORK")
print("="*80)
print("\nSample node attributes in filtered network:")
sample_node = list(G_filtered.nodes())[0]
print(f"\nNode ID: {sample_node}")
for attr, value in G_filtered.nodes[sample_node].items():
    if attr in ['comments', 'tags']:
        print(f"  {attr}: {len(value)} items")
    elif attr == 'description':
        print(f"  {attr}: {value[:80]}..." if len(value) > 80 else f"  {attr}: {value}")
    elif attr == 'merged_goals':
        print(f"  {attr}: {len(value)} merged goals")
        if len(value) > 0:
            print(f"    First few: {value[:3]}")
    else:
        print(f"  {attr}: {value}")


Building filtered network...
(Only goals with descriptions, merging similar goals)

Added merged_goals attribute to all nodes
  Representatives (with merged goals): 283
  Non-representatives (empty list): 2,607

Filtered & merged network statistics:
  Nodes: 2,890
  Edges: 219,130
  Density: 0.05249119
  Average degree: 151.65

  Connected components: 28
  Largest component: 2,860 nodes (99.0%)

COMPARISON: ORIGINAL vs FILTERED & MERGED

Comparison: Original vs Filtered & Merged
Metric                    Original             Filtered & Merged    Change              
Nodes                     3,394                2,890                -504 (-14.8%)
Edges                     258,027              219,130              -38,897 (-15.1%)
Density                   0.04481253           0.05249119           +0.00767867 (+17.1%)

SAMPLE NODE ATTRIBUTES IN FILTERED NETWORK

Sample node attributes in filtered network:

Node ID: dQggEQQH
  title: Make ice cream from scratch
  description: Ice cream o

In [15]:
# Sanity check: Count nodes with merged_goals list size > 0
nodes_with_merges = sum(1 for n in G_filtered.nodes() if len(G_filtered.nodes[n]['merged_goals']) > 0)
nodes_without_merges = sum(1 for n in G_filtered.nodes() if len(G_filtered.nodes[n]['merged_goals']) == 0)

print("\nSanity check - merged_goals attribute:")
print(f"  Nodes with merged goals (list size > 0): {nodes_with_merges:,}")
print(f"  Nodes without merged goals (empty list): {nodes_without_merges:,}")
print(f"  Total nodes: {G_filtered.number_of_nodes():,}")
print(f"  Percentage with merges: {100*nodes_with_merges/G_filtered.number_of_nodes():.1f}%")


Sanity check - merged_goals attribute:
  Nodes with merged goals (list size > 0): 283
  Nodes without merged goals (empty list): 2,607
  Total nodes: 2,890
  Percentage with merges: 9.8%


Calculate and display the number of isolated nodes in the filtered network.

In [16]:
# Show number of isolated nodes
num_isolated = len(list(nx.isolates(G_filtered)))
print(f"\n  Isolated nodes in filtered network: {num_isolated:,} ({100*num_isolated/num_nodes_filtered:.1f}%)")


  Isolated nodes in filtered network: 25 (0.9%)


In [17]:
# show all attributes of one sample node
print("\n" + "="*80)
print("SAMPLE NODE ATTRIBUTES IN FILTERED NETWORK")

print("\nSample node attributes in filtered network:")
sample_node = list(G_filtered.nodes())[0]
print(f"\nNode ID: {sample_node}")
for attr, value in G_filtered.nodes[sample_node].items():
    if attr in ['comments', 'tags']:
        print(f"  {attr}: {len(value)} items")
    elif attr == 'description':
        print(f"  {attr}: {value[:80]}..." if len(value) > 80 else f"  {attr}: {value}")
    elif attr == 'merged_goals':
        print(f"  {attr}: {len(value)} merged goals")
        if len(value) > 0:
            print(f"    First few: {value[:3]}")
    else:
        print(f"  {attr}: {value}")


SAMPLE NODE ATTRIBUTES IN FILTERED NETWORK

Sample node attributes in filtered network:

Node ID: dQggEQQH
  title: Make ice cream from scratch
  description: Ice cream or ice-cream is a frozen dessert usually made from dairy products, suc...
  wants_to_do: 3010
  have_done: 973
  comments: 46 items
  tags: 32 items
  included_by_our_users: 438
  merged_goals: 0 merged goals


Export the filtered network to a pickle file for use in subsequent analyses.

In [18]:
# Export filtered network to pickle file
import pickle

output_pickle = '../Networks/Prior Network Versions/b1_network.pkl'
with open(output_pickle, 'wb') as f:
    pickle.dump(G_filtered, f)

print(f"Network exported successfully")
print(f"  Nodes: {G_filtered.number_of_nodes():,}")
print(f"  Edges: {G_filtered.number_of_edges():,}")
print(f"\nNode attributes preserved:")
sample_node = list(G_filtered.nodes())[0]
for attr in G_filtered.nodes[sample_node].keys():
    print(f"  - {attr}")

Network exported successfully
  Nodes: 2,890
  Edges: 219,130

Node attributes preserved:
  - title
  - description
  - wants_to_do
  - have_done
  - comments
  - tags
  - included_by_our_users
  - merged_goals


# Part II: Network and Language Inspection

This part of the code uses the previously created network. It runs independently from the previous workflow, such that it can be accessed without having to run the entire pipeline.

Load the previously saved network and display its basic properties.

In [19]:
# Read in pkl file and show all attributes
import pickle
import networkx as nx
import fasttext
import warnings

pickle_file = '../Networks/Prior Network Versions/b1_network.pkl'

with open(pickle_file, 'rb') as f:
    G_loaded = pickle.load(f)
print(f"\nLoaded network from {pickle_file}:")
print(f"  Nodes: {G_loaded.number_of_nodes():,}")
print(f"  Edges: {G_loaded.number_of_edges():,}")
sample_node = next(iter(G_loaded.nodes()))
print(f"\nSample node attributes for node ID '{sample_node}':")
for attr, value in G_loaded.nodes[sample_node].items():
    print(f"  {attr}: {value}")


Loaded network from ../Networks/Prior Network Versions/b1_network.pkl:
  Nodes: 2,890
  Edges: 219,130

Sample node attributes for node ID 'dQggEQQH':
  title: Make ice cream from scratch
  description: Ice cream or ice-cream is a frozen dessert usually made from dairy products, such as milk and cream, and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to (or in replacement of) the natural ingredients.
  wants_to_do: 3010
  have_done: 973
  comments: ['We made ice cream using a friend’s ice cream maker.  We adjusted a plain vanilla recipe by adding a flavour essence ‘Mermaid Magic’ and sprinkles near the end.  While we were impatient and ate this as soon as it got to soft-serve consistency, it was still amazingly good.', '06/26/2022-Made a quart of French Vanilla', 'Got the machineVanillaLemon iceTomato sorbet StrawberryCo

Analyze the connectivity of the loaded network.

In [20]:
# Show amount of connected components
num_components_loaded = nx.number_connected_components(G_loaded)
print(f"\nNumber of connected components in loaded network: {num_components_loaded:,}")

# Length of largest connected component
largest_cc_loaded = max(nx.connected_components(G_loaded), key=len)
print(f"Size of largest connected component: {len(largest_cc_loaded):,} nodes")


Number of connected components in loaded network: 28
Size of largest connected component: 2,860 nodes


## 10. Language Detection on Network Goals
Detect the language of each goal in the filtered network using fastText.

In [21]:
# Get goals from the filtered network
network_goal_ids = list(G_loaded.nodes())
print(f"Goals in filtered network: {len(network_goal_ids):,}")

# Prepare text for language detection
goal_combined_texts_network = []
for gid in network_goal_ids:
    title = G_loaded.nodes[gid].get('title', '')
    desc = G_loaded.nodes[gid].get('description', '')
    
    # Combine title and description
    combined = f"{title}. {desc}" if desc else title
    goal_combined_texts_network.append(combined.strip())

print(f"Prepared {len(goal_combined_texts_network):,} texts for language detection")

# Text statistics
text_lengths = [len(t) for t in goal_combined_texts_network]
print(f"\nText length statistics:")
print(f"  Mean: {np.mean(text_lengths):.0f} characters")
print(f"  Median: {np.median(text_lengths):.0f} characters")
print(f"  Max: {np.max(text_lengths):,} characters")
print(f"  Min: {np.min(text_lengths):,} characters")

Goals in filtered network: 2,890
Prepared 2,890 texts for language detection

Text length statistics:
  Mean: 485 characters
  Median: 474 characters
  Max: 2,317 characters
  Min: 69 characters


Run language detection using fastText model (most accurate for short texts).

In [22]:
warnings.filterwarnings('ignore')

print("Loading fastText language detection model...")
model_path = '../Data/Embeddings, Similarity and Language Detection/lid.176.bin'
ft_model = fasttext.load_model(model_path)
print("Model loaded\n")

print(f"Detecting language for {len(goal_combined_texts_network):,} texts...")
print("(Using fasttext - most accurate for short texts)\n")

languages_network = []
language_scores_network = []

for i, text in enumerate(goal_combined_texts_network):
    if i % 10000 == 0 and i > 0:
        print(f"  Processed {i:,} / {len(goal_combined_texts_network):,} texts...")
    
    try:
        # Predict language
        predictions = ft_model.predict(text.replace('\n', ' '), k=1)
        lang = predictions[0][0].replace('__label__', '')
        score = predictions[1][0]
        
        languages_network.append(lang)
        language_scores_network.append(score)
    except:
        languages_network.append('unknown')
        language_scores_network.append(0.0)

print(f"\nLanguage detection complete")
print(f"  Total texts: {len(languages_network):,}")
print(f"  Unique languages: {len(set(languages_network))}")

# Count language distribution
lang_counts_network = Counter(languages_network)

print(f"\nTop 10 languages detected:")
for lang, count in lang_counts_network.most_common(10):
    percentage = 100 * count / len(languages_network)
    print(f"  {lang:<10} {count:>8,} ({percentage:>5.1f}%)")

Loading fastText language detection model...
Model loaded

Detecting language for 2,890 texts...
(Using fasttext - most accurate for short texts)


Language detection complete
  Total texts: 2,890
  Unique languages: 1

Top 10 languages detected:
  en            2,890 (100.0%)


Add language information to the network nodes and export to Excel for analysis.

In [None]:
# Add language attributes to the network nodes
for i, gid in enumerate(network_goal_ids):
    G_loaded.nodes[gid]['language'] = languages_network[i]
    G_loaded.nodes[gid]['language_score'] = language_scores_network[i]

print("Added language attributes to network nodes")

# Export node attributes to Excel
df_nodes = pd.DataFrame.from_dict(dict(G_loaded.nodes(data=True)), orient='index')
df_nodes.to_excel('../Data/Validation/b1_network_nodes.xlsx')
print(f"  Columns: {list(df_nodes.columns)}")
print(f"  Rows: {len(df_nodes):,}")

Added language attributes to network nodes
  Columns: ['title', 'description', 'wants_to_do', 'have_done', 'comments', 'tags', 'included_by_our_users', 'merged_goals', 'language', 'language_score']
  Rows: 2,890
