# Instagram Network Analysis - Advanced Analysis

This notebook provides additional analysis tools and visualizations for Instagram network data collected using the main pipeline. It allows for deeper exploration of the network structure, influence patterns, and audience insights.

## Setup and Configuration

First, let's set up the environment and load the required modules.

In [None]:
# Import required libraries
import os
import sys
import json
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from pathlib import Path
from datetime import datetime
from IPython.display import Image, display

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook", font_scale=1.2)

# Add project root to path
sys.path.append('.')

# Import project modules
from config.settings import INSTAGRAM, STORAGE, PROCESSING, VISUALIZATION
from config.logging_config import setup_logging, get_logger
from src.utils.storage import DataStorage
from src.processors.network_processor import NetworkProcessor
from src.visualizers.network_visualizer import NetworkVisualizer

# Set up logging
logger = setup_logging(log_dir='logs')

## Load Processed Data

Load the data processed by the main pipeline.

In [None]:
# Initialize storage
storage = DataStorage(base_dir='.')

# Find the latest rankings file
rankings_files = sorted(glob.glob("data/results/rankings_*.json*"), key=os.path.getmtime, reverse=True)

if not rankings_files:
    print("❌ No rankings files found. Please run the main pipeline first.")
else:
    # Load the latest rankings file
    rankings_file = os.path.basename(rankings_files[0])
    rankings = storage.load_data(
        filename=rankings_file,
        data_type='results',
        format='json',
        decompress=rankings_file.endswith('.gz')
    )
    
    print(f"✅ Loaded rankings data for @{rankings['metadata']['target_username']}")
    print(f"   - Total followers analyzed: {rankings['metadata']['total_followers_analyzed']}")
    print(f"   - Total unique accounts: {rankings['metadata']['total_unique_accounts']}")
    print(f"   - Generated at: {rankings['metadata']['generated_at']}")

## Advanced Network Analysis

Perform more detailed analysis of the network structure.

In [None]:
# Convert rankings to DataFrames for easier analysis
follower_df = pd.DataFrame(rankings['by_follower_count'])
influence_df = pd.DataFrame(rankings['by_influence_score'])
penetration_df = pd.DataFrame(rankings['by_penetration_rate'])

# Display basic statistics
print("\n📊 Basic Statistics:")
print(follower_df[['follower_count', 'penetration_rate', 'influence_score']].describe())

### Correlation Analysis

Analyze correlations between different metrics.

In [None]:
# Calculate correlations
correlation = follower_df[['follower_count', 'penetration_rate', 'influence_score']].corr()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Between Metrics')
plt.tight_layout()
plt.show()

print("\n🔍 Interpretation:")
print("- A high correlation between follower count and influence score indicates that popularity")
print("  is a major factor in determining influence in this network.")
print("- A high correlation between penetration rate and influence score suggests that")
print("  accounts that reach a large percentage of the target's followers are considered influential.")

### Verified vs. Non-Verified Accounts Analysis

Compare metrics between verified and non-verified accounts.

In [None]:
# Group by verification status
verification_groups = follower_df.groupby('is_verified')
verification_stats = verification_groups[['follower_count', 'penetration_rate', 'influence_score']].mean()

# Plot comparison
verification_stats.plot(kind='bar', figsize=(12, 6))
plt.title('Average Metrics by Verification Status')
plt.xlabel('Verified')
plt.ylabel('Average Value')
plt.xticks([0, 1], ['Non-Verified', 'Verified'], rotation=0)
plt.legend(title='Metric')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Count verified vs non-verified
verification_counts = follower_df['is_verified'].value_counts()
print(f"\n📊 Verification Status Counts:")
print(f"- Verified accounts: {verification_counts.get(True, 0)} ({verification_counts.get(True, 0)/len(follower_df)*100:.1f}%)")
print(f"- Non-verified accounts: {verification_counts.get(False, 0)} ({verification_counts.get(False, 0)/len(follower_df)*100:.1f}%)")

### Advanced Network Visualization

Create a more detailed network visualization with community detection.

In [None]:
# Create a network graph of the top accounts
def create_advanced_network_graph(top_n=50):
    # Create graph
    G = nx.Graph()
    
    # Add nodes (top accounts by influence score)
    top_accounts = influence_df.head(top_n)
    
    for _, account in top_accounts.iterrows():
        G.add_node(account['username'], 
                  size=account['follower_count'],
                  verified=account['is_verified'],
                  influence=account['influence_score'],
                  penetration=account['penetration_rate'])
    
    # Add edges (connections between accounts)
    # This is a simplified version - in a real implementation, we would use actual following relationships
    for i, (_, account1) in enumerate(top_accounts.iterrows()):
        for _, account2 in top_accounts.iloc[i+1:].iterrows():
            # Connect accounts with similar influence scores
            influence_diff = abs(account1['influence_score'] - account2['influence_score'])
            influence_max = max(account1['influence_score'], account2['influence_score'])
            
            if influence_diff / influence_max < 0.2:  # If difference is less than 20%
                G.add_edge(account1['username'], account2['username'], 
                          weight=1 - (influence_diff / influence_max))
    
    # Detect communities
    communities = nx.community.greedy_modularity_communities(G)
    
    # Assign community to each node
    community_map = {}
    for i, community in enumerate(communities):
        for node in community:
            community_map[node] = i
    
    # Set up plot
    plt.figure(figsize=(16, 12))
    
    # Calculate node sizes based on follower count
    sizes = [G.nodes[node]['size'] for node in G.nodes]
    max_size = max(sizes) if sizes else 1
    node_sizes = [100 + (1000 * (size / max_size)) for size in sizes]
    
    # Calculate node colors based on community
    cmap = plt.cm.get_cmap('tab10', len(communities))
    node_colors = [cmap(community_map.get(node, 0)) for node in G.nodes]
    
    # Calculate edge widths based on weight
    edge_widths = [G[u][v]['weight'] * 2 for u, v in G.edges]
    
    # Draw the graph
    pos = nx.spring_layout(G, seed=42, k=0.3)
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, width=edge_widths, alpha=0.3, edge_color='gray')
    
    # Draw nodes
    nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color=node_colors, alpha=0.8)
    
    # Draw labels for larger nodes
    large_nodes = [node for node, size in zip(G.nodes, node_sizes) if size > 300]
    nx.draw_networkx_labels(G, pos, {node: node for node in large_nodes}, font_size=10, font_weight='bold')
    
    # Add legend for communities
    legend_elements = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=cmap(i), 
                                 label=f'Community {i+1}', markersize=10) 
                      for i in range(len(communities))]
    
    # Add legend for verification
    verified_nodes = [node for node in G.nodes if G.nodes[node]['verified']]
    if verified_nodes:
        legend_elements.append(plt.Line2D([0], [0], marker='*', color='w', markerfacecolor='gold', 
                                        label='Verified', markersize=15))
    
    plt.legend(handles=legend_elements, loc='upper right')
    
    # Add title and adjust layout
    plt.title(f"Network of Top {len(G.nodes)} Instagram Accounts by Influence Score\nColored by Community")
    plt.tight_layout()
    plt.axis('off')
    
    # Add verification stars for verified accounts
    for node in verified_nodes:
        x, y = pos[node]
        plt.text(x, y+0.02, '★', color='gold', fontsize=15, ha='center', va='center')
    
    plt.show()
    
    # Print community information
    print("\n🔍 Community Analysis:")
    for i, community in enumerate(communities):
        print(f"\nCommunity {i+1} ({len(community)} accounts):")
        for node in list(community)[:5]:  # Show top 5 accounts in each community
            verified = "✓" if G.nodes[node]['verified'] else " "
            print(f"  {verified} @{node} - Influence: {G.nodes[node]['influence']:.2f}, Penetration: {G.nodes[node]['penetration']:.2f}%")
        if len(community) > 5:
            print(f"  ... and {len(community) - 5} more accounts")

# Create the advanced network graph
create_advanced_network_graph(top_n=50)

## Audience Interest Analysis

Analyze the interests of the target's followers based on who they follow.

In [None]:
# Load clusters data if available
clusters_files = sorted(glob.glob("data/results/clusters_*.json*"), key=os.path.getmtime, reverse=True)

if clusters_files:
    # Load the latest clusters file
    clusters_file = os.path.basename(clusters_files[0])
    clusters = storage.load_data(
        filename=clusters_file,
        data_type='results',
        format='json',
        decompress=clusters_file.endswith('.gz')
    )
    
    # Create a pie chart of cluster sizes
    cluster_sizes = [cluster['size'] for cluster in clusters.values()]
    cluster_labels = [f"{cluster_id}: {', '.join(cluster['top_features'][:3])}" 
                     for cluster_id, cluster in clusters.items()]
    
    plt.figure(figsize=(12, 8))
    plt.pie(cluster_sizes, labels=cluster_labels, autopct='%1.1f%%', startangle=90, 
           shadow=False, explode=[0.05] * len(cluster_sizes))
    plt.title('Distribution of Interest Clusters')
    plt.axis('equal')
    plt.tight_layout()
    plt.show()
    
    # Print detailed cluster information
    print("\n🔍 Interest Cluster Analysis:")
    for cluster_id, cluster_data in clusters.items():
        print(f"\n{cluster_id} ({cluster_data['size']} accounts):")
        print(f"  Top features: {', '.join(cluster_data['top_features'])}")
        print(f"  Top accounts:")
        for i, account in enumerate(cluster_data['top_accounts'][:5]):
            verified = "✓" if account['is_verified'] else " "
            print(f"    {i+1}. {verified} @{account['username']} - {account['follower_count']} followers")
else:
    print("\n❌ No clusters data found. Run the cluster identification in the main pipeline first.")

## Penetration Rate Analysis

Analyze the penetration rate distribution and identify accounts with unusually high penetration.

In [None]:
# Plot penetration rate distribution
plt.figure(figsize=(12, 6))
sns.histplot(follower_df['penetration_rate'], bins=30, kde=True)
plt.title('Distribution of Penetration Rates')
plt.xlabel('Penetration Rate (%)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Identify accounts with unusually high penetration rates
# Define high penetration as > mean + 2*std
mean_penetration = follower_df['penetration_rate'].mean()
std_penetration = follower_df['penetration_rate'].std()
high_penetration_threshold = mean_penetration + 2 * std_penetration

high_penetration_accounts = follower_df[follower_df['penetration_rate'] > high_penetration_threshold]
high_penetration_accounts = high_penetration_accounts.sort_values('penetration_rate', ascending=False)

print(f"\n🔍 Accounts with Unusually High Penetration Rates (>{high_penetration_threshold:.2f}%):")
for i, (_, account) in enumerate(high_penetration_accounts.iterrows()):
    verified = "✓" if account['is_verified'] else " "
    print(f"{i+1}. {verified} @{account['username']} - {account['penetration_rate']:.2f}% penetration")
    if i >= 19:  # Show top 20
        print(f"... and {len(high_penetration_accounts) - 20} more accounts")
        break

## Influence Score Components Analysis

Analyze the components that contribute to the influence score.

In [None]:
# Create a function to calculate influence score components
def calculate_influence_components(account, total_followers):
    from config.settings import PROCESSING
    
    weights = PROCESSING['INFLUENCE_SCORE_WEIGHTS']
    
    # Follower count component (normalized)
    follower_component = min(account['follower_count'] / total_followers * 10, 1.0) * weights['FOLLOWER_COUNT']
    
    # Verified status component
    verified_component = 1.0 if account['is_verified'] else 0.0
    verified_component *= weights['VERIFIED_STATUS']
    
    # Engagement rate component (placeholder)
    engagement_component = 0.5 * weights['ENGAGEMENT_RATE']
    
    # Recency component (placeholder)
    recency_component = 0.5 * weights['RECENCY']
    
    return {
        'follower_component': follower_component,
        'verified_component': verified_component,
        'engagement_component': engagement_component,
        'recency_component': recency_component
    }

# Calculate components for top accounts
top_accounts = influence_df.head(10).to_dict('records')
total_followers = rankings['metadata']['total_followers_analyzed']

for account in top_accounts:
    account['components'] = calculate_influence_components(account, total_followers)

# Create a stacked bar chart of influence components
components_df = pd.DataFrame([
    {
        'username': account['username'],
        'Follower Count': account['components']['follower_component'],
        'Verified Status': account['components']['verified_component'],
        'Engagement Rate': account['components']['engagement_component'],
        'Recency': account['components']['recency_component']
    }
    for account in top_accounts
])

# Plot stacked bar chart
plt.figure(figsize=(14, 8))
components_df.set_index('username')[
    ['Follower Count', 'Verified Status', 'Engagement Rate', 'Recency']
].plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Influence Score Components for Top 10 Accounts')
plt.xlabel('Username')
plt.ylabel('Influence Score Component')
plt.legend(title='Component')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\n🔍 Influence Score Component Analysis:")
print("The influence score is calculated based on multiple components:")
print(f"- Follower Count: {PROCESSING['INFLUENCE_SCORE_WEIGHTS']['FOLLOWER_COUNT'] * 100}% of the score")
print(f"- Verified Status: {PROCESSING['INFLUENCE_SCORE_WEIGHTS']['VERIFIED_STATUS'] * 100}% of the score")
print(f"- Engagement Rate: {PROCESSING['INFLUENCE_SCORE_WEIGHTS']['ENGAGEMENT_RATE'] * 100}% of the score")
print(f"- Recency: {PROCESSING['INFLUENCE_SCORE_WEIGHTS']['RECENCY'] * 100}% of the score")

## Recommendations Based on Analysis

Generate recommendations for the target account based on the analysis.

In [None]:
# Generate recommendations
print("\n🌟 Recommendations for @{}:".format(rankings['metadata']['target_username']))
print("\n1. Potential Collaborations:")
for i, account in enumerate(influence_df.head(5).to_dict('records')):
    verified = "✓" if account['is_verified'] else " "
    print(f"   {verified} @{account['username']} - High influence score of {account['influence_score']:.2f}")

print("\n2. Audience Interest Areas:")
if 'clusters' in locals():
    for cluster_id, cluster_data in list(clusters.items())[:3]:
        print(f"   - {', '.join(cluster_data['top_features'][:5])}")
else:
    print("   Run cluster analysis to identify audience interest areas")

print("\n3. Engagement Strategy:")
print("   - Focus on creating content related to the identified interest clusters")
print("   - Engage with high-penetration accounts to increase visibility")
print("   - Consider the community structure when planning outreach")

print("\n4. Growth Opportunities:")
print("   - Identify accounts with high influence but low follower count for potential partnerships")
print("   - Target communities where your presence is currently low")
print("   - Analyze verified vs. non-verified account performance in your niche")

## Export Advanced Analysis Results

Export the results of the advanced analysis for further use.

In [None]:
# Create a comprehensive report
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
report_filename = f"data/results/advanced_analysis_report_{timestamp}.md"

with open(report_filename, 'w') as f:
    f.write(f"# Advanced Instagram Network Analysis Report\n\n")
    f.write(f"## Target Account: @{rankings['metadata']['target_username']}\n\n")
    f.write(f"- **Analysis Date:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write(f"- **Followers Analyzed:** {rankings['metadata']['total_followers_analyzed']}\n")
    f.write(f"- **Unique Accounts:** {rankings['metadata']['total_unique_accounts']}\n\n")
    
    f.write(f"## Top Accounts by Influence Score\n\n")
    f.write(f"| Rank | Username | Verified | Influence Score | Follower Count | Penetration Rate |\n")
    f.write(f"| ---- | -------- | -------- | --------------- | -------------- | ---------------- |\n")
    
    for i, account in enumerate(influence_df.head(20).to_dict('records')):
        verified = "✓" if account['is_verified'] else " "
        f.write(f"| {i+1} | @{account['username']} | {verified} | {account['influence_score']:.2f} | ")
        f.write(f"{account['follower_count']} | {account['penetration_rate']:.2f}% |\n")
    
    f.write(f"\n## Audience Interest Areas\n\n")
    
    if 'clusters' in locals():
        for cluster_id, cluster_data in clusters.items():
            f.write(f"### {cluster_id} ({cluster_data['size']} accounts)\n\n")
            f.write(f"**Top Features:** {', '.join(cluster_data['top_features'])}\n\n")
            f.write(f"**Top Accounts:**\n\n")
            f.write(f"| Rank | Username | Verified | Follower Count |\n")
            f.write(f"| ---- | -------- | -------- | -------------- |\n")
            
            for i, account in enumerate(cluster_data['top_accounts'][:10]):
                verified = "✓" if account['is_verified'] else " "
                f.write(f"| {i+1} | @{account['username']} | {verified} | {account['follower_count']} |\n")
            
            f.write(f"\n")
    else:
        f.write(f"*Cluster analysis not available*\n\n")
    
    f.write(f"## Recommendations\n\n")
    
    f.write(f"### Potential Collaborations\n\n")
    for i, account in enumerate(influence_df.head(10).to_dict('records')):
        verified = "✓" if account['is_verified'] else " "
        f.write(f"- {verified} @{account['username']} - Influence score: {account['influence_score']:.2f}, ")
        f.write(f"Penetration rate: {account['penetration_rate']:.2f}%\n")
    
    f.write(f"\n### Engagement Strategy\n\n")
    f.write(f"- Focus on creating content related to the identified interest clusters\n")
    f.write(f"- Engage with high-penetration accounts to increase visibility\n")
    f.write(f"- Consider the community structure when planning outreach\n\n")
    
    f.write(f"### Growth Opportunities\n\n")
    f.write(f"- Identify accounts with high influence but low follower count for potential partnerships\n")
    f.write(f"- Target communities where your presence is currently low\n")
    f.write(f"- Analyze verified vs. non-verified account performance in your niche\n")

print(f"\n✅ Advanced analysis report saved to {report_filename}")

## Conclusion

This advanced analysis provides deeper insights into the Instagram follower network for the target account. The results can be used to inform content strategy, identify potential collaborations, and understand audience interests.

### Key Takeaways

- The influence score combines multiple factors, with follower count and verification status being significant contributors
- Community detection reveals clusters of related accounts that can inform content strategy
- Accounts with high penetration rates are particularly valuable for reaching the target's audience
- The correlation between different metrics helps understand what drives influence in this specific network

### Next Steps

- Use the recommendations to inform your Instagram strategy
- Run this analysis periodically to track changes in the network
- Compare results across different target accounts to identify broader trends
- Incorporate additional data sources (e.g., engagement metrics) for more comprehensive analysis