# Instagram Network Analysis - Main Pipeline

This notebook implements the main pipeline for analyzing Instagram follower networks. It collects followers from a target Instagram profile, analyzes who they follow, and generates insights about the most influential accounts in this network.

## Setup and Configuration

First, let's set up the environment and install required dependencies.

In [None]:
# Install required packages
!pip install -q requests pandas numpy matplotlib seaborn networkx scikit-learn aiohttp asyncio tqdm

In [None]:
# Clone the repository
!git clone https://github.com/your-username/instagram-network-analysis.git
%cd instagram-network-analysis

## Import Modules

Now let's import the necessary modules for our analysis.

In [None]:
import os
import sys
import json
import asyncio
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from pathlib import Path
from tqdm.notebook import tqdm

# Add project root to path
sys.path.append('.')

# Import project modules
from config.settings import INSTAGRAM, STORAGE, PROCESSING, VISUALIZATION
from config.logging_config import setup_logging, get_logger
from src.utils.auth import InstagramAuth
from src.utils.storage import DataStorage
from src.collectors.follower_collector import FollowerCollector
from src.collectors.following_collector import FollowingCollector
from src.processors.network_processor import NetworkProcessor
from src.visualizers.network_visualizer import NetworkVisualizer

# Set up logging
logger = setup_logging(log_dir='logs')

## Google Cloud Storage Integration (Optional)

If you want to save data to Google Cloud Storage, you can configure it here.

In [None]:
# Mount Google Drive (if using Colab)
from google.colab import drive
drive.mount('/content/drive')

# Configure GCS (optional)
USE_GCS = False  # Set to True to enable GCS integration

if USE_GCS:
    # Install GCS libraries
    !pip install -q google-cloud-storage
    
    # Update settings
    STORAGE['GCS']['ENABLED'] = True
    STORAGE['GCS']['BUCKET_NAME'] = 'your-bucket-name'
    STORAGE['GCS']['PROJECT_ID'] = 'your-project-id'
    
    # Path to credentials file (if needed)
    STORAGE['GCS']['CREDENTIALS_FILE'] = '/content/drive/MyDrive/path/to/credentials.json'

## Authentication Setup

Set up authentication with Instagram using a session cookie.

In [None]:
# Function to get session cookie from user
def get_session_cookie():
    from IPython.display import display, HTML
    from ipywidgets import widgets
    
    print("\nInstructions to get your Instagram session cookie:")
    print("1. Log in to Instagram in your browser")
    print("2. Open developer tools (F12 or right-click > Inspect)")
    print("3. Go to the 'Application' or 'Storage' tab")
    print("4. Under 'Cookies', find 'instagram.com'")
    print("5. Find the 'sessionid' cookie and copy its value")
    print("\nIMPORTANT: Keep this value private and do not share it with anyone!\n")
    
    cookie_input = widgets.Password(description='Session Cookie:', style={'description_width': 'initial'}, layout={'width': '500px'})
    display(cookie_input)
    
    return cookie_input

# Get session cookie
cookie_input = get_session_cookie()

In [None]:
# Initialize authentication with the session cookie
session_cookie = cookie_input.value

# Create auth instance
auth = InstagramAuth(session_cookie=session_cookie)

# Validate authentication
if auth.validate_auth():
    print("✅ Authentication successful!")
else:
    print("❌ Authentication failed. Please check your session cookie and try again.")

## Initialize Storage

Set up storage for data collection and processing.

In [None]:
# Initialize storage
storage = DataStorage(base_dir='.', use_gcs=USE_GCS)

# Create directories if they don't exist
for directory in ['data/raw', 'data/processed', 'data/results']:
    os.makedirs(directory, exist_ok=True)

## Collect Followers

Collect followers from the target Instagram profile.

In [None]:
# Target profile to analyze
target_username = input("Enter the Instagram username to analyze: ")

# Maximum number of followers to collect
max_followers = int(input("Enter the maximum number of followers to collect (default: 20000): ") or "20000")

# Initialize follower collector
follower_collector = FollowerCollector(auth.get_session(), storage=storage)

# Collect followers
print(f"\nCollecting followers for @{target_username} (max: {max_followers})...")
print("This may take some time. Please be patient.")

# Run the collection asynchronously
async def collect_followers():
    return await follower_collector.collect_followers(
        username=target_username,
        max_followers=max_followers,
        save_interval=100
    )

# Run the collection
followers = await collect_followers()

print(f"\n✅ Collected {len(followers)} followers for @{target_username}")

## Collect Following Data

For each follower, collect the accounts they follow.

In [None]:
# Maximum number of following to collect per user
max_following_per_user = int(input("Enter the maximum number of following to collect per user (default: 1000): ") or "1000")

# Maximum number of parallel requests
max_parallel = int(input("Enter the maximum number of parallel requests (default: 3): ") or "3")

# Initialize following collector
following_collector = FollowingCollector(auth.get_session(), storage=storage)

# Collect following data
print(f"\nCollecting following data for {len(followers)} followers...")
print("This will take a significant amount of time. Please be patient.")

# Run the collection asynchronously
async def collect_following():
    return await following_collector.collect_following_for_followers(
        followers=followers,
        max_following=max_following_per_user,
        max_parallel=max_parallel
    )

# Run the collection
following_data = await collect_following()

print(f"\n✅ Collected following data for {len(following_data)} followers")

## Process Network Data

Process the collected data to generate insights.

In [None]:
# Initialize network processor
processor = NetworkProcessor(storage=storage)

# Get the latest followers and following data files
followers_file = f"followers_{target_username}_*.json.gz"
following_file = "following_collection_*.json.gz"

import glob
followers_files = sorted(glob.glob(f"data/raw/{followers_file}"), key=os.path.getmtime, reverse=True)
following_files = sorted(glob.glob(f"data/raw/{following_file}"), key=os.path.getmtime, reverse=True)

if not followers_files or not following_files:
    print("❌ Could not find data files. Please check that data collection was successful.")
else:
    # Load the latest data files
    processor.load_data(
        followers_file=os.path.basename(followers_files[0]),
        following_file=os.path.basename(following_files[0])
    )
    
    # Process the data
    print("\nProcessing network data...")
    rankings = processor.process_network_data()
    
    # Identify clusters (optional)
    print("\nIdentifying interest clusters...")
    clusters = processor.identify_clusters(n_clusters=5)
    
    print("\n✅ Data processing complete")

## Visualize Results

Create visualizations of the results.

In [None]:
# Initialize visualizer
visualizer = NetworkVisualizer(storage=storage)

# Create visualizations
print("\nCreating visualizations...")

# Network graph
network_graph_path = visualizer.create_network_graph(rankings)

# Follower distribution
distribution_path = visualizer.create_follower_distribution(rankings)

# Top accounts by follower count
follower_chart_path = visualizer.create_top_accounts_bar_chart(rankings, metric='follower_count')

# Top accounts by influence score
influence_chart_path = visualizer.create_top_accounts_bar_chart(rankings, metric='influence_score')

# Metric comparison
comparison_path = visualizer.create_metric_comparison_scatter(rankings)

# Cluster visualization (if clusters were identified)
if clusters:
    cluster_path = visualizer.create_cluster_visualization(clusters)

# Create dashboard
dashboard_path = visualizer.create_dashboard(rankings, clusters)

print("\n✅ Visualizations created")

## Display Results

Display the top accounts and visualizations.

In [None]:
# Display top accounts by follower count
print("\n🏆 Top 10 Accounts by Follower Count:")
for i, account in enumerate(rankings['by_follower_count'][:10]):
    verified = "✓" if account['is_verified'] else " "
    print(f"{i+1}. {verified} @{account['username']} - {account['follower_count']} followers")

In [None]:
# Display top accounts by influence score
print("\n🌟 Top 10 Accounts by Influence Score:")
for i, account in enumerate(rankings['by_influence_score'][:10]):
    verified = "✓" if account['is_verified'] else " "
    print(f"{i+1}. {verified} @{account['username']} - {account['influence_score']:.2f} score")

In [None]:
# Display dashboard
from IPython.display import Image, display

print("\n📊 Dashboard:")
display(Image(dashboard_path))

## Export Results

Export the results to various formats.

In [None]:
# Export rankings to CSV
print("\nExporting results to CSV...")

# Convert rankings to DataFrames
follower_df = pd.DataFrame(rankings['by_follower_count'])
influence_df = pd.DataFrame(rankings['by_influence_score'])
penetration_df = pd.DataFrame(rankings['by_penetration_rate'])

# Export to CSV
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
follower_df.to_csv(f"data/results/rankings_by_follower_{timestamp}.csv", index=False)
influence_df.to_csv(f"data/results/rankings_by_influence_{timestamp}.csv", index=False)
penetration_df.to_csv(f"data/results/rankings_by_penetration_{timestamp}.csv", index=False)

print("✅ Results exported to CSV files in the data/results directory")

## Conclusion

You've successfully analyzed the Instagram follower network for @{target_username}. The results show the most influential accounts among their followers, which can provide valuable insights for marketing, content strategy, and audience understanding.

### Next Steps

- Explore the `analysis.ipynb` notebook for more detailed analysis
- Run this analysis for different target accounts to compare results
- Adjust parameters like `max_followers` and `max_following_per_user` to balance depth and runtime

### Important Notes

- This analysis respects Instagram's rate limits to avoid blocks
- The session cookie is only used for authentication and is not stored or shared
- All data is stored locally and/or in your GCS bucket if configured