# OONI Data Analysis - Internet Blocking Detection

## Introduction

The Open Observatory of Network Interference (OONI) is a global initiative dedicated to monitoring Internet censorship and network interference. This notebook analyzes OONI data to identify instances of Internet blocking by implementing analytical techniques to assess the likelihood that observed anomalies are genuine cases of blocking.

## Data Preparation

In this section, we'll load the provided CSV file and conduct an initial exploration of the dataset to understand its structure.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setting plot styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook", font_scale=1.2)

# For better display of dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

In [None]:
# Load the CSV file
file_path = '202505-ooni-hpi-sample.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nNumber of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

In [None]:
# Display the first few rows of the dataset
df.head()

In [None]:
# Get a list of columns
print("Columns in the dataset:")
for i, col in enumerate(df.columns, 1):
    print(f"{i}. {col}")

In [None]:
# Get summary statistics
df.describe(include='all')

In [None]:
# Check data types and missing values
df_info = pd.DataFrame({
    'Data Type': df.dtypes,
    'Non-Null Count': df.count(),
    'Missing Values': df.isnull().sum(),
    'Missing Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})

df_info

In [None]:
# Check unique values for categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()

for col in categorical_columns[:10]:  # Limiting to first 10 columns to avoid overwhelming output
    unique_values = df[col].nunique()
    if unique_values < 20:  # Only show value counts for columns with fewer than 20 unique values
        print(f"\n{col} - {unique_values} unique values:")
        print(df[col].value_counts().sort_values(ascending=False).head(10))
    else:
        print(f"\n{col} - {unique_values} unique values (too many to display)")

### Temporal Distribution of Data

Let's examine the distribution of measurements over time.

In [None]:
# Check if there's a timestamp column
timestamp_cols = [col for col in df.columns if 'time' in col.lower() or 'date' in col.lower()]

if timestamp_cols:
    print(f"Potential timestamp columns: {timestamp_cols}")
    
    # Attempt to parse the first timestamp column
    try:
        time_col = timestamp_cols[0]
        if pd.api.types.is_numeric_dtype(df[time_col]):
            # If it's a numeric timestamp (Unix epoch)
            df['parsed_time'] = pd.to_datetime(df[time_col], unit='s')
        else:
            # If it's a string timestamp
            df['parsed_time'] = pd.to_datetime(df[time_col])
            
        # Plot time distribution
        plt.figure(figsize=(12, 6))
        df['parsed_time'].dt.date.value_counts().sort_index().plot(kind='line')
        plt.title('Distribution of Measurements Over Time')
        plt.xlabel('Date')
        plt.ylabel('Number of Measurements')
        plt.tight_layout()
        plt.show()
    except Exception as e:
        print(f"Could not parse timestamp column {time_col}: {e}")
else:
    print("No obvious timestamp columns found.")

### Geographical Distribution

Let's examine the geographical distribution of the measurements.

In [None]:
# Check for country or location columns
geo_cols = [col for col in df.columns if any(term in col.lower() for term in ['country', 'location', 'geo', 'region', 'city', 'probe_cc'])]

if geo_cols:
    print(f"Potential geographical columns: {geo_cols}")
    
    # Plot distribution for the first identified geographical column
    geo_col = geo_cols[0]
    plt.figure(figsize=(12, 8))
    top_locations = df[geo_col].value_counts().head(20)
    sns.barplot(x=top_locations.values, y=top_locations.index)
    plt.title(f'Top 20 {geo_col}')
    plt.xlabel('Number of Measurements')
    plt.tight_layout()
    plt.show()
else:
    print("No obvious geographical columns found.")

### Test Results Analysis

Now, let's look at the distribution of test results to start identifying potential blocking.

In [None]:
# Check for result or status columns
result_cols = [col for col in df.columns if any(term in col.lower() for term in ['result', 'status', 'block', 'censor', 'anomaly', 'outcome'])]

if result_cols:
    print(f"Potential result columns: {result_cols}")
    
    # Analyze the first result column
    result_col = result_cols[0]
    plt.figure(figsize=(10, 6))
    df[result_col].value_counts().plot(kind='bar')
    plt.title(f'Distribution of {result_col}')
    plt.xlabel('Result Value')
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()
else:
    print("No obvious result columns found.")

## Summary of Initial Data Exploration

Based on the initial exploration, we can summarize the following about the OONI dataset:

1. **Dataset Size**: The dataset contains [to be filled after running] rows and [to be filled after running] columns.
2. **Key Columns**: [To be identified after running]
3. **Temporal Distribution**: [To be described after running]
4. **Geographical Distribution**: [To be described after running]
5. **Test Results**: [To be described after running]
6. **Missing Data**: [To be summarized after running]

In the next sections, we'll dive deeper into analyzing specific aspects of the data to identify instances of Internet blocking.