# MS MARCO v1.1 Dataset Exploration

This notebook provides an exploratory data analysis (EDA) of the MS MARCO v1.1 training dataset (`train.parquet`).

MS MARCO is a collection of datasets focused on deep learning in search, originally released at NIPS 2016. It features real Bing questions and human-generated answers.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from tqdm.notebook import tqdm
import json

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Set plotting style
plt.style.use('ggplot')
sns.set(font_scale=1.2)

## Dataset Overview

Based on the dataset info, MS MARCO v1.1 contains the following features:

- **answers**: List of string answers
- **passages**: List of dictionaries containing:
  - **is_selected**: Integer flag (1 if passage was selected as relevant, 0 otherwise)
  - **passage_text**: The text content of the passage
  - **url**: Source URL of the passage
- **query**: The search query text
- **query_id**: Unique identifier for the query
- **query_type**: Type of the query
- **wellFormedAnswers**: List of well-formatted answers

In [None]:
# Load the train.parquet dataset
parquet_path = os.path.join('data', 'raw', 'parquet', 'train.parquet')

# Check if file exists
if os.path.exists(parquet_path):
    print(f"Loading dataset from {parquet_path}...")
    df = pd.read_parquet(parquet_path)
    print(f"Dataset loaded successfully with {len(df):,} rows.")
else:
    print(f"Error: File not found at {parquet_path}")
    print("Searching for alternative locations...")
    
    # Try alternative locations
    alt_paths = [
        'data/parquet/train.parquet',
        'train.parquet'
    ]
    
    for path in alt_paths:
        if os.path.exists(path):
            print(f"Found dataset at {path}")
            df = pd.read_parquet(path)
            print(f"Dataset loaded successfully with {len(df):,} rows.")
            break
    else:
        print("Could not find the train.parquet file. Please check the file path.")

In [None]:
# Display basic information about the dataset
print("Dataset shape:", df.shape)
print("\nColumn names:")
for col in df.columns:
    print(f"- {col}")

print("\nMemory usage:")
print(f"{df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

In [None]:
# Display a sample of the data to understand its structure
df.head(3)

## Query Analysis

In [None]:
# Basic statistics about queries
df['query_length'] = df['query'].apply(len)
df['query_word_count'] = df['query'].apply(lambda x: len(x.split()))

print("Query length statistics (characters):")
print(df['query_length'].describe())

print("\nQuery word count statistics:")
print(df['query_word_count'].describe())

In [None]:
# Visualize query length distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(df['query_length'], bins=50, kde=True)
plt.title('Distribution of Query Lengths (characters)')
plt.xlabel('Number of Characters')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.histplot(df['query_word_count'], bins=30, kde=True)
plt.title('Distribution of Query Word Counts')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Analyze query types
query_type_counts = df['query_type'].value_counts()

print("Query type distribution:")
print(query_type_counts)

# Plot query types
plt.figure(figsize=(12, 6))
query_type_counts.plot(kind='bar')
plt.title('Distribution of Query Types')
plt.xlabel('Query Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Passages Analysis

In [None]:
# Count passages per query
df['num_passages'] = df['passages'].apply(len)

print("Passages per query statistics:")
print(df['num_passages'].describe())

# Count selected passages per query (passages marked as relevant)
df['num_selected_passages'] = df['passages'].apply(lambda x: sum(1 for p in x if p['is_selected'] == 1))

print("\nSelected passages per query statistics:")
print(df['num_selected_passages'].describe())
print("\nDistribution of selected passages per query:")
print(df['num_selected_passages'].value_counts().sort_index())

In [None]:
# Visualize passages distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(df['num_passages'], bins=20, kde=False)
plt.title('Distribution of Passages per Query')
plt.xlabel('Number of Passages')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.countplot(y=df['num_selected_passages'])
plt.title('Distribution of Selected Passages per Query')
plt.xlabel('Frequency')
plt.ylabel('Number of Selected Passages')

plt.tight_layout()
plt.show()

In [None]:
# Analyze passage lengths
# This might take a while for large datasets
def get_passage_lengths(row):
    lengths = [len(p['passage_text']) for p in row]
    return {
        'min_length': min(lengths) if lengths else 0,
        'max_length': max(lengths) if lengths else 0,
        'avg_length': sum(lengths) / len(lengths) if lengths else 0,
        'selected_length': next((len(p['passage_text']) for p in row if p.get('is_selected') == 1), 0)
    }

# Run on a sample to avoid long processing time
sample_size = min(10000, len(df))
sample_df = df.sample(sample_size, random_state=42)

passage_lengths = sample_df['passages'].apply(get_passage_lengths)
passage_stats = pd.DataFrame(passage_lengths.tolist())

print("Passage length statistics (characters) from sample of", sample_size, "rows:")
print(passage_stats.describe())

# Plot passage lengths
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.histplot(passage_stats['avg_length'], bins=50, kde=True)
plt.title('Distribution of Average Passage Length')
plt.xlabel('Average Length (characters)')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.histplot(passage_stats['selected_length'], bins=50, kde=True)
plt.title('Distribution of Selected Passage Length')
plt.xlabel('Length (characters)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

## Answers Analysis

In [None]:
# Analyze answer counts and lengths
df['num_answers'] = df['answers'].apply(len)
df['num_well_formed_answers'] = df['wellFormedAnswers'].apply(len)

print("Number of answers per query:")
print(df['num_answers'].describe())
print("\nNumber of well-formed answers per query:")
print(df['num_well_formed_answers'].describe())

In [None]:
# Distribution of number of answers
answer_counts = pd.DataFrame({
    'Regular Answers': df['num_answers'].value_counts().sort_index(),
    'Well-Formed Answers': df['num_well_formed_answers'].value_counts().sort_index()
})

print("Distribution of answers per query:")
print(answer_counts)

# Plot answer counts
plt.figure(figsize=(12, 6))
answer_counts.plot(kind='bar')
plt.title('Distribution of Answers and Well-Formed Answers per Query')
plt.xlabel('Number of Answers')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Compute answer lengths for a sample
def get_answer_lengths(answers):
    lengths = [len(a) for a in answers]
    return {
        'min_length': min(lengths) if lengths else 0,
        'max_length': max(lengths) if lengths else 0,
        'avg_length': sum(lengths) / len(lengths) if lengths else 0
    }

# Only process rows with at least one answer
has_answers = sample_df[sample_df['num_answers'] > 0]
answer_lengths = has_answers['answers'].apply(get_answer_lengths)
answer_stats = pd.DataFrame(answer_lengths.tolist())

print("Answer length statistics (characters) from sample:")
print(answer_stats.describe())

# Plot answer lengths
plt.figure(figsize=(10, 6))
sns.histplot(answer_stats['avg_length'], bins=50, kde=True)
plt.title('Distribution of Average Answer Length')
plt.xlabel('Average Length (characters)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

## Relationship Between Query Type and Other Metrics

In [None]:
# Analyze metrics by query type
query_type_metrics = df.groupby('query_type').agg({
    'query_length': 'mean',
    'query_word_count': 'mean',
    'num_passages': 'mean',
    'num_selected_passages': 'mean',
    'num_answers': 'mean',
    'num_well_formed_answers': 'mean',
    'query_id': 'count'
}).rename(columns={'query_id': 'count'})

print("Metrics by query type:")
query_type_metrics

In [None]:
# Visualize metrics by query type
metrics_to_plot = ['query_length', 'query_word_count', 'num_passages', 'num_selected_passages', 
                   'num_answers', 'num_well_formed_answers']

fig, axes = plt.subplots(3, 2, figsize=(16, 18))
axes = axes.flatten()

for i, metric in enumerate(metrics_to_plot):
    sns.barplot(x=query_type_metrics.index, y=query_type_metrics[metric], ax=axes[i])
    axes[i].set_title(f'Average {metric.replace("_", " ").title()} by Query Type')
    axes[i].set_xlabel('Query Type')
    axes[i].set_ylabel('Average Value')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Example Queries and Passages

In [None]:
# Display example queries and their passages
def display_example(row):
    print(f"Query ID: {row['query_id']}")
    print(f"Query: {row['query']}")
    print(f"Query Type: {row['query_type']}")
    print(f"Number of Passages: {len(row['passages'])}")
    
    print("\nAnswers:")
    for i, answer in enumerate(row['answers']):
        print(f"  {i+1}. {answer}")
    
    if row['wellFormedAnswers']:
        print("\nWell-Formed Answers:")
        for i, answer in enumerate(row['wellFormedAnswers']):
            print(f"  {i+1}. {answer}")
    
    print("\nSelected Passages:")
    selected = [(i, p) for i, p in enumerate(row['passages']) if p['is_selected'] == 1]
    if selected:
        for i, passage in selected:
            print(f"  Passage {i+1} (URL: {passage['url']}):")
            print(f"  {passage['passage_text'][:300]}..." if len(passage['passage_text']) > 300 else passage['passage_text'])
            print()
    else:
        print("  No passages were selected as relevant.")
    
    print("-" * 80)

# Display 3 random examples
print("Example Queries and Passages:\n")
for _, row in df.sample(3, random_state=42).iterrows():
    display_example(row)

## Correlation Analysis

In [None]:
# Correlation analysis of numerical features
numeric_cols = ['query_length', 'query_word_count', 'num_passages', 'num_selected_passages', 
                'num_answers', 'num_well_formed_answers']

corr_matrix = df[numeric_cols].corr()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

## Summary Statistics and Insights

In [None]:
# Generate summary statistics
summary = {
    'Total Queries': len(df),
    'Query Types': len(df['query_type'].unique()),
    'Avg Passages Per Query': df['num_passages'].mean(),
    'Avg Selected Passages Per Query': df['num_selected_passages'].mean(),
    'Queries with Answers (%)': (df['num_answers'] > 0).mean() * 100,
    'Queries with Well-Formed Answers (%)': (df['num_well_formed_answers'] > 0).mean() * 100,
    'Avg Answers Per Query': df['num_answers'].mean(),
    'Avg Query Length (chars)': df['query_length'].mean(),
    'Avg Query Words': df['query_word_count'].mean()
}

print("Dataset Summary Statistics:")
for key, value in summary.items():
    print(f"{key}: {value:.2f}" if isinstance(value, float) else f"{key}: {value}")

## Conclusion

This notebook has provided a comprehensive exploratory data analysis of the MS MARCO v1.1 dataset, focusing on:

1. **Query analysis**: Length, word count, and query types
2. **Passages analysis**: Distribution of passages and selected passages per query
3. **Answers analysis**: Distribution of regular and well-formed answers
4. **Relationship between query types and various metrics**
5. **Example queries and passages**
6. **Correlation analysis** between different numerical features

The insights gained from this analysis can be used to better understand the MS MARCO dataset and inform subsequent modeling approaches for search relevance and question answering tasks.