Our goal is to make a list of participants across IETF groups. Once we've done that, it should be possible to evaluate patterns of participation: how many people participate, in which groups, how does affiliation, gender, RFC authorship or other characteristics relate to levels of participation, and a variety of other related questions.

Start by importing the necessary libraries.

In [1]:
%matplotlib inline
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
import bigbang.utils as utils
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os
import csv
import re
import scipy
import scipy.cluster.hierarchy as sch
import email

Let's start with a single IETF mailing list. (Later, we can expand to all current groups, or all IETF lists ever.)

In [2]:
list_url = 'https://www.ietf.org/mail-archive/text/perpass/' # perpass happens to be one that I subscribe to

ietf_archives_dir = '../../ietf-archives' # relative location of the ietf-archives directory/repo

list_archive = mailman.open_list_archives(list_url, ietf_archives_dir)
activity = Archive(list_archive).get_activity()

Opening 43 archive files


In [3]:
people = pd.DataFrame(activity.sum(0), columns=['perpass']) # sum the message count, rather than by date

In [4]:
people.describe()

Unnamed: 0,perpass
count,261.0
mean,8.015326
std,18.733961
min,1.0
25%,1.0
50%,2.0
75%,7.0
max,231.0


Now repeat, parsing the archives and collecting the activities for all the mailing lists in the corpus. To make this faster, we try to open pre-created `-activity.csv` files which contain the activity summary for the full list archive. These files are created with `bin/mail_to_activity.py` or might be included in the mailing list archive repository.

In [12]:
reload(mailman)

<module 'bigbang.mailman' from '/Users/nick/code/mailing-list-analysis/bigbang/bigbang/mailman.py'>

In [19]:
f = open('ietf_lists_normalized.txt', 'r')
ietf_lists = f.readlines()

list_activities = []

for list_url in ietf_lists:
    try:
        activity_summary = mailman.open_activity_summary(list_url, ietf_archives_dir)
        if activity_summary is not None:
            list_activities.append((list_url, activity_summary))
    except Exception as e:
        print str(e)

In [20]:
len(list_activities)

335

In [23]:
for (list_url, activity_summary) in list_activities:
    list_name = mailman.get_list_name(list_url)
    activity_summary.rename(columns={'Message Count': list_name}, inplace=True) # name the message count column for the list
    people = pd.merge(people, activity_summary, how='outer', left_index=True, right_index=True)

In [24]:
people

Unnamed: 0_level_0,perpass_x,16ng,6lo,6lowpan,ipv6_x,renum,6tisch,6tsch,abfab,accord,...,weirds,widex,woes,wpkops,ietf-and-github,xcon,w3c-ietf-xmldsig,xmpp,xrblock,yam
From,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
<yeonche@nownuri.net>,,,,,,,,,,,...,,,,,,,,,,
<ddzmkzhtdzqgie@cw-sol.com>,,,,,,,,,,,...,,,,,,,,,,
<lwcibcautxd@exwe01.exch.eds.com>,,,,,,,,,,,...,,,,,,,,,,
<yfghlnyfv@elvischarity.com>,,,,,,,,,,,...,,,,,,,,,,
2006 JPMorgan Chase & Co. <message.center@chase.com>,,,,,,,,,,,...,,,,,,,,,,
8 <sexy@abn2.com>,,,,,,,,,,,...,,,,,,,,,,
<abc@gosok.com>,,,,,,,,,,,...,,,,,,,,,,
<bbworld@bbconcert.com>,,,,,,,,,,,...,,,,,,,,,,
<byvehcenpaewtp@crazyfisherman.com>,,,,,,,,,,,...,,,,,,,,,,
<chlwlgur22@hanmail.net>,,,,,,,,,,,...,,,,,,,,,,


In [48]:
# not sure how the index ended up with NaN values, but need to change them to strings here so additional steps will work
new_index = people.index.fillna('missing')
people.index = new_index

Split out the email address and header name from the From header we started with.

In [46]:
froms = pd.Series(people.index)
emails = froms.apply(lambda x: email.utils.parseaddr(x)[1])
emails.index = people.index
names = froms.apply(lambda x: email.utils.parseaddr(x)[0])
names.index = people.index
people['email'] = emails
people['name'] = names
people

Unnamed: 0_level_0,perpass_x,16ng,6lo,6lowpan,ipv6_x,renum,6tisch,6tsch,abfab,accord,...,woes,wpkops,ietf-and-github,xcon,w3c-ietf-xmldsig,xmpp,xrblock,yam,email,name
From,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
<yeonche@nownuri.net>,,,,,,,,,,,...,,,,,,,,,yeonche@nownuri.net,
<ddzmkzhtdzqgie@cw-sol.com>,,,,,,,,,,,...,,,,,,,,,ddzmkzhtdzqgie@cw-sol.com,
<lwcibcautxd@exwe01.exch.eds.com>,,,,,,,,,,,...,,,,,,,,,lwcibcautxd@exwe01.exch.eds.com,
<yfghlnyfv@elvischarity.com>,,,,,,,,,,,...,,,,,,,,,yfghlnyfv@elvischarity.com,
2006 JPMorgan Chase & Co. <message.center@chase.com>,,,,,,,,,,,...,,,,,,,,,message.center@chase.com,2006 JPMorgan Chase & Co.
8 <sexy@abn2.com>,,,,,,,,,,,...,,,,,,,,,sexy@abn2.com,8
<abc@gosok.com>,,,,,,,,,,,...,,,,,,,,,abc@gosok.com,
<bbworld@bbconcert.com>,,,,,,,,,,,...,,,,,,,,,bbworld@bbconcert.com,
<byvehcenpaewtp@crazyfisherman.com>,,,,,,,,,,,...,,,,,,,,,byvehcenpaewtp@crazyfisherman.com,
<chlwlgur22@hanmail.net>,,,,,,,,,,,...,,,,,,,,,chlwlgur22@hanmail.net,
