# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Collecting-Events" data-toc-modified-id="Collecting-Events-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Collecting Events</a></div><div class="lev2 toc-item"><a href="#Number-of-unique-tweets" data-toc-modified-id="Number-of-unique-tweets-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Number of unique tweets</a></div><div class="lev2 toc-item"><a href="#Tweets-in-each-dataset" data-toc-modified-id="Tweets-in-each-dataset-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Tweets in each dataset</a></div><div class="lev2 toc-item"><a href="#Number-of-datasets-tweet-is-included-in-by-number-of-tweets" data-toc-modified-id="Number-of-datasets-tweet-is-included-in-by-number-of-tweets-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Number of datasets tweet is included in by number of tweets</a></div><div class="lev2 toc-item"><a href="#Number-of-tweets-that-are-unique-to-each-dataset" data-toc-modified-id="Number-of-tweets-that-are-unique-to-each-dataset-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Number of tweets that are unique to each dataset</a></div>

# Collecting Events

Description of how datasets were collected:
* sfm_during_search: A one-time search initiated in the middle of the event using Social Feed Manager.
* sfm_filter: A filter stream initiated prior to the event using Social Feed Manager. The filter is briefly interrupted every 30 minutes.
* sfm_post_search: A one-time search initiated after the conclusion of the event using Social Feed Manager.
* sfm_search: Recurring incremental searches initiated prior to the event using Social Feed Manager. A search is skipped if the previous search has not been completed.
* mith_filter: A filter stream initiated on 2017-09-05 04:11:55 and ended on 2017-09-13 16:52:42 using twarc.
* mith_search: A search started on 2017-09-01 13:51:02 and ending on 2017-09-17 07:30:03 using twarc_archive. The search was conducted every hour.

In [1]:
import csv
import pandas as pd
import numpy as np
from dateutil.parser import parse as date_parse
import os
import glob
import logging
import gzip

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

def load_iter(base_filename, start_datetime, end_datetime):
    file_count = 0
    try:
        while True:
            filepath = 'data/{}-{:02d}.csv.gz'.format(base_filename, file_count)
            with gzip.open(filepath, 'rt') as f:
                logging.debug('Loading %s', filepath)
                reader = csv.reader(f)
                for count, line in enumerate(reader):
                    if count % 100000 == 0:
                        logging.debug('Loaded %s', count)            
                    created_at_datetime = date_parse(line[1])
                    if created_at_datetime >= start_datetime and created_at_datetime < end_datetime:
                        yield line[0]
            file_count += 1
    except IOError:
        pass
            
def load(base_filename, start_datetime, end_datetime):
    tweet_series = pd.Series(1, index=load_iter(base_filename, start_datetime, end_datetime), name=base_filename)
    return tweet_series[~tweet_series.index.duplicated()]

# Sept 9-11 2017
start_date = date_parse('2017-09-09 00:00:00-04:00')
end_date = date_parse('2017-09-12 00:00:00-04:00')

tweet_dfs = []
for filepath in glob.glob('data/*-00.csv.gz'):
    base_filename = os.path.basename(filepath)[:-10]
    tweet_dfs.append(load(base_filename, start_date, end_date))

tweet_df = pd.concat(tweet_dfs, axis=1).fillna(0)
tweet_df["total"] = tweet_df.sum(axis=1)

DEBUG:root:Loading data/mith_filter.csv-00.csv.gz
DEBUG:root:Loaded 0
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 500000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 800000
DEBUG:root:Loaded 900000
DEBUG:root:Loaded 1000000
DEBUG:root:Loaded 1100000
DEBUG:root:Loaded 1200000
DEBUG:root:Loaded 1300000
DEBUG:root:Loaded 1400000
DEBUG:root:Loaded 1500000
DEBUG:root:Loaded 1600000
DEBUG:root:Loaded 1700000
DEBUG:root:Loaded 1800000
DEBUG:root:Loaded 1900000
DEBUG:root:Loading data/mith_filter.csv-01.csv.gz
DEBUG:root:Loaded 0
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 500000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 800000
DEBUG:root:Loaded 900000
DEBUG:root:Loaded 1000000
DEBUG:root:Loaded 1100000
DEBUG:root:Loaded 1200000
DEBUG:root:Loaded 1300000
DEBUG:root:Loaded 1400000
DEBUG:root:Loaded 15

DEBUG:root:Loaded 1900000
DEBUG:root:Loading data/mith_search.csv-06.csv.gz
DEBUG:root:Loaded 0
DEBUG:root:Loading data/sfm_during_search-00.csv.gz
DEBUG:root:Loaded 0
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 500000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 800000
DEBUG:root:Loaded 900000
DEBUG:root:Loaded 1000000
DEBUG:root:Loaded 1100000
DEBUG:root:Loaded 1200000
DEBUG:root:Loaded 1300000
DEBUG:root:Loaded 1400000
DEBUG:root:Loaded 1500000
DEBUG:root:Loaded 1600000
DEBUG:root:Loaded 1700000
DEBUG:root:Loaded 1800000
DEBUG:root:Loaded 1900000
DEBUG:root:Loading data/sfm_during_search-01.csv.gz
DEBUG:root:Loaded 0
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 500000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 800000
DEBUG:root:Loaded 900000
DEBUG:root:Loaded 1000000
DEBUG:root:Loaded 110000

DEBUG:root:Loaded 700000
DEBUG:root:Loaded 800000
DEBUG:root:Loaded 900000
DEBUG:root:Loaded 1000000
DEBUG:root:Loaded 1100000
DEBUG:root:Loaded 1200000
DEBUG:root:Loaded 1300000
DEBUG:root:Loaded 1400000
DEBUG:root:Loaded 1500000
DEBUG:root:Loaded 1600000
DEBUG:root:Loaded 1700000
DEBUG:root:Loaded 1800000
DEBUG:root:Loaded 1900000
DEBUG:root:Loading data/sfm_post_search-01.csv.gz
DEBUG:root:Loaded 0
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 500000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 800000
DEBUG:root:Loaded 900000
DEBUG:root:Loaded 1000000
DEBUG:root:Loaded 1100000
DEBUG:root:Loaded 1200000
DEBUG:root:Loaded 1300000
DEBUG:root:Loaded 1400000
DEBUG:root:Loaded 1500000
DEBUG:root:Loaded 1600000
DEBUG:root:Loaded 1700000
DEBUG:root:Loaded 1800000
DEBUG:root:Loaded 1900000
DEBUG:root:Loading data/sfm_post_search-02.csv.gz
DEBUG:root:Loaded 0
DEBUG:root:Loaded 100000
DEBUG:root:Load

In [2]:
tweet_df.head()

Unnamed: 0,mith_filter.csv,mith_search.csv,sfm_during_search,sfm_filter,sfm_post_search,sfm_search,total
906366551712038912,1.0,1.0,1.0,1.0,0.0,1.0,5.0
906366551749570560,1.0,1.0,1.0,1.0,0.0,1.0,5.0
906366551871418369,1.0,1.0,1.0,1.0,0.0,1.0,5.0
906366552181796864,1.0,1.0,1.0,1.0,0.0,1.0,5.0
906366552181805057,1.0,1.0,1.0,1.0,0.0,1.0,5.0


## Number of unique tweets

In [3]:
tweet_df[['total']].count()

total    8073438
dtype: int64

## Tweets in each dataset

In [4]:
tweet_summary_df = pd.DataFrame(tweet_df.sum(), columns=['tweet_count']).drop('total')
tweet_summary_df['percentage'] = tweet_summary_df.tweet_count.div(tweet_df.count()).mul(100).round(1).astype(str) + '%'
tweet_summary_df

Unnamed: 0,tweet_count,percentage
mith_filter.csv,7384737.0,91.5%
mith_search.csv,4948024.0,61.3%
sfm_during_search,4916920.0,60.9%
sfm_filter,7284440.0,90.2%
sfm_post_search,4381984.0,54.3%
sfm_search,4206151.0,52.1%


## Number of datasets tweet is included in by number of tweets
That is, ? tweets are included in ? datasets.

In [5]:
tweet_df.total.value_counts()

6.0    3032328
2.0    2800470
5.0    1196515
4.0     486682
3.0     420301
1.0     137142
Name: total, dtype: int64

## Number of tweets that are unique to each dataset

In [6]:
tweet_df[tweet_df.total == 1].sum().drop('total')

mith_filter.csv      77740.0
mith_search.csv       1100.0
sfm_during_search     6593.0
sfm_filter           36871.0
sfm_post_search      13482.0
sfm_search            1356.0
dtype: float64