# Data Inspection

**Prerequisite:** It is a prerequisite for this notebook that the  `binary-matrix.sh` and `time-series.sh` scripts have been run on the data folder of interest.

#### Structure of this notebook:  
    **Exploring the data and descriptive statistics:**   

    A. Package and Data Load  
    B. Understanding the Likers & Retweeters datasets  
    C. Understanding script performance
    D. Understanding user activity

# A. Package and Data Load
Specify your data directory in this section (`my_pull`).

In [None]:
import os
import glob
import json
import pandas as pd
import csv
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
import itertools
from matplotlib import pyplot as plt
from collections import Counter
from ast import literal_eval
newest_pull_directory = max(glob.glob('../Pull*'), key=os.path.getmtime)
from resources.datainspection import *

`my_pull`: Set the data directory you want to inspect, e.g. Pull-DD-MM-YYYY-hour:minute:second, or use the newest by setting `my_pull = newest_pull_directory`.

In [None]:
my_pull = newest_pull_directory
my_pull

Load the data.

The data structures are described further below. In overview,
- `likers` is the dataframe of tweets and all their observed likers,
- `finalharvest_l` is the dataframe of tweets and likers observed during the final harvest,
- `retweeters` and `finalharvest_r` are the same, but for retweeeters.
- `timeseries_likes` and `timeseries_likes` are dataframes with all like/retweet counts over time.

In [None]:
likers = pd.read_pickle(os.path.join(my_pull,'binary-matrix-likers.pkl'))
retweeters = pd.read_pickle(os.path.join(my_pull,'binary-matrix-retweeters.pkl'))
finalharvest_l = pd.read_pickle(os.path.join(my_pull,'likers_final_harvest_complete.pkl'))
finalharvest_r = pd.read_pickle(os.path.join(my_pull,'retweeters_final_harvest_complete.pkl'))
finalharvest_l.index.names = ['tweet']
finalharvest_r.index.names = ['tweet']
timeseries_likes = pd.read_pickle(os.path.join(my_pull,'timeseries_likes.pkl'))
timeseries_retweets = pd.read_pickle(os.path.join(my_pull,'timeseries_retweets.pkl'))

# B. Exploring

## Summary Numbers

The following dataframe includes some summary totals of both `likers` and `retweeters`:

In [None]:
totals = pd.DataFrame()
totals.loc[1,'Total no. of: Liked Tweets'] = likers.shape[0]
totals.loc[1,'Likers'] = likers.shape[1]
totals.loc[1,'Likes'] = sum(likers.sum(axis = 1, skipna = True))
totals.loc[1,'Retweeted Tweets'] = retweeters.shape[0]
totals.loc[1,'Retweeters'] = retweeters.shape[1]
totals.loc[1,'Retweets'] = sum(retweeters.sum(axis = 1, skipna = True))
totals

## Likes and Retweets over Time

We look at `timeseries_likes` and `timeseries_retweeets` to see the likes and retweets of our tweets develop over time, by plotting them.

Pay atention to whether the plotlines are generally only increasing. The more up- and down oscillation you see, the less precise the below estimation of dataset completetion will be.

In [None]:
likes_transposed = timeseries_likes.T
retweets_transposed = timeseries_retweets.T

In [None]:
# All tweets:
likes_transposed.plot.line(legend=False)
plt.xlabel("Timesteps")
plt.ylabel('Likecount')
# Subset only:
# likes_transposed.iloc[:,0:20].plot.line(legend=False)
# plt.set_xlabel("Timesteps")
# plt.set_ylabel('Likecount')

In [None]:
# All tweets:
retweets_transposed.plot.line(legend=False)
plt.xlabel("Timesteps")
plt.ylabel('Retweetcount')
# 
# Subset only:
# retweets_transposed.iloc[:,0:20].plot.line(legend=False)
# plt.set_xlabel("Timesteps")
# plt.set_ylabel('Retweetcount')

You may notice that some line end before others. That's because we stopped tracking the tweet then.

Take a peak at `timeseries_retweets.head()`, and see that tweets will stop have recorded retweet counts, but start showing `NanN`.

In [None]:
timeseries_retweets.head()

If you want to look up the like/retweet count of a given tweet at a given time, you can use syntax like this:

timeseries_retweets.at[1537520976086421504,2]

timeseries_retweets.loc[1537520976086421504,2]

timeseries_retweets.iloc[1,2])

## Lists of Likers

Let us look at the dataset of liking users, stored in  `likers`. In `likers`, the row index is tweet ID and the column names are user names. A cell contains `1` if the user liked the tweet, else `NaN`. Both rows and columns are sorted: rows numerically, columns alphabetically. The `retweeters` and `finalharvest_` dataframes are structured in the same way.

This section provides some examples of how to prod the `likers` matrix.

To find some tweet IDs, we may want to look up at the subsection of the first 3 tweets and the first 5 users, using `.iloc`:

In [None]:
likers.iloc[0:3,0:5]

Or perhaps we want to look at the tweets with the highest like count (retweet count) at the time of final harvest:

In [None]:
finalharvest_l[['like_count']]
#finalharvest_r[['retweet_count']]

As the index is the tweet ID, we can look up the row of a single tweet by using its ID and `.loc`. 
Here, we pass the list `[tweet]` to return a nice looking dataframe. If you'd rather want just a series, pass just `tweet`.

In [None]:
tweet = 1537712147500781569
# likers.loc[tweet] # Series
likers.loc[[tweet]] # Dataframe

We can subset `likes` to only users that have liked the tweet by dropping columns with NaN values:

In [None]:
likers.loc[[tweet]].dropna(axis='columns')

We get a list of the liking users of the tweet by listing the columns names:

In [None]:
likers.loc[[tweet]].dropna(axis='columns').columns.values.tolist()

## C. Checking dataset completeness

This section investigates how complete the dataset is with respect to the collected likers and retweeters. 

To evaluate how complete the dataset is, we compare the number of likers/retweeters the script curated per tweet to the maximum number of likes/retweets the tweet has received across all observations. We suggest to compare with the maximal number (instead of, e.g., the latest number) of likes to account for retracted and deleted likes/retweets as well as deleted tweets.

Example: We compare to the maximum number of likes instead of the latest number (e.g. last logged for final harvest) to account for the following case: A tweet may have gotten 100 likes early on that the script misses, and then slowly gets 100 more likes that the script all collects as intended, while the first 100 unlike or have their accounts deleted. Going by the latest like count, this would not inform us that we missed out, but tell us that we got 100 out of 100 likers.

Further, please note that the like count at final harvest is the last like count logged during the observation period, and does not reflect the like count *at* time of final harvest collection. The logged final harvest like count only serves to sort the final harvest likers/retweeters collection such that likers of tweets with higher like counts are collected first.

We apply the described completeness measure to the tweets that were also subject to collection in the final harvest.

### C0. Maximal Like and Retweet Counts

To implement the completeness measure, we add the final harvest like/retweet count to the timeseries dataframes,so we can find the maximum number of likes/retweets across all observations:

In [None]:
timeseries_likes_all = timeseries_likes
timeseries_retweets_all = timeseries_retweets
for tweet in finalharvest_l.index:
    timeseries_likes_all.at[tweet,'final'] = finalharvest_l.at[tweet,'like_count']
for tweet in finalharvest_r.index:
    timeseries_retweets_all.at[tweet,'final'] = finalharvest_r.at[tweet,'retweet_count']

max_likes = timeseries_likes_all.max(axis=1)
max_retweets = timeseries_retweets_all.max(axis=1)

For fun, let us print the tweets IDs of the tweets with most likes and retweets:

In [None]:
print("Likes:   ", max_likes.idxmax(), max_likes.max())
print("Retweets:", max_retweets.idxmax(), max_retweets.max())

### C1. How many tweets were included in the final harvest?

As we below subset to tweet included in the final harvest, it may be of interest to know how large a portion of all collected tweets this is:

In [None]:
all_vs_finalharvest = pd.DataFrame()
all_vs_finalharvest.loc[1,'All'] = likers.shape[0]
all_vs_finalharvest.loc[1,'Final harvest'] = finalharvest_l.shape[0]
all_vs_finalharvest.loc[1,'%'] = (finalharvest_l.shape[0]/likers.shape[0])*100
all_vs_finalharvest.loc[2,'All'] = retweeters.shape[0]
all_vs_finalharvest.loc[2,'Final harvest'] = finalharvest_r.shape[0]
all_vs_finalharvest.loc[2,'%'] = (finalharvest_r.shape[0]/retweeters.shape[0])*100
all_vs_finalharvest = all_vs_finalharvest.apply(np.floor).astype('int')
all_vs_finalharvest.index = ['Likes', 'Retweets']
all_vs_finalharvest

### C2. Plot overviews of missed likers and retweeters

This section gives a visual overview of how complete the dataset is (again, measured by number of collected likers vs. maximum like count across all observations to account for deletions).

In [None]:
getindex = likers.index.intersection(finalharvest_l.index)
subset_likerscomplete = likers.loc[getindex]
# drop any columns that as a result only contain NaN, i.e. users that have not liked any of the final harvest tweets:
subset_likerscomplete = subset_likerscomplete.dropna(how='all', axis=1) 

getindex = retweeters.index.intersection(finalharvest_r.index)
subset_retweeterscomplete = retweeters.loc[getindex]
# drop any columns that as a result only contain NaN, i.e. users that have not retweeted any of the final harvest tweets:
subset_retweeterscomplete = subset_retweeterscomplete.dropna(how='all', axis=1) 

# like count at time of final harvest
likecount = finalharvest_l['like_count']

# Max like count filtered for tweets relevant at final harvest:
max_likes.index.names = ['tweet']
getindex = max_likes.index.intersection(likecount.index)
subset_maxlikes = max_likes[getindex]

# number of collected likers script performance
likerscollected = subset_likerscomplete.sum(axis = 1, skipna = True) 
# retweet count at time of final harvest
retweetcount = finalharvest_r['retweet_count']

# Max retweet count filtered for tweets relevant at final harvest:
max_retweets.index.names = ['tweet']
getindex = max_retweets.index.intersection(retweetcount.index)
subset_maxretweets = max_retweets[getindex]

# number of collected retweeters
retweeterscollected = subset_retweeterscomplete.sum(axis = 1, skipna = True) 

Sort the number of likers/retweeters we collected and the maxumum like/retweet number the same way (for plotting):

In [None]:
subset_maxlikes = subset_maxlikes.sort_values(ascending = False)
likerscollected = likerscollected.reindex(subset_maxlikes.index)

subset_maxretweets = subset_maxretweets.sort_values(ascending = False)
retweeterscollected = retweeterscollected.reindex(subset_maxretweets.index)

### C2.1 Relative number of missed likers and retweeters given the popularity of a tweet (absolute number of likes/retweets it attracts)

This plot presents the relative number of missed likers (blue) and retweeters (orange), with tweets sorted by their popularity (number of likes/retweets they attract) on the x-axis and the share of missed likers and retweeters per tweet on the y-axis. 

You learn how many likers/retweeters relative to the total number of attracted likes/retweets the script missed in collecting.  

You can learn from it that the script priotitises the collection of likers/retweeters for popular tweets. You should see that the script misses out on likers/retweeters relatively less when a tweet attracted many likers/retweeters. The script de-prioritises the collection of likers/retweeters for tweets that get few likes/retweeters.

First, we seperately plot retweeters and likers, followed by a combined plot.

Retweeters:

In [None]:
plot_missed_scatter_single(subset_maxretweets,retweeterscollected, label = 'retweets', col = 'orange')

Likers:

In [None]:
plot_missed_scatter_single(subset_maxlikes,likerscollected, label = 'likes', col = 'dodgerblue')

Likers and Retweeters in one plot:

In [None]:
plot_missed_scatter_combined(subset_maxlikes, likerscollected, subset_maxretweets,retweeterscollected, label1 = 'likes', label2='retweets', col1='dodgerblue', col2='orange')

### C2.2 Absolute number of missed likers and retweeters 

This plot presents the *absolute* number of missed likers (blue) and retweeters (orange), with tweets on the x-axis and the number of missed likers and retweeters on the y-axis. 

You learn how many likers/retweeters the script missed in collecting, e.g. due a large batch (say, 200) placed simultaneously.


In [None]:
# Absolute number of missed likes/retweets per tweet
plot_missed(subset_maxlikes, likerscollected, subset_maxretweets, retweeterscollected)

### C2.3 Combined absolute and relative missed likers and reweeters.

This plot--like C2.1, but slightly differently visualized--presents the relative number of missed likes (blue) and retweets (orange), with tweets on the x-axis and the share of missed likes and retweets on the y-axis. 

You learn how many likers/retweeters relative to the total number of attracted likes/retweets the script missed in collecting.  

The dotted lines further tell you the total number of likes/retweets.

As above, you can learn from it that the script priotitises the collection of likers/retweeters for popular tweets. You should see that the script misses out on likers/retweeters relatively less when a tweet attracted many likers/retweeters. The script de-prioritises the collection of likers/retweeters for tweets that get few likes/retweeters.

In [None]:
# Supplemented with total number of likes/retweets each tweet attracted: 
plot_missed_relative_absolutecount(subset_maxlikes, likerscollected, subset_maxretweets, retweeterscollected)


### C3. Details of missed likers and rewtweeters

We turn to checking in details where likers and retweeters were missed.

The following dataframe lists tweets according to the difference in like count at final harvest and number of collected likers:

In [None]:
# inspect numbers more closely: likers
d = {'Collected likers |': likerscollected, 'Max like count |': subset_maxlikes, 'Absolute difference |': subset_maxlikes-likerscollected, 'Percent missed': ((subset_maxlikes-likerscollected)/subset_maxlikes)*100}
inspectlikes = pd.DataFrame(data=d).sort_values(by=['Absolute difference |'], ascending=False)
inspectlikes#.astype({'Collected likers':'int', 'Absolute difference':'int'})

The following dataframe lists tweets according to the difference in retweet count at final harvest and number of collected retweeters:

In [None]:
# inspect numbers more closely: retweeteres
d = {'Collected retweeters |': retweeterscollected, 'Max retweet count |': subset_maxretweets, 'Absolute difference |': subset_maxretweets-retweeterscollected, 'Percent missed': ((subset_maxretweets-retweeterscollected)/subset_maxretweets)*100}
inspectretweets = pd.DataFrame(data=d).sort_values(by=['Absolute difference |'], ascending=False)
inspectretweets#.astype({'Collected retweeters':'int', 'Absolute difference':'int'})

The following dataframe gives you and idea about the number of tweets (in % of all tweets monitored) where the script misses likers/retweeters in the collecton or where the script collected too many (when likes/retweets where deleted/retracted). 

Likers and retweeters are specified in row 1 and 2, respectively. E.g., from the first columnn you can learn for how many (%) tweets the script collected 10 or more too many likers/retweeters. Or, from the last column, you can learn for how mnay tweets (%), the script collected a complete collection of likers/retweeters (compared to time of final harvest). 

In [None]:
perf = pd.DataFrame()
perf.loc[1,'Too many: % with 10 or more too many:'] = round(len(inspectlikes[inspectlikes['Absolute difference |'] <-10])/len(inspectlikes),4)
perf.loc[2,'Too many: % with 10 or more too many:'] = round(len(inspectretweets[inspectretweets['Absolute difference |'] <-10])/len(inspectretweets), 4)

perf.loc[1, 'Too few: % with 10 or more missed:'] = round(len(inspectlikes[inspectlikes['Absolute difference |'] >10])/len(inspectlikes),4)
perf.loc[2, 'Too few: % with 10 or more missed:'] = round(len(inspectretweets[inspectretweets['Absolute difference |'] >10])/len(inspectretweets),4)

perf.loc[1,'Too many: % with 10% or more too many:'] = round(len(inspectlikes[inspectlikes['Percent missed'] <-10])/len(inspectlikes),4)
perf.loc[2,'Too many: % with 10% or more too many:'] = round(len(inspectretweets[inspectretweets['Percent missed'] >10] )/len(inspectretweets),4)

perf.loc[1, 'Too few: % with 10% or more missed:'] = round(len(inspectlikes[inspectlikes['Percent missed'] >10])/len(inspectlikes),4)
perf.loc[2, 'Too few: % with 10% or more missed:'] = round(len(inspectretweets[inspectretweets['Percent missed'] <-10] )/len(inspectretweets),4)

perf.loc[1, 'Match: % with neither too many nor too few:'] = round(len(inspectlikes[inspectlikes['Absolute difference |'] == 0])/len(inspectlikes),4)
perf.loc[2, 'Match: % with neither too many nor too few:'] = round(len(inspectretweets[inspectretweets['Absolute difference |'] == 0])/len(inspectretweets),4)


perf.index = ['Likes', 'Retweets']

perf

## D. Understanding user activity

### How many likes/retweets did the users place? How many unique likers/retweeters are in the dataset? 

The following tables and the plot provide some intel about the frequency whith which likers and retweeters were active. You'll learn how many very active users (place many likes/retweets) the dataset contains, how many users are one-time-only active in terms of liking/retweeting. You can explore the frequency tables for this purpose, or look at the plot. In the plot, likers are blue, retweeters are orange, and the number of placed likes/retweetes is found on the x-axis, with the share of likers/retweeters on the y-axis. 

The bottom dataframe summarizes this data. Again, you'll find the data concerning likes and retweets in the rows, respectively. In the columns, we collect some data summaries about, e.g., the (share of/number of) users that placed more than 1 like/retweet, up to the (share of/number of) users that placed more than 50 likes/retweets.

In [None]:
freqtable_l, freqtable_r = make_frequency_table(likers, retweeters)

In [None]:
freqtable_l.head()

In [None]:
freqtable_r.head()

In [None]:
plot_frequency(freqtable_l, freqtable_r)

In [None]:
users = pd.DataFrame()

users.loc[1, 'users placed more than 1:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 1, 'freqlikers'].sum()
users.loc[2,'users placed more than 1:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 1, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 1:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 1, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 1:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 1, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 2:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 2, 'freqlikers'].sum()
users.loc[2,'users placed more than 2:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 2, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 2:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 2, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 2:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 2, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 3:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 3, 'freqlikers'].sum()
users.loc[2,'users placed more than 3:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 3, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 3:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 3, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 3:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 3, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 4:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 4, 'freqlikers'].sum()
users.loc[2,'users placed more than 4:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 4, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 4:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 4, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 4:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 4, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 50:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 50, 'freqlikers'].sum()
users.loc[2,'users placed more than 50:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 50, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 50:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 50, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 50:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 50, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)


users.index = ['Likes', 'Retweets']
users
