# Data Inspection

**Prerequisite:** It is a prerequisite for this notebook that the  `binary-matrix.sh` script has been run on the data folder of interest.

Action items: `TODO`, `QUESTION`

#### Structure of this notebook:  
    **Exploring the data and descriptive statistics:**   

    A. Package and Data Load  
    B. Understanding the Likers & Retweeters datasets  
    C. Understanding script performance
    D. Understanding user activity

# A. Package and Data Load
Specify your data directory in this secion (`my_pull`).

In [None]:
import os
import glob
import json
import pandas as pd
import csv
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
import itertools
from matplotlib import pyplot as plt
from collections import Counter
from ast import literal_eval
newest_pull_directory = max(glob.glob('../Pull*'), key=os.path.getmtime)
from resources.datainspection import *

`my_pull`: Set the data directory you want to inspect, e.g. Pull-DD-MM-YYYY-hour:minute:second, or use the newest by setting `my_pull = newest_pull_directory`.

In [None]:
my_pull = newest_pull_directory
my_pull

Load the data:

In [None]:
likers = pd.read_pickle(os.path.join(my_pull,'binary-matrix-likers.pkl'))
retweeters = pd.read_pickle(os.path.join(my_pull,'binary-matrix-retweeters.pkl'))
likers_complete = likers
retweeters_complete = retweeters
finalharvest_l = pd.read_pickle(os.path.join(my_pull,'likers_final_harvest_complete.pkl'))
finalharvest_r = pd.read_pickle(os.path.join(my_pull,'retweeters_final_harvest_complete.pkl'))
finalharvest_l.index.names = ['tweet']
finalharvest_r.index.names = ['tweet']

# B. Exploring

## Summary Numbers

The following dataframe includes some summary numbers of both `likers` and `retweeters`:

In [None]:
totals = pd.DataFrame()
totals.loc[1,'Tweets Liked'] = likers.shape[0]
totals.loc[1,'Likers'] = likers.shape[1]
totals.loc[1,'Likes'] = sum(likers.sum(axis = 1, skipna = True))
totals.loc[1,'Tweets Retweeted'] = retweeters.shape[0]
totals.loc[1,'Retweeters'] = retweeters.shape[1]
totals.loc[1,'Retweets'] = sum(retweeters.sum(axis = 1, skipna = True))
totals

## Likers

Let us look at the dataset of liking users, stored in  `likers`. In `likers`, the row index is tweet ID and the column names are user names. A cell contains `1` if the user liked the tweet, else `NaN`. Both rows and columns are sorted: rows numerically, columns alphabetically. The `retweeters` and `finalharvest_` dataframes are structured in the same way.

This section provides some examples of how to prod the `likers` matrix.

To find some tweet IDs, we may want to look up at the subsection of the first 3 tweets and the first 5 users, using `.iloc`:

In [None]:
likers.iloc[0:3,0:5]

As the index is the tweet ID, we can look up the row of a single tweet by using its ID and `.loc`. 
Here, we pass the list `[tweet]` to return a nice looking dataframe. If you'd rather want just a series, pass just `tweet`.

In [None]:
tweet = 1537712147500781569
# likers.loc[tweet] # Series
likers.loc[[tweet]] # Dataframe

We can subset dataframe to only users that have liked the tweet by dropping columns with NaN values:

In [None]:
likers.loc[[tweet]].dropna(axis='columns')

We get a list of the liking users of the tweet by listing the columns names:

In [None]:
likers.loc[[tweet]].dropna(axis='columns').columns.values.tolist()

## C. Understanding script performance
### How many likers/retweeters did the script curate? We are looking at a dataset comprising those tweets that were also one last time collected in the final harvest

TODO: Write text for this section

In [None]:
# Optional: How many tweets got <my_likersAtLeast> likes? How many tweets got <my_retweetersAtLeast> retweets?
# see parameters my_likersAtLeast/my_retweetersAtLeast
my_likersAtLeast = 10 # TODO SET YOUR PARAMTER HERE
my_retweetersAtLeast = 3 # TODO SET YOUR PARAMETER HERE

Atleast = pd.DataFrame()
Atleast.loc[1, 'All tweets liked'] = likers.shape[0]
Atleast.loc[1,'Tweets with my_likersAtLeast'] = sum((likers.sum(axis = 1, skipna = True)) >= my_likersAtLeast) 
Atleast.loc[1, 'All tweets retweeted'] = retweeters.shape[0]
Atleast.loc[1,'Tweets with my_retweetersAtLeast'] = sum((retweeters.sum(axis = 1, skipna = True)) >= my_retweetersAtLeast) 
Atleast

In [None]:
subset_likerscomplete = pd.merge(likers_complete, finalharvest_l, left_index=True, right_index=True)
subset_retweeterscomplete = pd.merge(retweeters_complete, finalharvest_r, left_index=True, right_index=True)

In [None]:
# like count at time of final harvest
likecount = finalharvest_l['like_count']
# number of collected likers 
likerscollected = subset_likerscomplete.sum(axis = 1, skipna = True) 
# retweet count at time of final harvest
retweetcount = finalharvest_r['retweet_count']
# number of collected retweeters
retweeterscollected = subset_retweeterscomplete.sum(axis = 1, skipna = True) 

In [None]:
# Absolute number of missed likes/retweets per tweet
plot_missed(likecount, likerscollected, retweetcount, retweeterscollected)

In [None]:
# Share of missed likes/retweets given total of received likes/retweets per tweet
plot_missed_relative(likecount, likerscollected, retweetcount, retweeterscollected)

In [None]:
# Supplemented with total number of likes/retweets each tweet attracted: 
plot_missed_relative_absolutecount(likecount, likerscollected, retweetcount, retweeterscollected)


In [None]:
# inspect numbers more closely: likers
d = {'collected likers': likerscollected, 'likecount': likecount, 'difference': likecount-likerscollected, 'percent': ((likecount-likerscollected)/likecount)}
inspectlikes = pd.DataFrame(data=d)
inspectlikes

In [None]:
# inspect numbers more closely: retweeteres
d = {'collected retweeters': retweeterscollected, 'retweetcount': retweetcount, 'difference': retweetcount-retweeterscollected, 'percent': ((retweetcount-retweeterscollected)/retweetcount)}
inspectretweets = pd.DataFrame(data=d)
inspectretweets

In [None]:
perf = pd.DataFrame()
perf.loc[1, '% tweets with 10 or more too many (deleted):'] = round(len(inspectlikes[inspectlikes['difference'] <-10])/len(inspectlikes),4)
perf.loc[2,'% tweets with 10 or more too many (deleted):'] = round(len(inspectretweets[inspectretweets['difference'] <-10])/len(inspectretweets), 4)

perf.loc[1, '% tweets with 10 or more missed:'] = round(len(inspectlikes[inspectlikes['difference'] >10])/len(inspectlikes),4)
perf.loc[2, '% tweets with 10 or more missed:'] = round(len(inspectretweets[inspectretweets['difference'] >10])/len(inspectretweets),4)

perf.loc[1, '% tweets with 10% or more too many (deleted):'] = round(len(inspectlikes[inspectlikes['percent'] <-.1])/len(inspectlikes),4)
perf.loc[2,'% tweets with 10% or more too many (deleted):'] = round(len(inspectretweets[inspectretweets['percent'] >.1] )/len(inspectretweets),4)


perf.loc[1, '% tweets with 10% or more missed:'] = round(len(inspectlikes[inspectlikes['percent'] >.1])/len(inspectlikes),4)
perf.loc[2, '% tweets with 10% or more missed:'] = round(len(inspectretweets[inspectretweets['percent'] <-.1] )/len(inspectretweets),4)

perf.loc[1, '% tweets with complete:'] = round(len(inspectlikes[inspectlikes['difference'] == 0])/len(inspectlikes),4)
perf.loc[2, '% tweets with complete:'] = round(len(inspectretweets[inspectretweets['difference'] == 0])/len(inspectretweets),4)


perf.index = ['Likes', 'Retweets']

perf

In [None]:
# Inspect (highly popular) tweets in terms of like count
likecount

In [None]:
# Inspect (highly popular) tweets in terms of retweet count
retweetcount

## D. Understanding user activity

### How many likes/retweets did the users place? How many unique likers/retweeters are in the dataset? 

TODO: Write text for this section

In [None]:
freqtable_l, freqtable_r = make_frequency_table(likers_complete, retweeters_complete)

In [None]:
freqtable_l.head()

In [None]:
freqtable_r.head()

In [None]:
plot_frequency(freqtable_l, freqtable_r)

In [None]:
users = pd.DataFrame()

users.loc[1, 'users placed more than 1:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 1, 'freqlikers'].sum()
users.loc[2,'users placed more than 1:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 1, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 1:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 1, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 1:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 1, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 2:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 2, 'freqlikers'].sum()
users.loc[2,'users placed more than 2:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 2, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 2:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 2, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 2:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 2, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 3:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 3, 'freqlikers'].sum()
users.loc[2,'users placed more than 3:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 3, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 3:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 3, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 3:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 3, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 4:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 4, 'freqlikers'].sum()
users.loc[2,'users placed more than 4:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 4, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 4:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 4, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 4:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 4, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)

users.loc[1, 'users placed more than 50:'] = freqtable_l.loc[freqtable_l['placedlikes'] > 50, 'freqlikers'].sum()
users.loc[2,'users placed more than 50:'] = freqtable_r.loc[freqtable_r['placedretweets'] > 50, 'freqretweeters'].sum()

users.loc[1, '% users placed more than 50:'] = round((freqtable_l.loc[freqtable_l['placedlikes'] > 50, 'freqlikers'].sum())/sum(freqtable_l['freqlikers']),4)
users.loc[2, '% users placed more than 50:'] = round((freqtable_r.loc[freqtable_r['placedretweets'] > 50, 'freqretweeters'].sum())/sum(freqtable_r['freqretweeters']),4)


users.index = ['Likes', 'Retweets']
users
