# Author Prediction

It is possible to predict an author or "new author" at same time by defining categories as 1 if a author is to be predicted but
only if it is not a new author. Because of memory, only twitter or reddit data can be predicted in one run.
The full dataset does not fit in laptops memory and is computed on the cluster (which in turn has no gpu support)

The probability of predicting an author is calculated for each relationship (root distance to another node, reply distance to other nodes, and reply distance to nodes with the same author. In future also the author follower network will be included in the feature set.

The overall sum of the probability of predicting an author (in average) will be interpreted as the likelihood of any author writing in any time in the conversation (again, because it is not a new author). This will then seen as the author being present in the conversation because it is another measure of a author being available in all branches and positions in the conversation.



In [25]:
from platform import python_version
# import ipyparallel as ipp
# c = ipp.Client(profile="slurm")
# c.ids

print(python_version())

3.8.10


In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import pickle
from keras import backend as K
# import pickle5 as pickle

is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
    try:
        tf.config.set_logical_device_configuration(
            gpus[0],
            [tf.config.LogicalDeviceConfiguration(memory_limit=2024)])
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Virtual devices must be set before GPUs have been initialized
        print(e)
print("cuda gpu is available: {}".format(is_cuda_gpu_available))

file_name = "data/vision_forward_graph_data_local_05_08_22.pkl"
# file_name = "data/vision_forward_graph_data_08_09_22.pkl"
with open(file_name, 'rb') as f:
    df = pickle.load(f)

df.shape

1 Physical GPUs, 1 Logical GPUs
cuda gpu is available: True


2022-08-16 14:49:35.364911: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 14:49:35.365252: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 14:49:35.365472: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 14:49:35.365778: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 14:49:35.366006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from S

(809823, 81)

In [27]:
# importing utility functions
%run author_vision_util.ipynb

In [28]:
df = equalize_samples(df)
df = df[df["platform"] == "reddit"]
df.shape

chosen 24 conversations and gotten 28579 from twitter compared to 24497 from reddit


(24497, 81)

#### Create a one hot vector representation of the possible authors
- create an artificial user that represents a new user in a conversation up to that point
- get a matrix with the authors as columns and a 1 if the author wrote the post
- join it with the feature matrix
- drop the author column


In [29]:
# compute a fake user that symbolizes that the given user has not been seen at a given stage in the conversation
df_conversation_authors = df[["conversation_id", "author", "current_time"]]
first_times = df_conversation_authors.groupby(["conversation_id", "author"]).min()

def is_new_author(row):
    earliest_author_post = first_times.loc[row["conversation_id"],row["author"]]
    current_post_time = row["current_time"]
    return  earliest_author_post >= current_post_time

new_author_column = df[["conversation_id", "author", "current_time"]].apply(is_new_author, axis=1)
new_author_column= new_author_column.rename(columns={'current_time':"Author_is_new"})
#new_author_column.describe()
# current author has not been the beam_node
new_author_column.value_counts()

Author_is_new
False            14982
True              9515
dtype: int64

In [30]:
def compute_new_author_column(df):
    author_one_hot = pd.get_dummies(df.author, prefix="Author", sparse=True)
    # make author cells 0 that are now represented as "new author"
    author_one_hot = author_one_hot.astype(bool).apply(lambda x: x & ~new_author_column.Author_is_new).astype(int)
    # delete columns that are all 0 
    author_one_hot = author_one_hot.loc[:, (author_one_hot != 0).any(axis=0)]
    # join the new author column to the labels
    labels = author_one_hot.join(new_author_column.astype(int))
    features = take_features(df, ["author", "current_time", "beam_node_time"])
    combined_set = features.join(labels)
    return combined_set, features, labels

combined_set, features, labels = compute_new_author_column(df)

#### Training NN to predict the author that would write next
- included a "new author" category to capture predicting unknown authors
- using multi-class classification (instead of multi-label)
- relu/sigmoid activation functions have same effect
- precision grew significantly when adding more than 3-5 layers

In [31]:
from keras.layers import Dropout
from keras.optimizer_v2.rmsprop import RMSprop  # selecting train and test datasets
train, test = train_test_split(combined_set, test_size=0.2, shuffle=False)
print("split training and test set")

split training and test set


In [32]:
# train the model
y = train.drop(features.columns, axis=1)
x = train.drop(labels.columns, axis=1)
print("seperated features and y with shapes:")
print(x.shape)
print(y.shape)

# import tensorflow and train the model
# print(tf.__version__)
input_shape = (x.shape[1],)
output_shape = y.shape[1]
print("inputshape is {}".format(input_shape))
model = Sequential([
    Dense(output_shape, activation='relu', input_shape=input_shape),
    Dense(output_shape, activation='relu', input_shape=input_shape),
    Dense(output_shape, activation='relu', input_shape=input_shape),        
    Dense(output_shape, activation='relu', input_shape=input_shape),
    Dense(output_shape, activation='relu', input_shape=input_shape),
    Dense(output_shape, activation='relu', input_shape=input_shape),    
    Dense(output_shape, activation='softmax', input_shape=input_shape)
])
print("defined model as {}".format(model.layers))
# stochastic gradient descend as a classifier seem appropriate
model.compile(
    optimizer=RMSprop(),
    loss='categorical_crossentropy',
    metrics=['categorical_accuracy', 'accuracy' ,'mae']
)
print("compiled model")

seperated features and y with shapes:
(19597, 71)
(19597, 136)
inputshape is (71,)
defined model as [<keras.layers.core.Dense object at 0x7f25b84cc820>, <keras.layers.core.Dense object at 0x7f25b864ab50>, <keras.layers.core.Dense object at 0x7f25b864a7f0>, <keras.layers.core.Dense object at 0x7f25b864a640>, <keras.layers.core.Dense object at 0x7f25b8655c40>, <keras.layers.core.Dense object at 0x7f25b8655ca0>, <keras.layers.core.Dense object at 0x7f25b8655820>]
compiled model


In [33]:
#model.fit(x, y, epochs=3)
model.fit(x, y)
#model.fit(x, y, epochs=10, shuffle=True)
# evaluate the model on the test set
test_y = test.drop(features.columns, axis=1)
test_x = test.drop(labels.columns, axis=1)
#test_x = test_x.drop("timedelta", axis=1)

loss, cat_accuracy, accuracy, mae = model.evaluate(test_x, test_y)
print("the accuracy on the training set is cat acc {}, reg acc {} and the mae is {}".format(cat_accuracy, accuracy, mae))

the accuracy on the training set is cat acc 0.20224489271640778, reg acc 0.20224489271640778 and the mae is 0.013512490317225456


In [34]:
import numpy as np

sample_df = df.sample(frac=1).reset_index(drop=True).groupby('conversation_id').apply(lambda x: x.sample(n=1)).reset_index(drop = True)
sample_features = take_features(sample_df, ["author", "current_time", "beam_node_time"])
sample_prediction = model.predict(sample_features)
np.matrix(sample_prediction)[0:5, -1] # the last row is the "new author column" label and should contain a high value

matrix([[0.39269435],
        [0.39269435],
        [0.39269435],
        [0.39269435],
        [0.39269435]], dtype=float32)

#### Predicting the author presence based on prediction probabilities
- compute predictions for the whole dataframe
- drop features and non-features except conversation and platform
- wide to long the authors to make them a index
- groupby conversation and platform

In [35]:
all_features = take_features(df, ["author", "current_time", "beam_node_time"])
predictions = model.predict(all_features)
column_names = labels.columns
predictions = pd.DataFrame(predictions, columns=column_names)
print(type(predictions))
print(predictions.shape)

<class 'pandas.core.frame.DataFrame'>
(24497, 136)


In [36]:
all_non_features = df[["conversation_id", "platform"]]
print(type(all_non_features))
print(all_non_features.shape)
all_non_features.reset_index(drop=True, inplace=True)
joined_dataframe = all_non_features.join(predictions)
# not_needed_list = ["beam_node", "has_followed_path", "has_follow_path", "beam_node_author", "current"]
# author_predictions = joined_dataframe.drop(not_needed_list, axis=1)
# joined_dataframe.groupby(["platform", "conversation_id"]).mean()
# joined_dataframe["id"] = joined_dataframe.index

<class 'pandas.core.frame.DataFrame'>
(24497, 2)


In [37]:
joined_dataframe.Author_is_new.describe() # no idea why that is the same prediction of all the rows

count    24497.000000
mean         0.392698
std          0.000399
min          0.389762
25%          0.392694
50%          0.392694
75%          0.392694
max          0.436718
Name: Author_is_new, dtype: float64

In [38]:
# joined_dataframe.describe()
joined_dataframe = joined_dataframe.groupby(["platform", "conversation_id"]).mean()
joined_dataframe.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Author_538210,Author_1125209,Author_1920977,Author_1997405,Author_2280486,Author_2600925,Author_3783312,Author_3919689,Author_4153484,Author_4212372,...,Author_92606372,Author_93354514,Author_93631770,Author_94543394,Author_95335292,Author_95977208,Author_97589063,Author_98781300,Author_99195573,Author_is_new
platform,conversation_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
reddit,661614,0.007504,0.000886,0.00171,0.000215,0.000142,0.025926,9e-06,0.000184,0.000837,7.4e-05,...,0.000222,0.005754,2.2e-05,1.1e-05,0.006104,0.009734,0.001911,0.004756,0.001,0.392694


In [39]:


author_predictions_existing = joined_dataframe.drop(["Author_is_new"], axis=1)
author_predictions_existing.reset_index(level=['platform', 'conversation_id'],inplace=True)
author_predictions_existing_reshaped = pd.wide_to_long(author_predictions_existing, stubnames="Author_", i=['platform', 'conversation_id'], j="author_id")
author_predictions_existing_reshaped.head(3)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Author_
platform,conversation_id,author_id,Unnamed: 3_level_1
reddit,661614,538210,0.007504
reddit,661614,1125209,0.000886
reddit,661614,1920977,0.00171


In [40]:
# avg_author_pred = author_predictions_existing_reshaped.groupby(["platform", "conversation_id", "author_id"]).mean()
# avg_author_pred.head(3)

In [41]:
avg_conversation_pred  = author_predictions_existing_reshaped.groupby(["platform", "conversation_id"]).mean()
avg_conversation_pred.head(3)


Unnamed: 0_level_0,Unnamed: 1_level_0,Author_
platform,conversation_id,Unnamed: 2_level_1
reddit,661614,0.004499
reddit,10955776,0.004499
reddit,15848916,0.004499


In [42]:
avg_platform_pred = avg_conversation_pred.groupby(["platform"]).mean()
print(avg_platform_pred)
avg_platform_pred # picking the correct author seems to be exceedingly difficult#


           Author_
platform          
reddit    0.004499


Unnamed: 0_level_0,Author_
platform,Unnamed: 1_level_1
reddit,0.004499


In [43]:


author_predictions_existing = joined_dataframe.drop(["Author_is_new"], axis=1)
author_predictions_existing.reset_index(level=['platform', 'conversation_id'],inplace=True)
author_predictions_existing_reshaped = pd.wide_to_long(author_predictions_existing, stubnames="Author_", i=['platform', 'conversation_id'], j="author_id")
author_predictions_existing_reshaped.head(3)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Author_
platform,conversation_id,author_id,Unnamed: 3_level_1
reddit,661614,538210,0.007504
reddit,661614,1125209,0.000886
reddit,661614,1920977,0.00171


In [44]:
# avg_author_pred = author_predictions_existing_reshaped.groupby(["platform", "conversation_id", "author_id"]).mean()
# avg_author_pred.head(3)

In [45]:
avg_conversation_pred  = author_predictions_existing_reshaped.groupby(["platform", "conversation_id"]).mean()
avg_conversation_pred.head(3)


Unnamed: 0_level_0,Unnamed: 1_level_0,Author_
platform,conversation_id,Unnamed: 2_level_1
reddit,661614,0.004499
reddit,10955776,0.004499
reddit,15848916,0.004499


In [46]:
avg_platform_pred = avg_conversation_pred.groupby(["platform"]).mean()
print(avg_platform_pred)
avg_platform_pred # picking the correct author seems to be exceedingly difficult#


           Author_
platform          
reddit    0.004499


Unnamed: 0_level_0,Author_
platform,Unnamed: 1_level_1
reddit,0.004499


#### Notes
- inserting the new author column increased precision times 10
- categorical accuracy and regular accuracy match (which is weird)

In [47]:
avg_conversation_pred  = author_predictions_existing_reshaped.groupby(["platform", "conversation_id"]).sum()
avg_conversation_pred.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Author_
platform,conversation_id,Unnamed: 2_level_1
reddit,661614,0.607306
reddit,10955776,0.607306
reddit,15848916,0.607306


In [48]:
avg_platform_pred = avg_conversation_pred.groupby(["platform"]).mean()
print(avg_platform_pred)
avg_platform_pred # picking the correct author seems to be exceedingly difficult#


           Author_
platform          
reddit    0.607303


Unnamed: 0_level_0,Author_
platform,Unnamed: 1_level_1
reddit,0.607303


#### Notes
- inserting the new author column increased precision times 10
- categorical accuracy and regular accuracy match (which is weird)