# Transformation Part

In [1]:
from preprocess import Preprocessor
from text_preprocess import TextPreprocesser
from network_preprocess import NetworkTransformer
from text_clustering import Clusterer
from utils import load_data

import numpy as np
from sentence_transformers import SentenceTransformer

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


## Loading the Data

In [2]:
PATH = 'comments/comments_students.csv'

dataset = load_data(PATH, split=False)

## First preprocessing

Let's get some feature engineering done, by getting some important features such as the depth, the time since parent, if a comment is root, some time info too, and finally some aggregate measure such as the number of comment the author wrote at that point in time

In [5]:
preprocessor = Preprocessor(dataset)
preprocessor.preprocess()
dataset = preprocessor.data
dataset.head()

Unnamed: 0,created_utc,ups,subreddit_id,link_id,name,subreddit,id,author,body,parent_id,parent_ups,time_since_parent,is_root,depth,weekday,hour,num_comments,deleted,author_count,parent_id_count
0,1430438400,3.0,t5_2qh1i,t3_34f9rh,t1_cqug90j,AskReddit,cqug90j,jesse9o3,No one has a European accent either because i...,t1_cqug2sr,0.0,0.0,0,0.0,4,2,1,0,54,1
1,1430438400,3.0,t5_2qh1i,t3_34fvry,t1_cqug90k,AskReddit,cqug90k,beltfedshooter,That the kid ..reminds me of Kevin. so sad :-(,t3_34fvry,0.0,0.0,1,0.0,4,2,1,0,5,1443
2,1430438400,5.0,t5_2qh1i,t3_34ffo5,t1_cqug90z,AskReddit,cqug90z,InterimFatGuy,NSFL,t1_cqu80zb,0.0,0.0,0,0.0,4,2,1,0,57,5
3,1430438401,1.0,t5_2qh1i,t3_34aqsn,t1_cqug91c,AskReddit,cqug91c,JuanTutrego,I'm a guy and I had no idea this was a thing g...,t1_cqtdj4m,0.0,0.0,0,0.0,4,2,1,0,5,4
4,1430438401,101.0,t5_2qh1i,t3_34f9rh,t1_cqug91e,AskReddit,cqug91e,dcblackbelt,"Mid twenties male rocking skinny jeans/pants, ...",t1_cquc4rc,0.0,0.0,0,0.0,4,2,1,0,2,2


## Text Preprocessing

Let's get some text features in our dataset from the body columns

In [11]:
text_preprocessor = TextPreprocesser(dataset)
text_preprocessor.preprocess()
dataset = text_preprocessor.data
dataset.head()

100%|██████████| 4234970/4234970 [00:09<00:00, 448556.62it/s]
100%|██████████| 4234970/4234970 [00:22<00:00, 190530.24it/s]
100%|██████████| 4234970/4234970 [00:45<00:00, 92890.47it/s] 
100%|██████████| 4234970/4234970 [00:07<00:00, 559411.22it/s]
100%|██████████| 4234970/4234970 [00:09<00:00, 452217.88it/s]
100%|██████████| 4234970/4234970 [00:16<00:00, 252887.36it/s]
100%|██████████| 4234970/4234970 [00:17<00:00, 243052.29it/s]
100%|██████████| 4234970/4234970 [00:12<00:00, 349830.31it/s]
100%|██████████| 4234970/4234970 [00:19<00:00, 213387.48it/s]
100%|██████████| 4234970/4234970 [00:13<00:00, 313967.72it/s]
  return func(self, *args, **kwargs)
  return _methods._mean(a, axis=axis, dtype=dtype,
100%|██████████| 4234970/4234970 [00:48<00:00, 87935.38it/s] 
100%|██████████| 4234970/4234970 [11:20<00:00, 6223.37it/s] 
100%|██████████| 4234970/4234970 [11:15<00:00, 6266.84it/s] 


Unnamed: 0,created_utc,ups,subreddit_id,link_id,name,subreddit,id,author,body,parent_id,...,word_density,punctuation_count,upper_case_word_count,stopword_count,count_words_title,contains_url,mean_word_len,punct_percent,polarity,subjectivity
0,1430438400,3.0,t5_2qh1i,t3_34f9rh,t1_cqug90j,AskReddit,cqug90j,jesse9o3,No one has a European accent either because i...,t1_cqug2sr,...,5.409091,3,0,12,5,0,4.666667,9.52381,0.0,0.0
1,1430438400,3.0,t5_2qh1i,t3_34fvry,t1_cqug90k,AskReddit,cqug90k,beltfedshooter,That the kid ..reminds me of Kevin. so sad :-(,t3_34fvry,...,4.363636,6,0,5,2,0,3.7,40.0,-0.625,1.0
2,1430438400,5.0,t5_2qh1i,t3_34ffo5,t1_cqug90z,AskReddit,cqug90z,InterimFatGuy,NSFL,t1_cqu80zb,...,2.0,0,1,0,0,0,4.0,0.0,0.0,0.0
3,1430438401,1.0,t5_2qh1i,t3_34aqsn,t1_cqug91c,AskReddit,cqug91c,JuanTutrego,I'm a guy and I had no idea this was a thing g...,t1_cqtdj4m,...,3.6,2,1,8,1,0,2.928571,7.142857,0.0,0.0
4,1430438401,101.0,t5_2qh1i,t3_34f9rh,t1_cqug91e,AskReddit,cqug91e,dcblackbelt,"Mid twenties male rocking skinny jeans/pants, ...",t1_cquc4rc,...,5.477273,7,0,15,3,0,4.604651,13.953488,0.189062,0.427083


## Network Preprocessing

Let's get some network features in our dataset

In [12]:
transformer = NetworkTransformer()
transformer = transformer.fit(dataset)
dataset = transformer.transform(dataset)

Fit networks started
Fit networks attributes
Fit author attributes
First link attributes
Done
Transform networks started
Transforming comments
Transforming authors


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[X['author_is_deleted'] == 1][key] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[X['author_is_deleted'] == 1][key] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[X['author_is_deleted'] == 1][key] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row

Transforming links
Done


In [13]:
dataset.head()

Unnamed: 0,created_utc,ups,subreddit_id,link_id,name,subreddit,id,author,body,parent_id,...,com_out_degree_centrality,com_in_degree_centrality,author_is_moderator,author_is_deleted,author_implication,author_degree_centrality,author_out_degree_centrality,author_in_degree_centrality,author_is_influential,link_popularity
0,1430438400,3.0,t5_2qh1i,t3_34f9rh,t1_cqug90j,AskReddit,cqug90j,jesse9o3,No one has a European accent either because i...,t1_cqug2sr,...,2.27545e-07,0.0,0,0,54,0.000194,9.3e-05,0.000101,0.0,2786
1,1430438400,3.0,t5_2qh1i,t3_34fvry,t1_cqug90k,AskReddit,cqug90k,beltfedshooter,That the kid ..reminds me of Kevin. so sad :-(,t3_34fvry,...,2.27545e-07,0.0,0,0,5,7e-06,7e-06,0.0,0.0,12349
2,1430438400,5.0,t5_2qh1i,t3_34ffo5,t1_cqug90z,AskReddit,cqug90z,InterimFatGuy,NSFL,t1_cqu80zb,...,2.27545e-07,0.0,0,0,57,0.000177,0.000106,7.1e-05,0.0,7247
3,1430438401,1.0,t5_2qh1i,t3_34aqsn,t1_cqug91c,AskReddit,cqug91c,JuanTutrego,I'm a guy and I had no idea this was a thing g...,t1_cqtdj4m,...,2.27545e-07,0.0,0,0,5,0.0,0.0,0.0,0.0,1466
4,1430438401,101.0,t5_2qh1i,t3_34f9rh,t1_cqug91e,AskReddit,cqug91e,dcblackbelt,"Mid twenties male rocking skinny jeans/pants, ...",t1_cquc4rc,...,2.27545e-07,2.27545e-07,0,0,2,5e-06,2e-06,2e-06,0.0,2786


## Text Clustering
Let's run some text clustering algorithm called FAISS. We should get sentence embeddings to feed the algorithm. we will use a model from sentence_transformers library

In [3]:
# model = SentenceTransformer('./models/distilroberta-base-paraphase-v1')  # here i downloaded the model beforehand, but if you don't have just specify model=None and it will pull it from the Internet
embeddings = np.load('sentence_embeddings.npy', allow_pickle=True)
clusterer = Clusterer(dataset)
clusterer.process(embeddings, with_gpu=True)  # if you didn't install faiss-gpu, pass with_gpu=False

Clustering...
Clustering done...


In [5]:
dataset = clusterer.data
dataset.head()

Unnamed: 0,created_utc,ups,subreddit_id,link_id,name,subreddit,id,author,body,parent_id,...,com_in_degree_centrality,author_is_moderator,author_is_deleted,author_implication,author_degree_centrality,author_out_degree_centrality,author_in_degree_centrality,author_is_influential,link_popularity,clusters
0,1430438400,3.0,t5_2qh1i,t3_34f9rh,t1_cqug90j,AskReddit,cqug90j,jesse9o3,No one has a European accent either because i...,t1_cqug2sr,...,0.0,0,0,54,0.000194,9.3e-05,0.000101,0.0,2786,5
1,1430438400,3.0,t5_2qh1i,t3_34fvry,t1_cqug90k,AskReddit,cqug90k,beltfedshooter,That the kid ..reminds me of Kevin. so sad :-(,t3_34fvry,...,0.0,0,0,5,7e-06,7e-06,0.0,0.0,12349,5
2,1430438400,5.0,t5_2qh1i,t3_34ffo5,t1_cqug90z,AskReddit,cqug90z,InterimFatGuy,NSFL,t1_cqu80zb,...,0.0,0,0,57,0.000177,0.000106,7.1e-05,0.0,7247,2
3,1430438401,1.0,t5_2qh1i,t3_34aqsn,t1_cqug91c,AskReddit,cqug91c,JuanTutrego,I'm a guy and I had no idea this was a thing g...,t1_cqtdj4m,...,0.0,0,0,5,0.0,0.0,0.0,0.0,1466,4
4,1430438401,101.0,t5_2qh1i,t3_34f9rh,t1_cqug91e,AskReddit,cqug91e,dcblackbelt,"Mid twenties male rocking skinny jeans/pants, ...",t1_cquc4rc,...,2.27545e-07,0,0,2,5e-06,2e-06,2e-06,0.0,2786,5


In [6]:
dataset.reset_index(drop=True).to_feather('comments_processed.ft')