# Reddit Posts
This notebook allows to interaact with the reddit posts application using different configurations as well as to observe the result sets. The notebook does comparisons of new Reddit posts vs the Reddit posts from the previous execution.
- new posts since last execution
- posts that are no longer in top 75 since last execution
- posts whose scores changed since last execution

A markdown section has been added above each notebook cell explaining what it does.

## Constants
Constants are defined to track the path locations in the application.

In [24]:
BASE_PATH = '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts'
SRC_PATH = '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src'
DATA_PATH = '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/data'

NEW_POSTS_LISTING = "new"
TOP_POSTS_LISTING = "top"

NEW_POSTS_COUNT = 100
TOP75_POSTS_COUNT = 75

POST_COLUMNS = [
    "author_fullname",
    "title",
    "name",
    "score",
    "created",
    "view_count",
    "id",
    "author",
    "url",
    "created_utc"
    ]

POST_COLUMNS_X = {POST_COLUMNS[i]+"_x":POST_COLUMNS[i] for i in range(len(POST_COLUMNS))}
POST_COLUMNS_Y = {POST_COLUMNS[i]+"_y":POST_COLUMNS[i] for i in range(len(POST_COLUMNS))}


## Imports and configuration
This section imports dependant modules and makes sure the notebook has access to the Reddit Posts application source.
Finally the logging module is configured so that application log statements are visible when each notbook cell is executed.

In [25]:
import datatable as dt
import pandas as pd

# patch to import source code
import sys
sys.path.append(SRC_PATH)
print(sys.path)

from apis.reddit.reddit_posts import RedditPosts
from apis.io.post_io import load_posts, save_posts
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.StreamHandler(sys.stdout)
    ]
)

['/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/notebooks', '/usr/local/anaconda3/lib/python39.zip', '/usr/local/anaconda3/lib/python3.9', '/usr/local/anaconda3/lib/python3.9/lib-dynload', '', '/usr/local/anaconda3/lib/python3.9/site-packages', '/usr/local/anaconda3/lib/python3.9/site-packages/aeosa', '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src', '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src']


## Load new posts

This section loads the latest new posts and the new posts from the last execution

Note that the same code is used to load both new posts and top posts

In [26]:
reddit_new = RedditPosts(listing=NEW_POSTS_LISTING, limit=NEW_POSTS_COUNT, timeframe="hour")
new_posts = reddit_new.load(columns=POST_COLUMNS)
logging.info(f"loaded latest new posts count {new_posts.nrows}")

last_new_posts = load_posts(listing=NEW_POSTS_LISTING, columns=POST_COLUMNS, base_path=DATA_PATH)
logging.info(f"loaded previous new posts count {last_new_posts.nrows}")

2022-09-19 23:54:43,601 [INFO] loaded latest new posts count 100
2022-09-19 23:54:43,605 [INFO] loaded previous new posts count 100


## Load Top 75 posts

This section loads the latest top 75 posts and the top 75 posts from the last execution

Note that the same code is used to load both new posts and top posts

In [27]:
reddit_top = RedditPosts(listing=TOP_POSTS_LISTING, limit=TOP75_POSTS_COUNT, timeframe="hour")
top_posts = reddit_top.load(columns=POST_COLUMNS)
logging.info(f"loaded latest top posts count {top_posts.nrows}")

last_top_posts = load_posts(listing=TOP_POSTS_LISTING, columns=POST_COLUMNS, base_path=DATA_PATH)
logging.info(f"loaded previous top posts count {last_top_posts.nrows}")

2022-09-19 23:54:49,100 [INFO] loaded latest top posts count 75
2022-09-19 23:54:49,103 [INFO] loaded previous top posts count 75


## Save posts
This section saves the lastest new and top75 posts to files on disk

In [20]:
save_posts(listing=NEW_POSTS_LISTING, posts=new_posts, base_path=DATA_PATH)
save_posts(listing=TOP_POSTS_LISTING, posts=top_posts, base_path=DATA_PATH)

True

## Determine new posts since last run
A pandas dataframe is used because the datatable df does not yet support left/right outer joins
The datatable df can easily be converted to a pandas df
The pandas merge function is then used with the indicator=True option, 
which generates a new "_merge" column, with values: left_only, right_only and both

Once the merge is performed, all the latest new posts are marked with "_merge"="left_only", meaning those are the posts that only appeared in the latest results.

Finally, column names are cleaned up and only the new posts data is returned

In [28]:
df_this_run = new_posts.to_pandas()
df_last_run = last_new_posts.to_pandas()
df_new_posts = pd.merge(df_this_run, df_last_run, on=['id'], how="outer", indicator=True)
display(df_new_posts.groupby(['_merge'])['_merge'].count())

df_new_since_last = df_new_posts[df_new_posts['_merge'] == 'left_only']
df_new_since_last = df_new_since_last.rename(POST_COLUMNS_X, axis=1)
df_new_since_last = df_new_since_last[POST_COLUMNS]
df_new_since_last.head

_merge
left_only     100
right_only    100
both            0
Name: _merge, dtype: int64

<bound method NDFrame.head of    author_fullname                                              title  \
0      t2_hm0yyww8  If People Got Smart with Their Purchases, Woul...   
1      t2_sn1qagym                             Exclusive rare links ✅   
2      t2_8oitj7a5                           Crypto Insider here, AMA   
3      t2_n48p04r5                                            Damn 💪🏾   
4      t2_802lmiz6  waiting on group ride. I'm going to have to go...   
..             ...                                                ...   
95     t2_bb2q5i0f  Fuse question for possible kill switch/"manual...   
96     t2_jetjf1at  Maricel Soriano gears up for new series with A...   
97      t2_2sqbfgp  Got to talk to Ben and Henry on Open Lines ton...   
98        t2_jkwb7  You know? My favorite variant is Delta 🙂 Parti...   
99     t2_3ihgwaog  CardboardCollectible &amp; Sentinel Games to d...   

         name  score             created  view_count      id  \
0   t3_xiyb5e    1.0 2022-09-

## Determine posts no longer in top 75 since last run
A pandas dataframe is used because the datatable df does not yet support left/right outer joins
The datatable df can easily be converted to a pandas df
The pandas merge function is then used with the indicator=True option, 
which generates a new "_merge" column, with values: left_only, right_only and both

Once the merge is performed, the previous posts that were part of top 75 are marked with "_merge"="right_only", meaning those are the posts that no longer appeared in the top75 results.

Finally, column names are cleaned up and only the new posts data is returned

In [29]:
df_this_run = top_posts.to_pandas()
df_last_run = last_top_posts.to_pandas()
df_top_posts = pd.merge(df_this_run, df_last_run, on=['id'], how="outer", indicator=True)
display(df_top_posts.groupby(['_merge'])['_merge'].count())

df_no_longer_top75 = df_top_posts[df_top_posts['_merge'] == 'right_only']
df_no_longer_top75 = df_no_longer_top75.rename(POST_COLUMNS_Y, axis=1)
df_no_longer_top75 = df_no_longer_top75[POST_COLUMNS]
df_no_longer_top75.head

_merge
left_only     15
right_only    15
both          60
Name: _merge, dtype: int64

<bound method NDFrame.head of    author_fullname                                              title  \
75       t2_15wpdc                              Destiny defends Hasan   
76     t2_iteuvd6g  British volunteer fighter Viktor Yatsunyk has ...   
77        t2_g0kwx  The Rise Of Remote Work Sparks A Broader Globa...   
78        t2_g81s0                   Destiny Responds to Hasan's Take   
79     t2_p1777439  The Ukrainian soldier Vasyl Pelysh was capture...   
80        t2_lgedr                       Asmon Goes off on CrazySlick   
81     t2_hm0jhj9p                Destiny covering today's drama like   
82     t2_6d4tqcbn  See a Salamander Grow From a Single Cell in th...   
83     t2_1kpk6luj  Positive Interpretation Bias Predicts Longitud...   
84      t2_7uwwwcr  IMX / Gamestop NFT Battle Royale Game Kiravers...   
85     t2_52x3x7j1  I guess turning a blind eye because he is Geo ...   
86     t2_7z4299qf                                              Relax   
87       t2_8xeiiq   

## Determine posts whose score changed
To determine posts whose scores changed, we combine the new and top 75 posts, drop duplicates.
Then, we filter the posts that appeared during both program executions and where the score changed

In [35]:
df_scores = pd.concat([df_top_posts, df_new_posts])
df_scores.drop_duplicates(subset=['id'])

df_scores = df_scores[(df_scores['_merge'] == 'both') & (df_scores['score_x'] != df_scores['score_y'])]

df_scores = df_scores.rename(POST_COLUMNS_X, axis=1)
df_scores = df_scores[POST_COLUMNS+['score_y']]
df_scores['score_change'] = df_scores.score - df_scores.score_y
df_scores[['title','score','score_y','score_change']]

Unnamed: 0,title,score,score_y,score_change
0,Adrianah's opinion on xQc and Train not mentio...,959.0,868.0,91.0
1,Destiny explains the problems with revealing s...,438.0,420.0,18.0
2,Worst part of the game,364.0,289.0,75.0
3,Adrianah says Destiny has been one of the most...,256.0,242.0,14.0
4,The increased prevalence of depression and anx...,274.0,252.0,22.0
5,"Hasan straight up calls XQC a liar, and Adrian...",242.0,237.0,5.0
6,do we hate them? yes. are we worried about the...,254.0,235.0,19.0
7,"“Devils Horns” sunrise captured in Qatar, duri...",247.0,224.0,23.0
8,Adrianah says she appreciates Maya apology,217.0,210.0,7.0
9,Adrianah never told xQc about the sexual assault,193.0,180.0,13.0
