# Reddit Posts
This notebook allows to interaact with the reddit posts application using different configurations as well as to observe the result sets. The notebook does comparisons of new Reddit posts vs the Reddit posts from the previous execution.
- new posts since last execution
- posts that are no longer in top 75 since last execution
- posts whose scores changed since last execution

A markdown section has been added above each notebook cell explaining what it does.

## Constants
Constants are defined to track the path locations in the application.

In [36]:
BASE_PATH = '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts'
SRC_PATH = '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src'
DATA_PATH = '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/data'

NEW_POSTS_LISTING = "new"
TOP_POSTS_LISTING = "top"

NEW_POSTS_COUNT = 100
TOP75_POSTS_COUNT = 75

POST_COLUMNS = [
    "author_fullname",
    "title",
    "name",
    "score",
    "created",
    "view_count",
    "id",
    "author",
    "url",
    "created_utc"
    ]

POST_COLUMNS_X = {POST_COLUMNS[i]+"_x":POST_COLUMNS[i] for i in range(len(POST_COLUMNS))}
POST_COLUMNS_Y = {POST_COLUMNS[i]+"_y":POST_COLUMNS[i] for i in range(len(POST_COLUMNS))}


## Imports and configuration
This section imports dependant modules and makes sure the notebook has access to the Reddit Posts application source.
Finally the logging module is configured so that application log statements are visible when each notbook cell is executed.

In [38]:
import datatable as dt
import pandas as pd

# patch to import source code
import sys
sys.path.append(SRC_PATH)
print(sys.path)

from apis.reddit.reddit_posts import RedditPosts
from apis.io.post_io import load_posts, save_posts
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.StreamHandler(sys.stdout)
    ]
)

['/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/notebooks', '/usr/local/anaconda3/lib/python39.zip', '/usr/local/anaconda3/lib/python3.9', '/usr/local/anaconda3/lib/python3.9/lib-dynload', '', '/usr/local/anaconda3/lib/python3.9/site-packages', '/usr/local/anaconda3/lib/python3.9/site-packages/aeosa', '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src', '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src', '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src', '/Users/johnrojas/Development/vcs/github/johnwrf/reddit_posts/src']


## Load new posts

This section loads the latest new posts and the new posts from the last execution

Note that the same code is used to load both new posts and top posts

In [42]:
reddit_new = RedditPosts(listing=NEW_POSTS_LISTING, limit=NEW_POSTS_COUNT, timeframe="hour")
new_posts = reddit_new.load(columns=POST_COLUMNS)
logging.info(f"loaded latest new posts count {new_posts.nrows}")

last_new_posts = load_posts(listing=NEW_POSTS_LISTING, columns=POST_COLUMNS, base_path=DATA_PATH)
logging.info(f"loaded previous new posts count {last_new_posts.nrows}")

2022-09-20 01:49:13,837 [INFO] loaded latest new posts count 100
2022-09-20 01:49:13,841 [INFO] loaded previous new posts count 100


## Load Top 75 posts

This section loads the latest top 75 posts and the top 75 posts from the last execution

Note that the same code is used to load both new posts and top posts

In [46]:
reddit_top = RedditPosts(listing=TOP_POSTS_LISTING, limit=TOP75_POSTS_COUNT, timeframe="hour")
top_posts = reddit_top.load(columns=POST_COLUMNS)
logging.info(f"loaded latest top posts count {top_posts.nrows}")

last_top_posts = load_posts(listing=TOP_POSTS_LISTING, columns=POST_COLUMNS, base_path=DATA_PATH)
logging.info(f"loaded previous top posts count {last_top_posts.nrows}")

2022-09-20 01:50:25,375 [INFO] loaded latest top posts count 75
2022-09-20 01:50:25,377 [INFO] loaded previous top posts count 75


## Save posts
This section saves the lastest new and top75 posts to files on disk

In [41]:
save_posts(listing=NEW_POSTS_LISTING, posts=new_posts, base_path=DATA_PATH)
save_posts(listing=TOP_POSTS_LISTING, posts=top_posts, base_path=DATA_PATH)

True

## Determine new posts since last run
A pandas dataframe is used because the datatable df does not yet support left/right outer joins
The datatable df can easily be converted to a pandas df
The pandas merge function is then used with the indicator=True option, 
which generates a new "_merge" column, with values: left_only, right_only and both

Once the merge is performed, all the latest new posts are marked with "_merge"="left_only", meaning those are the posts that only appeared in the latest results.

Finally, column names are cleaned up and only the new posts data is returned

In [44]:
df_this_run = new_posts.to_pandas()
df_last_run = last_new_posts.to_pandas()
df_new_posts = pd.merge(df_this_run, df_last_run, on=['id'], how="outer", indicator=True)
display(df_new_posts.groupby(['_merge'])['_merge'].count())

df_new_since_last = df_new_posts[df_new_posts['_merge'] == 'left_only']
df_new_since_last = df_new_since_last.rename(POST_COLUMNS_X, axis=1)
df_new_since_last = df_new_since_last[POST_COLUMNS]
df_new_since_last.head

_merge
left_only     100
right_only    100
both            0
Name: _merge, dtype: int64

<bound method NDFrame.head of    author_fullname                                              title  \
0      t2_8f3682f2                                                Her   
1      t2_r35hbouv                                    My home screen.   
2      t2_ezv57rrm                                     Sharpness help   
3      t2_2ehxrk9c                                   Aeon credit card   
4       t2_9wfjfkx  [sway] Just came back to manjaro for productivity   
..             ...                                                ...   
95     t2_enzq156u                       Slowly but surely y’all 😮‍💨🤙   
96     t2_7281r8gd                                               Ropa   
97      t2_14snr5x        DnD 1e Modules with Cursed Artifacts/Items?   
98     t2_dwfcml7f  Experts predicted a stalemate in Ukraine, here...   
99     t2_mv5jkwxb  I tried to apply everyone’s advice while keepi...   

         name  score             created  view_count      id  \
0   t3_xj0gz7    1.0 2022-09-

## Determine posts no longer in top 75 since last run
A pandas dataframe is used because the datatable df does not yet support left/right outer joins
The datatable df can easily be converted to a pandas df
The pandas merge function is then used with the indicator=True option, 
which generates a new "_merge" column, with values: left_only, right_only and both

Once the merge is performed, the previous posts that were part of top 75 are marked with "_merge"="right_only", meaning those are the posts that no longer appeared in the top75 results.

Finally, column names are cleaned up and only the new posts data is returned

In [47]:
df_this_run = top_posts.to_pandas()
df_last_run = last_top_posts.to_pandas()
df_top_posts = pd.merge(df_this_run, df_last_run, on=['id'], how="outer", indicator=True)
display(df_top_posts.groupby(['_merge'])['_merge'].count())

df_no_longer_top75 = df_top_posts[df_top_posts['_merge'] == 'right_only']
df_no_longer_top75 = df_no_longer_top75.rename(POST_COLUMNS_Y, axis=1)
df_no_longer_top75 = df_no_longer_top75[POST_COLUMNS]
df_no_longer_top75.head

_merge
left_only      5
right_only     5
both          70
Name: _merge, dtype: int64

<bound method NDFrame.head of    author_fullname                                              title  \
75     t2_9rd7708d                        the hypocrisy of those guys   
76     t2_nbrghm5b                                   Congratulations.   
77     t2_cm6ogjgd  This beautiful couple was trying to have a pho...   
78       t2_16aol1  Duckabush River, Brothers Wilderness, WA, USA ...   
79        t2_6udar                   🔥 Wedge tailed eagle in flight 🔥   

         name  score             created view_count      id            author  \
75  t3_xizdq9   48.0 2022-09-20 04:49:29      False  xizdq9         Blenny125   
76  t3_xizdu9   46.0 2022-09-20 04:49:40      False  xizdu9       PettyLustre   
77  t3_xizdxp   39.0 2022-09-20 04:49:50      False  xizdxp         mortissed   
78  t3_xize4h   34.0 2022-09-20 04:50:07      False  xize4h  jackrussellcorgi   
79  t3_xize1v   28.0 2022-09-20 04:50:01      False  xize1v         katmonday   

                                    url     

## Determine posts whose score changed
To determine posts whose scores changed, we combine the new and top 75 posts, drop duplicates.
Then, we filter the posts that appeared during both program executions and where the score changed

In [48]:
df_scores = pd.concat([df_top_posts, df_new_posts])
df_scores.drop_duplicates(subset=['id'])

df_scores = df_scores[(df_scores['_merge'] == 'both') & (df_scores['score_x'] != df_scores['score_y'])]

df_scores = df_scores.rename(POST_COLUMNS_X, axis=1)
df_scores = df_scores[POST_COLUMNS+['score_y']]
df_scores['score_change'] = df_scores.score - df_scores.score_y
df_scores[['title','score','score_y','score_change']]

Unnamed: 0,title,score,score_y,score_change
0,KaiCenat reacts to Ice Posieden tweet about Mi...,367.0,355.0,12.0
1,Sliker's ex gf on living with the Austin texas...,216.0,213.0,3.0
2,Adrianah tweets thread with 5 more ppl sharing...,139.0,133.0,6.0
3,Ranger's Rank Shakeup,112.0,108.0,4.0
4,2.3 Trillion Interception,112.0,110.0,2.0
5,Trump lawyers acknowledge Mar-a-Lago probe cou...,99.0,98.0,1.0
7,Kai gives Mizkif an L,96.0,95.0,1.0
8,Somo Inu - Bearer Good Fortune is a Meme Token...,90.0,89.0,1.0
9,Bunny the Dog,82.0,81.0,1.0
11,Confirmed. No music after goals at the Grand F...,68.0,66.0,2.0
