# Preliminary Research and Look at Reddit

Here I will play around with a small part of the data and see what interesting things I can derive from my dataset.

In [1]:
import json
import pandas as pd

In [2]:
with open("small_reddit.json", "r") as infile:
    tiny_april = json.load(infile)

`small_reddit.json` has 1.2 GB out of ~120GB data for April 2021 from Reddit submission.

In [3]:
len(tiny_april)

305238

In [4]:
tiny_april[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'archived': False,
 'author': '[deleted]',
 'author_created_utc': None,
 'author_flair_background_color': '',
 'author_flair_css_class': None,
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': 'dark',
 'can_gild': False,
 'category': None,
 'content_categories': None,
 'contest_mode': False,
 'created_utc': 1617260457,
 'discussion_type': None,
 'distinguished': None,
 'domain': 'self.BBRae_Community',
 'edited': False,
 'event_end': 1617346740.0,
 'event_is_live': False,
 'event_start': 1617260400.0,
 'gilded': 0,
 'gildings': {},
 'hidden': False,
 'hide_score': False,
 'id': 'lupsor',
 'is_created_from_ads_ui': False,
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': False,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '#3af221',
 'link_flair_css_class': 'Only used on April Fools PS

Let's look at subreddit statistics.

In [6]:
subreddits = set()

In [7]:
for entry in tiny_april:
    subreddits.add(entry['subreddit'])

In [8]:
len(subreddits)

47231

In [14]:
subreddits

{'cbr',
 'watchpeoplesurvive',
 'TopAsiaFX',
 'Completedcomedymanga',
 'SneakerMarketRefs',
 'u_WomplyWombat',
 'u_gothjock216',
 'faketits',
 'u_2Dpins',
 'u_originaldior',
 'u_throwaway1345545',
 'ParasiteEve',
 'u_edgar866',
 'serbiancringe',
 'ivernmains',
 'u_AEGISAlliance',
 'dahyun',
 'WhatShouldIMix',
 'UlalaIdleAdventure',
 'RedditListings',
 'ShittyHandbras',
 'u_SLB_Model',
 'u_0873177792',
 'aftk',
 'SFTC',
 'ketogains',
 'GayCruisingSpots',
 'u_SavageRouge',
 'GirlsInBlack',
 'BlowjobsPorn',
 'lojban',
 'hmm',
 'u_They_call_me_divine',
 'LucyLi',
 'CallMeCarson_2',
 'IntelligenceFiles',
 'GenshinLewds',
 'RammusMains',
 'gloving',
 'OffMyChestPH',
 'u_Okay-But-No',
 'botafogo',
 'Seussism_',
 'guitarstrapporn',
 'LGV60',
 'atlus',
 'bara',
 'mindMYcrack',
 'u_SugarKPumpkin',
 'MilenaBilska',
 'u_sototally99',
 'ryanadams',
 'clevercomebacks',
 'TeluguBabes',
 'ManualTransmissions',
 'untrustworthypoptarts',
 'iamveryedgy',
 'miband',
 'u_P4BU',
 'Rogers',
 'RedditAndChill'

I should cluster using cosine similarity or Jaro-Winkler. Reference: https://medium.com/@appaloosastore/string-similarity-algorithms-compared-3f7b4d12f0ff

Cosine does not take into account letter ordering ("niche" and "chine" are very similar according to cosine), but not in our case.

Levenshtein takes into account edit distance. Foo and Bar are just as similar to each other as Beautiful and Beauties. Perhaps not the best algorithm for our case.

Trigram also could work (or other variations of n-grams)

Jaro-Winkler seems good

I'm thinking of using multiple algorithms and see how the clusters differ.

## Finding URLs

Let's see where we can find URLs. I think in `entry["url"]` and in `entry["selftext"]`.

In [13]:
for entry in tiny_april[100:150]:
    print("--------------------------- SELFTEXT -------------------------")
    print(entry["selftext"])
    print("--------------------------- URL -------------------------")
    print(entry["url"])
    print(f"-------------- IS_SELF: {entry['is_self']} -----------------------")
    print("================*************==============")

--------------------------- SELFTEXT -------------------------
[deleted]
--------------------------- URL -------------------------
https://www.redgifs.com/watch/basicwhispereddaddylonglegs
-------------- IS_SELF: False -----------------------
--------------------------- SELFTEXT -------------------------
My doctor wrote a prescription for me to take 2 tablets of Finasteride 5 mg 3 times a day. Is that too much? 

I’ve only read about people taking one tablet a day . 

Any advice will be greatly appreciated.

Thank you
--------------------------- URL -------------------------
https://www.reddit.com/r/HairTransplants/comments/mhj2j4/finasteride_5_mg_3_times_daily/
-------------- IS_SELF: True -----------------------
--------------------------- SELFTEXT -------------------------

--------------------------- URL -------------------------
https://i.imgur.com/7pTMnOg.jpg
-------------- IS_SELF: False -----------------------
--------------------------- SELFTEXT -------------------------

----