In [24]:
import sqlite3
import pprint
import pandas as pd
import plotly.express as px
import yaml
import wordcloud

conn = sqlite3.connect("data/scraping_stats.db")

In [25]:
with open("config.yml", mode="r", encoding="utf-8") as file:
    try:
        config = yaml.safe_load(file)
    except yaml.YAMLError as exc:
        print(exc)

In [26]:
table_name = config["log_table"]

query = f"SELECT * FROM {table_name}"

In [27]:
data = pd.read_sql(query, con=conn)

In [28]:
data.head()

Unnamed: 0,index,logging_time,over_18_posts_prop,most_popular_sub,count_most_pop_sub,mean_selftext_len,common_subreddit,count_common_sub
0,0,07/27/22 17:13:34,0.0,AskReddit,36620912,52.16,Mastodon,1
1,0,07/27/22 17:13:40,0.02,AskReddit,36620913,47.08,Decoders,1
2,0,07/27/22 17:13:45,0.02,news,25000369,38.28,cute,1
3,0,07/27/22 17:13:50,0.0,harrypotter,1332672,25.3,TherapeuticKetamine,1
4,0,07/27/22 17:13:55,0.0,AskReddit,36620915,26.56,AskReddit,2


In [29]:
px.line(data_frame = data, x="logging_time", y="mean_selftext_len")

The majority of posts are between 20 and 100 words long, though this varies quite a bit. Posts without text have been excluded from this metric.

In [30]:
px.bar(data, x="most_popular_sub")

The most popular subreddits by subscribers are **AskReddit**, **aww**, and **funny**. There are a number of other popular subreddits on sports, gaming, and general social or news content. 

In [1]:
px.bar(data, x="common_subreddit")

NameError: name 'px' is not defined

We can see that there are a number of game-focused subs that have a high-amount of interaction (buy number of posts). Some high-post subreddits (**autonewspaper**) are mostly bot posts (automoderator in reddit terms) as well as one interesting subreddit titled "Shrouds" which only posts completely random text with the apparent intent of making reddit harder to search.

In [32]:
df = pd.read_csv("data/response.csv", header=0, sep="|", engine="python")

In [33]:
df["subreddit"].value_counts()

PokemonGoRaids        70
AskReddit             33
Shrouds               28
PokemonGoFriends      27
Morbius               25
                      ..
dutchbros              1
mathmemes              1
RatchetAndClank        1
Art                    1
Starcitizen_guilds     1
Name: subreddit, Length: 1182, dtype: int64

In [38]:
df["author"].value_counts()

AutoModerator        19
YossieBidenfFan      17
bobcat                9
shynailgirl           6
OurProgressive        5
                     ..
Fantastic_Octopus     1
RaVeNSentrY           1
peace204              1
TheDailyNick          1
No_Bike4081           1
Name: author, Length: 1344, dtype: int64

In [35]:
df.drop_duplicates(subset="title", inplace=True)

In [36]:
df.subreddit.value_counts().head(10)

PokemonGoRaids      34
AskReddit           19
PokemonGoFriends    18
Shrouds             17
Morbius             12
teenagers           10
nytimes              9
FreeKarma4You        8
betterCallSaul       6
ThisCelebrity        6
Name: subreddit, dtype: int64

In [37]:
df[df.subreddit=="AskReddit"].value_counts("author")

author
Apart_Bumblebee3144     1
TheHoomanBean2804       1
twilightw0rld           1
sneakysecrets1          1
planetjiji              1
mugglejulia             1
bttf7000                1
beer-thinker            1
athwolf                 1
Slave_Schatz            1
Between3N20Karakters    1
No_Inflation_28         1
Mozzie_501              1
MapsandPics             1
Lespade                 1
Itz_shankr              1
HellenKellersMonocle    1
Connor1854Jordan        1
zanaj                   1
dtype: int64

A surprising amount of people still play Pokemon Go apparently! There are also usually a high number of **AskReddit** posts where people can solicit opinions or experiences from others.