# Initial Dataset Exploration (Republican Opinion)

Import necessary modules:

In [4]:
# Python STL
from pathlib import Path

# Data Analysis
import numpy as np
import pandas as pd

# Data Visualisation
import plotly.express as px

Create necessary `Path` objects:

In [5]:
root_dir = Path().cwd().parent
data_dir = root_dir / "data"

Discover datasets in `data/` directory:

In [6]:
print("CSV Files: ")
for csv in data_dir.glob("*.csv"):
    print(csv)

CSV Files: 
/Users/jakegodsall/Documents/dev/data-science/reddit-political-opinion/data/reddit_opinion_democrats.csv
/Users/jakegodsall/Documents/dev/data-science/reddit-political-opinion/data/reddit_opinion_republican.csv


## Republican Opinion Dataset Initial Exploration

Load the dataset into a `pandas.DataFrame`:

In [7]:
rep_df = pd.read_csv(data_dir / "reddit_opinion_republican.csv", delimiter=",")

rep_df.head()

Unnamed: 0,comment_id,score,self_text,subreddit,created_time,post_id,author_name,controversiality,ups,downs,...,user_link_karma,user_comment_karma,user_total_karma,post_score,post_self_text,post_title,post_upvote_ratio,post_thumbs_ups,post_total_awards_received,post_created_time
0,kbc02gf,1,85% acceptance rate and 65% graduation rate? C...,politics,2023-11-29 23:40:57,187242w,ShinySpines,0,1,0,...,239.0,53391.0,54801.0,442,,Trump Caught Moving Money Around to Pay Massiv...,0.98,442,0,2023-11-29 22:33:12
1,kbbzzvm,1,The video is crazier.,VoteDEM,2023-11-29 23:40:28,186mtny,Eightysixedit,0,1,0,...,2.0,18983.0,18985.0,19,"After a year of hard work, we can enjoy the re...","Daily Discussion Thread: November 29, 2023",0.9,19,0,2023-11-29 11:00:07
2,kbbzye4,1,They will release the iron-clad proof of 'The ...,politics,2023-11-29 23:40:11,1870jp8,NovelRelationship830,0,1,0,...,1561.0,11992.0,13553.0,1003,,Republicans Trip Over Their Own Assholes Tryin...,0.98,1003,0,2023-11-29 21:26:26
3,kbbzxhi,1,[https://www.whitehouse.gov/omb/budget/histori...,changemyview,2023-11-29 23:40:00,186x25w,CalLaw2023,0,1,0,...,2.0,1274.0,1276.0,0,I ran into this article where an interest grou...,CMV:Social security contributes to federal debt,0.3,0,0,2023-11-29 18:55:07
4,kbbzx4z,1,Republicans should move away from conspiracy t...,AskReddit,2023-11-29 23:39:56,18730jr,areallycleverid,0,1,0,...,2527.0,4371.0,6898.0,0,,Biden has retaken the lead on the latest natio...,0.14,0,0,2023-11-29 23:11:32


Determine size of the dataset:

In [8]:
print("Number of rows: ", rep_df.size)

Number of rows:  4776912


Check for empty values:

In [9]:
rep_df.isna().sum()

comment_id                         0
score                              0
self_text                          1
subreddit                          0
created_time                       0
post_id                            0
author_name                        0
controversiality                   0
ups                                0
downs                              0
user_is_verified                4677
user_account_created_time       4677
user_awardee_karma                 3
user_awarder_karma                 3
user_link_karma                    3
user_comment_karma                 3
user_total_karma                   3
post_score                         0
post_self_text                155275
post_title                         0
post_upvote_ratio                  0
post_thumbs_ups                    0
post_total_awards_received         0
post_created_time                  0
dtype: int64

Determining values of categorical variables:

In [14]:
rep_df.loc[:, "score"].value_counts()

score
 1       34661
 2       23125
 3       17721
 5        9526
 4        9248
         ...  
 719         1
 579         1
 1454        1
 564         1
-81          1
Name: count, Length: 1370, dtype: int64

In [11]:
rep_df.loc[:, "subreddit"].unique()

array(['politics', 'VoteDEM', 'changemyview', 'AskReddit',
       'PoliticalHumor', 'neoliberal', 'Conservative',
       'WhitePeopleTwitter', 'news', 'Republican_misdeeds', 'democrats',
       'conspiracy', 'RepublicanValues', 'Republican', 'uspolitics',
       'ShitPoliticsSays', 'Political_Revolution', 'AskThe_Donald',
       'trump', 'WayOfTheBern', 'conservatives', 'Libertarian',
       'progressive', 'Republican_memes', 'SandersForPresident',
       'ConservativesOnly', 'EnoughTrumpSpam', 'republicanmemes',
       'ChristianDemocrat', 'ConservativeDemocrat'], dtype=object)

In [12]:
rep_df.loc[:, "subreddit"].value_counts()

subreddit
politics                78383
VoteDEM                 25552
WhitePeopleTwitter      16219
PoliticalHumor          12441
democrats               11005
neoliberal               7813
Conservative             5544
ShitPoliticsSays         4969
Libertarian              4415
changemyview             3647
conspiracy               3441
Republican               2817
trump                    2604
uspolitics               2549
EnoughTrumpSpam          2503
Political_Revolution     2339
progressive              2262
AskThe_Donald            1903
RepublicanValues         1517
SandersForPresident      1477
Republican_misdeeds      1352
news                      972
WayOfTheBern              937
conservatives             828
AskReddit                 445
ConservativesOnly         327
ChristianDemocrat         306
republicanmemes           246
Republican_memes          113
ConservativeDemocrat      112
Name: count, dtype: int64

In [16]:
rep_df.loc[:, "controversiality"].unique()

array([0, 1])