**In this notebook we analyze a uniform random sample of reddit comments dataset to find problems, outliers or patterns across the data**

# Loading the dataset from Gdrive

In [None]:
# A code snippet to download the dataset file from a Gdrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.

file_id = '1-D_uHkn37M5ptWVQl8a5-q8NBv9jaLWr'
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('dataset.bz2')

# Sampling 10k records from the dataset using uniform random sampling

In [None]:
import sys
import bz2
import random
import time
import pandas as pd

def read_comment(afile):
    while afile.read(1) != b"\n":
        pass
    output = b""
    while 1:
        byte = afile.read(1)
        if byte == b"\n":
            return output + b"\n"
        output += byte


num_samples = 10000
random.seed(time.time())
sampled_bytes = [random.randint(0, 500000000) for _ in range(num_samples)]
sampled_bytes.sort()
output = b""
with bz2.BZ2File("dataset.bz2", mode="rb") as input_file:
    for i in range(num_samples):
        input_file.seek(sampled_bytes[i])
        output += read_comment(input_file)
print(num_samples, 'comments are sampled.\n')

with open("sample.json", "wb") as output_file:
    output_file.write(output)
df = pd.read_json("sample.json", lines=True)
df.to_csv('sampled_data.csv', index=False)

10000 comments are sampled.



# Analyzing the dataset using panadas dataframe

In [None]:
red_df = pd.read_csv('sampled_data.csv')
red_df.head()

Unnamed: 0,retrieved_on,ups,author_flair_css_class,author_flair_text,gilded,controversiality,subreddit_id,edited,subreddit,parent_id,...,downs,body,distinguished,id,archived,score,author,score_hidden,link_id,name
0,1425124281,1.0,,,0,0,t5_2qnfs,0,Bushcraft,t1_cn9co47,...,0.0,"Thanks, I'll look into one of those!",,cnas935,False,1.0,naivesuperiority,False,t3_2ohma5,t1_cnas935
1,1425124281,3.0,,,0,0,t5_2qh61,0,WTF,t1_cna8b1r,...,0.0,"I'm not religious at all, and I'm not into gun...",,cnas93k,False,3.0,WorldsGreatestPoop,False,t3_2qwr9k,t1_cnas93k
2,1425124281,3.0,,,0,0,t5_2qh61,0,WTF,t1_cna8b1r,...,0.0,"I'm not religious at all, and I'm not into gun...",,cnas93k,False,3.0,WorldsGreatestPoop,False,t3_2qwr9k,t1_cnas93k
3,1425124279,2.0,i-gpcm,8350-GTX760-16GB-256SSD-HAFXB-K70RGB,0,0,t5_2sgp1,0,pcmasterrace,t1_cnarrfo,...,0.0,I hear you. Due to the lingering effects of a...,,cnas98r,False,2.0,Head_Cockswain,False,t3_2qy3j3,t1_cnas98r
4,1425124279,2.0,,,0,0,t5_2qstm,0,personalfinance,t1_cnaqt37,...,0.0,"If you are completely inexperienced, then choo...",,cnas98s,False,2.0,tccommentate,False,t3_2qyap3,t1_cnas98s


In [None]:
red_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10005 entries, 0 to 10004
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   retrieved_on            10002 non-null  object 
 1   ups                     10000 non-null  float64
 2   author_flair_css_class  2954 non-null   object 
 3   author_flair_text       2680 non-null   object 
 4   gilded                  10005 non-null  int64  
 5   controversiality        10005 non-null  object 
 6   subreddit_id            10005 non-null  object 
 7   edited                  10005 non-null  object 
 8   subreddit               10005 non-null  object 
 9   parent_id               10000 non-null  object 
 10  created_utc             10000 non-null  float64
 11  downs                   10000 non-null  float64
 12  body                    9998 non-null   object 
 13  distinguished           83 non-null     object 
 14  id                      9995 non-null 

## controversiality

In [None]:
red_df['controversiality'].value_counts()

0                  10000
twixasaurousrex        1
youcefhd               1
elaintahra             1
Relacuna               1
brim4brim              1
Name: controversiality, dtype: int64

Conculsion: 


---


All the comments have zero controversiality score. Hence it doesn't provide any useful inferenec.

## downvotes

In [None]:
red_df['downs'].value_counts()

0.0    10000
Name: downs, dtype: int64

Conclusion:


---
All comments have zero downvotes.


## upvotes

In [None]:
sum(red_df['ups'] < 0)

448

Conclusion


---
About 4.5% of the comments have negative upvotes (which doesn't make sense)


## authors

In [None]:
from collections import Counter

result = dict(Counter(red_df['author'].tolist()))
result = {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse = True)}
top_users = dict(list(result.items())[0: 10])
top_users

{'[deleted]': 807,
 'AutoModerator': 60,
 nan: 10,
 'jonandkaylatoler': 7,
 'havoc_bot': 6,
 'ricky_king': 5,
 'Mrs_Holman_7': 5,
 'autowikibot': 5,
 'Sabrina_Cage': 5,
 'Shadow-Pie': 4}

Conclusion:


---


After getting the top users in the dataset we noted the following

- About 8% of the comments have deleted users. A user could get banned or remove his account but his comments would still be on the platform
- Some of the most active users are actual bots such as havoc_bot and autowiki bots. They are used for auto-moderating contents
- Some user entries have nan values

## body

In [None]:
sum(red_df['body'] == '[deleted]')

645

Also, a fair portion of the comments have deleted body entries.

## gilded

In [None]:
red_df['gilded'].value_counts()

0     9997
1        5
2        2
25       1
Name: gilded, dtype: int64

Conclusion


---


gilded is the amount of reddit golds a comment recieve. Almost all the comments have 0 gilded which doesn't make it a useful feature for our anaylsis