# Project 3, Reddit Classification


## Problem Statement : 
Create a classification model that will predict which subreddit the post is from.

## Executive Summary:

Many people assume that Psychology and Sociology subreddit pages are very similar to each other. Through this reddit classification we can showcase how different these 2 subreddit pages. This can be shown by how precisely we predict the subreddit category based on the post.  



In [2]:
#Importing libraries that are needed
import time
import pandas as pd
import requests
import random

In [3]:
#We are first loading the apple_url_reddit page
psy_url = 'https://www.reddit.com/r/askpsychology/.json'

In [4]:
psy_res = requests.get(psy_url, headers = {'User-agent': 'Data Inc 1.0'})

In [5]:
psy_res.status_code

200

In [6]:
psy_dict = psy_res.json()

In [7]:
psy_dict.keys()

dict_keys(['kind', 'data'])

In [8]:
psy_dict['kind']

'Listing'

In [9]:
psy_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [10]:
len(psy_dict['data']['children'])

26

In [11]:
psy_dict['data']['children'][0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'askpsychology',
  'selftext': 'Hello again, r/askpsychology! As promised, we have updated the rules and posting requirements for this sub. They can be found on the sidebar in the redesign, but I will also be posting them here for full visibility. The updated rules are as follows:\n\n1. **Do not ask for diagnostic or analytic impressions.** It is unethical to give diagnoses or analyses over the internet. These types of questions are better asked of a therapist in\\-person. \n2. **Questions must be asked clearly in the post title.** Violations of this rule will be removed immediately. You may give additional or clarifying information in the body of your post.\n3. **Answers must be evidence\\-based.** Answers given must reflect the scientific consensus and ideally should cite sources. Anecdotal evidence or pop\\-psychology will be removed.\n4. **No leading questions.**  Examples of these questions are: "Why is Group X so st

In [12]:
psy_dict['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [13]:
psy_dict['data']['children'][0]['kind']

't3'

In [14]:
psy_dict['data']['children'][0]['data']['subreddit']

'askpsychology'

In [15]:
psy_dict['data']['children'][0]['data']['title']

'Updated rules and posting requirements'

In [16]:
psy_dict['data']['children'][0]['data']['selftext']

'Hello again, r/askpsychology! As promised, we have updated the rules and posting requirements for this sub. They can be found on the sidebar in the redesign, but I will also be posting them here for full visibility. The updated rules are as follows:\n\n1. **Do not ask for diagnostic or analytic impressions.** It is unethical to give diagnoses or analyses over the internet. These types of questions are better asked of a therapist in\\-person. \n2. **Questions must be asked clearly in the post title.** Violations of this rule will be removed immediately. You may give additional or clarifying information in the body of your post.\n3. **Answers must be evidence\\-based.** Answers given must reflect the scientific consensus and ideally should cite sources. Anecdotal evidence or pop\\-psychology will be removed.\n4. **No leading questions.**  Examples of these questions are: "Why is Group X so stupid?" or "What is wrong with Group X?" \n5. **No jokes, memes, insults, or slurs.**  This is an

In [17]:
psy_posts = [p['data'] for p in psy_dict['data']['children']]

In [18]:
pd.DataFrame(psy_posts)

Unnamed: 0,all_awardings,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],,,True,NawtAGoodNinja,,,[],,M.Sc. | Counseling Psychology,...,,Updated rules and posting requirements,0,27,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
1,[],,,False,PsyNimo,,,[],,,...,,Is talking to yourself healthy?,0,42,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
2,[],,,False,Apprehensive_Bowl,,,[],,,...,,How do we know that the autism spectrum is at ...,0,6,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
3,[],,,False,EggShellEmotions,,,[],,,...,,Is there a maze or puzzle that you can solve o...,0,8,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
4,[],,,False,kishahatesyou,,,[],,,...,,What are great psychology-related research top...,0,8,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
5,[],,,False,Ghostcruncher,,,[],,,...,,How does the sudden death of a parent affect a...,0,5,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
6,[],,,False,sesamechicken4evr,,,[],,,...,,What’s the difference between each SSRI?,0,2,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
7,[],,,False,IfNoUniverseWhatBe,,,[],,,...,,"I don’t need a diagnosis, I am a kind of recov...",0,1,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
8,[],,,False,earthquakest,,,[],,,...,,What factors can make a person become obsessiv...,0,12,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
9,[],,,False,potatornapart,,,[],,,...,,How do we know that our memories as we recall ...,0,3,https://www.reddit.com/r/askpsychology/comment...,[],,False,,


In [19]:
pd.DataFrame(psy_posts).to_csv('psy_posts.csv')

In [20]:
psy_dict['data']['after']

't3_bu9llu'

In [21]:
pd.DataFrame(psy_posts)['name']

0     t3_8gaeqd
1     t3_bzd2ov
2     t3_bzjj11
3     t3_bzelbn
4     t3_bzefuj
5     t3_bzgcnf
6     t3_bzgs65
7     t3_bzjc9f
8     t3_bz9fwz
9     t3_bz4h2c
10    t3_bz30mq
11    t3_byy2xl
12    t3_bys1ie
13    t3_bvq1lr
14    t3_bvi4ni
15    t3_bvnyj5
16    t3_bvauo1
17    t3_bvgotf
18    t3_bvdrfd
19    t3_bv3h7r
20    t3_bvd9pw
21    t3_but5qf
22    t3_buwvwx
23    t3_buhquo
24    t3_bu808l
25    t3_bu9llu
Name: name, dtype: object

In [22]:
psy_posts = []
after = None

for a in range(40):
    if after == None:
        current_url = psy_url
    else:
        current_url  = psy_url+'?after='+after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Data Inc 1.0'})
    
    if res.status_code != 200:
        print('Status Error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    psy_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    if a>0:
        prev_posts = pd.read_csv('psy.csv')
        current_df = pd.DataFrame(psy_posts)
    else:
        pd.DataFrame(psy_posts).to_csv('psy.csv',index = False)
        
    sleep_duration = random.randint(1,3)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/askpsychology/.json
1
https://www.reddit.com/r/askpsychology/.json?after=t3_bu9llu
3
https://www.reddit.com/r/askpsychology/.json?after=t3_brym7u
2
https://www.reddit.com/r/askpsychology/.json?after=t3_bq7obm
3
https://www.reddit.com/r/askpsychology/.json?after=t3_bmxsv7
1
https://www.reddit.com/r/askpsychology/.json?after=t3_b7s066
3
https://www.reddit.com/r/askpsychology/.json?after=t3_b5tp5p
2
https://www.reddit.com/r/askpsychology/.json?after=t3_b4p2ee
1
https://www.reddit.com/r/askpsychology/.json?after=t3_b2rhlq
1
https://www.reddit.com/r/askpsychology/.json?after=t3_b1iffe
1
https://www.reddit.com/r/askpsychology/.json?after=t3_b0dlqk
1
https://www.reddit.com/r/askpsychology/.json?after=t3_az1v1i
3
https://www.reddit.com/r/askpsychology/.json?after=t3_ax2vvo
1
https://www.reddit.com/r/askpsychology/.json?after=t3_avcv8e
2
https://www.reddit.com/r/askpsychology/.json?after=t3_au2sj2
3
https://www.reddit.com/r/askpsychology/.json?after=t3_as2anx
1
https://

In [23]:
(len(psy_posts))

1001

In [24]:
pd.DataFrame(psy_posts).to_csv('psy.csv', index = False)

## We now move on to attain the Sociology Reddit details

In [25]:
#We are first loading the apple_url_reddit page
soc_url = 'https://www.reddit.com/r/sociology/.json'

In [26]:
soc_res = requests.get(soc_url, headers = {'User-agent': 'Mozilla 5.0'})

In [27]:
soc_res.status_code

200

In [28]:
soc_dict = soc_res.json()

In [29]:
soc_dict.keys()

dict_keys(['kind', 'data'])

In [30]:
soc_dict['kind']

'Listing'

In [31]:
soc_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [32]:
len(soc_dict['data']['children'])

25

In [33]:
soc_dict['data']['children'][0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'sociology',
  'selftext': 'Hello, I would like to ask for some guidance. I assume many here are familiar with research methods, and I don\'t know where else to ask this. I am currently researching which theory is more appropriate to explain the actions and policies of country A over country B. I plan to compare these theories. Is there a methodology that comes close or that is appropriate to my "method"? Something like "Content analysis"? Or maybe "Comparative research"?\n\n&amp;#x200B;\n\nAny ideas?\n\nhttps://i.redd.it/696r3vt0as331.jpg',
  'author_fullname': 't2_1120zf61',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'Help with choosing an appropriate research methodology',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/sociology',
  'hidden': False,
  'pwls': None,
  'link_flair_css_class': None,
  'downs': 0,
  'hide_score': False,
  'media_metadata': {'69

In [34]:
soc_dict['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [35]:
soc_dict['data']['children'][0]['kind']

't3'

In [36]:
soc_dict['data']['children'][0]['data']['subreddit']

'sociology'

In [37]:
soc_dict['data']['children'][0]['data']['title']

'Help with choosing an appropriate research methodology'

In [38]:
soc_dict['data']['children'][0]['data']['selftext']

'Hello, I would like to ask for some guidance. I assume many here are familiar with research methods, and I don\'t know where else to ask this. I am currently researching which theory is more appropriate to explain the actions and policies of country A over country B. I plan to compare these theories. Is there a methodology that comes close or that is appropriate to my "method"? Something like "Content analysis"? Or maybe "Comparative research"?\n\n&amp;#x200B;\n\nAny ideas?\n\nhttps://i.redd.it/696r3vt0as331.jpg'

In [39]:
soc_posts = [p['data'] for p in psy_dict['data']['children']]

In [40]:
pd.DataFrame(soc_posts)

Unnamed: 0,all_awardings,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],,,True,NawtAGoodNinja,,,[],,M.Sc. | Counseling Psychology,...,,Updated rules and posting requirements,0,27,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
1,[],,,False,PsyNimo,,,[],,,...,,Is talking to yourself healthy?,0,42,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
2,[],,,False,Apprehensive_Bowl,,,[],,,...,,How do we know that the autism spectrum is at ...,0,6,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
3,[],,,False,EggShellEmotions,,,[],,,...,,Is there a maze or puzzle that you can solve o...,0,8,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
4,[],,,False,kishahatesyou,,,[],,,...,,What are great psychology-related research top...,0,8,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
5,[],,,False,Ghostcruncher,,,[],,,...,,How does the sudden death of a parent affect a...,0,5,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
6,[],,,False,sesamechicken4evr,,,[],,,...,,What’s the difference between each SSRI?,0,2,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
7,[],,,False,IfNoUniverseWhatBe,,,[],,,...,,"I don’t need a diagnosis, I am a kind of recov...",0,1,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
8,[],,,False,earthquakest,,,[],,,...,,What factors can make a person become obsessiv...,0,12,https://www.reddit.com/r/askpsychology/comment...,[],,False,,
9,[],,,False,potatornapart,,,[],,,...,,How do we know that our memories as we recall ...,0,3,https://www.reddit.com/r/askpsychology/comment...,[],,False,,


In [41]:
pd.DataFrame(soc_posts).to_csv('soc_posts.csv')

In [42]:
soc_dict['data']['after']

't3_btbib0'

In [43]:
pd.DataFrame(soc_posts)['name']

0     t3_8gaeqd
1     t3_bzd2ov
2     t3_bzjj11
3     t3_bzelbn
4     t3_bzefuj
5     t3_bzgcnf
6     t3_bzgs65
7     t3_bzjc9f
8     t3_bz9fwz
9     t3_bz4h2c
10    t3_bz30mq
11    t3_byy2xl
12    t3_bys1ie
13    t3_bvq1lr
14    t3_bvi4ni
15    t3_bvnyj5
16    t3_bvauo1
17    t3_bvgotf
18    t3_bvdrfd
19    t3_bv3h7r
20    t3_bvd9pw
21    t3_but5qf
22    t3_buwvwx
23    t3_buhquo
24    t3_bu808l
25    t3_bu9llu
Name: name, dtype: object

In [44]:
soc_posts = []
after = None

for a in range(40):
    if after == None:
        current_url = soc_url
    else:
        current_url  = soc_url+'?after='+after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Data Inc 1.0'})
    
    if res.status_code != 200:
        print('Status Error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    soc_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    if a>0:
        prev_posts = pd.read_csv('soc.csv')
        current_df = pd.DataFrame(soc_posts)
    else:
        pd.DataFrame(soc_posts).to_csv('soc.csv',index = False)
        
    sleep_duration = random.randint(1,3)
    print(sleep_duration)
    time.sleep(sleep_duration)
    


https://www.reddit.com/r/sociology/.json
2
https://www.reddit.com/r/sociology/.json?after=t3_btbib0
3
https://www.reddit.com/r/sociology/.json?after=t3_bnb8n7
2
https://www.reddit.com/r/sociology/.json?after=t3_bk8zqu
1
https://www.reddit.com/r/sociology/.json?after=t3_bgpo99
3
https://www.reddit.com/r/sociology/.json?after=t3_bbgl1b
1
https://www.reddit.com/r/sociology/.json?after=t3_b8igvs
1
https://www.reddit.com/r/sociology/.json?after=t3_b4757z
3
https://www.reddit.com/r/sociology/.json?after=t3_b1fk6s
1
https://www.reddit.com/r/sociology/.json?after=t3_axrtfe
3
https://www.reddit.com/r/sociology/.json?after=t3_atxdn8
1
https://www.reddit.com/r/sociology/.json?after=t3_aqtwlp
2
https://www.reddit.com/r/sociology/.json?after=t3_ao6fsj
1
https://www.reddit.com/r/sociology/.json?after=t3_akd970
3
https://www.reddit.com/r/sociology/.json?after=t3_afvxm7
3
https://www.reddit.com/r/sociology/.json?after=t3_aalgud
1
https://www.reddit.com/r/sociology/.json?after=t3_a6r0f3
1
https://www.r

In [45]:
(len(psy_posts))

1001

In [46]:
pd.DataFrame(soc_posts).to_csv('soc.csv', index = False)

In [47]:
#Check if there are any duplicates in the soc_posts
len(set([p['name'] for p in soc_posts]))
#There are no duplicates as the length of soc_posts list and the set are the same

988

In [48]:
#Check if there are any duplicates in the psy_posts
len(set([p['name'] for p in psy_posts]))
#There are no duplicates as the length of psy_posts list and the set are the same

1001

In [49]:
soci = pd.read_csv('soc.csv')

In [50]:
soci.isnull().sum().sort_values().tail(45)

hide_score                         0
id                                 0
archived                           0
created_utc                        0
author_fullname                   36
author_flair_type                 36
author_patreon_flair              36
author_flair_richtext             36
selftext_html                    310
selftext                         310
author_flair_text_color          952
secure_media                     980
media                            980
author_cakeday                   982
crosspost_parent_list            985
crosspost_parent                 985
media_metadata                   986
author_flair_background_color    988
author_flair_template_id         988
thumbnail                        988
approved_by                      988
view_count                       988
author_flair_css_class           988
approved_at_utc                  988
author_flair_text                988
suggested_sort                   988
category                         988
b

In [51]:
psychi = pd.read_csv('psy.csv')
psychi.isnull().sum().sort_values().tail(45)

gildings                            0
is_crosspostable                    0
id                                  0
hide_score                          0
gilded                              0
hidden                              0
author_fullname                    36
author_flair_type                  36
author_flair_richtext              36
author_patreon_flair               36
selftext                          176
selftext_html                     176
author_flair_text_color           962
author_cakeday                    997
distinguished                     998
author_flair_text                 998
link_flair_background_color      1000
link_flair_text                  1000
link_flair_template_id           1000
approved_by                      1001
author_flair_template_id         1001
author_flair_background_color    1001
approved_at_utc                  1001
suggested_sort                   1001
thumbnail                        1001
view_count                       1001
author_flair