### Setting up config for calls to reddit API
Reddit offers an API which can be used to access the content on the site.
The following bits of code attempt to set up the requisite configurations for accessing the API and making a test call

In [28]:
import requests
import json
import praw

In [4]:
with open('config.json', 'r') as f:
    config = json.load(f)

In [6]:
app_auth = requests.auth.HTTPBasicAuth(config['app_client_id'], config['app_secret'])
post_data = {"grant_type": "password", "username": config['username'], "password": config['pwd']}
headers = {"User-Agent": "ChangeMeClient/0.1 by "+ config['username']}

In [134]:
response = requests.post("https://www.reddit.com/api/v1/access_token", auth=app_auth, data=post_data, headers=headers)

auth_dict = response.json()

In [49]:
auth_dict

{'access_token': '424623723-SVqCItj-nok1K_Nj2DBud7sUwt0',
 'expires_in': 3600,
 'scope': '*',
 'token_type': 'bearer'}

In [50]:
headers = {"Authorization": "bearer " + auth_dict['access_token'], "User-Agent": "ChangeMeClient/0.1 by " + config['username']}

In [61]:
api_call = requests.get("https://oauth.reddit.com/r/India/about", headers=headers)

In [62]:
api_call.json()

{'data': {'accounts_active': 2866,
  'accounts_active_is_fuzzed': False,
  'active_user_count': 2866,
  'advertiser_category': 'Local',
  'all_original_content': False,
  'allow_chat_post_creation': False,
  'allow_discovery': True,
  'allow_images': True,
  'allow_polls': True,
  'allow_videogifs': True,
  'allow_videos': True,
  'banner_background_color': '',
  'banner_background_image': 'https://styles.redditmedia.com/t5_2qh1q/styles/bannerBackgroundImage_in37li1rxvx11.jpg',
  'banner_img': '',
  'banner_size': None,
  'can_assign_link_flair': True,
  'can_assign_user_flair': True,
  'collapse_deleted_comments': True,
  'comment_score_hide_mins': 60,
  'community_icon': 'https://styles.redditmedia.com/t5_2qh1q/styles/communityIcon_iv9fbruq3ge31.png',
  'created': 1201263626.0,
  'created_utc': 1201234826.0,
  'description': '### We are looking for additional moderators. If you believe you can help, apply [via modmail](https://www.reddit.com/message/compose?to=%2Fr%2Findia)\n\n###[r/

### Trying out PRAW API Libary
While trying to research the way to access different parts of the site using the reddit API, I stumbled across PRAW (Python Reddit API Wrapper) which offers simple and intuitive methods and functionalities to do the same. Since it is a popular package and used and recommended in reddit forums, I decided to go ahead with it instead of trying out things with the base API uris. 

In [63]:
reddit = praw.Reddit(client_id=config['app_client_id'],
                     client_secret=config['app_secret'],
                     user_agent="ChangeMeClient/0.1 by " + config['username'])

In [83]:
for submission in reddit.subreddit('India').new(limit=2):
    print(submission.title)
    #print(submission.selftext)
    comments = submission.comments.list()
    print(submission.link_flair_text)
    print(submission.id)
    submission.
    #print(submission.)

One life for another, very weird statistics btw
Coronavirus
g1vrg5
Given the negative growth in employment and consumption in the rural economy, the 2020 Union Budget seems like a cruel joke on the plight of the poor, in general, and women, in particular.
Policy/Economy
g1vnmq


In [43]:
submission.selftext

"###[Covid-19 Fundraisers & Donation Links](https://amnesty.org.in/support-indias-most-vulnerable-fight-covid-19-a-list-of-fundraisers-you-can-donate-to/) via Amnesty International\n* [This link covers](https://amnesty.org.in/support-indias-most-vulnerable-fight-covid-19-a-list-of-fundraisers-you-can-donate-to/) Migrant Workers Day-Labourers, Other Vulnerable Groups, Urban Poor, Transgender Community, Waste-pickers and Sanitation Workers, Healthcare Workers and Doctors, Older Persons & Children and Animal Care \n\n------------------------------------------------------------------------------------------------------\n\n#####Indian Goverment\n* [Official Twitter Collection of Indian Govt. Communications](https://twitter.com/i/events/1240662046280048646)\n* [State and District Wise Details of Cases in India](https://www.mohfw.gov.in/pdf/DistrictWiseList324.pdf)\n* All India Helplines: 1075 (Toll Free) | 1930 (Toll Free) | 1944 (Northeast India Only) | +911123978046 | Email ID: ncov2019@go

In [42]:
comments[0].body

'###[Covid-19 Fundraisers & Donation Links](https://amnesty.org.in/support-indias-most-vulnerable-fight-covid-19-a-list-of-fundraisers-you-can-donate-to/) via Amnesty International\n* [This link covers](https://amnesty.org.in/support-indias-most-vulnerable-fight-covid-19-a-list-of-fundraisers-you-can-donate-to/) Migrant Workers Day-Labourers, Other Vulnerable Groups, Urban Poor, Transgender Community, Waste-pickers and Sanitation Workers, Healthcare Workers and Doctors, Older Persons & Children and Animal Care \n\n------------------------------------------------------------------------------------------------------\n\n**I am looking for volunteers** who are willing to create and update their state level threads for the time of lockdown. Updates will mostly consist of latest news in those states with respect to Coronavirus and ongoing lockdown.\n\n#####🔴 Require Volunteers for\nJammu & Kashmir and Ladakh | Sikkim | Manipur | Mizoram | Assam | Meghalaya | Tripura | Arunachal Pradesh | Jh

## Premise behind data collection
The data which has been collected is based on two ideas:
1. The flair of the submission/post is most likely primarily linked to the text of the submission and the topic it encompasses. For this reason, it is important to gather the title, body, and a few comments to gather sufficient text for both analysis and classifier needs.
2. While other factors may be less intuitively related to the flair, there may be an existing pattern linking the various metrics and tags to a particular flair.

### Retrieving and storing data from India subreddit
After figuring out the way to retrieve different parts of data from a subreddit using the PRAW library, the next step is to gather some test data for both analysis and the classfier.
This initial approach focusses primarily on title, five comments and flair.

In [64]:
import pandas as pd

In [89]:
reddit_india_data = pd.DataFrame()

In [90]:
seen_ids = {}

In [91]:
for submission in reddit.subreddit('India').hot(limit=200):
    if submission.id not in seen_ids.keys():
        seen_ids[submission.id] = True
        curr_data = pd.Series({'title': submission.title, 'title_text': submission.selftext, 
                               'comments': [comment.body for comment in submission.comments.list()[:5]],
                               'flair': submission.link_flair_text, 
                               'url': submission.url, 'id': submission.id})
        reddit_india_data = reddit_india_data.append(curr_data, ignore_index=True)

In [93]:
for submission in reddit.subreddit('India').top(limit=200):
    if submission.id not in seen_ids.keys():
        seen_ids[submission.id] = True
        curr_data = pd.Series({'title': submission.title, 'title_text': submission.selftext, 
                               'comments': [comment.body for comment in submission.comments.list()[:5]],
                               'flair': submission.link_flair_text, 
                               'url': submission.url, 'id': submission.id})
        reddit_india_data = reddit_india_data.append(curr_data, ignore_index=True)

It has been observed that the top 200 and hot 200 posts share 12 posts in common, since on going through all of them only resulted in 388 records, instead of 400

In [94]:
reddit_india_data

Unnamed: 0,comments,flair,id,title,title_text,url
0,[###[Covid-19 Fundraisers & Donation Links](ht...,Coronavirus,fqqdsg,Coronavirus (COVID-19) Megathread - News and U...,###[Covid-19 Fundraisers & Donation Links](htt...,https://www.reddit.com/r/india/comments/fqqdsg...
1,[I do t think we are intentionally under repor...,Politics,g17fnr,Is India underreporting the coronavirus outbre...,,https://www.youtube.com/watch?v=JIhNKZOHJ74
2,"[reminded me of that doggo meme, Is this true?...",Non-Political,g1owee,"Storm in Assam, Gauhati Today",,https://v.redd.it/uyyrvk4cdys41
3,[This is very strange since all other porn sub...,Non-Political,g1ll83,/r/IndiansGoneWild is apparently blocked in In...,Opening /r/IndiansGoneWild (NSFW) today gives ...,https://www.reddit.com/r/india/comments/g1ll83...
4,"[WTF how much rape goes on in Delhi?!?!, Repor...",Non-Political,g1jwxm,83% drop in rape cases in Delhi during lockdown,,https://www.thehindu.com/news/cities/Delhi/83-...
...,...,...,...,...,...,...
383,"[No matter what crappy news comes elsewhere, i...",Science/Technology,ctyr9r,ISRO releases photo of moon taken by Chandraya...,,https://i.redd.it/zavrpg8hn0i31.png
384,"[Ideas are bulletproof, ""There are many forms ...",Politics,dc6cqr,An ode to Gandhi ( from twitter ),,https://i.redd.it/mai3yiobm2q31.jpg
385,"[Hey OP, you can't post a picture of my father...",Politics,bryt3s,Every home today morning,,https://imgur.com/XHyphiN
386,[News Article:\n\n> https://www.indiatoday.in/...,[R]eddiquette,bepj6a,Need all the help possible.,,https://i.redd.it/t4465qnpo2t21.jpg


In [96]:
reddit_india_data.to_excel('reddit_india_data.xlsx', index=False)

### Gathered some basic data, getting additional columns for EDA
The gathered data primarily consists of text associated with a submission within a subreddit.
However, various other metrics and datapoints are associated with each submission which can help with analysis.
That data has been gathered here and then joined with the original data on the submission 'ID' to make a combine dataset.

In [122]:
additional_data = pd.DataFrame()

for r_id in seen_ids.keys():
    submission = reddit.submission(id=r_id)
    curr_data = pd.Series({'up_count': submission.ups, 'down_count': submission.downs, 
                               'domain': submission.domain, 'category' : submission.category,
                               'is_orig': submission.is_original_content, 'n_comm': submission.num_comments,
                               'upvote_rat': submission.upvote_ratio, 'views': submission.view_count,
                               'total_awards': submission.total_awards_received, 'id': submission.id})
    
    additional_data = additional_data.append(curr_data, ignore_index=True)

In [135]:
#Ups, downs, domain, category, is_original_content, no. of comments, upvote ratio, view count, total awards received, is original content

In [123]:
additional_data.head()

Unnamed: 0,category,domain,down_count,id,is_orig,n_comm,total_awards,up_count,upvote_rat,views
0,,self.india,0.0,fqqdsg,0.0,10949.0,2.0,416.0,0.97,
1,,youtube.com,0.0,g17fnr,0.0,136.0,0.0,207.0,0.79,
2,,v.redd.it,0.0,g1owee,0.0,108.0,0.0,1950.0,0.99,
3,,self.india,0.0,g1ll83,0.0,483.0,0.0,1119.0,0.96,
4,,thehindu.com,0.0,g1jwxm,0.0,119.0,0.0,1081.0,0.97,


In [124]:
additional_data.shape

(388, 10)

In [129]:
complete_data = reddit_india_data.join(additional_data.set_index('id'), on='id')

In [130]:
complete_data.head()

Unnamed: 0,comments,flair,id,title,title_text,url,category,domain,down_count,is_orig,n_comm,total_awards,up_count,upvote_rat,views
0,[###[Covid-19 Fundraisers & Donation Links](ht...,Coronavirus,fqqdsg,Coronavirus (COVID-19) Megathread - News and U...,###[Covid-19 Fundraisers & Donation Links](htt...,https://www.reddit.com/r/india/comments/fqqdsg...,,self.india,0.0,0.0,10949.0,2.0,416.0,0.97,
1,[I do t think we are intentionally under repor...,Politics,g17fnr,Is India underreporting the coronavirus outbre...,,https://www.youtube.com/watch?v=JIhNKZOHJ74,,youtube.com,0.0,0.0,136.0,0.0,207.0,0.79,
2,"[reminded me of that doggo meme, Is this true?...",Non-Political,g1owee,"Storm in Assam, Gauhati Today",,https://v.redd.it/uyyrvk4cdys41,,v.redd.it,0.0,0.0,108.0,0.0,1950.0,0.99,
3,[This is very strange since all other porn sub...,Non-Political,g1ll83,/r/IndiansGoneWild is apparently blocked in In...,Opening /r/IndiansGoneWild (NSFW) today gives ...,https://www.reddit.com/r/india/comments/g1ll83...,,self.india,0.0,0.0,483.0,0.0,1119.0,0.96,
4,"[WTF how much rape goes on in Delhi?!?!, Repor...",Non-Political,g1jwxm,83% drop in rape cases in Delhi during lockdown,,https://www.thehindu.com/news/cities/Delhi/83-...,,thehindu.com,0.0,0.0,119.0,0.0,1081.0,0.97,


In [131]:
complete_data.shape

(388, 15)

In [133]:
complete_data.to_excel('compl_reddit_india_data.xlsx', index=False)