# Script to scrape and collect data.


Python Reddit API Wrapper(PRAW) to be used to collect data.<br>
Documentation: https://praw.readthedocs.io/en/latest/ <br>
We will collect 200 data instances of each flair.*<br>

The flairs we classify are the ones on  reddit as per 7th April 2020 <br>
1) Scheduled <br>
2) Politics <br>
3) Photography <br>
4) AskIndia <br>
5) Sports <br>
6) Non-Political <br>
7) Science/Technology <br>
8) Food <br>
9) Business/Finance <br>
10) Coronavirus<br>
11) CAA-NRC-NPR <br>
*flair may be subject to changes

In [1]:
import praw
import pandas as pd
import datetime as dt
import nltk
import re
from tqdm import tqdm_notebook

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from praw.models import MoreComments

Top 200 posts are chosen along with necessart attributes.
For each post we extract the top level comments.
We choose only the top level to make sure that relevance is maintained.
The dataset itself is written to a CSV file

In [2]:
flairs = ["Scheduled", "Politics", "Photography", "AskIndia", "Sports",
        "Non-Political", "Science/Technology", "Food", "Business/Finance",
        "Coronavirus", "CAA-NRC-NPR"]


reddit = praw.Reddit(client_id='7hx8xkseiQkWNA', client_secret='UPd0tXn-Lt26mk4-Z38oCNLuG_8', user_agent='Kevin Stephen')
india = reddit.subreddit('india')

results = []

for flair in flairs:
    posts = india.search(flair, limit=200)

    for post in posts:
        comment = ''
        post.comment_sort = 'best'
        post.comment_limit = 3
        for top_level_comment in post.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            comment += top_level_comment.body
        results.append([post.title, post.score, post.id, post.url, post.num_comments, post.selftext, post.created, comment, flair])

results = pd.DataFrame(results ,columns=['title', 'score', 'id', 'url', 'num_comments', 'body', 'created', 'Comment', 'Flair'])
print(results)
results.to_csv('reddit_india_test.csv')


                                                  title  score      id  \
0     Untouchability, even in quarantine. 'We have n...     58  fzvwz8   
1     Delhi Govt Sources: Names of CM Arvind Kejriwa...    304  f7ogd8   
2     Delhi: AP Singh, advocate of 2012 Delhi gang-r...     16  flgvah   
3     No 100% quota for tribal teachers in schools l...     18  g698qu   
4     Why the Supreme Court’s verdict on SC/ST quota...    105  f1o839   
...                                                 ...    ...     ...   
2102  The state if suppression and labeling those in...      2  evi93u   
2103  Why can't we protest for relections? They brok...     16  efrzjb   
2104   Something to reflect on during ongoing protests.     30  el6n9o   
2105  Unofficial termination letter by Students of J...     94  ef0n1x   
2106  Citizenship Amendment Act(CAB/CAA), National R...     14  ec4rbh   

                                                    url  num_comments  \
0     https://www.telegraphindia.com/i

In [3]:
df_india = pd.read_csv('reddit_india_test.csv')
df_india['Comment'].head(20)

0     Let them feel hungry for a couple of days, max...
1     This is beyond petty.> The inclusion of a Delh...
2     My hunch is , this guy is trying to expose the...
3     When SC has to point out that 100% quota in no...
4     Muslims and reservation are two distractions u...
5     Bachega India, tabhi toh Padhega India.Gand ma...
6     Oh boy. Chalo bhaisahab. Sabji ka dukaan main ...
7     AFAIK, U.S. still hasn't banned entries of Ind...
8     But we are funding Sanskrit Universities with ...
9     Someone give me a ELI5? What's this drug used ...
10    Yes I think you are legally eligible to claim ...
11    Bc students in 12 and 10 are not immune to the...
12                                                  NaN
13    Hey man just sleep already, maintain a sleep h...
14    Unless you are in a dire need of job, skip the...
15    Our society is going back to the rotten eras o...
16    The single biggest minus point about Indian so...
17    Bangladesh is literally the only friendly 

In [4]:
import string 
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# Text cleaning


Before creating any feature from the raw text, we must perform a cleaning process
to ensure no distortions are introduced to the model. The following steps have been followed:

Special character cleaning: special characters such as “\n” double quotes
must be removed from the text since we aren’t expecting any predicting
power from them.

Upcase/downcase: we would expect, for example, “Book” and “book” to be
the same word and have the same predicting power. For that reason we
have downcased every word.

Punctuation signs: characters such as “?”, “!”, “;” have been removed.

Possessive pronouns: in addition, we would expect that “Trump” and
“Trump’s” had the same predicting power.

Stemming or Lemmatization: stemming is the process of reducing
derived words to their root. Lemmatization is the process of reducing a word
to its lemma. The main difference between both methods is that
lemmatization provides existing words, whereas stemming provides the
root, which may not be an existing word. We have used a Lemmatizer based
in WordNet. I have preferred lemmatizing over stemming because it is more accurate. eg. chance may become chanc during stemming but remains chance during lemmatization

Stop words: words such as “what” or “the” won’t have any predicting power
since they will presumably be common to all the documents. For this reason,
they may represent noise that can be eliminated. We have downloaded a list
of English stop words from the nltk package and then deleted them from the
corpus.

In [5]:
def ret_string(text):
    return str(text)

def remove_punc(text):
    text_nopunc = "".join([char for char in text if char not in string.punctuation])
    return text_nopunc

def tokenize(text):
    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens]
    return tokens

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    text = [word for word in text if word not in stop_words]
    return text

def lemmatizer(text):
    wn = nltk.WordNetLemmatizer()
    text = [wn.lemmatize(word) for word in text]
    return text

In [6]:
df_india['title'] = df_india['title'].apply(lambda x: ret_string(x))
df_india['title'] = df_india['title'].apply(lambda x: remove_punc(x))
df_india['title'] = df_india['title'].apply(lambda x: tokenize(x))
df_india['title'] = df_india['title'].apply(lambda x: remove_stopwords(x))
df_india['title'] = df_india['title'].apply(lambda x: lemmatizer(x))

df_india['Comment'] = df_india['Comment'].apply(lambda x: ret_string(x))
df_india['Comment'] = df_india['Comment'].apply(lambda x: remove_punc(x))
df_india['Comment'] = df_india['Comment'].apply(lambda x: tokenize(x))
df_india['Comment'] = df_india['Comment'].apply(lambda x: remove_stopwords(x))
df_india['Comment'] = df_india['Comment'].apply(lambda x: lemmatizer(x))

df_india['body'] = df_india['body'].apply(lambda x: ret_string(x))
df_india['body'] = df_india['body'].apply(lambda x: remove_punc(x))
df_india['body'] = df_india['body'].apply(lambda x: tokenize(x))
df_india['body'] = df_india['body'].apply(lambda x: remove_stopwords(x))
df_india['body'] = df_india['body'].apply(lambda x: lemmatizer(x))

df_india['url'] = df_india['url'].apply(lambda x: ret_string(x))
df_india['url'] = df_india['url'].apply(lambda x: remove_punc(x))
df_india['url'] = df_india['url'].apply(lambda x: tokenize(x))
df_india['url'] = df_india['url'].apply(lambda x: remove_stopwords(x))
df_india['url'] = df_india['url'].apply(lambda x: lemmatizer(x))


# Create new feature
We empirically find that the best results are obtained by combining title and the top level comments <br>
I have chosen to leave out body as a feature due to the fact that the length is highly variable and may contain irrelevant terms as these are penned by thousands of people in the country <br> 

In [7]:
df_india['new_feature'] = df_india['title']+df_india['Comment']
df_india['new_feature'].head(20)

0     [untouchability, even, quarantine, never, take...
1     [delhi, govt, source, name, cm, arvind, kejriw...
2     [delhi, ap, singh, advocate, 2012, delhi, gang...
3     [100, quota, tribal, teacher, school, located,...
4     [supreme, court, ’, verdict, scst, quota, crea...
5     [entrance, exam, scheduled, may, bachega, indi...
6     [advisory, scheduled, international, commercia...
7     [roommate, india, he, indian, american, citize...
8     [ministry, score, 100, fund, utilisation, sche...
9     [hydroxychloroquine, schedule, h1, drug, sold,...
10    [askindia, indigo, cancelled, flight, le, 8, h...
11    [maharashtra, govt, school, urban, area, mahar...
12    [nirbhaya, case, sc, quashes, death, row, conv...
13    [let, fix, sleep, schedule, hey, man, sleep, a...
14    [hr, people, people, ladder, applied, role, tr...
15    [massive, mob, storm, scheduled, caste, colony...
16    [india, highly, questionable, work, culture, b...
17    [source, bangladesh, foreign, minister, ak

# Store to CSV

In [8]:
df_india.to_csv('clean_reddit_india.csv')