### Scraping the Reddit data 
![alt text](reddit_pranked.png) 

Initial Steps 
1) Creation of Reddit API - https://www.reddit.com/prefs/apps
2) Make a note of Secret and the Id for the app
3) Install PRAW (Python Reddit API Wrapper) -> Acts like a wrapper for reddit, 
   Navigate - https://praw.readthedocs.io/en/stable/

Good to go 

In [96]:
import praw # Wrapper
import pandas
import matplotlib
import nltk
from pprint import PrettyPrinter
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [39]:
nltk.download('vader_lexicon',quiet = True)

True

In [3]:
user = "anonymous"
 
reddit = praw.Reddit(
    client_id = "",
    client_secret = "",
    user_agent= user
    
)

In [31]:
getheadlines = set() # To avoid duplicates
for content in reddit.subreddit('conspiracy').hot(limit=None):
    
    getheadlines.add(content.title)
    # print(content.id)
    # print(content.title)
    # print(content.author)
    
print(len(getheadlines))

915


Inference: 915 Submissions are there in conspiracy community under hot category. 

Lot of features we can access by using subreddit.

Navigate - https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html#praw.models.Subreddit

In [34]:
redditdf = pandas.DataFrame(getheadlines)
redditdf.head()

Unnamed: 0,0
0,My 3 yo daughter keeps repeating this for no r...
1,Tachyons may be behind precognition/clairvoyance
2,Lady Diana - Murdered or Accident
3,Just my 2 cents
4,Bethel park school shooting threats


Inference: Converted the reddit data (headlines) into a dataframe for further analysis.

In [35]:
redditdf.to_csv("redditdata.csv",encoding='utf-8',index=False,header=False)

Inference: Saved the reddit data into a csv file.

Sentiment Analysis on the Data

In [47]:
analyse = SentimentIntensityAnalyzer()
outcome = []

for row in getheadlines:
    sentiment = analyse.polarity_scores(row)
    sentiment['headlines'] = row
    outcome.append(sentiment)
print(outcome[:1])

[{'neg': 0.258, 'neu': 0.742, 'pos': 0.0, 'compound': -0.4184, 'headlines': 'My 3 yo daughter keeps repeating this for no reason!!'}]


In [54]:
newdf = pandas.DataFrame.from_records(outcome)
newdf.head()

Unnamed: 0,neg,neu,pos,compound,headlines
0,0.258,0.742,0.0,-0.4184,My 3 yo daughter keeps repeating this for no r...
1,0.0,1.0,0.0,0.0,Tachyons may be behind precognition/clairvoyance
2,0.714,0.286,0.0,-0.8176,Lady Diana - Murdered or Accident
3,0.0,1.0,0.0,0.0,Just my 2 cents
4,0.412,0.588,0.0,-0.4215,Bethel park school shooting threats


In [55]:
newdf['label'] = 0
newdf.loc[newdf['compound'] > 0.2, 'label'] = 1 # Positive
newdf.loc[newdf['compound'] < -0.2, 'label'] = -1 # Negative

In [82]:
newdf.head(15)

Unnamed: 0,neg,neu,pos,compound,headlines,label
0,0.258,0.742,0.0,-0.4184,My 3 yo daughter keeps repeating this for no r...,-1
1,0.0,1.0,0.0,0.0,Tachyons may be behind precognition/clairvoyance,0
2,0.714,0.286,0.0,-0.8176,Lady Diana - Murdered or Accident,-1
3,0.0,1.0,0.0,0.0,Just my 2 cents,0
4,0.412,0.588,0.0,-0.4215,Bethel park school shooting threats,-1
5,0.417,0.583,0.0,-0.6486,If the dead internet theory is true..,-1
6,0.0,1.0,0.0,0.0,Audio analysis 2 shooters at Trump rally - Chr...,0
7,0.0,1.0,0.0,0.0,Trump wasn’t supposed to make it,0
8,0.252,0.748,0.0,-0.7227,But if you say that “Jews are secretly running...,-1
9,0.0,0.896,0.104,0.4404,The Democratic national committee protected Bi...,1


In [63]:
newdf.to_csv('finaldata.csv',encoding="utf-8")

In [67]:
newdf['label'].value_counts()


label
 0    471
-1    286
 1    158
Name: count, dtype: int64

In [99]:
pp = PrettyPrinter()

In [106]:
print(type(pprint))

<class 'module'>


In [110]:
print("Postive Headlines")
positive_list = list(newdf[newdf['label'] == 1].headlines)[:6]
pp.pprint(positive_list)

Postive Headlines
['The Democratic national committee protected Biden from debating RFK Jr. '
 'during the primaries. Had they not everyone would have seen these '
 'deficiencies and not a tenable candidate.',
 'The Great Replacement is real. Here are facts for Greece. ',
 'Thomas Matthew Crooks’ Motive— Independent Party Affiliation',
 "Baron Trump's Marvelous Underground Journey by Ingersol Lockwood 1892",
 'This is pretty creepy if you ask me',
 'Interesting theory. Snipers in building actually were the ones who shot at '
 'Trump and the boy on the roof was the fall guy.']


In [109]:
print("Neutral Headlines")
neutral_list = list(newdf[newdf['label'] == 0].headlines)[:6]
pp.pprint(neutral_list)

Neutral Headlines
['Tachyons may be behind precognition/clairvoyance',
 'Just my 2 cents',
 'Audio analysis 2 shooters at Trump rally - Chris Martenson',
 'Trump wasn’t supposed to make it',
 'Biden Exiting Race Sunday',
 'Second shooter in Water tower..']


In [108]:
print("Negative Headlines")
negative_list = list(newdf[newdf['label'] == -1].headlines)[:6]
pp.pprint(negative_list)

Negative Headlines
['My 3 yo daughter keeps repeating this for no reason!!',
 'Lady Diana - Murdered or Accident',
 'Bethel park school shooting threats',
 'If the dead internet theory is true.. ',
 'But if you say that “Jews are secretly running the government”, that is '
 'antisemitic hate speech',
 'Catalyst Events and The Trump Assassination Attempt']


Inference: Displayed the sentiments of the headlines.

In [91]:
print(newdf['label'].value_counts() * 100)

label
 0    47100
-1    28600
 1    15800
Name: count, dtype: int64


Inference: Displayed the Percentage of each sentiments of the headlines.

![alt text](reddit_clap.png)

Conclusion

Scraped the data from reddit social media by accessing Reddit API, used PRAW python library for the communication. And at last analysed the sentiments of a specific topic going on reddit. With the usage of the reddit data, created a dataset so that it can be used for further model buiding and stuffs.