<p align = "center" draggable=‚Äùfalse‚Äù ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Sentiment Analysis of Reddit Data using Reddit API</h1>

In this live coding session, we leverage the Python Reddit API Wrapper (`PRAW`) to retrieve data from subreddits on [Reddit](https://www.reddit.com), and perform sentiment analysis using [`pipelines`](https://huggingface.co/docs/transformers/main_classes/pipelines) from [HuggingFace ( ü§ó the GitHub of Machine Learning )](https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/), powered by [transformer](https://arxiv.org/pdf/1706.03762.pdf).

## Objectives

At the end of the session, you will 

- know how to work with APIs
- feel more comfortable navigating thru documentation, even inspecting the source code
- understand what a `pipeline` object is in HuggingFace
- perform sentiment analysis using `pipeline`
- run a python script in command line and get the results

## How to Submit

- At the end of each task, commit* the work into the repository you created before the assignment
- After completing all three tasks, make sure to push the notebook containing all code blocks and output cells to your repository you created before the assignment
- Submit the link to the notebook in Canvas

\***NEVER** commit a notebook displaying errors unless it is instructed otherwise. However, commit often; recall git ABC = **A**lways **B**e **C**ommitting.

## Tasks

### Task I: Instantiate a Reddit API Object

The first task is to instantiate a Reddit API object using [PRAW](https://praw.readthedocs.io/en/stable/), through which you will retrieve data. PRAW is a wrapper for [Reddit API](https://www.reddit.com/dev/api) that makes interacting with the Reddit API easier unless you are already an expert of [`requests`](https://docs.python-requests.org/en/latest/).

#### 1. Install packages

Please ensure you've ran all the cells in the `imports.ipynb`, located [here](https://github.com/FourthBrain/MLE-8/blob/main/assignments/week-3-analyze-sentiment-subreddit/imports.ipynb), to make sure you have all the required packages for today's assignment.

####  2. Create a new app on Reddit 

Create a new app on Reddit and save secret tokens; refer to [post in medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for more details.

- Create a Reddit account if you don't have one, log into your account.
- To access the API, we need create an app. Slight updates, on the website, you need to navigate to `preference` > `app`, or click [this link](https://www.reddit.com/prefs/apps) and scroll all the way down. 
- Click to create a new app, fill in the **name**, choose `script`, fill in  **description** and **redirect uri** ( The redirect URI is where the user is sent after they've granted OAuth access to your application (more info [here](https://github.com/reddit-archive/reddit/wiki/OAuth2)) For our purpose, you can enter some random url, e.g., www.google.com; as shown below.


    <img src="https://miro.medium.com/max/700/1*lRBvxpIe8J2nZYJ6ucMgHA.png" width="500"/>
- Jot down `client_id` (left upper corner) and `client_secret` 

    NOTE: CLIENT_ID refers to 'personal use script" and CLIENT_SECRET to secret.
    
    <div>
    <img src="https://miro.medium.com/max/700/1*7cGAKth1PMrEf2sHcQWPoA.png" width="300"/>
    </div>

- Create `secrets_reddit.py` in the same directory with this notebook, fill in `client_id` and `secret_id` obtained from the last step. We will need to import those constants in the next step.
    ```
    REDDIT_API_CLIENT_ID = "client_id"
    REDDIT_API_CLIENT_SECRET = "secret_id"
    REDDIT_API_USER_AGENT = "any string except bot; ex. My User Agent"
    ```
- Add `secrets_reddit.py` to your `.gitignore` file if not already done. NEVER push credentials to a repo, private or public. 

#### 3. Instantiate a `Reddit` object

Now you are ready to create a read-only `Reddit` instance. Refer to [documentation](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) when necessary.

In [12]:
import praw
import secrets_reddit

# Create a Reddit object which allows us to interact with the Reddit API
reddit = praw.Reddit(
    client_id=secrets_reddit.REDDIT_API_CLIENT_ID,
    client_secret=secrets_reddit.REDDIT_API_CLIENT_SECRET,
    user_agent=secrets_reddit.REDDIT_API_USER_AGENT,
)

In [13]:
print(reddit) 

<praw.reddit.Reddit object at 0x7fbd76010c10>


<details>
<summary>Expected output:</summary>   

```<praw.reddit.Reddit object at 0x10f8a0ac0>```
</details>

#### 4. Instantiate a `subreddit` object

Lastly, create a `subreddit` object for your favorite subreddit and inspect the object. The expected outputs you will see are from `r/machinelearning` unless otherwise specified.

In [14]:
subreddit = reddit.subreddit("machinelearning")

What is the display name of the subreddit?

In [15]:
print(subreddit.display_name)

machinelearning


<details>
<summary>Expected output:</summary>   

    machinelearning
</details>

How about its title, is it different from the display name?

In [16]:
print(subreddit.title)

Machine Learning


<details>
<summary>Expected output:</summary>   

    Machine Learning
</details>

Print out the description of the subreddit:

In [17]:
print(subreddit.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

<details>
<summary>Expected output:</summary>

    **[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
    --------
    +[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
    --------
    +[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
    --------
    +[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
    --------
    +[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict
</details>

### Task II: Parse comments

#### 1. Top Posts of All Time

Find titles of top 10 posts of **all time** from your favorite subreddit. Refer to [Obtain Submission Instances from a Subreddit Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)) if necessary. Verify if the titles match what you read on Reddit.

In [23]:
# try run this line, what do you see? press q once you are done
?subreddit.top 

In [22]:
for submission in subreddit.top(limit=10):
    print(submission.title)

[Project] From books to presentations in 10s with AR + ML
[D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
[R] First Order Motion Model applied to animate paintings
[N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
[D] This AI reveals how much time politicians stare at their phone at work
[D] Types of Machine Learning Papers
[D] The machine learning community has a toxicity problem
[Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
I made a robot that punishes me if it detects that if I am procrastinating on my assignments [P]
[P] Using oil portraits and First Order Model to bring the paintings back to life


<details> <summary>Expected output:</summary>

    [Project] From books to presentations in 10s with AR + ML
    [D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
    [R] First Order Motion Model applied to animate paintings
    [N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
    [D] This AI reveals how much time politicians stare at their phone at work
    [D] Types of Machine Learning Papers
    [D] The machine learning community has a toxicity problem
    [Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
    [P] Using oil portraits and First Order Model to bring the paintings back to life
    [D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)    
</details>

#### 2. Top 10 Posts of This Week

What are the titles of the top 10 posts of **this week** from your favorite subreddit?

In [25]:
for submission in subreddit.top(time_filter="week", limit=10):
    print(submission.title)

30% of Google's Reddit Emotions Dataset is Mislabeled [D]
[R] mixed reality future ‚Äî see the world through artistic lenses ‚Äî made with NeRF
[N] First-Ever Course on Transformers: NOW PUBLIC
[D] Why are Corgi dogs so popular in machine learning (especially in the image generation community)?
[D] Are there any rejected papers that ended up having significant impact in the long run?
[D] Noam Chomsky on LLMs and discussion of LeCun paper (MLST)
[N] Andrej Karpathy is leaving Tesla
[R] So someone actually peer-reviewed this and thought "yeah, looks good"?
[D] How do you verify the novelty of your research?
[N] BigScience Releases their 176 Billion Parameter Open-access Multilingual Language Model


<details><summary>Expected output:</summary>

    [N] Ian Goodfellow, Apple‚Äôs director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said ‚ÄúI believe strongly that more flexibility would have been the best policy for my team.‚Äù He was likely the company‚Äôs most cited ML expert.
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    [P] I‚Äôve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple‚Äôs version of MobileNet, and more directly on your phone's camera roll.
    [R] Meta is releasing a 175B parameter language model
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    [P] T-SNE to view and order your Spotify tracks
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    [D] Do you use NLTK or Spacy for text preprocessing?
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
</details>

üíΩ‚ùì Data Question:

Check out what other attributes the `praw.models.Submission` class has in the [docs](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html). 

1. After having a chance to look through the docs, is there any other information that you might want to extract? How might this additional data help you?

Write a sample piece of code below extracting three additional pieces of information from the submission below.

In [28]:
for submission in subreddit.top(time_filter="week", limit=10):
    print(f'{submission.title}|{submission.num_comments} | {submission.upvote_ratio}| {submission.score}')

30% of Google's Reddit Emotions Dataset is Mislabeled [D]|137 | 0.98| 870
[R] mixed reality future ‚Äî see the world through artistic lenses ‚Äî made with NeRF|15 | 0.96| 355
[N] First-Ever Course on Transformers: NOW PUBLIC|37 | 0.92| 345
[D] Why are Corgi dogs so popular in machine learning (especially in the image generation community)?|68 | 0.92| 318
[D] Are there any rejected papers that ended up having significant impact in the long run?|99 | 0.98| 292
[D] Noam Chomsky on LLMs and discussion of LeCun paper (MLST)|242 | 0.88| 272
[N] Andrej Karpathy is leaving Tesla|120 | 0.94| 269
[R] So someone actually peer-reviewed this and thought "yeah, looks good"?|81 | 0.96| 239
[D] How do you verify the novelty of your research?|55 | 0.99| 184
[N] BigScience Releases their 176 Billion Parameter Open-access Multilingual Language Model|32 | 0.98| 183


üíΩ‚ùì Data Question:

2. Is there any information available that might be a concern when it comes to Ethical Data?
```
__ANS:__ Not necessarily. upvote_ratio can be a concern if audience is biased towards a certain topic, it can bury some useful information.
```

#### 3. Comment Code

Add comments to the code block below to describe what each line of the code does (Refer to [Obtain Comment Instances Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) when necessary). The code is adapted from [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)

The purpose is 
1. to understand what the code is doing 
2. start to comment your code whenever it is not self-explantory if you have not (others will thank you, YOU will thank you later üòä) 

In [30]:
%%time
from praw.models import MoreComments

# Initialize an empty list of top comments
top_comments = []

# Loop through the submission objects obtained by calling subreddit.top(limit=10)
for submission in subreddit.top(limit=10):
    # An inner loop for looping through top level comments in submission.comments object
    for top_level_comment in submission.comments:
        # Check if top_level_comment if of type MoreComments, if true skip to next loop
        if isinstance(top_level_comment, MoreComments):
            continue
        # If not (thus here), add top_level_comment.body to to_comments list
        top_comments.append(top_level_comment.body)

CPU times: user 405 ms, sys: 33.5 ms, total: 438 ms
Wall time: 1min 53s


#### 4. Inspect Comments

How many comments did you extract from the last step? Examine a few comments. 

In [31]:
len(top_comments)  # the answer may vary 693 for r/machinelearning

740

In [38]:
import random

[random.choice(top_comments) for i in range(3)]

['But the question is: is it the validation set? üòÅ',
 'I grew up wanting to be a scientist but became disillusioned by the idea when it became clear that the problems you mentioned were ubiquitous in modern science.',
 'Great work, mate!']

<details> <summary>Some of the comments from `r/machinelearning` subreddit are:</summary>

    ['Awesome visualisation',
    'Similar to a stack or connected neurons.',
    'Will this Turing pass the Turing Test?']
</details>

üíΩ‚ùì Data Question:

3. After having a chance to review a few samples of 5 comments from the subreddit, what can you say about the data? 

HINT: Think about the "cleanliness" of the data, the content of the data, think about what you're trying to do - how does this data line up with your goal?

__ANS__: After reviewing few comments, it is clear that the language of the comments is far more legible than some of the twitter data I've seen.  However the sentiment are a lot balanced, thus might make many comments neutral.

#### 5. Extract Top Level Comment from Subreddit `TSLA`.

Write your code to extract top level comments from the top 10 topics of a time period, e.g., year, from subreddit `TSLA` and store them in a list `top_comments_tsla`.  

In [39]:
subreddit = reddit.subreddit("TSLA")

top_comments_tsla = []

for submission in subreddit.top(limit=10, time_filter="year"):
    for top_level_comment in submission.comments:
        if isinstance(top_level_comment, MoreComments):
            continue
        top_comments_tsla.append(top_level_comment.body)

In [40]:
len(top_comments_tsla) # Expected: 174 for r/machinelearning

158

In [44]:
[random.choice(top_comments_tsla) for i in range(3)]

['üñêÔ∏ènot touching my TSLA for the next 4-5 years at least.',
 'I‚Äôm %70 TSLA',
 'I‚Äôm here for the split üòé']

<details>
<summary>Some of the comments from `r/TSLA` subreddit:</summary>

    ['I bought puts',
    '100%',
    'Yes. And I‚Äôm bag holding 1200 calls for Friday and am close to throwing myself out the window']
</details>

üíΩ‚ùì Data Question:

4. Now that you've had a chance to review another subreddits comments, do you see any differences in the kinds of comments either subreddit has - and how might this relate to bias?

__ANS__: Comments are either very pro, or very against. TSLA comments will have strong bias one way or the other depending upon number of fans or haters participating in the discussion.

### Task III: Sentiment Analysis

Let us analyze the sentiment of comments scraped from `r/TSLA` using a pre-trained HuggingFace model to make the inference. Take a [Quick tour](https://huggingface.co/docs/transformers/quicktour). 

#### 1. Import `pipeline`

In [45]:
from transformers import pipeline # YOUR CODE HERE

#### 2. Create a Pipeline to Perform Task "sentiment-analysis"

In [46]:
sentiment_model = pipeline("sentiment-analysis")# YOUR CODE HERE

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


#### 3. Get one comment from list `top_comments_tsla` from Task II - 5.

In [47]:
comment = random.choice(top_comments_tsla)

In [48]:
comment

'I can see a split after Elon sells his 10% as an announcement would drive up the price.'

The example comment is: `'Bury Burry!!!!!'`. Print out what you get. For reproducibility, use the same comment in the next step; consider setting a seed.

#### 4. Make Inference!

In [51]:
sentiment = sentiment_model(comment)# YOUR CODE HERE 

What is the type of the output `sentiment`?

```
List
```

In [52]:
print(f'The comment: {comment}')
print(f'Predicted Label is {sentiment[0]["label"]} and the score is {sentiment[0]["score"]:.3f}')

The comment: I can see a split after Elon sells his 10% as an announcement would drive up the price.
Predicted Label is NEGATIVE and the score is 0.998


For the example comment, the output is:

    The comment: Bury Burry!!!!!
    Predicted Label is NEGATIVE and the score is 0.989

üñ•Ô∏è‚ùì Model Question:

1. What does the score represent?

```
The confidence of the used model (distilbert-base-uncased-finetuned-sst-2-english in this case) on predicting the model (NEGATIVE in this case)
```

### Task IV: Put All Together

Let's pull all the piece together, create a simple script that does 

- get the subreddit
- get comments from the top posts for given subreddit
- run sentiment analysis 

#### Complete the Script

Once you complete the code, running the following block writes the code into a new Python script and saves it as `top_tlsa_comment_sentiment.py` under the same directory with the notebook. 

In [65]:
%%writefile top_tlsa_comment_sentiment.py

import secrets_reddit
import random

from typing import Dict, List

from praw import Reddit
from praw.models.reddit.subreddit import Subreddit
from praw.models import MoreComments

from transformers import pipeline


def get_subreddit(display_name:str) -> Subreddit:
    """Get subreddit object from display name

    Args:
        display_name (str): [description]

    Returns:
        Subreddit: [description]
    """
    reddit = Reddit(
        client_id=secrets_reddit.REDDIT_API_CLIENT_ID,        
        client_secret=secrets_reddit.REDDIT_API_CLIENT_SECRET,
        user_agent=secrets_reddit.REDDIT_API_USER_AGENT
        )
    
    subreddit = reddit.subreddit(display_name)
    return subreddit

def get_comments(subreddit:Subreddit, limit:int=3) -> List[str]:
    """ Get comments from subreddit

    Args:
        subreddit (Subreddit): [description]
        limit (int, optional): [description]. Defaults to 3.

    Returns:
        List[str]: List of comments
    """
    top_comments = []
    for submission in subreddit.top(limit=limit):
        for top_level_comment in submission.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            top_comments.append(top_level_comment.body)
    return top_comments

def run_sentiment_analysis(comment:str) -> Dict:
    """Run sentiment analysis on comment using default distilbert model
    
    Args:
        comment (str): [description]
        
    Returns:
        str: Sentiment analysis result
    """
    sentiment_model = pipeline("sentiment-analysis")# YOUR CODE HERE
    sentiment = sentiment_model(comment)
    return sentiment[0]


if __name__ == '__main__':
    subreddit = get_subreddit("TSLA")
    comments = get_comments(subreddit)
    comment = random.choice(comments)
    sentiment = run_sentiment_analysis(comment)
    
    print(f'The comment: {comment}')
    print(f'Predicted Label is {sentiment["label"]} and the score is {sentiment["score"]:.3f}')

Overwriting top_tlsa_comment_sentiment.py


Run the following block to see the output.

In [66]:
!python top_tlsa_comment_sentiment.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
The comment: Karma would eventually catch up with TSLA making a shitty product which is overpriced and all the false claims about autopilot and putting people‚Äôs lives in danger
Predicted Label is NEGATIVE and the score is 1.000


<details><summary> Expected output:</summary>

    No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
    The comment: When is DOGE flying
    Predicted Label is POSITIVE and the score is 0.689
</details>

üíΩ‚ùì Data Question:

5. Is the subreddit active? About how many posts or threads per day? How could you find this information?

### ANS

The subreddit TSLA does not seem to be very active. It averages about 1 post per fay and 2 comments per day. I could had done some analysis myself, but instead chose a lazy path (choose battles wisely!) and Googled information about Reddit Analalytics landing at https://subredditstats.com/r/tsla 

üíΩ‚ùì Data Question:

6. Does there seem to be a large distribution of posters or a smaller concentration of posters who are very active? What kind of impact might this have on the data?

__ANS__ : Looking at the top submissions of last month, there apprears to be a handful of posters landing upvoted submissions. It is a red flag for having biased data, as sample size for posters becomes limited.