<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Sentiment Analysis of Reddit Data using Reddit API</h1>

In this live coding session, we leverage the Python Reddit API Wrapper (`PRAW`) to retrieve data from subreddits on [Reddit](https://www.reddit.com), and perform sentiment analysis using [`pipelines`](https://huggingface.co/docs/transformers/main_classes/pipelines) from [HuggingFace ( 🤗 the GitHub of Machine Learning )](https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/), powered by [transformer](https://arxiv.org/pdf/1706.03762.pdf).

## Objectives

At the end of the session, you will 

- know how to work with APIs
- feel more comfortable navigating thru documentation, even inspecting the source code
- understand what a `pipeline` object is in HuggingFace
- perform sentiment analysis using `pipeline`
- run a python script in command line and get the results

## How to Submit

- At the end of each task, commit* the work into the repository you created before the assignment
- After completing all three tasks, make sure to push the notebook containing all code blocks and output cells to your repository you created before the assignment
- Submit the link to the notebook in Canvas

\***NEVER** commit a notebook displaying errors unless it is instructed otherwise. However, commit often; recall git ABC = **A**lways **B**e **C**ommitting.

## Tasks

### Task I: Instantiate a Reddit API Object

The first task is to instantiate a Reddit API object using [PRAW](https://praw.readthedocs.io/en/stable/), through which you will retrieve data. PRAW is a wrapper for [Reddit API](https://www.reddit.com/dev/api) that makes interacting with the Reddit API easier unless you are already an expert of [`requests`](https://docs.python-requests.org/en/latest/).

#### 1. Install packages

Please ensure you've ran all the cells in the `imports.ipynb`, located [here](https://github.com/FourthBrain/MLE-8/blob/main/assignments/week-3-analyze-sentiment-subreddit/imports.ipynb), to make sure you have all the required packages for today's assignment.

####  2. Create a new app on Reddit 

Create a new app on Reddit and save secret tokens; refer to [post in medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for more details.

- Create a Reddit account if you don't have one, log into your account.
- To access the API, we need create an app. Slight updates, on the website, you need to navigate to `preference` > `app`, or click [this link](https://www.reddit.com/prefs/apps) and scroll all the way down. 
- Click to create a new app, fill in the **name**, choose `script`, fill in  **description** and **redirect uri** ( The redirect URI is where the user is sent after they've granted OAuth access to your application (more info [here](https://github.com/reddit-archive/reddit/wiki/OAuth2)) For our purpose, you can enter some random url, e.g., www.google.com; as shown below.


    <img src="https://miro.medium.com/max/700/1*lRBvxpIe8J2nZYJ6ucMgHA.png" width="500"/>
- Jot down `client_id` (left upper corner) and `client_secret` 

    NOTE: CLIENT_ID refers to 'personal use script" and CLIENT_SECRET to secret.
    
    <div>
    <img src="https://miro.medium.com/max/700/1*7cGAKth1PMrEf2sHcQWPoA.png" width="300"/>
    </div>

- Create `secrets_reddit.py` in the same directory with this notebook, fill in `client_id` and `secret_id` obtained from the last step. We will need to import those constants in the next step.
    ```
    REDDIT_API_CLIENT_ID = "client_id"
    REDDIT_API_CLIENT_SECRET = "secret_id"
    REDDIT_API_USER_AGENT = "any string except bot; ex. My User Agent"
    ```
- Add `secrets_reddit.py` to your `.gitignore` file if not already done. NEVER push credentials to a repo, private or public. 

#### 3. Instantiate a `Reddit` object

Now you are ready to create a read-only `Reddit` instance. Refer to [documentation](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) when necessary.

In [2]:
import praw
import secrets_reddit

# Create a Reddit object which allows us to interact with the Reddit API
reddit = praw.Reddit(
    client_id=secrets_reddit.REDDIT_API_CLIENT_ID,
    client_secret=secrets_reddit.REDDIT_API_CLIENT_SECRET,
    user_agent=secrets_reddit.REDDIT_API_USER_AGENT
)

In [3]:
print(reddit) 

<praw.reddit.Reddit object at 0x7f60e8381b80>


<details>
<summary>Expected output:</summary>   

```<praw.reddit.Reddit object at 0x10f8a0ac0>```
</details>

#### 4. Instantiate a `subreddit` object

Lastly, create a `subreddit` object for your favorite subreddit and inspect the object. The expected outputs you will see are from `r/machinelearning` unless otherwise specified.

In [4]:
# YOUR CODE HERE
subreddit = reddit.subreddit("machinelearning")

What is the display name of the subreddit?

In [5]:
# YOUR CODE HERE
print(subreddit.display_name)

machinelearning


<details>
<summary>Expected output:</summary>   

    machinelearning
</details>

How about its title, is it different from the display name?

In [6]:
# YOUR CODE HERE
print(subreddit.title)

Machine Learning


<details>
<summary>Expected output:</summary>   

    Machine Learning
</details>

Print out the description of the subreddit:

In [7]:
# YOUR CODE HERE
print(subreddit.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

<details>
<summary>Expected output:</summary>

    **[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
    --------
    +[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
    --------
    +[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
    --------
    +[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
    --------
    +[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict
</details>

### Task II: Parse comments

#### 1. Top Posts of All Time

Find titles of top 10 posts of **all time** from your favorite subreddit. Refer to [Obtain Submission Instances from a Subreddit Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)) if necessary. Verify if the titles match what you read on Reddit.

In [8]:
# try run this line, what do you see? press q once you are done
?subreddit.top 

[0;31mSignature:[0m
[0msubreddit[0m[0;34m.[0m[0mtop[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtime_filter[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'all'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mgenerator_kwargs[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mint[0m[0;34m,[0m [0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mIterator[0m[0;34m[[0m[0mAny[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a :class:`.ListingGenerator` for top items.

:param time_filter: Can be one of: ``"all"``, ``"day"``, ``"hour"``,
    ``"month"``, ``"week"``, or ``"year"`` (default: ``"all"``).

:raises: :py:class:`ValueError` if ``time_filter`` is invalid.

Additional keyword arguments are passed in the initialization of
:class:`.ListingGenerator`.

This method can be used

In [9]:
# YOUR CODE HERE
for submission in subreddit.top(limit=10):
    print(submission.title)

[Project] From books to presentations in 10s with AR + ML
[D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
[R] First Order Motion Model applied to animate paintings
[N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
[D] This AI reveals how much time politicians stare at their phone at work
[D] Types of Machine Learning Papers
[D] The machine learning community has a toxicity problem
[Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
I made a robot that punishes me if it detects that if I am procrastinating on my assignments [P]
[P] Using oil portraits and First Order Model to bring the paintings back to life


<details> <summary>Expected output:</summary>

    [Project] From books to presentations in 10s with AR + ML
    [D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
    [R] First Order Motion Model applied to animate paintings
    [N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
    [D] This AI reveals how much time politicians stare at their phone at work
    [D] Types of Machine Learning Papers
    [D] The machine learning community has a toxicity problem
    [Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
    [P] Using oil portraits and First Order Model to bring the paintings back to life
    [D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)    
</details>

#### 2. Top 10 Posts of This Week

What are the titles of the top 10 posts of **this week** from your favorite subreddit?

In [10]:
# YOUR CODE HERE
for submission in subreddit.top(time_filter='week', limit=10):
    print(submission.title)

30% of Google's Reddit Emotions Dataset is Mislabeled [D]
[R] mixed reality future — see the world through artistic lenses — made with NeRF
[N] First-Ever Course on Transformers: NOW PUBLIC
[D] Why are Corgi dogs so popular in machine learning (especially in the image generation community)?
[D] Are there any rejected papers that ended up having significant impact in the long run?
[D] Noam Chomsky on LLMs and discussion of LeCun paper (MLST)
[N] Andrej Karpathy is leaving Tesla
[R] So someone actually peer-reviewed this and thought "yeah, looks good"?
[D] How do you verify the novelty of your research?
[N] BigScience Releases their 176 Billion Parameter Open-access Multilingual Language Model


<details><summary>Expected output:</summary>

    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    [R] Meta is releasing a 175B parameter language model
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    [P] T-SNE to view and order your Spotify tracks
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    [D] Do you use NLTK or Spacy for text preprocessing?
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
</details>

💽❓ Data Question:

Check out what other attributes the `praw.models.Submission` class has in the [docs](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html). 

1. After having a chance to look through the docs, is there any other information that you might want to extract? How might this additional data help you?

Write a sample piece of code below extracting three additional pieces of information from the submission below.

#### **Data Question:**
There are a few other "features" that could be of use. These "features" are: clicked, comments, num_comments, score, and upvote ratio. These additional "features" help to identify and distinguish the validity of the data being extracted. Upvote ratio, gives insight into how many members of the community agree with the information in the post. Popular posts contain potential viable information. Each of the additional features I listed would help to filter through the posts to attain more insight. 

In [15]:
# YOUR CODE HERE
for submission in subreddit.top(time_filter='week', limit=10):
    print("Submission Information:")
    print("    Title:", submission.title)
    print("    Upvote Ratio:", submission.upvote_ratio)
    print("    Upvote Score:", submission.score)
    print("    Number of comments:", submission.num_comments)

Submission Information:
    Title: 30% of Google's Reddit Emotions Dataset is Mislabeled [D]
    Upvote Ratio: 0.98
    Upvote Score: 863
    Number of comments: 137
Submission Information:
    Title: [R] mixed reality future — see the world through artistic lenses — made with NeRF
    Upvote Ratio: 0.96
    Upvote Score: 356
    Number of comments: 15
Submission Information:
    Title: [N] First-Ever Course on Transformers: NOW PUBLIC
    Upvote Ratio: 0.92
    Upvote Score: 340
    Number of comments: 37
Submission Information:
    Title: [D] Why are Corgi dogs so popular in machine learning (especially in the image generation community)?
    Upvote Ratio: 0.92
    Upvote Score: 315
    Number of comments: 68
Submission Information:
    Title: [D] Are there any rejected papers that ended up having significant impact in the long run?
    Upvote Ratio: 0.98
    Upvote Score: 291
    Number of comments: 98
Submission Information:
    Title: [D] Noam Chomsky on LLMs and discussion of LeC

💽❓ Data Question:

2. Is there any information available that might be a concern when it comes to Ethical Data?

#### **Data Question**
Reddit is a public forum where people are freely sharing information. From this perspective, I don't see information available that would be of concern. Reddit doesn't seem to host personal or private information.  

#### 3. Comment Code

Add comments to the code block below to describe what each line of the code does (Refer to [Obtain Comment Instances Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) when necessary). The code is adapted from [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)

The purpose is 
1. to understand what the code is doing 
2. start to comment your code whenever it is not self-explantory if you have not (others will thank you, YOU will thank you later 😊) 

In [16]:
%%time
from praw.models import MoreComments

# define array to place top comments
top_comments = []

# loop through the top 10 submissions in the subreddit
for submission in subreddit.top(limit=10):
    # search for the top level comments. 
    for top_level_comment in submission.comments:
        # skip over loading in more comments that are underneath and tied to main comment
        if isinstance(top_level_comment, MoreComments):
            continue
        # add the body of the top_level_comment to the array
        top_comments.append(top_level_comment.body)

CPU times: user 223 ms, sys: 3.19 ms, total: 226 ms
Wall time: 12.8 s


#### 4. Inspect Comments

How many comments did you extract from the last step? Examine a few comments. 

In [17]:
#YOUR CODE HERE  # the answer may vary 693 for r/machinelearning
print(len(top_comments))

740


In [18]:
import random

[random.choice(top_comments) for i in range(3)]

['And unfortunately some Data Science teams inherit that toxicity.\n\nHere are some some elements that can help:\n\n* place everyone on the same level\n* promote diversity\n* reward inclusivity and support between teammates',
 'This looks fantastic!! Very excited to give this a try later',
 'This wouldn’t be able to get the correct facial movements and body language of the person though. Kinda ruins it for me']

<details> <summary>Some of the comments from `r/machinelearning` subreddit are:</summary>

    ['Awesome visualisation',
    'Similar to a stack or connected neurons.',
    'Will this Turing pass the Turing Test?']
</details>

In [19]:
import random

[random.choice(top_comments) for i in range(5)]

['u/savevideo',
 "I see a lot of comments talking about all the short comings of ML. How there are too many people in the field, how there are not enough, too many graduate students, requirements are to strick or too lenient. \n\nAs someone who is about to enter grad school, in ML, and who is committed to the idea of being apart of this apperant broken machine. How can I be apart of the change that results in something better? Sure, read more papers, be better at research, be more creative, blah blah blah, descriptors that are easy for the experienced to understand and impossible for the young and learning to interpret. \n\nThe reason that science seems to be only nudged by the many and truly pushed by the few is because, in my opinion, success is hardly documented and faults and critism are plentiful. I think if more were willing to mentor, teach and share then we could see more progress. I know I could be better.\n\nFinally, we need a less hand wavy approach to learning how to resear

💽❓ Data Question:

3. After having a chance to review a few samples of 5 comments from the subreddit, what can you say about the data? 

HINT: Think about the "cleanliness" of the data, the content of the data, think about what you're trying to do - how does this data line up with your goal?

#### **Data Question:**
This is an interesting question, because the data is not necessarily "clean." The data is highly biased based on the perspective of the author of the comment (data). If the goal is monitor or determine public perception, or trends then reddit might be an acceptable platform. As reddit appears to be more casual though heavily laden with bias. I think it really depends on the specificity of the goal. Some of the comments are not of value either. In regard, data cleansing would need to be thought about so we could extract viable data.   

#### 5. Extract Top Level Comment from Subreddit `TSLA`.

Write your code to extract top level comments from the top 10 topics of a time period, e.g., year, from subreddit `TSLA` and store them in a list `top_comments_tsla`.  

In [20]:
# YOUR CODE HERE
subreddit = reddit.subreddit("TSLA")
print(subreddit.display_name)

TSLA


In [24]:
%%time
from praw.models import MoreComments

# instantiate an array to hold top comments
top_comments_tsla = []

# YOUR COMMENT HERE
for submission in subreddit.top(limit=10, time_filter="year"):
    # YOUR COMMENT HERE
    for top_level_comment in submission.comments:
        # YOUR COMMENT HERE
        if isinstance(top_level_comment, MoreComments):
            continue
        # YOUR COMMENT HERE
        top_comments_tsla.append(top_level_comment.body)

CPU times: user 45.2 ms, sys: 4.11 ms, total: 49.3 ms
Wall time: 11.4 s


In [25]:
len(top_comments_tsla) # Expected: 174 for r/machinelearning

158

In [26]:
[random.choice(top_comments_tsla) for i in range(3)]

['Ooof better sell it now and you can buy back a full share!',
 'Been holding this whole time. Just waiting out the pain.',
 'Bought 10 shares at $655!  It was a good time!']

In [27]:
[random.choice(top_comments_tsla) for i in range(5)]

["Hell yeah :D good approach I say.. personally It's my 2nd biggest holding because my crypto grew so much.. But yeah tesla is great\n\nI'm like 35% tsla 50% eth,sol,btc and 15% PLTR and ARKG",
 'golf clap',
 'Congratz all around. I was a little worried when we were at 900 and losing steam.  But I am all in 1370 pre split.  I am arguing over how much icing I want on my cake.',
 'All in since 2019',
 "Holding 415 shares since 450 (presplit)\nI'm gonna hold for another 5-10 years.\n\n\nIf Tsla eventually overpasses Apple as the most valuable company ill become a Teslionaire!!"]

<details>
<summary>Some of the comments from `r/TSLA` subreddit:</summary>

    ['I bought puts',
    '100%',
    'Yes. And I’m bag holding 1200 calls for Friday and am close to throwing myself out the window']
</details>

💽❓ Data Question:

4. Now that you've had a chance to review another subreddits comments, do you see any differences in the kinds of comments either subreddit has - and how might this relate to bias?

#### **Data Question:**
Both subreddits viewed are completely different. This is expected though because they are of two different topics. The "TSLA" is more focused on discussion about crypto whereas the other is about machine learning. The differences in topic will influence the bias of the information that is being shared. The focus of the two subreddits are completely different and with a different focus discussion will be completely different. 

### Task III: Sentiment Analysis

Let us analyze the sentiment of comments scraped from `r/TSLA` using a pre-trained HuggingFace model to make the inference. Take a [Quick tour](https://huggingface.co/docs/transformers/quicktour). 

#### 1. Import `pipeline`

In [28]:
from transformers import pipeline# YOUR CODE HERE

#### 2. Create a Pipeline to Perform Task "sentiment-analysis"

In [29]:
sentiment_model = pipeline("sentiment-analysis")# YOUR CODE HERE

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


#### 3. Get one comment from list `top_comments_tsla` from Task II - 5.

In [30]:
comment = random.choice(top_comments_tsla)

In [31]:
comment

'It’s gonna happen 10000000000000% on December 9th exactly'

The example comment is: `'Bury Burry!!!!!'`. Print out what you get. For reproducibility, use the same comment in the next step; consider setting a seed.

#### 4. Make Inference!

In [33]:
sentiment = sentiment_model(comment) # YOUR CODE HERE 
print(type(sentiment))

<class 'list'>


What is the type of the output `sentiment`?

#### **Answer**
The output of sentiment is a 'list.'

In [34]:
print(f'The comment: {comment}')
print(f'Predicted Label is {sentiment[0]["label"]} and the score is {sentiment[0]["score"]:.3f}')

The comment: It’s gonna happen 10000000000000% on December 9th exactly
Predicted Label is POSITIVE and the score is 0.895


For the example comment, the output is:

    The comment: Bury Burry!!!!!
    Predicted Label is NEGATIVE and the score is 0.989

🖥️❓ Model Question:

1. What does the score represent?

#### **Answer**
The score represents the percentage of likely hood that the comment is classified as either POSITIVE or NEGATIVE. In my output, I found that the comment is 89.5% likely to be positive. 

### Task IV: Put All Together

Let's pull all the piece together, create a simple script that does 

- get the subreddit
- get comments from the top posts for given subreddit
- run sentiment analysis 

#### Complete the Script

Once you complete the code, running the following block writes the code into a new Python script and saves it as `top_tlsa_comment_sentiment.py` under the same directory with the notebook. 

In [45]:
%%writefile top_tlsa_comment_sentiment.py

import secrets_reddit
import random

from typing import Dict, List

from praw import Reddit
from praw.models.reddit.subreddit import Subreddit
from praw.models import MoreComments

from transformers import pipeline


def get_subreddit(display_name:str) -> Subreddit:
    """Get subreddit object from display name

    Args:
        display_name (str): [description]

    Returns:
        Subreddit: [description]
    """
    reddit = Reddit(
        client_id=secrets_reddit.REDDIT_API_CLIENT_ID,        
        client_secret=secrets_reddit.REDDIT_API_CLIENT_SECRET,
        user_agent=secrets_reddit.REDDIT_API_USER_AGENT
        )
    
    subreddit = reddit.subreddit(display_name)# YOUR CODE HERE
    return subreddit

def get_comments(subreddit:Subreddit, limit:int=3) -> List[str]:
    """ Get comments from subreddit

    Args:
        subreddit (Subreddit): [description]
        limit (int, optional): [description]. Defaults to 3.

    Returns:
        List[str]: List of comments
    """
    top_comments = []
    for submission in subreddit.top(limit=limit):
        for top_level_comment in submission.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            top_comments.append(top_level_comment.body)
    return top_comments

def run_sentiment_analysis(comment:str) -> Dict:
    """Run sentiment analysis on comment using default distilbert model
    
    Args:
        comment (str): [description]
        
    Returns:
        str: Sentiment analysis result
    """
    sentiment_model = pipeline("sentiment-analysis")# YOUR CODE HERE
    sentiment = sentiment_model(comment)
    return sentiment[0]


if __name__ == '__main__':
    subreddit = get_subreddit('TSLA') # YOUR CODE HERE
    comments = get_comments(subreddit)
    comment = random.choice(comments)# YOUR CODE HERE
    sentiment = run_sentiment_analysis(comment)
    
    print(f'The comment: {comment}')
    print(f'Predicted Label is {sentiment["label"]} and the score is {sentiment["score"]:.3f}')

Overwriting top_tlsa_comment_sentiment.py


Run the following block to see the output.

In [46]:
!python top_tlsa_comment_sentiment.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
The comment: Let’s see if this investment has any effect on the actual stock. Good or bad.
Predicted Label is NEGATIVE and the score is 0.999


<details><summary> Expected output:</summary>

    No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
    The comment: When is DOGE flying
    Predicted Label is POSITIVE and the score is 0.689
</details>

💽❓ Data Question:

5. Is the subreddit active? About how many posts or threads per day? How could you find this information?

#### **Data Question:**
Yes the subreddit is indeed active. It appears that there may be hundreds of comments daily. The code that I wrote appears to show 300 comments for the one day that I used to grab the data. See the below code blocks.  

In [49]:
for submission in subreddit.top(time_filter='day', limit=10):
    print("Submission Information:")
    print("    Title:", submission.title)
    print("    Upvote Ratio:", submission.upvote_ratio)
    print("    Upvote Score:", submission.score)
    print("    Number of comments:", submission.num_comments)

Submission Information:
    Title: Texas Power Grid Woes Hit Toyota, Tesla
    Upvote Ratio: 0.88
    Upvote Score: 6
    Number of comments: 4
Submission Information:
    Title: Think Tesla's Going Down More? New ETF Rises When Tesla Falls
    Upvote Ratio: 0.84
    Upvote Score: 4
    Number of comments: 4
Submission Information:
    Title: Munich court orders Tesla to reimburse customer for Autopilot Phantom Breaking problems
    Upvote Ratio: 0.76
    Upvote Score: 2
    Number of comments: 2


In [62]:
comment_array = []
author_array = []
for submission in subreddit.top(time_filter='day'):
    print("Submission Information:")
    print("    Title:", submission.title)
    print("    Upvote Ratio:", submission.upvote_ratio)
    print("    Upvote Score:", submission.score)
    print("    Number of comments:", submission.num_comments)
    for comment in subreddit.comments():
        comment_array.append(comment)
print(len(comment_array))

Submission Information:
    Title: Texas Power Grid Woes Hit Toyota, Tesla
    Upvote Ratio: 0.89
    Upvote Score: 7
    Number of comments: 4
Submission Information:
    Title: Think Tesla's Going Down More? New ETF Rises When Tesla Falls
    Upvote Ratio: 1.0
    Upvote Score: 5
    Number of comments: 4
Submission Information:
    Title: Munich court orders Tesla to reimburse customer for Autopilot Phantom Breaking problems
    Upvote Ratio: 0.76
    Upvote Score: 2
    Number of comments: 2
300


💽❓ Data Question:

6. Does there seem to be a large distribution of posters or a smaller concentration of posters who are very active? What kind of impact might this have on the data?

In [73]:
%%time
from praw.models import MoreComments

# instantiate an array to hold top comments
top_comments_tsla = []
author = []

# YOUR COMMENT HERE
for submission in subreddit.top(limit=1000, time_filter="day"):
    # YOUR COMMENT HERE
    for top_level_comment in submission.comments:
        # YOUR COMMENT HERE
        if isinstance(top_level_comment, MoreComments):
            continue
        # YOUR COMMENT HERE
        top_comments_tsla.append(top_level_comment.body)
        author.append(top_level_comment.author)

CPU times: user 12.3 ms, sys: 1.42 ms, total: 13.7 ms
Wall time: 30.6 s


In [74]:
print(len(top_comments_tsla))
print(len(author))

6
6


#### **Data Question**
It appears that there is a smaller distribution of posters who are active. The impact to the data is that the diversity is small and the bias is potentially large. We are not catching a large portion of the population for a diverse distribution. 