# Project 3: Web APIs & NLP

## **Problem Statement**

- Is it clear what the goal of the project is?
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

## **Data Collection**

- Was enough data gathered to generate a significant result? (At least 1000 posts per subreddit)
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?

In [174]:
#imports
import praw
import requests
import pandas as pd
import numpy as np
import time
import sys
import re


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import spacy
nlp = spacy.load("en_core_web_sm")

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

In [175]:
reddit = praw.Reddit(
    client_id='DJIJ0XU8iiOG3SWwi8q1xA',
    client_secret='1chrkkxWjCXbetDz4QJ8QsBJPtGT8g',
    user_agent='Project-3 by u/Ok_Plantain_4879',
    username='Ok_Plantain_4879',
    password='Mohona_10hswc!'
)

In [176]:
# List of subreddits to collect posts from
subreddit_names = ["Marvel", "harrypotter"]

In [177]:
# Dictionary to store collected data
collected_data = {subreddit: [] for subreddit in subreddit_names}

In [178]:
analyzer = SentimentIntensityAnalyzer()

# Define a custom function to calculate sentiment
def calculate_sentiment(text):
    sentiment = analyzer.polarity_scores(text)
    return sentiment['compound']

# Load the spaCy model and create a custom Doc extension for sentiment
nlp = spacy.load("en_core_web_sm")
spacy.tokens.Doc.set_extension('sentiment', getter=calculate_sentiment, force=True)

In [179]:
# Define a function to count external links in text
def count_external_links(title, text):
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    combined_text = title + " " + text
    urls = re.findall(url_pattern, combined_text)
    external_links_count = len(urls)

    return external_links_count

In [180]:
# Collect posts from each subreddit
for subreddit_name in subreddit_names:
    subreddit = reddit.subreddit(subreddit_name)
    
    
    # Variables to keep track of collected posts and the desired post count
    collected_posts = 0
    desired_post_count = 1000
    time_filters = ["all", "year", "month", "week", "day"]
    for time_filter in time_filters:
        if collected_posts >= desired_post_count:
            break

            
        # Paginate through posts in batches of 100 which is reddit API limit
        for submission in subreddit.top(limit=None, time_filter=time_filter):
            if collected_posts >= desired_post_count:
                break  # Stop when collected the desired count

            # Ensure the post is not a duplicate by checking its ID
            if submission.id not in [post['id'] for post in collected_data[subreddit_name]]:
                post_data = {
                    "id": submission.id,
                    "title": submission.title,
                    "content": submission.selftext if submission.selftext else "",
                    "score": submission.score,
                    "num_comments": submission.num_comments,
                    "author": submission.author.name if submission.author else "Unknown",
                    "created_utc": submission.created_utc,
                    "upvote_ratio": submission.upvote_ratio,
                    "subreddit_flair": submission.link_flair_text,
                    "submission_datetime": submission.created_utc,
                    "post_type": submission.is_self,
                }

                # Entity recognition
                doc = nlp(submission.title + " " + submission.selftext)
                entities = [ent.text for ent in doc.ents]
                post_data["entity_recognition"] = entities

                # Sentiment analysis
                text_content = doc.text  # Extract the text content from the doc
                sentiment = analyzer.polarity_scores(text_content)
                compound_sentiment = sentiment['compound']
                post_data["text_sentiment"] = compound_sentiment

                # Counting external links (you may need to implement link counting logic)
                external_links_count = count_external_links(submission.title, submission.selftext)
                post_data["external_links_count"] = external_links_count

                collected_data[subreddit_name].append(post_data)
                collected_posts += 1

In [181]:
# Create dataframes from collected data
dataframes = {subreddit: pd.DataFrame(data) for subreddit, data in collected_data.items()}

In [182]:
for subreddit, df in dataframes.items():
    print(f"Data from r/{subreddit}:")
    display(df.head())

Data from r/Marvel:


Unnamed: 0,id,title,content,score,num_comments,author,created_utc,upvote_ratio,subreddit_flair,submission_datetime,post_type,entity_recognition,text_sentiment,external_links_count
0,yktbuk,We need more scenes like this,,46797,1188,steikul,1667453000.0,0.91,Film/Television,1667453000.0,False,[],0.4173,0
1,8jok7z,Fun in the sun with Dr. Strange,,42427,198,Scaulbylausis,1526414000.0,0.95,Fan Made,1526414000.0,False,[Strange],0.3612,0
2,87i3qf,Neat,,40204,216,Isai76,1522155000.0,0.92,Other,1522155000.0,False,[],0.4588,0
3,bjkx77,My Son's Graduation Mortarboard,,33414,239,TehErk,1556737000.0,0.91,Fan Made,1556737000.0,False,[],0.0,0
4,8m78k8,Chris Evans being great as usual,,33406,596,Unknown,1527303000.0,0.93,,1527303000.0,False,[Chris Evans],0.6249,0


Data from r/harrypotter:


Unnamed: 0,id,title,content,score,num_comments,author,created_utc,upvote_ratio,subreddit_flair,submission_datetime,post_type,entity_recognition,text_sentiment,external_links_count
0,rtqob0,How to make a grown man cry: Hagrid Edition.,,80480,871,vpsj,1641066000.0,0.94,Discussion,1641066000.0,False,[],-0.4767,0
1,ll4kki,Warwick Davis and his various roles in Harry P...,,79977,1012,200020124,1613485000.0,0.96,Behind the Scenes,1613485000.0,False,"[Warwick Davis, Harry Potter]",0.0,0
2,ik1goe,After waiting nearly 2 months for my missing p...,,76674,1186,blackmachine312,1598892000.0,0.91,Merchandise,1598892000.0,False,[nearly 2 months],-0.4184,0
3,sfgxxy,Another dungbomb from my pensive,,67758,310,Voldyneedsnose,1643458000.0,0.88,Dungbomb,1643458000.0,False,[],0.0772,0
4,dfj3pq,I made a model of Azkaban,,62153,905,BlandDandelion,1570637000.0,0.96,Fanworks,1570637000.0,False,[Azkaban],0.0,0


In [183]:
#For Harry Potter
# Checking the Tittle and Concents in DataFrame
for index, row in df.iterrows():
    title = row['title']
    content = row['content']

    # Check if the content in each column is empty
    if pd.notna(title):
        print(f"Title (Row {index}): {title}")
    else:
        print(f"Title (Row {index}): Empty")

    if pd.notna(content):
        print(f"Content (Row {index}): {content}")
    else:
        print(f"Content (Row {index}): Empty")

Title (Row 0): How to make a grown man cry: Hagrid Edition.
Content (Row 0): 
Title (Row 1): Warwick Davis and his various roles in Harry Potter
Content (Row 1): 
Title (Row 2): After waiting nearly 2 months for my missing pieces, it's finally complete!!
Content (Row 2): 
Title (Row 3): Another dungbomb from my pensive
Content (Row 3): 
Title (Row 4): I made a model of Azkaban
Content (Row 4): 
Title (Row 5): You should've atleast asked, Potter
Content (Row 5): 
Title (Row 6): Accidentally bought the gen Z/ how do you do fellow kids dialect version of Philosopher's Stone and I'm dead üíÄ
Content (Row 6): 
Title (Row 7): Never thought about that.
Content (Row 7): 
Title (Row 8): We all know who the favorite child was
Content (Row 8): 
Title (Row 9): In this perspective....
Content (Row 9): 
Title (Row 10): My Halloween costume !
Content (Row 10): 
Title (Row 11): And that just makes it better
Content (Row 11): 
Title (Row 12): So not a true fan
Content (Row 12): 
Title (Row 13): Tom Fe

In [184]:
#For Harry Potter
# Counters for full content and empty content
full_content_count = 0
empty_content_count = 0

for index, row in df.iterrows():
    content = row['content']
    
    if pd.isna(content) or content.strip() == "":
        empty_content_count += 1
    else:
        full_content_count += 1

print("Number of Full Content:", full_content_count)
print("Number of Empty Content:", empty_content_count)

Number of Full Content: 36
Number of Empty Content: 964


In [185]:
# Save dataframes as CSV files
for subreddit, df in dataframes.items():
    # Define the filename for each subreddit
    filename = f"{subreddit.lower().replace(' ', '')}.csv"
    # Save the dataframe as a CSV file
    df.to_csv(filename, index=False)

## **Data Cleaning and EDA**

- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

## **Preprocessing and Modeling**

- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** two models)?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

## **Evaluation and Conceptual Understanding**

- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

## **Conclusion and Recommendations**

- Does the student provide appropriate context to connect individual steps back to the overall project?
- Is it clear how the final recommendations were reached?
- Are the conclusions/recommendations clearly stated?
- Does the conclusion answer the original problem statement?
- Does the student address how findings of this research can be applied for the benefit of stakeholders?
- Are future steps to move the project forward identified?