# Problem Statement

Disney+ was launched in Nov-2019, and has quickly become a major player in the streaming industry. 

Disney+ is facing formidable challenge to effectively compete with Netflix’s market dominance.

This exercise and analysis aims to gain understanding of public perception, user preferences, and emerging trends through Reddit

# Reddit Exploration

#### Public Perception from Reddit

Every day, millions of people around the world post, vote, and comment in communities organized around their interests.

Subreddits are user-created areas of interest where discussions on Reddit are organized, and are denoted by “r/”
For this exercise, we will scrape the posts from 
r/DisneyPlus vs r/netflent


In [None]:
# Import relevant libraries
import requests
import time
import pandas as pd

In [6]:
# Provide the url to be scraped and initiate requests
disney_url = 'https://www.reddit.com/r/disneyplus/.json'
headers = {'User-agent': 'Irvan'}
res = requests.get(disney_url, headers=headers)

In [7]:
# Use status codes to understand how the target server responds to your request.
res.status_code

In [10]:
disney_json = res.json()
disney_json

In [12]:
# Evaluate the dictionary keys
sorted(disney_json.keys())

['data', 'kind']

In [13]:
# And check the keys inside 'data'
sorted(disney_json['data'].keys())

['after', 'before', 'children', 'dist', 'geo_filter', 'modhash']

In [14]:
# The posts are contained in the 'children' key, hence, check and see what is inside 'children'
disney_json['data']['children'][2]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'DisneyPlus',
  'selftext': 'This movie is NOT bad it’s not my favorite Disney movie but the songs are catchy and it’s a good family movie. I think we forget didn’t had so many hits they can make some that are just good and it’s not the end of the world ',
  'author_fullname': 't2_noc710nje',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'Just watched wish ',
  'link_flair_richtext': [{'a': ':Discussion:',
    'e': 'emoji',
    'u': 'https://emoji.redditmedia.com/8yr06i86r5fa1_t5_r0hux/Discussion'},
   {'e': 'text', 't': ' Discussion'}],
  'subreddit_name_prefixed': 'r/DisneyPlus',
  'hidden': False,
  'pwls': 6,
  'link_flair_css_class': '',
  'downs': 0,
  'thumbnail_height': None,
  'top_awarded_type': None,
  'hide_score': False,
  'name': 't3_1bw34pt',
  'quarantine': False,
  'link_flair_text_color': 'light',
  'upvote_ratio': 0.78,
  'author_flair_background_color': Non

In [15]:
disney_json['data']['after']

't3_1bv325e'

# Gather Data

In this code, we collect data by fetching the posts from various sections of the r/DisneyPlus and r/netflix subreddit using the Reddit API. It makes requests to different URLs representing the **hot, new, top, and rising** posts and collects information about each post, including its **ID, title, selftext, creation date, URL, and upvote ratio**. Here are the details of each section:

1. id: contains the numerical ID of specific user
2. title: contains the title of the post
3. selftext: contains the body text of the post, where the author expresses their opinion and raises questions about the accuracy of the content
4. creation_date: specifies the timestamp of when the post was created in Unix time format
5. upvote_ratio: denotes the ratio of upvotes to total votes cast on the postlble..

### DisneyPlus Reddit Data Scraping

In [17]:
# Create empty list and provide the URLs from which we will get the content from

disney_posts = []
disney_post_ids = set()
after = None
headers = {'User-Agent': 'Irvan'}
disney_urls = ['https://www.reddit.com/r/DisneyPlus/hot/.json', 
        'https://www.reddit.com/r/DisneyPlus/new/.json', 
        'https://www.reddit.com/r/DisneyPlus/top/.json',
        'https://www.reddit.com/r/DisneyPlus/rising/.json']

# Create a for loop to make 50 requests for each of the url
for url in disney_urls:
    for i in range(50):
        if after is None:
            params = {}
        else:
            params = {'after': after}
        res = requests.get(url, params=params, headers=headers)

# Scrape the content of each URL and get the required fields
        if res.status_code == 200:
            json_data = res.json()
            for post in json_data['data']['children']:
                post_id = post['data']['id']
                if post_id not in disney_post_ids:
                    post_data = {
                        'id': post_id,
                        'title': post['data']['title'],
                        'selftext': post['data']['selftext'],
                        'created_date': post['data']['created'],
                        'url': post['data']['url'],
                        'upvote_ratio': post['data']['upvote_ratio'],
                        'num_comments': post['data']['num_comments']
                    }
                    disney_posts.append(post_data)
                    disney_post_ids.add(post_id)
            after = json_data['data']['after']
        else:
            print(res.status_code)
            break
# Add a delay to avoid hitting rate limits
        time.sleep(1)  

### Netflix Reddit Data Scrapping

In [18]:
# Apply the same code to request posts from r/netflix subreddit

netflix_posts = []
netflix_post_ids = set()
after = None
headers = {'User-Agent': 'Irvan'}
netflix_urls = ['https://www.reddit.com/r/netflix/hot/.json', 
        'https://www.reddit.com/r/netflix/new/.json', 
        'https://www.reddit.com/r/netflix/top/.json',
        'https://www.reddit.com/r/netflix/rising/.json']

for url in netflix_urls:
    for i in range(50):  # Make 50 requests for each url
        if after is None:
            params = {}
        else:
            params = {'after': after}
        res = requests.get(url, params=params, headers=headers)
        
        if res.status_code == 200:
            json_data = res.json()
            for post in json_data['data']['children']:
                post_id = post['data']['id']
                if post_id not in netflix_post_ids:
                    post_data = {
                        'id': post_id,
                        'title': post['data']['title'],
                        'selftext': post['data']['selftext'],
                        'created_date': post['data']['created'],
                        'url': post['data']['url'],
                        'upvote_ratio': post['data']['upvote_ratio'],
                    }
                    netflix_posts.append(post_data)
                    netflix_post_ids.add(post_id)
            after = json_data['data']['after']
        else:
            print(res.status_code)
            break
        
        time.sleep(1)  # Add a delay to avoid hitting rate limits

In [21]:
# Update the collected disney+ post into a dataframe and export it to CSV
disney_df = pd.DataFrame(disney_posts)
disney_df.to_csv('disney_data.csv')

In [23]:
# Update the collected disney+ post into a dataframe and export it to CSV
netflix_df = pd.DataFrame(netflix_posts)
netflix_df.to_csv('netflix_data.csv')