# Emotional Consistency among Political Ideologies: An Approach to Address Polarization on Youtube

Group 5:
- Chance Landis (ChancL), Hanna Lee (Lee10), Jason Sun (YongXs), Andy Wong (WongA22)

## Data Collection

### Sources of Information
- **AllSides**: A media bias tool that provides a rating based on "multi-partisan Editorial Reviews by trained experts and Blind Bias Surveys™ in which participants rate content without knowing the source." We used this tool to determine how we should classify the most popular (based on subscriber count) YouTube channels we found. (Source: https://www.allsides.com/media-bias/media-bias-rating-methods)
- **HypeAuitor**: A company that uses a data-driven approach to influencer marketing. In the process, they collated lists of YouTube based on category, subscriber count, and country. This allowed us to find YouTube channels that focused on news and politics with the most subscribers. (Source: https://hypeauditor.com/about/company/, https://hypeauditor.com/top-youtube-news-politics-united-states/)
- **Pew Research Center**: A nonpartisan, nonprofit organization that conducts research on public opinion, demographic trends, and social issues. It provides data-driven insights into various aspects of social science issues, explicitly stating they do not take a stance on political issues. For our research, we relied on their studies on political ideologies and alignment with political parties as a reference. (Source: https://www.pewresearch.org/about/, https://www.pewresearch.org/politics/2016/06/22/5-views-of-parties-positions-on-issues-ideologies/)
- **YouTube**: As a group, we've chosen to expand our collection of YouTube videos by selecting additional keywords associated with the ideology we're studying. Our focus will be on gathering comments from these videos to conduct our research.
    - We used a combination of Andy and Hanna's code to get the comments from YouTube channels.

### Top 5 Democratic YouTube Channels
Vice, Vox, MSNBC, The Daily Show, The Young Turks

In [1]:
pip install --upgrade pip

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 14.9 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-24.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
!pip install --upgrade google-api-python-client --quiet

In [4]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2023.12.25-cp310-cp310-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     ---------------------------------------- 42.0/42.0 kB 2.0 MB/s eta 0:00:00
Collecting tqdm (from nltk)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
     ---------------------------------------- 57.6/57.6 kB 3.2 MB/s eta 0:00:00
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ----------- ---------------------------- 0.5/1.5 MB 14.2 MB/s eta 0:00:01
   ---------------------------------------

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\me\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.2.0-cp310-cp310-win_amd64.whl.metadata (19 kB)
Collecting numpy<2,>=1.22.4 (from pandas)
  Downloading numpy-1.26.4-cp310-cp310-win_amd64.whl.metadata (61 kB)
     ---------------------------------------- 0.0/61.0 kB ? eta -:--:--
     ---------------------------------------- 61.0/61.0 kB 3.4 MB/s eta 0:00:00
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.0-cp310-cp310-win_amd64.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   - -------------------------------------- 0.4/11.6 MB 13.1 MB/s eta 0:00:01
   ---- ----------------------------------- 1.2/11.6 MB 15.4 MB/s eta 0:00:01
   -------- ------------------------------- 2.6/11.6 MB 20.6 MB/s eta 0:00:01
   ------------ --------------------------- 3.6/11.6 MB 20.8 MB/s eta 0:00

In [2]:
!pip install nrclex

Collecting nrclex
  Downloading NRCLex-4.0-py3-none-any.whl (4.4 kB)
Collecting textblob (from nrclex)
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
INFO: pip is looking at multiple versions of nrclex to determine which version is compatible with other requirements. This could take a while.
Collecting nrclex
  Downloading NRCLex-3.0.0.tar.gz (396 kB)
     ---------------------------------------- 0.0/396.4 kB ? eta -:--:--
     --------------------------- ---------- 286.7/396.4 kB 5.9 MB/s eta 0:00:01
     -------------------------------------- 396.4/396.4 kB 6.1 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata

In [7]:
!pip install stopwords

Collecting stopwords
  Downloading stopwords-1.0.0-py2.py3-none-any.whl (37 kB)
Installing collected packages: stopwords
Successfully installed stopwords-1.0.0


In [39]:
# imports
import json
import pandas as pd

import googleapiclient
import googleapiclient.discovery
import googleapiclient.errors
from googleapiclient.errors import HttpError

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nrclex import NRCLex
import datetime
from datetime import datetime
from nltk.tokenize import casual

In [23]:
# API call
# vice: AIzaSyA2rNi_MI-3LQkBzzQ6Tn4EF0lgXWoilfc
# vox: AIzaSyAoeLCEEfqmnpRHR4xRMKt1YdbeUUw75ao
# msnbc: AIzaSyBZDxP2HEW50EDfExcZJag7J2mRroZ9_vk
# daily show: AIzaSyD8adQZlhLNVQrQXpU5-u3s1Y-9TZs20ik
# young turk: AIzaSyB8yyrUrfQGLrlQRmF555oc1emrIDXF7yU

# Others: API_KEY = "AIzaSyCjWja_yyRROSw5tcP_KxYjasJgHLX3oKE"
# API_KEY = "AIzaSyCjWja_yyRROSw5tcP_KxYjasJgHLX3oKE"
API_KEY = "AIzaSyCfjrHtWz-ySCQMobOW0DrN3IwvIZL_YEE"

youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)

In [11]:
# Define channels
channels = ["Vice", "Vox", "msnbc", "thedailyshow", "TheYoungTurks"]

In [12]:
# Define keywords
keyword_lists = {
    "isis": ["ISIS", "Terrorism", "Radicalist", "Jihad", "Suicide Bombing"],
    "guns": ["Gun", "Shooting", "School shooting", "Firearm", "Gun control", "NRA", "Second Amendment"],
    "immigration": ["Immigration", "Border control", "Mexico", "Visa", "Citizenship", "Asylum", "Deportation", "Refugee"],
    "economy": ["Economy", "Budget deficit", "Unemployed", "Inflation", "Interest rate", "Federal reserve", "Market", "Employment"],
    "healthcare": ["Health care", "Medicaid", "Covid", "Obamacare", "Public health", "Insurance"],
    "socioeco": ["Socio-economic", "Rich", "Poor", "Income inequality", "Poverty", "Wealth distribution"],
    "abortion": ["Abortion", "Pregnancy", "Unwanted Pregnancy", "Roe", "Wade", "Pro-life", "Rape", "Incest", "Life of mother", "Religion"],
    "climate": ["Climate change", "Global Warming", "Carbon", "Alternative Energy", "Climate", "Methane", "Emissions", "Gas", "Greenhouse"]
}

In [13]:
# Function for getting channel id based on name
def get_channel_id(channel):  
    channel_id = youtube.search().list(
        part="snippet",
        type="channel",
        q=channel
    )

    res_channel = channel_id.execute()
    chan_id = res_channel["items"][0]["id"]["channelId"]

    return chan_id

In [14]:
# Function for retrieving the upload playlist id using channel id
def get_upload_id(channel):
    request = youtube.channels().list(
        part="contentDetails",
        id=channel
    )

    res = request.execute()
    uploads_playlist_id = res["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]

    return uploads_playlist_id

In [15]:
up_id = []

for channel in channels:
    print(channel)
    chan_id = get_channel_id(channel)
    upload_id = get_upload_id(chan_id)
    up_id.append(upload_id)

Vice
Vox
msnbc
thedailyshow
TheYoungTurks


In [40]:
up_id

['UUn8zNIfYAQNdrFRrr8oibKw',
 'UULXo7UDZvByw2ixzpQCufnA',
 'UUaXkIU1QidjPwiAYu6GcHjg',
 'UUwWhs_6x42TyRM4Wstoq8HA',
 'UU1yBKRuGpC1tSM73A0ZjYjQ']

In [16]:
# Initialize PorterStemmer
ps = PorterStemmer()

# Function to check if a video title contains any of the keywords
def contains_keyword(title, keywords):
    title_lower = title.lower()
    words = word_tokenize(title_lower)
    
    # Stem each word in the title + keyword
    stemmed_words = [ps.stem(word) for word in words]
    for keyword in keywords:
        keyword_stemmed = ps.stem(keyword.lower())
        if keyword_stemmed in stemmed_words:
            return keyword
    return None

In [17]:
# function to fetch videos from a playlist and get title with keywords
def keyword_videos(playlist_id, keywords, channel_name):
    videos_info = []
    next_page_token = None

    while True:
        # Make the next API request using the nextPageToken
        request = youtube.playlistItems().list(
            part="snippet",
            playlistId=playlist_id,
            pageToken=next_page_token
        ) 
        res = request.execute()

        # Process the response and save video info
        for v in res["items"]:
            video_title = v["snippet"]["title"]
            detected_word = contains_keyword(video_title, keywords)
            if detected_word:
                # Separate Resource Call to retrieve video views
                views = youtube.videos().list(id=v['snippet']['resourceId']['videoId'], part="snippet,contentDetails,statistics")
                view_temp = views.execute()
                video_views = view_temp['items'][0]['statistics']['viewCount']

                # Append video information with views to videos_info list
                videos_info.append({
                    "id": v["snippet"]["resourceId"]["videoId"],
                    "title": video_title,
                    "keyword": detected_word,
                    "published_at": v["snippet"]["publishedAt"],
                    "VideoViews": video_views
                })
        # Update the nextPageToken for the next iteration
        next_page_token = res.get('nextPageToken')

        if not next_page_token or (len(videos_info) > 30):
            break
    return videos_info

In [None]:
# for channel, upload_id in zip(channels, up_id):
#     for keyword_name, keywords in keyword_lists.items():
#         videos_info = keyword_videos('UUn8zNIfYAQNdrFRrr8oibKw', keywords, 'Vice')

In [22]:
def get_video_comments(channels, up_id, keyword_lists, limit=30):
    # Function to fetch videos from a playlist and get title with keywordsand 
    def keyword_videos(playlist_id, keywords, channel_name):
        videos_info = []
        next_page_token = None

        while True:
            # Make the next API request using the nextPageToken
            request = youtube.playlistItems().list(
                part="snippet",
                playlistId=playlist_id,
                pageToken=next_page_token
            ) 
            res = request.execute()

            # Process the response and save video info
            for v in res["items"]:
                video_title = v["snippet"]["title"]
                detected_word = contains_keyword(video_title, keywords)
                if detected_word:
                    videos_info.append(
                    {
                        "channel": channel_name,
                        "video_id": v["snippet"]["resourceId"]["videoId"],
                        "title": video_title,
                        "keyword": detected_word,
                        "published_at": v["snippet"]["publishedAt"]
                    }
                    )

            # Update the nextPageToken for the next iteration
            next_page_token = res.get('nextPageToken')

            if not next_page_token or (len(videos_info) > 15):
                break
        return videos_info

    # Function for getting top 30 relevant comments for a list of videos
    def get_vid_comments(vid_lst, limit):
        vids_final = []

        # Iterate through each video in the video list
        for vid in vid_lst:
            try:
                request = youtube.commentThreads().list(
                    videoId=vid['video_id'],
                    part='id,snippet,replies',
                    textFormat='plainText',
                    order='relevance',
                    maxResults=50)
                res = request.execute()

                # Iterate through each comment
                for v in res["items"]:

                    # Create a copy of dictionary of current video that is being iterated. This is because each comment is also contained with the video data
                    vid_temp = vid.copy()
                    vid_temp.update({'CommentId':v['id']})
                    vid_temp.update({'CommentTitle':v['snippet']['topLevelComment']['snippet']['textOriginal']})
                    vid_temp.update({'CommentCreationTime':v['snippet']['topLevelComment']['snippet']['publishedAt']})
                    vid_temp.update({'CommentLikes':v['snippet']['topLevelComment']['snippet']['likeCount']})
                    vids_final.append(vid_temp)

                nextPageToken = res.get('nextPageToken')

                while nextPageToken:
                    try:
                        request = youtube.commentThreads().list(
                            videoId=vid['video_id'],
                            part='id,snippet,replies',
                            textFormat='plainText',
                            order='relevance',
                            maxResults=50,
                            pageToken=nextPageToken)

                        res = request.execute()

                        nextPageToken = res.get('nextPageToken')

                        for v in res["items"]:
                            # Create a copy of dictionary of current video that is being iterated. This is because each comment is also contained with the video data
                            vid_temp = vid.copy()
                            vid_temp.update({'CommentId':v['id']})
                            vid_temp.update({'CommentTitle':v['snippet']['topLevelComment']['snippet']['textOriginal']})
                            vid_temp.update({'CommentCreationTime':v['snippet']['topLevelComment']['snippet']['publishedAt']})
                            vid_temp.update({'CommentLikes':v['snippet']['topLevelComment']['snippet']['likeCount']})
                            vids_final.append(vid_temp)

                        # If the number of saved videos is larger than self-defined limit, break while loop and return the list of videos
                        if len(vids_final) >= limit:
                            return vids_final
                    except KeyError:
                        break

            # Error handling for videos with disabled comments
            # Got the answer format from StackOverflow (https://stackoverflow.com/questions/19342111/get-http-error-code-from-requests-exceptions-httperror)
            except HttpError as e:
                if e.resp.status == 403:
                    print(f"Comments are disabled for the video with videoId: {vid['video_id']}")
                else:
                    print("An HTTP error occurred:", e)
                # Continue to the next video
                continue
                
        return vids_final
    
    all_comments = []
    #for channel, upload_id in zip(channels, up_id):
    for keyword_name, keywords in keyword_lists.items():
        videos_info = keyword_videos('UUaXkIU1QidjPwiAYu6GcHjg', keywords, 'MSNBC')
        video_comments = get_vid_comments(videos_info, limit)
        all_comments.extend(video_comments)
    
    return all_comments

#### Vice

In [34]:
vice_comments = get_video_comments('Vice', 'UUn8zNIfYAQNdrFRrr8oibKw', keyword_lists, limit=30)

Comments are disabled for the video with videoId: EEIvWNhuL8U


In [45]:
len(vice_comments)

872

In [46]:
# Check output, commented out for viewing purposes
# vice_comments[:10]

[{'channel': 'Vice',
  'video_id': 'SwoRx3tstxY',
  'title': 'We Uncovered an ISIS Mass Grave | Super Users',
  'keyword': 'ISIS',
  'published_at': '2022-04-11T15:00:12Z',
  'CommentId': 'Ugws1dFQrp7AovnexrB4AaABAg',
  'CommentTitle': 'Bless the hard work of journalists! Seeing the deplorable and terrible things done by monstrous groups like ISIS in one spot must be so difficult. We’re with you!',
  'CommentCreationTime': '2022-04-12T22:53:43Z',
  'CommentLikes': 146},
 {'channel': 'Vice',
  'video_id': 'SwoRx3tstxY',
  'title': 'We Uncovered an ISIS Mass Grave | Super Users',
  'keyword': 'ISIS',
  'published_at': '2022-04-11T15:00:12Z',
  'CommentId': 'UgxcNLZW2rAeMBklWD14AaABAg',
  'CommentTitle': "Also I can't imagine  the amount mental trauma this work puts these journalists and their teams  undergo having to file through hours of footage of some of the most horrific acts enacted upon people in order to try and piece together what really happened.  If they can uncover even some o

In [60]:
# Change to DF
vice_comments_df = pd.DataFrame(vice_comments)

In [61]:
# Check output, commented out for viewing purposes
# vice_comments_df.head()

Unnamed: 0,channel,video_id,title,keyword,published_at,CommentId,CommentTitle,CommentCreationTime,CommentLikes
0,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11T15:00:12Z,Ugws1dFQrp7AovnexrB4AaABAg,Bless the hard work of journalists! Seeing the...,2022-04-12T22:53:43Z,146
1,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11T15:00:12Z,UgxcNLZW2rAeMBklWD14AaABAg,Also I can't imagine the amount mental trauma...,2022-04-11T16:49:08Z,726
2,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11T15:00:12Z,UgzFcqbEHILJ93hjvqh4AaABAg,This is so heartbreaking. What a horrific disp...,2022-04-11T15:08:45Z,251
3,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11T15:00:12Z,UgxYrGf2RJZeKbFEap14AaABAg,7:35 Social-media shouldn't be just summarily ...,2022-04-11T15:59:18Z,219
4,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11T15:00:12Z,Ugx5rmcME6ua2pMkojt4AaABAg,VICE NEVER Dissapoints! Amazing documentaries!...,2022-04-11T15:04:32Z,127


In [62]:
# Save df to a CSV file
vice_comments_df.to_csv("vice_comments.csv", index=False)

### Vox

In [48]:
vox_comments = get_video_comments('Vox', 'UULXo7UDZvByw2ixzpQCufnA', keyword_lists, limit=30)

In [49]:
len(vox_comments)

778

In [50]:
# Check output, commented out for viewing purposes
# vox_comments[:10]

[{'channel': 'Vox',
  'video_id': 'IbTBehjdlc0',
  'title': 'How Florida legally terrorized gay students',
  'keyword': 'Terrorism',
  'published_at': '2019-11-04T13:00:06Z',
  'CommentId': 'UgxCgDXiTTfjLBdkAE94AaABAg',
  'CommentTitle': '"It was a low degree of terror."\nThis man could barely choke out those words a half century later...\nLow degree?  I don\'t think so.',
  'CommentCreationTime': '2019-11-05T08:38:22Z',
  'CommentLikes': 13004},
 {'channel': 'Vox',
  'video_id': 'IbTBehjdlc0',
  'title': 'How Florida legally terrorized gay students',
  'keyword': 'Terrorism',
  'published_at': '2019-11-04T13:00:06Z',
  'CommentId': 'UgzpaYXLMtGz3ymErOJ4AaABAg',
  'CommentTitle': 'Could you imagine having to question if every single person you engage with is a plant by some organization?',
  'CommentCreationTime': '2019-11-04T13:16:13Z',
  'CommentLikes': 7158},
 {'channel': 'Vox',
  'video_id': 'IbTBehjdlc0',
  'title': 'How Florida legally terrorized gay students',
  'keyword': 'Terr

In [64]:
# Change to DF
vox_comments_df = pd.DataFrame(vox_comments)

# Check output, commented out for viewing purposes
# vox_comments_df.head()

Unnamed: 0,channel,video_id,title,keyword,published_at,CommentId,CommentTitle,CommentCreationTime,CommentLikes
0,Vox,IbTBehjdlc0,How Florida legally terrorized gay students,Terrorism,2019-11-04T13:00:06Z,UgxCgDXiTTfjLBdkAE94AaABAg,"""It was a low degree of terror.""\nThis man cou...",2019-11-05T08:38:22Z,13004
1,Vox,IbTBehjdlc0,How Florida legally terrorized gay students,Terrorism,2019-11-04T13:00:06Z,UgzpaYXLMtGz3ymErOJ4AaABAg,Could you imagine having to question if every ...,2019-11-04T13:16:13Z,7158
2,Vox,IbTBehjdlc0,How Florida legally terrorized gay students,Terrorism,2019-11-04T13:00:06Z,UgweQtEREv31yKgtumB4AaABAg,That poor man never trusted another person and...,2019-11-30T14:30:31Z,4516
3,Vox,IbTBehjdlc0,How Florida legally terrorized gay students,Terrorism,2019-11-04T13:00:06Z,UgwaArmK_X6VOrDrjrZ4AaABAg,And this is only 1 person’s story.,2019-11-06T13:46:52Z,2228
4,Vox,IbTBehjdlc0,How Florida legally terrorized gay students,Terrorism,2019-11-04T13:00:06Z,UgxzPYwa0EGXhdYhtMF4AaABAg,"""These interrogators, the investigators, they ...",2019-11-05T02:56:20Z,3026


In [65]:
# Save df to a CSV file
vox_comments_df.to_csv("vox_comments.csv", index=False)

#### MSNBC

In [None]:
msnbc_comments = get_video_comments('MSNBC', 'UUaXkIU1QidjPwiAYu6GcHjg', keyword_lists, limit=30)

In [None]:
len(msnbc_comments)

In [None]:
# Check output, commented out for viewing purposes
msnbc_comments[:10]

In [None]:
# Change to DF
msnbc_comments_df = pd.DataFrame(msnbc_comments)

# Check output, commented out for viewing purposes
# msnbc_comments_df.head()

In [None]:
# Save df to a CSV file
msnbc_comments_df.to_csv("msnbc_comments.csv", index=False)

#### The Daily Show

In [24]:
dailyshow_comments = get_video_comments('The Daily Show', 'UUwWhs_6x42TyRM4Wstoq8HA', keyword_lists, limit=30)

HttpError: <HttpError 500 when requesting https://youtube.googleapis.com/youtube/v3/playlistItems?part=snippet&playlistId=UUaXkIU1QidjPwiAYu6GcHjg&pageToken=EAAaflBUOkNOZFdJaEJEUkVVMFFUTTJNVVZETnpsQ056UkVLQUZJaUpHano4LTRoQU5RQVZvNElrTm9hRlpXVjBaWllUQnNWazFXUm5CYVIzQlJaREpzUWxkWVZUSlNNazVKWVcxalUwUkJhbk15WXkxMVFtaEVRWFJ3VTFSQlVTSQ&key=AIzaSyCfjrHtWz-ySCQMobOW0DrN3IwvIZL_YEE&alt=json returned "Internal error encountered.". Details: "[{'domain': 'youtube.CoreErrorDomain', 'reason': 'SERVICE_UNAVAILABLE'}]">

In [None]:
len(dailyshow_comments)

In [None]:
# Check output, commented out for viewing purposes
dailyshow_comments[:10]

In [None]:
# Change to DF
dailyshow_comments_df = pd.DataFrame(dailyshow_comments)

# Check output, commented out for viewing purposes
# dailyshow_comments_df.head()

In [None]:
# Save df to a CSV file
dailyshow_comments_df.to_csv("dailyshow_comments.csv", index=False)

#### Young Turks

In [None]:
yturk_comments = get_video_comments('The Young Turks', 'UU1yBKRuGpC1tSM73A0ZjYjQ', keyword_lists, limit=30)

In [None]:
len(yturk_comments)

In [None]:
# Check output, commented out for viewing purposes
yturk_comments[:10]

In [None]:
# Change to DF
yturk_comments_df = pd.DataFrame(yturk_comments)

# Check output, commented out for viewing purposes
# yturk_comments_df.head()

In [None]:
# Save df to a CSV file
yturk_comments_df.to_csv("yturk_comments.csv", index=False)

# Step 3

In [11]:
right_comment_df = pd.read_csv('Project_yt_comments.csv')
right_title_df = pd.read_csv('Project_yt_titles.csv')
demo_df = pd.read_csv('combine_democ_comments.csv')

  right_comment_df = pd.read_csv('Project_yt_comments.csv')


In [12]:
def textcleaner(row):
    row = str(row)
    row = row.lower()
    # remove punctuation
    row = re.sub(r'[^\w\s]', '', row)
    #remove urls
    row  = re.sub(r'http\S+', '', row)
    #remove mentions
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row)
    #remove hashtags
    row = re.sub(r"(?<![#\w])#(\w{1,25})", '',row)
    #remove other special characters
    row = re.sub('[^A-Za-z .-]+', '', row)
        #remove digits
    row = re.sub('\d+', '', row)
    row = row.strip(" ")
    row = re.sub('\s+', ' ', row)
    return row
    
stopeng = set(stopwords.words('english'))
def remove_stop(text):
    try:
        words = text.split(' ')
        valid = [x for x in words if x not in stopeng]
        return(' '.join(valid))
    except AttributeError:
        return('')

In [36]:
# Drop NaN
right_comment_df = right_comment_df.dropna()
right_title_df = right_title_df.dropna()
demo_df = demo_df.dropna()

In [37]:
# Change from datetime to date
right_comment_df['CommentCreationTime'] = right_comment_df['CommentCreationTime'].apply(lambda x: datetime.strptime(str(x)[0:10], '%Y-%m-%d').date())
right_title_df['published_at'] = right_title_df['published_at'].apply(lambda x: datetime.strptime(str(x)[0:10], '%Y-%m-%d').date())
demo_df['CommentCreationTime'] = demo_df['CommentCreationTime'].apply(lambda x: datetime.strptime(str(x)[0:10], '%Y-%m-%d').date())
demo_df['published_at'] = demo_df['published_at'].apply(lambda x: datetime.strptime(str(x)[0:10], '%Y-%m-%d').date())

In [40]:
# Tokenize
right_comment_df['TweetToken'] = right_comment_df['CommentTitle'].apply(lambda x: casual.TweetTokenizer().tokenize(x))
right_title_df['TweetToken'] = right_title_df['title'].apply(lambda x: casual.TweetTokenizer().tokenize(x))
demo_df['TweetTokenTitle'] = demo_df['title'].apply(lambda x: casual.TweetTokenizer().tokenize(x))
demo_df['TweetTokenComment'] = demo_df['CommentTitle'].apply(lambda x: casual.TweetTokenizer().tokenize(x))


In [41]:
# Clean text
right_comment_df['CommentCleaned'] = right_comment_df['TweetToken'].apply(lambda x: remove_stop(textcleaner(x)))
right_title_df['TitleCleaned'] = right_title_df['TweetToken'].apply(lambda x: remove_stop(textcleaner(x)))
demo_df['TitleCleaned'] = demo_df['TweetTokenTitle'].apply(lambda x: remove_stop(textcleaner(x)))
demo_df['CommentCleaned'] = demo_df['TweetTokenComment'].apply(lambda x: remove_stop(textcleaner(x)))


In [44]:
def nrc_sen(text, cat):
    sen = NRCLex(text)
    if cat == 'pos':
        return sen.affect_frequencies['positive']
    else:
        return sen.affect_frequencies['negative']

In [45]:
right_comment_df['PositiveScore'] = right_comment_df['CommentCleaned'].apply(lambda x: nrc_sen(x, 'pos'))
right_comment_df['NegativeScore'] = right_comment_df['CommentCleaned'].apply(lambda x: nrc_sen(x, 'neg'))        
right_title_df['PositiveScore'] = right_title_df['TitleCleaned'].apply(lambda x: nrc_sen(x, 'pos'))
right_title_df['NegativeScore'] = right_title_df['TitleCleaned'].apply(lambda x: nrc_sen(x, 'neg'))        

demo_df['PositiveScoreTitle'] = demo_df['TitleCleaned'].apply(lambda x: nrc_sen(x, 'pos'))
demo_df['NegativeScoreTitle'] = demo_df['TitleCleaned'].apply(lambda x: nrc_sen(x, 'neg'))    
demo_df['PositiveScoreComment'] = demo_df['CommentCleaned'].apply(lambda x: nrc_sen(x, 'pos'))    
demo_df['NegativeScoreComment'] = demo_df['CommentCleaned'].apply(lambda x: nrc_sen(x, 'neg'))        

In [48]:
def nrc_emo(text, ver):
    emo = NRCLex(text).affect_frequencies
    max_emo = max(emo, key=emo.get)
    max_score = emo[max_emo]
    if ver == 'score':
        return max_score
    else:
        return max_emo

In [52]:
right_comment_df['Emotion'] = right_comment_df['CommentCleaned'].apply(lambda x: nrc_emo(x, 'emo'))
right_comment_df['EmotionScore'] = right_comment_df['CommentCleaned'].apply(lambda x: nrc_emo(x, 'score'))        
right_title_df['Emotion'] = right_title_df['TitleCleaned'].apply(lambda x: nrc_emo(x, 'emo'))
right_title_df['EmotionScore'] = right_title_df['TitleCleaned'].apply(lambda x: nrc_emo(x, 'score'))        

demo_df['EmotionTitle'] = demo_df['TitleCleaned'].apply(lambda x: nrc_emo(x, 'emo'))
demo_df['EmotionScoreTitle'] = demo_df['TitleCleaned'].apply(lambda x: nrc_emo(x, 'score'))    
demo_df['EmotionComment'] = demo_df['CommentCleaned'].apply(lambda x: nrc_emo(x, 'emo'))
demo_df['EmotionScoreComment'] = demo_df['CommentCleaned'].apply(lambda x: nrc_emo(x, 'score'))    

In [54]:
demo_df

Unnamed: 0,channel,video_id,title,keyword,published_at,CommentId,CommentTitle,CommentCreationTime,CommentLikes,TitleCleaned,...,PositiveScoreTitle,NegativeScoreTitle,PositiveScoreComment,NegativeScoreComment,Emotion,EmotionScore,EmotionTitle,EmotionScoreTitle,EmotionComment,EmotionScoreComment
0,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11,Ugws1dFQrp7AovnexrB4AaABAg,Bless the hard work of journalists! Seeing the...,2022-04-12,146,uncovered isis mass grave super users,...,0.000000,0.333333,0.066667,0.133333,fear,0.333333,fear,0.333333,fear,0.200000
1,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11,UgxcNLZW2rAeMBklWD14AaABAg,Also I can't imagine the amount mental trauma...,2022-04-11,726,uncovered isis mass grave super users,...,0.000000,0.333333,0.230769,0.076923,fear,0.333333,fear,0.333333,positive,0.230769
2,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11,UgzFcqbEHILJ93hjvqh4AaABAg,This is so heartbreaking. What a horrific disp...,2022-04-11,251,uncovered isis mass grave super users,...,0.000000,0.333333,0.000000,0.222222,fear,0.333333,fear,0.333333,fear,0.222222
3,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11,UgxYrGf2RJZeKbFEap14AaABAg,7:35 Social-media shouldn't be just summarily ...,2022-04-11,219,uncovered isis mass grave super users,...,0.000000,0.333333,0.166667,0.166667,fear,0.333333,fear,0.333333,fear,0.166667
4,Vice,SwoRx3tstxY,We Uncovered an ISIS Mass Grave | Super Users,ISIS,2022-04-11,Ugx5rmcME6ua2pMkojt4AaABAg,VICE NEVER Dissapoints! Amazing documentaries!...,2022-04-11,127,uncovered isis mass grave super users,...,0.000000,0.333333,0.285714,0.071429,fear,0.333333,fear,0.333333,positive,0.285714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2416,MSNBC,BZ_f66aoZ0I,Kimberly Atkins Stohr: GA District Attorney Fa...,Gas,2024-02-16,UgzBO4V2k1_ayMG_1tF4AaABAg,We don't trust bank s,2024-02-18,1,kimberly atkins stohr ga district attorney fan...,...,0.333333,0.000000,0.000000,0.000000,fear,0.333333,fear,0.333333,trust,1.000000
2417,MSNBC,BZ_f66aoZ0I,Kimberly Atkins Stohr: GA District Attorney Fa...,Gas,2024-02-16,UgxJiE3xA5crSIatswx4AaABAg,MSNBC at its worst. Fani Willis the new atm ca...,2024-02-18,3,kimberly atkins stohr ga district attorney fan...,...,0.333333,0.000000,0.142857,0.000000,fear,0.333333,fear,0.333333,trust,0.285714
2418,MSNBC,BZ_f66aoZ0I,Kimberly Atkins Stohr: GA District Attorney Fa...,Gas,2024-02-16,UgxXK_3Y17CdewRaIMJ4AaABAg,I bet Farni is not the only prosecutor in Ga. ...,2024-02-17,9,kimberly atkins stohr ga district attorney fan...,...,0.333333,0.000000,0.000000,1.000000,fear,0.333333,fear,0.333333,negative,1.000000
2419,MSNBC,BZ_f66aoZ0I,Kimberly Atkins Stohr: GA District Attorney Fa...,Gas,2024-02-16,Ugzd5fBC6pIR8k7JI214AaABAg,Exactly what defence can she use and according...,2024-02-16,6,kimberly atkins stohr ga district attorney fan...,...,0.333333,0.000000,0.074074,0.111111,fear,0.333333,fear,0.333333,anger,0.222222


## Andy's Section

In [None]:
# Function for retrieving the upload playlist id of a channel
def get_upload_id(channel):
    request = youtube.channels().list(part='contentDetails', forUsername=channel)
    res = request.execute()
    return res["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]

# Function for retrieving all vids within the upload playlist of a channel, stopping once a limit INT has been reached
def get_vids(channel, limit, keywords, ideology):
    
    # Output list
    vid_lst=[]

    request = youtube.playlistItems().list(part='snippet',playlistId=get_upload_id(channel),maxResults=50)
        
    res = request.execute()
    nextPageToken = res['nextPageToken']

    # Iterate through each video in the playlist
    for v in res["items"]:

        # Normalization of video title to check for keywords
        title = v['snippet']['title']
        title = title.lower()
        title = re.sub(r'[^\w\s]','', title)

        # Check for key words. If key word detected, then counter +1. If counter > 0, then the post will be flagged and added.
        counter = 0
        for word in title.split():
            counter = 0
            if word in keywords:
                counter += 1
        if counter == 0:
            continue

        # Create temp dictionary per video, and add video-specific information to dictionary
        vid_dict = {}
        vid_dict['ChannelName'] = v['snippet']['channelTitle']
        vid_dict['VideoId'] = v['snippet']['resourceId']['videoId']
        vid_dict['VideoTitle'] = v['snippet']['title']
        vid_dict['Ideology'] = ideology

        # Separate Resource Call to retrieve video views
        views = youtube.videos().list(id=v['snippet']['resourceId']['videoId'], part="snippet,contentDetails,statistics")
        view_temp = views.execute()
        vid_dict['VideoViews'] = view_temp['items'][0]['statistics']['viewCount']

        # Append dictionary to greater list
        vid_lst.append(vid_dict)

    # Iterate until no more next page
    while nextPageToken:
        try:
            request = youtube.playlistItems().list(part='snippet', playlistId=get_upload_id(channel), maxResults=50, pageToken = res['nextPageToken'])                
            res = request.execute()

            # Redefine next page token to check @ next iteration
            nextPageToken = res['nextPageToken']

            # Iterate through each video
            for v in res["items"]:

                # Normalization of video title to check for keywords
                title = v['snippet']['title']
                title = title.lower()
                title = re.sub(r'[^\w\s]','', title)

                # Check for key words. If key word detected, then counter +1. If counter > 0, then the post will be flagged and added.
                counter = 0
                for word in title.split():
                    if word in keywords:
                        counter += 1
                if counter == 0:
                    continue

                # Create temp dictionary per video, and add video-specific information to dictionary
                vid_dict = {}
                vid_dict['ChannelName'] = v['snippet']['channelTitle']
                vid_dict['VideoId'] = v['snippet']['resourceId']['videoId']
                vid_dict['VideoTitle'] = v['snippet']['title']
                                
                # Separate Resource Call to retrieve video views
                views = youtube.videos().list(id=v['snippet']['resourceId']['videoId'], part="snippet,contentDetails,statistics")
                view_temp = views.execute()
                vid_dict['VideoViews'] = view_temp['items'][0]['statistics']['viewCount']
                
                vid_lst.append(vid_dict)

            # If the number of saved videos is larger than self-defined limit, break while loop and return the list of videos
            if len(vid_lst) >= limit:
                return(vid_lst)

        # Error case handling
        except KeyError:
            break

# Function for getting top 30 relevant comments for a list of videos
def get_vid_comments(vid_lst, limit):
    vids_final = []

    # Iterate through each video in the video list
    for vid in vid_lst:
        
        request = youtube.commentThreads().list(videoId=vid['VideoId'],part='id,snippet,replies',textFormat='plainText',order='relevance',maxResults=50)
        res = request.execute()

        # Iterate through each comment
        for v in res["items"]:
            
            # Create a copy of dictionary of current video that is being iterated. This is because each comment is also contained with the video data
            vid_temp = copy.copy(vid)
            vid_temp.update({'CommentId':v['id']})
            vid_temp.update({'CommentTitle':v['snippet']['topLevelComment']['snippet']['textOriginal']})
            vid_temp.update({'CommentCreationTime':v['snippet']['topLevelComment']['snippet']['publishedAt']})
            vid_temp.update({'CommentLikes':v['snippet']['topLevelComment']['snippet']['likeCount']})
            vids_final.append(vid_temp)

        while nextPageToken:
            try:
                request = youtube.commentThreads().list(videoId=vid['VideoId'],part='id,snippet,replies',textFormat='plainText',order='relevance',maxResults=50)
                res = request.execute()
        
                nextPageToken = res['nextPageToken']
                
                for v in res["items"]:
                    # Create a copy of dictionary of current video that is being iterated. This is because each comment is also contained with the video data
                    vid_temp = copy.copy(vid)
                    vid_temp.update({'CommentId':v['id']})
                    vid_temp.update({'CommentTitle':v['snippet']['topLevelComment']['snippet']['textOriginal']})
                    vid_temp.update({'CommentCreationTime':v['snippet']['topLevelComment']['snippet']['publishedAt']})
                    vid_temp.update({'CommentLikes':v['snippet']['topLevelComment']['snippet']['likeCount']})
                    vids_final.append(vid_temp)
                    
                # If the number of saved videos is larger than self-defined limit, break while loop and return the list of videos
                if len(vids_final) >= limit:
                    return(vids_final)
            except KeyError:
                break
            
    return vids_final

# from Lab9
def textcleaner(row):
    row = str(row)
    row = row.lower()
    # remove punctuation
    row = re.sub(r'[^\w\s]', '', row)
    #remove urls
    row  = re.sub(r'http\S+', '', row)
    #remove mentions
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row)
    #remove hashtags
    row = re.sub(r"(?<![#\w])#(\w{1,25})", '',row)
    #remove other special characters
    row = re.sub('[^A-Za-z .-]+', '', row)
        #remove digits
    row = re.sub('\d+', '', row)
    row = row.strip(" ")
    row = re.sub('\s+', ' ', row)
    return row
    
stopeng = set(stopwords.words('english'))
def remove_stop(text):
    try:
        words = text.split(' ')
        valid = [x for x in words if x not in stopeng]
        return(' '.join(valid))
    except AttributeError:
        return('')

def df_clean_process(df):

    # Change datetime to date
    df['VideoPublishedDate'] = df['VideoPublishedDate'].apply(lambda x: datetime.strptime(x[0:10], '%Y-%m-%d').date())
    df['CommentCreationTime'] = df['CommentCreationTime'].apply(lambda x: datetime.strptime(x[0:10], '%Y-%m-%d').date())

    # Check NaN, if < 10% of total dataset, drop NaN
    if df.isnull().values.any():
        if len(df[df.isna().any(axis=1)]) < len(df) * 0.1:
            df = df.dropna()

    # Split into separate df for computational load reduction
    title_df = df[['ChannelName', 'VideoTitle', 'VideoPublishedDate', 'VideoViews', 'Ideology']].drop_duplicates()
    comment_df = df[['ChannelName', 'VideoViews', 'CommentTitle', 'CommentCreationTime', 'CommentLikes', 'Ideology']]

    # tokenize
    title_df['TweetToken'] = title_df['VideoTitle'].apply(lambda x: casual.TweetTokenizer().tokenize(x))
    comment_df['TweetToken'] = comment_df['CommentTitle'].apply(lambda x: casual.TweetTokenizer().tokenize(x))

    # clean
    title_df['Cleaned'] = title_df['TweetToken'].apply(lambda x: remove_stop(textcleaner(x)))
    comment_df['Cleaned'] = comment_df['TweetToken'].apply(lambda x: remove_stop(textcleaner(x)))

    return (title_df, comment_df)

    # Sentiment analysis

In [None]:
# define channels
channels_left = ['VICE', 'Vox', 'MSNBC', 'The Daily Show', 'TheYoungTurks']
channels_right = ['Fox News', 'Ben Shapiro', 'StevenCrowder', 'Daily Mail', 'DailyWire+']

# define key ideologies/associated keywords to look for in title
isis_keywords = ['terrorism', 'terrorist', 'extremism', 'radicalist', 'radicalism']
guns_keywords = ['shooting', 'shootings', 'school shooting', 'school shootings', 'firearms', 'firearm', 'gun', 'gun control', 'guns', 'nra', 'second amendment']
immigration_keywords = ['border control', 'mexico', 'visa', 'citizenship', 'asylum', 'deportation', 'refugee']
economy_keywords = ['budget', 'budget deficit', 'unemployed', 'inflation', 'interest rate',' federal reserve', 'market', 'employment']
health_care_keywords = ['medicaid', 'covid', 'obamacare', 'public health', 'insurance']
socioeconomic_keywords = ['rich', 'poor', 'income inequality', 'poverty',' wealth distribution']
abortion_keywords = ['pregnancy', 'unwanted pregnancy', 'roe', 'wade', 'abortion', 'pro-life', 'rape', 'incest', 'life of mother', 'religion']
climate_change_keywords = ['global warming', 'carbon', 'alternative energy', 'climate', 'methane', 'emissions','gas','greenhouse']

# Define for iteration
keywords = [isis_keywords, guns_keywords, immigration_keywords, economy_keywords, health_care_keywords, socioeconomic_keywords, abortion_keywords, climate_change_keywords]

# Pre-define empty df
left_df = pd.DataFrame(columns=['ChannelName', 'VideoId', 'VideoTitle', 'Ideology', 'VideoPublishedDate', 'VideoViews', 'CommentId', 'CommentTitle', 'CommentCreationTime', 'CommentLikes'])

# Loop through all left channels
for channel in channels_left:

    # Loop through all keywords/ideologies
    for keyword, ideology in zip(keywords, ['ISIS', 'GUNS', 'IMMIGRATION', 'ECONOMY', 'HEALTH CARE', 'SOCIOECONOMIC', 'ABORTION', 'CLIMATE CHANGE']):

        # Return temp df for one ideology for one channel
        temp_df = pd.DataFrame(get_vid_comments(get_vids(channel, 50, keyword, ideology)[0:50], 150))

        # Append temp df to master df
        left_df = pd.concat([left_df,temp_df])

# Pre-define empty df
right_df = pd.DataFrame(columns=['ChannelName', 'VideoId', 'VideoTitle', 'Ideology', 'VideoPublishedDate', 'VideoViews', 'CommentId', 'CommentTitle', 'CommentCreationTime', 'CommentLikes'])
for channel in channels_right:

    # Loop through all keywords/ideologies
    for keyword, ideology in zip(keywords, ['ISIS', 'GUNS', 'IMMIGRATION', 'ECONOMY', 'HEALTH CARE', 'SOCIOECONOMIC', 'ABORTION', 'CLIMATE CHANGE']):

        # Return temp df for one ideology for one channel
        temp_df = pd.DataFrame(get_vid_comments(get_vids(channel, 50, keyword, ideology)[0:50], 150))

        # Append temp df to master df
        right_df = pd.concat([right_df,temp_df])

(left_title_df, left_comment_df) = df_clean_process(left_df)
(right_title_df, right_comment_df) = df_clean_process(right_df)
# Loop through all right channels
for channel in channels_right:

    # Loop through all keywords/ideologies
    for keyword, ideology in zip(keywords, ['ISIS', 'GUNS', 'IMMIGRATION', 'ECONOMY', 'HEALTH CARE', 'SOCIOECONOMIC', 'ABORTION', 'CLIMATE CHANGE']):

        # Return temp df for one ideology for one channel
        temp_df = pd.DataFrame(get_vid_comments(get_vids(channel, 50, keyword, ideology)[0:50], 150))

        # Append temp df to master df
        right_df = pd.concat([right_df,temp_df])

(left_title_df, left_comment_df) = df_clean_process(left_df)
(right_title_df, right_comment_df) = df_clean_process(right_df)