<a href="https://colab.research.google.com/github/halboug/Text_Analysis_Final_Project/blob/main/final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Voices of the Salman Doctrine: Textual Analysis of Saudi Leadership Rhetoric

---

##Introduction and Context

<p>Since the death of King Abdullah and the subsequent rise of what is being called the “Salman Doctrine” in Saudi Arabia, the nation has seen tremendous shifts in policies. Additionally, with the ascension of crown prince Mohammed bin Salman, otherwise known as MBS, these shifts gained further momentum. The sweeping changes spanned a broad spectrum, from social reforms to foreign policy.</p>

<p>During the same period, the government has been touting its achievements and promising more structural reforms officials have been more vocal and participated in media events including panels, speeches, and interviews. But it is difficult to discern which of these reforms are actually being implemented and producing real results. </p>

<p>One way to analyze if the government is producing results is testing the priorities that appear in their rhetoric against global indices and major policies. Analyzing the content of Saudi leadership’s media appearances offers a potential way to uncover key policy issues/trends that appear in Saudi leadership's rhetoric. Understanding the new direction not only provides a clearer picture of the kingdom's current trajectory but also helps understand the progress the country has been making in recent years or lack thereof. </p>

##The Hypothesis

<p>The hypothesis is that by analyzing the content of media appearances, speeches, and interviews of the new Saudi leadership since the rise of MBS, we can identify specific policy issues and trends that have become central to the 'Salman Doctrine', then by comparing them to global indices, we can identify progress in each area. </p>

##Methodology

###Data collection:

<p>For the methodology, the YouTube API was utilized to extract transcripts of media appearances for Saudi leadership. This can include panel participation, interviews, and speeches. The selection criteria for the leadership would be ministerial-level or similar positions that frequently participate in media events. After processing that text and cleaning the data, we move onto the next phase. </p>

###Data Analysis:

<p>The primary analytical method would be topic modeling. Topic modeling would help break down the vast amounts of text into distinct clusters. This would give insights into the topics and policies that are more frequent in the rhetoric and, by inference, into the direction and focus of the government. </p>

###Evaluation:
<p>After identifying the top policy themes present in the transcripts, they are evaluated against the respective global indices. Progress will be measured in comparison to other countries depending on context. Major policy changes will also be considered where relevant. </p>

##Challenges and limitations
<p>However, there are also limitations and challenges. First, it is important to acknowledge that this would not give a comprehensive evaluation of the Saudi government, but more a piece of the puzzle. Second, media appearances structured as an interview or a panel, the interviewer or other participants’ speech might influence the results. Additionally, the focus would be in appearances conducted in English as Arabic is usually reserved for internal appearances. English is used by Saudi leaders to present themselves to the world. But even in that case, the use of Arabic proper nouns or terms within English transcripts might pose challenges, potentially leading to misinterpretations or inaccuracies during analysis. </p>



In [27]:
# Installing required libraries
!pip install google-api-python-client pandas
!pip install youtube-transcript-api
!pip install google-api-python-client
!pip install tomotopy
!pip install seaborn
!pip install gensim nltk





In [28]:
# Adding YouTube API key
api_key = 'AIzaSyBD_jwT9bNYM61CIbycOZDTWboXP6z-yR4'

In [29]:
# Importing libraries, modules, and setting up access to the youtube API
import pandas as pd
import re
import os
from googleapiclient.discovery import build
youtube = build('youtube', 'v3', developerKey=api_key)

In [30]:
# I used ChatGPT to generate a funnction that extracts the video ID from
# a txt file that has the YouTube links and saves them in another txt file
# I was surprised how well it worked!
# https://chat.openai.com/share/5e2d22dd-6aae-499e-9d78-35cc07e2ff7a


def extract_video_id(url):
    regex_patterns = [
        r'(?:https?:\/\/)?(?:www\.)?youtube\.com\/watch\?v=([^\&\?\/]+)',  # Standard URL
        r'(?:https?:\/\/)?(?:www\.)?youtu\.be\/([^\&\?\/]+)',             # Shortened URL
        r'(?:https?:\/\/)?(?:www\.)?youtube\.com\/embed\/([^\&\?\/]+)'    # Embed URL
    ]

    for pattern in regex_patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)
    return None

def process_csv_with_youtube_links(csv_file_path):
    df = pd.read_csv(csv_file_path)
    df['video_id'] = df['youtube_links'].apply(extract_video_id)
    output_file_path = os.path.join(os.path.dirname(csv_file_path), 'video_id.csv')
    df[['video_id']].to_csv(output_file_path, index=False)
    return df


csv_file_path = '/content/drive/MyDrive/Colab Notebooks/Final Project/youtube_links.csv'
processed_df = process_csv_with_youtube_links(csv_file_path)

# The video_id column is now saved in a new file named 'video_id.csv' in the same directory

In [31]:

# Function used to get the transcript using the video id through the YouTube API
# with exceptions where it can't get them
# This was a mixture of sources including the API documentation, ChatGPT, and trial and error

from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound

def get_transcript(video_id):
    try:
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id)

        transcript = ' '.join([entry['text'] for entry in transcript_list])
        return transcript

    except TranscriptsDisabled:
        print(f"Transcripts are disabled for video {video_id}")
        return None
    except NoTranscriptFound:
        print(f"No transcript found for video {video_id}")
        return None
    except Exception as e:
        print(f"Error fetching transcript for video {video_id}: {e}")
        return None


In [32]:
# Function to save transcript to a .txt file and naming it by video ID
def save_transcript(video_id, transcript, directory='/content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/'):
    filename = directory + f"{video_id}.txt"
    with open(filename, 'w') as file:
        file.write(transcript)
    print(f"Transcript for video {video_id} saved to {filename}")

In [33]:
# Read CSV file with video IDs
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Final Project/video_id.csv')

In [34]:
# Loop to iterate over all the transcripts and prints results
for video_id in df['video_id']:
    transcript = get_transcript(video_id)
    if transcript:
        save_transcript(video_id, transcript)


Transcript for video w0NxI44yBDM saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/w0NxI44yBDM.txt
Transcripts are disabled for video PjyKmUKu7GQ
Transcript for video BsA8EST3AnU saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/BsA8EST3AnU.txt
Transcript for video BeBvM2GmNKM saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/BeBvM2GmNKM.txt
Transcript for video bli9VGyCL2c saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/bli9VGyCL2c.txt
Transcript for video AUmsfaO2d-8 saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/AUmsfaO2d-8.txt
Transcript for video 3zwXONh6vYE saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/3zwXONh6vYE.txt
Transcript for video TnrmlImXtmE saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/TnrmlImXtmE.txt
Transcript for video zrggxoUF-vE saved to /content/drive/MyDrive/Colab Notebooks/Final Project/txt_files/z

In [35]:
#importing more libraries and setting up nltk
import tomotopy as tp
import seaborn
import glob
import os
from pathlib import Path
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim import corpora, models
stops = stopwords.words('english')
#This is what I call "how are these not stopwords or am I doing
# something wrong" list
how_are_these_not_stopwords_or_am_i_doing_something_wrong = ['yeah','um','na','uh', 'also', 'see', 'tell', 'yes', 'say', 'one', 'two','like','much','way','said','good', 'day','let', 'take', 'get','could','thing','look','talk','think','obviously','every','want','something']  # Replace these with your custom words
stops.extend(how_are_these_not_stopwords_or_am_i_doing_something_wrong)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [36]:
#Defining the directory for the transcript txt files
txt_directory = "/content/drive/MyDrive/Colab Notebooks/Final Project/txt_files"

In [37]:
#globbing up the text files or something I guess
files = glob.glob(f"{txt_directory}/*.txt")

In [38]:
#Built mostly using code from the workshop with minor adjustments
training_data = []
original_texts = []
titles = []

for file in files:
    text = open(file, encoding='utf-8').read()
    text_tokens = nltk.word_tokenize(text)
    nltk_text = nltk.Text(text_tokens)
    text_lower = [t.lower() for t in nltk_text if t.isalpha()]
    text_stops = [t for t in text_lower if t not in stops]
    text_string = ' '.join(text_stops)
    training_data.append(text_string)
    original_texts.append(text)
    titles.append(Path(file).stem)

    len(training_data), len(original_texts), len(titles)

In [39]:
#Built mostly using code from the workshop with minor adjustments

# Number of topics to return
num_topics = 10
# Numer of topic words to print out
num_topic_words = 10

# Intialize the model
model = tp.LDAModel(k=num_topics)

# Add each document to the model, after removing white space (strip)
# and splitting it up into words (split)
for text in training_data:
    model.add_doc(text.strip().split())

# The log-likelihood function is typically used to
# derive the maximum likelihood estimator of the parameter
print("Topic Model Training...\n\n")
# Iterate over the data 10 times
iterations = 10
for i in range(0, 100, iterations):
    model.train(iterations)
    print(f'Iteration: {i}\tLog-likelihood: {model.ll_per_word}')


print("\nTopic Model Results:\n\n")
# Print out top 10 words for each topic
topics = []
topic_individual_words = []
for topic_number in range(0, num_topics):
    topic_words = ' '.join(word for word, prob in model.get_topic_words(topic_id=topic_number, top_n=num_topic_words))
    topics.append(topic_words)
    topic_individual_words.append(topic_words.split())
    print(f"✨Topic {topic_number}✨\n\n{topic_words}\n")

Topic Model Training...


Iteration: 0	Log-likelihood: -8.939888422016486
Iteration: 10	Log-likelihood: -8.677240944245984
Iteration: 20	Log-likelihood: -8.57640523543793
Iteration: 30	Log-likelihood: -8.514648094476224
Iteration: 40	Log-likelihood: -8.465140298597355
Iteration: 50	Log-likelihood: -8.434845301778935
Iteration: 60	Log-likelihood: -8.41052956097095
Iteration: 70	Log-likelihood: -8.420756108688645
Iteration: 80	Log-likelihood: -8.403094966586579
Iteration: 90	Log-likelihood: -8.389579120508568

Topic Model Results:


✨Topic 0✨

global future economy challenges last dialogue transformation market sector green

✨Topic 1✨

kingdom international peace security stability efforts development united region president

✨Topic 2✨

question great new today time riyadh create right people cities

✨Topic 3✨

saudi arabia know years going work relationship investment working come

✨Topic 4✨

people us country oil back foreign continue policy time trying

✨Topic 5✨

kingdom today data b

##Results
From the results of the topic modeling, initially it seems that it is difficult to decipher the desired policy requirements. I went back and added more words I want to be removed from the text. Now the list generated seems a little more coherent. In addition, context definitely matters in this case as the model does not understand what I want (to be fair I usually don’t understand what I want). There also seems to be some overlap between the topics [(Kessel, 2019)](https://medium.com/pew-research-center-decoded/interpreting-and-validating-topic-models-ff8f67e07a32).  


When taken the context into account, we can identify some common words that indicate a policy direction. So unfortunately, we have to alleviate this challenge with manual brain power. For example (people, opportunity, development, women) would indicate something along the lines of women’s rights and/or social development .


<p>Using coffee and a sleep-deprived brain, we analyze the rest of the topics from the topic modeling results. We can infer the following top three policy requirements for the new Saudi leadership: </p>


1.   Energy and Environment: managing its key oil and gas sector while embracing renewable energy and sustainability. This highlights the kingdom's effort to balance being a major energy producer with evolving global environmental trends.
2.   Women’s Rights and Social Development: This includes advancing women's rights and broader social reforms, emphasizing gender equality, and aligning social policies with international standards.
3. International Cooperation and Regional Stability: his involves diplomatic efforts, conflict resolution, and collaborating with global and regional partners to address shared challenges and promote peace and security.

##Evaluation


<p>Now to test whether there has been progress since the leadership transitioned in 2016. For each policy theme we will choose relevant indices and major policy reforms. It is important to note that using this approach does not give the full picture. Context can influence these indices. For example, the 2020 pandemic did affect the Human Development Index (HDI) worldwide. This was taken into account in the evaluation, but other pieces of information might be missing that give context to the progress of the policies. </p>

#### 1- Energy and Environment: Decline
Using the Environmental Performance Index (EPI), the data from 2016 places Saudi Arabia’s ranking at 95 out of 180 countries. The latest report as of 2022, places the country at 109 out of 180. This means the country actually fell down 14 spots [(EPI)](https://sedac.ciesin.columbia.edu/data/collection/epi/sets/browse).

Using the ACEEE International Energy Efficiency Scorecard, Saudi Arabia scored 10 points out of a possible 60 and ranked last in the set of countries. The methodology changed for the index in 2022 but the score was 25 out of 100. A notable increase but still the country ranked 23 out of 25 [(ACEE)](https://www.aceee.org/sites/default/files/pdfs/2022_International_Scorecard/Saudi%20Arabia%20One-Pager.pdf).

####2- Women’s Rights and Social Development: Relative Improvement
Using the Human Development Index (HDI), which includes gender disparity and inequality, we see very little growth. The 2016 value of 0.864 and a 2021 value of 0.875. But it is worth mentioning that the world average in 2021 is 0.732 [(UNDP)](https://hdr.undp.org/data-center/human-development-index#/indicies/HDI).  
In addition, the HDI plateaued for a significant number of countries during the same period. Also, notable that in 2018, the Saudi government lifted a ban on women driving. Which seems notable. This was a significant milestone for the traditionally conservative country [(NYT, 2019)](https://www.nytimes.com/2019/06/24/world/middleeast/saudi-driving-ban-anniversary.html)

####3- International Cooperation and Regional Stability: Significant Improvement
According to the Global Peace Index, Saudi Arabia made the biggest leap in 2022 out of all countries. The country jumped 8 positions in the ranking. While it still ranked 119 out of 163 countries, this marks a significant improvement [(GPI)](https://www.visionofhumanity.org/5-countries-that-recorded-largest-improvements-in-peace-in-2022/).

##Conclusion
<p>In conclusion, the topic modeling results, enhanced by careful word selection and context consideration, reveal key policy directions for Saudi Arabia's new leadership. The top three identified priorities are Energy and Environment, Women's Rights and Social Development, and International Cooperation and Regional Stability. An evaluation of progress since the leadership transition in 2016 using relevant indices shows mixed results: a decline in Energy and Environment, relative improvement in Women's Rights and Social Development, and significant advancement in International Cooperation and Regional Stability. This analysis underscores the kingdom's ongoing efforts and challenges in these critical areas. </p>