
# Lecture 5: Text Summarization Using Large Language Models (LLMs)
    Date: 23/11/2023  
    Duration: 3 hours

   ## Introduction

    This lecture will introduce the concept of text summarization using Large Language Models (LLMs). We will explore how these models work, their applications in various media fields, and engage in practical exercises for hands-on experience.
    

#  Text Summarization through Large Language Models (LLMs)

## How LLMs Work in Summarization

Large Language Models (LLMs) like GPT-4 represent a significant advancement in the field of Natural Language Processing (NLP). They have revolutionized the way we approach text summarization, offering capabilities that were previously unattainable with simpler algorithms. This section delves into the intricacies of how these models work in summarizing texts.

### Understanding Large Language Models

LLMs are trained on extensive datasets encompassing a wide array of text sources, from news articles and scientific papers to books and internet content. This extensive training enables them to understand context, detect nuances, and generate human-like text. They function based on transformer architectures, which allow them to process words in relation to all other words in a text, rather than sequentially. This contextual understanding is critical in summarization.

### The Summarization Process

The process of summarization by an LLM involves several key steps:

1. **Contextual Understanding:** When presented with a text, the LLM first comprehends its context. This understanding is not just about individual words but the interplay between them—how they contribute to the overall meaning.

2. **Key Information Identification:** The model identifies key information, themes, and messages in the text. This step is crucial as it determines what content is essential and what can be omitted in the summary.

3. **Condensing Information:** Using its training and algorithms, the LLM then condenses this information. It retains the core ideas and facts while discarding redundant or less important details.

4. **Generating the Summary:** Finally, the model generates a summary. This step isn't just about cutting down the text; it's about rephrasing and restructuring information to make the summary coherent, concise, and readable.

### Techniques Used in Summarization

LLMs employ a variety of techniques for effective summarization:

- **Abstractive Summarization:** LLMs often use abstractive methods, generating summaries that may not use the exact phrasing of the original text but convey the same meaning. This approach allows for more flexible and natural summarization.
  
- **Attention Mechanisms:** These models utilize attention mechanisms to weigh different parts of the text differently, focusing more on the parts deemed more important.

### Applications and Considerations

The use of LLMs in summarization has broad applications, from summarizing news articles for quick consumption to condensing academic papers for research. However, it’s crucial to note that while LLMs are powerful, they aren't infallible. They can sometimes omit crucial details or introduce biases present in their training data. Therefore, human oversight and critical evaluation remain essential in the summarization process.

In conclusion, LLMs represent a transformative tool in text summarization. Their ability to understand and condense large volumes of text accurately and coherently makes them invaluable in various fields, from journalism to academia. As these models continue to evolve, so too will their capabilities and applications in text summarization.

## Types of Media Texts for Summarization

In the realm of media and communication, the need for concise and effective information is paramount. Large Language Models (LLMs) offer powerful tools for summarizing a wide array of media texts. This capability is particularly crucial in an age overwhelmed with information. We will explore the different types of media texts that are amenable to summarization through LLMs.

### News Articles

The most common application of text summarization is in the news industry. With the constant influx of news reports and articles, summarization helps in distilling the essence of stories for quick and efficient consumption. LLMs can capture the key points of news articles, providing brief versions that retain the critical information and context, which is especially useful for keeping up with fast-paced news cycles.

### Academic Papers

In academia, the volume of research papers published is vast. Summarization through LLMs can assist researchers and students by providing concise summaries of lengthy papers, allowing for quicker understanding of research findings and theories. This not only saves time but also aids in literature review and research analysis.

### Books and Literature

LLMs can summarize entire books, chapters, or specific sections, making it easier to grasp the main themes and narratives. This is particularly beneficial for educational purposes, literature reviews, and analysis in literary studies.

### Social Media Content

With the explosion of content on social media platforms, summarization can be used to aggregate and distill trends, opinions, and discussions. This is valuable for understanding public sentiment, market research, and media monitoring.

### Scripts and Screenplays

In the world of film and television, summarizing scripts and screenplays can help in quickly understanding plot points and narrative structures. This application is beneficial for scriptwriters, producers, and media students studying narrative techniques.

### Corporate and Legal Documents

Summarization can also be applied to lengthy corporate reports, legal documents, and policy papers. This helps in quickly grasping the key points and decisions in business and legal contexts.

### Reviews and Consumer Feedback

Summarizing product reviews and consumer feedback is essential for businesses to understand customer sentiment and market trends. LLMs can provide concise summaries of vast amounts of customer feedback, aiding in market analysis and product development.

### Speeches and Interviews

Transcripts of speeches and interviews can be lengthy. Summarization allows for a quick grasp of the key points and messages conveyed.

### Conclusion

In summary, the use of LLMs in text summarization spans a wide range of media texts. From news articles and academic papers to social media content and corporate documents, the ability to distill essential information from extensive texts is a valuable asset in various domains. As technology advances, the scope of text types amenable to summarization by LLMs is likely to expand further, enhancing our capacity to manage and comprehend the ever-growing expanse of information.

## Applications in Various Fields

LLMs' ability to summarize text has far-reaching implications across multiple fields. This section explores how different sectors can leverage text summarization to enhance efficiency and knowledge dissemination.

### Academia and Research

In academia, summarization aids in literature reviews and data analysis. Researchers can quickly assimilate key findings from multiple papers, facilitating a broader understanding of a field. Students benefit from concise summaries of complex academic texts, making learning more efficient.

### Journalism

Journalists can use LLMs to quickly digest reports, interviews, and research, aiding in the rapid creation of news articles. Summarization helps in covering more stories with accuracy and depth, especially in breaking news situations.

### Media Production

In media production, from film to broadcasting, summarizing scripts and project proposals helps in evaluating content swiftly. This application is crucial in pre-production stages, where time and clarity are paramount.

### Business Intelligence

Business professionals use summarization for swift analysis of market trends, consumer feedback, and internal reports. This tool enhances decision-making by condensing large volumes of data into actionable insights.

### Legal Sector

In the legal sector, summarizing case files and legal precedents can save significant time. Lawyers and judges can use summaries to quickly understand case nuances, aiding in legal research and courtroom preparations.

### Healthcare

Summarization can transform how medical professionals access and use medical literature and patient records. Quick summaries of patient histories and research papers enhance diagnostic and treatment efficiency.

### Government and Policy Making

For government officials and policymakers, summarization aids in digesting policy documents, research, and public feedback. This tool assists in informed decision-making and public communication.

### Education

In education, summarizing educational materials makes learning more accessible. Teachers and students can focus on key concepts, enhancing the educational process.

### Digital Marketing

Digital marketers use summarization to analyze consumer reviews and social media trends. This insight is crucial for strategy development and understanding market dynamics.

### Public Relations

In PR, summarizing news articles, reports, and public opinions helps in crafting appropriate responses and strategies. It aids in maintaining a positive public image and effective communication.

### Conclusion

The applications of text summarization using LLMs are vast and varied. From enhancing academic research to streamlining business processes and aiding in content creation, the potential of these tools is immense. As LLMs continue to advance, their impact across these fields is likely to grow, offering new ways to process and understand the ever-increasing volume of text-based information.

## Demonstration Overview

In the following demonstration, we will interact with a GPT model to summarize a selected PDF document. This exercise will highlight the model's natural language understanding and generation capabilities, showcasing its potential as a tool for media analysts and researchers.

### Guglielmo
https://myaidrive.com/4UY7KuLp2crD22Kg/Amanda-D.-Lo.pdf

Please provide a summary of the above document. 

### AI PDF GPT
The document titled "Netflix and Streaming Video" by Amanda D. Lotz, published by Polity Press in 2022, provides a comprehensive exploration of the subscriber-funded video on-demand (SVOD) services, focusing particularly on Netflix. The work is divided into two main parts: the first examines the differences between SVODs and traditional linear services, while the second part delves into how Netflix, as a mature and unique player, differs from other SVODs and past services.

In the introduction, Lotz emphasizes that SVODs, while appearing similar to traditional television services from a viewer's perspective, are industrially different due to their business models, on-demand delivery, and transnational reach. The book addresses the need to understand these services in light of their distinct business models and the industrial implications of their strategies, particularly in terms of content commissioning and distribution.

The first part of the book details the characteristics that differentiate SVODs from linear services, including their reliance on subscriber funding, provision of on-demand content, and global reach. It also discusses the implications of these features on the content strategies and business norms of SVODs, suggesting that the traditional distinctions between series and movies are becoming less relevant in the context of SVODs.

Lotz uses key terms like 'internet-distributed video', 'commission', 'linear vs. non-linear/on-demand', and 'scripted fiction' to frame her analysis. She explains that SVODs like Netflix have introduced changes in the norms of content distribution and consumption, with a focus on subscriber metrics rather than advertising-based metrics.

In the second part, the book explores Netflix's unique position in the SVOD landscape. Unlike other services, Netflix is a pure play, subscriber-funded, video-on-demand service. It differs from services with owned intellectual property (IP), like Disney+, HBO Max, and Paramount+, which have different content strategies and corporate objectives. The discussion includes Netflix's approach to content commissioning, its global strategy, and how it contrasts with other SVODs and traditional media services.

Overall, the book aims to provide a foundational understanding of SVODs and their role in the evolving landscape of media and culture, offering insights into their economic and technological characteristics. The document summarized covers the first 40 pages out of a total of 220 pages, focusing primarily on the introduction and the conceptual framework used in the book.

### Guglielmo
Can you please elaborate on the strategies of Netflix?

### AI PDF GPT
The strategies of Netflix, as described in "Netflix and Streaming Video" by Amanda D. Lotz, are multifaceted and designed to cater to a variety of audience preferences, distinguishing the service from traditional linear ad-supported services and theatrical film. Key elements of Netflix's strategy include:

1. **Diverse Content Strategy**: Netflix does not adhere to a singular content strategy. Instead, it aims to offer content that is subtly distinct from typical linear ad-supported services, focusing on different "taste clusters" or communities based on viewers' preferences. This strategy leads to a multifaceted library that may appear contradictory but is intentional to appeal to diverse sensibilities [page 162,163](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=162).

2. **Global Strategy**: Netflix aims to be a cosmopolitan service, disrupting traditional norms of national identity in constructing video audiences. Its strategy involves commissioning content that is deeply rooted in local cultures and stories, which paradoxically achieves universal appeal due to their authenticity and specificity [page 178,179](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=178).

3. **Library Strategy**: Netflix's library strategy is to optimize options for various taste clusters, providing value to different viewer preferences. Unlike linear services, Netflix focuses on building a comprehensive library rather than scheduling specific content, which allows them to cater to a wide range of viewers [page 150,151](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=150).

4. **Targeting Niche Audiences**: Netflix specifically targets certain niches, such as young adults or specific genres like sci-fi, which might not attract a large audience for linear ad-supported services. This strategy allows them to cater to specific audience segments effectively [page 168,167](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=168).

5. **Commissioning Content**: Over time, Netflix has increased its focus on commissioning content, a strategic shift that allows them to craft their own identity and prepare for a future with more competition. This approach helps Netflix to stand out and maintain subscriber interest [page 155,156](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=155).

6. **Utilization of Data**: Netflix employs sophisticated data use to sort subscribers into different taste clusters based on their viewing behavior. This data-driven approach enables them to target content to specific viewer preferences rather than broad demographics [page 87](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=87).

These strategies collectively illustrate Netflix's approach to becoming a dominant player in the SVOD market, focusing on diversity, global reach, and data-driven content curation.

### Guglielmo
Can you please elaborate on the "conglomerated niche" strategy?

### AI PDF GPT
The "conglomerated niche" strategy as described in Amanda D. Lotz's "Netflix and Streaming Video" refers to a content aggregation approach that combines elements of mass broadcasting and niche segmentation strategies. This strategy, especially prominent in SVODs like Netflix, is aimed at servicing multiple taste segments and sensibilities simultaneously.

1. **Combination of Broadcasting and Niche Segmentation**: SVODs with general libraries blend the mass scale of broadcasting with the niche segmentation of cable/satellite channels. This is facilitated by their ability to provide on-demand access, enabling them to execute this distinctive strategy of serving various tastes and sensibilities at the same time [page 63](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=63).

2. **Transnational Scale**: The conglomerated niche strategy is particularly effective for transnational services, as it achieves scale by serving segmented audiences on a global basis. This approach allows SVODs to cater to specific audience segments that may be too small or diverse when segmented by nation [page 107,108](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=107).

3. **Precedent in Cable and Broadcast Channels**: This strategy is not entirely unprecedented. In the past, conglomerates like Viacom attempted a version of this strategy by owning multiple niche, linear-delivered, nationally bound program services [page 64,63](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=64).

4. **Library Strategy and On-Demand Access**: The key to this strategy is building a diverse library rather than a schedule, with SVODs offering depth in content to appeal to different viewer preferences. This strategy is effective because subscribers can access any part of the library on demand [page 60,61](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=60).

5. **Diverse and Contradictory Content**: The strategy results in a library where different content caters to diverse and sometimes contradictory taste clusters. This approach is similar to how HBO, a subscriber-funded cable channel, aggregates niches but with greater scale and content variety [page 65](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=65).

6. **Content Strategy Alignment with Viewer Preferences**: This strategy also involves sorting viewers into different taste clusters based on their viewing behavior, which helps in aligning the content strategy closely with viewer preferences [page 87](https://myaidrive.com/?r=c#/home?folder=&file=Amanda%20D.%20Lotz%20-%20Netflix%20and%20Streaming%20Video_%20The%20Business%20of%20Subscriber-Funded%20Video%20on%20Demand-Polity%20Press%20_2022_.pdf&pdfPage=87).

Overall, the conglomerated niche strategy is a sophisticated approach to content curation in SVODs, allowing them to appeal to a wide range of audience segments by offering a diverse array of content that caters to multiple, often distinct, tastes and preferences.

![vector_database.webp](attachment:vector_database.webp)

## RAG: In-Depth Explanation

**RAG** stands for **Retriever-Augmented Generation**. This system is a hybrid model combining two distinct phases in information processing:

1. **Retrieval**: The retriever component is responsible for sifting through a vast corpus of data to find relevant pieces of information. It uses a vector database to efficiently search through high-dimensional data representations. This is essential because the retriever can access a much larger body of information than what can be stored in the parameters of a generative model.

2. **Generation**: Once relevant information is retrieved, the generator component comes into play. It synthesizes the retrieved data to compose a coherent and contextually appropriate response. This generator is typically a language model trained to produce human-like text based on the input it receives.

### Necessity of RAG

The necessity of RAG arises from the limitations of standalone generative models in handling complex queries that require external knowledge or up-to-date information. Here are several reasons why RAG is essential:

- **Volume of Data**: No single model can store all the world's knowledge within its parameters. RAG allows a model to tap into vast external databases without needing to memorize the information.

- **Up-to-Date Information**: Generative models are often trained on datasets that become outdated quickly. RAG can retrieve the latest information from updated databases, ensuring the responses reflect current knowledge.

- **Contextual Relevance**: Sometimes, the context required to answer a question is not available in the prompt given to the generator. RAG ensures that the generator has all relevant information needed to provide accurate answers.

- **Specificity and Detail**: Generative models might struggle with providing detailed responses to specific queries. By using RAG, these models can pull in specific details from retrieved documents to enhance the quality and specificity of their responses.

- **Efficiency in Information Retrieval**: Without RAG, generative models might have to generate responses by considering all possible pieces of knowledge they have been trained on, which is inefficient and often inaccurate. RAG ensures that only relevant information is considered when generating a response.

### Example of RAG Necessity

Imagine you're using a GPS application that has an integrated AI chatbot to help you find restaurants. If you ask for the best sushi place in town, the chatbot needs to:

1. Understand the question (which it can do with the generative model alone).
2. Know the latest and most relevant local information (which it may not have if it only relies on its pre-existing knowledge base).

By using RAG, the chatbot can access a vector database of the latest restaurant reviews and recommendations, retrieve the most relevant ones, and then generate a response that accurately reflects the current best sushi places.

In summary, RAG is essential for creating AI systems that provide accurate, detailed, and up-to-date responses, which is especially critical in dynamic fields such as news, medical advice, legal assistance, and more.

![embedding.webp](attachment:embedding.webp)

## Detailed Insights into Embeddings

### Conceptualizing Embeddings
**Embeddings** are a transformative approach in machine learning that turn qualitative, often categorical data, into a quantitative format that algorithms can process. They map entities—whether words, images, or other types of data—into vectors of real numbers.

### Purpose and Advantages of Embeddings
The key purpose of embeddings is to capture and quantify the relationships and contextual similarities between data points in a way that reflects human intuition. They:
- Enable pattern recognition in complex data.
- Support operations like clustering, similarity detection, and anomaly detection.
- Offer dimensionality reduction, turning potentially vast arrays of data into manageable vector spaces without losing significant information.

### Training Embeddings
Embeddings are learned from data. Using models like neural networks, the system is trained to recognize patterns and associations, embedding them within the geometry of the vector space. Through training, the model learns to place similar entities closer together and dissimilar ones further apart.

### Multidimensionality of Embeddings
Although the uploaded image represents embeddings in two dimensions for visualization purposes, actual embeddings often exist in higher dimensions (hundreds or even thousands) to capture the rich semantics and relationships of the data.

### From the Diagram: A Visual Representation
In the provided image, we see a two-dimensional embedding space populated with various fruits and vegetables. Each item's position is a simplified representation of potentially complex relationships. The coordinates `[0.86, 0.77]` for the 'Peach' reflect its 'location' in this conceptual space relative to other items.

### Similarity and Distance Metrics
The proximity between points in an embedding space is measured using metrics like cosine similarity or Euclidean distance. These metrics provide a 'similarity score', with smaller values indicating greater similarity. The sidebar in the diagram likely represents the similarity of the 'Peach' to other items, with 'Peach' being most similar to 'Orange' at a score of 1.2.

### Embeddings in Context: Enhancing RAG
When applied to **Retriever-Augmented Generation** (RAG), embeddings empower the system to:
- Efficiently retrieve contextually relevant data from a larger corpus.
- Understand and respond to queries with a level of nuance and specificity that mirrors human understanding.

### Practical Applications and Challenges
Embeddings are used in:
- Search engines, to surface content related to a query.
- Recommendation systems, to suggest items similar to a user's interests.
- Natural language processing, to grasp the nuances of human language.

However, embeddings can also be challenging:
- They require large datasets and computational power to train effectively.
- They can inadvertently capture and perpetuate biases present in the training data.

### Conclusion
Embeddings represent a leap forward in our ability to quantitatively analyze qualitative data. They are the linchpin of systems that require a deep understanding of content, context, and the subtle connections between data points, such as RAG. The fruitful application of embeddings spans across various domains, from digital assistants to complex data analytics in media studies.



## Interactive Exploration of "Grey's Anatomy" Narrative Ecosystem

In this exercise, we'll embark on a journey through the narrative ecosystem of the popular medical drama series "Grey's Anatomy." We'll leverage various tools and APIs to extract episode summaries, create a searchable in-memory database, and construct a chat interface that allows us to converse with the series as if it were an entity itself.

### Scraping the Wiki for Summaries
Our first step involves collecting data from the "Grey's Anatomy" wiki. This website is a treasure trove of information, containing detailed summaries of every episode aired. By scraping the wiki, we can compile a comprehensive dataset of episode summaries.

#### Objective
- Use web scraping techniques to extract short summaries of each "Grey's Anatomy" episode.
- Clean and structure the data for further processing.

### Guglielmo
Drowning on Dry Land Grey's Anatomy Universe Wiki Fandom.html
File
Please analyze this web page, find the text that corresponds to "Short Summary" and show it to me.

### ChatGPT
Here is the short summary from the "Drowning on Dry Land" Grey's Anatomy Universe Wiki page:

"The mass casualty incident continues to swamp Seattle Grace. Alex cares for Jane Doe while trying to find her husband, as well as please the hopeful, yet angry crowd in the clinic trying to get information. George crosses a professional line, and begins to regret it. Izzie tries her hand at a burr hole, not an easy feat for her as she doesn't have surgical equipment, nor much recent hands-on OR time, and Derek is sent into the field to manage some of the trauma, specifically neuro, coming in, but finds something else"【35†source】.

### Guglielmo
greys_reduced.csv
Spreadsheet
Very good. 
Now, analyze the csv file I uploaded. The "Title_Link" column contains links to web pages similar to the one where you found the Short Summary previously.
I need a script that will read each page in "Title_Link," extract the Short Summary, and insert it into the "greys_reduced.csv" file in a new "Short Summary" column.

### ChatGPT
To achieve your task of reading each web page from the "Title_Link" column in the "greys_reduced.csv" file, extracting the Short Summary, and then inserting it into a new "Short Summary" column in the same file, you need a script that can:

1. Read the CSV file and extract the URLs from the "Title_Link" column.
2. Access and parse each web page to extract the Short Summary.
3. Write the extracted summaries back into the CSV file in a new column.

However, there are some limitations to consider:

- My current environment does not support direct access to external web pages, so I cannot fetch data from the URLs provided in your CSV file.
- My environment also does not allow running asynchronous tasks, which would be required to efficiently process multiple web pages.

Given these constraints, the best I can do is provide you with a Python script that outlines the necessary steps. You would then need to run this script in your own environment where you can access the internet and process these web pages. The script will use libraries like `pandas` for handling the CSV file and `requests` and `beautifulsoup4` for fetching and parsing the web pages.

Here's a basic outline of the script:

```python
import pandas as pd
import requests
from bs4 import BeautifulSoup

def extract_short_summary(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Find the Short Summary based on its unique identifiers (like id or class)
            summary_element = soup.find('span', id='Short_Summary')
            if summary_element and summary_element.find_next_sibling('p'):
                return summary_element.find_next_sibling('p').get_text(strip=True)
        return "Summary not found or inaccessible"
    except requests.RequestException:
        return "Error in accessing page"

# Load the CSV file
df = pd.read_csv('path_to/greys_reduced.csv')

# Apply the function to each URL in the Title_Link column
df['Short Summary'] = df['Title_Link'].apply(extract_short_summary)

# Save the updated DataFrame to a new CSV file
df.to_csv('path_to/greys_reduced_with_summaries.csv', index=False)
```

Please ensure you have the necessary Python packages installed (`pandas`, `requests`, `beautifulsoup4`) before running this script. Replace `'path_to/greys_reduced.csv'` with the actual path to your CSV file. The script will save the updated DataFrame to a new file named 'greys_reduced_with_summaries.csv'.

### Guglielmo
    </div></div></div></div><h2><span class="mw-headline" id="Short_Summary">Short Summary</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="https://greysanatomy.fandom.com/wiki/Drowning_on_Dry_Land?veaction=edit&amp;section=1" class="mw-editsection-visualeditor" title="Edit section: Short Summary" data-tracking-label="edit-section-anon"><svg class="wds-icon wds-icon-tiny"><use xlink:href="#wds-icons-pencil-tiny"></use></svg></a><span class="mw-editsection-bracket">]</span></span></h2>
<p>The mass casualty incident continues to swamp Seattle Grace. Alex 
cares for Jane Doe while trying to find her husband, as well as please 
the hopeful, yet angry crowd in the clinic trying to get 
information.&nbsp;George crosses a professional line, and begins to 
regret it.&nbsp;Izzie tries her hand at a burr hole, not an easy feat 
for her as she doesn't have surgical equipment, nor much recent hands-on
 OR time,&nbsp;and Derek is sent into the field to manage some of the 
trauma, specifically neuro, coming in, but finds something else.
</p>


### ChatGPT
Thank you for providing the specific HTML structure. Based on this structure, the Short Summary is contained within a \<p> tag that follows an \<h2> tag with an inner \<span> having the id "Short_Summary".

I will adjust the script to accurately target this structure. Here's the updated script:

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

def extract_short_summary(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Targeting the specific structure of the Short Summary
            summary_element = soup.find('span', id='Short_Summary')
            if summary_element:
                summary_paragraph = summary_element.find_parent('h2').find_next_sibling('p')
                if summary_paragraph:
                    return summary_paragraph.get_text(strip=True)
        return "Summary not found or inaccessible"
    except requests.RequestException as e:
        return f"Error in accessing page: {e}"

# Load the CSV file
df = pd.read_csv('Greys_Anatomy_Episodes_Cleaned.csv')

# Apply the function to each URL in the Title_Link column
df['Short Summary'] = df['Title_Link'].apply(extract_short_summary)

# Save the updated DataFrame to a new CSV file
df.to_csv('Greys_Anatomy_Episodes_Cleaned_with_summaries.csv', index=False)


### Creating a Searchable Index with DocArray
With our dataset in hand, the next step is to transform this collection of summaries into a searchable index. For this, we'll use `DocArrayInMemorySearch`, a feature of Langchain, which is designed to handle such tasks efficiently.

#### Objective
- Initialize a `DocArrayInMemorySearch` instance with the episode summaries.
- Index the summaries to create a searchable in-memory database.

### Building a Chat Interface with ChatGPT's API
Now comes the interactive part. Using the ChatGPT API, we'll create a chat interface that enables us to query the "Grey's Anatomy" narrative ecosystem. This setup will allow us to ask questions about the series and receive responses that draw directly from the episode summaries in our database.

#### Objective
- Develop a chat interface using the ChatGPT API.
- Integrate the searchable index with the chat interface to pull in relevant information from the episode summaries.

### The User Experience
Imagine asking the chat interface, "What happens to Meredith in the bomb episode?" The system would then:
1. Parse the question to understand it's referring to a specific event in the series.
2. Query the `DocArrayInMemorySearch` to retrieve relevant episode summaries.
3. Use the ChatGPT API to generate a conversational response that details the event, drawing from the retrieved summaries.

### Conclusion
By the end of this exercise, we will have created an end-to-end system that not only holds detailed knowledge of "Grey's Anatomy" episodes but also interacts with users in a natural, intuitive manner. This example illustrates the potential of combining web scraping, vector search databases, and conversational AI to engage with narrative ecosystems in a deep and meaningful way.

In [None]:
# !pip install --upgrade langchain
# !pip install python-dotenv
# !pip install docarray
# !pip install tiktoken

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [None]:
llm_model = "gpt-4-1106-preview"

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

In [None]:
file = 'Greys_Anatomy_Episodes_Cleaned_with_summaries.csv'
loader = CSVLoader(file_path=file)

In [None]:
docs = loader.load()

In [None]:
docs[0]

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
embed = embeddings.embed_query("Il mio nome è Guglielmo")

In [None]:
print(len(embed))

In [None]:
print(embed[:5])

In [None]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [None]:
query = "Please list all \
episodes about covid"

In [None]:
docs = db.similarity_search(query)

In [None]:
len(docs)

In [None]:
retriever = db.as_retriever()

In [None]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

In [None]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [None]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all \
episodes about covid in a table and report the summary. The table should be in markdown.") 

In [None]:
display(Markdown(response))

# Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a field of study that analyzes people's sentiments, opinions, evaluations, attitudes, and emotions from written language. It is a sub-discipline of natural language processing (NLP) and computational linguistics that focuses on identifying subjective information in text data.

### What is Sentiment Analysis?

Sentiment analysis involves using algorithms to detect the polarity (positive, negative, neutral) and sometimes the emotions (happy, sad, angry, etc.) expressed in a piece of text. This goes beyond merely searching for positive and negative words: it requires understanding the nuances of language as used in context. It has become a key tool in the arsenal of data scientists, allowing them to uncover the sentiments behind social media posts, reviews, and any user-generated content.

### Main Uses of Sentiment Analysis

The applications of sentiment analysis are extensive and include:

- **Business Intelligence**: Companies use sentiment analysis to gauge public opinion, monitor brand and product reputation, and understand customer experiences.
- **Market Research**: Understanding consumer sentiment towards products or services to drive marketing strategies.
- **Politics**: Analyzing public sentiment towards policies, campaigns, or political figures.
- **Social Monitoring**: Tracking societal trends, public health responses, or reactions to events and news.

### Sentiment Analysis with Traditional Tools

Traditional sentiment analysis tools like VADER (Valence Aware Dictionary and sEntiment Reasoner) and TextBlob are rule-based systems that use a dictionary of words and their associated sentiment scores to evaluate the sentiment of a text. VADER, for instance, is specifically tuned to sentiments expressed in social media and can understand text with nuances such as capitalization and exclamation points, which express intensity.

TextBlob, on the other hand, provides a simple API for diving into common NLP tasks such as part-of-speech tagging, noun phrase extraction, and sentiment analysis. These tools work well with straightforward texts but can struggle with complex sentence structures, sarcasm, and context-dependent expressions.

### Sentiment Analysis with Transformers and Large Language Models (LLMs)

With the advent of deep learning, newer models like transformers have taken sentiment analysis to the next level. Transformers are designed to handle sequential data and are particularly good at capturing the context within sentences and larger text structures. They use self-attention mechanisms to weigh the influence of different words on the sentiment of a given word in a sentence.

LLMs like BERT, GPT, and their variants have been trained on vast datasets and can perform sentiment analysis with more nuanced understanding of language. They are not just looking for specific sentiment-laden words but also understanding the context, the flow of the conversation, and the subtleties of language that traditional models might miss.

### Programmatically Using GPT for Sentiment Analysis

GPT (Generative Pretrained Transformer) can be programmed to perform sentiment analysis by fine-tuning it on a specific dataset or using it directly via APIs provided by OpenAI. When using GPT for sentiment analysis, the input text is presented to the model, which then generates a sentiment score or classification based on its training.

Here’s a simplified example of how one might programmatically use GPT for sentiment analysis:

```python
from transformers import pipeline

# Load a pre-trained model and tokenizer
sentiment_pipeline = pipeline("sentiment-analysis")

# Example text
text = "I love my phone, but the battery life is too short."

# Get sentiment prediction
result = sentiment_pipeline(text)

# The result is a list of dictionary items with the 'label' and 'score' keys
print(result)
```

The output will typically provide a label such as 'POSITIVE' or 'NEGATIVE', along with a confidence score. 

In practice, when using GPT or other transformer models for sentiment analysis, there are important considerations to keep in mind, such as the quality and relevance of the training data, the need for context in the analysis, and the interpretability of the model outputs.

### Conclusion

Sentiment analysis represents a vital intersection between computer science and human psychology. Whether through traditional NLP tools or advanced LLMs, it allows us to understand and quantify the complex emotional dynamics of language. As we progress, the technology will continue to evolve, offering ever more accurate insights into the vast and nuanced world of human sentiment.


https://gaganpreetkaurkalsi-sentimentanalysis-streamlit-app-ujm4r5.streamlit.app/


   ## Q&A and Discussion

    [Space for student questions and interactive discussion]
    