# Module 5 - Society & LLMs

> **Risks and Limitations of LLMs**

Many of these risks and limitations are really hard to mitigate, especially if there
are intentional and irresponsible, and malicious actors. But we'll first look at the source that
enables LLM power today, which is the data, and look at how data can also translate to model bias.
Then, we will cover different aspects of misuse, whether intentionally or not. We will also go
over LLMs' potential impact on jobs and their environmental costs. But I would like to note that
going through this list of risks and limitations doesn't mean that we're necessarily asking you to
stop using them altogether. Rather, I hope that this session will serve as an invitation to all
of you, whether you're a user or whether you are a developer, to think about how we can collectively
be more responsible as we interact with these models and as we build these applications.

Examples:
- Debate between the "Art World" and the use of Generative AI like Dall-E, copyrights problems, among other ethical discussions.

- Automation displaces jobs and increases inequality.

- Incures in high environmental and fanancial costs, measured in carbon footprint emissions and millions of $USD that means to train a LLM from scratch.

- Having a big data volume of data, available for training a LLM, does not imply that it's good data. Most of the data is extracted from the internet, and so, it's biased in some cases, because the data extracted from websites such us Reddit or Wikipedia, over 80% of the total traffic of those sites are males under the age of 29, and in the case of Wikipedia, only a 15% are female. So the idea behind this is that perhaps the language models that we're using today are really not that smart; they are simply parrots, who are really good at copying what the humans say. It means that if the data is bad, we probably cannot imagine these language models to do much better.

- Do we actually have good data quality? But how do we even start auditing when the data is so big? So when the mode, the data input is biased, we can almost certainly expect the model to be biased as well. "Garbage in, garbage out" is an old adage that still applies to language models. But the other fundamental limitation with the data today is that only certain types of stories make into the news. For instance, a peaceful protest is much less eye-catching on the newspaper than a violent protest. Therefore, the former often goes unreported, which means our model doesn't know about it. But the other limitation is that we cannot afford to update our model all that much, even if we can update our data. Because we established earlier that training a model from scratch is expensive so when we cannot update our data then we risk having a dated model.

- Models can be highly toxic, discriminatory, and also exclusive. And the reason is because ourdata is flawed. So if you look at the examples on this slide over here, we see that, on the right hand side, we have many more females represented in a family context versus a politics context. So in fact, the paper actually found that female-sounding names are depicted often as less powerful. And we can argue that this is a reflection of the society, but it does also mean that we need to carefully consider how to use such a model when it can embed bias that we may not necessarily want the model to incorporate. There are also some other models that exhibit certain bias against certain demographic groups. And it's also not surprising that these models can have poorer performance for some languages as well because of the data problem that we mentioned.

- The next risk has to do with information hazards. So this comes in two prongs. The first is when we can accidentally compromise privacy by leaking or inferring private information. So this slide over here shows how Sydney, the Microsoft chatbot, accidentally reveals itself to be Sydney and employees can also accidentally leave company secrets by interacting with another close-sourced model. What is really interesting but also concerning is the image on the bottom over here, where it shows that the LLM can confidently output information that is incorrect. In fact, it suggests that violence within the couple can actually be good. LLMs can also facilitate many, many malicious use cases. For example, fraud, censorship, surveillance, or cyber attacks.

- And lastly, this is something that we're all prone towards as well, which is when we are relying on this technology way too much and when we give way too much trust to these models. For example, if I were to struggle with mental health it will probably not be wise for me to consult a chatbot on what to do. Many of this generated text indicate that large language models tend to hallucinate. We haven't touched upon the term "hallucination" at all, but we will adress it soon.

> **Hallucinations**

Hallucination refers to when the generated content is nonsensical or unfaithful to the provided source content. It means that the output can sound completely natural and fluent and it also means that the output can sound really confident even when it's wrong.

There are two types of hallucination: intrinsic and extrinsic. Based on individuals, we may all have different tolerance levels based on how faithful or how factual we expect these outputs to be. And we'll talk in just a second about what faithful actually means in this context.
- Intrinsic refers to when the output directly contradicts the source. If I give a source text that indicates the first Ebola vaccine was first approved, was approved by the FDA in 2019. But a summary output indicates that the first Ebola vaccine was approved in 2021. So this is a very clear case of contradiction and it means that the output is not faithful to the source text over here. And it also means that this output is completely not factual as well.
- Extrinsic hallucination, it refers to when we cannot verify the output from the source, but the model itself might not be wrong. For example, if I were to say Alice won first price in fencing last week and then the model tells me that Alice won first price in fencing for the first time last week and she was ecstatic. It's probably true that Alice did that for the first time and she was really excited that she got it, but we cannot verify that from the source. It means that we cannot really say that output is factual or faithful towards the source.

![Screen Shot 2024-01-20 at 19.01.40.png](attachment:57572669-6c58-46ab-bec0-81d81eda5fb8.png)


What leads to hallucination? 

The first component, probably without any surprises, is the data. So how we collect data matters a lot, in terms of how the model performs. And we talked about in the earlier segment that when we have big data, it's really hard to do audits well or do any audits at all. Same goes to the data collection process. We may gather any text that we have without any factual verification. We also do not filter out exact duplicates most of the time. For example, if you were to ingest the same Reddit thread twice, that counts as a duplicate and duplicates can bias the model. If we have many of the same Reddit threads show up in the data, then it means that it's more likely for the model to output responses from those Reddit threads. But the other problem regarding the data is actually just regarding how open-ended these generative tasks are. For example, in a chat application, we will probably want the chatbot to be more engaging. And therefore, it means that we would expect more diverse responses. If I were to ask the chatbot about the same thing many times, and if the chatbot will always repeat the same things, it will probably be a chatbot that we won't use for very long. So we want the chatbot or some applications to have more diversity to improve engagement, but this type of diversity can also correlate with bad hallucination, especially when we need factual and reliable output. When we ask the chatbot about medical literature, our tolerance level for anything that is non-factual will be quite low, compared to when we ask something about how to make a perfect salad. But this open-ended nature of generative task is just a really hard-to-avoid problem and it's something that will have to deal with as a user of LLM applications.

![Screen Shot 2024-01-20 at 19.05.37.png](attachment:c7458d9e-8b96-48f2-99c1-b4a7c190614f.png)

The second component that leads to hallucination is the model. The first reason within the model
itself is imperfect encoder learning. It means that encoder learns wrong correlations between
parts of training data. The second reason can happen at decoding time,
which means that when the model is trying to generate text output, the decoder actually
attends the wrong part of the input source, But there are also decoder design, that's
by design, encouraging randomness and also unexpectedness. For example, top-K sampling. So
for those types of decoders, rather than picking the most likely token, it would select any one of,
out of the four candidates over here, that you see on the slide, to generate the next token.
The third reason is exposure bias. So very technically speaking, this means that there is
a discrepancy in decoding between the training and also inference time. But plainly speaking,
it means that model tends to generate output based on its own historically generated token.
So it also means that the model can veer off a topic. When you start off asking about dishwasher,
maybe the model itself would then start generating content about the dryer.
The fourth reason has to do with parametric knowledge bias.
As summary, it means that the model will stick to what it knows. So all models tend
to generate output based on what it has memorized, rather than the provided input.

![Screen Shot 2024-01-20 at 19.09.11.png](attachment:78845ca4-7403-4983-8cf2-86b9aa7a674f.png)

How do we evaluate hallucination?

Evaluating hallucination is tricky and imperfect, as I mentioned before, different individuals
can have different expectations about how the models actually behave and we can also have very
different decision criteria to determine whether a certain content is toxic or why does certain
content is classified as misinformation. There are two categories of metrics here that we can rely
on to assist with evaluating hallucination. The first category is statistical metrics:
BLEU, ROUGE, and METEOR have been around for some time in NLP and when using these metrics,
we see that approximately 25 of summaries contain hallucination,
which means it contains unsupported, very unsupported information.
The second metric over here is called PARENT, which measures the hallucination using both
source and also output text. It means that it's actually using n-grams behind the scene
to capture what is in the source versus the target and then it calculates the F1 score.
The third type of metric is called BVSS which stands for Bag-of-Vectors
Sentence Similarity. It measures whether the translation output has
the same amount of information as the translation reference.
The second category is model-based metrics. It means that we are leveraging another model to help
us evaluate hallucination. But this category of metrics also means that any of the errors
from these models that we're leveraging also get propagated throughout as well.
So the first type of model that we can leverage is information extraction. And this is especially
useful for any named entity recognition use cases. We are trying to extract knowledge
so we can use this to compare with a language-based, the language model output.
The second metric over here is question-answering-based. It means
that we can measure the faithfulness by measuring the similarity among the same,
among the different answers to the same question.
The third metric over here is faithfulness.
It asks the question "does the output actually contain any unsupported information?"
The last one over here is language-model-based, which means that we are using a language model to
help us calculate the ratio of hallucinating tokens to the total number of target token.
So as you can see over here, there are a variety of metrics to help us evaluate hallucination,
but none of them is perfect. In the next segment, we'll talk about mitigation strategies.

![Screen Shot 2024-01-20 at 19.13.31.png](attachment:ea379fcd-ab1a-45cd-80a3-24233498071b.png)

> **Hallucination Mitigation Strageties**

Since hallucination stems from
both data and model, it is only appropriate that we address them from both data and model
perspectives as well. 

- The first is to build a faithful data set. What this looks like is involving humans to write clean and faithful targets from scratch, given the source text. We may also want to rewrite the real sentences on the web and then it also involves having humans to actually filter out any non-factual data or make corrections to the current data. We may also want to look into augmenting the input data sources with more sources.

- The second angle is about doing more architectural research and experimentation to improve the current modeling and inference methods. Maybe this involves using more reinforcement learning and maybe this also involves using more multi-task learning. Because hallucination stems from the reliance on a single data set, we can also do more post-processing corrections as well and that will involve human in the loop.


How to reduce risks and limitations?

How do we reduce risks and limitations, in general, for all large language models? So we talked about data bias. We talked about toxic models. We talked about information hazard and also malicious users. So all of this we have to combat them using a united front. For data bias, we want to look at the data slices and maybe even update our data more frequently, but for to any toxic models, this will require approaches from multiple angles. 

- First of all, it has to do with assessing the data as well, but we can also incorporate some post-processing tools as well from Hugging Face and Spark NLP. And in fact, these are two tools I will look at in the code later on. Or even implementing some guard rails around these large language models such as using NeMo Guardrails. And it can also involve curating more data for fine-tuning as well. 

- For misinformation hazard or information hazard, we want to look at the source, where the information actually comes from. So that can include curating data for fine-tuning or fine-tuning your own model. 

- For malicious users, to catch these bad actors or malicious actors, there has to be some type of regulation. But chances are a lot of these risks and limitations that we see with the large language models do need some type of regulation to help us govern the usage.

![Screen Shot 2024-01-20 at 19.23.37.png](attachment:01d01ea2-d15e-43c4-9382-50debd95186b.png)


Governance and Audit LLM?

We can think about regulation from a three-layered standpoint. So this
paper in 2023 proposes that we can audit on three separate layers. First is governance,
which refers to the fact that we should audit technology providers.
This means all the big companies that are providing us the models to use.
The second audit can come in the form of models, which means that we audit a model prior to any
release to the public. The third audit has to do with the application level, which means that
we are assessing the risk of these models based on how users are actually using them.
But even with this framework, there are still some open questions for us as a community to answer:
how do we actually capture the entire landscape, where we cannot be totally sure how users are
interacting with these models? And how do we actually audit close-sourced models? But
thankfully, there are a lot of discussions in the works recently by different countries globally.
And perhaps the biggest question of all is, even with the auditing framework in place, who should
actually be the one doing these audits? And any auditing is really only as good as the institution
delivering it. We also have to recognize that all large language models cannot have zero risk
so we will have to come up with arbitrary thresholds of what is our acceptance level.
But how do we capture deliberate misuse? And how do we actually address gray areas,
like when we are using large language models to generate creative products?
So these are the types of questions that we, as a society, would need to wrestle with,
as we advance more and more in large language model inventions and technology.

# Summary

- LLMs have tremendous potential to transform and truly revolutionize every single industry.
- But we need better data to build better models in the long term, apart from just training bigger and bigger models.
- But LLMs, in general, can hallucinate, can cause harm and influence human behavior when we over-rely on it.
- We still have a really long way to go to properly evaluate large language models and oftentimes the metrics that we use can be very subjective.
- We need regulatory standards to help Implement some type of standards, in terms of what is ethical and responsible usage of large language models.