# Create evaluation dataset for Redbox RAG chat  <a class="anchor" id="title"></a>

**Before running this notebook**

Set the version of the evaluation dataset you are creating **[HERE](#setversion)**

## Table of Contents <a class="anchor" id="toc"></a>
* [Set version of the evaluation dataset you are creating](#setversion)
* [Overview](#overview)
* [Notebook Setup](#setup)
* [Generate Evaluation Dataset](#ragas)
* [Save Evaluation Dataset](#six-section)
* [Troubleshooting](#troubleshooting)
* [#TODO - Next Steps](#todo)

## Overview <a class="anchor" id="overview"></a>

There is a troubleshooting section at the end of this notebook [Troubleshooting](#troubleshooting)

[Back to top](#title)

## Notebook Setup <a id="setup"></a>

In [3]:
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json

pd.set_option("display.max_colwidth", None)

[Back to top](#title)

------

## Set version of the evaulation dataset you are creating <a class="anchor" id="setversion"></a>

**Before running this notebook**

Set the version of the evaluation dataset you are creating

In [14]:
version = "0.1.0"

**It is really important to version the evaluations we are doing**, including the input data used to generate evaluation datasets

#### Create a data directory this this version of evaluation data

In [None]:
from pathlib import Path
dir_name = f"evaluation_data_v{version}"
dir_path = Path(f"./data/{dir_name}")

if not dir_path.exists():
    dir_path.mkdir(parents=True)

**Now copy all the files you want to use to generate your evaluation dataset into the new directory created above**

[Back to top](#title)

---------------

## Synthetically generate evaluation dataset <a class="anchor" id="ragas"></a>

RAGAS generating a synthetic test set detailed [HERE](https://docs.ragas.io/en/stable/getstarted/testset_generation.html). Perhaps not as SOTA as DeepEval (validate!), but it creates `input` AND `expected_output` for us. 

So we are not generating input questions based on our chunking strategy, however, we are using the same files

In [15]:
# Takes about 4 minutes for 4 docs. Consider Langchain `unstructured`
from langchain.document_loaders import DirectoryLoader
# loader = DirectoryLoader("./data/evaluation_files")
loader = DirectoryLoader(f"./data/{dir_name}")
documents = loader.load()

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### Save Langchain documents for future use

In [16]:
import typing as t
import jsonlines
from langchain.schema import Document


def save_docs_to_jsonl(documents: t.Iterable[Document], file_path: str) -> None:
    with jsonlines.open(file_path, mode="w") as writer:
        for doc in documents:
            writer.write(doc.dict())


def load_docs_from_jsonl(file_path) -> t.Iterable[Document]:
    documents = []
    with jsonlines.open(file_path, mode="r") as reader:
        for doc in reader:
            documents.append(Document(**doc))
    return documents

In [17]:
#TODO: Use the functions above!
save_docs_to_jsonl(documents, f"./data/synthetic_data/langchain_documents_for_ragas_v{version}.jsonl")

-----------

In [18]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

## TODO: Next steps  <a class="anchor" id="todo"></a>

In [22]:
#TODO: Add code to handle rate limits in the generator - partiuclarly for the critic using GPT-4: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb

In [None]:
#TODO: Replace the `generate_with_langchain_docs` method with something utilising Redbox chunking strategy

In [25]:
#TODO: Investigate why RAGAS sometimes creates NaN values!

In [None]:
#TODO: Can we add cost estimate to RAGAS synthetic test generation?

**CHANGE `test_size` to generate more evaluation data** (in cell below)

In [None]:
# generate testset
testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.4, reasoning: 0.3, multi_context: 0.3})

#### Save RAGAS generated testset as pickle

In [10]:
import pickle

with open(f'./data/synthetic_data/ragas_testset_v{version}.pkl', 'wb') as f:
    pickle.dump(testset, f)

Export to Pandas

In [None]:
## To view dataframe in notebook (very text heavy!)
# testset.to_pandas()

#### Convert dataframe into a DeepEval compatible CSV & save

In [21]:
testset_df = testset.to_pandas()

# Rename the columns
new_column_names = {
    'question': 'input',
    'contexts': 'context',
    'ground_truth': 'expected_output',
    # Add more column names here
}

testset_df_renamed = testset_df.rename(columns=new_column_names)

#  DeepEval dataset format requires an 'actual_output' column
testset_df_renamed['actual_output'] = ''
testset_df_renamed = testset_df_renamed.drop(['evolution_type', 'metadata', 'episode_done'], axis=1)

# Convert all columns to string - otherwise DeepEval will throw an Pydantic validation error
testset_df_renamed = testset_df_renamed.astype(str)

# save as CSV
testset_df_renamed.to_csv(f'./data/synthetic_data/ragas_synthetic_data_v{version}.csv', index=False)

In [26]:
testset_df_renamed.head()

Unnamed: 0,input,context,expected_output,actual_output
0,What were the findings regarding the prevalence of behavioural disorders among EBCI children who received UBI-like payments?,"[' among EBCI children who received payments, levels of psychiatric symptoms fell significantly, with a clear downward trend in the prevalence of these disorders amongst EBCI children in receipt of payments over time. Specifically, Akee and colleagues (2018) identified a reduction in the prevalence of symptoms of behavioural disorders (-23% of a standard deviation) and emotional disorders (-37% of a standard deviation) among EBCI children who had received UBI-like payments for four years. Similarly, Costello and colleagues (2003) identified a 40% decrease in symptoms of behavioural disorders (such as conduct and oppositional disorders) amongst EBCI children who were lifted from poverty as a result of payments (p=0.002). Given the known trajectory of such disorders in childhood toward substance misuse, criminality and unemployment in adulthood, this is of considerable significance.\n\nincome changes alone did not account for the observed improvements in child wellbeing, rather the most significant mediating factor was an improvement in parental supervision.\n\nIn fact, in isolation, the effect of changing poverty status was nonsignificant, while the effect of increased parental supervision accounted for approximately 77% of the reduction in the number of psychiatric symptoms observed. Further analysis found that the increased parental supervision amongst those receiving payments was almost exclusively due to reduced time constraints within the family. Simply put, providing unconditional payments offered parents the opportunity to spend more quality time with their children, with significant and long-lasting benefits.\n\nSecondary analyses conducted by both studies subsequently explored which factors mediated these findings. Both\n\nA further study conducted in 2010, assessed the prevalence of psychiatric disorders amongst the original sample of EBCI children\n\nPanel A. Coefficients on EBCI children by wave for behavioural disorder\n\nPanel B. Coefficients on EBCI children by wave for emotional disorder\n\no2 + Estimated coefficient ow ® ° nd ow IN Time in years ° a 1 ° 1 ° a Estimated coefficient T T T T T 3 2 4 ° 1 nd Time in years\n\nNote: From “Akee, R., Copeland, W., Costello, E. J., & Simeonova, E. (2018). How does household income affect child personality traits and behaviors?. American Economic Review, 108(3), 775-827.”\n\n—_\n\nSS\n\n®\n\nMental Health Foundation Scotland\n\n——————________\n\n14.\n\nThe mental health effects of a universal basic income - Summary of study findings\n\nat 21 years of age (Costello et al. 2010). This additionally allowed for a comparison of the effects of onset and length of exposure to payments (participants aged 12, 14 and 16 at the onset of payments). In doing so it was identified that...\n\nchildren with longer dividend exposure were 22% less likely to have been arrested at ages 16-17 and 7% less likely to have dealt drugs by age 21\n\nEBCI adults who had benefited from payments as children were significantly less likely to suffer from any form of psychiatric disorder as adults than adults who had not received payments as children (30.2% vs 36.0%; p=0.001) and were less likely to suffer from substance use disorders in particular\n\n(Akee et al. 2010).\n\nFor children who were in poverty at baseline, $4,000 p.a. in extra income was associated with completing an extra year of education, and school attendance increased by four days per quarter. Again, secondary analysis identified that the most significant factors accounting for this change, was an improvement in parental supervision (by 3-5%) due to reduced time constraints within the family.\n\n(28.6% vs 30.6%; p=0.014), with a reduction in alcohol (20.3% vs 23.8%; p=0.006) and cannabis use or dependence (16.7% vs 19.5%; p=0.049). On closer analysis these differences were most pronounced amongst the youngest EBCI study cohort (aged 12 at onset of casino dividends) who had had the longest exposure to the intervention, among whom there was a lower prevalence of psychiatric disorders (31.4%) in comparison with EBCI adults who had been older (aged 14) when payments were initiated (41.7%, p=0.005).\n\nFinally, one further study explored whether length of exposure to UBI-like payments in childhood (six years versus two years) resulted in differences in years of education, school attendance, probability of being arrested and probability of dealing drugs. They found that...\n\nIt is unfortunate that race/ethnicity in these studies was almost entirely merged with the intervention, making it difficult to differentiate the effects of the UBI like payments from the effects of being an EBCI member. However, it would appear race alone did not solely account for the observed findings, as onset and length of exposure to UBI-like payments were also of']",Akee and colleagues (2018) identified a reduction in the prevalence of symptoms of behavioural disorders (-23% of a standard deviation) among EBCI children who had received UBI-like payments for four years. Costello and colleagues (2003) identified a 40% decrease in symptoms of behavioural disorders (such as conduct and oppositional disorders) amongst EBCI children who were lifted from poverty as a result of payments (p=0.002).,
1,How is occupational exposure to AI categorized based on skill levels?,"[' assign each occupation to one of four skill levels21 gained through education or work-related experience:\n\n21 SOC2010 volume 1: structure and descriptions of unit groups - Office for National Statistics\n\n13\n\nLevel 1 – general compulsory education.\n\n\n\nLevel 2 – general compulsory education with a longer period of work-related training or work experience.\n\nLevel 3 – post-compulsory education below degree level.\n\n\n\nLevel 4 – professional occupations normally requiring a degree or equivalent period of relevant work experience.\n\nFigure 1 shows that professional occupations (those at skill level 4) are more exposed to AI than other occupations. These include many of the top 20 occupations listed in the previous section, including management consultants and business analysts, financial managers and directors, chartered and certified accountants, and psychologists.\n\nFigure 1: Exposure to all AI by skill level of occupation\n\nA C B \n\nHow to read this chart\n\nThe boxes show the upper 25%, lower 25%, and average value for the AIOE score. • The error bars show the highest and lowest AIOE scores. • Each dot represents the AIOE score of an individual occupation. • The AIOE score is a relative measure so negative values still indicate some exposure to AI.\n\nThe professional occupations least exposed to AI (marked by ‘C’ on Figure 1) are veterinarians, medical radiographers, dental practitioners, physiotherapists, and senior police officers. Despite being less exposed to AI compared to other professional occupations, they rank among the middle for exposure to AI across all occupations. It may be expected that occupations such as radiographers would be more exposed, but this may be explained by the current use of technology and AI within these roles.\n\nSimilarly, those occupations requiring the lowest levels of education or relevant work experience are less exposed to AI (those at skill level 1). The exception to this is security\n\n14\n\nguards (as shown by ‘A’ in Figure 1) where potential uses of AI have been documented to be anything from monitoring live video to AI powered patrol bots.22 Whilst lower skilled occupations are generally less exposed to AI, there are still some higher skilled occupations at skill level 3 that are less exposed, such as roofers and sports players (as shown by ‘B’ in Figure 1).\n\nThese results are consistent with the findings of similar research23 which suggests that occupations requiring a lower level of education tend to be more manual and often technically difficult roles, which have already seen extensive changes due to developments in technologies, and it is unlikely to be cost effective to apply further automation.24, 25 Furthermore, more recent advancements in AI have been more applicable to software and technologies and either require skills in technical coding or use of specific software as part of the job, e.g. accountancy and finance.\n\n22 Artificial Intelligence and its applications in physical security | G4S United Kingdom 23 Eloundou et al (2023), Felten et al (2023), Brynjolfsson et al (2023) 24 How automation has affected jobs through the ages | World Economic Forum (weforum.org) 25 AI-and-work-evidence-synthesis.pdf (thebritishacademy.ac.uk)\n\n15\n\n3 Exposure to AI across industries and geography\n\n3.1 Exposure to AI across industry\n\nThe industry estimate of exposure to AI is constructed by taking a weighted average of the AI Occupational Exposure (AIOE) scores across occupations within an industry. This provides an average AIOE score for each industry, which are shown in Figure 2. In general, the industries more exposed to AI follow the same themes as discussed earlier in this report.\n\nThe finance & insurance sector is more exposed to AI than any other sector. This sector features a large number of finance and clerical roles which have high AIOE scores. There are five other sectors that are highly exposed to AI: information & communication; professional, scientific & technical; property; public administration & defence; and education.\n\nThe industries least exposed to AI are accommodation & food services; motor trades, agriculture, forestry, and fishing; transport & storage and construction.\n\nSome of these industries capture a range of activities. For example, the veterinary activities sub-sector has much lower exposure to AI (average AIOE of -0.06) compared to the professional, scientific & technical industry as whole (average AIOE of 0.86).\n\nFigure 2: Exposure to AI by industry\n\nFinance & insurance Information & communication Professional, scientific & technical Property Public administration & defence Education All industries Health Wholesale Arts, entertainment, recreation & other services Production Business administration & support services Retail Construction Transport & Storage (inc. postal) Agriculture, forestry & fishing Motor trades Accommodation & food services -0.8 -0.6 -0.']",Occupational exposure to AI is categorized based on skill levels. Level 1 includes occupations that require general compulsory education. Level 2 includes occupations that require general compulsory education with a longer period of work-related training or experience. Level 3 includes occupations that require post-compulsory education below degree level. Level 4 includes professional occupations that normally require a degree or equivalent period of relevant work experience.,
2,What are the mental health effects of conditional versus unconditional cash transfers?,"[' the conditionality applied to them, would contribute to reducing poverty stigma, with implied societal and population benefits.\n\n—_\n\nSS\n\n®\n\nMental Health Foundation Scotland\n\n20.\n\nThe mental health effects of a universal basic income - Conclusions\n\nConclusions T he purpose of this review was to\n\nexamine the existing literature on previous UBI pilots, in order to assess the relative influence of individual, universal and unconditional payments on mental wellbeing. In Scotland, where close to a quarter of all children and a fifth of ‘working age’ adults live in poverty, public and political interest in UBI is growing. There is therefore a need for evidence on the potential benefits and drawbacks of the policy.\n\nUnfortunately, given the limited available evidence, it is not possible to draw any conclusions on the potential benefits or pitfalls of universal cash transfers. However, several pilots did directly compare the mental health effects of conditional versus unconditional cash transfers and their findings are striking. For adults, studies consistently found that removing the conditions associated with traditional welfare benefits was associated with improved mental wellbeing among participants, suggesting this holds considerable gains for population mental health. From a reduction in reported feelings of stress, symptoms of psychiatric disorder and perceptions of stigma and marginalisation, to overall improvements in mental wellbeing and better cognitive functioning, studies consistently reported clear and significant improvements in mental health when the conditionality\n\nremoved or replaced with more supportive, tailored, unconditional interventions.\n\nFurthermore, while UBI opponents assert\n\nthat the conditions associated with traditional welfare benefits are necessary in order to ensure individuals are motivated to return to work, in the studies we identified the removal of conditionality had no effect on either the rate of employment amongst recipients or their motivation for and attempts to secure it. Although the current review is limited to the previous UBI pilots, these findings echo the results of other benefit sanction reviews. In 2016, a comprehensive review by the National Audit Office concluded there was limited evidence sanctions actually worked and concluded the sanctions system was in urgent need of reform (National Audit Office, 2016). More recently, in the most extensive study of welfare conditionality in the UK ever conducted, sanctions were overall found to be ineffective at getting jobless people into work (Economic and Social Research Council [ESRC], 2018). Moreover, it emerged that they do little to enhance motivation to find employment and for some individuals, increase their risk of exposure to further poverty, ill-health and even survival crime (ESRC, 2018). Apportioning benefit sanctions based on the failure to comply with the current conditionality attached to Universal Credit would therefore appear\n\nassociated with traditional welfare was\n\nnot only detrimental to mental wellbeing,\n\n®\n\nMental Health Foundation Scotland\n\n21.\n\nThe mental health effects of a universal basic income - Conclusions\n\nbut also potentially harmful, ineffective and unnecessary.\n\nThere was also some preliminary evidence to suggest that removing conditionality is associated with a decrease in healthcare utilisation and reduction in health-risk behaviours, including alcohol and substance misuse. However more rigorous research is needed in order to definitively establish this, as well as further studies to determine the particular tenets of UBI which contribute to these reductions. Nonetheless, given the staggering human and financial cost of hospital admissions, as well as drug and alcohol misuse in Scotland, these results are of considerable interest.\n\nFinally, in the studies of unconditional cash transfers for children, we identified significant and long-lasting benefits in their mental health, particularly when these were introduced early. The detrimental effects of growing up in poverty for children’s mental health and overall development are well known. However, findings from Western Carolina indicate that income changes in isolation did not significantly mediate improved mental health outcomes for children. Rather it was increased parental supervision and an improvement in parent-child relationships due to reduced time constraints within the family which were central. As such, Scotland’s efforts to improve children’s mental wellbeing through reducing childhood poverty will be ineffective if the wider effects of all policies on children’s wellbeing are not evaluated. This must therefore be carefully considered\n\nmost effective strategies for improving children’s wellbeing, with implications that providing low-income households with non-stigmatising, unconditional payments offers parents the opportunity to spend more quality time with their children, with significant and long-lasting benefits.\n\nIn summary, it was not the purpose of this review to explore the economic impact or affordability of UBI, or indeed its ability to lift people out of poverty, rather our aim was to explore the impact of UBI-like schemes on mental health and wellbeing. Although none of the identified studies evaluated the mental health effects of a UBI in its purest sense, given none of these pilots were truly universal, our findings add an important contribution to our understanding of how the current welfare system could be improved to better support everyone’s mental health. Numerous reports have highlighted the detrimental mental health impacts the UK’s existing social security system holds for claimants.']","Removing the conditions associated with traditional welfare benefits and replacing them with unconditional cash transfers is associated with improved mental wellbeing among participants. Studies consistently reported clear and significant improvements in mental health when conditionality was removed or replaced with more supportive, tailored, unconditional interventions.",
3,What were the effects of cash payments on recipients in the B-MINCOME study and the Ontario Basic Income Pilot?,"['\nOverall, at the end of the study, almost a quarter of those who didn’ t receive UBI payments (24%) were found to have Mental Health Inventory scores indicative of a mental health difficulty, compared with less than one fifth of UBI recipients (17%) (p=0.001) (Kela, 2020).\n\nthe key distinction between B-MINCOME payments and traditional welfare benefits in Spain, was that they were not granted on the basis of being legally unemployed and providing proof of actively seeking employment. Rather they were granted to low-income households irrespective of employment status and the only condition applied to one of the study arms was the requirement to engage in a community- based group.\n\nAgain...\n\nIt is worth noting that, as contrasted with UBI pilots in other High-Income Countries, Finland’s Basic Income experiment was statutory and randomised, significantly increasing confidence in their findings. In addition, all of these findings were found to remain significant after controlling for sociodemographic variables, including gender, age, education, household structure and income.\n\nsignificant improvements in wellbeing among all four groups receiving B-MINCOME payments were observed in comparison with the control group who continued to receive standard benefits.\n\nBarcelona’s B-MINCOME study compared traditional welfare benefits, with the effects of four forms of cash payments, namely; (1) cash payments received on an unlimited basis; (2) cash payments received on a limited basis (whereby they were incrementally reduced according to any household earnings in excess of the basic threshold); (3) cash payments received unconditionally; and (4) cash payments received conditionally on the basis of involvement in a social programme (Kirchner et al. 2019). Although B-MINCOME had a complex study design,\n\nGeneral satisfaction with life, assessed according to a 10-point Likert scale, increased by 27% amongst B-MINCOME groups as a whole, and at the end of the study the probability of participants reporting a ‘high level of life satisfaction’ (rating their wellbeing as ≥ 7/10) was 11% higher amongst those receiving any form of B-MINCOME payment compared with those in the control group (Kirchner et al. 2019). Self-reported experiences of mental illness were also found to be significantly lower (9.6%), although it is not clear from the study reporting exactly how ‘experiences of mental illness’ were enquired about. In comparison with the\n\nwith various different treatment groups,\n\ncontrol group, the biggest differences\n\n®\n\nMental Health Foundation Scotland\n\n10.\n\nThe mental health effects of a universal basic income - Summary of study findings\n\nin these outcomes were found amongst those receiving payments in an unlimited fashion and amongst those receiving payments based on their involvement in a social participation programme. Interviews with participants also found that a renewed sense of hope for the future was ubiquitous among recipients (The Young Foundation, 2019; 2020).\n\nThe Netherlands Social Assistance Experiments sought to explore whether changes to social security payments could improve employment rates. As such, a sample of participants who were already in receipt of welfare benefits had their payments changed in one of three ways, with a control group who remained on welfare benefits as usual. In one group all conditionality surrounding social security was removed, in a second group increased help and guidance in finding employment was provided and in a third group recipients were allowed to earn extra money before their social security payments were withdrawn (Verlaat et al. 2020). A survey of all groups conducted at the end of the pilot identified positive treatment effects in terms of subjective wellbeing for all three interventions compared to the control group, although these did not reach statistical significance. However, a statistically significant treatment effect on participant’s self- efficacy (defined as a combined measure of self-confidence and perceived ability to find work) was found among both those who had had the conditionality\n\nsurrounding their benefits removed\n\nhad received additional help in finding employment (0.573, p=0.055) (Verlaat et al. 2020). In addition, in interviews, participants in both of these groups consistently reported increased wellbeing and a reduction in stress and anxiety.\n\nThe Ontario Basic Income Pilot (OBIP) was first launched in 2018, with the aim of providing a fixed income for three years to residents of three communities (Hamilton, Thunder Bay and Lindsay). Potential participants were those aged 18 to 64 who were living on low or no income, and data was collected from 4000 individuals receiving payments and 2000 control participants (Mendelson, 2019). Although, all of the tenets of UBI were therefore clearly not met, the pilot did provide a guaranteed, unconditional income equivalent to 75% of the Low-Income Measure (Mendelson, 2019). Two cross-sectional surveys of OBIP recipients subsequently explored recipients’ experiences of these payments in comparison with the traditional']","The effects of cash payments on recipients in the B-MINCOME study included significant improvements in wellbeing, increased general satisfaction with life, lower self-reported experiences of mental illness, and a renewed sense of hope for the future. The effects of cash payments on recipients in the Ontario Basic Income Pilot included providing a fixed income for three years to residents of three communities.",
4,Why is coordinated action necessary in addressing frontier AI risks?,"[' results of which will be showcased on Day 1 of the Summit.\n\nHere, our team will present 10-minute demonstrations, focused on 4 key areas of risk:\n\ne misuse\n\nmisuse societal harm\n\ne\n\nhttps://www.gov.uk/government/publications/frontier-ai-taskforce-second-progress-report/frontier-ai-taskforce-second-progress-report\n\n7/9\n\n30/10/2023, 10:11\n\nFrontier AI Taskforce: second progress report - GOV.UK\n\ne\n\ne\n\nloss of human control unpredictable progress\n\nWe believe these demonstrations will be the most compelling and nuanced presentations of frontier AI risks done by any government to date. Our hope is that these demonstrations will raise awareness of frontier AI risk and the need for coordinated action before new - more capable - systems are developed and deployed.\n\nTo the emerging network\n\nAI is a general purpose and dual-use technology. We need a clear-eyed commitment to empirically understanding and mitigating the risks of AI so we can enjoy the beneﬁts. In 1955 Von Neumann, pioneer of nuclear weapons and computing wrote a crisp and prescient essay “Can we survive technology?” in which he considered the implications of accelerating technological progress. His ﬁnal conclusion rings true today:\n\nAny attempt to ﬁnd automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgement”\n\nWe must do our best and contribute to that empirical foundation for day-to-day judgement.\n\nOne thing that struck me on a recent trip to the Bay Area to meet AI companies and researchers was the number of people who, unbeknown to us, have been rooting for the Taskforce behind the scenes. There have been AI researchers telling their network to join and ﬁghting for us in private. There have been people spending their goodwill with company leaders to promote the importance of our work. To the emerging international network of people who take this technology seriously - we are very grateful for your support, and can’t wait to tell you more about what we’ve been working towards in just a few days time at the AI Safety Summit.\n\nA home for AI Safety research\n\nLast week the Prime Minister announced (https://www.gov.uk/government/speeches/prime-ministers-speech-on-ai-26-october- 2023) that he is putting the UK’s work on AI safety on a longer term basis by creating an AI Safety Institute in which our work will continue. The AI Safety Institute is the ﬁrst state-backed organisation focused on frontier AI safety for\n\nhttps://www.gov.uk/government/publications/frontier-ai-taskforce-second-progress-report/frontier-ai-taskforce-second-progress-report\n\n8/9\n\n30/10/2023, 10:11\n\nFrontier AI Taskforce: second progress report - GOV.UK\n\nthe public interest. Its mission is to minimise surprise to the UK and humanity from rapid and unexpected advances in AI, and will work towards this by developing the sociotechnical infrastructure needed to understand the risks of advanced AI and support its governance. AI has the power to revolutionise industries, enhance our lives and address complex global challenges, but we must also confront the global risks. The future of AI is safe AI.\n\nBack to top\n\nOGL\n\nAll content is available under the Open Government Licence v3.0, except where otherwise stated\n\n© Crown copyright\n\nhttps://www.gov.uk/government/publications/frontier-ai-taskforce-second-progress-report/frontier-ai-taskforce-second-progress-report\n\n9/9']","Coordinated action is necessary in addressing frontier AI risks because it is important to raise awareness of these risks and mitigate them before new and more capable AI systems are developed and deployed. The demonstrations presented by the team will showcase the compelling and nuanced presentations of frontier AI risks done by any government to date, with the hope of increasing awareness and promoting coordinated action.",


#### Convert dataframe into a DeepEval compatible JSON

In [None]:
# Convert the DataFrame to a JSON object
ragas_synthetic_data_json = testset_df_renamed.to_json(orient='records')

data = json.loads(ragas_synthetic_data_json)

# Convert the Python object back to a JSON string, with indentation for prettifying
pretty_json = json.dumps(data, indent=4)

# Define the path to the output file
output_file_path = f'./data/synthetic_data/ragas_synthetic_data_v{version}.json'


# Save the JSON object to a file
with open(output_file_path, 'w') as f:
    json.dump(pretty_json, f)

In [None]:
data = json.loads(ragas_synthetic_data_json)

# Convert the Python object back to a JSON string, with indentation for prettifying
pretty_json = json.dumps(data, indent=4)

print(pretty_json)

[Back to top](#title)

-----------------------

## Trouble shooting <a class="anchor" id="troubleshooting"></a>

#### Langchain DirectoryLoader Error

If you run into a poppler path error and poppler is installed and can be access from your virtual environment (by running `pdfinfo -v`), then close notebook and restart the Jupyter server from the terminal where the path is correctly set (by running `code notebooks/evaluation/evaluation_dataset_generation.ipynb`) 

#### RAGAS synthetically generated evaluation data

We have found some rows of synthetically generated evaluation data from using the RAGAS framework, includes some NaN and/or not str type, which results in an error for DeepEval metrics, as these data fail Pydantic validation.

To avoid this, ensure you turn RAGAS synthetically generated evaluation data to type str and remove rows of data with NaN

#### DeepEval framework

At the moment, I can only load from CSV into DeepEval test cases, so there may be something wrong with the JSON created above. #TODO: Debug

[Back to top](#title)

-------