In [6]:
# Step 1: Import necessary libraries

from datasets import load_dataset
import pandas as pd

# Dataset Exploration: Instructions and Output Inputs

This notebook is dedicated to exploring the "instruction" and "output input" columns of the dataset loaded from "selfrag/selfrag_train_data". We aim to understand the structure and content of these fields better, enhancing our analysis with adjusted Pandas display options to prevent content truncation.


In [7]:
# Import necessary libraries
import pandas as pd
from datasets import load_dataset

# Adjust Pandas display options
pd.set_option('display.max_colwidth', None)  # Ensure full visibility of content within columns


In [8]:
# Load the dataset
dataset = load_dataset("selfrag/selfrag_train_data")

# Access the 'train' split (or adjust as necessary)
data = dataset['train']

# Convert to Pandas DataFrame for focused exploration
df = pd.DataFrame(data)


In [9]:
# Select only the 'instruction' and 'output input' columns for analysis
focused_df = df[['instruction', 'output']].copy()  # Assuming 'output input' refers to 'output' column

# Display the first few entries to verify selection
display(focused_df.head())


Unnamed: 0,instruction,output
0,"In this task, you are given a context paragraph of the tweet and question. Your task is to generate right answer of given question based on given context tweet paragraph.\n\nExample input: Context: Our prayers are with the students, educators & families at Independence High School & all the first responders on the scene. #PatriotPride— Doug Ducey (@dougducey) February 12, 2016 Question: at which school were first responders on the scene for?\nExample output: independence high school\nExample explanation: From the context tweet, we can see that independence high school is the right answer.\nQ: Context: BREAKING: ATF sending additional agents from Portland along with K9 team to #UCCShooting tragedy.— ATF HQ (@ATFHQ) October 1, 2015 Question: where are the agents from?\nA:","[Retrieval]<paragraph>2017 Portland train attack\nin economics and worked for the Cadmus Group consulting firm. A passenger who helped him reported his last words: ""Tell everyone on this train I love them."" Micah David-Cole Fletcher, 21, survived and was treated at Legacy Emanuel Medical Center for serious but not life-threatening wounds, specifically a knife injury to his throat. The attacker had narrowly missed Fletcher's jugular vein. Fletcher was released from the hospital in time to attend Christian's arraignment on May 30, 2017. Fletcher, a student at Portland State University, won a poetry contest in 2013 while he was a junior in high school with a</paragraph>[Irrelevant]portland[Utility:5]"
1,"Question: Write a text based on ""rangers show some iron to tie the series""\nText: along with being talented , self-assured and highly paid , these experienced rangers are also considerate .\n\nQuestion: Write a text based on ""union wo n't dismantle blockage of gm canada headquarters""\nText: canadian auto workers officials friday refused to end a blockade of general motors canada headquarters despite an offer to potentially bring new car production to a complex where a truck plant is slated for closure .\n\nQuestion: Write a text based on ""six azerbaijan opposition parliamentary candidates declare hunger strike""\nText:",[No Retrieval]six opposition candidates in the upcoming parliamentary elections in azerbaijan declared a hunger strike friday to pressure the government into ensuring fair polls .[Utility:5]
2,"You will be given a sentence. Check whether the sentence is grammatically correct and is meaningful. If the sentence is grammatically correct, then answer with '1', otherwise answer with '0'.\n\nQ: You are the only person that I can rely on.\n\nA: 0\n****\nQ: Fruit at once hit the roof from the tree.\n\nA: 1\n****\nQ: Students who fail the final exam or who do not do the reading will be executed.\n\nA:",[No Retrieval]1\n****\n[Utility:5]
3,"Q:Is there a negative or positive tone to this product review? === Title: At the bottom of the River, a review by Dylan Review: At the Bottom of the river isn't, in my perspective, a very good book. I gave it one star. I gave it one star because there isn't really a plot, main character (at least with a name), and it is boring. The book is sort of written in a mix between the first person and third person perspectives. The book is hard to understand, especially with 2 two page long sentences! The book kind of seems like it is someone thinking, with no real reason. It skips from one place to another in one chapter. It is hard to get into the story, and since there is no plot, there is little suspense. All in all, it isn't a very good book, and I don't recommend it. Answer:\nA:",[No Retrieval]Negative[Utility:5]
4,"Question: Fertility Clinics Vary on Embryo Disposal Techniques\n\nBy DAVID B. CARUSO PHILADELPHIA (AP) -- The nation's fertility clinics vary widely when it comes to how they perform one of the most delicate aspects of their jobs: disposing of unused frozen human embryos that were created to help infertile women become pregnant...\n\nWhich topic is this article about?\nOPTIONS:\n- World\n- Sports\n- Business\n- Science/Tech\nAnswer: Science/Tech\n\nQuestion: Canada #39;s Inflation Rate Unexpectedly Picks Up to 2.4 (Update1)\n\nCanada #39;s inflation rate unexpectedly accelerated in November to a 2.4 percent rate, reflecting a jump in prices for new cars, gasoline and houses from a year earlier.\n\nWhich topic is this article about?\nOPTIONS:\n- World\n- Sports\n- Business\n- Science/Tech\nAnswer: Business\n\nQuestion: Paris bomb at Indonesia Embassy wounds 10\n\nPARIS A bomb rocked a quiet, elegant neighborhood of apartment buildings in the west of the city, slightly wounding at least 10 people, including 4 workers at the Indonesian Embassy, where a rigged package exploded below its crimson and white national flag \n\nWhich topic is this article about?\nOPTIONS:\n- World\n- Sports\n- Business\n- Science/Tech\nAnswer: World\n\nQuestion: UN Urged to Ignore Bush Plea for Human Cloning Ban\n\n LONDON (Reuters) - Britain's national academy of science urged the United Nations on Monday to ignore a call by President Bush to ban all forms of human cloning.\n\nWhich topic is this article about?\nOPTIONS:\n- World\n- Sports\n- Business\n- Science/Tech",[No Retrieval]Answer: World[Utility:5]


## Dataset Structure Exploration

Let's take a closer look at the structure of our focused dataset. This includes understanding the data types and identifying any immediate patterns or inconsistencies within the "instruction" and "output input" columns.


In [10]:
# Print the structure of the focused dataset
print("Focused Dataset Structure:")
print(focused_df.info())


Focused Dataset Structure:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145619 entries, 0 to 145618
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   instruction  145619 non-null  object
 1   output       145619 non-null  object
dtypes: object(2)
memory usage: 2.2+ MB
None


## Detailed Exploration of Instructions and Output Inputs

In this section, we aim to delve deeper into the content of the "instruction" and "output input" columns. Our goal is to identify unique characteristics, patterns, or inconsistencies that could inform further analysis or preprocessing steps.


In [11]:
# Display detailed examples to understand the variability and structure of the content
# Here, you could use techniques like value_counts() for categorical data or custom exploration based on your analysis needs

# For demonstration, display a random sample to get a broader sense of the data
display(focused_df.sample(5))


Unnamed: 0,instruction,output
80742,Rewrite the following sentence without changing the meaning: His car was wrecked by a careless driver.,[No Retrieval]A careless driver wrecked his car.[Utility:5]
35097,Suggest a replacement word for the italicized that better suits the context.\n\nHe silently indulged himself with a hamburger.,[No Retrieval]He silently *enjoyed* himself with a hamburger.[Utility:5]
109098,"Given four answer candidates, A, B, C and D, choose the best answer choice.## Input:\n\nCutting down a tree\nA: ceases its ability to grow\nB: will cause it to grown 10x bigger\nC: will decrease the likelihood of deforestation\nD: will cause the tree to flourish",[No Retrieval]A[Utility:5]
119827,who sings i could fall in love with you,"[Retrieval]<paragraph>I Could Fall in Love\nI Could Fall in Love ""I Could Fall in Love"" is a song recorded by American Tejano singer Selena for her fifth studio album, ""Dreaming of You"" (1995), released posthumously by EMI Latin on 26 June 1995. ""I Could Fall in Love"" and ""Tú Sólo Tú"" were the album's lead promotional recordings, showcasing her musical transition from Spanish- to English-language songs. The lyrics explore feelings of heartbreak and despair and express the singer's fear of rejection by a man she finds herself falling in love with. Composed by Keith Thomas, ""I Could Fall in Love"" is a pop ballad with</paragraph>[Relevant]Selena[Fully supported][Utility:5]"
118176,a voting district made up of a few hundred voters is called a,[No Retrieval]electoral district[Utility:5]
