# Task
Load the 'test' and 'train' splits from the 'strategyqa_dataset', 'maqa_datasets', and 'gsm8k_datasets' into pandas DataFrames.

## Explore Dataset Directories

### Subtask:
Examine the file structure and types within the 'strategyqa_dataset', 'maqa_datasets', and 'gsm8k_datasets' directories to determine the best method for loading data into pandas DataFrames, specifically looking for 'test' and 'train' splits. This involves listing files and their extensions.


**Reasoning**:
To begin exploring the dataset directories, I will first use the `ls -R` command to list the contents of the root directory and confirm the presence and structure of the requested dataset directories.



In [None]:
import os

# List the contents of the /content directory to see available files and directories
print("Contents of /content/:")
!ls -R /content/

# Try to list the contents of the specified dataset directories directly.
# If they don't exist, the command will fail, and we'll address that in the next step.

# 1. List the contents of 'strategyqa_dataset' directory
print("\nContents of strategyqa_dataset:")
!ls -R /content/strategyqa_dataset

# 2. List the contents of 'maqa_datasets' directory
print("\nContents of maqa_datasets:")
!ls -R /content/maqa_datasets

# 3. List the contents of 'gsm8k_datasets' directory
print("\nContents of gsm8k_datasets:")
!ls -R /content/gsm8k_datasets

Contents of /content/:
/content/:
drive  sample_data

/content/drive:
MyDrive

/content/drive/MyDrive:
'aasta 1'
 andmetehnika
'andmete visualiseerimine'
'atm andmeteaduse meetodid'
'Colab Notebooks'
 DataEngineeringProject.gdoc
 digihumanitaaria
'eesti keel pythonis'
'koodihalduse alused'
'Machine Learning Homework 05.gdoc'
 masinope
'notebook converter'
'Slideless presentation.gdoc'
 transformers
'Transformers course homework.gdoc'
'Transformers HW 03.gdoc'
'Transformers Project.gdoc'
'Üksuse UT ML project proposals 2025 koopia.gsheet'

'/content/drive/MyDrive/aasta 1':

/content/drive/MyDrive/andmetehnika:
EntityDiagram.drawio

'/content/drive/MyDrive/andmete visualiseerimine':
kodutöö2  praks10

'/content/drive/MyDrive/andmete visualiseerimine/kodutöö2':
births_year.RData  Kodutoo2.Rmd  linnad.RData  ylevaatused.RData

'/content/drive/MyDrive/andmete visualiseerimine/praks10':
Blah.pdf  images  PF16_2024.RData  Praktikum10-2025S.Rmd  test.RData

'/content/drive/MyDrive/andmete

**Reasoning**:
The previous attempt to list the dataset directories failed because they were not located directly under `/content/`. Based on the `ls -R /content/` output, the datasets are located within `/content/drive/MyDrive/transformers/projekt/tf_project/hendrik_trying_out_things/`. I will now list the contents of these directories using their correct paths to identify the 'test' and 'train' splits and their file extensions.



In [None]:
import os

# Base path for the datasets
base_path = "/content/drive/MyDrive/transformers/projekt/tf_project/hendrik_trying_out_things"

# 1. List the contents of 'strategyqa_dataset' directory
print("\nContents of strategyqa_dataset:")
!ls -R {base_path}/strategyqa_dataset

# 2. List the contents of 'maqa_datasets' directory
print("\nContents of maqa_datasets:")
!ls -R {base_path}/maqa_datasets

# 3. List the contents of 'gsm8k_datasets' directory
print("\nContents of gsm8k_datasets:")
!ls -R {base_path}/gsm8k_datasets


Contents of strategyqa_dataset:
/content/drive/MyDrive/transformers/projekt/tf_project/hendrik_trying_out_things/strategyqa_dataset:
bart_decomp_dev_predictions.jsonl
dev.json
queries_cache.json
strategyqa_train_paragraphs.json
train.json
transformer_qa_ORA-P_dev_no_placeholders.json
transformer_qa_ORA-P_train_no_placeholders.json

Contents of maqa_datasets:
/content/drive/MyDrive/transformers/projekt/tf_project/hendrik_trying_out_things/maqa_datasets:
 MAQA_commonsense_reasoning.json
 MAQA_mathmatical_reasoning.json
 MAQA_world_knowledge_HLS.json
 MAQA_world_knowledge_nq.json
'single_commonsens_reasoning(StrategyQA).json'
'single_mathematical_reasoning(gsm8k).json'
'single_world_knowledge(NQ).json'

Contents of gsm8k_datasets:
/content/drive/MyDrive/transformers/projekt/tf_project/hendrik_trying_out_things/gsm8k_datasets:
'test-00000-of-00001 (1).parquet'  'train-00000-of-00001 (1).parquet'
 test-00000-of-00001.parquet	    train-00000-of-00001.parquet


### Summary of Dataset Splits and File Types:

**1. `strategyqa_dataset`:**
*   **Test Split:** `dev.json` (JSON format)
*   **Train Split:** `train.json` (JSON format)

**2. `maqa_datasets`:**
*   This directory contains several JSON files, but no files explicitly named 'test' or 'train'. It appears the data is organized by reasoning type (e.g., `MAQA_commonsense_reasoning.json`, `MAQA_mathmatical_reasoning.json`, `MAQA_world_knowledge_HLS.json`, `MAQA_world_knowledge_nq.json`, and other `single_...json` files). Further inspection within these files might be needed to identify the 'test' and 'train' splits if they are combined or differently named.

**3. `gsm8k_datasets`:**
*   **Test Split:** `test-00000-of-00001.parquet` (Parquet format)
*   **Train Split:** `train-00000-of-00001.parquet` (Parquet format)

## Load Data into DataFrames

### Subtask:
Load the 'test' and 'train' splits from 'strategyqa_dataset' and 'gsm8k_datasets', and all JSON files from 'maqa_datasets' into pandas DataFrames.


**Reasoning**:
The subtask requires loading specific 'test' and 'train' splits from `strategyqa_dataset` and `gsm8k_datasets` and all JSON files from `maqa_datasets` into pandas DataFrames. This involves importing pandas, setting the base path, reading JSON and Parquet files, and iterating through a directory for multiple JSON files.



In [None]:
import pandas as pd
import os

# Define the base path for the datasets
base_path = '/content/drive/MyDrive/transformers/projekt/tf_project/hendrik_trying_out_things'

# 3. For strategyqa_dataset:
strategyqa_dev_df = pd.read_json(os.path.join(base_path, 'strategyqa_dataset', 'dev.json'))
strategyqa_train_df = pd.read_json(os.path.join(base_path, 'strategyqa_dataset', 'train.json'))

# 4. For gsm8k_datasets:
gsm8k_test_df = pd.read_parquet(os.path.join(base_path, 'gsm8k_datasets', 'test-00000-of-00001.parquet'))
gsm8k_train_df = pd.read_parquet(os.path.join(base_path, 'gsm8k_datasets', 'train-00000-of-00001.parquet'))

# 5. For maqa_datasets:
maqa_dfs = {}
maqa_dataset_path = os.path.join(base_path, 'maqa_datasets')
for filename in os.listdir(maqa_dataset_path):
    if filename.endswith('.json'):
        file_path = os.path.join(maqa_dataset_path, filename)
        df_name = os.path.splitext(filename)[0] # Get filename without extension
        maqa_dfs[df_name] = pd.read_json(file_path)

# 6. Print the first few rows and shape of each loaded DataFrame
print("\n--- strategyqa_dev_df ---")
print(strategyqa_dev_df.head())
print(f"Shape: {strategyqa_dev_df.shape}")

print("\n--- strategyqa_train_df ---")
print(strategyqa_train_df.head())
print(f"Shape: {strategyqa_train_df.shape}")

print("\n--- gsm8k_test_df ---")
print(gsm8k_test_df.head())
print(f"Shape: {gsm8k_test_df.shape}")

print("\n--- gsm8k_train_df ---")
print(gsm8k_train_df.head())
print(f"Shape: {gsm8k_train_df.shape}")

print("\n--- maqa_datasets DataFrames ---")
for name, df in maqa_dfs.items():
    print(f"\nDataFrame: {name}")
    print(df.head())
    print(f"Shape: {df.shape}")



--- strategyqa_dev_df ---
                    qid                              term  \
0  e0044a7b4d146d611e73                   Albany, Georgia   
1  c69397b4341b65ed080f  Saint Vincent and the Grenadines   
2  be5c9933987f046b476e                             Greed   
3  1932e05f10680ece229f                      Sea of Japan   
4  fb8b656051c742f5bd27                           Lil Jon   

                                         description  \
0                     City in Georgia, United States   
1                           Country in the Caribbean   
2  an inordinate or insatiable longing, especiall...   
3       Marginal sea between Japan, Russia and Korea   
4  American rapper, record producer and DJ from G...   

                                            question  answer  \
0  Will the Albany in Georgia reach a hundred tho...   False   
1  Is the language used in Saint Vincent and the ...    True   
2  Is greed the most prevalent of the Seven Deadl...   False   
3  Would the 

## Store and Access DataFrames

### Subtask:
Organize the loaded DataFrames into a single dictionary for easy access and further processing.


## Summary:

### Data Analysis Key Findings

*   The initial attempt to locate datasets under `/content/` failed, but the correct base path was identified as `/content/drive/MyDrive/transformers/projekt/tf_project/hendrik_trying_out_things/`.
*   **`strategyqa_dataset`**:
    *   `dev.json` (acting as test split) was loaded into `strategyqa_dev_df` with 229 rows and 8 columns (`qid`, `term`, `description`, `question`, `answer`, `facts`, `decomposition`, `evidence`).
    *   `train.json` was loaded into `strategyqa_train_df` with 2061 rows and 8 columns.
*   **`gsm8k_datasets`**:
    *   `test-00000-of-00001.parquet` was loaded into `gsm8k_test_df` with 1319 rows and 2 columns (`question`, `answer`).
    *   `train-00000-of-00001.parquet` was loaded into `gsm8k_train_df` with 7473 rows and 2 columns.
*   **`maqa_datasets`**: This directory did not contain explicit 'test' or 'train' splits. Instead, seven individual JSON files were loaded, categorized by reasoning type, each with 2 columns:
    *   `single_mathematical_reasoning(gsm8k)`: 1319 rows.
    *   `MAQA_commonsense_reasoning`: 1000 rows.
    *   `MAQA_mathmatical_reasoning`: 400 rows.
    *   `MAQA_world_knowledge_nq`: 592 rows.
    *   `single_commonsens_reasoning(StrategyQA)`: 2290 rows.
    *   `MAQA_world_knowledge_HLS`: 50 rows.
    *   `single_world_knowledge(NQ)`: 1288 rows.

### Insights or Next Steps

*   Organize all loaded DataFrames into a single dictionary for centralized and efficient access for subsequent analysis steps.
*   Given the varying column structures (e.g., `strategyqa` having 8 columns, `gsm8k` and `maqa` datasets having 2 columns), consider a standardization process or specific handling for each dataset type during combined processing.


In [None]:
print("\n--- Columns for strategyqa_dev_df ---")
print(strategyqa_dev_df.columns)

print("\n--- Columns for strategyqa_train_df ---")
print(strategyqa_train_df.columns)

print("\n--- Columns for gsm8k_test_df ---")
print(gsm8k_test_df.columns)

print("\n--- Columns for gsm8k_train_df ---")
print(gsm8k_train_df.columns)

print("\n--- Columns for maqa_datasets DataFrames ---")
for name, df in maqa_dfs.items():
    print(f"\nDataFrame: {name}")
    print(df.columns)


--- Columns for strategyqa_dev_df ---
Index(['qid', 'term', 'description', 'question', 'answer', 'facts',
       'decomposition', 'evidence'],
      dtype='object')

--- Columns for strategyqa_train_df ---
Index(['qid', 'term', 'description', 'question', 'answer', 'facts',
       'decomposition', 'evidence'],
      dtype='object')

--- Columns for gsm8k_test_df ---
Index(['question', 'answer'], dtype='object')

--- Columns for gsm8k_train_df ---
Index(['question', 'answer'], dtype='object')

--- Columns for maqa_datasets DataFrames ---

DataFrame: single_mathematical_reasoning(gsm8k)
Index(['question', 'answer'], dtype='object')

DataFrame: MAQA_commonsense_reasoning
Index(['question', 'answer'], dtype='object')

DataFrame: MAQA_mathmatical_reasoning
Index(['question', 'answer'], dtype='object')

DataFrame: MAQA_world_knowledge_nq
Index(['question', 'answer'], dtype='object')

DataFrame: single_commonsens_reasoning(StrategyQA)
Index(['question', 'answer'], dtype='object')

DataFrame: 

In [None]:
all_datasets = {
    'strategyqa_dev': strategyqa_dev_df,
    'strategyqa_train': strategyqa_train_df,
    'gsm8k_test': gsm8k_test_df,
    'gsm8k_train': gsm8k_train_df
}

# Add MAQA datasets to the main dictionary
for name, df in maqa_dfs.items():
    all_datasets[f'maqa_{name}'] = df

print("All loaded DataFrames are now organized into the 'all_datasets' dictionary.")
print(f"Total DataFrames in dictionary: {len(all_datasets)}")
print("Keys in all_datasets:")
for key in all_datasets.keys():
    print(f"- {key}")


All loaded DataFrames are now organized into the 'all_datasets' dictionary.
Total DataFrames in dictionary: 11
Keys in all_datasets:
- strategyqa_dev
- strategyqa_train
- gsm8k_test
- gsm8k_train
- maqa_single_mathematical_reasoning(gsm8k)
- maqa_MAQA_commonsense_reasoning
- maqa_MAQA_mathmatical_reasoning
- maqa_MAQA_world_knowledge_nq
- maqa_single_commonsens_reasoning(StrategyQA)
- maqa_MAQA_world_knowledge_HLS
- maqa_single_world_knowledge(NQ)
