# Basic file format loading, conversion, and CSV standardization

In this notebook, we perform initial loading tasks where we load and convert .xlsx files to CSV files. We also save to CSV with UTF-8 encoding. One of the datasets comes in CSV format but has structure issues. We change the encoding to UTF-8 as well and standardize its structure by switching to comma separators while protecting comma-containing cells.

## Import Required Libraries

In [1]:
# import required libraries
import pandas as pd
import chardet


## Load and Examine Dataset Structure

### 1. This `chardet` function returns the `annotators.csv` file's encoding system

In [2]:
# `chardet` function to detect encoding of csv file annotators.csv
def detect_encoding(file_path):
    with open(file_path, "rb") as file:
        raw_data = file.read()
        result = chardet.detect(raw_data)
        return result


encoding_info = detect_encoding("../data/annotators.csv")
print(encoding_info)


{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


### 2. The following script standardizes the CSV file by creating standardized version with UTF-8 encoding and comma separators while protecting data

In [3]:
# define problematic csv standardization function
def safe_standardize_csv(
    input_file, output_file, input_sep=";", input_encoding="windows-1252"
):
    df = pd.read_csv(input_file, sep=input_sep, encoding=input_encoding)

    # Check if loading worked correctly (basic validation)
    print(f"Loaded shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")

    # Save with proper quoting
    df.to_csv(output_file, encoding="utf-8", index=False)
    print(f"Standardized: {input_file} → {output_file}")
    return df


# standardize the annotators.csv file
annotators_standardized = safe_standardize_csv(
    "../data/annotators.csv",
    "../data/annotators_utf8.csv",
    input_sep=";",
    input_encoding="Windows-1252",
)


Loaded shape: (1345, 9)
Columns: ['id', 'age', 'gender', 'education', 'native_english_speaker', 'political_ideology', 'followed_news_outlets', 'news_check_frequency', 'survey_completed']
Standardized: ../data/annotators.csv → ../data/annotators_utf8.csv


### 3. Performs simple conversion for .xlsx files and uses UTF-8 encoding to enhance standardization 

In [4]:
# load excel files
annotations = pd.read_excel("../data/annotations.xlsx")
labeled_dataset = pd.read_excel("../data/labeled_dataset.xlsx")

# and convert to utf-8 csv files
annotations.to_csv("../data/annotations_utf8.csv", encoding="utf-8", index=False)
labeled_dataset.to_csv(
    "../data/labeled_dataset_utf8.csv", encoding="utf-8", index=False
)


  for idx, row in parser.parse():


### 4. Check datasets loaded properly

In [5]:
# check that everything is loaded correctly
print(f"Annotators_standardized shape: {annotators_standardized.shape}")
print(f"Annotations shape: {annotations.shape}")
print(f"Labeled_dataset shape: {labeled_dataset.shape}")
print("Data loaded successfully!")


Annotators_standardized shape: (1345, 9)
Annotations shape: (17775, 23)
Labeled_dataset shape: (1700, 12)
Data loaded successfully!


### 4. When needed, use test copies for testing purposes and to preserve raw data

In [6]:
# create copies for testing purposes
annotators_test = annotators_standardized.copy()
annotations_test = annotations.copy()
labeled_dataset_test = labeled_dataset.copy()
