# **Min-Max Scaling & Data Validation**
This notebook applies **Min-Max Scaling** to normalize composite survey scores, ensuring that all values are scaled between **0 and 1**. This transformation allows for **consistent comparisons between respondents**, as the original survey data contained composite scores on different numerical scales.

Additionally, we implement **automated validation tests** to confirm that: 

    ✅ Data and respondents are correctly loaded, all rows are present
    ✅ All transformed values **fall within the expected range** `[0,1]`
    ✅The re-scaled data is complete, after undergoing trnasformations.

Once the dataset has been successfully transformed and validated, it will be saved as a new CSV (CSV2.csv) which will be used in future notebooks.

In [19]:
# The following notebook uses these packages:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import os

In [20]:
# Load the file and preview the columns and data format
file_path = r"C:\Users\12012\OneDrive\Desktop\CSV1.csv"
if os.path.exists(file_path):
    df = pd.read_csv(file_path)
    print("✅")
    display(df.head())  # Display first rows
else:
    print("❌ File not found. Check the filename and/ or path.")


✅


Unnamed: 0,Respondent ID,Respondant Zip Code,Urban or Rural,Age,Gender,Ethnicity,Income Re-scaled see document for re-scaling,Education Re-scaled see document for rescaling,Household size Re-scaled see document for rescaling,Marital,...,Composite Crisis Score (Sum of all Crisis Scores),Systems Thinking Score 0-36,Trust Score 1-5,Conspiracy 5-20,Complexity 5-20,Openness 3-21,Conscienciousness 3-21,Extroversion 3-21,Agreeableness 3-21,Neuroticism 3-21
0,2590,85716,Urban,46,Female,Black,4,3,3,Single,...,30,28,1.0,8,13,8,21,7,21,17
1,1713,85396,Rural,20,Female,Hispanic,1,1,6,Single,...,30,21,1.0,10,13,10,12,14,11,12
2,2453,0,Rural,63,Male,White,6,3,2,Married,...,30,18,4.0,10,9,12,18,12,15,11
3,2052,85225,Urban,32,Female,White,5,2,3,Married,...,30,32,2.666667,15,13,12,17,12,17,21
4,1736,85297,Urban,43,Male,Asian,9,6,4,Married,...,30,36,3.333333,20,19,12,12,12,12,12


**🔹 Test 1: Verifying All Respondents Are Present:**
This test ensures that all 1042 survey responses from the Arizona-based cohort are included in the dataset. If the expected number of not found, it indicates that the data might be loading incorrectly or that the file is incomplete.

In [21]:
expected_rows = 1042
assert df.shape[0] == expected_rows, f"❌ Error: Expected {expected_rows} rows, but found {df.shape[0]}!"
print(f"✅ Test Passed: All {expected_rows} respondents included.")

✅ Test Passed: All 1042 respondents included.


In [22]:
print(df.columns) # Full list of all columns and current scaling

Index(['Respondent ID', 'Respondant Zip Code', 'Urban or Rural', 'Age',
       'Gender', 'Ethnicity', 'Income Re-scaled see document for re-scaling',
       'Education Re-scaled see document for rescaling',
       'Household size Re-scaled see document for rescaling', 'Marital',
       'Employment Simple',
       'Elections: I think elections are fair and reliable Re-scaled see document for rescaling',
       'Last Election: Didn't Vote 0, Democrat 1, Republican 2',
       'Elections: I understand the term 'democracy' to mean:',
       'Elections: How would you define the political system in which you live? ',
       'Political System Oppinions: Which of these two political systems do you think is the most effective and functional for making complex decisions? ',
       'Political System Oppinions: Which of these two political systems would you prefer to live in?',
       'Economic Instability Re-scaled: 0=no impact 3=high impact',
       'Energy Crisis Re-scaled 0=no impact 3=high imp

## Composite Scores: 
**Uses:** A key feature of this project is the ability to **compare respondents** by generating **similarity scores** based on survey data. Some variables—like **age or income**—can be directly compared because they exist on a measurable scale. Others, such as **gender, ethnicity, household size, and education level**, can still be used for comparison but require more **contextual interpretation** rather than numerical scaling.  

To facilitate comparisons, the survey includes **composite scores**, which aggregate responses from multiple related questions into a single metric. These scores help **identify patterns** across respondents, but they also have **limitations**—they summarize complex behaviors and attitudes into a number, which may not capture the full **nuance of individual responses**.  

**Limitations:** While composite scores allow for broad **comparisons across the population**, they should be used with **an understanding of their constraints** They are designed to be reductive, rahter than expansive in nature, but the individual questions that contribute to these scales can still provide deeper insights into specific trends.  

**Challenges:** Some composite scores in the dataset **use different numerical ranges** (e.g., **0–36, 1–5, 3–21**). Without standardization, **direct comparisons are not meaningful**—a score of **10 in one category** might not represent the same level of agreement as **10 in another category**. Before meaningful comparisons can be made, we need to standardize the scores, by re-scaling them.

**Next Steps** Min-Max scaling will transform all columns holding composite data into a **range from 0 to 1**. This ensures that **all composite scores** contribute equally to **similarity calculations**, making comparisons between respondents **more interpretable and reliable**.

In [23]:
from sklearn.preprocessing import MinMaxScaler

# Define the composite score columns
composite_columns = [
    "Systems Thinking Score 0-36", 
    "Trust Score 1-5", 
    "Conspiracy 5-20", 
    "Complexity 5-20", 
    "Openness 3-21", 
    "Conscienciousness 3-21", 
    "Extroversion 3-21", 
    "Agreeableness 3-21", 
    "Neuroticism 3-21"
]

# Initialize the scaler, apply to columns defined above:
scaler = MinMaxScaler()
df[composite_columns] = scaler.fit_transform(df[composite_columns])
# Display the first few rows after scaling for checking
df.head()


Unnamed: 0,Respondent ID,Respondant Zip Code,Urban or Rural,Age,Gender,Ethnicity,Income Re-scaled see document for re-scaling,Education Re-scaled see document for rescaling,Household size Re-scaled see document for rescaling,Marital,...,Composite Crisis Score (Sum of all Crisis Scores),Systems Thinking Score 0-36,Trust Score 1-5,Conspiracy 5-20,Complexity 5-20,Openness 3-21,Conscienciousness 3-21,Extroversion 3-21,Agreeableness 3-21,Neuroticism 3-21
0,2590,85716,Urban,46,Female,Black,4,3,3,Single,...,30,0.777778,0.0,0.2,0.533333,0.277778,1.0,0.222222,1.0,0.777778
1,1713,85396,Rural,20,Female,Hispanic,1,1,6,Single,...,30,0.583333,0.0,0.333333,0.533333,0.388889,0.5,0.611111,0.444444,0.5
2,2453,0,Rural,63,Male,White,6,3,2,Married,...,30,0.5,0.75,0.333333,0.266667,0.5,0.833333,0.5,0.666667,0.444444
3,2052,85225,Urban,32,Female,White,5,2,3,Married,...,30,0.888889,0.416667,0.666667,0.533333,0.5,0.777778,0.5,0.777778,1.0
4,1736,85297,Urban,43,Male,Asian,9,6,4,Married,...,30,1.0,0.583333,1.0,0.933333,0.5,0.5,0.5,0.5,0.5


**Note** Min-Max scaling has been applied, as indicated by the decimal numbers that replaced the previous whole numbers. The columns still show the previous rang before rescaling, but this will be gone in future data versions.

**🔹 Test 2: Ensure Composite Columns Have Been Rescaled Coreectly**
The test below ensures that:
1. All **expected columns** are correctly re-scaled- (columns containing demographic or other scale data is not min-max rescaled)
2. No values fall **below 0** or **above 1** after Min-Max scaling.

In [24]:
# TEST 2: Testing for Correct Value Range
assert df[composite_columns].min().min() >= 0, "❌ Error: Min value is below 0"
assert df[composite_columns].max().max() <= 1, "❌ Error: Max value is above 1"
print("✅ Test 2 Passed: All values are between 0 and 1.")


✅ Test 2 Passed: All values are between 0 and 1.


**🔹Test 3: Ensure No Missing Values in the Data**  The final test makes sure:

1. Missing (**NaN values**) are not present after transformations: this could be caused by the original data being in an incorrect place or format.

In [25]:
# Count the total number of missing values across the entire dataset
total_missing_values = df.isnull().sum().sum()  

# Rule for warning if there are any missing values
if total_missing_values > 0:
    print(f"❌ Warning: The dataset contains {total_missing_values} missing values!")

# Assertion to ensure there are no missing values: if the number is 0, the test can pass, if not, it raises an error.
assert total_missing_values == 0, "❌ Error: There are missing values in the dataset!"

# If no missing values were found, print a success message
print("✅ Test 3 Passed: No missing values in the dataset.")

✅ Test 3 Passed: No missing values in the dataset.


## **Future Work with the Re-Scaled Data**

The re-scaled dataset will be the foundation for **future analyses**, allowing for easier **comparisons between respondents**. While users could download the current version, we want to ensure that the **Min-Max Scaling process is transparent and easy to understand**.

To achieve this, we have created a **new version** with **clearly labeled column names**, explicitly indicating that Min-Max Scaling has been applied.

This updated file will be saved as:  
📄 **CSV2.csv**
