In [1]:
import pandas as pd
import numpy as np


First Step: PreProcess the Data
1. We want all rows to represent one observation or one instance (e.x. data for one county in one year). We also want all columns to represent one feature, which is one type of information about the observation. For this, we used pivot_wide property of pandas in the AllData.ipynb file. 
2. Non-Numeric to Numeric Conversion: Columns that were previously objects (because of "N/A" strings) are now showing as float64, which means they have been succesfully converted to numeric. THis is important for numerical analysis and understanding. 
3. Handling Missing Values: The missing values are now represented NaN (Not a Number), which will be handled appropriately by Pandas and the ML libraries
4. Data Summary using describe() to show column statistics, .dtypes() to confirm that all columns are of the correct data type, and nunique() to make sure each column's data has been succesfully read

In [3]:

values_data = pd.read_csv("pivot_wide_format_data.csv")

#Replace all NaN values with "N/A"
values_data.fillna("N/A", inplace=True)

values_data.replace("N/A", np.nan, inplace=True)

# List of column pairs to merge
column_pairs = [
    ('Child Mortality Rate (White)', 'Child Mortality Rate (white)'),
    ('Drug Overdose Mortality Rate (White)', 'Drug Overdose Mortality Rate (white)'),
    ('Firearm Fatalities Rate (White)', 'Firearm Fatalities Rate (white)'),
    ('Homicide Rate (White)', 'Homicide Rate (white)'),
    ('Infant Mortality Rate (White)', 'Infant Mortality Rate (white)'),
    ('Injury Death Rate (White)', 'Injury Death Rate (white)'),
    ('MV Mortality Rate (White)', 'MV Mortality Rate (white)'),
    ('Suicide Rate (White)', 'Suicide Rate (white)'),
    ('Teen Birth Rate (White)', 'Teen Birth Rate (white)'),
    ('YPLL Rate (White)', 'YPLL Rate (white)'),
    ('Preventable Hospitalization Rate', 'Preventable Hosp. Rate')
]

for pair in column_pairs:
    column1, column2 = pair
    # Combine the columns and prioritize the non-null values in the first column
    values_data[column1] = values_data[column1].combine_first(values_data[column2])
    # Drop the second column as its data has been merged
    values_data.drop(column2, axis=1, inplace=True)

#Convert object columns that should be numeric to float
for column in values_data.columns:
    if values_data[column].dtype == "object" and column not in ["County", "Year"]:
        values_data[column] = pd.to_numeric(values_data[column], errors="coerce")
values_data.to_csv("pivot_wide_format_data_cleaned.csv", index=False)



  values_data.fillna("N/A", inplace=True)


In [4]:
print(values_data.head())

print(values_data.dtypes)

print(values_data.describe())

print(values_data.nunique())


    County  Year  % Adults with Obesity  % Children in Poverty  \
0  Appling  2011                    NaN                   31.3   
1  Appling  2012                    NaN                   35.7   
2  Appling  2013                    NaN                   34.0   
3  Appling  2014                    NaN                   34.8   
4  Appling  2015                    NaN                   40.3   

   % Excessive Drinking  % Smokers  % Unemployed  % Uninsured  ACSC Rate  \
0                  13.2       23.0           NaN    24.600000        NaN   
1                  11.0       20.2     10.700000    23.700000        NaN   
2                   9.6       24.8     11.000000    24.900000        NaN   
3                   NaN       27.8     10.317788    25.189603        NaN   
4                   NaN       27.8     10.023661    24.231524        NaN   

   Age-Adjusted Death Rate  ...  Teen Birth Rate (Black)  \
0                      NaN  ...                      NaN   
1                      NaN

Second Step: Data Scaling
Scale the numerical data to ensure that all features contribute equally to the distance calculation in the clustering algorithm. Without scaling, features with larger ranges could influence the model more than features with smaller ranges. Scaling prevents this by ensuring that each feature has a similar influence on the model. 

For K-Means Clustering, we will be using standardization (Z-Score normalization) since K-Means is a distance-based algorithm and standardization maintains the effect of outliers in the data. In standardization, the features are rescaled so that they have properties of a standard normal distribution with a mean of 0 and standard deviation from the mean of 1. Standardization can be calculated as X - mean / (standard deviations from mean). In Pandas, we will use 'StandardScaler' from the 'sklearn.preprocessing' library

Some other data scaling techniques include min-max scaling (getting all values between given range) and robust scaling (uses the median and interquartile range or IQR so its robust to outliers)