# Assignment 01: Data Preprocessing Project - Matthew Mendoza

I am to account for missing values and to concatenate rows and columns.

I will be using a dataset donated to UC Irvine Machine Learning Repository:
[Diabetes 130-US hospitals for years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008)

The dataset represents a ten-year (1999-2008) clinical record of diabetes
patients in 130 US hospitals, with the key goal of predicting early readmission
within 30 days of discharge to address inconsistent diabetes management, which
impacts both hospital costs and patient health.

## Dataset citation

Clore,John, Cios,Krzysztof, DeShazo,Jon, and Strack,Beata. (2014). Diabetes 130-US hospitals for years 1999-2008. UCI Machine Learning Repository. https://doi.org/10.24432/C5230J.

## [Python](https://www.python.org/) dependencies

For analysis and processing of data I will be using the
[pandas package](https://pypi.org/project/pandas/)


## Missing values

In this dataset there are missing values indicated by a question mark (`?`).
The missing values that are missing are categorical in type and include the
following:

- `race`: Caucasian, Asian, African American, Hispanic, and other
- `weight`: Weight in pounds
- `payer_code`: Integer identifier corresponding to 23 distinct values,
  for example, Blue Cross/Blue Shield, Medicare, and self-pay
- `medical_specialty`: Integer identifier of a specialty of the admitting
  physician, corresponding to 84 distinct values, for example, cardiology,
  internal medicine, family/general practice, and surgeon
- `diag_1`: The primary diagnosis (coded as first three digits of ICD9);
  848 distinct values
- `diag_2`: Secondary diagnosis (coded as first three digits of ICD9);
  923 distinct values
- `diag_3`: Additional secondary diagnosis (coded as first three digits of ICD9);
  954 distinct values

### Cautions and cavitates

Some "massaging" of the data is required...

`age` values assigned are not natural numbers, but are given a range with an
inclusive lower bound and an exclusive upper bound number in 10 year increments;
for example, `[0-10)`, `[10-20)`, `[20-30)`, ect.

`weight`, like `age`, is also a range of numbers with an inclusive lower bound
number and an exclusive upper bound number, but in 25 pounds increments
(e.g `[0-25`,`[50-75)`,`[100-125)`, ...).

It should be noted that `weight` in the dataset is not prioritized; in addition,
when there are records for `weight` only `Caucasian`s are those who were
accounted for (excluding `Asian`, `African American`, `Hispanic`, and `other`)

For this reason I am forced to list the median weights of each category,
regardless if they were male or female, of only `Caucasian` in the dataset.


In [109]:
#gg
import pandas as pd
import os
import numpy as np

# Set the path to the data
DATA_PATH: str = os.getcwd() + \
    os.path.normpath(
        path="/data/diabetes+130-us+hospitals+for+years+1999-2008")

# Construct file paths
input_file = os.path.join(DATA_PATH, "diabetic_data.csv")
output_file = os.path.join(
    DATA_PATH, "diabetic_data_caucasian_median_weight.csv")

# Load the original CSV data
data = pd.read_csv(input_file, na_values='?')

# Filter the data to include only Caucasian patients
caucasian_data = data[data['race'] == 'Caucasian']


def calculate_median_weight(weight_range: str) -> float:
    """Calculate the median weight from a weight range or single value.

    Args:
        weight_range (str): A string containing a weight range or single value.

    Returns:
        float: The median weight calculated from the given weight range or value.
    """
    if isinstance(weight_range, str):
        bounds: list = [int(val) for val in weight_range.strip(
            '[]()').split('-') if val.isdigit()]
        if len(bounds) == 2:
            # If the value is a range, return the median
            return np.median(bounds)
    return np.nan  # Handle non-numeric values and floats as NaN


# Calculate the median for the 'weight' column
median = caucasian_data['weight'].apply(calculate_median_weight)

# Create a new DataFrame with the desired columns
output_data = caucasian_data[['encounter_id',
                              'patient_nbr', 'race', 'gender', 'age', 'weight']]

# Replace "NaN" values in the 'weight' column with the calculated median
output_data.loc[:, 'weight'] = output_data['weight'].apply(
    calculate_median_weight).fillna(median)

# Drop rows with missing values (NaN) in any column and update the DataFrame in-place
output_data.dropna(inplace=True)

# Save the filtered data to a new CSV file
output_data.to_csv(output_file, index=False)


  data = pd.read_csv(input_file, na_values='?')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  output_data.dropna(inplace=True)


## Concatenating rows and columns
