# Applied Data Science with Python: 
# Incremental Capstone 1 Session 3

### Project Statement: Develop a comprehensive solution for data aggregation, wrangling, and visualization using a healthcare dataset for the Aura platform. The goal is to effectively manage and process complex healthcare data to enable insightful analysis and enhance data-driven decision-making capabilities within Aura.


### Variable Descriptions

| **Variable**   | **Description**                                                                 | **Type** |
|----------------|---------------------------------------------------------------------------------|---------------|
| `visits`       | Number of physician office visits                                               | numeric         |
| `nvisits`      | Number of non-physician office visits                                           | numeric         |
| `ovisits`      | Number of physician hospital outpatient visits                                  | numeric         |
| `novisits`     | Number of non-physician hospital outpatient visits                              | numeric         |
| `emergency`    | Emergency room visits                                                           | numeric         |
| `health`       | Factor indicating self-perceived health                                         | categorical        |
| `chronic`      | Number of chronic conditions                                                    | numeric         |
| `adl`          | Factor indicating whether the individual has a condition that limits activities of daily living | categorical        |
| `region`       | Factor indicating region                                                        | categorical        |
| `age`          | Age in years (divided by 10)                                                    | numeric       |
| `hospital`     | Number of hospital stays                                                        | numeric         |
| `gender`       | Factor indicating gender                                                        | categorical        |
| `school`       | Number of years of education                                                    | numeric         |
| `employed`     | Factor. Is the individual employed?                                             | categorical        |
| `medicaid`     | Factor. Is the individual covered by Medicaid?                                  | categorical        |
| `married`      | Factor. Is the individual married?                                              | categorical        |
| `income`       | Family income in USD 10,000                                                     | numeric       |
| `insurance`    | Factor. Is the individual covered by private insurance?                         | categorical        |

## Two Huge Problems With This Dataset

1. Medicare: there is no explicit variable for __Medicare__. This is a significant omission, especially when analyzing individuals aged 65 and over — who, in the U.S., are generally eligible for Medicare. As a result, some individuals who appear uninsured may in fact be covered by Medicare, but are either:
- Captured under a generic `insurance` category
- Misclassified due to self-reporting, question ambiguity, or cognitive factors

2. Class Imbalance: In classification problems, it's important to check for **class imbalance** — when one category appears much more frequently than another. In this dataset, 90% of respondents are marked as `employed = no` and only 10% as `yes`, so models using this dataset may learn to always predict 'no' and still appear accurate, even though they're failing to detect the minority class. There are numerous class imbalances in this dataset, so tread carefully.

### Session 3 Tasks
1. Import relevant Python libraries.
2. Import the CSV file – NSMES1988updated.csv into a dataframe.
3. Identify the different data types: numerical and categorical.
4. Perform a detailed data pivoting including the features `Health` and `Region` and report your findings.
5. Perform analysis based on the following criteria: Different types of visits, Gender, Marital Status, School, Income, Employment Status, Insurance, and Medicaid.
6. Explore and analyze the dataset to gain insights into how different factors relate to each other within the dataset. Group the data according to specific demographic and economic criteria below, and create a series of distribution tables by considering the instructions below:
- __Age and Gender Distribution__: Generate a table to view the number of individuals within each age group, separated by gender.
- __Health Status by Gender__: Create a distribution table that categorizes individuals by their health status, differentiated by gender.
- __Income Distribution by Gender__: Compile a table to examine the income distribution across genders.
- __Regional Income Distribution__: Prepare a table to display the income distribution across various regions.
- __Age-wise Income Analysis__: Develop a table to analyze the relationship between age and income.
7. Report your findings.


### Task 1: Import relevant Python libraries.

### Task 2: Import the CSV file – NSMES1988updated.csv into a dataframe.

### Task 3: Identify the different data types: numerical and categorical.

### Task 4: Perform a detailed data pivoting including the features `Health` and `Region` and report your findings.

### Task 5: Perform analysis based on the following criteria: Different types of visits, Gender, Marital Status, School, Income, Employment Status, Insurance, and Medicaid.

### Task 6: Explore and analyze the dataset to gain insights into how different factors relate to each other within the dataset. Group the data according to specific demographic and economic criteria below, and create a series of distribution tables by considering the instructions below:

- __Age and Gender Distribution__: Generate a table to view the number of individuals within each age group, separated by gender.

- __Health Status by Gender__: Create a distribution table that categorizes individuals by their health status, differentiated by gender.

- __Income Distribution by Gender__: Compile a table to examine the income distribution across genders.

- __Regional Income Distribution__: Prepare a table to display the income distribution across various regions.

- __Age-wise Income Analysis__: Develop a table to analyze the relationship between age and income.

### Task 7: Report your findings.