# Written Analysis: Machine Learning

## Question: How does where you live in the US affect your quality of life?

### In this project, we attempted to use machine learning to predict factors related to quality of life. These factors included household income, poverty rate, pollution data, and Medicare data. With the Medicare data, we focused on preventative care, such as participation in Part B, ambulatory visits, and diagnostic tests.

### We conducted our project using a linear model. Over the course of the project we tested various output variables and features, to see which would enable us to best predict the quality of a person's life. 

### Output variables were chosen as indications of quality of life - for example, we looked at participation in Medicare Part B. We wanted to see if we could use factors such as household incomes to predict Part B participation. 

#### Medicare Part B is optional, and costs a monthly premium when used. It covers preventative care, such as doctor visits and diagnostic tests. Compared to Part A, which is automatic, costs no extra money out of pocket, and is mainly for emergencies, our theory was that a higher particpation in Part B would indicate a person's overall quality of life is higher, due to being able to afford it, and being able to prevent medical issues. 

## Trials

### For each trial, we used a SVC (support vector classifier) model, and changed either the output variable, the features, or both, to see which data would give us the best predictions. Each trial involved the following steps:

1. Assigning features - we assigned the data we wished to use to predict our output to the features.
2. Assigning our output variable to "y" - we assigned the data we wanted to be able to predict to our "y" value.
3. Splitting our data into training data and testing data using the train/test/split.
4. Using the standard scaler to scale our data.
5. Training the model.
6. Calculate a classification report to see if our data was likely to be effective at predicting our output variable.
7. Idenfitying the most signifcant features for our chosen output variable (on some trials) when the classification report shows low precision scores.


### Trial 1: Comparing Household Income and Population to Medicare Part B Beneficiaries
For our first trial, we wanted to see if we could predict participation in Part B, our output variable, by using the features of household income and population. Our theory was that the higher the household income and population, the higher participation in Part B would be, indicating a higher quality of life.

We used the classification report to gauge the effectiveness of our chosen data at making our desired prediction, and saw that the majority of "precision" scores were 0.00. We then determinded the importance of our features, so we could change to the highest-rated ones and try again in Trial 2.

### Trial 2: Comparing Ambulatory Discharges and Ambulatory Visits to Medicare Part B Beneficiaries
For our second trial, we adjusted the features to see if we could predict participation in Part B, our output variable, by using the features of ambulatory discharges and ambulatory visits. We discovered these to be the most significant features in our datasets.  Our theory was that the higher the household income and population, the higher participation in Part B would be, indicating a higher quality of life.

We used the classification report to gauge the effectiveness of our chosen data at making our desired prediction, and saw that the majority of "precision" scores had improved to 1.00, emaning our chosen data would be ffective at making our desired prediction. Since both features used and our output variable were all from the same Medicare dataset, we decided to try again using different features pulled from our census data and our pollution data in Trial 3.

### Trial 3: Comparing Pollutants (Nitrogen Dioxide and Sulfur Dioxide) Levels to Household Income
For our third trial, we wanted to see if we could predict household income, our output variable, by using the features of Nitrogen Dioxide and Sulfur Dioxide.  Our theory was that the higher the level of pollutants, the lower the household income would be, indicating a lower quality of life.

We used the classification report to gauge the effectiveness of our chosen data at making our desired prediction, and saw that the majority of "precision" scores were 0.00. We then determinded the importance of our features, and discovered the highest-rated features for determining income were poverty rates and population, all from the census dataset. We then decided to change our output variable and our features so Trial 4 would come from more than one of our datasets.

### Trial 4: Comparing Income and Pollutants (Nitrogen Dioxide) to Poverty Rate
For our fourth trial, we adjusted the features to see if we could predict poverty rates, our output variable (also changed), by using the features of household income and pollutants(Nitrogen Dioxide) . Our theory was that the higher the household income, and the lower the pollutants, the lower the poverty rate would be, indicating a higher quality of life.

We used the classification report to gauge the effectiveness of our chosen data at making our desired prediction, and saw that the the precision scores were a mix of scores ranging from 0.00 to 1.00, with more instances of 1.00 than any other number. This indicated our chosen data would not be highly effective for making our desired predictions, so we moved on to Trial 5.

### Trial 5: Comparing Income, Part B Beneficiaries, and Unhealthy Days to Poverty Rate
For our fifth trial, we wanted to see if we could predict poverty rates, our output variable, by using the features of household income, Part B Beneficiaries, and days with so many pollutants recorded they were labeled as "Unhealthy Days." Our theory was that the higher the household income, the higher the Part B participation, and the lower the rate of unhealthy days, would correlate to lower poverty rates, indicating a higher quality of life.

We used the classification report to gauge the effectiveness of our chosen data at making our desired prediction, and saw that the the precision scores were a mix of scores ranging from 0.00 to 1.00, with less instances of 0.00 than any other number, most falling in betwee 0.33 and 0.75 This indicated our chosen data would not be highly effective for making our desired predictions.

### Overall, we found that the trial with the highest classification scores was Trial 4. We chose this model to fine-tune using the grid search. 

### We then created a linear regression model with the output variable and features from Trial 4.
We used MatPlotLib to visualize if the shown line was a good fit for out data.  The residuals express the difference between the data on the line and the actual data, so the values of the residuals will show how well the residuals represent the data. Our visualization showed as random, indicating our data was a good fit for a linear regression model.