<a href="https://colab.research.google.com/github/megoeggo/colab/blob/main/Mego_Franks_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project
Choose one of the research briefs options below to focus on for this final project:

1. [Trust in Health Information Sources among American Adults](https://hints.cancer.gov/docs/Briefs/HINTS_Brief_39.pdf)

Use this dataset to complete the following steps:

https://drive.google.com/file/d/1c7Nv3xTjGsYpHlT5oXPFfXurmDbl1AlB/view?usp=sharing

1. First, read through the brief you've chosen.
2. Next, using what you have learned about cleaning data and creating visualizations, reproduce one of the data visualizations included in your brief. Use the Python libraries we discussed in class for this project (Pandas, matplotlib), though you can also use other libraries as needed. Try to recreate it as faithfully as possible (using the same ranges, numbers, and general layout).
3. Then, create another visualization based on a different demographic variable of your choice from the data. This visualization should relate generally to the topic of your brief, but should investigate a new aspect of it from the HINTS dataset. For example, if there was hypothetically a brief about age, amount of weekly exercise, and heart disease, for your final project you could investigate age, amount of exercise, and BMI category for respondents.
4. Finally, write a reflection about this project (see reflection prompts below).

If needed, you can find the codebook for this dataset here:

https://hints.cancer.gov/data/survey-instruments.aspx#H5C4

And you can download the original data here:

https://hints.cancer.gov/data/Default.aspx


# Recreate a visualization

Insert your code into the code box below to recreate a visualization from your brief. Make sure to comment your code!

In [1]:
# code for recreating a visualization

# first import libraries

!pip install seaborn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import style



In [2]:
# now import data

hints_data_raw = pd.read_csv("/content/hints5cycle4finalproject.csv")

In [32]:
# we are looking at the percentage of americans reporting some or a lot of trust in different health info sources, so lets create a dataframe subset that shows only those relevant variables

hints_data_subset = hints_data_raw[["CancerTrustDoctor", "CancerTrustFamily", "CancerTrustGov", "CancerTrustCharities", "CancerTrustReligiousOrgs"]]

hints_data_subset

Unnamed: 0,CancerTrustDoctor,CancerTrustFamily,CancerTrustGov,CancerTrustCharities,CancerTrustReligiousOrgs
0,A lot,Some,Some,Some,A little
1,Some,Not At All,Some,Not At All,Not At All
2,A lot,A lot,Some,Some,Not At All
3,A lot,A little,A lot,Some,Not At All
4,A lot,A little,Some,A little,A little
...,...,...,...,...,...
3860,A lot,Some,A lot,A lot,A little
3861,Not At All,Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained),Missing data (Not Ascertained)
3862,A lot,Missing data (Not Ascertained),A lot,Missing data (Not Ascertained),Missing data (Not Ascertained)
3863,A lot,A little,A little,Some,A little


In [42]:
# some of our rows are missing data as seen above, so we want to convert those missing values to null values, so we can remove those rows from our analysis

hints_data_subset_all = hints_data_subset.replace("Missing data (Not Ascertained)", np.NaN)

hints_data_subset_all2 = hints_data_subset_all.replace("Multiple responses selected in error", np.NaN)

hints_data_subset_nanremoved = hints_data_subset_all2.dropna()

hints_data_subset_nanremoved


Unnamed: 0,CancerTrustDoctor,CancerTrustFamily,CancerTrustGov,CancerTrustCharities,CancerTrustReligiousOrgs
0,A lot,Some,Some,Some,A little
1,Some,Not At All,Some,Not At All,Not At All
2,A lot,A lot,Some,Some,Not At All
3,A lot,A little,A lot,Some,Not At All
4,A lot,A little,Some,A little,A little
...,...,...,...,...,...
3856,A lot,A lot,Some,Some,A little
3857,A lot,A little,Some,A little,Not At All
3859,Some,Not At All,Some,Not At All,Not At All
3860,A lot,Some,A lot,A lot,A little


In [53]:
# now we need to convert our variables to categories so we can count them

hints_data_subset_nanremoved_categories = hints_data_subset_nanremoved[["CancerTrustDoctor", "CancerTrustFamily", "CancerTrustGov", "CancerTrustCharities", "CancerTrustReligiousOrgs"]].astype("category")

# and we are going to put the answers in a logical order so they make sense in the visualization

hints_data_subset_nanremoved_categories_ordered = hints_data_subset_nanremoved_categories["CancerTrustDoctor"].cat.reorder_categories(["A lot", "Some", "A little", "Not At All"], ordered=True)


# Your explanation for the new visualization

**What does your new visualization show?**

Use this code block (double click to edit and replace this text) to briefly explain (one or two sentences) what your new visualization will explore.

# Your code for the new visualization

Insert your code in the code block below to create a new visualization based on a different demographic variable of your choice. Make sure to comment your code!

In [None]:
# new visualization code


# Reflections
Edit this code block to answer the following questions.

**1. Was the original visualization that you reproduced from your brief useful for the data that were being represented? Why or why not? How could it have been improved?**



**2. For the new visualization you created, what type of visualization did you choose and why did you choose that type?**



**3. Were you able to understand the data you were working with based on the brief you chose? Were there any additional data you wish you had to enhance your analysis / visualization?**

