# Exploratory Data Analysis for Final Project

In this assignment, your task is to put your new skils to use in the form of an open-ended, exploratory data analysis. In particular, I'm looking for you to demonstrate your ability to use the programmatic tools we've been learning to access data, manipulate it, and visualize some findings while working towards the ultimate goal of developing a final project proposal. Please include the following in your assignment:

  1. A brief summary of the topic (a few sentences)
  2. Access 2 or more datasets, at least one of them using an API (if you are not using any data from an API as part of your project yet, don’t worry about it, and just do this to get some exercise using APIs. You might use it later).
  3. Demonstrate the use of Pandas operations to filter out missing data and/or outliers.
  4. Demonstrate your capacity to use some of the "group-by" operations to produce pivot tables or statistical summaries of your data.
  5. Use Matplotlib or Seaborn to produce 2-3 data visualizations of your data to both explore the data and highlight any notable patterns.
  6. Include a short written analysis of your interpretation of the data.
  7. In a few paragraphs, describe the research question you intend to investigate in your final project, and the plan for the data analysis you intend to perform.

Note that this exercise is intended to help you formulate your project topic. But it is not a binding contract. Your project will most likely evolve over the rest of the semester. So use this as an opportunity to be creative, throw some ideas against the wall and see what sticks. I will release the final project guidelines shortly. In the meantime, dig in!

And as always, please submit this assignment both as a PR on GitHub along with the URL of your PR on bCourses.

In [14]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

In [25]:
#Import Dataset1: CSV document of 2022 AARP Healthy Living Survey of Adults Age 50 and Older
AARP = pd.read_csv('AARP2022.csv', 
                   usecols=['age', 'racethni', 'educ5', 'marital', 'state', 'metro', 'internet'],
                   encoding='ISO-8859-1') 
AARP

Unnamed: 0,age,racethni,educ5,marital,state,metro,internet
0,76,"White, non-Hispanic",Vocational/tech school/some college/ associates,Married,Minnesota,Metro Area,Internet Household
1,86,"White, non-Hispanic",Bachelor's degree,Married,Pennsylvania,Metro Area,Internet Household
2,71,"White, non-Hispanic",HS graduate or equivalent,Separated,New Jersey,Metro Area,Internet Household
3,88,"White, non-Hispanic",Vocational/tech school/some college/ associates,Married,Washington,Metro Area,Internet Household
4,72,"White, non-Hispanic",Post grad study/professional degree,Married,North Carolina,Metro Area,Internet Household
...,...,...,...,...,...,...,...
1959,53,Hispanic,Post grad study/professional degree,Divorced,Florida,Metro Area,Internet Household
1960,76,Hispanic,Post grad study/professional degree,Married,Florida,Metro Area,Internet Household
1961,73,Hispanic,Post grad study/professional degree,Separated,Florida,Metro Area,Internet Household
1962,60,Hispanic,Post grad study/professional degree,Married,Mississippi,Metro Area,Internet Household


In [26]:
!pip install requests pandas



In [27]:
import requests
import pandas as pd

# Example URL for the Census Bureau API (you'll need to replace this with the actual URL you're using)
url = "https://api.census.gov/data.json"

# Make the API request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Convert the response to JSON
    data = response.json()
    # Convert the JSON data to a pandas DataFrame
    # The first item in the JSON response is usually the header, so we use it as column names
    df = pd.DataFrame(data[1:], columns=data[0])
else:
    print(f"Failed to fetch data: {response.status_code}")

TypeError: unhashable type: 'slice'

In [29]:
AARP.dropna(inplace=True)
AARP

Unnamed: 0,age,racethni,educ5,marital,state,metro,internet
0,76,"White, non-Hispanic",Vocational/tech school/some college/ associates,Married,Minnesota,Metro Area,Internet Household
1,86,"White, non-Hispanic",Bachelor's degree,Married,Pennsylvania,Metro Area,Internet Household
2,71,"White, non-Hispanic",HS graduate or equivalent,Separated,New Jersey,Metro Area,Internet Household
3,88,"White, non-Hispanic",Vocational/tech school/some college/ associates,Married,Washington,Metro Area,Internet Household
4,72,"White, non-Hispanic",Post grad study/professional degree,Married,North Carolina,Metro Area,Internet Household
...,...,...,...,...,...,...,...
1959,53,Hispanic,Post grad study/professional degree,Divorced,Florida,Metro Area,Internet Household
1960,76,Hispanic,Post grad study/professional degree,Married,Florida,Metro Area,Internet Household
1961,73,Hispanic,Post grad study/professional degree,Separated,Florida,Metro Area,Internet Household
1962,60,Hispanic,Post grad study/professional degree,Married,Mississippi,Metro Area,Internet Household


In [33]:
Q1 = AARP['age'].quantile(0.25)
Q3 = AARP['age'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for the outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter outliers
AARP_filtered = AARP[(AARP['age'] >= lower_bound) & (AARP['age'] <= upper_bound)]
AARP_filtered

Unnamed: 0,age,racethni,educ5,marital,state,metro,internet
0,76,"White, non-Hispanic",Vocational/tech school/some college/ associates,Married,Minnesota,Metro Area,Internet Household
1,86,"White, non-Hispanic",Bachelor's degree,Married,Pennsylvania,Metro Area,Internet Household
2,71,"White, non-Hispanic",HS graduate or equivalent,Separated,New Jersey,Metro Area,Internet Household
3,88,"White, non-Hispanic",Vocational/tech school/some college/ associates,Married,Washington,Metro Area,Internet Household
4,72,"White, non-Hispanic",Post grad study/professional degree,Married,North Carolina,Metro Area,Internet Household
...,...,...,...,...,...,...,...
1959,53,Hispanic,Post grad study/professional degree,Divorced,Florida,Metro Area,Internet Household
1960,76,Hispanic,Post grad study/professional degree,Married,Florida,Metro Area,Internet Household
1961,73,Hispanic,Post grad study/professional degree,Separated,Florida,Metro Area,Internet Household
1962,60,Hispanic,Post grad study/professional degree,Married,Mississippi,Metro Area,Internet Household


In [35]:
internet_access_by_metro = AARP_filtered.groupby('metro')['internet'].value_counts(normalize=True).unstack(fill_value=0) * 100
internet_access_by_metro

internet,Internet Household,Non-internet household
metro,Unnamed: 1_level_1,Unnamed: 2_level_1
Metro Area,88.850174,11.149826
Non-Metro Area,80.753138,19.246862


In [36]:
internet_access_by_race = AARP_filtered.groupby('racethni')['internet'].value_counts(normalize=True).unstack(fill_value=0) * 100
internet_access_by_race

internet,Internet Household,Non-internet household
racethni,Unnamed: 1_level_1,Unnamed: 2_level_1
"2+, non-Hispanic",90.322581,9.677419
"Asian, non-Hispanic",100.0,0.0
"Black, non-Hispanic",80.498866,19.501134
Hispanic,86.386139,13.613861
"Other, non-Hispanic",86.666667,13.333333
"White, non-Hispanic",91.254753,8.745247


In [None]:
internet_access_by_race = AARP_filtered.groupby('racethni')['internet'].value_counts(normalize=True).unstack(fill_value=0) * 100
internet_access_by_race