# EDA

Youth Risk Behavior Survey (YRBS) data has been converted from fixed-width-format, to comma separated values.  The dataset was split into two by state name, A-M and N-Z. In this notebook, we will merge the two together and perform some basic cleanup and initial exploration.


In [None]:
import sys
import time
import pandas as pd
import numpy as np
from pathlib import Path
from sqlalchemy import inspect, create_engine
import hvplot.pandas

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Read our SQL database


In [None]:
connection_string = 'sqlite:///data/cdc_yrbs_state_data.db'
engine = create_engine(connection_string)
insp = inspect(engine)
print(insp.get_table_names())
display(pd.read_sql_query("SELECT * FROM STATE LIMIT 3;", con=engine))
display(pd.read_sql_query("SELECT COUNT() FROM STATE", con=engine))

## Read the Full Database into a Pandas DataFrame

In [None]:
df = pd.read_sql_query("SELECT * FROM STATE;", con=engine)
df.shape

In [None]:
print(f"Imported from SQL Dataframe Shape: {df.shape}\n\n")
print("Head:")
display(df.head(3))
print("Tail:")
display(df.tail(3))

## Column Enumeration

We have SAS files to help with this. But python gives us disctionaries to have more granularity with our analysis.  

SAS Input File sample

``` SAS
data dataout.sadc_2019_xxxxxxx;
infile datain lrecl=900;
input
sitecode $ 1-5
sitename $ 6-55
sitetype $ 56-105
sitetypenum 106-113
...
...
```

SAS Proc Format sample

``` SAS
proc format library=library;
value $SITE
"AB" = "Albuquerque, NM" 
"AK" = "Alaska"
"AL" = "Alabama"
"AR" = "Arkansas"
"AZB" = "Arizona"
...

```
etc

### Python Dictionaries, Ready for Pandas!

Python dictionaries are used for value translation.  However, we will perform this task later ad-hoc.  So for now, we will save our DataFrame to a "cleaned" Pandas DataFrame.

In [None]:
# Function? :( maybe not
def use_dict(column_in, value_in):
    """
    Translates values into understandable output.
    """
    if column_in == "age":
        age ={
            1: "12 years old or younger",
            2: "13 years old",
            3: "14 years old",
            4: "15 years old",
            5: "16 years old",
            6: "17 years old",
            7: "18 years old or older"
        }
        val = age.get(value_in)

    elif column_in == "sex":
        sex ={
            1: "Female",
            2: "Male",
            3: "Other"
        }
        val = sex.get(value_in)
        
    elif column_in == "grade":
        grade = {
            1: "9th grade",
            2: "10th grade",
            3: "11th grade",
            4: "12th grade",
            5: "Ungraded or other grade"
        }
        val = grade.get(value_in)

    elif column_in == "race4":
        race4 = {
            1: "White",
            2: "Black or African American",
            3: "Hispanic/Latino",
            4: "All Other Races"
        }
        val = race4.get(value_in)

    elif column_in == "grarace7de":
        race7 = {
            1: "American Indian/Alaska Native",
            2: "Asian",
            3: "Black or African American",
            4: "Hispanic/Latino",
            5: "Native Hawaiian/Other Pacific Islander",
            6: "White",
            7: "Multiple Races (Non-Hispanic)"
        }
        val = race7.get(value_in)
        
    elif column_in == "q66":
        q66 = {
            1: "Heterosexual (straight)",
            2: "Gay or lesbian",
            3: "Bisexual",
            4: "Not sure"
        }
        val = q66.get(value_in)
        
    elif column_in == "q65":
        q65 = {
            1: "I have never had sexual contact",
            2: "Females",
            3: "Males",
            4: "Females and males"
        }
        val = q65.get(value_in)
        
       
        
    return val

print(use_dict("age", 6))

In [None]:
# Define dictionaries

sex_dict ={
    1: "Female",
    2: "Male",
    3: "Other"
}

age_dict ={
    1: "12 years old or younger",
    2: "13 years old",
    3: "14 years old",
    4: "15 years old",
    5: "16 years old",
    6: "17 years old",
    7: "18 years old or older"
}

grade_dict = {
    1: "9th grade",
    2: "10th grade",
    3: "11th grade",
    4: "12th grade",
    5: "Ungraded or other grade"
}

race4_dict = {
    1: "White",
    2: "Black or African American",
    3: "Hispanic/Latino",
    4: "All Other Races"
}

race7_dict = {
    1: "American Indian/Alaska Native",
    2: "Asian",
    3: "Black or African American",
    4: "Hispanic/Latino",
    5: "Native Hawaiian/Other Pacific Islander",
    6: "White",
    7: "Multiple Races (Non-Hispanic)"
}

q65_dict = {
    1: "I have never had sexual contact",
    2: "Females",
    3: "Males",
    4: "Females and males"
}

q66_dict = {
    1: "Heterosexual (straight)",
    2: "Gay or lesbian",
    3: "Bisexual",
    4: "Not sure"
}


In [None]:
# Rewrite columns data from survey question mapping dictionaries
df=df.replace({"age": age_dict})
df=df.replace({"sex": sex_dict})
df=df.replace({"grade": grade_dict})
df=df.replace({"race4": race4_dict})
df=df.replace({"race7": race7_dict})
df=df.replace({"q66": q66_dict})
df=df.replace({"q65": q65_dict})
# df

In [None]:
df["age"].value_counts()

In [None]:
df["race4"].value_counts()

In [None]:
df["q66"].value_counts()

In [None]:
df["q65"].value_counts()

## Summarize and Visualize
Extract Subset of data

In [None]:
cols = ["sitename", "year", "age", "sex", "race7", "stheight", "stweight", "bmi", "bmipct", "q66", "q65"]
summary_df = df[cols].copy()

In [None]:
# # Concatenate dataframes
# summary_df = pd.concat([summary_a_m_df, summary_n_z_df])
# print(f"summery_df Shape: {summary_df.shape}\n\n")
# print("Head:")
# display(summary_df.head(3))
# print("Tail:")
# display(summary_df.tail(3))
# sorted_df = summary_df.sort_values(["year"], ascending=True)
# sorted_df.hvplot(x="year")