# Identifying Predicting Factors of Tobacco Use in the Youth
(Exploratory Data Analysis and Data Preproccessing)

# Environment Setup (Do this before running any code cell)
1. While in VSCode, use command `cmd+shift+p` 
2. Select `Python: Create Environment` -> `Venv`. This will create a python venv to install all your python packages in. After creating it VSCode will automatically active it.
3. Run command in **Install Packages** below to automatically install all required packages from the `requirements.txt` file.

### Install Packages
Install all the required packages directly from the requirements.txt file

In [42]:
# Run this to install required packages
%pip install -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


### Global Imports
Import everything you need

In [19]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

import sys
from pathlib import Path
sys.path.append(str(Path().resolve().parent))
from helpers.drop_list import dropped


# Load the dataset
df = pd.read_csv('../data/nyts2023.csv')
tobacco_user_df = pd.read_csv('../data/tobacco_users.csv')
nonuser_df = pd.read_csv('../data/nonusers.csv')

  df = pd.read_csv('../data/nyts2023.csv')
  tobacco_user_df = pd.read_csv('../data/tobacco_users.csv')
  nonuser_df = pd.read_csv('../data/nonusers.csv')


### Update requirements.txt
If you install any new packages, run this update the requirements.txt file.

In [39]:
%pip freeze > ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Selected Target Variable
The target label for our model will be a binary classification of tobacco user or non-user. This label is based on Q100: "During the past 30 days, on how many days did you use any tobacco product(s)?". Respondents with a response value of 1 or greater will be labeled as tobacco users. Respondents either skipped Q100 or reported a value of 0 will be labeled as nonusers.

In [16]:
# Filter out rows where Q100 is either 0, skipped, or missing
# Keep only rows where Q100 has a numeric value of 1 or greater, indicating Tobacco Use

tobacco_user_df = df[df['Q100'].apply(lambda x: str(x).isdigit() and int(x) >= 1)]

# Count the number of respondents who use Tobacco
num_respondents = len(tobacco_user_df)
print(f"Number of respondents who use Tobacco: {num_respondents}")

# Create a DataFrame for non-users by negating the condition for tobacco use
nonuser_df = df[~df['Q100'].apply(lambda x: str(x).isdigit() and int(x) >= 1)]

# Count the number of respondents who do not use Tobacco
num_respondents_non = len(nonuser_df)
print(f"Number of respondents who do not use Tobacco: {num_respondents_non}")

# Export the filtered data to new CSV files
tobacco_user_df.to_csv('tobacco_users.csv', index=False)
nonuser_df.to_csv('nonusers.csv', index=False)
print("Export Success.")

Number of respondents who use Tobacco: 1760
Number of respondents who do not use Tobacco: 20309
Export Success.


# Exploratory Data Analysis
For this phase I didn't do much visualizes I just did a manual deep dive through the questions so far and took the notes below. I created two files `map.py` and `map_annotated.py` containing the original mapping of questions to columnID and an annotated version where I decided to note which columns should be merged, one-hot encoded, or just removed. 

**Notes:**
- Many multiple choice questions are already split into separate columns and do not need to be one-hot encoded. But remove any Dummy Variable Trap questions.
- Questions that are noted as Categorical need to be one-hot encoded. For example, for QN1, the categories are 0-13, 14-18, 19+, each question needs to be it's own column and have 0 or 1 if they belong to that group.
- MERGE to NC means merge to new column, TC = target column
  - A MERGE means that if they answered to any of these questions, their value to the new question would be 1. If they didn't answer to any of these questions, their value would be 0.
- Questions like "Why do you currently use e-cigarettes? (They are available in flavors, such as menthol, mint, candy, fruit, or chocolate)" can be thought of as 
  "Responded used e-cigarettes due to it's availability in flavors such as menthol, mint, candy, fruit, or chocolate".
- Skip Logic Questions don't really seem to be a problem as question answers are split into separate columns (as if they were one-hot encoded already) and dummy variable traps can be removed.
  All other questions can seemingly be one-hot encoded (split into categories), merged (multiple of them) into new columns, or just removed because they are nto relevant to the analysis.
  - The few exceptions of skip logic questions are ones that are removed anyways
- Some questions like 48-51 on their own are weird to predict if they relate to tobacco use. For example (Q48) if you are curious about trying a cig
  it wouldnt be significant to predict if you are a tobacco user. But if you combined it with smoking in the household, it could be significant, ex.
  'Respondent is curious about smoking and is exposed to it within the household'. Just a suggestion.
- In hindsight, i overlooked the importance of people potentially using flavored nicotine products vs non-flavored.
- Used Q39 and Q100,101 as target label

After that I decided to do some data cleaning.

# Preliminary Data Preprocessing

##### Remove unneccessary rows (manually determined)

In [31]:
# Drop columns that are not needed, in 'drop_list' and ignore any missing columns
df_new = df.drop(columns=dropped, errors='ignore')

# Calculate the number of columns before and after dropping
original_column_count = df.shape[1]
new_column_count = df_new.shape[1]
columns_dropped = original_column_count - new_column_count

# Print the result
print(f"Number of columns dropped: {columns_dropped}")
print(f"Original columns: {original_column_count}, New columns: {new_column_count}")

Number of columns dropped: 550
Original columns: 1469, New columns: 919


##### Remove all rows that have a TEXT value response.

In [28]:
# Step 2: Identify and drop columns that contain 'TEXT' in their column IDs
text_columns = [col for col in df_new.columns if 'TEXT' in col]
df_new_notext = df_new.drop(columns=text_columns, errors='ignore')

# Calculate and print the result for 'TEXT' columns
original_column_count = df_new.shape[1]
new_column_count = df_new_notext.shape[1]
columns_dropped = original_column_count - new_column_count

# Print the result
print(f"Number of columns dropped: {columns_dropped}")
print(f"Original columns: {original_column_count}, New columns: {new_column_count}")

df_new_notext.head()

Number of columns dropped: 53
Original columns: 919, New columns: 866


Unnamed: 0,artificial_id,Non_SOGI_School,Location,QN1,QN2,QN3,QN4b,QN4c,QN4d,QN4e,...,CBIDIS,CHTP,CPOUCH,Stratum,Stratum_num,PSU,PSU_num,WT_analysis,QN141R,QN142R
0,B2100007,2,1.0,5.0,2.0,2.0,,,,,...,1.0,2.0,2.0,S05,5,P21,21,4232.149929,,
1,B2100018,2,2.0,8.0,2.0,1.0,,,,,...,,,,S02,2,P09,9,514.656322,,
2,B2100021,2,1.0,5.0,1.0,2.0,1.0,,,,...,,,,S08,8,P32,32,244.855983,,
3,B2100035,2,1.0,4.0,2.0,2.0,,,,,...,,,2.0,S08,8,P34,34,775.983192,,
4,B2100036,2,1.0,4.0,2.0,1.0,,,,,...,2.0,2.0,2.0,S04,4,P16,16,353.735565,1.0,1.0


##### Check for and display all columns with missing values. 
Select `View as a scrollable element` to exit truncated view see all columns. By displaying all the columns with missing values and their counts, we can see which ones are due to a skip because a previous question disqualified them from this one or the question not applying to them: they are the ones with the very high amount of missing values per question / column. The smaller amounts are most often because of edit errors or just not answered / not displayed (what does that mean?)

In [37]:
pd.set_option('display.max_rows', None)
df_new_notext.isnull().sum()

artificial_id          0
Non_SOGI_School        0
Location              69
QN1                   90
QN2                  152
QN3                   76
QN4b               18205
QN4c               21564
QN4d               21883
QN4e               19790
QN5a               19631
QN5b               19176
QN5c               17828
QN5d               21377
QN5e                9636
QN6                  184
QN7                18705
QN8                18742
QN9                18791
QN11a              20404
QN11b              21285
QN11e              21889
QN11f              21890
QN11l              21298
QN12a              21639
QN12b              21890
QN12e              21949
QN12f              22009
QN12h              21841
QN12l              21470
QN14j              21893
QN16               20561
QN17               20565
QN18e_a            22043
QN18e_b            22017
QN18e_c            21958
QN18e_d            22034
QN18e_e            22023
QN18e_f            22043
QN18e_g            22039


##### Convert all columns to numerical format

In [None]:
# Check if any columns contain numeric-like data stored as strings (do not run yet! WIP) 
for column in df.columns:
    # Ensure the column is of object (string-like) type
    if df[column].dtype == 'object':
        # Now safely apply the str accessor
        if df[column].str.isnumeric().any():
            print(f"Column {column} contains numeric-like data but is stored as a string.")

# To Do:
- Make appropriate transformations in map_annotated before handling missing values.
  1. Merge columns together into new columns and drop all the old ones that were previously separate. Similar columns become one umbrella column. Ex. "Respondent used e-cigarettes due to exposure from friends, media, or family" encompasses multiple columns and reduces dimensions.
  2. Use one-hot encoding to separate a categorical labeled columns into separate columns for each category. Ex. Ages 0-13, 14-18, 19+ become their own categories.
  3. Consider combining attributes together that aren't necessarily similar but may be correlated.
- Then handle missing values with a high missing rate by replacing it with 0. The reasoning is because:

In the case of QN4: "QN4B: Are you Hispanic, Latino, Latina, or of Spanish origin? (Yes, Mexican, Mexican American, Chicano, or Chicana)" a missing response indicates that they are not what the question is asking.
So in cases like such, you would not use a median or mean value as it's only 1 or No response (Yes or No).
You would also not just drop the column because over 50% of participants left it empty. It just means that 50% or more of participants are not what the question is asking. 
 
Another example is QN7: "QN7: How old were you when you first used an e-cigarette, even once or twice?". This also has a very high number of missing values for this column. That is because the previous question asks if you have ever used it. So all those who indicated no would skip this question. 0 would represent an absence of the behavior of using an e-cigarette.

Assuming the first model we use is multilinear regression, we would ideally want all values to be binary (0/1), so these processes such as one hot encoding and merging and replacing values with 0 after strategically dropping irrelevant columns is preparing us for that.
