# Identifying Predicting Factors of Tobacco Use in the Youth
(Exploratory Data Analysis and Data Preproccessing)

# Environment Setup (Do this before running any code cell)
1. While in VSCode, use command `cmd+shift+p` 
2. Select `Python: Create Environment` -> `Venv`. This will create a python venv to install all your python packages in. After creating it VSCode will automatically active it.
3. Run command in **Install Packages** below to automatically install all required packages from the `requirements.txt` file.

### Install Packages
Install all the required packages directly from the requirements.txt file

In [None]:
# Run this to install required packages
%pip install -r ../requirements.txt

### Global Imports
Import everything you need

In [2]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

import sys
from pathlib import Path
sys.path.append(str(Path().resolve().parent))
from helpers.drop_list import dropped


# Load the dataset
df = pd.read_csv('../data/nyts2023.csv')
tobacco_user_df = pd.read_csv('../data/tobacco_users.csv')
nonuser_df = pd.read_csv('../data/nonusers.csv')

  df = pd.read_csv('../data/nyts2023.csv')
  tobacco_user_df = pd.read_csv('../data/tobacco_users.csv')
  nonuser_df = pd.read_csv('../data/nonusers.csv')


### Update requirements.txt
If you install any new packages, run this update the requirements.txt file.

In [None]:
%pip freeze > ../requirements.txt

# Selected Target Variable
The target label for our model will be a binary classification of tobacco user or non-user. This label is based on Q100: "During the past 30 days, on how many days did you use any tobacco product(s)?". Respondents with a response value of 1 or greater will be labeled as tobacco users. Respondents either skipped Q100 or reported a value of 0 will be labeled as nonusers.

In [None]:
# Filter out rows where Q100 is either 0, skipped, or missing
# Keep only rows where Q100 has a numeric value of 1 or greater, indicating Tobacco Use

tobacco_user_df = df[df['Q100'].apply(lambda x: str(x).isdigit() and int(x) >= 1)]

# Count the number of respondents who use Tobacco
num_respondents = len(tobacco_user_df)
print(f"Number of respondents who use Tobacco: {num_respondents}")

# Create a DataFrame for non-users by negating the condition for tobacco use
nonuser_df = df[~df['Q100'].apply(lambda x: str(x).isdigit() and int(x) >= 1)]

# Count the number of respondents who do not use Tobacco
num_respondents_non = len(nonuser_df)
print(f"Number of respondents who do not use Tobacco: {num_respondents_non}")

# Export the filtered data to new CSV files
tobacco_user_df.to_csv('tobacco_users.csv', index=False)
nonuser_df.to_csv('nonusers.csv', index=False)
print("Export Success.")

# Data Visualization

In [None]:
#comparison between tobacco users and non-users
user_counts = [len(tobacco_user_df), len(nonuser_df)]
labels = ['Tobacco Users', 'Non-Users']

plt.figure(figsize=(8, 5))
plt.bar(labels, user_counts, color=['steelblue', 'lightcoral'])
plt.title('Number of Respondents - Tobacco Users vs Non-Users')
plt.xlabel('Group')
plt.ylabel('Count')
plt.show()

In [None]:
age_mapping = {
    1: '9 years old',
    2: '10 years old',
    3: '11 years old',
    4: '12 years old',
    5: '13 years old',
    6: '14 years old',
    7: '15 years old',
    8: '16 years old',
    9: '17 years old',
    10: '18 years old',
    11: '19 years old or older'
}

#showing actual ages instead of index
df['age_labeled'] = df['QN1'].map(age_mapping)

#plotting the proportion of respondents by age
df['age_labeled'].value_counts().plot(kind='pie', autopct='%1.1f%%', figsize=(8, 8))
plt.title('Proportion of Respondents by Age')
plt.ylabel('')
plt.show()


# Exploratory Data Analysis
For this phase I didn't do much visualizes I just did a manual deep dive through the questions so far and took the notes below. I created two files `map.py` and `map_annotated.py` containing the original mapping of questions to columnID and an annotated version where I decided to note which columns should be merged, one-hot encoded, or just removed. 

**Notes:**
- Many multiple choice questions are already split into separate columns and do not need to be one-hot encoded. But remove any Dummy Variable Trap questions.
- Questions that are noted as Categorical need to be one-hot encoded. For example, for QN1, the categories are 0-13, 14-18, 19+, each question needs to be it's own column and have 0 or 1 if they belong to that group.
- MERGE to NC means merge to new column, TC = target column
  - A MERGE means that if they answered to any of these questions, their value to the new question would be 1. If they didn't answer to any of these questions, their value would be 0.
- Questions like "Why do you currently use e-cigarettes? (They are available in flavors, such as menthol, mint, candy, fruit, or chocolate)" can be thought of as 
  "Responded used e-cigarettes due to it's availability in flavors such as menthol, mint, candy, fruit, or chocolate".
- Skip Logic Questions don't really seem to be a problem as question answers are split into separate columns (as if they were one-hot encoded already) and dummy variable traps can be removed.
  All other questions can seemingly be one-hot encoded (split into categories), merged (multiple of them) into new columns, or just removed because they are nto relevant to the analysis.
  - The few exceptions of skip logic questions are ones that are removed anyways
- Some questions like 48-51 on their own are weird to predict if they relate to tobacco use. For example (Q48) if you are curious about trying a cig
  it wouldnt be significant to predict if you are a tobacco user. But if you combined it with smoking in the household, it could be significant, ex.
  'Respondent is curious about smoking and is exposed to it within the household'. Just a suggestion.
- In hindsight, i overlooked the importance of people potentially using flavored nicotine products vs non-flavored.
- Used Q39 and Q100,101 as target label

After that I decided to do some data cleaning.

# Preliminary Data Preprocessing

##### Remove unneccessary rows (manually determined)

In [None]:
# Drop columns that are not needed, in 'drop_list' and ignore any missing columns
df_new = df.drop(columns=dropped, errors='ignore')

# Calculate the number of columns before and after dropping
original_column_count = df.shape[1]
new_column_count = df_new.shape[1]
columns_dropped = original_column_count - new_column_count

# Print the result
print(f"Number of columns dropped: {columns_dropped}")
print(f"Original columns: {original_column_count}, New columns: {new_column_count}")

##### Remove all rows that have a TEXT value response.

In [None]:
# Step 2: Identify and drop columns that contain 'TEXT' in their column IDs
text_columns = [col for col in df_new.columns if 'TEXT' in col]
df_new_notext = df_new.drop(columns=text_columns, errors='ignore')

# Calculate and print the result for 'TEXT' columns
original_column_count = df_new.shape[1]
new_column_count = df_new_notext.shape[1]
columns_dropped = original_column_count - new_column_count

# Print the result
print(f"Number of columns dropped: {columns_dropped}")
print(f"Original columns: {original_column_count}, New columns: {new_column_count}")

df_new_notext.head()

##### Check for and display all columns with missing values. 
Select `View as a scrollable element` to exit truncated view see all columns. By displaying all the columns with missing values and their counts, we can see which ones are due to a skip because a previous question disqualified them from this one or the question not applying to them: they are the ones with the very high amount of missing values per question / column. The smaller amounts are most often because of edit errors or just not answered / not displayed (what does that mean?)

In [None]:
pd.set_option('display.max_rows', None)
df_new_notext.isnull().sum()

##### Convert all columns to numerical format

In [None]:
# Check if any columns contain numeric-like data stored as strings (do not run yet! WIP) 
for column in df.columns:
    # Ensure the column is of object (string-like) type
    if df[column].dtype == 'object':
        # Now safely apply the str accessor
        if df[column].str.isnumeric().any():
            print(f"Column {column} contains numeric-like data but is stored as a string.")

Now convert all string data into numerical data:

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

for column in df.columns:
    # Ensure the column is of object (string-like) type
    if df[column].dtype == 'object':
        # Check if it contains numeric-like data stored as strings
        if df[column].str.isnumeric().any():
            # Convert numeric-like strings to actual numbers
            df[column] = pd.to_numeric(df[column], errors='coerce')
        else:
            # Convert categorical strings to numerical labels using LabelEncoder
            df[column] = label_encoder.fit_transform(df[column])

print("All string-based categorical columns have been converted to numerical values.")

Verify all string based columns have been converted to numerical values

In [None]:
for column in df.columns:
    # Ensure the column is of object (string-like) type
    if df[column].dtype == 'object':
        # Now safely apply the str accessor
        if df[column].str.isnumeric().any():
            print(f"Column {column} contains numeric-like data but is stored as a string.")

# To Do:
- Make appropriate transformations in map_annotated before handling missing values.
  1. Merge columns together into new columns and drop all the old ones that were previously separate. Similar columns become one umbrella column. Ex. "Respondent used e-cigarettes due to exposure from friends, media, or family" encompasses multiple columns and reduces dimensions.
  2. Use one-hot encoding to separate a categorical labeled columns into separate columns for each category. Ex. Ages 0-13, 14-18, 19+ become their own categories.
  3. Consider combining attributes together that aren't necessarily similar but may be correlated.
- Then handle missing values with a high missing rate by replacing it with 0. The reasoning is because:

In the case of QN4: "QN4B: Are you Hispanic, Latino, Latina, or of Spanish origin? (Yes, Mexican, Mexican American, Chicano, or Chicana)" a missing response indicates that they are not what the question is asking.
So in cases like such, you would not use a median or mean value as it's only 1 or No response (Yes or No).
You would also not just drop the column because over 50% of participants left it empty. It just means that 50% or more of participants are not what the question is asking. 
 
Another example is QN7: "QN7: How old were you when you first used an e-cigarette, even once or twice?". This also has a very high number of missing values for this column. That is because the previous question asks if you have ever used it. So all those who indicated no would skip this question. 0 would represent an absence of the behavior of using an e-cigarette.

Assuming the first model we use is multilinear regression, we would ideally want all values to be binary (0/1), so these processes such as one hot encoding and merging and replacing values with 0 after strategically dropping irrelevant columns is preparing us for that.


Merging columns into new columns and drop old columns:

# NC1:

In [None]:
#before merging
qn11_columns = ['QN11a', 'QN11b', 'QN11f']  
qn12_columns = ['QN12a', 'QN12b', 'QN12f']  
print("Before NC1 Column Creation:")
print(df[qn11_columns + qn12_columns].tail())

In [None]:
def check_influence(row):
    # if there is a value in those two columns it means they are influenced
    if row[qn11_columns + qn12_columns].any(): 
        #we can return 1 to indicate influence
        return "1"
    # and 0 for no influence
    # returning zero here, we can also fill NA values alongside
    else:
        return "0"

# Apply the function to create a new Influence column
df['NC1'] = df.apply(check_influence, axis=1)

In [None]:
#after merging:
print("After NC1 Column Creation:")
print(df[qn11_columns + qn12_columns + ['NC1']].tail())

then we can drop those columns:

In [None]:
# List of columns used for the first NC1 merge
nc1_columns = ['QN11a', 'QN11b', 'QN11b', 'QN12a', 'QN12b', 'QN12f']

# Drop these columns from the DataFrame
df = df.drop(columns=nc1_columns)



# NC2

In [None]:
#before merge
nc2_columns = ['QN12l']

print("Before NC2 Column Creation:")
print(df[nc2_columns].tail())

In [None]:
def check_mental_state(row):
    if row[nc2_columns].any():
        return "1"  # Bad mental state present
    else:
        return "0"  # No bad mental state

# Apply the function to create NC2
df['NC2'] = df.apply(check_mental_state, axis=1)
# After merging, display NC2 column alongside related columns
print("After NC2 Column Creation:")
print(df[nc2_columns + ['NC2']].tail())


In [None]:

columns_to_drop = ['QN12c', 'QN12d', 'QN12e', 'QN12g', 'QN12h', 'QN12i', 'QN12j', 'QN12k', 'QN12l','QN12m', 'QN12n']

# Drop these columns from the DataFrame
df = df.drop(columns=columns_to_drop)

# NC3

In [None]:
#before merging
qn16_columns = ['QN16']  
qn17_columns = ['QN17']  
print("Before NC3 Column Creation:")
print(df[qn16_columns + qn17_columns].tail())

In [None]:

nc3_columns = qn16_columns + qn17_columns

# Define the function to check for nicotine use
def check_nicotine_use(row):
    if row[nc3_columns].any():  # Checks if there's any value in QN16 or QN17
        return "1"
    else:
        return "0"

# Apply the function to create NC3
df['NC3'] = df.apply(check_nicotine_use, axis=1)


In [None]:
print("After NC3 Column Creation:")
print(df[qn16_columns + qn17_columns + ['NC3']].tail())


now we drop the rest of the columns

In [None]:
df = df.drop(columns=nc3_columns)


# Now merging NC4

In [None]:
# Define columns for NC4, including QN77
nc4_columns = ['QN18e_a', 'QN18e_b', 'QN18e_c', 'QN18e_d', 'QN18e_e', 
               'QN18e_f', 'QN18e_g', 'QN18e_h', 'QN18e_i', 'QN18e_j', 'QN18e_k', 'QN77']

print("Before NC4 Column Creation:")
print(df[nc4_columns].tail())


In [None]:
def check_nicotine_pouch_use(row):
    # If there's a value in any nc4_columns, it indicates nicotine pouch use
    if row[nc4_columns].any():
        return "1"  # Nicotine pouch use present
    else:
        return "0"  # No nicotine pouch use
    

# Apply the function to create NC4
df['NC4'] = df.apply(check_nicotine_pouch_use, axis=1)


print("After NC4 Column Creation:")
print(df[nc4_columns + ['NC4']].tail())



Drop the rest of the columns that were used to create NC4

In [None]:
df = df.drop(columns=nc4_columns)


# Now merging NC5

In [None]:
# Define columns for NC5, including QN83
nc5_columns = ['QN18f_a', 'QN18f_b', 'QN18f_c', 'QN18f_d', 'QN18f_e', 
               'QN18f_f', 'QN18f_g', 'QN18f_h', 'QN18f_i', 'QN18f_j', 'QN18f_k', 'QN83']

print("Before NC5 Column Creation:")
print(df[nc5_columns].tail())


In [None]:
def check_oral_nicotine_use(row):
    # If there's a value in any nc5_columns, it indicates oral nicotine use
    if row[nc5_columns].any():
        return "1"  # Oral nicotine use present
    else:
        return "0"  # No oral nicotine use


# Apply the function to create NC5
df['NC5'] = df.apply(check_oral_nicotine_use, axis=1)

print("After NC5 Column Creation:")
print(df[nc5_columns + ['NC5']].tail())


Drop the columns that made NC5

In [None]:
df = df.drop(columns=nc5_columns)


# Now merging NC6

In [None]:
# Define columns for NC6
nc6_columns = ['QN21a_a', 'QN21a_b', 'QN21a_c', 'QN21a_d', 'QN21a_e', 
               'QN21a_f', 'QN21a_g']

print("Before NC6 Column Creation:")
print(df[nc6_columns].tail())


In [None]:
def check_ecig_access(row):
    # If there's a value in any nc6_columns, it indicates e-cigarette access
    if row[nc6_columns].any():
        return "1"  # E-cigarette access present
    else:
        return "0"  # No e-cigarette access


# Apply the function to create NC6
df['NC6'] = df.apply(check_ecig_access, axis=1)

print("After NC6 Column Creation:")
print(df[nc6_columns + ['NC6']].tail())


Now drop the rest of the columns

In [None]:
df = df.drop(columns=nc6_columns)



# Now merging NC7

In [None]:
# Define the columns for NC7
nc7_columns = ['QN30', 'QN31', 'QN32', 'QN33']

# Print the columns before merging
print("Before NC7 Column Creation:")
print(df[nc7_columns].tail())



In [None]:
# Define the function to check for curiosity about e-cigarettes
def check_curiosity(row):
    # If there's a value in any nc7_columns, it indicates curiosity or openness
    if row[nc7_columns].any():
        return "1"  # Curiosity or openness present
    else:
        return "0"  # No curiosity or openness
    

# Apply the function to create NC7
df['NC7'] = df.apply(check_curiosity, axis=1)

print("After NC7 Column Creation:")
print(df[nc7_columns + ['NC7']].tail())


Drop the columns we used to make NC7

In [None]:
df = df.drop(columns=nc7_columns)


# Now merging NC8

In [None]:
# Define columns for NC8
nc8_columns = ['QN34a', 'QN34b', 'QN34c', 'QN35a', 'QN35b', 'QN35c']

# Print the columns before merging
print("Before NC8 Column Creation:")
print(df[nc8_columns].tail())



In [None]:
# Define function to check for vaping marijuana, CBD, or THC products
def check_vaping_substances(row):
    # If there's a value in any nc8_columns, it indicates vaping of these substances
    if row[nc8_columns].any():
        return "1"  # Vaping these substances present
    else:
        return "0"  # No vaping of these substances
    


# Apply the function to create NC8
df['NC8'] = df.apply(check_vaping_substances, axis=1)
print("After NC8 Column Creation:")
print(df[nc8_columns + ['NC8']].tail())


Now drop the rest of the columns

In [None]:
df = df.drop(columns=nc8_columns)


Now merging columns  'QN135A', 'QN135B', 'QN135C', 'QN135D', and 'QN135E'  to create NC11

In [None]:
# Define columns for NC11
nc9_columns = ['QN135a', 'QN135b', 'QN135c', 'QN135d', 'QN135e']

# Print the columns before merging
print("Before NC11 Column Creation:")
print(df[nc9_columns].tail())

In [None]:
# Define function to check if e-cigarette use was witnessed at school
def check_ecig_witnessed(row):
    # If there's a value in any nc11_columns, it indicates e-cigarette use witnessed at school
    if row[nc9_columns].any():
        return "1"  # E-cigarette use witnessed
    else:
        return "0"  # No e-cigarette use witnessed


# Apply the function to create NC11
df['NC9'] = df.apply(check_ecig_witnessed, axis=1)
print("After NC9 Column Creation:")
print(df[nc9_columns + ['NC9']].tail())


Now drop the rest of the columns 

In [None]:
df = df.drop(columns=nc9_columns)


Now merging columns 'QN136B', 'QN136C', 'QN136D', 'QN136E', 'QN136H', 
                'QN136I', 'QN136J', 'QN136K', and 'QN136L' to create NC12

In [None]:
# Define columns for NC12
nc10_columns = ['QN136b', 'QN136c', 'QN136d', 'QN136e', 'QN136h', 
                'QN136i', 'QN136j', 'QN136k', 'QN136l']

# Print the columns before merging
print("Before NC10 Column Creation:")
print(df[nc10_columns].tail())

In [None]:
# Define function to check if respondent lives with someone who uses tobacco products
def check_tobacco_use_at_home(row):
    # If there's a value in any nc12_columns, it indicates tobacco use at home
    if row[nc10_columns].any():
        return "1"  # Tobacco use present in household
    else:
        return "0"  # No tobacco use in household


# Apply the function to create NC12
df['NC10'] = df.apply(check_tobacco_use_at_home, axis=1)
print("After NC10 Column Creation:")
print(df['NC10'].tail())


Now drop the rest of the columns

In [None]:
df = df.drop(columns=nc10_columns)


Now merging columns 'QN137A', 'QN137B', 'QN137C', 'QN137D', and 'QN137E' to create NC13

In [None]:
# Define columns for NC11
nc11_columns = ['QN137a', 'QN137b', 'QN137c', 'QN137d', 'QN137e']

# Print the columns before merging
print("Before NC11 Column Creation:")
print(df[nc11_columns].tail())


In [None]:
# Define function to check for experiences of psychological distress or discrimination at school
def check_school_distress(row):
    # If there's a value in any nc13_columns, it indicates distress or discrimination experience
    if row[nc11_columns].any():
        return "1"  # Psychological distress or discrimination present
    else:
        return "0"  # No psychological distress or discrimination

# Apply the function to create NC13
df['NC11'] = df.apply(check_school_distress, axis=1)
print("After NC11 Column Creation:")
print(df['NC11'].tail())


Now drop the rest of the columns that created NC13

In [None]:
df = df.drop(columns=nc11_columns)


Now merging columns 'QN137H', 'QN137I', 'QN137J', 'QN137K', 'QN137L', 'QN137M', and 'QN137N' to create NC14

In [None]:
# Define columns for NC12
nc12_columns = ['QN137h', 'QN137i', 'QN137j', 'QN137k', 'QN137l', 'QN137m', 'QN137n']

# Print the columns before merging
print("Before NC12 Column Creation:")
print(df[nc12_columns].tail())

In [None]:
# Define function to check for general experiences of discrimination
def check_discrimination(row):
    # If there's a value in any nc14_columns, it indicates discrimination experience
    if row[nc12_columns].any():
        return "1"  # Discrimination experience present
    else:
        return "0"  # No discrimination experience

# Apply the function to create NC14
df['NC12'] = df.apply(check_discrimination, axis=1)
print("After NC12 Column Creation:")
print(df[nc12_columns + ['NC12']].tail())


Now drop the rest of the columns used to create NC14

In [None]:
df = df.drop(columns=nc12_columns)


Displaying all new columns:

In [None]:
nc_columns = ['NC1', 'NC2', 'NC3', 'NC4', 'NC5', 'NC6', 'NC7', 'NC8', 'NC9', 'NC10', 'NC11', 'NC12']
print(df[nc_columns].head())

Keep only QN or NC columns

In [None]:
# DataFrame with columns that start with 'QN' or contain 'NC'
df_qn_nc = df.filter(regex='(^QN|NC)')
print(list(df_qn_nc.columns))
# save into df
df = df_qn_nc


Drop all columns by parser

In [None]:
# Drop columns that are not needed, in 'drop_list' and ignore any missing columns
df_new = df.drop(columns=dropped, errors='ignore')

# Calculate the number of columns before and after dropping
original_column_count = df.shape[1]
new_column_count = df_new.shape[1]
columns_dropped = original_column_count - new_column_count

# Print the result
print(f"Number of columns dropped: {columns_dropped}")
print(f"Original columns: {original_column_count}, New columns: {new_column_count}")
df = df_new

In [None]:
print(list(df.columns))

Now handle the rest of the values that are NA:

In [5]:
# Display NA counts in a DataFrame format
na_counts_df = df.isna().sum().reset_index()
na_counts_df.columns = ['Column', 'NA_Count']
print(na_counts_df)



               Column  NA_Count
0       artificial_id         0
1     Non_SOGI_School         0
2            Location        69
3                 QN1        90
4                 QN2       152
...               ...       ...
1464              PSU         0
1465          PSU_num         0
1466      WT_analysis         0
1467           QN141R      4200
1468           QN142R      3991

[1469 rows x 2 columns]


Now we handle missing values. 
In our dataset, there are some naturally missing values due to branching of the questions. Those missing values are significantly higher than actual missing values(respondent did not answer). 
Our approach is to replace the naturally missing values with -1, and use that as an indicator when we train the model; for questions respondent did not answer, we fill those with 0, indicating an actual missing value.

In [None]:
threshold = 2993
fill_value = -1
high_na_columns = df.columns[df.isna().sum() > threshold]
df[high_na_columns] = df[high_na_columns].fillna(fill_value)

print(df.isna().sum())


In [6]:
df['QN6'].isna().sum()

184

TODO:
filter through the rest of the na columns and fill with mean/median/mode

In [7]:
columns_to_fill_mode = ['QN1', 'QN2', 'QN3', 'QN117', 'QN119', 'QN120', 
                        'QN127', 'QN128', 'QN131', 'QN132', 'QN140', 
                        'QN148', 'QN149']

# Fill each column with its mode
for col in columns_to_fill_mode:
    df[col] = df[col].fillna(df[col].mode()[0])

You can check the rest of the columns that have na still here

In [4]:
df.columns[df.isna().sum() > 0]

Index(['Location', 'QN1', 'QN2', 'QN3', 'QN4a', 'QN4b', 'QN4c', 'QN4d', 'QN4e',
       'QN5a',
       ...
       'CHOOKAH', 'CROLLCIGTS', 'CPIPE', 'CSNUS', 'CORAL', 'CBIDIS', 'CHTP',
       'CPOUCH', 'QN141R', 'QN142R'],
      dtype='object', length=1248)

# Begin of model training

**Random Forrest Model**

Since we have a dataset that is based on a questionnaire, the most straightforward approach is to use a decision tree. 
However, since our dataset contains responses from over 1k respondents, and over 150 questions, our dataset is no long fit for a decision tree due to the size of our dataset. 

Hence, our apporach is to use a random forrest model.

In [3]:
print(df.isna().sum())

artificial_id         0
Non_SOGI_School       0
Location             69
QN1                  90
QN2                 152
                   ... 
PSU                   0
PSU_num               0
WT_analysis           0
QN141R             4200
QN142R             3991
Length: 1469, dtype: int64


In [None]:
# create indicator columns for -1 values (skipped responses)
for col in df.columns:
    if (df[col] == -1).sum() > 0:  # Check for columns containing -1 values
        df[col + '_skipped'] = (df[col] == -1).astype(int)  # Create indicator column

X = df.drop(columns=['QN100'])
y = df['QN100']

# Step 4: Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Initialize and train the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 7: Make predictions and evaluate the model
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, classification_report

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")
print(classification_report(y_test, y_pred))

# Optional: Display feature importance
feature_importances = model.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print(importance_df)