This hackathon was done by Haifa Al Ashkar ID:202201534

*note: upload the dataset and rename it to Survey*

# Objective
The objective of this project is to predict the number of cigarettes smoked daily based on socio-economic, psychological, and behavioral factors. This can help in understanding smoking patterns and targeting interventions for smokers.

# Why Choose These Features?
### Behavioral Traits:
Features like "difficulty refraining from smoking in forbidden places" and "time to first cigarette after waking up" are strong indicators of nicotine dependency.
### Socio-Economic Factors:
Income levels and peer influence are known to significantly affect smoking habits.
### Personality Traits:
Psychological traits such as anxiety and creativity might be linked to smoking behavior.
These features provide a mix of behavioral, economic, and psychological variables for comprehensive prediction.



# Steps Taken to Clean Data

1. **Replace Non-Informative Entries**:
   - Replaced responses like `"I don't know"`, `"I prefer not to say"`, and `"Prefer not to answer"` with `NaN`.

2. **Handle Categorical Columns with Ordinal Encoding**:
   - Encoded Likert-scale responses for personality traits, mapping `"Disagree strongly"` to `1` and `"Agree strongly"` to `7`.
   - Mapped other ordinal columns such as:
     - **Smoking behavior**: `"10 or less cigarettes/day"` → `1`, `"31 cigarettes/day or more"` → `4`.
     - **Stress levels**: `"Never"` → `1`, `"Constantly"` → `5`.
     - **Exercise frequency**, **income ranges**, and similar ordinal responses were encoded with numerical scales.

3. **Binary Encoding**:
   - Converted `Yes/No` responses to `1/0`.
   - Encoded gender as `"Male"` → `1` and `"Female"` → `0`.
   - Encoded sector type as `"Public"` → `1` and `"Private"` → `0`.

4. **Standardize Cigarette Brands**:
   - Standardized brand names like `"Marlboro Gold"`, `"Malboro"`, and `"Malborow"` into `"Marlboro"`.
   - Grouped rare brands under `"Others"`.

5. **One-Hot Encoding**:
   - Applied one-hot encoding to categorical variables such as:
     - **Employment status**.
     - **Favorite cigarette brands**.
     - **Governorates** (regions).
     - **Income sources**.

6. **Handle Missing Values**:
   - Filled missing values with `0` or appropriate strategies (e.g., `mean`/`median`) based on column type.

7. **Remove Irrelevant Columns**:
   - Dropped columns containing `"comments"` in their names to avoid free-text data.

8. **Map Frequency-Based Columns**:
   - Mapped ordinal frequency-based responses like:
     - **"How often do you feel stressed?"**: `"Never"` → `1`, `"Constantly"` → `5`.
     - **"On average, how many hours per day do you spend on social media?"**: `"Less than 1 hour"` → `1`, `"More than 4 hours"` → `4`.

9. **Final Dataset**:
   - The dataset is now clean, numerical, and ready for use in machine learning models.


In [22]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder
import numpy as np

def clean_survey_data(data):
    # Replace non-informative entries with NaN
    non_informative = ["I don't know", "I prefer not to say", "Prefer not to answer"]
    data.replace(non_informative, np.nan, inplace=True)


    # Ordinal encoding for Likert-scale personality traits
    likert_scale_mapping = {
        "Disagree strongly": 1,
        "Disagree moderately": 2,
        "Disagree a little": 3,
        "Neither agree nor disagree": 4,
        "Agree a little": 5,
        "Agree moderately": 6,
        "Agree strongly": 7
    }
    likert_scale_columns = [
        "I see myself as someone who is extraverted, enthusiastic:",
        "I see myself as someone who is critical, quarrelsome:",
        "I see myself as someone who is dependable, self-disciplined:",
        "I see myself as someone who is anxious, easily upset:",
        "I see myself as someone who is open to new experiences:",
        "I see myself as someone who is reserved, quiet:",
        "I see myself as someone who is sympathetic, warm:",
        "I see myself as someone who is disorganized, careless:",
        "I see myself as someone who is calm, emotionally stable:",
        "I see myself as someone who is conventional, uncreative:"
    ]
    for col in likert_scale_columns:
        if col in data.columns:
            data[col] = data[col].map(likert_scale_mapping)

    gender_mapping = {"Male": 1, "Female": 0}
    data["Gender:"] = data["Gender:"].map(gender_mapping)

    # data.columns = data.columns.str.strip().str.replace(r'[^\w\s\(\)\?\,\.0-9\']', '', regex=True)
# Replace all "Yes" with 1 and "No" with 0 across the entire dataset
    data = data.replace({"Yes": 1, "No": 0})


        # Encode Yes/No and True/False values
    # yes_no_cols = [
    #     "Have you smoked at least one full tobacco cigarette (excluding e-cigarettes) once or more in the past 30 days?",
    #     "Do you find it difficult to refrain from smoking where it is forbidden (church, library, cinema, plane, etc...)?",
    #     "Do you smoke more frequently during the first hours after waking up than during the rest of the day?Â",
    #     "Do you smoke if you are so ill that you are in bed most of the day?",
    #     "Are you currently able to afford your favorite or preferred cigarette brand(s)?",
    #     "Do you have close friends?",

    # ]
    # for col in yes_no_cols:
    #     if col in data.columns:
    #         data[col] = data[col].map({"Yes": 1, "No": 0})

    sector_col =["Sector"]
    for col in sector_col:
        if col in data.columns:
            data[col] = data[col].map({"Public": 1, "Private": 0})

    # Ordinal encode frequency-based columns
    stress_mapping = {"Never": 1, "Rarely": 2, "Occasionally": 3, "Frequently": 4, "Constantly": 5}
    exercise_mapping = {"Never": 1, "Rarely": 2, "Often or at least 3 days every week": 3, "Every day or at least 5 times every week": 4}
    social_media_mapping = {
        "Less than 1 hour": 1,
        "Between 1 and 2 hours": 2,
        "Between 2 and 3 hours": 3,
        "More than 4 hours": 4
    }

    if "How often do you feel stressed?" in data.columns:
        data["How often do you feel stressed?"] = data["How often do you feel stressed?"].map(stress_mapping)
    if "How often do you exercise?" in data.columns:
        data["How often do you exercise?"] = data["How often do you exercise?"].map(exercise_mapping)
    if "On average, how many hours per day do you spend on social media for entertainment and social interaction (Facebook, Instagram, YouTube, etc...)" in data.columns:
        data["On average, how many hours per day do you spend on social media for entertainment and social interaction (Facebook, Instagram, YouTube, etc...)"] = data["On average, how many hours per day do you spend on social media for entertainment and social interaction (Facebook, Instagram, YouTube, etc...)"].map(social_media_mapping)

    # One-hot encode employment status
    employment_column = "What is your current employment status?"
    if employment_column in data.columns:
        data = pd.get_dummies(data, columns=[employment_column], drop_first=True)

    # Define the mapping for ordinal encoding
    cigarettes_mapping = {
        "10 or less cigarettes/day": 1,
        "11 to 20 cigarettes": 2,
        "21 to 30 cigarettes": 3,
        "31 cigarettes/day or more": 4
    }

    # Apply the mapping to the column
    data["How many cigarettes do you smoke each day?"] = data["How many cigarettes do you smoke each day?"].map(cigarettes_mapping)

  # Define the mapping for ordinal encoding
    wake_up_cigarette_mapping = {
        "Within 5 minutes": 1,
        "6 to 30 minutes": 2,
        "31 to 60 minutes": 3,
        "After 60 minutes": 4
    }

    # Apply the mapping to the column
    data["How soon after you wake up do you smoke your first cigarette?"] = data["How soon after you wake up do you smoke your first cigarette?"].map(wake_up_cigarette_mapping)

    # Define the mapping for binary encoding
    cigarette_preference_mapping = {
        "The first one in the morning": 1,
        "All others": 0
    }

    # Apply the mapping to the column
    data["Which cigarette would you mostly hate to give up?"] = data["Which cigarette would you mostly hate to give up?"].map(cigarette_preference_mapping)

    # Define the mapping for ordinal encoding
    smoking_behavior_mapping = {
        "The number of cigarettes I smoke per day has decreased": 1,
        "The number of cigarettes I smoke per day has remained the same": 2,
        "The number of cigarettes I smoke per day has increased": 3
    }

    # Apply the mapping to the column
    data["How would you describe your current smoking behavior compared to your smoking behavior before Lebanon's economic crisis and revolution began in 2019?"] = data["How would you describe your current smoking behavior compared to your smoking behavior before Lebanon's economic crisis and revolution began in 2019?"].map(smoking_behavior_mapping)

    # Define the mapping for cigarette brands




    # Apply the mapping to the column
    data["What is your favorite or preferred cigarette brand(s) if you were able to access it?"] = data["What is your favorite or preferred cigarette brand(s) if you were able to access it?"].apply(map_brand)

    # One-hot encode the standardized brands
    data = pd.get_dummies(data, columns=["What is your favorite or preferred cigarette brand(s) if you were able to access it?"], prefix="Brand")

    data["What cigarette brand(s) are you currently using?"] = data["What cigarette brand(s) are you currently using?"].apply(map_brand)

    # One-hot encode the standardized brands
    data = pd.get_dummies(data, columns=["What cigarette brand(s) are you currently using?"], prefix="Current_Brand")

    # Define the mapping for ordinal encoding
    switch_mapping = {
        "No, I am currently using my favorite or preferred cigarette brand(s)": 0,
        "Yes, I am currently using a cheaper alternative": 1,
        "Yes, I am currently using a more expensive alternative": 2
    }

    # Apply the mapping to the column
    data["Has 2019's revolution or economic crisis caused you to switch away from your favorite or preferred cigarette brand(s) to an  alternative?"] = data["Has 2019's revolution or economic crisis caused you to switch away from your favorite or preferred cigarette brand(s) to an  alternative?"].map(switch_mapping)

    # One-hot encode the governorates column
    data = pd.get_dummies(data, columns=["Which governerate do you live in or spend most of your time in?"], prefix="Governorate")

    # Define the mapping for ordinal encoding
    education_mapping = {
        "Less than high school": 1,
        "High school degree or equivalent (e.g. GED)": 2,
        "Incomplete bachelor's degree": 3,
        "Bachelor's degree (BA/BS)": 4,
        "Incomplete graduate degree": 5,
        "Graduate degree (MA/MS)": 6,
        "Post-graduate degree (PhD, MD, or other)": 7
    }

    # Apply the mapping to the column
    data["What is the highest level of education you have attained?"] = data["What is the highest level of education you have attained?"].map(education_mapping)


    # Define the mapping for ordinal encoding
    marital_status_mapping = {
        "Single": 1,
        "Engaged": 2,
        "Married": 3,
        "Divorced/Separated": 4,
        "Other, please specify": 5  # Treating "Other" as a separate category
    }

    # Apply the mapping to the column
    data["What is your current marital status?"] = data["What is your current marital status?"].map(marital_status_mapping)


    # One-hot encode the main source of income
    data = pd.get_dummies(data, columns=["What is your main source of income?"], prefix="Income_Source")

    # One-hot encode the income/financial support column
    data = pd.get_dummies(data, columns=["What type of income or financial support does your household receive?"], prefix="Income_Type")

    # Define the mapping for ordinal encoding
    income_mapping = {
        "Less than 1 million L.L": 1,
        "Between 1 and 4 million L.L": 2,
        "Between 4 and 8 million L.L": 3,
        "Between 8 and 12 million L.L": 4,
        "Between 12 and 16 million L.L": 5,
        "Between 16 and 20 million L.L": 6,
        "More than 20 million L.L": 7
    }

    # Apply the mapping to the column
    data["If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange)."] = data["If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange)."].map(income_mapping)

    # Handle missing values by filling with 0 or another strategy (e.g., mean, median)
    data["If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange)."].fillna(0, inplace=True)

    # Define the mapping for ordinal encoding
    income_sufficiency_mapping = {
        "Very low income: does not cover basic needs for a month": 1,
        "Low: barely covers basic needs for a month": 2,
        "Medium: covers all basic needs": 3,
        "High: completely covers necessities with a few luxury items": 4,
        "Extremely high: covers a wide range of luxury items": 5
    }

    # Apply the mapping to the column
    data["How would you describe your current income sufficiency?"] = data["How would you describe your current income sufficiency?"].map(income_sufficiency_mapping)

    # Handle missing values (if any exist)
    data["How would you describe your current income sufficiency?"].fillna(0, inplace=True)  # Replace with 0 or use a strategy like mean/median

    # Define the mapping for ordinal encoding
    financial_impact_mapping = {
        "Not at all": 1,
        "Slightly": 2,
        "Moderately": 3,
        "Very": 4,
        "Extremely": 5
    }

    # Apply the mapping to the column
    data["To what extent were you financially (negatively) affected by the deterioration of the Lebanese economy?"] = data["To what extent were you financially (negatively) affected by the deterioration of the Lebanese economy?"].map(financial_impact_mapping)

    # Handle missing values (if any exist)
    data["To what extent were you financially (negatively) affected by the deterioration of the Lebanese economy?"].fillna(0, inplace=True)  # Replace with 0 or another strategy like mean/median

    # Define the mapping for ordinal encoding
    social_media_usage_mapping = {
        "I don't use any social media platforms": 0,
        "Less than 1 hour": 1,
        "Between 1 hour and 2 hours": 2,
        "Between 2 and 3 hours": 3,
        "Between 3 and 4 hours": 4,
        "More than 4 hours": 5
    }

    # Apply the mapping to the column
    data["On average, how many hours per day do you spend on social media for entertainment and social interaction (Facebook, Instagram, YouTube, etc...)?"] = data["On average, how many hours per day do you spend on social media for entertainment and social interaction (Facebook, Instagram, YouTube, etc...)?"].map(social_media_usage_mapping)

    # Handle missing values (if any exist)
    data["On average, how many hours per day do you spend on social media for entertainment and social interaction (Facebook, Instagram, YouTube, etc...)?"].fillna(0, inplace=True)  # Replace with 0 or another strategy like mean/median

        # Define the mapping for binary encoding
    employment_status_mapping = {
        "Employed": 1,
        "Unemployed": 0
    }

    # Apply the mapping to the column
    data["Employment Status"] = data["Employment Status"].map(employment_status_mapping)

    # Handle missing values (if any exist)
    data["Employment Status"].fillna(0, inplace=True)  # Replace with 0 (assumes missing = "Unemployed") or another strategy

    # Drop all columns that contain "comments" in their names
    data = data.loc[:, ~data.columns.str.contains("comment", case=False)]


    return data

brand_mapping = {
        "Marlboro": "Marlboro",
        "Marlboro Gold": "Marlboro",
        "Malboro": "Marlboro",
        "Malborow": "Marlboro",
        "Marlboro Red": "Marlboro",
        "Davidoff": "Davidoff",
        "Kent": "Kent",
        "Camel": "Camel",
        "Cedars": "Cedars",
        "Heets": "Heets",
        "Winston": "Winston",
        "IQOS": "IQOS",
        # Combine similar rare brands into 'Others'
        "Other": "Others",
        "Cohiba": "Others",
        "Golden Virginia": "Others",
        "Parliament": "Others",
        "Gitanes": "Others",
        "Rolling Tobacco": "Others",
    }
# Function to standardize brand names
def map_brand(value):
    # Ensure case insensitivity and remove extraneous characters
    value = str(value).lower().strip()
    for key, standardized in brand_mapping.items():
        if key.lower() in value:
            return standardized
    return "Others"  # Default to 'Others' for unknown brands
# Load the dataset
file_path = 'Survey.xls'  # Update the path to your file
data = pd.read_excel(file_path)

# Clean the dataset
cleaned_data = clean_survey_data(data)

# Save or inspect the cleaned data
cleaned_data.to_csv('cleaned_survey_data.csv', index=False)
print("Data cleaning completed. Cleaned data saved as 'cleaned_survey_data.csv'.")


Data cleaning completed. Cleaned data saved as 'cleaned_survey_data.csv'.


  data = data.replace({"Yes": 1, "No": 0})
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange)."].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the o

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
file_path = 'cleaned_survey_data.csv'  # Replace with the actual path
data = pd.read_csv(file_path)

# Step 1: Drop Irrelevant or Redundant Columns
columns_to_drop = [
    'Unnamed: 0',  # Index column
    'Last page',  # Metadata
    'What is your current marital status? [Comment]',  # Redundant
    'What is your main source of income? [Comment]',  # Sparse free-text data
    'What type of income or financial support does your household receive? [Comment]',  # Sparse free-text data
    'Employment Status'  # Original column, if one-hot encoded columns are present
]
data = data.drop(columns=columns_to_drop, errors='ignore')

# Step 2: Handle Missing Values
# Fill missing values for numerical features with the median
for col in data.select_dtypes(include=['float64', 'int64']).columns:
    data[col] = data[col].fillna(data[col].median())

# Step 3: Prepare Target and Features
target_column = 'How many cigarettes do you smoke each day?'  # Target variable
X = data.drop(columns=[target_column], errors='ignore')  # Features
y = data[target_column]  # Target

# Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the preprocessed dataset shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Optional: Display the remaining columns in X to verify
print("Features in the dataset after preprocessing:")
print(X.columns.tolist())


X_train shape: (169, 76)
X_test shape: (43, 76)
y_train shape: (169,)
y_test shape: (43,)
Features in the dataset after preprocessing:
['Sector', 'Have you smoked at least one full tobacco cigarette (excluding e-cigarettes) once or more in the past 30 days?', 'I see myself as someone who is extraverted, enthusiastic:', 'I see myself as someone who is critical, quarrelsome:', 'I see myself as someone who is dependable, self-disciplined:', 'I see myself as someone who is anxious, easily upset:', 'I see myself as someone who is open to new experiences:', 'I see myself as someone who is reserved, quiet:', 'I see myself as someone who is sympathetic, warm:', 'I see myself as someone who is disorganized, careless:', 'I see myself as someone who is calm, emotionally stable:', 'I see myself as someone who is conventional, uncreative:', 'Do you find it difficult to refrain from smoking where it is forbidden (church, library, cinema, plane, etc...)?', 'Do you smoke more frequently during the fir

#Model Used and Why
###Random Forest Regressor:

####Why?:
Handles mixed data types (categorical and numerical).
Robust to outliers and missing values.
Automatically captures non-linear relationships and feature interactions.


In [24]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Step 1: Train the Random Forest Regressor
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)

# Step 2: Make Predictions
y_pred = rf_model.predict(X_test)

# Step 3: Evaluate the Model
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Step 4: Feature Importance
feature_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Print Performance Metrics
print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)

# Display Feature Importance
print("Top 10 Important Features:")
print(feature_importances.head(10))


Mean Absolute Error (MAE): 0.4813953488372093
Root Mean Squared Error (RMSE): 0.6270788682269114
Top 10 Important Features:
                                              Feature  Importance
12  Do you find it difficult to refrain from smoki...    0.259658
27  If you receive payment in Lebanese Lira, what ...    0.202230
15  How soon after you wake up do you smoke your f...    0.108875
11  I see myself as someone who is conventional, u...    0.042947
5   I see myself as someone who is anxious, easily...    0.027044
26  Of the five closest friends or acquaintances t...    0.025731
13  Do you smoke more frequently during the first ...    0.023241
17  How old were you the first time you smoked a f...    0.023197
29  Including yourself, how many people currently ...    0.022417
22                                   How old are you?    0.017080


#Results Analysis
##Performance Metrics:

**Mean Absolute Error (MAE)**: 0.48 cigarettes/day.
**Root Mean Squared Error (RMSE)**: 0.63 cigarettes/day.
The model shows good predictive accuracy for this dataset.
Key Features:

###Top Predictors:
"Difficulty refraining from smoking in forbidden places" (25.9% importance).
"Monthly household income in Lebanese Lira" (20.2% importance).
"Time to first cigarette after waking up" (10.8% importance).
Behavioral features dominate the prediction, with socio-economic and personality traits contributing less but still significant.
###Insights:

Smoking habits are heavily influenced by addiction-related behaviors and economic stress.
Personality traits like anxiety and conformity also play a role, albeit smaller.