<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">  <div style="font-size:150%; color:#FEE100"><b>Exploring Health, Behavior and Socioeconomic Data</b></div>  <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>

This notebook begins with a spark of curiosity about how lifestyle details, socioeconomic status, and national-level mental health data relate to one another. Our goal is to clean the data, remove outliers, relate two seemingly disparate datasets, craft a dashboard of intriguing visualizations, and ultimately build prediction models. If you find this notebook useful, please consider upvoting it.

## Table of Contents
- [Data Loading](#Data-Loading)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Data Merging and Relation](#Data-Merging-and-Relation)
- [Exploratory Data Analysis and Visualizations](#Exploratory-Data-Analysis-and-Visualizations)
- [Dashboard](#Dashboard)
- [Machine Learning Model](#Machine-Learning-Model)
- [Summary and Conclusions](#Summary-and-Conclusions)

In [None]:
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Import necessary libraries
import numpy as np
import pandas as pd

import matplotlib
matplotlib.use('Agg')  # Use Agg backend for matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style='whitegrid')

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# For reproducibility
np.random.seed(42)

## Data Loading

In [None]:
# Load the datasets

try:
    df1 = pd.read_csv('1.csv', delimiter=',', encoding='ascii')
    print('File 1 loaded successfully.')
except Exception as e:
    print('Error loading 1.csv:', e)

try:
    master_df = pd.read_csv('master.csv', delimiter=',', encoding='ascii')
    print('Master file loaded successfully.')
except Exception as e:
    print('Error loading master.csv:', e)

# Optional: Display the first few rows (not included in output)
# df1.head(), master_df.head()

## Data Cleaning and Preprocessing

In [None]:
# For file 1 (df1), we remove outliers using the IQR method for numeric columns
numeric_cols_df1 = ['Age', 'Number of Children', 'Income']

for col in numeric_cols_df1:
    if col in df1.columns:
        Q1 = df1[col].quantile(0.25)
        Q3 = df1[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        # Only keep data within the bounds
        df1 = df1[(df1[col] >= lower_bound) & (df1[col] <= upper_bound)]
        
# For master_df, remove outliers for selected numeric columns
numeric_cols_master = ['suicides_no', 'population', 'suicides/100k pop', 'HDI for year', 'gdp_per_capita ($)']

for col in numeric_cols_master:
    if col in master_df.columns:
        Q1 = master_df[col].quantile(0.25)
        Q3 = master_df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        master_df = master_df[(master_df[col] >= lower_bound) & (master_df[col] <= upper_bound)]

# Note: The IQR method can sometimes remove too many data points; this is an important method to check for extreme values that might affect our analysis.

## Data Merging and Relation

In [None]:
# While the two datasets come from different levels of aggregation, we can relate them by age groups.
# We create an age group column in df1 that aligns with the age categories in master_df.

# Define bins for age. It is assumed that master_df age categories follow similar ranges.
bins = [0, 14, 24, 34, 54, 74, 150]
labels = ['5-14', '15-24', '25-34', '35-54', '55-74', '75+']

# Create new column 'age_group' in df1
df1['age_group'] = pd.cut(df1['Age'], bins=bins, labels=labels, right=True)

# In master_df, the 'age' column is already categorical. To be consistent, we rename it as 'age_group'
if 'age' in master_df.columns:
    master_df.rename(columns={'age': 'age_group'}, inplace=True)

# Aggregation: Calculate average Income from df1 by age_group and average suicides/100k pop from master_df by age_group
df1_agg = df1.groupby('age_group')['Income'].mean().reset_index()
master_agg = master_df.groupby('age_group')['suicides/100k pop'].mean().reset_index()

# Merge the aggregated data on the common 'age_group'
merged_df = pd.merge(df1_agg, master_agg, on='age_group', how='inner')

# This merged_df allows us to investigate potential relationships between average income and suicide rates by age group.
# Note: Data merging of different granularities should be approached with caution and may require additional justification in a real analysis.

## Exploratory Data Analysis and Visualizations

In [None]:
# Visualization 1: Heatmap for correlation among numeric variables in df1 (if 4 or more present)
numeric_df1 = df1.select_dtypes(include=[np.number])
if numeric_df1.shape[1] >= 4:
    plt.figure(figsize=(10, 8))
    corr = numeric_df1.corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap for File 1 Numeric Data')
    plt.tight_layout()
    plt.show()

# Visualization 2: Pair Plot for numeric columns in df1
sns.pairplot(numeric_df1)
plt.suptitle('Pair Plot for File 1 Numeric Data', y=1.02)
plt.show()

# Visualization 3: Histogram for Age distribution in df1
plt.figure(figsize=(8, 6))
sns.histplot(df1['Age'], kde=True, color='blue')
plt.title('Age Distribution in File 1')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Visualization 4: Bar Plot for average Income by Age Group from merged_df
plt.figure(figsize=(8, 6))
sns.barplot(x='age_group', y='Income', data=merged_df, palette='viridis')
plt.title('Average Income by Age Group (File 1)')
plt.xlabel('Age Group')
plt.ylabel('Average Income')
plt.tight_layout()
plt.show()

# Visualization 5: Grouped Bar Plot: Compare Average Income (file1) and Suicide Rate (master) by Age Group
merged_melt = pd.melt(merged_df, id_vars='age_group', value_vars=['Income', 'suicides/100k pop'],
                      var_name='Metric', value_name='Value')
plt.figure(figsize=(8, 6))
sns.barplot(x='age_group', y='Value', hue='Metric', data=merged_melt, palette='magma')
plt.title('Comparison of Income and Suicide Rate by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Value')
plt.legend(title='Metric')
plt.tight_layout()
plt.show()

## Dashboard

In [None]:
# Create a dashboard with multiple subplots to view key visualizations in one figure
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Subplot 1: Histogram for Age Distribution
sns.histplot(df1['Age'], kde=True, ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('Age Distribution')

# Subplot 2: Box Plot for Income in df1
sns.boxplot(x=df1['Income'], ax=axes[0, 1], color='lightgreen')
axes[0, 1].set_title('Income Box Plot')

# Subplot 3: Bar Plot for Average Income by Age Group
sns.barplot(x='age_group', y='Income', data=merged_df, ax=axes[1, 0], palette='pastel')
axes[1, 0].set_title('Average Income by Age Group')

# Subplot 4: Bar Plot for Solarizing Suicide Rates by Age Group
sns.barplot(x='age_group', y='suicides/100k pop', data=merged_df, ax=axes[1, 1], palette='deep')
axes[1, 1].set_title('Suicide Rate by Age Group')

plt.tight_layout()
plt.show()

## Machine Learning Model

In [None]:
# For our machine learning predictor, we will use the data from file 1 (df1) to predict the 'Smoking Status'.
# The steps include: 
# 1. Preprocessing the data: encoding categorical variables, handling missing values, and dropping irrelevant columns.
# 2. Splitting the data into training and testing sets.
# 3. Building a RandomForestClassifier.
# 4. Evaluating the accuracy and displaying a confusion matrix.

# Select target and features
target = 'Smoking Status'

# Drop columns that are not useful for prediction
drop_cols = ['Name']  # Exclude Name as it is non-informative
features = df1.drop(columns=drop_cols + [target, 'age_group'])

# Identify categorical and numeric features
categorical_cols = features.select_dtypes(include=['object']).columns.tolist()
numeric_cols = features.select_dtypes(include=[np.number]).columns.tolist()

# Create preprocessing pipelines for both numeric and categorical data
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Prepare the complete pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Split the data
X = features
y = df1[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Train the model
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate accuracy
acc = accuracy_score(y_test, y_pred)
print('Accuracy of Smoking Status Prediction: {:.2f}%'.format(acc * 100))

# Compute and display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Smoking Status Prediction')
plt.tight_layout()
plt.show()

## Summary and Conclusions

In this notebook we undertook a multi-faceted analysis of two datasets from the health and socioeconomic domain. We started by cleaning the data with techniques such as outlier removal using the IQR method and handled different granularities by matching age groups. Through a rich suite of visualizations including heatmaps, pair plots, histograms, and grouped bar plots, we were able to identify potential relationships between average income and suicide rates in different age groups.

Furthermore, we built a RandomForestClassifier model to predict Smoking Status from lifestyle and socioeconomic variables. The associated accuracy and confusion matrix provide an initial gauge of the predictive power of this data. This analysis illustrates the merits of a holistic approach, combining data cleaning, integration, visualization, and machine learning. 

Future analysis could include using more advanced feature engineering techniques, exploring alternative modeling approaches, and integrating additional external datasets to enrich the context and predictive power.

If you found this notebook engaging and useful, consider upvoting it.