# Data Analysis Project: Social Media Usage and Emotional Well-Being

This notebook presents a comprehensive analysis of three datasets (`train.csv`, `val.csv`, and `test.csv`) capturing information on social media usage and users' dominant emotional states. The dataset, meticulously researched and prepared by AI Inventor Emirhan BULUT, explores the relationship between social media usage patterns and emotional well-being.

## Dataset Features

- **User_ID**: Unique identifier for the user.
- **Age**: Age of the user.
- **Gender**: Gender of the user (Female, Male, Non-binary).
- **Platform**: Social media platform used (e.g., Instagram, Twitter, Facebook, LinkedIn, Snapchat, WhatsApp, Telegram).
- **Daily_Usage_Time (minutes)**: Daily time spent on the platform in minutes.
- **Posts_Per_Day**: Number of posts made per day.
- **Likes_Received_Per_Day**: Number of likes received per day.
- **Comments_Received_Per_Day**: Number of comments received per day.
- **Messages_Sent_Per_Day**: Number of messages sent per day.
- **Dominant_Emotion**: User's dominant emotional state during the day (e.g., Happiness, Sadness, Anger, Anxiety, Boredom, Neutral).

## Files

- **train.csv**: Data for training models.
- **test.csv**: Data for testing models.
- **val.csv**: Data for validation purposes.

## Usage

This dataset can be used for various analyses, including but not limited to:

- Predicting users' emotional well-being based on their social media usage.
- Clustering users based on usage patterns.
- Examining the impact of different platforms on users' emotions.


## Dataset Citation

This dataset was meticulously researched and prepared by AI Inventor Emirhan BULUT. If you use this dataset in your work, please cite or reference it as follows:

BULUT, E. "Social Media Usage and Emotional Well-Being Dataset." (https://www.kaggle.com/datasets/emirhanai/social-media-usage-and-emotional-well-being).
 


## Loading the Data

First, we need to load the datasets into pandas DataFrames. This will allow us to manipulate and analyze the data effectively.


In [None]:
import pandas as pd

# Function to load CSV and handle errors
def load_csv_with_error_handling(filepath):
    try:
        return pd.read_csv(filepath)
    except pd.errors.ParserError:
        print(f"Error reading {filepath}. Trying with on_bad_lines='skip'.")
        return pd.read_csv(filepath, on_bad_lines='skip')

# Load the datasets with error handling
train_df = load_csv_with_error_handling('train.csv')
val_df = load_csv_with_error_handling('val.csv')
test_df = load_csv_with_error_handling('test.csv')

# Define the function to remove outliers
def remove_outliers(df):
    return df[~df['Platform'].isin(['55', 'SS'])]

# Apply the function to each dataframe
train_df = remove_outliers(train_df)
val_df = remove_outliers(val_df)
test_df = remove_outliers(test_df)

# Display the first few rows of each dataset
display(train_df.head())
display(val_df.head())
display(test_df.head())


## Data Overview

We will now take a look at the basic structure and summary statistics of the datasets. This includes checking the data types, non-null counts, and basic descriptive statistics.


In [None]:
# Display column names
print("Train Dataset Columns:")
print(train_df.columns)
print("Validation Dataset Columns:")
print(val_df.columns)
print("Test Dataset Columns:")
print(test_df.columns)

# Basic information about the datasets
print("Train Dataset Info:")
train_df.info()
print("\nValidation Dataset Info:")
val_df.info()
print("\nTest Dataset Info:")
test_df.info()

# Display basic statistics
print("Train Dataset Statistics:")
display(train_df.describe())
print("Validation Dataset Statistics:")
display(val_df.describe())
print("Test Dataset Statistics:")
display(test_df.describe())


## Data Cleaning

Next, we will check for and handle any missing values in the datasets. This step is crucial to ensure the quality and accuracy of our analysis and models.


In [None]:
# Identify numeric columns and convert to numeric, coercing errors to NaN
numeric_cols = ['Age', 'Daily_Usage_Time (minutes)', 'Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day', 'Messages_Sent_Per_Day']

for col in numeric_cols:
    if col in train_df.columns:
        train_df[col] = pd.to_numeric(train_df[col], errors='coerce')
    if col in val_df.columns:
        val_df[col] = pd.to_numeric(val_df[col], errors='coerce')
    if col in test_df.columns:
        test_df[col] = pd.to_numeric(test_df[col], errors='coerce')

# Fill missing values for numeric columns with median
train_df[numeric_cols] = train_df[numeric_cols].fillna(train_df[numeric_cols].median())
val_df[numeric_cols] = val_df[numeric_cols].fillna(val_df[numeric_cols].median())
test_df[numeric_cols] = test_df[numeric_cols].fillna(test_df[numeric_cols].median())

# Fill missing values for categorical columns with mode
categorical_cols = ['User_ID', 'Gender', 'Platform', 'Dominant_Emotion']

train_df[categorical_cols] = train_df[categorical_cols].fillna(train_df[categorical_cols].mode().iloc[0])
val_df[categorical_cols] = val_df[categorical_cols].fillna(val_df[categorical_cols].mode().iloc[0])
test_df[categorical_cols] = test_df[categorical_cols].fillna(test_df[categorical_cols].mode().iloc[0])

# Verify no missing values remain
print("Missing values in Train Dataset after cleaning:")
display(train_df.isnull().sum())
print("Missing values in Validation Dataset after cleaning:")
display(val_df.isnull().sum())
print("Missing values in Test Dataset after cleaning:")
display(test_df.isnull().sum())


## Exploratory Data Analysis (EDA)

In this section, we will visualize the data to gain insights into the distribution of different features and the relationships between them. This step helps in understanding the underlying patterns and trends in the data.


In [None]:
import matplotlib.pyplot as plt

# Example visualizations
# Distribution of Daily Usage Time
plt.hist(train_df['Daily_Usage_Time (minutes)'], bins=30)
plt.title('Distribution of Daily Usage Time')
plt.xlabel('Daily Usage Time (minutes)')
plt.ylabel('Frequency')
plt.show()


### Distribution of Daily Usage Time

The histogram above shows the distribution of the daily usage time of social media platforms among users. It helps us understand how much time users typically spend on social media each day.


In [None]:
# Distribution of Dominant Emotions
train_df['Dominant_Emotion'].value_counts().plot(kind='bar')
plt.title('Distribution of Dominant Emotions')
plt.xlabel('Dominant Emotion')
plt.ylabel('Frequency')
plt.show()


### Distribution of Dominant Emotions

The bar chart above illustrates the frequency of different dominant emotional states among users. This visualization gives us an overview of the prevalent emotional states in the dataset.


In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix using only numeric columns
numeric_train_df = train_df.select_dtypes(include=[np.number])

# Ensure all columns are numeric and drop any rows with NaN values that might remain
numeric_train_df = numeric_train_df.dropna()

# Calculate the correlation matrix
correlation_matrix = numeric_train_df.corr()

# Plot the correlation matrix with annotations using matplotlib directly
plt.figure(figsize=(12, 10))
heatmap = plt.pcolor(correlation_matrix, cmap='viridis')

plt.colorbar(heatmap)
plt.xticks(np.arange(0.5, len(correlation_matrix.columns), 1), correlation_matrix.columns, rotation=45)
plt.yticks(np.arange(0.5, len(correlation_matrix.index), 1), correlation_matrix.index)

# Adding annotations
for y in range(correlation_matrix.shape[0]):
    for x in range(correlation_matrix.shape[1]):
        plt.text(x + 0.5, y + 0.5, f'{correlation_matrix.iloc[y, x]:.2f}',
                 horizontalalignment='center',
                 verticalalignment='center',
                 color='white')

plt.title('Correlation Matrix')
plt.show()


## Model Preparation

Before building models, we need to prepare the data by defining our features (input variables) and the target variable (output). We will also split the data into training and validation sets if necessary.


In [None]:
X_train = train_df.drop(columns=['User_ID', 'Dominant_Emotion', 'Gender', 'Platform'])
y_train = train_df['Dominant_Emotion']
X_val = val_df.drop(columns=['User_ID', 'Dominant_Emotion', 'Gender', 'Platform'])
y_val = val_df['Dominant_Emotion']
X_test = test_df.drop(columns=['User_ID', 'Dominant_Emotion', 'Gender', 'Platform'])
y_test = test_df['Dominant_Emotion']

## Modeling evaluation with FLAML

We will use FLAML (Fast and Lightweight AutoML) to automatically select the best model and hyperparameters for predicting the dominant emotional state of users based on their social media usage. FLAML provides an efficient and effective way to perform automated machine learning.


In [None]:
from flaml import AutoML

# Define the AutoML model
autoML = AutoML()

# Specify settings
settings = {
    "time_budget": 30,  # Total running time in seconds
    "metric": 'roc_auc_ovo',  # Primary metric
    "task": 'classification',  # Task type
    "log_file_name": "flaml.log",  # Log file name
}

# Train the AutoML model
autoML.fit(X_train=X_train, y_train=y_train, **settings)


In [None]:
print(f'Best AutoML Model: {autoML.best_estimator}\n')
print(f'Best Paramter AutoML Model:\n {autoML.best_config}\n')
print(f'Best roc_auc_ovo On Val data: {1 - autoML.best_loss:.4g}\n')
print(f'Best Run Training duration: {autoML.best_config_train_time:.4g} s\n')

In [None]:
import xgboost as xgb

# Feature importance
plt.figure(figsize=((20,15)), facecolor='yellow')  # Set background color to yellow
plt.barh(
    autoML.model.estimator.feature_name_, autoML.model.estimator.feature_importances_
)
plt.show()

## Evaluation with FLAML

We will evaluate the best model found by FLAML on the validation set to assess its performance.


In [None]:
from sklearn.metrics import accuracy_score

# Predict on the validation set
y_val_pred = autoML.predict(X_val)

# Calculate accuracy
final_accuracy = accuracy_score(y_val, y_val_pred)

print("Final Accuracy:", final_accuracy)


## Optimal model according to FLAML

In [None]:
import xgboost as xgb

# Extract the best model and its configuration
best_model = autoML.model
best_config = autoML.best_config

# Recreate the model using the best configuration
if best_model.estimator_class == xgb.XGBClassifier:
    recreated_model = xgb.XGBClassifier(**best_config)
else:
    recreated_model = best_model.estimator_class(**best_config)

# Fit the recreated model
recreated_model.fit(X_train, y_train)

# Evaluate the recreated model
y_val_pred_recreated = recreated_model.predict(X_val)
recreated_model_accuracy = accuracy_score(y_val, y_val_pred_recreated)

print(f"Best AutoML Model: {autoML.best_estimator}")
print(f"Best Configuration: {best_config}")
print(f"Recreated Model Accuracy: {recreated_model_accuracy}")

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import numpy as np

# Predict on the test set
y_test_pred = recreated_model.predict(X_test)

# Mapping emotions to colors
emotion_color_map = {
    'Neutral': 'blue',
    'Anxiety': 'red',
    'Happiness': 'green',
    'Boredom': 'purple',
    'Sadness': 'orange',
    'Anger': 'brown'
}

# Convert predicted labels to colors
y_test_pred_colors = np.array([emotion_color_map[emotion] for emotion in y_test_pred])

# List of features to plot on the y-axis
features_to_plot = ['Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day', 'Messages_Sent_Per_Day']

# Create subplots in a 2x2 grid
fig, axs = plt.subplots(2, 2, figsize=(15, 10), constrained_layout=True)

# Generate scatter plots with Daily Usage Time on the x-axis and other features on the y-axis
for i, feature in enumerate(features_to_plot):
    ax = axs[i // 2, i % 2]
    scatter = ax.scatter(X_test['Daily_Usage_Time (minutes)'], X_test[feature], c=y_test_pred_colors, alpha=0.7)
    ax.set_title(f'Predicted Emotional Well-Being vs {feature}')
    ax.set_xlabel('Daily Usage Time (minutes)')
    ax.set_ylabel(feature)

# Create a custom legend
handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10, label=emotion) for emotion, color in emotion_color_map.items()]
fig.legend(handles=handles, title="Emotional States", loc='center')

plt.show()


### Overall Insights

- Higher daily usage time is generally associated with increased social media activity (posts, likes, comments, messages).
- Positive emotional states, particularly happiness, are more frequently predicted among users who are more active on social media.
- Negative emotional states (such as anxiety and sadness) are less distinctly clustered around higher activity metrics, indicating that higher activity levels might be linked with positive emotional well-being.
- These visualizations suggest that users with higher engagement on social media platforms (likes, comments, and messages) tend to have a better emotional state as predicted by the model.


## Preprocessing for Clustering

In [None]:
from sklearn.preprocessing import StandardScaler

# Assuming train_df, val_df, and test_df are already cleaned
# Combine the datasets if needed or use only the training set for clustering
combined_df = pd.concat([train_df, val_df, test_df], ignore_index=True)

# Select features for clustering
features = ['Daily_Usage_Time (minutes)', 'Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day', 'Messages_Sent_Per_Day']
X = combined_df[features]

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## Determining number of clusters

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score

# Elbow Method
sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    sse.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), sse, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

# Silhouette Score
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

plt.figure(figsize=(10, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.title('Silhouette Scores for Different Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()


## Clustering Data

In [None]:
# Apply K-Means with the optimal number of clusters
optimal_clusters = 4
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to the original dataframe
combined_df['Cluster'] = cluster_labels

# Add cluster labels to the original dataframe for analysis
combined_df['Cluster'] = cluster_labels

# Profile each cluster
cluster_profiles = combined_df.groupby('Cluster')[features].mean()
print(cluster_profiles)

# Evaluate cluster quality
silhouette_avg = silhouette_score(X_scaled, cluster_labels)
print(f'Silhouette Score: {silhouette_avg}')


## Visualizing Clusters

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Use PCA to reduce dimensions for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Define colors for clusters
colors = ['blue', 'orange', 'green', 'red']
cluster_labels_unique = np.unique(cluster_labels)

# Plot the clusters with a legend
plt.figure(figsize=(10, 7))
for cluster in cluster_labels_unique:
    plt.scatter(X_pca[cluster_labels == cluster, 0], X_pca[cluster_labels == cluster, 1], 
                label=f'Cluster {cluster} - {colors[cluster]}', color=colors[cluster], alpha=0.6)

plt.title('Clusters of Users Based on Social Media Usage Patterns')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend(title="Clusters")
plt.show()



In [None]:
# Assuming 'combined_df' is the DataFrame that contains the cluster labels and emotions
emotion_distribution = combined_df.groupby(['Cluster', 'Dominant_Emotion']).size().unstack(fill_value=0)
print(emotion_distribution)

import matplotlib.pyplot as plt

# Plot the emotion distribution for each cluster
fig, axs = plt.subplots(2, 2, figsize=(15, 10), constrained_layout=True)

# Define the clusters and colors for each emotion
clusters = [0, 1, 2, 3]
colors = ['blue', 'orange', 'green', 'red']

# Create bar plots for each cluster
for i, cluster in enumerate(clusters):
    ax = axs[i // 2, i % 2]
    emotion_counts = emotion_distribution.loc[cluster]
    bars = ax.bar(emotion_counts.index, emotion_counts.values, color=colors[i])
    ax.set_title(f'Emotion Distribution in Cluster {cluster}')
    ax.set_xlabel('Emotion')
    ax.set_ylabel('Count')
    ax.set_xticklabels(emotion_counts.index, rotation=45)

    # Label bars with the counts
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.suptitle('Emotion Distribution Across Clusters', fontsize=16)
plt.show()



### Cluster Characteristics and Emotion Distribution

#### Cluster 0 (Blue):
- **Average Daily Usage Time**: 50 minutes
- **Average Posts Per Day**: 1
- **Average Likes Received Per Day**: 10
- **Average Comments Received Per Day**: 5
- **Average Messages Sent Per Day**: 3

*Analysis*: Users in this cluster have lower daily usage time and social media activity, indicating they are less engaged.

*Emotion Distribution Assumptions*:
- This cluster is predominantly associated with neutral emotions and boredom.
- There is a significant presence of anxiety and sadness, indicating that less engaged users might experience more negative emotions.


#### Cluster 1 (Orange):
- **Average Daily Usage Time**: 90 minutes
- **Average Posts Per Day**: 3
- **Average Likes Received Per Day**: 30
- **Average Comments Received Per Day**: 15
- **Average Messages Sent Per Day**: 10

*Analysis*: Users in this cluster exhibit moderate social media activity and usage patterns, representing average engagement.

*Emotion Distribution Assumptions*:
- This cluster shows a high occurrence of happiness, suggesting that moderately engaged users tend to be happier.
- There are also instances of anger, anxiety, and sadness, but they are less prevalent compared to happiness.


#### Cluster 2 (Green):
- **Average Daily Usage Time**: 130 minutes
- **Average Posts Per Day**: 5
- **Average Likes Received Per Day**: 50
- **Average Comments Received Per Day**: 25
- **Average Messages Sent Per Day**: 20

*Analysis*: This group includes users with relatively high social media usage and engagement, indicating they are highly active.

*Emotion Distribution Assumptions*:
- The predominant emotion in this cluster is happiness, indicating that highly engaged users are mostly happy.
- Negative emotions like anxiety are minimally present.


#### Cluster 3 (Red):
- **Average Daily Usage Time**: 180 minutes
- **Average Posts Per Day**: 8
- **Average Likes Received Per Day**: 80
- **Average Comments Received Per Day**: 35
- **Average Messages Sent Per Day**: 30

*Analysis*: Users in this cluster show the highest levels of social media activity and daily usage time, representing the most engaged users.

*Emotion Distribution Assumptions*:
- This cluster has a diverse emotional distribution with significant occurrences of anger, anxiety, and sadness.
- Despite high engagement, these users do not predominantly experience positive emotions, indicating a possible link between high engagement and negative emotional states.


## Distribution across platforms

In [None]:
# Calculate the distribution of emotions for each platform
platform_emotion_distribution = combined_df.groupby(['Platform', 'Dominant_Emotion']).size().unstack(fill_value=0)
print(platform_emotion_distribution)

# Define platforms
platforms = platform_emotion_distribution.index

# Create bar plots for each platform in a 2x4 grid layout
fig, axs = plt.subplots(4, 2, figsize=(15, 20), constrained_layout=True)

for i, platform in enumerate(platforms):
    ax = axs[i // 2, i % 2]
    emotion_counts = platform_emotion_distribution.loc[platform]
    bars = ax.bar(emotion_counts.index, emotion_counts.values)
    ax.set_title(f'Emotion Distribution on {platform}')
    ax.set_xlabel('Emotion')
    ax.set_ylabel('Count')
    ax.set_xticklabels(emotion_counts.index, rotation=45)

    # Label bars with the counts
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.suptitle('Emotion Distribution Across Platforms', fontsize=16)
plt.savefig('emotion_distribution_across_platforms.png')
plt.show()



### Examining the Impact of Different Platforms on Users' Emotions

#### Key Takeaways

- **Facebook**: High in neutral and anxiety; moderate boredom and sadness.
- **Instagram**: High happiness; moderate anxiety, neutral, and sadness; low anger.
- **LinkedIn**: High boredom; moderate neutral and anxiety; low sadness and anger.
- **Snapchat**: High anxiety; moderate neutral and happiness.
- **Telegram**: High neutral and sadness; moderate anger and anxiety.
- **Twitter**: High anger and sadness; moderate neutral and anxiety; low happiness.
- **WhatsApp**: High anger; moderate neutral and anxiety; low happiness.
