# 1.Problem Definition  
We're trying to find to predict Addiction level on social media based on some features.

# 2. Exploring our data

## Data Dictionary


* UserID: Unique identifier assigned to each user.
* Age: The user's age. - Gender: The user's gender (e.g., male, female, non-binary).
* Location: Geographic location of the user.
* Income: The user's income level.
* Debt: Amount of debt the user has.
* Owns Property: Indicates whether the user owns property.
* Profession: The user's occupation or job.
* Demographics: Statistical data about the user (e.g., age, gender, income).
* Platform: The platform the user is using (e.g., website, mobile app).
* Total Time Spent: The total time the user spends on the platform.
* Number of Sessions: The number of times the user has logged into the platform.
* Video ID: Unique identifier for a video.
* Video Category: The category or genre of the video.
* Video Length: Duration of the video.
* Engagement: User interaction with the video (e.g., likes, comments, shares).
* Importance Score: A score indicating how important the video is to the user.
* Time Spent On Video: The amount of time the user spends watching a video.
* Number of Videos Watched: The total number of videos watched by the user.
* Scroll Rate: The rate at which the user scrolls through content.
* Frequency: How often the user engages with the platform.
* Productivity Loss: The impact of platform usage on the user's productivity.
* Satisfaction: The user's satisfaction level with the platform or content.
* Watch Reason: The reason why the user is watching a video (e.g., entertainment, education).
* Device Type: The type of device the user is using (e.g., smartphone, tablet, desktop).
* OS: The operating system of the user's device (e.g., iOS, Android, Windows).
* Watch Time: The time of day when the user watches videos.
* Self Control: The user's ability to control their usage of the platform.
* Addiction Level: The user's level of dependency on the platform.
* Current Activity: What the user is doing while watching the video.
* Connection Type: The type of internet connection the user has (e.g., Wi-Fi, cellular).

 ## Importing necessary tools

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import the necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier

# Importing the Evaluation Metrics
from sklearn.model_selection import train_test_split
# Regression Metrics

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score,mean_squared_log_error
# Classification Metrics
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve



## Reading the data

In [2]:
df = pd.read_csv("Time_Wasters_on_Social_Media.csv")
df.head()

Unnamed: 0,UserID,Age,Gender,Location,Income,Debt,Owns Property,Profession,Demographics,Platform,...,ProductivityLoss,Satisfaction,Watch Reason,DeviceType,OS,Watch Time,Self Control,Addiction Level,CurrentActivity,ConnectionType
0,1,56,Male,Pakistan,82812,True,True,Engineer,Rural,Instagram,...,3,7,Procrastination,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
1,2,46,Female,Mexico,27999,False,True,Artist,Urban,Instagram,...,5,5,Habit,Computer,Android,5:00 PM,7,3,At school,Wi-Fi
2,3,32,Female,United States,42436,False,True,Engineer,Rural,Facebook,...,6,4,Entertainment,Tablet,Android,2:00 PM,8,2,At home,Mobile Data
3,4,60,Male,Barzil,62963,True,False,Waiting staff,Rural,YouTube,...,3,7,Habit,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
4,5,25,Male,Pakistan,22096,False,True,Manager,Urban,TikTok,...,8,2,Boredom,Smartphone,iOS,8:00 AM,10,0,At home,Mobile Data


In [3]:
# Checking for missing values
df.isna().sum()

UserID                      0
Age                         0
Gender                      0
Location                    0
Income                      0
Debt                        0
Owns Property               0
Profession                  0
Demographics                0
Platform                    0
Total Time Spent            0
Number of Sessions          0
Video ID                    0
Video Category              0
Video Length                0
Engagement                  0
Importance Score            0
Time Spent On Video         0
Number of Videos Watched    0
Scroll Rate                 0
Frequency                   0
ProductivityLoss            0
Satisfaction                0
Watch Reason                0
DeviceType                  0
OS                          0
Watch Time                  0
Self Control                0
Addiction Level             0
CurrentActivity             0
ConnectionType              0
dtype: int64

In [4]:
df.describe()

Unnamed: 0,UserID,Age,Income,Total Time Spent,Number of Sessions,Video ID,Video Length,Engagement,Importance Score,Time Spent On Video,Number of Videos Watched,Scroll Rate,ProductivityLoss,Satisfaction,Self Control,Addiction Level
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,40.986,59524.213,151.406,10.013,4891.738,15.214,4997.159,5.129,14.973,25.248,49.774,5.136,4.864,7.094,2.906
std,288.819436,13.497852,23736.212925,83.952637,5.380314,2853.144258,8.224953,2910.053701,2.582834,8.200092,14.029159,29.197798,2.122265,2.122265,2.058495,2.058495
min,1.0,18.0,20138.0,10.0,1.0,11.0,1.0,15.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.0
25%,250.75,29.0,38675.25,78.0,6.0,2542.0,8.0,2415.75,3.0,8.0,14.0,23.0,3.0,4.0,5.0,2.0
50%,500.5,42.0,58805.0,152.0,10.0,4720.5,15.0,5016.0,5.0,15.0,25.0,50.0,5.0,5.0,7.0,3.0
75%,750.25,52.0,79792.25,223.0,15.0,7346.0,22.0,7540.25,7.0,22.0,37.0,74.0,6.0,7.0,8.0,5.0
max,1000.0,64.0,99676.0,298.0,19.0,9997.0,29.0,9982.0,9.0,29.0,49.0,99.0,9.0,9.0,10.0,7.0


In [5]:

# Group by 'Platform' and calculate the summary statistics
platform_analysis = (
    df
    .groupby('Platform')
    .agg(
        Avg_Productivity_Loss=('ProductivityLoss', lambda x: round(x.mean(skipna=True), 2)),
        Avg_Time_Spent=('Total Time Spent', lambda x: round(x.mean(skipna=True), 2)),
        Avg_Satisfaction=('Satisfaction', lambda x: round(x.mean(skipna=True), 2))
    )
    .reset_index()  # Convert the grouped DataFrame into a flat DataFrame
)

# The result is now a pandas DataFrame with the summarized data
print(platform_analysis)


    Platform  Avg_Productivity_Loss  Avg_Time_Spent  Avg_Satisfaction
0   Facebook                   5.07          155.18              4.93
1  Instagram                   5.08          146.91              4.92
2     TikTok                   5.14          151.27              4.86
3    YouTube                   5.26          152.82              4.74


 ### Make a copy of the data


In [6]:

df_tmp = df
df_tmp.head()

Unnamed: 0,UserID,Age,Gender,Location,Income,Debt,Owns Property,Profession,Demographics,Platform,...,ProductivityLoss,Satisfaction,Watch Reason,DeviceType,OS,Watch Time,Self Control,Addiction Level,CurrentActivity,ConnectionType
0,1,56,Male,Pakistan,82812,True,True,Engineer,Rural,Instagram,...,3,7,Procrastination,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
1,2,46,Female,Mexico,27999,False,True,Artist,Urban,Instagram,...,5,5,Habit,Computer,Android,5:00 PM,7,3,At school,Wi-Fi
2,3,32,Female,United States,42436,False,True,Engineer,Rural,Facebook,...,6,4,Entertainment,Tablet,Android,2:00 PM,8,2,At home,Mobile Data
3,4,60,Male,Barzil,62963,True,False,Waiting staff,Rural,YouTube,...,3,7,Habit,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
4,5,25,Male,Pakistan,22096,False,True,Manager,Urban,TikTok,...,8,2,Boredom,Smartphone,iOS,8:00 AM,10,0,At home,Mobile Data


## Convert Strings to numbers

In [7]:
# Find the columns which contain strings
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content.astype(str)):
        print(label)

UserID
Age
Gender
Location
Income
Debt
Owns Property
Profession
Demographics
Platform
Total Time Spent
Number of Sessions
Video ID
Video Category
Video Length
Engagement
Importance Score
Time Spent On Video
Number of Videos Watched
Scroll Rate
Frequency
ProductivityLoss
Satisfaction
Watch Reason
DeviceType
OS
Watch Time
Self Control
Addiction Level
CurrentActivity
ConnectionType


In [8]:
# Turn all of the string values into category values

for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content.astype(str)):
        df_tmp[label] = content.astype("category").cat.as_ordered()

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   UserID                    1000 non-null   category
 1   Age                       1000 non-null   category
 2   Gender                    1000 non-null   category
 3   Location                  1000 non-null   category
 4   Income                    1000 non-null   category
 5   Debt                      1000 non-null   category
 6   Owns Property             1000 non-null   category
 7   Profession                1000 non-null   category
 8   Demographics              1000 non-null   category
 9   Platform                  1000 non-null   category
 10  Total Time Spent          1000 non-null   category
 11  Number of Sessions        1000 non-null   category
 12  Video ID                  1000 non-null   category
 13  Video Category            1000 non-null   categor

In [10]:
df_tmp.head()

Unnamed: 0,UserID,Age,Gender,Location,Income,Debt,Owns Property,Profession,Demographics,Platform,...,ProductivityLoss,Satisfaction,Watch Reason,DeviceType,OS,Watch Time,Self Control,Addiction Level,CurrentActivity,ConnectionType
0,1,56,Male,Pakistan,82812,True,True,Engineer,Rural,Instagram,...,3,7,Procrastination,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
1,2,46,Female,Mexico,27999,False,True,Artist,Urban,Instagram,...,5,5,Habit,Computer,Android,5:00 PM,7,3,At school,Wi-Fi
2,3,32,Female,United States,42436,False,True,Engineer,Rural,Facebook,...,6,4,Entertainment,Tablet,Android,2:00 PM,8,2,At home,Mobile Data
3,4,60,Male,Barzil,62963,True,False,Waiting staff,Rural,YouTube,...,3,7,Habit,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
4,5,25,Male,Pakistan,22096,False,True,Manager,Urban,TikTok,...,8,2,Boredom,Smartphone,iOS,8:00 AM,10,0,At home,Mobile Data


In [11]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   UserID                    1000 non-null   category
 1   Age                       1000 non-null   category
 2   Gender                    1000 non-null   category
 3   Location                  1000 non-null   category
 4   Income                    1000 non-null   category
 5   Debt                      1000 non-null   category
 6   Owns Property             1000 non-null   category
 7   Profession                1000 non-null   category
 8   Demographics              1000 non-null   category
 9   Platform                  1000 non-null   category
 10  Total Time Spent          1000 non-null   category
 11  Number of Sessions        1000 non-null   category
 12  Video ID                  1000 non-null   category
 13  Video Category            1000 non-null   categor

In [12]:
df_tmp["Platform"].cat.codes

0      1
1      1
2      0
3      3
4      2
      ..
995    2
996    0
997    2
998    3
999    3
Length: 1000, dtype: int8

In [13]:
df_tmp.to_csv("prepocessed data.csv", index=False)

In [14]:
df_tmp.head()

Unnamed: 0,UserID,Age,Gender,Location,Income,Debt,Owns Property,Profession,Demographics,Platform,...,ProductivityLoss,Satisfaction,Watch Reason,DeviceType,OS,Watch Time,Self Control,Addiction Level,CurrentActivity,ConnectionType
0,1,56,Male,Pakistan,82812,True,True,Engineer,Rural,Instagram,...,3,7,Procrastination,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
1,2,46,Female,Mexico,27999,False,True,Artist,Urban,Instagram,...,5,5,Habit,Computer,Android,5:00 PM,7,3,At school,Wi-Fi
2,3,32,Female,United States,42436,False,True,Engineer,Rural,Facebook,...,6,4,Entertainment,Tablet,Android,2:00 PM,8,2,At home,Mobile Data
3,4,60,Male,Barzil,62963,True,False,Waiting staff,Rural,YouTube,...,3,7,Habit,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
4,5,25,Male,Pakistan,22096,False,True,Manager,Urban,TikTok,...,8,2,Boredom,Smartphone,iOS,8:00 AM,10,0,At home,Mobile Data


In [15]:
df_tmp["Addiction Level"].value_counts()

Addiction Level
2    248
5    228
0    180
3    159
1     60
7     55
4     36
6     34
Name: count, dtype: int64

In [16]:
# Making Correlation Matrix a little prettier

corr_matrix = df_tmp.corr()
fig, ax = plt.subplots (figsize= (15, 10))
ax = sns.heatmap(corr_matrix,
                 annot= True,
                 linewidths= 0.5,
                 fmt= ".2f",
                 cmap= "YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top -0.5)

ValueError: could not convert string to float: 'Mobile Data'

In [None]:
# Find the columns which contain strings
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content.astype(str)):
        print(label)

In [24]:
# Turn categorical variables into numbers 
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Turn categories into numbers and add +1
        df_tmp[label] = pd.Categorical(content).codes+1

# 3. Modelling 

In [27]:
# Split data into X and y
X= df.drop("Addiction Level",axis=1,)
y = df_tmp["Addiction Level"]

In [None]:
X

In [None]:
y

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Fitting the models and scoring it

### Learning Stacking Classifier

In [40]:

from sklearn.svm import SVC

# Define the base estimators
estimators = [
    ('dt', DecisionTreeClassifier()),
    ('svc', SVC(probability=True))
]

# Initialize the StackingClassifier with the estimators
stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5
)

# Use the stacking_clf in your code
#"StackingClassifier": stacking_clf,

### Experimenting with Classifiction models

In [42]:
## Put models in a dictionary

models = {"Logistic Regression":LogisticRegression(),
          "Linear Regression":LinearRegression(), # not a Classification model
          "Random Forest":RandomForestClassifier(),
          "DecisionTreeClassifier":DecisionTreeClassifier(),
          "StackingClassifier": stacking_clf,
          "svc": SVC(probability=True),
          "KNeighborsClassifier":KNeighborsClassifier()}

# Create a function to fit and score models

def fit_and_score (models, X_train,X_test,y_train,y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data (no labels)
    X_test : testing data (no labels)
    y_train : training  labels
    y_test : testing  labels
    """
    # set random seed
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores= {}
    #Loop through models
    for name, model in models.items():
        #fit the model to the data
        model.fit(X_train,y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(X_test,y_test)
    return model_scores


In [None]:


# Ensure data is in the correct format
X_train = np.array(X_train) if not isinstance(X_train, np.ndarray) else X_train
X_test = np.array(X_test) if not isinstance(X_test, np.ndarray) else X_test
y_train = np.array(y_train).ravel() if not isinstance(y_train, np.ndarray) else y_train.ravel()
y_test = np.array(y_test).ravel() if not isinstance(y_test, np.ndarray) else y_test.ravel()

# Verify the shapes
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# Now call the fit_and_score function
model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
model_scores = fit_and_score(models,
                             X_train=X_train,
                             X_test=X_test,
                             y_train=y_train,
                             y_test=y_test)

model_scores

In [None]:
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar()
plt.title("Comparison of model performances");

### Saving our model scores

In [None]:
# Convert the dictionary to a DataFrame with an index
model_scores_f = pd.DataFrame(model_scores, index=[0])
model_scores_f
# Save the DataFrame to a CSV file
#model_scores_f.to_csv('model_scores.csv', index=False)

### Evaluating our Regression model

In [None]:
# Linear Regression model

Reg_model = LinearRegression()
Reg_model.fit(X_train, y_train)  # Fit the model with training data
y_preds = Reg_model.predict(X_test)  # Predict using the test data

In [54]:
def rmsle(y_test, y_preds):
    """
    Calculates root mean squared log error between predictions and true labels.
    """
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create function to evaluate model on a few different levels
def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_test)
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Valid MAE": mean_absolute_error(y_test, val_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_test, val_preds),
              "Training R^2": r2_score(y_train, train_preds),
              "Valid R^2": r2_score(y_test, val_preds)}
    return scores



In [None]:


show_scores(Reg_model)

In [None]:
rmsle(y_test, y_preds)

### Summary on our findings with Regression model
on our data, we used `LinearRegression` model and got an rmsle(Root mean log square error) of `0.000318478` which indicates that our model is performing exceptionally well

## Feature Importance

now we want to find out which features are really important in producing the target Variables

In [73]:


# Assuming original_data is your original DataFrame
X_train = df.copy()

# Now you can access the columns attribute
feature_names = X_train.columns

In [74]:
# Assuming X_train is your NumPy array
if isinstance(X_train, np.ndarray):
    column_names = [
        "UserID", "Age", "Location", "Income", "Debt", "Owns Property", "Profession", 
        "Demographics", "Platform", "Total Time Spent", "Number of Sessions", "Video ID", 
        "Video Category", "Video Length", "Engagement", "Importance Score", 
        "Time Spent On Video", "Number of Videos Watched", "Scroll Rate", "Frequency", 
        "Productivity Loss", "Satisfaction", "Watch Reason", "Device Type", "OS", 
        "Watch Time", "Self Control", "Addiction Level", "Current Activity", "Connection Type"
    ]
    X_train = pd.DataFrame(X_train, columns=column_names)

# Now you can access the columns attribute
feature_names = X_train.columns

In [75]:

# Convert X_train to a DataFrame if it is a NumPy array
if isinstance(X_train, np.ndarray):
    X_train = pd.DataFrame(X_train, columns= [
        "UserID", "Age", "Location", "Income", "Debt", "Owns Property", "Profession", 
        "Demographics", "Platform", "Total Time Spent", "Number of Sessions", "Video ID", 
        "Video Category", "Video Length", "Engagement", "Importance Score", 
        "Time Spent On Video", "Number of Videos Watched", "Scroll Rate", "Frequency", 
        "Productivity Loss", "Satisfaction", "Watch Reason", "Device Type", "OS", 
        "Watch Time", "Self Control", "Addiction Level", "Current Activity", "Connection Type"
    ])

# Now you can access the columns attribute
feature_names = X_train.columns

In [None]:
# Ensure X_train only contains feature columns
X_train_features = X_train.drop(columns=["Addiction Level"])

# Extract the coefficients
coefficients = Reg_model.coef_

# Pair coefficients with feature names
feature_importance = pd.DataFrame({
    'Feature': X_train_features.columns,
    'Importance': coefficients
})

print(feature_importance)

In [None]:
# Find feature importance of our best model
len(Reg_model.coef_), len(X_train.columns)

In [None]:
# Helper function for plotting feature importance
def plot_features (columns, importances, n=20):
    df =(pd.DataFrame({"features": columns,
                       "feature_importances": importances})
         .sort_values("feature_importances", ascending=False)
         .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature Importance")
    ax.invert_yaxis()
    ax.set_title("Feature Importance chart")

# Ensure X_train only contains feature columns
X_train_features = X_train.drop(columns=["Addiction Level"])

# Plot the feature importances
plot_features(X_train_features.columns, Reg_model.coef_)

In [None]:
type(X_train)

In [None]:
len(X_train.columns)

In [None]:
X_train

## Saving Models


In [None]:
# Saving the Classification model
import joblib

# Save the model to a file
joblib.dump(stacking_clf, 'models/stacking_clf_model.pkl')

# Saving the Regression model

joblib.dump(Reg_model, 'models/LinearRegression_model.pkl')

### Evaluating the Classification `StackingClassifier`() model

In [None]:
stacking_clf.final_estimator_.coef_

In [None]:
# Match coef's of features to columns

feature_dict = dict(zip(df.columns,list(stacking_clf.final_estimator_.coef_[0])))
feature_dict

Feature Importance Chart for Stacking Classifier Model

In [None]:
# Visualize feature importance

feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature Importance for Classification model", legend = False)

# Final Findings

from the Feature Importance Chart of the two paths(Regression and Classification), it could be deduced that this is a Regression Problem.

* Why would userID rank high in classification but low in regression?