# Students Social Media Addiction Score Prediction

## Problem Statement:
##### To predict the Social Media Addiction Score of students based on their usage patterns, lifestyle, and academic behavior.

#### Target variable: 
##### is in numeric so this is regression model

### About Dataset
##### Dataset source --- student social media addiction dataset from KAGGEL

##### The dataset contains multiple features that reflect a student’s online activity, health habits, and academic engagement.
### Feature Name	    |   Description
##### Age -->	               Age of the student
##### Gender	-->           Gender of the student (encoded numerically)
##### Daily_Usage	-->       Average daily time spent on social media (in hours)
##### Sleep_Hours	 -->      Average number of hours the student sleeps per day
##### Study_Hours	  -->     Number of hours spent studying daily
##### Stress_Level	    -->    Stress level of the student on a scale of 1–10
##### Academic_Performance-->	Academic performance score (1–10)
##### Family_Time	      -->  Time spent with family per day
##### Exercise_Hours     -->	Daily time spent on physical activity
##### Social_Interaction	-->Level of real-life social interaction
##### Screen_Time	      -->  Total daily screen usage (hours)
##### Addiction_Score      --> (Target Variable)	Numeric value representing the level of social media addiction

### IMPORTING LIBRARIES

In [63]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as pyplot
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")


### READING CSV FILE
##### dataset can be stored in a variable data

In [64]:
data=pd.read_csv('Students Social Media Addiction.csv')

### CHECKING HOW MANY COLUMNS IN THE DATASET

In [65]:
data.columns

Index(['Student_ID', 'Age', 'Gender', 'Academic_Level', 'Country',
       'Avg_Daily_Usage_Hours', 'Most_Used_Platform',
       'Affects_Academic_Performance', 'Sleep_Hours_Per_Night',
       'Mental_Health_Score', 'Relationship_Status',
       'Conflicts_Over_Social_Media', 'Addicted_Score'],
      dtype='object')

### SHAPE OF THE DATASET

In [66]:
data.shape

(705, 13)

##### To print first five rows of the dataset

In [67]:
data.head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


##### To print last five rows of the dataset

In [68]:
data.tail()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
700,701,20,Female,Undergraduate,Italy,4.7,TikTok,No,7.2,7,In Relationship,2,5
701,702,23,Male,Graduate,Russia,6.8,Instagram,Yes,5.9,4,Single,5,9
702,703,21,Female,Undergraduate,China,5.6,WeChat,Yes,6.7,6,In Relationship,3,7
703,704,24,Male,Graduate,Japan,4.3,Twitter,No,7.5,8,Single,2,4
704,705,19,Female,Undergraduate,Poland,6.2,Facebook,Yes,6.3,5,Single,4,8


### DATAPREPROCESSING 

##### Checking null values in the dataset

In [69]:
data.isnull().sum()

Student_ID                      0
Age                             0
Gender                          0
Academic_Level                  0
Country                         0
Avg_Daily_Usage_Hours           0
Most_Used_Platform              0
Affects_Academic_Performance    0
Sleep_Hours_Per_Night           0
Mental_Health_Score             0
Relationship_Status             0
Conflicts_Over_Social_Media     0
Addicted_Score                  0
dtype: int64

#### Checking duplicates in the dataset

In [70]:
data.duplicated().sum()


np.int64(0)

##### To print duplicate rows

In [71]:
data[data.duplicated()]


Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score


##### Dropping  the Duplicated rows

In [72]:
data = data.drop_duplicates()

In [73]:
data.duplicated().sum()


np.int64(0)

### CHECKING DUPLICATES IN COLUMNS

In [74]:
data.duplicated(subset=['Student_ID', 'Age', 'Gender','Academic_Level', 'Country',
       'Avg_Daily_Usage_Hours', 'Most_Used_Platform',
       'Affects_Academic_Performance', 'Sleep_Hours_Per_Night',
       'Mental_Health_Score', 'Relationship_Status',
       'Conflicts_Over_Social_Media', 'Addicted_Score']).sum()


np.int64(0)

### DESCRIBING DATA

In [75]:
data.describe()

Unnamed: 0,Student_ID,Age,Avg_Daily_Usage_Hours,Sleep_Hours_Per_Night,Mental_Health_Score,Conflicts_Over_Social_Media,Addicted_Score
count,705.0,705.0,705.0,705.0,705.0,705.0,705.0
mean,353.0,20.659574,4.918723,6.868936,6.22695,2.849645,6.436879
std,203.660256,1.399217,1.257395,1.126848,1.105055,0.957968,1.587165
min,1.0,18.0,1.5,3.8,4.0,0.0,2.0
25%,177.0,19.0,4.1,6.0,5.0,2.0,5.0
50%,353.0,21.0,4.8,6.9,6.0,3.0,7.0
75%,529.0,22.0,5.8,7.7,7.0,4.0,8.0
max,705.0,24.0,8.5,9.6,9.0,5.0,9.0


In [76]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 705 entries, 0 to 704
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Student_ID                    705 non-null    int64  
 1   Age                           705 non-null    int64  
 2   Gender                        705 non-null    object 
 3   Academic_Level                705 non-null    object 
 4   Country                       705 non-null    object 
 5   Avg_Daily_Usage_Hours         705 non-null    float64
 6   Most_Used_Platform            705 non-null    object 
 7   Affects_Academic_Performance  705 non-null    object 
 8   Sleep_Hours_Per_Night         705 non-null    float64
 9   Mental_Health_Score           705 non-null    int64  
 10  Relationship_Status           705 non-null    object 
 11  Conflicts_Over_Social_Media   705 non-null    int64  
 12  Addicted_Score                705 non-null    int64  
dtypes: fl

### FEATURE TRANSFORMATION
##### -- One hot encoding
##### --Label encoding
##### converting categorial values in to numerical values

In [77]:
from sklearn.preprocessing import LabelEncoder

categorical_cols = [
    'Gender', 'Academic_Level', 'Country',
    'Most_Used_Platform', 'Affects_Academic_Performance',
    'Relationship_Status'
]

encoder = LabelEncoder()
for col in categorical_cols:
    data[col] = encoder.fit_transform(data[col])


##### Dropping unnecessary columns 

In [78]:
data.drop('Student_ID', axis=1, inplace=True)


### FEATURE SELECTION
##### All input feature are assinging to X and output feature is assinging to y ,so that X=independent values y=dependent value

In [79]:
X = data.drop('Addicted_Score', axis=1)
y = data['Addicted_Score']


### SPLITTING THE DATA
##### Training-80% and Testing-20%

In [80]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### FEATURE SCALING
##### -- Standardisation
##### -- Normalisation
##### converting input range between 0-1 and to equalize priorities among all features

In [81]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

scaler = MinMaxScaler()

# Fit only on training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform test or new data
X_test_scaled = scaler.transform(X_test)

# Check scaling


##### checking that feature scaling done or not

In [82]:
print(pd.DataFrame(X_train_scaled).describe())


               0           1           2           3           4           5   \
count  564.000000  564.000000  564.000000  564.000000  564.000000  564.000000   
mean     0.444740    0.496454    0.523050    0.534062    0.451909    0.320600   
std      0.231584    0.500431    0.491853    0.280352    0.197729    0.310473   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      0.333333    0.000000    0.000000    0.284404    0.323077    0.090909   
50%      0.500000    0.000000    1.000000    0.541284    0.430769    0.090909   
75%      0.666667    1.000000    1.000000    0.825688    0.584615    0.545455   
max      1.000000    1.000000    1.000000    1.000000    1.000000    1.000000   

               6           7           8           9           10  
count  564.000000  564.000000  564.000000  564.000000  564.000000  
mean     0.648936    0.528430    0.556294    0.744681    0.571631  
std      0.477727    0.198054    0.276845    0.288113    0.190848 

### MODEL SELECTION
##### selecting with five model means taking five models and checking five models accuracy and taking best accuracy model among them

##### 1 DecisionTreeRegressor

In [83]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
from sklearn.metrics import r2_score
DTR_Accuracy=round(r2_score(y_test, model.predict(X_test)) * 100)
DTR_Accuracy

98

##### 2 support vector regression

In [84]:
from sklearn.svm import SVR
model = SVR()
model.fit(X_train, y_train)
from sklearn.metrics import r2_score
SVR_Accuracy=round(r2_score(y_test, model.predict(X_test)) * 100)
SVR_Accuracy

59

##### 3 Random forest regression

In [85]:
from sklearn.ensemble import RandomForestRegressor
RF_model = RandomForestRegressor(n_estimators=10)
RF_model.fit(X_train, y_train)
from sklearn.metrics import r2_score
RFR_Accuracy=round(r2_score(y_test, RF_model.predict(X_test)) * 100)
RFR_Accuracy

98

##### 4 LinearRegression

In [86]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

LR_model = LinearRegression()
LR_model.fit(X_train, y_train)

LR_Accuracy = round(r2_score(y_test, LR_model.predict(X_test)) * 100)


##### 5 polynomial regression

In [87]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
from sklearn.linear_model import LinearRegression
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
from sklearn.metrics import r2_score
PR_Accuracy=round(r2_score(y_test, poly_model.predict(X_test_poly)) * 100)
PR_Accuracy

97

##### printing all models accuracy

In [88]:
import pandas as pd

# Create a table
model_performance = pd.DataFrame({
    "Model Name": [
        "Linear Regression",
        "Polynomial Regression",
        "Decision Tree Regression",
        "Support Vector Regression",
        "Random Forest Regression"
    ],
    "R² Score": [
        LR_Accuracy,
        PR_Accuracy,
        DTR_Accuracy,
        SVR_Accuracy,
        RFR_Accuracy
    ]
})

# Display table
print(model_performance)



                  Model Name  R² Score
0          Linear Regression        96
1      Polynomial Regression        97
2   Decision Tree Regression        98
3  Support Vector Regression        59
4   Random Forest Regression        98


##### selecting model -- Randomforest regression and MODEL TRAINING

In [89]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=10)
model.fit(X_train, y_train)


0,1,2
,n_estimators,10
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [90]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = model.predict(X_test)

print("MAE:", mean_absolute_error(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)
print("R² Score:", r2_score(y_test, y_pred))


MAE: 0.07375886524822692
MSE: 0.04695035460992907
RMSE: 0.21668030508084732
R² Score: 0.9812370346546595


In [None]:
# Example: Create a dictionary with all feature values
# You can modify these values to test different predictions
print("Available features:", list(X.columns))
print("\n")

input_data = {
    'Age': [18],
    'Gender': [2],
    'Avg_Daily_Usage_Hours': [2],
    'Most_Used_Platform': [4],
    'Affects_Academic_Performance': [1],
    'Sleep_Hours_Per_Night': [9],
    'Mental_Health_Score': [4],
    'Relationship_Status': [1],
    'Conflicts_Over_Social_Media': [1]
}

# Create DataFrame with user inputs
new_student = pd.DataFrame(input_data)

# Convert categorical features if any
new_student = pd.get_dummies(new_student)

# Match training features
new_student = new_student.reindex(columns=X.columns, fill_value=0)

# Scale the input
new_scaled = scaler.transform(new_student)

# Predict addiction score
predicted_addiction = model.predict(new_scaled)

print("Input values:")
print(new_student)
print("\nPredicted Social Media Addiction Score:", predicted_addiction[0])

Available features: ['Age', 'Gender', 'Academic_Level', 'Country', 'Avg_Daily_Usage_Hours', 'Most_Used_Platform', 'Affects_Academic_Performance', 'Sleep_Hours_Per_Night', 'Mental_Health_Score', 'Relationship_Status', 'Conflicts_Over_Social_Media']


Input values:
   Age  Gender  Academic_Level  Country  Avg_Daily_Usage_Hours  \
0   18       2               0        0                      2   

   Most_Used_Platform  Affects_Academic_Performance  Sleep_Hours_Per_Night  \
0                   4                             1                      9   

   Mental_Health_Score  Relationship_Status  Conflicts_Over_Social_Media  
0                    4                    1                            1  

Predicted Social Media Addiction Score: 7.4


##### selecting some columns to predict the addiction score


##### saving


In [92]:
import pickle
pickle.dump(RF_model, open('RF_model.pkl', 'wb'))

In [93]:
pickle.dump(scaler,open('scalar.pkl','wb'))