We will use the test/train/val datasets to train one model, and next we will split sentimentdataset into test/train/val to train a different model, then compare model performances.

In [27]:
import pandas as pd

file_path = "/Users/kellyg/csds312-project/train.csv" 
df = pd.read_csv(file_path)

df["Addicted"] = (df["Daily_Usage_Time (minutes)"] > 120).astype(int)

count_more_than_120 = df["Addicted"].sum()
count_less_than_120 = (df["Addicted"] == 0).sum()

print(df[["Daily_Usage_Time (minutes)", "Addicted"]].head())
print(f"Number of people with Daily Usage Time greater than 120 minutes: {count_more_than_120}")
print(f"Number of people with Daily Usage Time less than or equal to 120 minutes: {count_less_than_120}")


   Daily_Usage_Time (minutes)  Addicted
0                       120.0         0
1                        90.0         0
2                        60.0         0
3                       200.0         1
4                        45.0         0
Number of people with Daily Usage Time greater than 120 minutes: 220
Number of people with Daily Usage Time less than or equal to 120 minutes: 781


In [21]:
import pandas as pd

df = pd.read_csv("/Users/kellyg/csds312-project/sentimentdataset.csv")

df["Timestamp"] = pd.to_datetime(df["Timestamp"])
df["Date"] = df["Timestamp"].dt.date

#group by user and date
user_daily_activity = df.groupby(["User", "Date"]).agg(
    total_posts=("Text", "count"),  
    total_likes=("Likes", "sum"),
    total_retweets=("Retweets", "sum"),
    first_post=("Timestamp", "min"),  
    last_post=("Timestamp", "max"),   
).reset_index()

#time on app = timestamp of last post - timestamp of first post (each post takes 15 mins)
user_daily_activity["Active_Hours"] = (user_daily_activity["last_post"] - user_daily_activity["first_post"]).dt.total_seconds() / 3600
user_daily_activity["Active_Hours"].fillna(0, inplace=True)
user_daily_activity["Daily_Usage_Time"] = user_daily_activity["total_posts"] * 15 / 60  

#total interactions = likes + retweets
user_daily_activity["Total_Interactions"] = user_daily_activity["total_likes"] + user_daily_activity["total_retweets"]

#addiction = over 2 hours of usage OR over 100 interactions per day
user_daily_activity["Addicted"] = ((user_daily_activity["Daily_Usage_Time"] > 2) | (user_daily_activity["Total_Interactions"] > 100)).astype(int)

df = df.merge(user_daily_activity[["User", "Date", "Addicted"]], on=["User", "Date"], how="left")
print(df[["User", "Date", "Addicted"]].head())

print("Number of addicted users:", df["Addicted"].sum())
total_users = df["User"].nunique()
print("Total number of users:", total_users)

             User        Date  Addicted
0   User123        2023-01-15         0
1   CommuterX      2023-01-15         0
2   FitnessFan     2023-01-15         0
3   AdventureX     2023-01-15         0
4   ChefCook       2023-01-15         0
Number of addicted users: 105
Total number of users: 685


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  user_daily_activity["Active_Hours"].fillna(0, inplace=True)
