<a href="https://colab.research.google.com/github/mazinkamal134/DS_MRP_2024/blob/main/TensiStrength/1_TensiSterngth_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will read the control and treatment tweets datasets and prepares them for the stress score calculation

Since running the TensiStrength Java API is a time consuming process, this notebook splits the tweets for each disorder into multiple chunks to be able to run multiple instances of the API in parallel

In [None]:
import pandas as pd
import pickle
from datetime import datetime
import os
import json

In [None]:
# mount the Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Global Params

In [None]:
tweetsDir = "/content/drive/MyDrive/Master-2024/MRP/Data/Tweets"
tensiStrengthDir = "/content/drive/MyDrive/Master-2024/MRP/Data/TensiStrength"

## Ingest the control tweets

In [None]:
disorders = ["anxiety", "depression", "ptsd"]

In [None]:
# Read the control Tweets CSV file and pickle
fileName = os.path.join(tweetsDir, "control_tweets.csv")
controlTweetsDf = pd.read_csv(fileName)
print("Shape of the original control tweets file:", controlTweetsDf.shape)
# Filter
controlTweetsDf = controlTweetsDf[(controlTweetsDf.disorder.isin(disorders)) & (controlTweetsDf.tweet_type == "timeline") & (controlTweetsDf.lang == "en") & (controlTweetsDf.cleaned_text.notna())]
# Add the group
controlTweetsDf["group"] = 0
print("Filtered Control dataset shape:", controlTweetsDf.shape)

## Ingest the treatment tweets

In [None]:
# Read the treatment Tweets CSV file and pickle
fileName = os.path.join(tweetsDir, "treatment_tweets.csv")
treatmentTweetsDf = pd.read_csv(fileName)
print("Shape of the original treatment tweets file:", treatmentTweetsDf.shape)
# Filter
treatmentTweetsDf = treatmentTweetsDf[(treatmentTweetsDf.disorder.isin(disorders)) & (treatmentTweetsDf.tweet_type == "timeline") & (treatmentTweetsDf.lang == "en") & (treatmentTweetsDf.cleaned_text.notna())]
# Add the group
treatmentTweetsDf["group"] = 1
print("Filtered Treatment dataset shape:", treatmentTweetsDf.shape)

## Combine control and treatment tweets
Additionally select only timeline English tweets with cleaned_text is not null

Disorders = Anxiety, Depression, and PTSD

In [None]:
# Combine
tweetsDf = pd.concat([controlTweetsDf, treatmentTweetsDf])
print("Shape combined:", tweetsDf.shape)

# Fix the data types
tweetsDf["created_at"] = pd.to_datetime(tweetsDf.created_at).dt.tz_convert(None)
tweetsDf["author_id"] = tweetsDf["author_id"].astype("str")

# Reorder the columns
cols = ["id", "tweet_type", "referenced_tweet_type", "created_at", "lang", "disorder", "group", "author_id", "text", "cleaned_text", "retweet_count", "reply_count", "like_count", "quote_count", "source", "group"]
tweetsDf = tweetsDf[cols]

# Check the counts, and use to find the proper chunk size
tweetsDf.groupby("disorder")["id"].count().reset_index()

Shape combined: (3232475, 19)


Unnamed: 0,disorder,id
0,anxiety,616143
1,depression,2070766
2,ptsd,545566


# Split the data based on disorder and chunk size
In order to run the TensiStrenght Java API in parallel, split the disorder tweets into multiple files to act as input for the multiple code instances that should be used to calculate the stress score

In [None]:
# Define the chunk (file) size based on the total number of tweets in each disorder (previous cell)
# The below values are for refrence only!
disorderChunkSizes = {"anxiety": 125000, "depression": 420000, "ptsd": 110000}

In [None]:
# Loop through the disorders to create the file splits
for disorder, chunkSize in disorderChunkSizes.items():
  # Isolate the disorder
  disorderDf = tweetsDf[tweetsDf.disorder == disorder]
  print(f"{disorder} df shape:", disorderDf.shape)

  # Split the DataFrame into chunks
  chunks = disorderDf.shape[0] // chunkSize + (disorderDf.shape[0] % chunkSize != 0)

  # Create a list of DataFrames
  dfs = [disorderDf.iloc[i * chunkSize:(i+1) * chunkSize] for i in range(chunks)]

  # Save the resulting DataFrames
  for i, df in enumerate(dfs):
      print(f"{disorder} data frame {i + 1} shape:{df.shape}\n")
      # Pickle
      fileName = f"{disorder}TweetsDfWithTensiStrength{i}.pickle"
      df.to_pickle(os.path.join(tensiStrengthDir + "/Chunks", fileName))

Next step is to run the next notebook(s) in the pipeline to calculate the sterss score for each file, followed by the final step that combines the file chunks into one file for each disorder.