## <em>A Big Data Mining Approach Project</em>
## <b>Stress Detecting from Social Media Interaction</b>
## Group name: The Enigma Ensemble

### <em>(*) First author:</em>
##### Tri Quan Do (tdo22@uic.edu) - Group Leader
##### Mosrour Tafadar (mtafad2@uic.edu)
##### Hina Khali (hkhali21@uic.edu)
##### Safiya Mustafa (smust3@uic.edu)


## Project Abstract:
Emotional and mental stress are serious issues that can have a significant impact on our well-being. Despite the fact that an emotional experience usually starts as a personal, internal process, it frequently results in the communal sharing of emotions with others. Emotions that are verbally expressed to others by the individual who has experienced them are referred to as being socially shared. People share their emotions with others in more than 80% of all emotional events, regardless of their age, gender, personality type, or culture (Bazarova, Choi, Sosik, Cosley, Whitlock 1). Due to social media's widespread use, people are accustomed to posting about their everyday activities and connecting with acquaintances on these platforms, making it possible to use information from online social networks to identify stress.

## Project Introduction

The initial step of this research project involves identifying a set of words that are commonly associated with emotional stress. Using this set of words, the models aim to compute an overall stress score for each individual under investigation. However, it is critical to acknowledge that some words may carry a higher intensity than others. Hence, the project purpose will segregate the identified set of words into distinct categories based on their intensity levels, namely high, moderate, and low to parallel conduct a word frequency analysis to identify words or phrases that occur frequently, specifically those that pertain to emotions or stress. This research approach is expected to provide valuable insights into the underlying patterns and associations between language use and emotional stress, thereby contributing to the existing knowledge base on the topic.<br><br>

Robust technologies for processing and analyzing massive amounts of social media data include Support Vector Machines (SVM) and MapReduce, which can be used to forecast stress levels based on social media posts. SVM is a machine learning algorithm that divides the data into classes before identifying the hyperplane that best distinguishes the classes. Large datasets can be processed concurrently on a distributed computing system using the model and software framework known as MapReduce

Full project information could be found here <"add link to document">

In [None]:
#######################################################
###########   ENVIRONMENT SETTING UP   ################
!pip install pandas
!pip install numpy
!pip install -U scikit-learn
!pip install seaborn
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq

In [None]:
###########################################################################################
## These code below could generate error when working on non-google colab environment    ##
## Please comment those code below if you work on local machine                          ##
###########################################################################################

# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials
#
# # Authenticate and create the PyDrive client
# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

In [2]:
from oauth2client.crypt import PyCryptoSigner
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Modeling for Machine Learning Task
from sklearn.linear_model import LinearRegression 
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFECV

import pyspark
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

## Data Description

In [3]:
#First reading the data
data = pd.read_csv('Training Data/twitter_content.csv', encoding='ISO-8859-1')
data


Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0000,yes,1.0,12/5/13 1:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.587300e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0000,yes,1.0,10/1/12 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.587300e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.587300e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0000,yes,1.0,6/11/09 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.587300e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0000,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.587300e+17,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20045,815757572,True,golden,259,,female,1.0000,yes,1.0,8/5/15 21:16,...,https://pbs.twimg.com/profile_images/656793310...,0,C0DEED,"@lookupondeath ...Fine, and I'll drink tea too...",,783,10/26/15 13:20,6.587400e+17,Verona ªÁ,
20046,815757681,True,golden,248,,male,1.0000,yes,1.0,8/15/12 21:17,...,https://pbs.twimg.com/profile_images/639815429...,0,0,Greg Hardy you a good player and all but don't...,,13523,10/26/15 12:40,6.587300e+17,"Kansas City, MO",
20047,815757830,True,golden,264,,male,1.0000,yes,1.0,9/3/12 1:17,...,https://pbs.twimg.com/profile_images/655473271...,0,C0DEED,You can miss people and still never want to se...,,26419,10/26/15 13:20,6.587400e+17,Lagos Nigeria,
20048,815757921,True,golden,250,,female,0.8489,yes,1.0,11/6/12 23:46,...,https://pbs.twimg.com/profile_images/657716093...,0,0,@bitemyapp i had noticed your tendency to pee ...,,56073,10/26/15 12:40,6.587300e+17,Texas Hill Country,


In [4]:
#Let's check overall info of the data
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20050 entries, 0 to 20049
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   _unit_id               20050 non-null  int64  
 1   _golden                20050 non-null  bool   
 2   _unit_state            20050 non-null  object 
 3   _trusted_judgments     20050 non-null  int64  
 4   _last_judgment_at      20000 non-null  object 
 5   gender                 19953 non-null  object 
 6   gender:confidence      20024 non-null  float64
 7   profile_yn             20050 non-null  object 
 8   profile_yn:confidence  20050 non-null  float64
 9   created                20050 non-null  object 
 10  description            16306 non-null  object 
 11  fav_number             20050 non-null  int64  
 12  gender_gold            50 non-null     object 
 13  link_color             20050 non-null  object 
 14  name                   20050 non-null  object 
 15  pr

From the data we can see that  there are total of 200050 rows and 26 columns. Below is the description of  each of the  column

The data contains the following fields:

**unitid:** a unique id for the user  
**_golden:**  whether the user was included in the gold standard for the model; TRUE or FALSE  


**unitstate:** state of the observation; one of finalized (for contributor-judged) or golden (for gold standard observations)
trustedjudgments: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations  

**lastjudgment_at:** date and time of last contributor judgment; blank for gold standard observations  

**gender:** one of male, female, or brand (for non-human profiles)  

**gender:confidence:**  a float representing confidence in the provided gender  

**profile_yn:** “no” here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it


**profile_yn:** confidence: confidence in the existence/non-existence of the profile  

**created:** date and time when the profile was created  

**description:** the user’s profile description  

**fav_number:** number of tweets the user has favourited


**gender_gold:** if the profile is golden, what is the gender?  

**link_color:** the link colour on the profile, as a hex value


**name:** the user’s name  

**profileyngold:** whether the profile y/n value is golden  


**profileimage:** a link to the profile image  

**retweet_count:** number of times the user has retweeted (or possibly, been retweeted)  

**sidebar_color:** color of the profile sidebar, as a hex value  

**text:** text of a random one of the user’s tweets  

**tweet_coord:** if the user has location turned on, the coordinates as a string with the format “[latitude, longitude]”  

**tweet_count:** number of tweets that the user has posted  

**tweet_created:**  when the random tweet (in the text column) was created  

**tweet_id:** the tweet id of the random tweet 

**tweet_location:** location of the tweet; seems to not be particularly normalized  

**user_timezone:** the timezone of the user

## Data Cleaning Description

In [None]:
def read_many(file_path="", expect_col=[], file_list=[], encode="", read_many=False):
  """
  Reads and returns multiple CSV files as pandas dataframes.

  Args:
      file_path (str): Optional file path to read CSV files from.
      expect_col (list): Optional list of expected column names to extract from each CSV file.
      file_list (list): List of CSV file names to read.
      encode (str): Optional encoding type for reading CSV files.
      read_many (bool): Optional boolean to indicate whether to read multiple CSV files.

  Returns:
      list: A list of pandas dataframes, where each dataframe corresponds to a CSV file in file_list.
  """

  dataFrame_list = []

  for f_name in file_list:
    data_csv = pd.read_csv(f_name)
    dataFrame_list.append(data_csv)

  dataFrames_Final = []
  # Drop unnecessary columns that only retrieve from expected one
  if len(expect_col) > 0:
    for frame in dataFrame_list:
      new_frame = frame.loc[:, expect_col]
      dataFrames_Final.append(new_frame)
  else:
    dataFrames_Final = dataFrame_list

  return dataFrames_Final

In [None]:
def read_one(file_path="", expect_col=[], encode="", drop_NaN=False):
  """
  Reads a single CSV file as a pandas dataframe, drops NaN rows and columns, and returns the resulting dataframe.

  Args:
      file_path (str): Optional file path to read the CSV file from.
      expect_col (list): Optional list of expected column names to extract from the CSV file.
      encode (str): Optional encoding type for reading the CSV file.
      drop_NaN (bool): Optional boolean to indicate whether to drop NaN rows and columns.

  Returns:
      pandas.DataFrame: A pandas dataframe that corresponds to the CSV file in file_path, after cleaning.

  Notes:
      If drop_NaN is True, rows and columns with NaN values will be dropped. If expect_col is non-empty,
      only the specified columns will be retained. If both options are used, NaN rows and columns will be
      dropped first, and then the specified columns will be retained.
  """

  # Case when import 1 single file only
  data_csv = pd.read_csv(file_path, encoding=encode)

  # Drop un-clean data or data row incomplete
  if drop_NaN:
    data_csv.dropna(inplace=True)           # drop rows missed value
    data_csv.to_csv("twitter_content_wb.csv", index=False) # Write back

  # Drop unnecessary columns that only retrieve from expected one
  if len(expect_col) > 0:
    data_csv = data_csv.loc[:, expect_col]

  return data_csv

In [None]:
# Clean the data to extract expected columns
def data_import(file_path="", expect_col=[], file_list=[], encode="", read_many=False):
  """
    This function imports crime data from a CSV file or
    a list of CSV files. It drops missing values and
    unnecessary columns from the data and returns
    a Pandas DataFrame.

    Parameters:
    file_path (str): the path to the CSV file to import
      (default: "")
    expect_col (list): a list of column names to keep in the data
      (default: [])
    file_list (list): a list of file paths to import if read_many is True
      (default: [])
    read_many (bool): True if importing multiple files, False if importing a single file
      (default: False)

    Returns:
    A Pandas DataFrame containing the cleaned crime data.
  """
  try:
    # When reading multiples file, return a list of frames
    if read_many is True:
      return read_many(file_path, expect_col, file_list, encode, read_many)

    # Case when import 1 single file only
    return read_one(file_path, expect_col, encode)

  # Internal error occurred
  except Exception as e:
    try:
      # Case when import 1 single file only
      return read_one(file_path, expect_col, encode)
    except Exception as e:
      print("Internal errors occurs for loading csv file. Try again", str(e))
      return None

In [None]:
# Data file storage - user can change CONST_PATH to his/her location
CONST_PATHDIR = "Training Data/twitter_content.csv"
CONST_ENCODES = 'ISO-8859-1'
signi_columns = ['_unit_id', 'gender', 'created', 'description', 'name', 'retweet_count','text']
twitter_Frame = data_import(file_path=CONST_PATHDIR, expect_col=signi_columns, encode=CONST_ENCODES)
twitter_Frame.shape

In [None]:
twitter_Frame.head(10)