## <em>A Big Data Mining Approach Project</em>
## <b>Stress Detecting from Social Media Interaction</b>
## Group name: The Enigma Ensemble

### <em>(*) First author:</em>
##### Tri Quan Do (tdo22@uic.edu) - Group Leader
##### Mosrour Tafadar (mtafad2@uic.edu)
##### Hina Khali (hkhali21@uic.edu)
##### Safiya Mustafa (smust3@uic.edu)


## Project Abstract:
Emotional and mental stress are serious issues that can have a significant impact on our well-being. Despite the fact that an emotional experience usually starts as a personal, internal process, it frequently results in the communal sharing of emotions with others. Emotions that are verbally expressed to others by the individual who has experienced them are referred to as being socially shared. People share their emotions with others in more than 80% of all emotional events, regardless of their age, gender, personality type, or culture (Bazarova, Choi, Sosik, Cosley, Whitlock 1). Due to social media's widespread use, people are accustomed to posting about their everyday activities and connecting with acquaintances on these platforms, making it possible to use information from online social networks to identify stress.

## Project Introduction

The initial step of this research project involves identifying a set of words that are commonly associated with emotional stress. Using this set of words, the models aim to compute an overall stress score for each individual under investigation. However, it is critical to acknowledge that some words may carry a higher intensity than others. Hence, the project purpose will segregate the identified set of words into distinct categories based on their intensity levels, namely high, moderate, and low to parallel conduct a word frequency analysis to identify words or phrases that occur frequently, specifically those that pertain to emotions or stress. This research approach is expected to provide valuable insights into the underlying patterns and associations between language use and emotional stress, thereby contributing to the existing knowledge base on the topic.<br><br>

Robust technologies for processing and analyzing massive amounts of social media data include Support Vector Machines (SVM) and MapReduce, which can be used to forecast stress levels based on social media posts. SVM is a machine learning algorithm that divides the data into classes before identifying the hyperplane that best distinguishes the classes. Large datasets can be processed concurrently on a distributed computing system using the model and software framework known as MapReduce

Full project information could be found here <"add link to document">

In [None]:
#######################################################
###########   ENVIRONMENT SETTING UP   ################
!pip install pandas
!pip install numpy
!pip install -U scikit-learn
!pip install seaborn
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq

In [None]:
###########################################################################################
## These code below could generate error when working on non-google colab environment    ##
## Please comment those code below if you work on local machine                          ##
###########################################################################################

# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials
#
# # Authenticate and create the PyDrive client
# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

In [None]:
from oauth2client.crypt import PyCryptoSigner
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Modeling for Machine Learning Task
from sklearn.linear_model import LinearRegression 
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFECV

import pyspark
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

## Data Description

## Data Cleaning Description

In [None]:
def read_many(file_path="", expect_col=[], file_list=[], encode="", read_many=False):
  """
  Reads and returns multiple CSV files as pandas dataframes.

  Args:
      file_path (str): Optional file path to read CSV files from.
      expect_col (list): Optional list of expected column names to extract from each CSV file.
      file_list (list): List of CSV file names to read.
      encode (str): Optional encoding type for reading CSV files.
      read_many (bool): Optional boolean to indicate whether to read multiple CSV files.

  Returns:
      list: A list of pandas dataframes, where each dataframe corresponds to a CSV file in file_list.
  """

  dataFrame_list = []

  for f_name in file_list:
    data_csv = pd.read_csv(f_name)
    dataFrame_list.append(data_csv)

  dataFrames_Final = []
  # Drop unnecessary columns that only retrieve from expected one
  if len(expect_col) > 0:
    for frame in dataFrame_list:
      new_frame = frame.loc[:, expect_col]
      dataFrames_Final.append(new_frame)
  else:
    dataFrames_Final = dataFrame_list

  return dataFrames_Final

In [None]:
def read_one(file_path="", expect_col=[], encode="", drop_NaN=False):
  """
  Reads a single CSV file as a pandas dataframe, drops NaN rows and columns, and returns the resulting dataframe.

  Args:
      file_path (str): Optional file path to read the CSV file from.
      expect_col (list): Optional list of expected column names to extract from the CSV file.
      encode (str): Optional encoding type for reading the CSV file.
      drop_NaN (bool): Optional boolean to indicate whether to drop NaN rows and columns.

  Returns:
      pandas.DataFrame: A pandas dataframe that corresponds to the CSV file in file_path, after cleaning.

  Notes:
      If drop_NaN is True, rows and columns with NaN values will be dropped. If expect_col is non-empty,
      only the specified columns will be retained. If both options are used, NaN rows and columns will be
      dropped first, and then the specified columns will be retained.
  """

  # Case when import 1 single file only
  data_csv = pd.read_csv(file_path, encoding=encode)

  # Drop un-clean data or data row incomplete
  if drop_NaN:
    data_csv.dropna(inplace=True)           # drop rows missed value
    data_csv.to_csv("twitter_content_wb.csv", index=False) # Write back

  # Drop unnecessary columns that only retrieve from expected one
  if len(expect_col) > 0:
    data_csv = data_csv.loc[:, expect_col]

  return data_csv

In [None]:
# Clean the data to extract expected columns
def data_import(file_path="", expect_col=[], file_list=[], encode="", read_many=False):
  """
    This function imports crime data from a CSV file or
    a list of CSV files. It drops missing values and
    unnecessary columns from the data and returns
    a Pandas DataFrame.

    Parameters:
    file_path (str): the path to the CSV file to import
      (default: "")
    expect_col (list): a list of column names to keep in the data
      (default: [])
    file_list (list): a list of file paths to import if read_many is True
      (default: [])
    read_many (bool): True if importing multiple files, False if importing a single file
      (default: False)

    Returns:
    A Pandas DataFrame containing the cleaned crime data.
  """
  try:
    # When reading multiples file, return a list of frames
    if read_many is True:
      return read_many(file_path, expect_col, file_list, encode, read_many)

    # Case when import 1 single file only
    return read_one(file_path, expect_col, encode)

  # Internal error occurred
  except Exception as e:
    try:
      # Case when import 1 single file only
      return read_one(file_path, expect_col, encode)
    except Exception as e:
      print("Internal errors occurs for loading csv file. Try again", str(e))
      return None

In [None]:
# Data file storage - user can change CONST_PATH to his/her location
CONST_PATHDIR = "Training Data/twitter_content.csv"
CONST_ENCODES = 'ISO-8859-1'
signi_columns = ['_unit_id', 'gender', 'created', 'description', 'name', 'retweet_count','text']
twitter_Frame = data_import(file_path=CONST_PATHDIR, expect_col=signi_columns, encode=CONST_ENCODES)
twitter_Frame.shape

In [None]:
twitter_Frame.head(10)