## <em>A Big Data Mining Approach Project</em>
## <b>Stress Detecting from Social Media Interaction</b>
## Group name: The Enigma Ensemble

### <em>(*) First author:</em>
##### Tri Quan Do (tdo22@uic.edu) - Group Leader
##### Mosrour Tafadar (mtafad2@uic.edu)
##### Hina Khali (hkhali21@uic.edu)
##### Safiya Mustafa (smust3@uic.edu)


## Project Abstract:
Emotional and mental stress are serious issues that 
can have a significant impact on our well-being. 
Despite the fact that an emotional experience usually 
starts as a personal, internal process, it frequently 
results in the communal sharing of emotions with 
others. Emotions that are verbally expressed to others 
by the individual who has experienced them are 
referred to as being socially shared. People share their 
emotions with others in more than 80% of all emotional 
events, regardless of their age, gender, personality 
type, or culture (Bazarova, Choi, Sosik, Cosley, 
Whitlock 1). Due to social media's widespread use, 
people are accustomed to posting about their everyday 
activities and connecting with acquaintances on these 
platforms, making it possible to use information from 
online social networks to identify stress.

## Project Introduction

The initial step of this research project involves 
identifying a set of words that are commonly associated 
with emotional stress. Using this set of words, the 
models aim to compute an overall stress score for each 
individual under investigation. However, it is critical to 
acknowledge that some words may carry a higher 
intensity than others. Hence, the project purpose will 
segregate the identified set of words into distinct 
categories based on their intensity levels, namely high, 
moderate, and low to parallelly conduct a word 
frequency analysis to identify words or phrases that 
occur frequently, specifically those that pertain to 
emotions or stress. This research approach is expected to 
provide valuable insights into the underlying patterns 
and associations between language use and emotional 
stress, thereby contributing to the existing knowledge 
base on the topic.<br><br>

Robust technologies for processing and analyzing 
massive amounts of social media data include Support 
Vector Machines (SVM) and MapReduce, which can be 
used to forecast stress levels based on social media posts. 
SVM is a machine learning algorithm that divides the 
data into classes before identifying the hyperplane that 
best distinguishes the classes. Large datasets can be 
processed concurrently on a distributed computing 
system using the model and software framework known 
as MapReduce

Full project information could be found here <"add link to document">

In [None]:
#######################################################
###########   ENVIRONMENT SETTING UP   ################
!pip install pandas
!pip install numpy
!pip install -U scikit-learn
!pip install seaborn
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
from oauth2client.crypt import PyCryptoSigner
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Modeling for Machine Learning Task
from sklearn.linear_model import LinearRegression 
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFECV

import pyspark
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

## Data Description

## Data Cleaning Description

In [None]:
# Clean the data to extract expected columns
def data_import(file_path="", sig_col=[]):
  try:
    dataFrame = pd.read_csv(file_path)
    
  except Exception as e:
    print("Interal errors when loading dataFrame, try again", str(e))
    return None
