# Mood Journal 

### Description 
As part of my New Year's Resolution of 2020 and in an effort to better remember my life, I began writing a mood journal. Everyday, I would write down a few paragraphs about my day and rank my mood with a number between -3 to +3. It was when I took Introduction to Data Mining that spring semester that I realized I could implement a supervised machine learning model on my journal and observe how the description of my day is related to my mood.

### Method 
Before diving into the details and the code, here is a brief overview of my approach. I used a private Instagram account as my mood journal, attaching a photo and caption for each day. I first needed to extract the text. Once I achieved this, I used the 'bag of words' approach to summarize the textual data. And finally, I used various supervised machine learning classifiers to model and predict my mood. 

#### Data Extraction
I donwloaded a json file with all the relevant data of my Instagram account and I extracted the json file into a Python dictionary. As I had used a range of -3 to +3 in my mood, I decided to make three labels: positive, zero, and negative. It may have been better to ignore zero and have only two labels. But as shown below, the percentage of zeros was quite siginificant at 35%.

In [7]:
import os
import json
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
import re
from nltk.stem.porter import PorterStemmer

#extract json file into python 
notebook_path = os.path.abspath("MoodJournal.ipynb")
with open(os.path.dirname(notebook_path) + '/../media.json') as f:
    data = json.load(f)
    
arr = data['photos']
arr.reverse()

mood = []
time = []
text = []
data = {'Text': [],
        'Label': [] }
#extracting features and labels
for con in arr:
    caption = con['caption']
    if len(caption) != 0:
        if caption.find('[') != -1 and caption.find(']') != -1 and con['taken_at'] not in time:
            rank = caption[caption.index('[') + 1 : caption.index(']')]
            description = caption[caption.index(']') + 1 : ]
            time.append(con['taken_at'])
            # parsing mood label into integer
            if '+' in rank:
                mood.append(int(rank))
            elif '-' in rank: 
                mood.append(int(rank))
            elif rank.isnumeric():
                mood.append(int(rank))
            else:
                continue

            # label as positive, zero, negative
            rank = int(rank)
            if rank < 0:
                rank = -1
            elif rank > 0:
                rank = 1
            else:
                rank = 0
            data['Label'].append(int(rank))
            text.append(description)
data['Text'] = text
df = pd.DataFrame(data)
texts = df['Text'].astype(str)
y = df['Label']


# print("number of rows, columns: ",df.shape)
pd.set_option('display.max_rows', None)
#checking number of zeroes
print(df[df['Label'] == 0].count())
print("shape: " + str(df.shape))
print("As shown, there are 47 zero values which is " + str(47/134) + " of the dataset")



Text     47
Label    47
dtype: int64
shape: (134, 2)
As shown, there are 47 zero values which is 0.35074626865671643 of the dataset


### Vectorizing words
Given a list of words, I could use the 'Bag of Words' approach. This is based on Naive Bayes probability, as we assume that each word is independent of each other and that we can draw meaningful conclusion from the frequency of each word. Thus, I can 'vectorize' the words in my mood journal by getting creating a 1d array with the frequency of each word in its respective entry. I used CountVectorizer to do this process.

An important process in this to filter the words to reduce redundancy, as otherwise the vector would become too large and specific. One method to reduce redundancy is 'stemming'. This involves trimming down a word to its 'natural root', thereby grouping similar words into one word. Additionally, I used the default 'stop words' which doesn't add any word in the stop words to be in the vector. Common stop words include 'the' and 'a'. Finally, 

__play around with different settings of preprocessing__ 

In [8]:
#using stemming to reduce word counts 
porter_stemmer = PorterStemmer()
def my_preprocessor(text):
    text = re.sub("\\W", " ", text)
    words = re.sub(r"[^A-Za-z0-9\-]", " ", text).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return ' '.join(words)

#vectorize the text (bag of words)
#using min_df=0.01 as there are 134 documents (rows), so will be accepted if appears more than once
#preprocessor = my_preprocessor -> lower accuracy
vectorizer = CountVectorizer(stop_words= 'english', min_df=0.01, preprocessor=my_preprocessor)
X = vectorizer.fit_transform(texts)

#filtering out words
# print(vectorizer.get_feature_names())
print("size: " + str(X.shape))

  'stop_words.' % sorted(inconsistent))


size: (134, 1263)


### Supervised Modelling 
I chose four different supervised modelling techniques: support vector machine, decision tree, Naive Bayes and linear regression. As the number of words was high, I expected decision tree to suffer from the curse of dimensionality and thus result in lowest accuracy. 

I calculated each model's accuracy through a cross-validation loop to prevent over-fitting and unbalanced data points. 

In [9]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

model = LinearSVC(class_weight='balanced', dual= False, tol = 1e-2, max_iter= 1e5)
print(cross_val_score(model, X, y, cv=5))
print("SVC: ", cross_val_score(model, X, y, cv=5).mean())

tree = DecisionTreeClassifier(criterion = 'entropy')
print(cross_val_score(tree, X, y, cv=5))
print("tree: ", cross_val_score(tree, X, y, cv=5).mean())

guassian = GaussianNB()
print(cross_val_score(guassian, X.todense(), y, cv=5))
print("naive: ", cross_val_score(guassian, X.todense(), y, cv=5).mean())

regression = LogisticRegression()
print(cross_val_score(regression, X, y, cv=5))
print("regression: ", cross_val_score(regression, X, y, cv = 5).mean())


#checking most popular words associated with good and bad mood
final_model = regression
final_model.fit(X, y)
feature_to_coef = {word: coef for word, coef in zip(vectorizer.get_feature_names(), final_model.coef_[0])}
for best_positive in sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)[:5]:
    print (best_positive)

for best_negative in sorted(feature_to_coef.items(), key=lambda x: x[1])[:5]:
    print (best_negative)

[0.62068966 0.62962963 0.65384615 0.38461538 0.61538462]
SVC:  0.5808330877296395
[0.48275862 0.40740741 0.38461538 0.42307692 0.5       ]
tree:  0.42310639552018864
[0.44827586 0.48148148 0.5        0.42307692 0.42307692]
naive:  0.4551822379408586
[0.62068966 0.59259259 0.69230769 0.34615385 0.57692308]
regression:  0.5657333726299243
('wasn', 0.6771447822713577)
('everyth', 0.4893144472959386)
('forgot', 0.43602633992652334)
('food', 0.3903594171681731)
('bad', 0.3742116809889241)
('did', -0.45665752715382263)
('good', -0.4153777740518988)
('algo', -0.41021722698426893)
('went', -0.40572422379352713)
('thi', -0.34306087368847205)




### Conclusion 
As predicted, the decision tree algorithm gave the worst average accuracy. Surprisingly, however, was Naive Bayes algorithm results were almost as low as the decision tree's. Perhaps this is because the words are not truly independent of each other, distorting the results. Linear SVM performed the best, at 0.58, as its decision boundaries are linear and thus not affected by the curse of dimensionality. 