# Mood Journal 

### Description 
As part of my New Year's Resolution of 2020 and in an effort to better remember my life, I began writing a mood journal. Everyday, I would write down a few paragraphs about my day and rank my mood with a number between -3 to +3. It was when I took Introduction to Data Mining that spring semester that I realized I could implement a supervised machine learning model on my journal and observe how the description of my day is related to my mood.

### Method 
Before diving into the details and the code, here is a brief overview of my approach. I used a private Instagram account as my mood journal, attaching a photo and caption for each day. I first needed to extract the text. Once I achieved this, I used the 'bag of words' approach to summarize the textual data. And finally, I used various supervised machine learning classifiers to model and predict my mood. 

#### Data Extraction
I donwloaded a json file with all the relevant data of my Instagram account and I extracted the json file into a Python dictionary. As I had used a range of -3 to +3 in my mood, I decided to make three labels: positive, zero, and negative. It may have been better to ignore zero and have only two labels. But as shown below, the percentage of zeros was 35%.

In [24]:
import os
import json
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
import re
from nltk.stem.porter import PorterStemmer

#extract json file into python 
notebook_path = os.path.abspath("MoodJournal.ipynb")
with open(os.path.dirname(notebook_path) + '/../media.json') as f:
    data = json.load(f)
    
arr = data['photos']
arr.reverse()

mood = []
time = []
text = []
data = {'Text': [],
        'Label': [] }
#extracting features and labels
for con in arr:
    caption = con['caption']
    if len(caption) != 0:
        if caption.find('[') != -1 and caption.find(']') != -1 and con['taken_at'] not in time:
            rank = caption[caption.index('[') + 1 : caption.index(']')]
            description = caption[caption.index(']') + 1 : ]
            time.append(con['taken_at'])
            # parsing mood label into integer
            if '+' in rank:
                mood.append(int(rank))
            elif '-' in rank: 
                mood.append(int(rank))
            elif rank.isnumeric():
                mood.append(int(rank))
            else:
                continue

            # label as positive, zero, negative
            rank = int(rank)
            if rank < 0:
                rank = -1
            elif rank > 0:
                rank = 1
            else:
                rank = 0
            data['Label'].append(int(rank))
            text.append(description)
data['Text'] = text
df = pd.DataFrame(data)
texts = df['Text'].astype(str)
y = df['Label']


# print("number of rows, columns: ",df.shape)
pd.set_option('display.max_rows', None)
#checking number of zeroes
print(df[df['Label'] == 0].count())
print(df.shape)
print("As shown, there are 47 zero values which is " + str(47/134) + " of the dataset")



Text     47
Label    47
dtype: int64
(134, 2)
As shown, there are 47 zero values which is 0.35074626865671643 of the dataset


### Vectorizing words
Given a list of words, I could use the 'Bag of Words' approach. This is based on the Naive Bayes probability, as we assume that each word is independent of each other and that we can draw meaningful conclusion from the frequency of each word. An important process in this to filter the words. I chose to use 'stemming' which involves trimming down a word to its 'natural root', thereby eliminating redudant words.

In [28]:
#using stemming to reduce word counts 
porter_stemmer = PorterStemmer()
def my_preprocessor(text):
    text = re.sub("\\W", " ", text)
    words = re.sub(r"[^A-Za-z0-9\-]", " ", text).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return ' '.join(words)

#vectorize the text (bag of words)
#using min_df=0.01 as there are 134 documents (rows), so will be accepted if appears more than once
#preprocessor = my_preprocessor -> lower accuracy
vectorizer = CountVectorizer(stop_words= 'english', min_df=0.01, preprocessor=my_preprocessor)
# X = vectorizer.fit_transform(texts)

#filtering out words
# print(vectorizer.get_feature_names())
# print("size: " + str(X.shape))

  'stop_words.' % sorted(inconsistent))
