## SONAR: Emotionally Intelligent Music Recommendations

Draws heavily from [this tutorial](https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html)

Welcome to Sonar! This project was inpired me to first explore Machine Learning. The premise is simple: we often want different music to listen to based on our mood. However, oftentimes finding that music proves difficult. Sonar takes away that difficulty: by entering short mood messages, sonar generates Spotify recommendations using both your old favorites and determined mood. 

Enjoy!

In [35]:
import re

import bokeh
import gensim
from gensim.models.word2vec import Word2Vec
import numpy as np
from nltk.tokenize import TweetTokenizer
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm_notebook, tqdm

Various document-wide utilities set up below

In [37]:
nlp = spacy.load("en")
tqdm_notebook().pandas(desc="progress-bar")
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
TaggedDocument = gensim.models.doc2vec.TaggedDocument

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

### Preprocessing

- Load tweets into a usable format. Binarize sentiment from {0, 1, 2, 3, 4} to {0, 1} with 2 serving as the cut off for a "positive" sentiment. Clean up null values and reindex.

- Tokenize the data and take only the n values we aim to use in training.

In [13]:
def load_tweets():
    tweets_train = pd.read_csv("/home/malits/data/twitter_sentiment_analysis/training.1600000.processed.noemoticon.csv", 
                               encoding = "ISO-8859-1")
    data = pd.DataFrame()
    data["sentiment"] = tweets_train["0"].apply(lambda v: 1 if (v >= 2) else 0)
    data["sentiment"] = data["sentiment"].map(int)
    data["tweet"] = tweets_train.iloc[:, [5]]
    data["tweet"] = data["tweet"].astype(str)
    data = data[data['tweet'].isnull() == False]
    data.reset_index(inplace=True)
    data.drop('index', axis=1, inplace=True)
    
    return data

In [14]:
data = load_tweets()
data.head()

Unnamed: 0,sentiment,tweet
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew


In [40]:
def tokenize(tweet):
    sub_regex = r'[(#[A-Za-z0-9]+)(https.+)\?!\.,]'
    tweet = str(tweet)
    tokens = tokenizer.tokenize(tweet)
    tokens = [re.sub(sub_regex, '', t) for t in tokens]
    tokens = [t for t in tokens if t is not None and t != '']
    return tokens

In [41]:
def postprocess(data, n=1000000):
    data = data.head(n)
    data["tokens"] = data["tweet"].progress_apply(tokenize)
    data = data[data.tokens != 'NC']
    data.reset_index(inplace=True)
    data.drop('index', inplace=True, axis=1)
    return data

In [42]:
processed = postprocess(data.head(500))

HBox(children=(IntProgress(value=0, description='progress-bar', max=500, style=ProgressStyle(description_width…

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### Create training data

Once preprocessed, split to test/train sets and vectorize w w2vec for use in keras

In [49]:
# Size of our embedding space

N_DIM = 200

In [44]:
X_train, X_test, y_train, y_test = train_test_split(np.array(processed['tokens']),
                                                    np.array(processed['sentiment']),
                                                    test_size=0.2)

In [50]:
def labelizeTweets(tweets, label_type):
    labelized = []
    for i, v in tqdm_notebook(enumerate(tweets)):
        label = '{}_{}'.format(label_type, i)
        ## Use nltk TaggedDocument obj to mark words with their corresponding label. Makes words easier to retrieve from their vector form.
        labelized.append(TaggedDocument(v, [label]))
    return labelized

In [46]:
X_train = labelizeTweets(X_train, 'TRAIN')
X_test = labelizeTweets(X_test, 'TEST')

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

In [47]:
train_size = len(X_train)
tweet_w2v = Word2Vec(size=N_DIM, min_count=10)
tweet_w2v.build_vocab([x.words for x in tqdm_notebook(X_train)])
tweet_w2v.train([x.words for x in tqdm_notebook(X_train)], 
                total_examples=train_size, epochs = tweet_w2v.epochs)

HBox(children=(IntProgress(value=0, max=400), HTML(value='')))

HBox(children=(IntProgress(value=0, max=400), HTML(value='')))

W0726 08:29:00.642506 139739017860928 base_any2vec.py:686] under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay


(4686, 27600)

In [57]:
tweet_w2v.most_similar("good")

  """Entry point for launching an IPython kernel.


[('of', 0.3715263605117798),
 ('I', 0.3577388525009155),
 ('the', 0.34806740283966064),
 ('that', 0.34661033749580383),
 ('"', 0.3401591181755066),
 ('got', 0.3271946907043457),
 ('just', 0.3248497545719147),
 ('my', 0.322878360748291),
 ("I'm", 0.3183445930480957),
 ('to', 0.3160049319267273)]