# Analyzing how millennial women spend money via Refinery29
**Objective**   
Analyze how millennial women spend their time and money using NLP. Build a recommender that takes in user input and selects 3 Refinery29 Money Diaries that are similar to the user.  

**Data**   
This data was scraped from the [Refinery29 Money Diaries](https://refinery29.com/en-us/money-diary) from January 18, 2019-June 3, 2020.

## **Load packages**

In [18]:

import pandas as pd
import pickle
import re
import string
import numpy as np
import operator
from collections import OrderedDict

from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity

from gensim import matutils, models

import scipy.sparse

import matplotlib.pyplot as plt
%matplotlib inline

import nltk
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, pos_tag

## **Read in pickled cleaned text data**

In [19]:
data_clean = pickle.load(open("data_clean.pkl","rb"))

## **Create Document Term Matrix**

In [3]:
cv = CountVectorizer(stop_words='english', min_df = 0.1, max_df=.8, ngram_range=(1, 2))
data_cv = cv.fit_transform(data_clean.diary_text_string)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index

## **Create helper functions to evaluate models**

In [4]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    "Given a model, return the top words for each topic"
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [None]:
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [5]:
# Updated list of stop words
additional_stop_words = ['pm', 'im', 'day', 'week', 'total', 'daily', 'like', 'today', 'dont', 'just', 'women',
                        'way', 'really', 'ive', 'diaries', 'got', 'gets', 'ill', 'things', 'bit', 'sevenday',
                        'hes', 'let', 'shes', 'lot', 'little', 'decide', 'ready', 'feel', 'goes', 'stop',
                        'finally', 'welcome', 'money', 'diaries', 'period', 'job', 'share', 'today', 'millennials',
                        'occupation', 'paycheck', 'diaries', 'taboo', 'dollar','period', 'year', 'month', 'worth',
                         'rent', 'bonus', 'expenses', 'today', 'mortgage', 'debt', 'expensesmortgage', 'industry',
                        'expensesrent', 'salary', 'savings', 'insurance', 'loan', 'loans', 'income', 'student',
                        'dollartoday', 'cash', 'room', 'account', 'woman', 'living','house', 'housing','mortgagepaycheck',
                        'half','split', 'apartment', 'totaldebt', 'checksavingscd', 'stuff', 'hour', 'hours',
                        'people', 'place', 'years', 'minutes', 'tomorrow']

stop_words = text.ENGLISH_STOP_WORDS.union(additional_stop_words)

## **Topic Modeling**

Topic modeling exploration included all variations of the following.
- LDA, LSA, NMF
- CountVectorizer and TfidfVectorizer
- Hyperparameters including stop words, max_df, min_df, and ngram_range
- Lemmatization and stemming
- Parts of speech: nouns only, verbs only, and nouns + verbs  

Optimal condition used below:
- NMF with CountVectorizer
- Parts of speech: nouns only
- Additional stop words
- Max_df = .9, min_df = .1
- Unigrams and bigrams

In [7]:
data_nouns = pd.DataFrame(data_clean.diary_text_string.apply(nouns))

In [8]:
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words, min_df = 0.1, max_df=.9, ngram_range=(1, 2))
data_cvn = cvn.fit_transform(data_nouns.diary_text_string)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index

Unnamed: 0,access,accounts,activity harmful,activity step,addition,adult,advantage,advice,afternoon,afterward,...,yesterdays,yoga,yoga class,yogurt,york,youd,youd questions,youre,youtube,zucchini
0,0,2,0,1,1,0,0,0,4,0,...,0,0,0,2,1,0,0,0,0,0
1,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,1,2,0,0
2,0,0,1,0,0,0,1,0,1,0,...,0,1,0,0,2,1,1,1,0,1
3,0,0,0,1,0,1,1,0,0,0,...,0,0,0,0,0,1,1,3,0,0
4,0,1,1,0,0,0,0,1,1,0,...,0,0,0,0,0,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
475,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
476,0,0,1,0,0,0,0,0,2,0,...,0,0,0,0,0,1,1,0,2,1
477,0,0,1,0,0,0,0,0,1,0,...,0,1,0,0,0,1,1,0,0,1
478,0,0,1,0,0,1,0,0,0,1,...,0,2,0,0,0,1,1,0,0,0


In [9]:
nmf_model = NMF(8)
doc_topic = nmf_model.fit_transform(data_dtmn)

In [10]:
display_topics(nmf_model, cvn.get_feature_names(), 10)

_topics_udf(topics, total_topics=2, num_terms=30, display_weights=True)


Topic  0
mom, parents, shower, food, hair, school, phone, breakfast, face, cream

Topic  1
husband, husbands, breakfast, chicken, water, food, tea, car, store, days

Topic  2
dog, dogs, walk, car, business, dog walk, home dog, friends, breakfast, food

Topic  3
kids, school, breakfast, partner, coffee, paper, couple, sister, family, chicken

Topic  4
coffee, class, cup, butter, salad, milk, shop, wine, days, partner

Topic  5
friends, friend, bar, wine, card, class, couple, drinks, food, car

Topic  6
office, breakfast, meeting, desk, salad, alarm, door, chicken, food, boss

Topic  7
baby, coffee, mom, milk, shower, couple, car, dad, family, cup


Based on this results, the topics were determined to be:
- Self-care 
- Husband
- Dogs
- Family
- Cooking/food
- Friends/socializing
- Work
- Baby

In [11]:
# Topic word matrix (see how much of the topic each word is made up of)
topic_word = pd.DataFrame(nmf_model2.components_.round(3),
             index = ["component_1","component_2", 'component_3', "component_4", 'compnent_5',
                     'component_6', 'component_7', 'component_8'],
             columns = cvn.get_feature_names())

Unnamed: 0,access,accounts,activity harmful,activity step,addition,adult,advantage,advice,afternoon,afterward,...,yesterdays,yoga,yoga class,yogurt,york,youd,youd questions,youre,youtube,zucchini
component_1,0.22,0.199,0.369,0.279,0.055,0.107,0.049,0.051,0.761,0.199,...,0.189,0.95,0.079,0.35,0.0,0.663,0.583,0.41,0.481,0.244
component_2,0.0,0.219,0.056,0.166,0.076,0.0,0.0,0.015,0.376,0.0,...,0.062,0.029,0.005,0.173,0.078,0.146,0.131,0.225,0.0,0.142
component_3,0.005,0.09,0.0,0.171,0.037,0.007,0.015,0.057,0.135,0.0,...,0.087,0.0,0.0,0.037,0.0,0.104,0.055,0.0,0.007,0.053
component_4,0.033,0.009,0.069,0.06,0.0,0.042,0.084,0.0,0.098,0.0,...,0.0,0.0,0.0,0.174,0.0,0.089,0.085,0.0,0.0,0.0
compnent_5,0.044,0.0,0.112,0.136,0.02,0.0,0.159,0.013,0.284,0.096,...,0.085,0.391,0.092,0.517,0.475,0.178,0.24,0.043,0.152,0.056
component_6,0.06,0.075,0.0,0.488,0.077,0.167,0.035,0.189,0.605,0.0,...,0.0,0.432,0.087,0.275,0.452,0.365,0.166,0.104,0.068,0.09
component_7,0.001,0.061,0.0,0.289,0.058,0.039,0.234,0.0,1.035,0.004,...,0.0,0.0,0.034,0.58,0.0,0.408,0.112,0.18,0.0,0.0
component_8,0.0,0.0,0.0,0.023,0.003,0.065,0.001,0.0,0.077,0.0,...,0.044,0.179,0.014,0.582,0.006,0.036,0.0,0.0,0.028,0.0


In [12]:
# This matrix shows the diaries and how each diary is made up of the 8 topics 
term_doc_matrix = pd.DataFrame(doc_topic.round(5),
             columns = ["component_1","component_2", 'component_3', "component_4", 'compnent_5',
                     'component_6', 'component_7', 'component_8'])

Unnamed: 0,component_1,component_2,component_3,component_4,compnent_5,component_6,component_7,component_8
0,0.05022,0.07722,0.07060,0.00000,0.34956,0.78826,0.73385,0.00435
1,0.32859,0.06349,0.00000,0.00000,0.50430,0.68546,0.17643,0.05327
2,0.25277,0.00616,0.00000,0.03418,0.64394,0.27775,1.00502,0.01977
3,0.37495,0.56710,0.01586,0.17394,0.20806,0.93754,0.28334,0.20011
4,0.11923,0.02350,0.07624,0.11870,0.25552,0.54420,0.64287,0.27697
...,...,...,...,...,...,...,...,...
475,0.27668,0.05135,0.28625,0.14870,0.05454,0.11753,0.01380,0.01412
476,0.49950,0.10542,0.80907,0.00000,0.09379,0.51250,0.00000,0.39321
477,0.98235,0.23150,0.00000,0.22776,0.78942,0.00000,0.02617,0.08606
478,0.68684,0.65626,0.00000,0.38844,0.00000,0.22062,0.09168,0.05935
