# Analyzing how millennial women spend money via Refinery29
**Objective**   
Analyze how millennial women spend their time and money using NLP. Build a recommender that takes in user input and selects 3 Refinery29 Money Diaries that are similar to the user.  

**Data**   
This data was scraped from the [Refinery29 Money Diaries](https://refinery29.com/en-us/money-diary) from January 18, 2019-June 3, 2020.


**Load packages**

In [5]:
import pickle
import re
import string
import operator

import pandas as pd
import numpy as np

from collections import OrderedDict

import nltk
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, pos_tag

from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity

from gensim import matutils, models

import scipy.sparse

import matplotlib.pyplot as plt
%matplotlib inline

## **Load in data**

In [7]:
# Load in pickled dataframes
clustered_diarist_df = pickle.load(open("clustered_data_scaled.pkl","rb"))
text_df = pickle.load(open("text_df.pkl","rb"))

In [8]:
# Merge dataframes to append clusters to text
updated_text_df = pd.merge(clustered_diarist_df, text_df, left_on='story_title', right_on='story_title')

In [9]:
# Drop unnecessary columns from new dataframe
updated_text_df.drop(columns=['occupation', 'age', 'salary', 'nomad', 'international', 'high_cost_of_living_area'], inplace=True)

## **Clean text data**

In [10]:
# Convert the diary text column from a list to a string
updated_text_df['diary_text_string'] = [', '.join(map(str, l)) for l in updated_text_df['diary_text']]

In [11]:
# Drop the diary_text as list column
del updated_text_df['diary_text']

In [12]:
# Apply a first round of text cleaning techniques
def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation, remove words 
    containing numbers, remove additional punctuation and other non-sensical text.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(' — ',' ', text)      #attempts to remove hyphen
    text = re.sub('\w*\d\w*', '', text)   #removes numbers
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub('^', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [13]:
updated_text_df.diary_text_string = updated_text_df.diary_text_string.apply(round1)

## **Create separate dataframes for each cluster**

In [14]:
# Cluster 0: Late 20s, average earners
cluster_0 = updated_text_df[updated_text_df['Cluster'] == 0]

# Drop the cluster column
del cluster_0['Cluster']

In [15]:
# Cluster 1: Late 30's, average earners
cluster_1 = updated_text_df[updated_text_df['Cluster'] == 1]

# Drop the cluster column
del cluster_1['Cluster']

In [16]:
# Cluster 2: early 20s, average starting salary 
cluster_2 = updated_text_df[updated_text_df['Cluster'] == 2]

# Drop the cluster column
del cluster_2['Cluster']

In [17]:
# Cluster 3: Mid-30s high earners 
cluster_3 = updated_text_df[updated_text_df['Cluster'] == 3]

# Drop the cluster column
del cluster_3['Cluster']

In [18]:
# Cluster 4: Late 20s, high earners
cluster_4 = updated_text_df[updated_text_df['Cluster'] == 4]

# Drop the cluster column
del cluster_4['Cluster']

## **Create helper functions to evaluate models**

In [19]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    "Given a model, return the top words for each topic"
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [20]:
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [21]:
# Updated list of stop words
additional_stop_words = ['pm', 'im', 'day', 'week', 'total', 'daily', 'like', 'today', 'dont', 'just', 'women',
                        'way', 'really', 'ive', 'diaries', 'got', 'gets', 'ill', 'things', 'bit', 'sevenday',
                        'hes', 'let', 'shes', 'lot', 'little', 'decide', 'ready', 'feel', 'goes', 'stop',
                        'finally', 'welcome', 'money', 'diaries', 'period', 'job', 'share', 'today', 'millennials',
                        'occupation', 'paycheck', 'diaries', 'taboo', 'dollar','period', 'year', 'month', 'worth',
                         'rent', 'bonus', 'expenses', 'today', 'mortgage', 'debt', 'expensesmortgage', 'industry',
                        'expensesrent', 'salary', 'savings', 'insurance', 'loan', 'loans', 'income', 'student',
                        'dollartoday', 'cash', 'room', 'account', 'woman', 'living','house', 'housing','mortgagepaycheck',
                        'half','split', 'apartment', 'totaldebt', 'checksavingscd', 'stuff', 'hour', 'hours',
                        'people', 'place', 'years', 'minutes', 'tomorrow', 'couple', 'head', 'meeting']

stop_words = text.ENGLISH_STOP_WORDS.union(additional_stop_words)

## **Topic modeling cluster 0 (late 20s, average earners)**

Topic modeling exploration included all variations of the following.
- LDA, LSA, NMF
- CountVectorizer and TfidfVectorizer
- Hyperparameters including stop words, max_df, min_df, and ngram_range
- Lemmatization and stemming
- Parts of speech: nouns only, verbs only, and nouns + verbs  

Optimal condition for Topic 0:
- NMF with CountVectorizer
- Parts of speech: nouns only
- Additional stop words
- Max_df = 88, min_df = .15
- Unigrams and bigrams

In [21]:
data_nouns_cluster_0 = pd.DataFrame(cluster_0.diary_text_string.apply(nouns))

In [22]:
cvn0 = CountVectorizer(stop_words=stop_words, min_df = 0.15, max_df=.88, ngram_range=(1, 2))
data_cvn0 = cvn0.fit_transform(data_nouns_cluster_0.diary_text_string)
data_dtmn0 = pd.DataFrame(data_cvn0.toarray(), columns=cvn0.get_feature_names())
data_dtmn0.index = data_nouns_cluster_0.index
data_dtmn0.shape

(167, 651)

In [23]:
nmf_model_cluster_0 = NMF(6)
doc_topic0 = nmf_model_cluster_0.fit_transform(data_dtmn0)
display_topics(nmf_model_cluster_0, cvn0.get_feature_names(), 10)


Topic  0
food, breakfast, walk, days, school, routine, break, phone, kids, hair

Topic  1
dog, walk, dogs, breakfast, office, home dog, coffee, dog walk, friends, bar

Topic  2
husband, chicken, breakfast, office, food, husbands, salad, shower, pizza, car

Topic  3
baby, milk, coffee, downstairs, granola, bowl, mom, yogurt, cup, shower

Topic  4
coffee, office, cup, days, class, cat, breakfast, store, butter, episode

Topic  5
friend, friends, office, bar, food, alarm, class, card, water, party


In [24]:
cluster_0_component_matrix = pd.DataFrame(doc_topic0.round(5),
             columns = ["component_1","component_2", 'component_3', "component_4", 'compnent_5',
                     'component_6'])

Based on these results, the topics were determined to be:
- Home routines
- Dogs
- Husband
- Baby/mom
- Work
- Friends/socializing

In [67]:
# # Pickle important things for recommender
# with open('cvn0.pkl', 'wb') as f:
#     pickle.dump(cvn0, f)

# with open('doc_topic0.pkl', 'wb') as f:
#     pickle.dump(doc_topic0, f)
    
# with open('nmf_model_cluster_0.pkl', 'wb') as f:
#     pickle.dump(nmf_model_cluster_0, f)
    
# with open('cluster_0.pkl', 'wb') as f:
#     pickle.dump(cluster_0, f)

## **Topic modeling cluster 1 (late 30's, average earners)**

Topic modeling exploration included all variations of the following.
- LDA, LSA, NMF
- CountVectorizer and TfidfVectorizer
- Hyperparameters including stop words, max_df, min_df, and ngram_range
- Lemmatization and stemming
- Parts of speech: nouns only, verbs only, and nouns + verbs  

Optimal condition for Topic 1:
- NMF with CountVectorizer
- Parts of speech: nouns only
- Additional stop words cluster 1 specific
- Max_df = .7, min_df = .12
- Unigrams and bigrams

In [61]:
data_nouns_cluster_1 = pd.DataFrame(cluster_1.diary_text_string.apply(nouns))

In [62]:
additional_stop_words_cluster_1 = ['pm', 'im', 'day', 'week', 'total', 'daily', 'like', 'today', 'dont', 'just', 'women',
                        'way', 'really', 'ive', 'diaries', 'got', 'gets', 'ill', 'things', 'bit', 'sevenday',
                        'hes', 'let', 'shes', 'lot', 'little', 'decide', 'ready', 'feel', 'goes', 'stop',
                        'finally', 'welcome', 'money', 'diaries', 'period', 'job', 'share', 'today', 'millennials',
                        'occupation', 'paycheck', 'diaries', 'taboo', 'dollar','period', 'year', 'month', 'worth',
                         'rent', 'bonus', 'expenses', 'today', 'mortgage', 'debt', 'expensesmortgage', 'industry',
                        'expensesrent', 'salary', 'savings', 'insurance', 'loan', 'loans', 'income', 'student',
                        'dollartoday', 'cash', 'room', 'account', 'woman', 'living','house', 'housing','mortgagepaycheck',
                        'half','split', 'apartment', 'totaldebt', 'checksavingscd', 'stuff', 'hour', 'hours',
                        'people', 'place', 'years', 'minutes', 'tomorrow', 'couple', 'head', 'morning',
                        'night', 'guy']

stop_words_cluster_1 = text.ENGLISH_STOP_WORDS.union(additional_stop_words_cluster_1)

In [63]:
cvn1 = CountVectorizer(stop_words=stop_words_cluster_1, min_df = 0.12, max_df=.7, ngram_range=(1, 2))
data_cvn1 = cvn1.fit_transform(data_nouns_cluster_1.diary_text_string)
data_dtmn1 = pd.DataFrame(data_cvn1.toarray(), columns=cvn1.get_feature_names())
data_dtmn1.index = data_nouns_cluster_1.index
data_dtmn1.shape

(46, 904)

In [64]:
nmf_model_cluster_1 = NMF(6)
doc_topic1 = nmf_model_cluster_1.fit_transform(data_dtmn1)
display_topics(nmf_model_cluster_1, cvn1.get_feature_names(), 10)


Topic  0
wine, bar, desk, alarm, train, class, chicken, yogurt, date, kitchen

Topic  1
kids, partner, school, boys, daycare, commute, paper, daughter, game, lunches

Topic  2
husband, boys, son, husbands, rice, desk, email, yogurt, meeting, bar

Topic  3
girls, school, hair, gym, chicken, husband, meeting, parents, project, guest

Topic  4
dog, business, cats, boyfriend, airport, town, dogs, tea, lots, tacos

Topic  5
baby, banana, pizza, chicken, milk, sister, strawberries, cheese, daycare, butter


In [65]:
cluster_1_component_matrix = pd.DataFrame(doc_topic1.round(5),
             columns = ["component_1","component_2", 'component_3', "component_4", 'compnent_5',
                     'component_6'])

Based on these results, the topics were determined to be:
- Work
- Kid routines
- Family
- Adult routines
- Pets
- Food

In [66]:
# # Pickle important things for recommender
# with open('cvn1.pkl', 'wb') as f:
#     pickle.dump(cvn1, f)

# with open('doc_topic1.pkl', 'wb') as f:
#     pickle.dump(doc_topic1, f)
    
# with open('nmf_model_cluster_1.pkl', 'wb') as f:
#     pickle.dump(nmf_model_cluster_1, f)
    
# with open('cluster_1.pkl', 'wb') as f:
#     pickle.dump(cluster_1, f)

## **Topic modeling for cluster 2 (early 20s, entry level salary)**

Topic modeling exploration included all variations of the following.
- LDA, LSA, NMF
- CountVectorizer and TfidfVectorizer
- Hyperparameters including stop words, max_df, min_df, and ngram_range
- Lemmatization and stemming
- Parts of speech: nouns only, verbs only, and nouns + verbs  

Optimal condition for Topic 2:
- NMF with CountVectorizer
- Parts of speech: nouns only
- Additional stop words 
- Max_df = .8, min_df = .1
- Unigrams and bigrams

In [22]:
data_nouns_cluster_2 = pd.DataFrame(cluster_2.diary_text_string.apply(nouns))

In [23]:
# Recreate a document-term matrix with only nouns
cvn2 = CountVectorizer(stop_words=stop_words, min_df = 0.1, max_df=.8, ngram_range=(1, 2))
data_cvn2 = cvn2.fit_transform(data_nouns_cluster_2.diary_text_string)
data_dtmn2 = pd.DataFrame(data_cvn2.toarray(), columns=cvn2.get_feature_names())
data_dtmn2.index = data_nouns_cluster_2.index
data_dtmn2.shape

(212, 960)

In [24]:
nmf_model_cluster_2 = NMF(8)
doc_topic2 = nmf_model_cluster_2.fit_transform(data_dtmn2)
display_topics(nmf_model_cluster_2, cvn2.get_feature_names(), 10)


Topic  0
mom, parents, dad, car, family, weekend, brother, water, pay, door

Topic  1
sister, school, card, routine, book, credit, kids, credit card, kitchen, bacon

Topic  2
dog, walk, dogs, car, cream, business, school, dog walk, office, ice cream

Topic  3
husband, kids, water, tea, chicken, table, media, café, cream, pajamas

Topic  4
class, school, bus, studio, classes, yoga, gym, alarm, workout, rest

Topic  5
partner, airport, hotel, restaurant, flight, shop, water, card, pay, flights

Topic  6
office, bar, chicken, salad, train, boss, desk, door, tonight, alarm

Topic  7
hair, walk, episode, gym, water, face, teeth, routine, butter, makeup


Based on these results, the topics were determined to be:
- Self-care
- Purcahses
- Dogs/pets
- Food
- Fitness
- Travel
- Work
- Family

In [53]:
# # Pickle important things for recommender
# with open('cvn2.pkl', 'wb') as f:
#     pickle.dump(cvn2, f)

# with open('doc_topic2.pkl', 'wb') as f:
#     pickle.dump(doc_topic2, f)
    
# with open('nmf_model_cluster_2.pkl', 'wb') as f:
#     pickle.dump(nmf_model_cluster_2, f)
    
# with open('cluster_2.pkl', 'wb') as f:
#     pickle.dump(cluster_2, f)

## **Topic modeling for cluster 3 (mid 30s, high earners)**

Topic modeling exploration included all variations of the following.
- LDA, LSA, NMF
- CountVectorizer and TfidfVectorizer
- Hyperparameters including stop words, max_df, min_df, and ngram_range
- Lemmatization and stemming
- Parts of speech: nouns only, verbs only, and nouns + verbs  

Optimal condition for Topic 3:
- NMF with CountVectorizer
- Parts of speech: nouns only
- Additional stop words cluster 3 specific 
- Min_df = .26
- Unigrams and bigrams

In [69]:
data_nouns_cluster_3 = pd.DataFrame(cluster_3.diary_text_string.apply(nouns))

In [70]:
additional_stop_words_cluster3 = ['pm', 'im', 'day', 'week', 'total', 'daily', 'like', 'today', 'dont', 'just', 'women',
                        'way', 'really', 'ive', 'diaries', 'got', 'gets', 'ill', 'things', 'bit', 'sevenday',
                        'hes', 'let', 'shes', 'lot', 'little', 'decide', 'ready', 'feel', 'goes', 'stop',
                        'finally', 'welcome', 'money', 'diaries', 'period', 'job', 'share', 'today', 'millennials',
                        'occupation', 'paycheck', 'diaries', 'taboo', 'dollar','period', 'year', 'month', 'worth',
                         'rent', 'bonus', 'expenses', 'today', 'mortgage', 'debt', 'expensesmortgage', 'industry',
                        'expensesrent', 'salary', 'savings', 'insurance', 'loan', 'loans', 'income', 'student',
                        'dollartoday', 'cash', 'room', 'account', 'woman', 'living','house', 'housing','mortgagepaycheck',
                        'half','split', 'apartment', 'totaldebt', 'checksavingscd', 'stuff', 'hour', 'hours',
                        'people', 'place', 'years', 'minutes', 'tomorrow', 'couple', 'head', 'morning',
                        'night', 'guy', 'lunch', 'dinner', 'breakfast', 'time']

stop_words_cluster3 = text.ENGLISH_STOP_WORDS.union(additional_stop_words_cluster3)

In [71]:
cvn3 = CountVectorizer(stop_words=stop_words_cluster3,  min_df = .26, ngram_range=(1, 2))
data_cvn3 = cvn3.fit_transform(data_nouns_cluster_3.diary_text_string)
data_dtmn3 = pd.DataFrame(data_cvn3.toarray(), columns=cvn3.get_feature_names())
data_dtmn3.index = data_nouns_cluster_3.index
data_dtmn3.shape

(4, 327)

In [72]:
nmf_model_cluster_3 = NMF(5)
doc_topic3 = nmf_model_cluster_3.fit_transform(data_dtmn3)
display_topics(nmf_model_cluster_3, cvn3.get_feature_names(), 10)


Topic  0
home, school, husband, work, coffee, order, door, dog, shopping, bed

Topic  1
kids, business, order, coffee, estate, meeting, phone, emails, amazon, city

Topic  2
work, home, coffee, wedding, family, dress, bed, mom, kitchen, service

Topic  3
work, dogs, walk, wine, order, break, list, delivery, city, online

Topic  4
home, dogs, school, work, bed, wine, life, sister, gym, mom


In [73]:
cluster_3_component_matrix = pd.DataFrame(doc_topic3.round(5),
             columns = ["component_1","component_2", 'component_3', "component_4", 'compnent_5'])

Based on these results, the topics were determined to be:
- Family routines
- Work
- Parenting
- Self-care
- Family 

In [74]:
# # Pickle important things for recommender
# with open('cvn3.pkl', 'wb') as f:
#     pickle.dump(cvn3, f)

# with open('doc_topic3.pkl', 'wb') as f:
#     pickle.dump(doc_topic3, f)
    
# with open('nmf_model_cluster_3.pkl', 'wb') as f:
#     pickle.dump(nmf_model_cluster_3, f)
    
# with open('cluster_3.pkl', 'wb') as f:
#     pickle.dump(cluster_3, f)

## **Topic modeling for cluster 4 (late 20s, high earners)**

Topic modeling exploration included all variations of the following.
- LDA, LSA, NMF
- CountVectorizer and TfidfVectorizer
- Hyperparameters including stop words, max_df, min_df, and ngram_range
- Lemmatization and stemming
- Parts of speech: nouns only, verbs only, and nouns + verbs  

Optimal condition for Topic 4:
- NMF with CountVectorizer
- Parts of speech: nouns only
- Additional stop words 
- Max_df = .92, min_df = .19
- Unigrams and bigrams

In [76]:
data_nouns_cluster_4 = pd.DataFrame(cluster_4.diary_text_string.apply(nouns))

In [77]:
cvn4 = CountVectorizer(stop_words=stop_words, min_df = 0.19,max_df=.92, ngram_range=(1, 2))
data_cvn4 = cvn4.fit_transform(data_nouns_cluster_4.diary_text_string)
data_dtmn4 = pd.DataFrame(data_cvn4.toarray(), columns=cvn4.get_feature_names())
data_dtmn4.index = data_nouns_cluster_4.index
data_dtmn4.shape

(51, 556)

In [78]:
nmf_model_cluster_4 = NMF(6)
doc_topic4 = nmf_model_cluster_4.fit_transform(data_dtmn4)
display_topics(nmf_model_cluster_4, cvn4.get_feature_names(), 10)


Topic  0
coffee, friends, friend, bar, card, salad, book, date, store, drinks

Topic  1
office, baby, shower, husband, company, emails, party, food, mom, car

Topic  2
butter, parents, cream, meetings, walk, chocolate, ice, break, weekend, ice cream

Topic  3
husband, dog, food, office, wine, store, coffee, family, car, husbands

Topic  4
gym, office, friend, water, breakfast, desk, food, workout, days, card

Topic  5
kids, clothes, door, coffee, team, friends, breakfast, chicken, office, family




In [79]:
cluster_4_component_matrix = pd.DataFrame(doc_topic4.round(5),
             columns = ["component_1","component_2", 'component_3', "component_4", 'compnent_5', 'compnent_6'])

Based on these results, the topics were determined to be:
- Dating
- Parties
- Food
- Home
- Fitness
- Work

In [80]:
# # Pickle important things for recommender
# with open('cvn4.pkl', 'wb') as f:
#     pickle.dump(cvn4, f)

# with open('doc_topic4.pkl', 'wb') as f:
#     pickle.dump(doc_topic4, f)
    
# with open('nmf_model_cluster_4.pkl', 'wb') as f:
#     pickle.dump(nmf_model_cluster_4, f)
    
# with open('cluster_4.pkl', 'wb') as f:
#     pickle.dump(cluster_4, f)