# Analyzing how millennial women spend money via Refinery29
**Objective**   
Analyze how millennial women spend their time and money using NLP. Build a recommender that takes in user input and selects 3 Refinery29 Money Diaries that are similar to the user.  

**Data**   
This data was scraped from the [Refinery29 Money Diaries](https://refinery29.com/en-us/money-diary) from January 18, 2019-June 3, 2020.

**Load packages.**

In [2]:
import time, os
import pickle
import re

from bs4 import BeautifulSoup
import requests

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

import pandas as pd
from scipy import stats
import numpy as np

import seaborn as sns

**Use Selenium to get unique link identifiers for 500 money diaries. Pickle links and to put into Beautiful Soup.**

In [4]:
chromedriver = "/Applications/chromedriver" 
os.environ["webdriver.chrome.driver"] = chromedriver

website = 'https://www.refinery29.com/en-us/money-diary'
driver = webdriver.Chrome(chromedriver)
driver.get(website)

In [5]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [106]:
find_href_links = [i['href'] for i in soup.find_all('a', href=True)]

In [10]:
links_to_follow = find_href_links[22:-213]

**Pickle links to save for future use.**

In [14]:
# with open('r29links.pkl', 'wb') as f:
#     pickle.dump(links_to_follow, f)

In [15]:
r29_links = pickle.load(open("r29links.pkl","rb"))

Define a helper function to get information about each diarist.

In [91]:
def get_diarist_value(soup, field_name):
    
    '''Grab a value from Money Diary. Takes a string attribute of a money diary and returns 
    the individual value'''
    try:
        obj = soup.find(text=re.compile(field_name))

        if not obj: 
            return 'BLANK' 
        else:
            individual_info = obj.next_element
            if individual_info:
                return individual_info.strip()
            else:
                return 'BLANK'
    except TypeError:
        return 'BLANK'

**Create a dictionary to hold the scraped data.**

In [93]:
def get_money_diaries_dict(id_link):
    
    '''Creates a dictionary of the categories of scraped data from each of the money diaries'''
    
    #Develop base URL
    base_url = "https://www.refinery29.com"
    
    #Create full URL to scrape
    url = base_url + id_link
    
    #Request HTML and parse
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page,"lxml")
    
    headers = ["story_title", 'occupation', 'age', 'location', 'salary', 'net_worth', 
               'debt', 'rent', 'mortgage','loans', 'savings', 'diary_text']
    
    # Story title
    story_title = id_link
    
    # Occupation
    occupation = get_diarist_value(soup, 'Occupation:')
    
    # Age
    age = get_diarist_value(soup, 'Age:')
    
    # Location
    location = get_diarist_value(soup, 'Location:')
    
    # Salary
    salary = get_diarist_value(soup, 'Salary:')
    
    # Net Worth
    net_worth = get_diarist_value(soup, 'Net Worth:')
    
    # Debt
    debt = get_diarist_value(soup, 'Debt:')
    
    # Rent
    rent = get_diarist_value(soup, 'Rent:')
    
    # Mortgage
    mortgage = get_diarist_value(soup, 'Mortgage:')
    
    # Loans
    loans = get_diarist_value(soup, 'Loans:')
    
    # Savings
    savings = get_diarist_value(soup, 'Savings:')
    
    # Diary text
    diary_text = []
    for div in soup.find_all('div', class_='section-text'):
        diary_text.append(div.text)
    
    data_dict = dict(zip(headers, [story_title, occupation, age, location, 
                                   salary, net_worth, debt, rent, mortgage, 
                                   loans, savings, diary_text]))
    
    return data_dict

In [96]:
# money_diary_list = []

# for link in r29_links:   
#     money_diary_list.append(get_money_diaries_dict(link))

**Convert the list of scraped data to a dataframe.**

In [107]:
money_df = pd.DataFrame(money_diary_list)  

**Create 2 dataframes- one with text daily diary and the other with diarist information**

In [4]:
text_df = money_df[['story_title','diary_text']]
diarist_df = money_df.drop('diary_text',axis=1)

**Pickle dataframes for future use.**

In [108]:
# # money_df
# with open('money_df.pkl', 'wb') as f:
#     pickle.dump(money_df, f)

In [5]:
# # text_df
# with open('text_df.pkl', 'wb') as f:
#     pickle.dump(text_df, f)

In [110]:
# # diarist_df
# with open('diarist_df.pkl', 'wb') as f:
#     pickle.dump(diarist_df, f)