# MSDS 631 - Lecture 11 (April 10, 2019)

## End-to-End Analysis

Now that we're comfortable with the basic functionality of Python and Pandas, I want to go through an example of what you can do with all of the tools we've discussed (plus a few extra that we haven't covered yet).

We have three data files that we're mostly going to read from:
- Users: Describes users overall characteristics
- Reviews: Details about individual reviews
- Businesses: Details about the businesses our reviewers rated

The reviews in these files are primarily focused on reviews in Las Vegas, Toronto, and Phoenix

In [None]:
import os
import re
import json
import pandas as pd
import numpy as np
from datetime import datetime as dt

#### These are my file path fetchers

In [None]:
#Code to help me get all of the files in a folder so that I can open them
def get_file_paths(folder, full_path=True):
    file_paths = []
    for (dirpath, dirnames, filenames) in os.walk(folder):
        for file_ in filenames:
            if (not re.match('\.', file_)) and (file_ != '_SUCCESS'):
                if full_path:
                    file_paths.append(os.path.join(dirpath, file_))
                else:
                    file_paths.append((dirpath, file_))
    return file_paths


def get_all_paths(subfolder):
    folder = os.path.join('/Users/jasonshu/Documents/yelp', subfolder)
    files = get_file_paths(folder)
    return files

In [None]:
file_paths = {}
file_paths['reviews'] = get_all_paths('reviews')
file_paths['users'] = get_all_paths('users')
file_paths['businesses'] = ['/Users/jasonshu/Documents/yelp/yelp_businesses.json']

In [None]:
for key in file_paths:
    print(key, len(file_paths[key]))

#### These are my file openers

In [None]:
def open_single_review_file(filepath):
    data = []
    with open(filepath,  'r') as f:
        reviews = f.read().strip().split('\n')
    for review in reviews:
        temp = json.loads(review)
        del(temp['text'])
        data.append(temp)
    return data


def open_single_user_file(filepath):
    data = []
    elite = []
    with open(filepath,  'r') as f:
        users = f.read().strip().split('\n')
    for user in users:
        temp = json.loads(user)
        data.append(temp)
    return data


def open_business_data(filepath):
    data = []
    with open(filepath,  'r') as f:
        businesses = f.read().strip().split('\n')
    for business in businesses:
        temp = json.loads(business)
        data.append(temp)
    return data

In [None]:
preview = {}
preview['reviews'] = open_single_review_file(file_paths['reviews'][0])
preview['users'] = open_single_user_file(file_paths['users'][0])
preview['businesses'] = open_business_data(file_paths['businesses'][0])

The first thing you ALWAYS do when opening data for the first time is to look at the data type that you have loaded. JSON parsers can return lists or dictionaries, and this will be what allows you to know what you are dealing with and how to access the data.

Once we've done that, we look at a few values to figure out what we have that could be of use for an analysis. I'm doing that for each of our data sources.

##### Reviews

In [None]:
type(preview['reviews'])

In [None]:
len(preview['reviews'])

In [None]:
preview['reviews'][0]

##### Users

In [None]:
type(preview['users'])

In [None]:
len(preview['users'])

In [None]:
preview['users'][0]

##### Businesses

In [None]:
type(preview['businesses'])

In [None]:
len(preview['businesses'])

In [None]:
preview['businesses'][0]

##### Estimated number of observations per type

In [None]:
for key in file_paths:
    num_files = len(file_paths[key])
    obs_per_file = len(preview[key])
    print(key, num_files * obs_per_file)

Now that I know what the data looks like, I'm going to write some functions to get exactly the data I want out of each dictionary from each data source. I don't want to carry around all of the data because it's overwhelming and unnecessary. This is where you need a specific problem in mind to solve so that you can guess ahead of time what you think will be necessary and what you think will never be used.

Sometimes you will want to loop through each item and get only the data you need, then create your DataFrames. Other times (when you are lucky) you can create the DataFrame directly from the data, then get rid of what you don't want. In this case, I have to parse the user data, but I can load the review data directly from the list of dictionaries I have opened. The business data is a little wonky because there is a dictionary as one of the values, which won't really work. However, since the data is not going to be kept, I will load it directly as a DataFrame and then drop it.

NOTE: Loading data as a DataFrame and dropping unwanted fields is much faster than using for-loops and parsing along the way. However, it is much more resource intensive for your computer to hold all of the data in a DataFrame, so your computer may not actually be able to do it every time. This is a trial and error thing until you do it enough times to know what you can and cannot load.

#### These are my file parsers for users and businesses

For users, I want some raw data, but I also want to "create" new data from the data I am seeing. For instance, I don't care about who the individual user's friends are - that's not helpful in this context. I do, however, think it could be useful to know *how many* friends they have. I also want to know the years they were "Elite" because I think that may be interesting down the road. Since that data comes to us in a different form and does not align 1-to-1, there is no easy way to store that data as an attribute of the user. Thus, I have to create an entirely different way of storing the users' elite years.

In [None]:
def parse_single_user_dict(user):
    base_fields = ['user_id', 'name', 'review_count', 'average_stars', 'yelping_since', 'fans', 'cool', 'funny', 'useful']
    base_data = [user[field] for field in base_fields]
    num_friends = len(user['friends'].split(', '))
    base_data.append(num_friends)
    if user['elite']:
        years_elite = user['elite'].split(',')
        user_id = user['user_id']
        elite_list = []
        for year in years_elite:
            elite_list.append([user_id, year])
    else:
        elite_list = []
    return base_data, elite_list

def parse_single_business_dict(business):
    base_fields = ['business_id', 'name', 'review_count', 'stars', 'city', 'state', 'postal_code']
    base_data = [business[field] for field in base_fields]
    return base_data

In [None]:
parse_single_user_dict(preview['users'][0])

Now that I have my parser and file loading strategy, I will write a function that allows me to convert each of the files I open into a DataFrame. This is ultimately where we want 90% of our analyses to wind up (in a DataFrame). I will combine the knowledge I have of the structure of the file with the user parser I just wrote to do this.

##### File openers

In [None]:
def load_reviews_from_file_as_df(reviews_list_of_dicts):
    kept_columns = ['review_id', 'user_id', 'business_id', 'stars', 'date']
    reviews_df = pd.DataFrame(reviews_list_of_dicts)
    subset_df = reviews_df[kept_columns].copy()
    return subset_df

In [None]:
def load_users_from_file_as_df(users_list_of_dicts):
    user_data = []
    elite_data = []
    for user_dict in users_list_of_dicts:
        user, elite = parse_single_user_dict(user_dict)
        user_data.append(user)
        elite_data += elite
    user_df = pd.DataFrame(user_data)
    elite_df = pd.DataFrame(elite_data)
    user_df.columns = ['user_id', 'name', 'review_count', 'average_stars', 'yelping_since', 'fans', 'cool', 
                       'funny', 'useful', 'num_friends']
    elite_df = ['user_id', 'year']
    return user_df, elite_df

In [None]:
def load_businesses_from_file_as_df(businesses_list_of_dicts):
    kept_columns = ['business_id', 'name', 'review_count', 'stars', 'city', 'state', 'postal_code']
    businesses_df = pd.DataFrame(businesses_list_of_dicts)
    subset_df = businesses_df[kept_columns].copy()
    return subset_df

ALMOST THERE!

Now that we can open a single file, for the data types that are being read from multiple files, I need to open each file, load it as a DataFrame, put each DataFrame into a list, then concatenate them into a single DataFrame.

For businesses, even though I don't need to create another function, I am so that the naming convention is consistent and I can keep track of things better.

In [None]:
def load_all_reviews(list_of_filepaths):
    all_reviews = []
    num_files = len(list_of_filepaths)
    print('Loading reviews')
    for n,f in enumerate(list_of_filepaths):
        print(f'\t{n} of {num_files}')
        reviews_list_of_dicts = open_single_review_file(f)
        reviews_df = load_reviews_from_file_as_df(reviews_list_of_dicts)
        all_reviews.append(reviews_df)
    all_reviews_df = pd.concat(all_reviews)
    return all_reviews_df

In [None]:
def load_all_users(list_of_filepaths):
    all_users = []
    all_elite = []
    num_files = len(list_of_filepaths)
    print('Loading users')
    for n, f in enumerate(list_of_filepaths):
        print(f'\t{n} of {num_files}')
        users_list_of_dicts = open_single_user_file(f)
        users_df, elite_df = load_users_from_file_as_df(users_list_of_dicts)
        all_users.append(users_df)
        all_elite.append(elite_df)
    all_users_df = pd.concat(all_users)
    all_elite_df = pd.concat(all_elite)
    return all_users_df, all_elite_df

In [None]:
def load_all_businesses(filepath):
    print('Loading businesses')
    businesses_list_of_dicts = open_business_data(filepath)
    all_businesses_df = load_businesses_from_file_as_df(businesses_list_of_dicts)
    return all_businesses_df

In [None]:
t1 = dt.now()
reviews = load_all_reviews(file_paths['reviews'])
t2 = dt.now()
print('\t', t2-t1)
users, elite = load_all_users(file_paths['users'])
t3 = dt.now()
print('\t', t3-t2)
businesses = load_all_businesses(file_paths['businesses'][0])
t4 = dt.now()
print('\t', t4-t3)

In [None]:
reviews.shape

In [None]:
users.shape

In [None]:
businesses.shape

#### Success!!!

Loading and cleaning data is the worst part of every analysis. It took me 6 hours just to write the code to just **open** the files, let alone parse it, and transform it into DataFrames. Don't think for a second that any of this comes easy and that you are doing something wrong if you are spinning your wheels trying to get the data to be analyzed. Some figures estimate that 80% of a data scientist's time is spent just loading and cleaning data. The analysis is the **easy** part! This is why we spent so much of our class looking at how to do things in Python. Now that we have the data loaded, it's time to have fun!!

## Analyses

In this section, I will perform three analyses that answer different questions about Yelp users and businesses, including:
1. Can I make up of "personas" for Yelp reviewers and find them?
2. Do "Elite" reviewers tend to take their responsibility more seriously and judge businesses more critically?

### Analysis 1a - Finding personas of yelp users

Yelp data is one of my favorite data sets because there are so many things you can glean about people. One of the things that intrigues me the most is regarding the different types of "personalities" that exist amongst reviewers. A few include:
- Hot-Cold Harry: Tends to either love a place or hate it (mostly 5-star and 1-star reviews
- Normal Norman: Tends to fit the distribution of reviews in a bell-shaped curve
- Negative Nancy: Tends to be mostly a complainer with more 1- and 2-star reviews
- Positive Patricia: Tends to give mostly 4- and 5-star reviews. She is either easily satisfied or only goes places that she knows she will like (reinforcement bias)
- Content Cassandra: Tends to give mostly positive reviews, centered mostly around 4-star with some 5-star and some 3-star as well

So that you might know what these distributions look like, let's just set those up here.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

In [None]:
distributions = {}
distributions['Hot-Cold Harry'] = pd.Series([.3, .1, .05, .15, .4])
distributions['Normal Norman'] = pd.Series([.05, .15, .4, .25, .15])
distributions['Negative Nancy'] = pd.Series([.3, .4, .1, .1, .1])
distributions['Positive Patricia'] = pd.Series([.05, .05, .15, .25, .5])
distributions['Content Cassandra'] = pd.Series([0, 0, .2, .5, .3])
distributions['Even Evan'] = pd.Series([.2, .2, .2, .2, .2])

In [None]:
for name in distributions:
    distributions[name].plot(kind='bar')
    print(name)
    plt.show()

Now let's use some techniques used for measuring "closeness" or "similarity" to assign a reviewer a persona.

In order to get each reviewers score distribution, we need to aggregate their review data. Let's refresh our memories about what the review DataFrame looks like.

In [None]:
reviews.head()

I'm going to convert the star ratings to integers to make things a little easier for me. Then I'm going to group by the user IDs and star ratings to get the users' review count by rating.

In [None]:
reviews['stars'] = reviews['stars'].astype(int)
reviewer_stars = reviews.groupby(['user_id', 'stars'])[['review_id']].count()
reviewer_stars = reviewer_stars.reset_index()
reviewer_stars.columns = ['user_id', 'stars', 'review_count']
reviewer_stars.head(10)

Now we want the data so that it's more easily digestable. We're going to use the handy dandy pivot table to do this.

In [None]:
reviewer_stars_table = reviewer_stars.pivot_table(index='user_id', columns='stars', values='review_count')
reviewer_stars_table.head(10)

We can see that several of our users only have a few reviews. Let's only focus on ones that have at least 10 reviews (this is an arbitrary cutoff).

In [None]:
reviewer_stars_table['total_reviews'] = reviewer_stars_table.sum(axis=1)

In [None]:
reviewer_stars_table.shape

In [None]:
reviewer_stars_table_10 = reviewer_stars_table[reviewer_stars_table['total_reviews'] >= 10]
reviewer_stars_table_10.head(10)

Out of curiousity, how many reviewers did we have, and how many do we have left after filtering?

In [None]:
print(reviewer_stars_table.shape[0])
print(reviewer_stars_table_10.shape[0])
print(reviewer_stars_table_10.shape[0]/reviewer_stars_table.shape[0])

We lost 92.5% of our reviewers!!! Let's see what percentage of reviews were actually written by those who met our threshold.

In [None]:
count_analysis = reviewer_stars_table.copy()
count_analysis = count_analysis.sort_values('total_reviews', ascending=False)
total_reviews = count_analysis['total_reviews'].sum()
count_analysis['cum_reviews'] = count_analysis['total_reviews'].cumsum()
count_analysis['cum_pct'] = count_analysis['cum_reviews'] / total_reviews
count_analysis.head(10)

In [None]:
count_analysis[count_analysis['total_reviews'] >= 10].tail()

Based on this, it looks like our 10+ count reviewers make up 52.1% of the total reviews in our data set. I can live with that. ok, now let's look at a few random users' star distributions and see what they look like.

In [None]:
sample = reviewer_stars_table_10.sample(5)
sample

In [None]:
for user in sample.index:
    row = sample.loc[user, [1,2,3,4,5]]
    row.plot(kind='bar')
    plt.show()

Ok, now let's see if we can find a way to automatically find which of the five personas that our users are closest to. We're going to use the idea of "Cosine Similarity."

<img src="https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png" width="350" height="350"/>

The formula for this is:
<img src="https://neo4j.com/docs/graph-algorithms/current/images/cosine-similarity.png" width="400" height="400"/>

So, if I had two identical vectors listed below, their similarity should be perfect. Let's see the math work.

In [None]:
a = pd.Series([1, 2, 1, 4, 2])
b = pd.Series([1, 2, 1, 4, 2])

In [None]:
a.dot(b)

In [None]:
a_magnitude = np.sqrt((a ** 2).sum())
b_magnitude = np.sqrt((b ** 2).sum())

In [None]:
(a.dot(b)) / (a_magnitude * b_magnitude)

Sweet!

Let's change b a tiny bit and see what happens.

In [None]:
b = pd.Series([1, 2, 1, 3, 2])
b_magnitude = np.sqrt((b ** 2).sum())

In [None]:
(a.dot(b)) / (a_magnitude * b_magnitude)

This shouldn't be interpreted as "99% similar", but if that helps you internalize similarity, then go ahead and think that way.

Let's go ahead and write the function so that we can compute the similarity of any two vectors (assuming they are the same length).

In [None]:
def compute_cosine_similarity(vec1, vec2):
    if isinstance(vec1, list):
        vec1 = pd.Series(vec1)
    if isinstance(vec2, list):
        vec2 = pd.Series(vec2)
    dot = vec1.dot(vec2)
    magnitude1 = np.sqrt((vec1 ** 2).sum())
    magnitude2 = np.sqrt((vec2 ** 2).sum())
    similarity = dot / (magnitude1 * magnitude2)
    return similarity

In [None]:
#Let's test our function
compute_cosine_similarity(a, b)

To help ourselves out, we're going to do a couple of things now:
1. Create a DataFrame of values for our personas
2. Drop the total_reviews column from our table
3. Write a function that computes the similarity of a user to each of our personas

In [None]:
personas_df = pd.DataFrame.from_dict(distributions, orient='index')
personas_df = personas_df.sort_index()
personas_df

In [None]:
#Looks like we need to modify the column names
personas_df.columns = [1,2,3,4,5]
personas_df

In [None]:
reviewer_stars_table_10 = reviewer_stars_table_10.drop('total_reviews', axis=1)

In [None]:
reviewer_stars_table_10.head()

In [None]:
#We need to fill in the null values with 0 so that we can do our math
reviewer_stars_table_10 = reviewer_stars_table_10.fillna(0)

In [None]:
#Let's just see an example of a row
one_reviewer_row = reviewer_stars_table_10.loc['---1lKK3aKOuomHnwAkAow']
one_reviewer_row

In [None]:
one_reviewer_distribution = one_reviewer_row / one_reviewer_row.sum()
one_reviewer_distribution

In [None]:
one_reviewer_distribution.plot(kind='bar')

In [None]:
def plot_against_personas(one_reviewer_distribution, distributions):
    axes = []
    fig = plt.figure(figsize=(10,8))  
    order = sorted(list(distributions.keys()))
    for i, persona in enumerate(order):
        axes.append(plt.subplot2grid((6,6),(i,5)))
    axes.append(plt.subplot2grid((6,6),(0,0),rowspan=6,colspan=5))
    for i, persona in enumerate(order):
        distributions[persona].plot(kind='bar', ax=axes[i])
        axes[i].xaxis.set_visible(False)
        axes[i].yaxis.set_visible(False)
        axes[i].set_title(persona, fontdict={'fontsize': 8})
        axes[i].set_ylim((0,.6))
    one_reviewer_distribution.plot(kind='bar', ax=axes[6])
    plt.show()
plot_against_personas(one_reviewer_distribution, distributions)    

In [None]:
def measure_persona_similarity(user_id, user_review_count, personas_df):
    all_scores = [user_id]
    user_relative_pcts = user_review_count / user_review_count.sum()
    order = sorted(distributions.keys())
    for persona in order:
        persona_distribution = personas_df.loc[persona]
        similarity = compute_cosine_similarity(user_review_count, persona_distribution)
        all_scores.append(similarity)
    return all_scores

In [None]:
measure_persona_similarity('---1lKK3aKOuomHnwAkAow', one_reviewer_distribution, personas_df)

In [None]:
reviewer_stars_table_10.head()

In [None]:
def scale_rows(df):
    total = df.sum(axis=1)
    df_T = df.T
    total_T = total.T
    scaled_df_T = df_T / total_T
    scaled_df = scaled_df_T.T
    return scaled_df

In [None]:
reviewer_stars_table_10_scaled = scale_rows(reviewer_stars_table_10)

In [None]:
reviewer_stars_table_10_scaled.head()

In [None]:
all_similarities = []
t1 = dt.now()
for i, user_id in enumerate(reviewer_stars_table_10.head(10000).index):
    if i % 1000 == 0:
        print('{:.1%}'.format(i / reviewer_stars_table_10.shape[0]))
    row = reviewer_stars_table_10.loc[user_id]
    similarities = measure_persona_similarity(user_id, row, personas_df)
    all_similarities.append(similarities)
t2 = dt.now()
print(t2-t1)

In [None]:
ids = reviewer_stars_table_10_scaled.head(10000).index.tolist()
columns = ['user_id'] + personas_df.index.tolist()
similarities_df = pd.DataFrame(all_similarities, columns=columns, index=ids)
similarities_df.head(10)

In [None]:
def compute_similarities_linalg(reviewer_relative_stars_df, personas_df):
    reviewer_matrix = reviewer_relative_stars_df.values.T
    personas_matrix = personas_df.values
    dot = pd.DataFrame(personas_matrix.dot(reviewer_matrix).T)
    user_magnitude = pd.DataFrame(np.sqrt((reviewer_relative_stars_df ** 2).sum(axis=1)))
    persona_magnitude = pd.DataFrame(np.sqrt((personas_df ** 2).sum(axis=1)))
    magnitude_product = user_magnitude.values.dot(persona_magnitude.values.T)
    similarity_scores = dot / magnitude_product
    similarity_scores.columns = personas_df.index.tolist()
    similarity_scores.set_index(reviewer_relative_stars_df.index, inplace=True)
    return similarity_scores

In [None]:
user_persona_scores_df = compute_similarities_linalg(reviewer_stars_table_10_scaled, personas_df)

In [None]:
user_persona_scores_df.head()

In [None]:
user_personas = user_persona_scores_df.idxmax(axis=1)

In [None]:
user_personas.head(10)

In [None]:
user_personas_df = pd.DataFrame(user_personas, columns=['persona'])
user_personas_df.head(10)

In [None]:
def plot_random_user(persona_type, user_personas, reviewer_relative_distributions_table, persona_distributions):
    id_ = user_personas[user_personas==persona_type].sample().index[0]
    row = reviewer_stars_table_10_scaled.loc[id_]
    plot_against_personas(row, distributions)
    print(user_persona_scores_df.loc[id_])

In [None]:
plot_random_user('Negative Nancy', user_personas, reviewer_stars_table_10_scaled, distributions)

Terrific!

One last thing! Let's plot a few different plot types showing the distribution of personas.

We want to do the following things:
- Plot a bar chart (in descending order) of the distribution of personas
- Plot a pie chart (in the same chart as the bar chart) of the distribution of personas
- Move the legend for the pie chart
- Give each subplot a title
- Give the overall graphic a title
- Plot small versions of each persona on the right side of the chart

In [None]:
persona_sizes = user_personas_df.groupby('persona').size().sort_values(ascending=False)

In [None]:
persona_sizes.plot(kind='bar')

### Analysis 2 - Are Elite reviewers more critical than non-elite users?

In [None]:
reviews.head()

In [None]:
elite.head()