Skip to content

hsinyuchen1017/Big-Data-Analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Google Analytics Customer Revenue Analysis

Table of Content

Group Member

  • Hsin-Yu Chen
  • Zili Bu
  • Huang-Chin Yen
  • Kexin Zhang

Executive Summary

Overview

This is the data we obtained from Kaggle analysis. In this competition, the original challenge was to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. However, we are planning for some new challenges form this Working with Large Dataset Class.

The problems

  • Can we deal with this large dataset with similar methods as we used in small datasets?
  • What can we do to our json column?
  • What model will be appropriate to this certain project?
  • Should we include all the variables in this dataset to do prediction about total revenues?

The solutions

  • We will use both what we learned in class and also our experience from dealing with small datasets in this project.
  • We would try to convert the json column into regular data frame. If it does not work, we would delete the column since there are a lot of NAs in it.
  • We will first try to use linear regression model to fit our data because the dependent variable is numerical. We would also try some model such as XGBoost, Random Forest, and Gradient Boost.
  • We will make the decision later after we look through the coefficient matrix and the important features.

Introduction

We found Google Analytics Customer Revenue Prediction dataset from Kaggle. Dataset Link

Our dataset is sources from Google Analytics platform. It contains 12 columns with 4 json format columns with 1708337 rows of data. After the conversion, it has 60 columns instead:

  • bool(1)
  • int64(4)
  • object(55))

The Column Information

  • fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
  • channelGrouping - The channel via which the user came to the Store.
  • date - The date on which the user visited the Store.
  • device - The specifications for the device used to access the Store.
  • geoNetwork - This section contains information about the geography of the user.
  • socialEngagementType - Engagement type, either "Socially Engaged" or "Not Socially Engaged".
  • totals - This section contains aggregate values across the session.
  • trafficSource - This section contains information about the Traffic Source from which the session originated.
  • visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
  • visitNumber - The session number for this user. If this is the first session, then this is set to 1.
  • visitStartTime - The timestamp (expressed as POSIX time).
  • hits - This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
  • customDimensions - This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
  • totals - This set of columns mostly includes high-level aggregate data.

As the website introduced, the 80/20 rule is a prevalent rule for many businesses in the real world. It is important to understand our target customers and attract their willingness to purchase the product at Google Store by coordinating proper marketing promotions.

Here are some important business questions:

  • Which region/continent has the highest customers, or highest revenues?
  • Which type of devices/OS has the highest access?
  • Which channel group has the highest access or revenues?
  • Which weekdays, months have the highest access or revenue?
  • Can we build some models to predict the revenues from our customers?

After the initial exploration about our data, we would like to focus our analysis on columns like:

  • totals.transactionRevenue: The revenue made from this visit.
  • device.browser: which browser is this visit used.
  • device.deviceCategory : what kind of device is this visit (tablet, mobile or desktop).
  • channelGrouping: The channel via which the user came to the Store. (organic search, display ads, direct, etc.)
  • device.operatingSystem: which operating system is this visit.

Results and Conclusions

  • From our analysis, we found out that total hits and visit hour as well as visit numbers are important features when predicting whether the visitor's total transaction revenue is high or low.

  • In conclusion, in this dataset, majority of the visitors have total hits smaller than 100 and spent more than 10 hours. It is reasonable since most of the people would think twice and browsing around the website before buying items online. Moreover, most of the visitors who spend higher are not in the first session. It might because visitors who spend more have already been through the website before or have already browsing around the website many time before making transaction.

  • We randomly split the data into train and test sets. We utilize test accuracy or test error to evaluate our model performance.

Challenges

  1. The very first challenge is how to handle with these json type columns in our csv file. Fortunately, we find a code provided by another analyst in Kaggle to help us transform our data. reference However, the kernel failed everytime we convert the json column, therefore, we finish the conversion on local machine and save the converted file into S3 for further analysis.

  2. The introduction of this dataset on Kaggle.com is not very clear. Therefore, we need to spend some time on the internet to look into the dataset to understand the meaning of each column, especially those new columns converted from json.

  3. We have a very large dataset, and it takes a long time to process (the kernel even died several times) . Therefore, we use a smaller sample of our original dataset to ensure the correctness of our code. After that, we apply all the code on the larger dataset to save our time.

  4. The json columns after conversion have a period in the column name, which hinder our spark.sql command later on. Then we realize that we need to use a back quote symbol to quote those column names to perform sql command.

  5. Column totals.transactionRevenue has a lot of null value after converted from json, we have to fill all the null value with zero since it means they do not conduct any transactions through these visits.

  6. Our dataset contains many null values and most of the features are categorical, and even many numerical features are constant value, which is challenging for model building. After cleaning up the null value and label the categorical feature, our model accuracy enhanced from 28% to 54%.

  7. Most of the value in our original target prediction "total.transactionrevenue" are 0, which is somehow meaningless for prediction. Therefore, instead of building regression models, we split the revenue into three classes and build classifier models for predictions.

Future Work

  1. We do not have much information about the groupchannels (what are their differences, their return on investment, etc). If we could have more data about these channels, we can evaluate which social platform or referral methods generates more vitis to Gstore. As a result, we can reallocate the resources to different social and referral channels to optimize the results. For example, to improve the poor performing ones or to expand the size of the good perfoming ones.

  2. Our model of the classifying level of revenue of customers has a best accuracy 54%, which remains a lot of space for improvement. Instead of directly deleting the null value and constant data, we can try to gather more information and take more references to better deal with those missing values, which might enhanced model prediction by considering those deleted features. Also, instead of labelizing categorical variables, we may try one-hot-encoding method to check whether it is better.

Takeaways From the Course

  • We are able to get familiar with github, for example, how to connect github, create and clone repositories, and how to update our work to a repo in github. Meanwhile, we also benefit from the github's ability to record the history(changes) of each commits we made.
  • The practise of regular express help us to extract desired string from certain text files.
  • The experience to work with AWS platform is great. We learned how to properly connect to a virtual machine, and navigate through our virtual machine and extract informations about our files onHDFS or S3 file system via shell command.
  • We get a brief introduction on map and reduce function when we dealing with large size of data in the real world. Meanwhile, we also experience the power of the parallized computations in clusters on how to increase the speed of processing tasks.
  • Establish a spark environment to run spark.sql command directly in python (juypter notebook)
  • When using spark dataframe, we realize the .cache() method could save the dataframe into memory so that we are able to save a lot of time to run sql command towards that spark dataframe, especially with such a large size of data.
  • Build machine learning models through pipeline.

About

Spark/ SQL/ Python/ Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published