<img src='./Resources/img/logo_footer_kly.png' alt='Kapanlagi Youniverse'>

<h1 style='text-align:center'>Kapanlagi Youniverse Sites Analysis</h1>
<p style='text-align:center'>Test submission for Data Analyst at Kapanlagi Youniverse</p>
<hr>

<h2>Content:</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#Introduction">Introduction</a>
            <ul>
                <li><a href="#Business-Understanding">Business Understanding</a></li>
                <li><a href="#Problem-Defining">Problem Defining</a></li>
                <li><a href="#Necessary-Libraries">Necessary Libraries</a></li>
            </ul></li><br>
        <li><a href="#Data-Understanding">Data Understanding</a>
            <ul>
                <li><a href="#Dataset-Import">Dataset Import</a></li>
                <li><a href="#Dataset-Information">Dataset Information</a></li>
            </ul></li><br>
        <li><a href="#Data-Preprocessing">Data Preprocessing</a>
            <ul>
                <li><a href="#Handling-Missing-Values">Handling Missing Values</a></li>
                <li><a href="#Feature-Engineering">Feature Engineering</a></li>
                <li><a href="#Unnecessary-Features">Unnecessary Features</a></li>
            </ul></li><br>
        <li><a href="#Exploratory-Data-Analysis">Exploratory Data Analysis</a>
            <ul>
                <li><a href="#"></a></li>
                <li><a href="#"></a></li>
            </ul></li><br>
        <li><a href="#Questions">Questions</a>
            <ul>
                <li><a href="#Domain-Retention">How could we know the retention % of the users on each domain ?</a></li>
                <li><a href="#Favorable-Used">Can you give insight on favorable used browsers, devices, OS and platform ?</a></li>
                <li><a href="#Overlapping-Users">Could you show the method on how to get the numbers & describe the overlapping users between each domain ?</a></li>
            </ul></li><br>
        <li><a href="#Reference">Reference</a></li>
    </ol>
</div>

<h2><a name='Introduction'>1. Introduction</a></h2>

<h3><a name='Business-Understanding'>1.1. Business Understanding</a></h3>

Sites analysis also known as web analytics is the measurement, collection, analysis, and reporting of web data to understand and optimize web usage. Web analytics is not just a process for measuring web traffic but can be used as a tool for business and market research and assess and improve website effectiveness<a href='#1'>[1]</a>.

Kapanlagi Youniverse (KLY) is a digital media network that cover a wide range of content from breaking news, lifestyle, entertainment, health, sport and social-media-hyped content<a href='#2'>[2]</a>.

Kapanlagi Youniverse (KLY) is the result of a merger between PT. Coverage of Six Dot Com and PT. Kapanlagi Dot Com Networks (KLN), which is currently part of the Emtek Group as the Largest Media Group in Indonesia. Media that are members of Kapanlagi Youniverse are Liputan6.com, Bola.com, Kapanlagi.com, Merdeka.com, Bola.net, Fimela.com, Brilio.com, Famous.id, Otosia.com and Dream.co.id<a href='#3'>[3]</a>.

As a digital media network KLY wants to improve website effectiveness with site analysis by processing a visit dataset which contains a sample of users visit activity on KLY sites for 7 days. Dataset sources:
<ul>
    <li>sampling 1% visit dataset: <code>https://drive.google.com/drive/folders/1DxWkG29Kv7ozU2l5Xm7VDpLGhKdM9elV?usp=sharing</code></li>
    <li>sampling 2% visit dataset: <code>https://drive.google.com/drive/folders/1f1dYIBMSqzsrjrawuJMClUoAbHHl4S5q?usp=sharing</code></li>
    <li>sampling 5% visit dataset: <code>https://drive.google.com/drive/folders/1dcacIQeZAltjFvKnuzKqq3kwwYmp3eQE?usp=sharing</code></li>
    <li>sampling 10% visit dataset: <code>https://drive.google.com/drive/folders/1_6KSTkYCC3FGJnpP09rw5NceFdtsiGoL?usp=sharing</code></li>
    <li>Full Dataset: <code>https://drive.google.com/drive/folders/1CZxFknarzDLLmHt5VnHJ97Yro6QxWxFn?usp=sharing</code></li>
</ul>

It is not necessary to work with the entire dataset but working with larger dataset is more challenging and more interesting.


<h3><a name='Problem-Defining'>1.2. Problem Defining</a></h3>

There is some of the questions that will be answered in the analysis process of this KLY website:
<ol>
    <li>How could we know the retention % of the users on each domain ?</li>
    <li>Can you give insight on favorable used browsers, devices, OS and platform ?</li>
    <li>Could you show the method on how to get the numbers & describe the overlapping users between each domain ?</li>
</ol>

<h3><a name='Necessary-Libraries'>1.3. Necessary Libraries</a></h3>

In this analysis, several packages will be used that assist in data analysis:
<ul>
    <li><a herf='https://pandas.pydata.org/pandas-docs/stable/index.html'>Pandas</a> for data processing
        <ul>
            <li>Pip: <code>pip install pandas</code></li>
            <li>Conda: <code>conda install pandas</code></li>
        </ul></li>
    <li><a herf='https://matplotlib.org/stable/contents.html'>Matplotlib</a> for visualization
        <ul>
            <li>Pip: <code>pip install matplotlib</code></li>
            <li>Conda: <code>conda install -c conda-forge matplotlib</code></li>
        </ul></li>
    <li><a herf='https://seaborn.pydata.org/index.html'>Seaborn</a> for visualization
        <ul>
            <li>Pip: <code>pip install seaborn</code></li>
            <li>Conda: <code>conda install seaborn</code></li>
        </ul></li>
    <li><a herf='https://plotly.com/python/'>Plotly</a> for visualization
        <ul>
            <li>Pip: <code>pip install plotly==4.14.3</code></li>
            <li>Conda: <code>conda install -c plotly plotly=4.14.3</code></li>
        </ul></li>
</ul>

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

<h2><a name='Data-Understanding'>2. Data Understanding</a></h2>

<h3><a name='Dataset-Import'>2.1. Dataset Import</a></h3>

This project using sampling 1% visit dataset that downloaded from <a herf='https://drive.google.com/drive/folders/1DxWkG29Kv7ozU2l5Xm7VDpLGhKdM9elV?usp=sharing'>google drive</a> and store it to the /Resource folder, this dataset consists of 7 csv files, each of which contains a history of visits from the 8th to the 14th February 2021 sequentially.

Create <code>for</code> looping to load all the dataset from /Resource folder. Before importing all of the dataset, get the column name by reading first row data using <code>open()</code> python function and <code>.split()</code> the column name with comma (,) as separator. Use <code>.read_csv()</code> function to import the dataset by given parameters <code>names</code> for list of column names and <code>skiprows=1</code> for skip first row that contain the column name. Lastly merge all of the dataset with <code>.concat()</code> from pandas then store the merge data to <code>all_data</code> dataframe.

In [2]:
# Get all dataset name in directory
files = [file for file in os.listdir('./Resources')]
all_data = pd.DataFrame()

# Looping to read all dataset
for file in files[1:-1]:
    # Get column name
    f = open('./Resources/'+file, "r")
    col_names = f.readline().split(",")
    
    # Load dataset
    df = pd.read_csv('./Resources/'+file, names=col_names, skiprows=1, low_memory=False)
    print(f'Shape of {file}', df.shape)
    
    # Merge dataset
    all_data = pd.concat([all_data, df])
    
all_data.reset_index(drop=True, inplace=True)
print('Shape of all_data:', all_data.shape)

Shape of 10.csv (56250, 19)
Shape of 11.csv (56250, 19)
Shape of 12.csv (56250, 19)
Shape of 13.csv (56250, 19)
Shape of 14.csv (56250, 19)
Shape of 8.csv (56250, 19)
Shape of 9.csv (56250, 19)
Shape of all_data: (393750, 19)


<h3><a name='Dataset-Information'>2.2. Dataset Information</a></h3>

In [3]:
# Check the first five rows of all dataset
all_data.head()

Unnamed: 0,id,browser_id,os_id,domain_id,device_info_id,visit_id,visitor_id,user_id,device_id,login_status,user_agent,platform,referrer,time,event_time,connection,year,month,day\n
0,cc8d7394-8fa2-480d-bc5f-2362177e9061,pale moon;28.16.0,windows;7,bukalapak.com,;other,162eb0cb-92d4-4bfd-a49f-666e8c8761ff,8eca0aa0-f83d-4f37-92f4-fecd5169c4f0,,,False,Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68...,web-desktop,https://www.bukalapak.com/payment/transactions...,2021-02-10T09:12:45.000Z,2021-02-10T09:12:45.000Z,,2021,2,10
1,0beb15cb-ff11-4e05-9d12-9c06b4d7acd3,chrome mobile;80.0.3987,android;10,m.liputan6.com,generic;generic smartphone,a00d0354-d67a-4c0c-89fc-3f3b0faceb10,f81a4b87-af9b-4e97-9ebc-b3c1f81e6280,,,False,Mozilla/5.0 (Linux; Android 10; V2026) AppleWe...,web-mobile,https://m.liputan6.com/showbiz/read/4479253/al...,2021-02-10T23:23:55.000Z,2021-02-10T23:23:55.000Z,,2021,2,10
2,ccad1699-30c9-40b3-8a33-cda4df0613ec,pale moon;28.15.0,windows;10,bukalapak.com,;other,1065b1eb-8b54-4b91-9d32-ac9238809f32,59166c52-b4e8-40b2-8641-6a445394e257,,,False,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6...,web-desktop,https://www.bukalapak.com/payment/purchases/BL...,2021-02-10T10:13:53.000Z,2021-02-10T10:13:53.000Z,,2021,2,10
3,e53eac79-8bf5-4508-8925-d056b281060b,chrome mobile;87.0.4280,android;10,m.bukalapak.com,generic;generic smartphone,68dcbff1-123d-4e80-a887-13dc76e28077,1c74afb2-6893-4167-9d1e-c1f4ec90f0d6,,,False,Mozilla/5.0 (Linux; Android 10; RMX1821) Apple...,web-mobile,https://m.bukalapak.com/p-gs/rumah-tangga/furn...,2021-02-10T22:18:02.000Z,2021-02-10T22:18:02.000Z,,2021,2,10
4,7fac143e-9556-4d1b-bfa9-9ea5e31fab2f,chrome mobile;88.0.4324,android;10,m.merdeka.com,samsung;samsung sm-g965f,119694d8-86e7-457a-8083-8a4d4b3a424e,81613709-e89d-4796-8260-89170bd40fa0,,,False,Mozilla/5.0 (Linux; Android 10; SM-G965F) Appl...,web-mobile,https://m.merdeka.com/foto/artis/1272931/20210...,2021-02-10T14:14:59.000Z,2021-02-10T14:14:59.000Z,,2021,2,10


Glossaries for details of each field attributes (in sequences):
<li><code>id</code> : record ID</li>
<li><code>browser_id</code> : browser type</li>
<li><code>os_id</code> : Operating System type</li>
<li><code>domain_id</code> : domain or subdomain of webpage</li>
<li><code>device_info_id</code> : device detail information</li>
<li><code>visit_id</code>: users session id</li>
<li><code>visitor_id</code> : unique user id</li>
<li><code>user_id</code> : user login id</li>
<li><code>device_id</code> : device ID</li>
<li><code>login_status</code> : boolean status of user login</li>
<li><code>user_agent</code> : browsers user agent details</li>
<li><code>platform</code> : device platform, ie desktop or mobile</li>
<li><code>referrer</code> : attribution of the visit, source of visitor coming from</li>
<li><code>time</code> : users visit time</li>
<li><code>event_time</code> : logging time</li>
<li><code>connection</code> : type of user connection</li>
<li><code>year</code> : year number</li>
<li><code>month</code> : month number</li>
<li><code>day</code> : day number</li>

Prints information about <code>all_data</code> dataframe including the index dtype and columns, non-null values and memory usage using <code>.info()</code> method.

In [4]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 393750 entries, 0 to 393749
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              393750 non-null  object 
 1   browser_id      393750 non-null  object 
 2   os_id           393750 non-null  object 
 3   domain_id       393735 non-null  object 
 4   device_info_id  393750 non-null  object 
 5   visit_id        393750 non-null  object 
 6   visitor_id      393750 non-null  object 
 7   user_id         16898 non-null   float64
 8   device_id       0 non-null       float64
 9   login_status    393750 non-null  bool   
 10  user_agent      393750 non-null  object 
 11  platform        393750 non-null  object 
 12  referrer        393736 non-null  object 
 13  time            393750 non-null  object 
 14  event_time      393750 non-null  object 
 15  connection      14 non-null      object 
 16  year            393750 non-null  object 
 17  month     

As the information provided above, there are missing values in the dataset, precisely in the <code>domain_id</code>, <code>user_id</code>, <code>device_id</code>, <code>referrer</code>, and <code>connection</code> columns that will be handled along with other process in the next step, namely data preprocessing.

<h2><a name='Data-Preprocessing'>3. Data Preprocessing</a></h2>

<h3><a name='Handling-Missing-Values'>3.1. Handling Missing Values</a></h3>

<h3><a name='Feature-Engineering'>3.2. Feature Engineering</a></h3>

<h3><a name='Unnecessary-Features'>3.3. Unnecessary Features</a></h3>

<h2><a name='Exploratory-Data-Analysis'>4. Exploratory Data Analysis</a></h2>

<h2><a name='Questions'>5. Questions</a></h2>

<h3><a name='Domain-Retention'>5.1. How could we know the retention % of the users on each domain ?</a></h3>

<h3><a name='Favorable-Used'>5.2. Can you give insight on favorable used browsers, devices, OS and platform ?</a></h3>

<h3><a name='Overlapping-Users'>5.3. Could you show the method on how to get the numbers & describe the overlapping users between each domain ?</a></h3>

<h2><a name='Reference'>6. Reference</a></h2>

<a name='1' herf='https://en.wikipedia.org/wiki/Web_analytics'>[1] Web Analytics</a><br>
<a name='2' herf='https://www.kly.id/product'>[2] Kapanlagi Youniverse</a><br>
<a name='3' herf='https://www.bola.com/asian-para-games/read/3658308/kapanlagi-youniverse-official-online-media-asian-para-games-2018'>[3] KLY Media Online</a><br>