# Introduction

This is the data from a freelancer market platform funnel data. The service uses the machine learning algorithms to 1) recommend new product to users, 2) detact outside personal transaction. To use the machine learning algorithms the data needs to be cleaned. We will cleaned the data off so that it can be used for the learning.

In [1]:
import pandas as pd # most commonly used python package to manipulate data
pd.options.display.max_columns = 100 # the data contains lots of columns. It's easier to explore data this way.

## Data Loading
### conversion.csv 


conversion contains website / mobile service activity log. We can analyze it and figure out the user usage patterns.   


- eventcategory: event categories. Values are
    - install
    - launch
    - deeplinkLaunch
    - goal
    - exit
    - foreground, background
    - launchlnSession
- isfirstactivity: define if the event is the user's first time on that activity. boolean
- apppackagename: Unique application package name. for Android it take applicationId, for iOS it takes Bundle ID
- appversion: the application version
- devicetype: user's device name
- devicemanufacturer: device manufacturer name
- osversion: the device OS version
- canonicaldeviceuuid: unique device ID. it can be used for user identification
- sourcetype: How the user joined the service
- channel: detailed version of sourcetype
    - unattributed
    - WEB
    - google-play, m_naver, google, (not set), google.adwrods...
- params_campaign: campaign name parameter entered by the marketter 
- params_medium: campaign media parameter entered by the marketter 
- params_term: campaign term parameter entered by the marketter 
- inappeventcategory: in-app-event value. hierarchical (category > action > label)
    - can only access when eventcategory equals goal
    - foreign key for funnel dataset
    - ex) seller_selling_history.view, gig_detail.view
- inappeventlabel: 
    - foreign key for category data set. 
- eventdatetime: event occurance time
- isfirstgoalactivity: as for goal events, it shows if the goal event happend more than once. Events are considered the same only if Goal Label, Description, Key, and Category match. boolean
- even_rank: for the data sort.

In [2]:
raw_log = pd.read_csv("./data/freelance/conversion.csv")
print(raw_log.shape)
raw_log.head()

(434244, 19)


Unnamed: 0,eventcategory,isfirstactivity,apppackagename,appversion,devicetype,devicemanufacturer,osversion,canonicaldeviceuuid,sourcetype,channel,params_campaign,params_medium,params_term,inappeventcategory,inappeventlabel,eventdatetime,rowuuid,isfirstgoalactivity,event_rank
0,goal,False,com.kmong.iOS,4.0.4,iPhone,Apple,iOS11.4.1,F36FAA62-ADAC-4AA5-9B00-1FD6CB7EE957,unattributed,unattributed,,,,home.view,,2018-09-28T00:00:00+09:00,fd2a188c-bc9b-4702-9c47-b546b2614817,False,True
1,goal,False,com.kmong.kmong,3.3.5,SM-N935S,samsung,Android7.0,8a871e50-0717-4aed-9bad-04ac3c3793be,unattributed,unattributed,,,,gig_detail.view,41201.0,2018-09-28T00:00:00+09:00,e62dccef-dd70-4415-8a33-c8324ddaed38,False,True
2,goal,False,com.kmong.iOS,4.0.4,iPhone,Apple,iOS12.0,A9E5778A-8F3D-4597-9718-74BF953A9F64,unattributed,unattributed,,,,inbox_detail.view,,2018-09-28T00:00:00+09:00,14eb3197-db83-493a-b7be-83582960c40b,False,True
3,foreground,,com.kmong.iOS,4.0.4,iPhone,Apple,iOS11.4.1,168761CB-CB67-4592-867D-52780D651297,,,,,,,,2018-09-28T00:00:01+09:00,f9bb91af-248b-44dc-9f5c-1c00b37ea97b,,True
4,goal,False,com.kmong.iOS,4.0.4,iPhone,Apple,iOS11.4.1,ACABB7C0-4C76-413A-B314-E5D6DA0D0E5D,viral,WEB,,,,buyer_order_track.view,,2018-09-28T00:00:02+09:00,236e9946-7801-4898-b609-06c8ab1139dc,False,True


### funnel.csv

Funnel means the paths that a user took up until they purchased products. 

<img src="https://cdn-images-1.medium.com/max/1600/0*voRGTKciwKuIb2HS.png" width=480 />
<br />
<center><small>Acquisition to revenue.
<br />    
(참고 자료: <a href="https://medium.com/the-school-of-mobile/app-marketing-metrics-for-pirates-growth-hacking-the-purchase-funnel-b4f1219c5945">App Marketing Metrics for Pirates: Growth Hacking the Purchase Funnel</a>)</small></center>

Through the funnel data we can see conversion or churn rate. 

In [3]:
raw_funnel = pd.read_csv("./data/freelance/funnel.csv")
print(raw_funnel.shape)
raw_funnel

(53, 6)


Unnamed: 0,Lv2,viewid,viewid desc,Lv1,funnel name,funnel desc
0,1100,home,홈 (탭),11,home,홈
1,1210,category_list,카테고리 목록 (탭),12,category,카테고리
2,1200,category_gig,카테고리-상품목록,12,category,카테고리
3,1300,search,검색,13,search,검색
4,1301,search_gig,검색-상품목록,13,search,검색
5,1302,search_seller,검색-전문가,13,search,검색
6,1400,gig_detail,상품상세,14,gig,상품
7,1401,gig_detail_option,상품상세-상품선택,14,gig,상품
8,1420,profile,전문가프로필,14,gig,상품
9,1520,login_sns,간편로그인,15,login,로그인


### category.csv

product category dats set. 

In [4]:
raw_category = pd.read_csv("./data/freelance/category.csv")
print(raw_category.shape)
raw_category

(245, 9)


Unnamed: 0,depth,categoryid,categoryname,cat1_id,cat2_id,cat3_id,cat1,cat2,cat3
0,1,1,디자인,1,,,디자인,,
1,1,2,마케팅,2,,,마케팅,,
2,1,3,번역·통역,3,,,번역·통역,,
3,1,4,문서작성,4,,,문서작성,,
4,1,6,IT·프로그래밍,6,,,IT·프로그래밍,,
5,1,7,콘텐츠 제작,7,,,콘텐츠 제작,,
6,1,9,상담·컨설팅,9,,,상담·컨설팅,,
7,1,10,레슨,10,,,레슨,,
8,1,11,핸드메이드,11,,,핸드메이드,,
9,1,9901,크몽 인쇄소,99,,,특별 카테고리,,


## Data Cleansing

Essensially, it's log data. Therefore it's difficult to analyze as is. We will clean the data and save it

### 1. Change canonicaldeviceuuid to userid

canonicaldeviceuuid is important because you can identify users. However, it's not intuitive to understand. To make it intuitive, we will change it to userid. 

In [5]:
raw_log["userid"] = raw_log["canonicaldeviceuuid"]
raw_log["userid"].head()

0    F36FAA62-ADAC-4AA5-9B00-1FD6CB7EE957
1    8a871e50-0717-4aed-9bad-04ac3c3793be
2    A9E5778A-8F3D-4597-9718-74BF953A9F64
3    168761CB-CB67-4592-867D-52780D651297
4    ACABB7C0-4C76-413A-B314-E5D6DA0D0E5D
Name: userid, dtype: object

### 2. convert eventdatetime to datetime type and extract the date and time



In [6]:
print(raw_log["eventdatetime"].dtype)
raw_log["eventdatetime"] = pd.to_datetime(raw_log["eventdatetime"]) # convert the object type to datetime type
print(raw_log["eventdatetime"].dtype)

object
datetime64[ns]


In [7]:
raw_log["eventdatetime_year"] = raw_log["eventdatetime"].dt.year
raw_log["eventdatetime_month"] = raw_log["eventdatetime"].dt.month
raw_log["eventdatetime_day"] = raw_log["eventdatetime"].dt.day
raw_log["eventdatetime_hour"] = raw_log["eventdatetime"].dt.hour
raw_log["eventdatetime_minute"] = raw_log["eventdatetime"].dt.minute
raw_log["eventdatetime_second"] = raw_log["eventdatetime"].dt.second

columns = ['eventdatetime', 'eventdatetime_year','eventdatetime_month', 'eventdatetime_day'
          ,'eventdatetime_hour','eventdatetime_minute','eventdatetime_second']

raw_log[columns].head()

Unnamed: 0,eventdatetime,eventdatetime_year,eventdatetime_month,eventdatetime_day,eventdatetime_hour,eventdatetime_minute,eventdatetime_second
0,2018-09-27 15:00:00,2018,9,27,15,0,0
1,2018-09-27 15:00:00,2018,9,27,15,0,0
2,2018-09-27 15:00:00,2018,9,27,15,0,0
3,2018-09-27 15:00:01,2018,9,27,15,0,1
4,2018-09-27 15:00:02,2018,9,27,15,0,2


### 3. osversion

osversion really contains two different pieces of information. OS type (Adroid and iOS) and OS Version.

In [8]:
raw_log["osversion"].head()

0     iOS11.4.1
1    Android7.0
2       iOS12.0
3     iOS11.4.1
4     iOS11.4.1
Name: osversion, dtype: object

In [9]:
iOS = raw_log["osversion"].str.startswith('iOS')
android = raw_log["osversion"].str.startswith('Adroid')
raw_log.loc[iOS, "ostype(clean)"] = "iOS"
raw_log.loc[android, "ostype(clean)"] = "Android"

In [10]:
# function to extract OS version
def extract_osversion(osversion):
    return osversion.replace("iOS","").replace("Adroid","")

In [11]:
# apply extract_osversion function across the row
raw_log["osversion(clean)"] = raw_log["osversion"].apply(extract_osversion)
columns = ["osversion", "ostype(clean)", "osversion(clean)"]
raw_log[columns].head()

Unnamed: 0,osversion,ostype(clean),osversion(clean)
0,iOS11.4.1,iOS,11.4.1
1,Android7.0,,Android7.0
2,iOS12.0,iOS,12.0
3,iOS11.4.1,iOS,11.4.1
4,iOS11.4.1,iOS,11.4.1


### 4. devicemanufacturer

There are three major smartphone manufacturer. Samsung, Apple, and LG. However, there are others too, such as Xiaomi, Foxconn, Pantech, Huawei, and so on.

In [12]:
raw_log["devicemanufacturer"].value_counts()

samsung           230313
Apple             163566
LGE                32565
Xiaomi              3688
Foxconn             1248
PANTECH              706
HUAWEI               615
Google               417
Sony                 378
BlackBerry           185
TCL                  138
vivo                 122
LG Electronics        84
HOMTOM                60
OPPO                  34
asus                  28
CUBOT                 17
Huawei                17
ZUK                   15
HTC                   15
blackberry            13
SHARP                 12
UMIDIGI                8
Name: devicemanufacturer, dtype: int64

Looking at the data there are some duplicate values. For example, LGE and LG Electronics are the same. 

In [13]:
mfg = ['samsung', 'LGE', 'LG Electronics', 'Apple']
sum(raw_log["devicemanufacturer"].value_counts()[mfg])/raw_log["devicemanufacturer"].count()

0.9822311879956891

Only 2% of the data is from manufacturers other than Samsung, Apple, and LG. It makes sense to categorize them as others for efficient analysis.

In [14]:
def device_clean(devicemfg):
    if "Samsung".upper() in devicemfg.upper():
        return "Samsung"
    elif "Apple".upper() in devicemfg.upper():
        return "Apple"
    elif "LG".upper() in devicemfg.upper():
        return "LG"
    else:
        return "Others"
    

In [15]:
raw_log["devicemanufacturer(clean)"] = raw_log["devicemanufacturer"].apply(device_clean)
raw_log["devicemanufacturer(clean)"].value_counts()

Samsung    230313
Apple      163566
LG          32649
Others       7716
Name: devicemanufacturer(clean), dtype: int64

### channel

Chennel column is what channel brought the users in. However this is not uniformed. It needs little cleaning to see a better picture

In [16]:
raw_log["channel"].unique()

array(['unattributed', nan, 'WEB', 'google-play', 'm_naver', 'google',
       '(not set)', 'google.adwords', 'm_naverpowercontents', 'pc_naver',
       'apple.searchads', 'facebook', 'm_daum'], dtype=object)

You can see in the result that there are different services from one company. Such as google-play, google, google.adwords could be merged into one. Also there is **NaN** value we will take care of it as well.

In [17]:
def clean_channel(channels):
    if pd.isnull(channels):
        return "NA"
    else:
        if 'unattributed' in channels.lower():
            return 'unattributed'
        elif 'web' in channels.lower():
            return 'web'
        elif 'google' in channels.lower():
            return 'google'
        elif '(not set)' in channels.lower():
            return '(not set)'
        elif 'naver' in channels.lower():
            return 'naver'
        elif 'daum' in channels.lower():
            return 'daum'
        elif 'apple' in channels.lower():
            return 'apple'
        elif 'facebook' in channels.lower():
            return 'facebook'        

In [18]:
raw_log["channel(clean)"] = raw_log["channel"].apply(clean_channel)
raw_log["channel(clean)"].value_counts()

unattributed    270834
NA              103079
web              27639
google           27235
(not set)         2444
naver             1224
daum               966
apple              697
facebook           126
Name: channel(clean), dtype: int64

The value set is clearer, however we can see quite a lot of data(23.7%) doesn't have channel information.

In [19]:
raw_log["channel(clean)"].value_counts()["NA"]/raw_log["channel(clean)"].count()

0.2373757610928418

### inappeventcategory

This shows user activities while on the application. It contains the information on whether the user is looking at products page or processing the order. This column is also a foreign key to join with funnel data. However if you look at the values you can see this is actually combination of several different information.

In [20]:
raw_log["inappeventcategory"].unique()[:5]

array(['home.view', 'gig_detail.view', 'inbox_detail.view', nan,
       'buyer_order_track.view'], dtype=object)

We are going to separate the values into smaller pieces.

In [21]:
def appViewcategoryClean(inappeventcategory):
    if pd.isnull(inappeventcategory):
        return inappeventcategory
    else:
        if '_' in inappeventcategory:
            return inappeventcategory.split('_')[0]
        else:
            return inappeventcategory.split('.')[0]
        
def appViewidClean(inappeventcategory):
    if pd.isnull(inappeventcategory):
        return inappeventcategory
    else:
        return inappeventcategory.split('.')[0]
    
def appViewActionClean(inappeventcategory):
    if pd.isnull(inappeventcategory):
        return inappeventcategory
    else:
        return inappeventcategory.split('.')[1]

In [22]:
raw_log["viewcategory"] = raw_log["inappeventcategory"].apply(appViewcategoryClean)
raw_log["viewid"] = raw_log["inappeventcategory"].apply(appViewidClean)
raw_log["viewaction"] = raw_log["inappeventcategory"].apply(appViewActionClean)

columns = ["inappeventcategory","viewcategory","viewid","viewaction"]
raw_log[columns].head()

Unnamed: 0,inappeventcategory,viewcategory,viewid,viewaction
0,home.view,home,home,view
1,gig_detail.view,gig,gig_detail,view
2,inbox_detail.view,inbox,inbox_detail,view
3,,,,
4,buyer_order_track.view,buyer,buyer_order_track,view


### Dropping unnecessary columns

After the cleansing process, we are going to drop some unnecessary columns that already processed. osversion, devicemanufacturer, canonicaldeviceuuid, channel, and event_rank. 


In [23]:
columns = ['eventcategory', 'isfirstactivity', 'apppackagename', 'appversion',
       'devicetype', 'sourcetype', 'params_campaign', 'params_medium',
       'params_term', 'inappeventcategory', 'inappeventlabel', 'eventdatetime',
       'rowuuid', 'isfirstgoalactivity', 'userid', 'eventdatetime_year', 'eventdatetime_month',
       'eventdatetime_day', 'eventdatetime_hour', 'eventdatetime_minute',
       'eventdatetime_second', 'ostype(clean)', 'osversion(clean)',
       'viewcategory', 'viewid', 'viewaction', 'devicemanufacturer(clean)',
       'channel(clean)']
raw_log = raw_log[columns]
raw_log.head()

Unnamed: 0,eventcategory,isfirstactivity,apppackagename,appversion,devicetype,sourcetype,params_campaign,params_medium,params_term,inappeventcategory,inappeventlabel,eventdatetime,rowuuid,isfirstgoalactivity,userid,eventdatetime_year,eventdatetime_month,eventdatetime_day,eventdatetime_hour,eventdatetime_minute,eventdatetime_second,ostype(clean),osversion(clean),viewcategory,viewid,viewaction,devicemanufacturer(clean),channel(clean)
0,goal,False,com.kmong.iOS,4.0.4,iPhone,unattributed,,,,home.view,,2018-09-27 15:00:00,fd2a188c-bc9b-4702-9c47-b546b2614817,False,F36FAA62-ADAC-4AA5-9B00-1FD6CB7EE957,2018,9,27,15,0,0,iOS,11.4.1,home,home,view,Apple,unattributed
1,goal,False,com.kmong.kmong,3.3.5,SM-N935S,unattributed,,,,gig_detail.view,41201.0,2018-09-27 15:00:00,e62dccef-dd70-4415-8a33-c8324ddaed38,False,8a871e50-0717-4aed-9bad-04ac3c3793be,2018,9,27,15,0,0,,Android7.0,gig,gig_detail,view,Samsung,unattributed
2,goal,False,com.kmong.iOS,4.0.4,iPhone,unattributed,,,,inbox_detail.view,,2018-09-27 15:00:00,14eb3197-db83-493a-b7be-83582960c40b,False,A9E5778A-8F3D-4597-9718-74BF953A9F64,2018,9,27,15,0,0,iOS,12.0,inbox,inbox_detail,view,Apple,unattributed
3,foreground,,com.kmong.iOS,4.0.4,iPhone,,,,,,,2018-09-27 15:00:01,f9bb91af-248b-44dc-9f5c-1c00b37ea97b,,168761CB-CB67-4592-867D-52780D651297,2018,9,27,15,0,1,iOS,11.4.1,,,,Apple,
4,goal,False,com.kmong.iOS,4.0.4,iPhone,viral,,,,buyer_order_track.view,,2018-09-27 15:00:02,236e9946-7801-4898-b609-06c8ab1139dc,False,ACABB7C0-4C76-413A-B314-E5D6DA0D0E5D,2018,9,27,15,0,2,iOS,11.4.1,buyer,buyer_order_track,view,Apple,web


In [24]:
# Write your code here!
print(raw_log.columns)
columns = ['event_category', 'is_first_activity', 'app_package_name', 'app_version',
       'device_type', 'source_type', 'params_campaign', 'params_medium',
       'params_term', 'in_app_event_category', 'in_app_event_label', 'event_date_time',
       'row_uuid', 'is_first_goal_activity', 'user_id',
       'event_datetime_year', 'event_datetime_month', 'event_datetime_day',
       'event_datetime_hour', 'event_datetime_minute', 'event_datetime_second',
       'os_type', 'os_version', 'view_category', 'view_id',
       'view_action', 'device_manufacturer', 'channel']
raw_log.columns = columns
raw_log.head()

Index(['eventcategory', 'isfirstactivity', 'apppackagename', 'appversion',
       'devicetype', 'sourcetype', 'params_campaign', 'params_medium',
       'params_term', 'inappeventcategory', 'inappeventlabel', 'eventdatetime',
       'rowuuid', 'isfirstgoalactivity', 'userid', 'eventdatetime_year',
       'eventdatetime_month', 'eventdatetime_day', 'eventdatetime_hour',
       'eventdatetime_minute', 'eventdatetime_second', 'ostype(clean)',
       'osversion(clean)', 'viewcategory', 'viewid', 'viewaction',
       'devicemanufacturer(clean)', 'channel(clean)'],
      dtype='object')


Unnamed: 0,event_category,is_first_activity,app_package_name,app_version,device_type,source_type,params_campaign,params_medium,params_term,in_app_event_category,in_app_event_label,event_date_time,row_uuid,is_first_goal_activity,user_id,event_datetime_year,event_datetime_month,event_datetime_day,event_datetime_hour,event_datetime_minute,event_datetime_second,os_type,os_version,view_category,view_id,view_action,device_manufacturer,channel
0,goal,False,com.kmong.iOS,4.0.4,iPhone,unattributed,,,,home.view,,2018-09-27 15:00:00,fd2a188c-bc9b-4702-9c47-b546b2614817,False,F36FAA62-ADAC-4AA5-9B00-1FD6CB7EE957,2018,9,27,15,0,0,iOS,11.4.1,home,home,view,Apple,unattributed
1,goal,False,com.kmong.kmong,3.3.5,SM-N935S,unattributed,,,,gig_detail.view,41201.0,2018-09-27 15:00:00,e62dccef-dd70-4415-8a33-c8324ddaed38,False,8a871e50-0717-4aed-9bad-04ac3c3793be,2018,9,27,15,0,0,,Android7.0,gig,gig_detail,view,Samsung,unattributed
2,goal,False,com.kmong.iOS,4.0.4,iPhone,unattributed,,,,inbox_detail.view,,2018-09-27 15:00:00,14eb3197-db83-493a-b7be-83582960c40b,False,A9E5778A-8F3D-4597-9718-74BF953A9F64,2018,9,27,15,0,0,iOS,12.0,inbox,inbox_detail,view,Apple,unattributed
3,foreground,,com.kmong.iOS,4.0.4,iPhone,,,,,,,2018-09-27 15:00:01,f9bb91af-248b-44dc-9f5c-1c00b37ea97b,,168761CB-CB67-4592-867D-52780D651297,2018,9,27,15,0,1,iOS,11.4.1,,,,Apple,
4,goal,False,com.kmong.iOS,4.0.4,iPhone,viral,,,,buyer_order_track.view,,2018-09-27 15:00:02,236e9946-7801-4898-b609-06c8ab1139dc,False,ACABB7C0-4C76-413A-B314-E5D6DA0D0E5D,2018,9,27,15,0,2,iOS,11.4.1,buyer,buyer_order_track,view,Apple,web


### Rearrange the order of the columns

Sort the column order 1) more important to less important columns, 2) put together similar kind of columns

In [25]:
# Write your code here!
raw_log = raw_log[["row_uuid","app_package_name","user_id","event_datetime_year","event_datetime_month","event_datetime_day","event_datetime_hour","event_datetime_minute","event_datetime_second","device_manufacturer","device_type","os_type","os_version","app_version","event_category","view_category","view_id","view_action","in_app_event_category","in_app_event_label","source_type","channel","params_campaign","params_medium","params_term","is_first_activity","is_first_goal_activity"]]
#raw_log.columns

In [26]:
# Write your code here!
raw_funnel = raw_funnel[['Lv1','Lv2', 'viewid', 'viewid desc',  'funnel name', 'funnel desc']]
raw_funnel.columns = ['lv1', 'lv2', 'view_id', 'view_desc', 'funnel_name', 'funnel_desc']
raw_funnel.columns

Index(['lv1', 'lv2', 'view_id', 'view_desc', 'funnel_name', 'funnel_desc'], dtype='object')

In [27]:
# Write your code here!
raw_category.columns = ['depth', 'category_id', 'category_name', 'category1_id', 'category2_id', 'category3_id',
       'category1', 'category2', 'category3']


## Merge datasets into one

To make it easier to analyze or training the machine learning algorithm, we are going to merge the sets together. First we are going to merge log and funnel data together using merge. The key column is view_id.

In [28]:
print(raw_log.shape)
print(raw_funnel.shape)
raw_log.head()

(434244, 27)
(53, 6)


Unnamed: 0,row_uuid,app_package_name,user_id,event_datetime_year,event_datetime_month,event_datetime_day,event_datetime_hour,event_datetime_minute,event_datetime_second,device_manufacturer,device_type,os_type,os_version,app_version,event_category,view_category,view_id,view_action,in_app_event_category,in_app_event_label,source_type,channel,params_campaign,params_medium,params_term,is_first_activity,is_first_goal_activity
0,fd2a188c-bc9b-4702-9c47-b546b2614817,com.kmong.iOS,F36FAA62-ADAC-4AA5-9B00-1FD6CB7EE957,2018,9,27,15,0,0,Apple,iPhone,iOS,11.4.1,4.0.4,goal,home,home,view,home.view,,unattributed,unattributed,,,,False,False
1,e62dccef-dd70-4415-8a33-c8324ddaed38,com.kmong.kmong,8a871e50-0717-4aed-9bad-04ac3c3793be,2018,9,27,15,0,0,Samsung,SM-N935S,,Android7.0,3.3.5,goal,gig,gig_detail,view,gig_detail.view,41201.0,unattributed,unattributed,,,,False,False
2,14eb3197-db83-493a-b7be-83582960c40b,com.kmong.iOS,A9E5778A-8F3D-4597-9718-74BF953A9F64,2018,9,27,15,0,0,Apple,iPhone,iOS,12.0,4.0.4,goal,inbox,inbox_detail,view,inbox_detail.view,,unattributed,unattributed,,,,False,False
3,f9bb91af-248b-44dc-9f5c-1c00b37ea97b,com.kmong.iOS,168761CB-CB67-4592-867D-52780D651297,2018,9,27,15,0,1,Apple,iPhone,iOS,11.4.1,4.0.4,foreground,,,,,,,,,,,,
4,236e9946-7801-4898-b609-06c8ab1139dc,com.kmong.iOS,ACABB7C0-4C76-413A-B314-E5D6DA0D0E5D,2018,9,27,15,0,2,Apple,iPhone,iOS,11.4.1,4.0.4,goal,buyer,buyer_order_track,view,buyer_order_track.view,,viral,web,,,,False,False


In [29]:
raw_log = raw_log.merge(raw_funnel, on='view_id', how = 'left')

In [30]:
print(raw_log.shape)

(434244, 32)


Data seems to be merged without an issue. 

In [31]:
print(raw_log.shape)
print(raw_category.shape)

(434244, 32)
(245, 9)


Now log and category data merge. We are using 'in_app_event_label' and 'category_id' to merge the data.

In [32]:
raw_log = raw_log.merge(raw_category, left_on='in_app_event_label', right_on = 'category_id',how = 'left')

In [33]:
print(raw_log.shape)
print(raw_category.shape)
raw_log.head()

(434244, 41)
(245, 9)


Unnamed: 0,row_uuid,app_package_name,user_id,event_datetime_year,event_datetime_month,event_datetime_day,event_datetime_hour,event_datetime_minute,event_datetime_second,device_manufacturer,device_type,os_type,os_version,app_version,event_category,view_category,view_id,view_action,in_app_event_category,in_app_event_label,source_type,channel,params_campaign,params_medium,params_term,is_first_activity,is_first_goal_activity,lv1,lv2,view_desc,funnel_name,funnel_desc,depth,category_id,category_name,category1_id,category2_id,category3_id,category1,category2,category3
0,fd2a188c-bc9b-4702-9c47-b546b2614817,com.kmong.iOS,F36FAA62-ADAC-4AA5-9B00-1FD6CB7EE957,2018,9,27,15,0,0,Apple,iPhone,iOS,11.4.1,4.0.4,goal,home,home,view,home.view,,unattributed,unattributed,,,,False,False,11.0,1100.0,홈 (탭),home,홈,,,,,,,,,
1,e62dccef-dd70-4415-8a33-c8324ddaed38,com.kmong.kmong,8a871e50-0717-4aed-9bad-04ac3c3793be,2018,9,27,15,0,0,Samsung,SM-N935S,,Android7.0,3.3.5,goal,gig,gig_detail,view,gig_detail.view,41201.0,unattributed,unattributed,,,,False,False,14.0,1400.0,상품상세,gig,상품,3.0,41201.0,자기소개서,4.0,412.0,41201.0,문서작성,자기소개서·이력서,자기소개서
2,14eb3197-db83-493a-b7be-83582960c40b,com.kmong.iOS,A9E5778A-8F3D-4597-9718-74BF953A9F64,2018,9,27,15,0,0,Apple,iPhone,iOS,12.0,4.0.4,goal,inbox,inbox_detail,view,inbox_detail.view,,unattributed,unattributed,,,,False,False,16.0,1610.0,메시지목록-상세,inbox,메시지,,,,,,,,,
3,f9bb91af-248b-44dc-9f5c-1c00b37ea97b,com.kmong.iOS,168761CB-CB67-4592-867D-52780D651297,2018,9,27,15,0,1,Apple,iPhone,iOS,11.4.1,4.0.4,foreground,,,,,,,,,,,,,,,,,,,,,,,,,,
4,236e9946-7801-4898-b609-06c8ab1139dc,com.kmong.iOS,ACABB7C0-4C76-413A-B314-E5D6DA0D0E5D,2018,9,27,15,0,2,Apple,iPhone,iOS,11.4.1,4.0.4,goal,buyer,buyer_order_track,view,buyer_order_track.view,,viral,web,,,,False,False,24.0,2410.0,메뉴목록-구매관리-거래메시지,transaction_history,거래관리,,,,,,,,,


Data seems to be merged without an issue. 

### Drop unnecessary columns and rearrange for finalizing

The following columns are not going to be use, therefore shall be dropped.

in_app_event_category
in_app_event_label
source_type
Lv1, Lv2
funnel_name, depth
category_id, category1_id, category2_id, category3_id

In [34]:
raw_log.columns

raw_log = raw_log[['row_uuid', 'app_package_name', 'user_id', 'event_datetime_year',
       'event_datetime_month', 'event_datetime_day', 'event_datetime_hour',
       'event_datetime_minute', 'event_datetime_second', 'device_manufacturer',
       'device_type', 'os_type', 'os_version', 'app_version', 'event_category',
       'view_category', 'view_id', 'view_action', 'in_app_event_category',
       'channel', 'params_campaign',
       'params_medium', 'params_term', 'is_first_activity',
       'is_first_goal_activity', 'view_desc', 
       'funnel_desc', 'category_name', 'category1', 'category2', 'category3']]

raw_log.shape

(434244, 31)

### Reindex the data frame

row_uuid is id for the data. Let's make it as the index and drop it afterward because it's going to be redundant.

In [35]:
raw_log.index = raw_log["row_uuid"]
raw_log = raw_log.drop(columns = "row_uuid")
raw_log.head()

Unnamed: 0_level_0,app_package_name,user_id,event_datetime_year,event_datetime_month,event_datetime_day,event_datetime_hour,event_datetime_minute,event_datetime_second,device_manufacturer,device_type,os_type,os_version,app_version,event_category,view_category,view_id,view_action,in_app_event_category,channel,params_campaign,params_medium,params_term,is_first_activity,is_first_goal_activity,view_desc,funnel_desc,category_name,category1,category2,category3
row_uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
fd2a188c-bc9b-4702-9c47-b546b2614817,com.kmong.iOS,F36FAA62-ADAC-4AA5-9B00-1FD6CB7EE957,2018,9,27,15,0,0,Apple,iPhone,iOS,11.4.1,4.0.4,goal,home,home,view,home.view,unattributed,,,,False,False,홈 (탭),홈,,,,
e62dccef-dd70-4415-8a33-c8324ddaed38,com.kmong.kmong,8a871e50-0717-4aed-9bad-04ac3c3793be,2018,9,27,15,0,0,Samsung,SM-N935S,,Android7.0,3.3.5,goal,gig,gig_detail,view,gig_detail.view,unattributed,,,,False,False,상품상세,상품,자기소개서,문서작성,자기소개서·이력서,자기소개서
14eb3197-db83-493a-b7be-83582960c40b,com.kmong.iOS,A9E5778A-8F3D-4597-9718-74BF953A9F64,2018,9,27,15,0,0,Apple,iPhone,iOS,12.0,4.0.4,goal,inbox,inbox_detail,view,inbox_detail.view,unattributed,,,,False,False,메시지목록-상세,메시지,,,,
f9bb91af-248b-44dc-9f5c-1c00b37ea97b,com.kmong.iOS,168761CB-CB67-4592-867D-52780D651297,2018,9,27,15,0,1,Apple,iPhone,iOS,11.4.1,4.0.4,foreground,,,,,,,,,,,,,,,,
236e9946-7801-4898-b609-06c8ab1139dc,com.kmong.iOS,ACABB7C0-4C76-413A-B314-E5D6DA0D0E5D,2018,9,27,15,0,2,Apple,iPhone,iOS,11.4.1,4.0.4,goal,buyer,buyer_order_track,view,buyer_order_track.view,web,,,,False,False,메뉴목록-구매관리-거래메시지,거래관리,,,,
