### Target and what we do
1.  Learn to process json format data to datafrme
2. Learn to convert datetime format
3. Learn sublpots
4. Learn lazy and quick way to check data quality. 


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
color = sns.color_palette()
sns.set_context("notebook", font_scale=1.2)

import datetime as datetime
from datetime import timedelta, date

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
train = pd.read_csv("../input/train.csv")
print(train.dtypes)
print(train.shape)

In [None]:
# Considering computing speed, I prefer sample the whole dataset. 
df = train.sample(frac =0.3) # u can choose the way or the proportion you like
df.head(3)

There are datetime and Json format data mixed in the dataset.

So, let's solve the elephant in room firstly.

1. Process Json format data into dataframe
2. Convert datetime format

In [None]:
#cols = ['device','totals','trafficSource','geoNetwork']

tmp = df['device'].apply(json.loads).tolist()
device = pd.DataFrame(tmp)
    
tmp = df['totals'].apply(json.loads).tolist()
totals = pd.DataFrame(tmp)

tmp = df['trafficSource'].apply(json.loads).tolist()
Source = pd.DataFrame(tmp)

tmp = df['geoNetwork'].apply(json.loads).tolist()
geonetwork = pd.DataFrame(tmp)

print(device.shape, totals.shape,Source.shape,geonetwork.shape)


In [None]:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df.visitStartTime = pd.to_datetime(df.visitStartTime,unit='s')
#print(df.date[2],df.visitStartTime[2])

Now, it looks quite tidy, without significant weird thing at least.
Next, Let's observe variables one by one.

In [None]:
df.channelGrouping.nunique() #------answer is 8
df.channelGrouping.value_counts().plot(kind ='barh',color ='c',figsize =(12,6),title = 'Channel Distribution')

In [None]:
# This is a lazy and efficient way to look into data, easy to pick the valuable features out.
for c in device.columns:
    print(c , '+',device[c].nunique())

In [None]:
# Check missing values
device = device[['browser','deviceCategory','isMobile','operatingSystem']]
for c in device.columns:
    print(device[c].isnull().sum(), device[c].isnull().sum()/len(device) * 100)
# very good, no missing values

In [None]:
fig, axes = plt.subplots(2,2,figsize = (16,16))
device.browser.value_counts()[:10].plot(kind ='barh',color ='c',legend= 'browser',ax=axes[0][0])
device.deviceCategory.value_counts().plot(kind ='barh',color ='c',legend= 'deviceCategory.',ax=axes[0][1])
device.isMobile.value_counts().plot(kind ='barh',color ='c',legend= 'isMobile',ax=axes[1][0])
device.operatingSystem.value_counts().plot(kind ='bar',color ='c',legend= 'operatingSystem',ax=axes[1][1])

In [None]:
# I prefer percentage to absolute value
for c in totals.columns:
    print(c , '+',totals[c].nunique(),'\n'
          '+ missing value number is', totals[c].isnull().sum(),'+ the percentage is',round(totals[c].isnull().sum()/len(totals) *100,2))
    

In [None]:
for c in totals.columns:
    totals[c] = totals[c].astype(float)
totals[totals.transactionRevenue > 0].count()/len(totals)

According to 20/80 principle. Always few people earn the big part of cake. 

In real world, competition could be more crucial.

In this case, only 12.5% customers earn positive revenue. That means, over 80% customers did not earn money.


In [None]:
# An intuitive way to check missing values, see how pretty it is!
import missingno as msno
msno.bar(totals,color = color)

In [None]:
print('Percentage of unique visitors in sample dataset : ', round(df.fullVisitorId.nunique()/len(df)*100,2))      
print('Percentage of unique visitors in train dataset : ', round(train.fullVisitorId.nunique()/len(train)*100,2))     
# We randomly pick the dataset, so there could be some difference

In [None]:
totals['date'] = df.date
totals['Id'] = df.visitId
totals['VisitNumber'] = df.visitNumber
totals.set_index = totals,date
totals.head()

In [None]:
fig, axes = plt.subplots(1,1,figsize=(20,10))
Revenue= totals.groupby(['date'])['transactionRevenue'].sum()
Revenue.plot(color ='c',title = 'Daily Revenue Plot')

In [None]:
fig, axes = plt.subplots(1,1,figsize=(20,10))
Visit_no= totals.groupby(['date'])['VisitNumber'].sum()
Visit_no.plot(color ='c',title = 'Daily VisitNumber Plot')

Looks quite similiar ,huh? 

Let s try put them together and see what gonna happens.

In [None]:
revenue_visits = pd.concat([Revenue,Visit_no],axis=1)
#revenue_visits.head(2)
fig, ax1 = plt.subplots(figsize=(20,10))
t = revenue_visits.index
s1 = revenue_visits['VisitNumber']
ax1.plot(t, s1, 'c')
ax1.set_xlabel('day')

ax1.set_ylabel('VisitNumber', color='c')
ax1.tick_params('y', colors='c')

ax2 = ax1.twinx()
s2 = revenue_visits['transactionRevenue']
ax2.plot(t, s2, 'pink')
ax2.set_ylabel('Revenue', color='pink')
ax2.tick_params('y', colors='pink')
fig.tight_layout()

In [None]:
fig, axes = plt.subplots(1,1,figsize=(20,8))
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.hist(Revenue,color ='c')

Another typical long tail distribution !
Know more about long tail ,here [(https://en.wikipedia.org/wiki/Long_tail)]

Next, let's speed up and look another pair : hits and pageviews.

In [None]:
x = np.arange(1,16,1)
y1 = totals.hits.value_counts()[:15]
y2 = totals.pageviews.value_counts()[:15]
fig, ax1 = plt.subplots(figsize=(12,8))
plt.barh(x,y1,color='c',label ='Hits')
plt.barh(x,-y2,color ='pink',label ='PageViews')
plt.legend(loc=[1, 0])
plt.show()
# They look quite similar. That makes sense, hit firstly, and pageview secondly.

In [None]:
totals.pageviews.describe()

In [None]:
# now let's see all pairs

#fig, axes = plt.subplots(1,1,figsize=(16,16))
#sns.pairplot(totals)

In [None]:
geonetwork.head()

In [None]:
geonetwork.head()
fig,axes = plt.subplots(3,2,figsize =(15,20))
geonetwork.continent.value_counts().plot(kind = 'bar', ax = axes[0][0], title = 'Continent Contribution',color = 'c',alpha =.5)
geonetwork[geonetwork.continent == 'Americas'].subContinent.value_counts().plot(kind = 'bar', ax = axes[0][1], title = 'America Contribution',color = 'c',alpha =.5)
geonetwork[geonetwork.continent == 'Asia'].subContinent.value_counts().plot(kind = 'bar', ax = axes[1][0], title = 'Asia Contribution',color = 'c',alpha =.5,rot =90)
geonetwork[geonetwork.continent == 'Europe'].subContinent.value_counts().plot(kind = 'bar', ax = axes[1][1], title = 'Europe Contribution',color = 'c',alpha =.5,rot = 90)
geonetwork[geonetwork.continent == 'Africa'].subContinent.value_counts().plot(kind = 'bar', ax = axes[2][0], title = 'Africa Contribution',color = 'c',alpha =.5,rot =90)
geonetwork[geonetwork.continent == 'Oceania'].subContinent.value_counts().plot(kind = 'bar', ax = axes[2][1], title = 'Oceania Contribution',color = 'c',alpha =.5)
#geonetwork.country.value_counts()[:10].plot(kind = 'bar', ax = axes[3][0], title = 'Top 10 Country Contribution',color = 'c',alpha =.5)
#geonetwork.city.value_counts()[:10].plot(kind = 'bar', ax = axes[3][1], title = 'Top 10 City Contribution',color = 'c',alpha =.5)


## What do we find so far?
1. More than half of channel source is from oganic,it means from search engine results that is earned, not paid. 
   Then that  from social. 
   The definition and difference of these channels can be found here: [(https://www.smartbugmedia.com/blog/what-is-the-difference-between-direct-and-organic-search-traffic-sources)]
2. In the google's case, the most popular browser is 'Chrome'.
3. The device category and 'ismobile' can validate each other Most of device are destktop, which is of course ' not mobile' .
4. The most popular operating system are Windows, then is Macintosh.

5. Revenue and visits are highly postive correlation.
6. Most of pageviews and hits stay in 1 click. If your  pageview is more than 4, u have already won 75% competitors !
7. Most of contribution come from Americas, Asia and Europe, but in this case, I don't think the region information are importance features which can significantly impact prediction result.






## Next Chapter let us load the full data and try some predcition.
See ya!