# BBC Task
## 1. Business Understanding
### 1.1 Business Objectives
### Web usage data

These datasets represent traffic to the BBC’s websites from users during June and July of 2016. It contains information such as BBC product visited (News, Sport, TV & iplayer, Weather, etc…), platform, date/time of the visit, type of app and region. We encourage you to use this data to find insights, patterns of behaviour of our users or use it to build a predictive model. 

The small dataset is simply a sample from the large dataset based on 10k users in case  you are not able to load the large file.  We would expect you to attempt using the large file if possible.
 
I will try to answer a few questions for this project:
My main focus in this project is to predict the future of products using Prophet.

## 2. Data
### 2.1 Import libraries
As a goal I always try to automate the repetative parts of tasks like this, so I have a ["useful imports" gist](https://gist.github.com/joezcool02/c98669a3a74a8ae21e5db23b02f6057f)

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
sns.set_style( 'whitegrid' )
pylab.rcParams[ 'figure.figsize' ] = 8 , 6

### 2.2 Import Helpers

In [2]:
def plot_histograms( df , variables , n_rows , n_cols ):
    fig = plt.figure( figsize = ( 16 , 12 ) )
    for i, var_name in enumerate( variables ):
        ax=fig.add_subplot( n_rows , n_cols , i+1 )
        df[ var_name ].hist( bins=10 , ax=ax )
        ax.set_title( 'Skew: ' + str( round( float( df[ var_name ].skew() ) , ) ) ) # + ' ' + var_name ) #var_name+" Distribution")
        ax.set_xticklabels( [] , visible=False )
        ax.set_yticklabels( [] , visible=False )
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

def plot_distribution( df , var , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , hue=target , aspect=4 , row = row , col = col )
    facet.map( sns.kdeplot , var , shade= True )
    facet.set( xlim=( 0 , df[ var ].max() ) )
    facet.add_legend()

def plot_categories( df , cat , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , row = row , col = col )
    facet.map( sns.barplot , cat , target )
    facet.add_legend()

def plot_correlation_map( df ):
    corr = full.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={ 'shrink' : .9 }, 
        ax=ax, 
        annot = True, 
        annot_kws = { 'fontsize' : 12 }
    )

def describe_more( df ):
    var = [] ; l = [] ; t = []
    for x in df:
        var.append( x )
        l.append( len( pd.value_counts( df[ x ] ) ) )
        t.append( df[ x ].dtypes )
    levels = pd.DataFrame( { 'Variable' : var , 'Levels' : l , 'Datatype' : t } )
    levels.sort_values( by = 'Levels' , inplace = True )
    return levels

def plot_variable_importance( X , y ):
    tree = DecisionTreeClassifier( random_state = 99 )
    tree.fit( X , y )
    plot_model_var_imp( tree , X , y )
    
def plot_model_var_imp( model , X , y ):
    imp = pd.DataFrame( 
        model.feature_importances_  , 
        columns = [ 'Importance' ] , 
        index = X.columns 
    )
    imp = imp.sort_values( [ 'Importance' ] , ascending = True )
    imp[ : 10 ].plot( kind = 'barh' )
    print (model.score( X , y ))

### 2.3 Read in Data

In [3]:
# Read in full dataset
df = pd.read_csv('../my_data/raw/web_usage_big.csv')

# Check data is read in okay
df.head()

Unnamed: 0,user_id,date_time,search_term,platform,app_type,product,name_page,page_url,region
0,468660be9c3ccd8f10e4b4778b85b076,2016-06-17 21:28:49.139,,Mobile,responsive,sport,sport.football.european_championship.story.365...,,london
1,468660be9c3ccd8f10e4b4778b85b076,2016-06-26 22:41:01.491,,Mobile,responsive,sport,sport.football.european_championship.story.364...,,london
2,468660be9c3ccd8f10e4b4778b85b076,2016-06-16 21:16:01.162,,Mobile,responsive,sport,sport.football.european_championship.story.365...,,london
3,468660be9c3ccd8f10e4b4778b85b076,2016-06-20 14:40:35.117,,Mobile,responsive,sport,sport.football.european_championship.story.364...,,london
4,468660be9c3ccd8f10e4b4778b85b076,2016-06-17 23:36:28.386,,Mobile,responsive,sport,sport.football.european_championship.2016.medi...,,london


### 2.4 Statistical summaries
I like to always run info and describe since it gives a rapid insight into the quality of the data, it is trivial to spot missing or outlying values.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15853552 entries, 0 to 15853551
Data columns (total 9 columns):
user_id        object
date_time      object
search_term    object
platform       object
app_type       object
product        object
name_page      object
page_url       object
region         object
dtypes: object(9)
memory usage: 1.1+ GB


In [5]:
df.describe()

Unnamed: 0,user_id,date_time,search_term,platform,app_type,product,name_page,page_url,region
count,15853552,15853552,9037,15852835,15835468,15734873,15853552,14875049,15853552
unique,200000,15583310,3561,8,6,24,98514,183560,124
top,5a2ea0350fdb99577da344be85e475b1,2016-06-24 19:09:37.385,glastonbury,Computer,responsive,sport,sport.page,https://www.bbc.co.uk/,london
freq,4183,6,126,5433941,14169494,9873222,2074774,1826367,6147141


### Quick Insights
As we can see from this there are 2000000 unique users as expected. There are 8 unique platforms, 6 unique app_types, 24 unique products and 124 unique regions. The most popular search term is "glastonbury", the most popular product is sport.


### Variable Description
* user_id: User identifier.
* date_time: A timestamp to indicate when the event occurred.
* search_term: If a search term was entered by the user on the BBC website, it will appear here.
* platform: How the content was accessed: Mobile, Computer, Tablet, Big screen.
* app_type: How the content is delivered to the browser: web, mobile-web, mobile-app or responsive.
* product: Which BBC product that the content is part of such as sport, news, cbbc etc.
* name_page: The content identifier for the page viewed (e.g. home.page, news.page).
* page_url: The web address of the page visited.
* region: Geographical region where the browser appears to have arrived from.

For my investigation I am just going to look at user_id, date_time, platform, product to keep things simple.

In [6]:
# trim down data
df = df[['user_id', 'date_time', 'platform', 'product']]
df.head()

Unnamed: 0,user_id,date_time,platform,product
0,468660be9c3ccd8f10e4b4778b85b076,2016-06-17 21:28:49.139,Mobile,sport
1,468660be9c3ccd8f10e4b4778b85b076,2016-06-26 22:41:01.491,Mobile,sport
2,468660be9c3ccd8f10e4b4778b85b076,2016-06-16 21:16:01.162,Mobile,sport
3,468660be9c3ccd8f10e4b4778b85b076,2016-06-20 14:40:35.117,Mobile,sport
4,468660be9c3ccd8f10e4b4778b85b076,2016-06-17 23:36:28.386,Mobile,sport


In [7]:
# convert to datetime to use type functions
df['date_time'] = pd.to_datetime(df['date_time'])
df['date'] = df['date_time'].apply(lambda x: x.date())

Unnamed: 0,user_id,date_time,platform,product,date
0,468660be9c3ccd8f10e4b4778b85b076,2016-06-17 21:28:49.139,Mobile,sport,2016-06-17
1,468660be9c3ccd8f10e4b4778b85b076,2016-06-26 22:41:01.491,Mobile,sport,2016-06-26
2,468660be9c3ccd8f10e4b4778b85b076,2016-06-16 21:16:01.162,Mobile,sport,2016-06-16
3,468660be9c3ccd8f10e4b4778b85b076,2016-06-20 14:40:35.117,Mobile,sport,2016-06-20
4,468660be9c3ccd8f10e4b4778b85b076,2016-06-17 23:36:28.386,Mobile,sport,2016-06-17


In [8]:
df = df.drop(['date_time'], axis=1)

In [10]:
# So we don't need to process every time
df.to_csv('../my_data/processed/simple_date.csv')

I will move over to another notebook as the preprocessing is now done.