## Walkthrough of Data Science process - Traveler

### * Goal: Predict the country that users will make their first booking in, based on some basic user profile data.

#### [1] Pre-processing: Assessing and analyzing data, cleaning, transforming and adding new features
#### [2] Learning model: Constructing and testing learning model
#### [3] Post-processing: Creating final predictions

### Milestone1: Understanding the Data

#### Formulate range of questions including (but not limited to):

    1. What features (columns) does the dataset contain?
    2. How many records (rows) have been provided?
    3. What format is the data in (e.g. what format are the dates provided, are there numerical values, what do the different categorical values look like)?
    4. Are there missing values?
    5. How do the different features relate to each other?
    
    Note: Look into the csv files provided.

### Reviewing the Dataset

    1. train_users_2.csv  – This dataset contains data on Traveler users, including the destination countries. Each row represents one user with the columns containing various information such the users’ ages and when they signed up. This is the primary dataset used to train the model.
    
    2. test_users.csv – This dataset also contains data on Traveler users, in the same format as train_users_2.csv, except without the destination country. These are the users for which final prediction model need to be tested.
    
    3. sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a  wish list, ran a search etc.) taken by the users in both the testing and training datasets.



In [4]:
##Exploring Traveler data
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline 
data =pd.read_csv("/home/b2/Downloads/train_users_2.csv")

Populating the interactive namespace from numpy and matplotlib


In [5]:
data.head().transpose()

Unnamed: 0,0,1,2,3,4
id,gxn3p5htnn,820tgsjxq7,4ft3gnwmtx,bjjt8pjhuk,87mebub9p4
date_account_created,2010-06-28,2011-05-25,2010-09-28,2011-12-05,2010-09-14
timestamp_first_active,20090319043255,20090523174809,20090609231247,20091031060129,20091208061105
date_first_booking,,,2010-08-02,2012-09-08,2010-02-18
gender,-unknown-,MALE,FEMALE,FEMALE,-unknown-
age,,38,56,42,41
signup_method,facebook,facebook,basic,facebook,basic
signup_flow,0,0,3,0,0
language,en,en,en,en,en
affiliate_channel,direct,seo,direct,direct,direct


## Looking at sample data what can be observed (but not limited)

    1. Missing values in the age column and date_first_booking column
        - ? need to be filled or the rows excluded altogether

    2. Most of the columns provided contain categorical data
        - ? 11 of the 16 columns provided appear to be categorical
        - Whats the problem? Most algs used in classification do not handle categorical data well. 
        - Solution: In data transformation, find ways to change data into forms more suitable for classification. 

    3. The timestamp_first_active column looks to be a full timestamp
        - ? For example 20090609231247 looks like it should be 2009-06-09 23:12:47


### Look at the structure of the data
 ### 1. Country Destination Values
   ##### country_destination (Most important column) 
   ##### Why? - Looking at the number of records that fall into each category can help provide some insights into how the model should be constructed as well as pitfalls to avoid.
<img src="./images/User_by_Destination.png" height="400" width="500"/>

##### Looking at the breakdown of the data, one thing that immediately stands out is that almost 90% of users fall into two categories, that is, they are either yet to make a booking (NDF) or they made their first booking in the US. 

##### What’s more, breaking down these percentage splits by year reveals that the percentage of users yet to make a booking increases each year and reached over 60% in 2014.

<img src="./images/User_by_Destination_and_Year.png" height="400" width="500"/>


### Summary for building a learning model:
   ##### [1] By analysis we observe that the spread of categories (yearwise) might have changed over time.
   ##### Since the final predictions will be made against user data from 2014 onwards, we can focus on more recent data for training the learning model as it is more likely to resemble the test data.
   ##### [2] Since vast majority of users fall into 2 categories ('NDF' and 'US') there is a risk that if the learning model is too generalized, it will select one of these two categories for every prediction. 
   ##### Important to ensure that the training data has enough information to build a learning model that will predict other categories as well.

### 2. Account Creation Dates
##### date_account_created column – how values have changed over time?
<img src="./images/Accounts_Created_Over_Time.png" height="400" width="500"/>


### Summary:
   ##### [1] By analysis we observe that there is an explosive growth, averaging over 10% growth in new accounts create per month. 
   ##### [2] In 2014 there is rapid increase from the year before.
   ##### In fact, we can limit considering the training data to accounts created from Jan 2013 onwards (70% will still be included)
   


### 3. Age Breakdown
#### Data Quality issues 
##### - significant number of users reported their ages well over 100, 
##### - a significant number of users reported their ages as over 1000.
<img src="./images/Reported_Ages_of_Users.png" height="400" width="500"/>


### Summary:
   ##### [1] Appears that a number of users have reported their birth year instead of their age.
   ##### [2] Significant numbers of users reporting their age as 105 and 110.
   ###### Why? - might be some users intentionally entered their age incorrectly for privacy reasons.
   ###### These are errors and needs to be addressed.
   ##### [3] Another issue with the age column is that sometimes age has not been reported at all.
    
   #### Check missing ages? 
   ##### Large number of missing values in all years.
   <img src="./images/Missing_Ages.png" height="400" width="500"/>
   #### Note: While cleaning the data, need to decide what to do with these missing values.
        

### 4. First Device Type
#### first_device_used column
<img src="./images/First_Device_Used.png" height="400" width="500"/>

### Summary:
##### [1] Windows users have increased significantly as a percentage of all users.
##### [2] iPhone users have tripled.
##### [3] users using ‘Other/unknown’ devices have gone from the second largest group to less than 5% of users.

#### Again we can notice that the recent data is likely to be most useful for building the learning model.

## Other columns
### HW - Give a look on other columns and see how they can also help in building an accurate classification learning model.


## Part - Focus on Cleaning Data

### [1] Fixing up formats - 
   ##### timestamp_first_active column contained numbers like 20090609231247 instead of timestamps in the expected format: 2009-06-09 23:12:47
### [2] Filling in missing values 
### [3] Correcting erroneous values - 
   ##### 'gender' column where someone has entered a number, or an 'age' column where someone has entered a value well over 100. 
### [4] Standardizing categories (correcting erronous values) - 
   ##### spelling mistakes, language differences or other factors will result in a given answer being provided in multiple ways.
   ###### Eg: data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on)


## Dealing with Missing Data - Solutions
### [1] Deleting/Ignoring rows with missing values
   ##### [a] If more than 10% of data to be deleted, then try reconsidering.
   ##### [b] Be confident that the rows being deleted do not contain information that is not contained in other rows.
   ##### Eg: For example, in the dataset we observe that many users have not provided their age. 
   ##### Can we assume that the people who chose not to provide their age are the same as the users who did? 
   ##### Or are they likely to represent a different type of user, perhaps an older and more privacy conscious user, and therefore a user that is likely to make different choices on which countries to visit? 
   ##### If the answer is the latter, we probably do not want to just delete the records.


### [2] Filling in the Values
##### What value to use?
##### Depends on a range of factors, including the type of data we are trying to fill.
##### Categorical: If the data is categorical (i.e. countries, device types, etc.), it may make sense to simply create a new category that will represent ‘unknown’.
##### Another option may be to fill the values with the most common value for that column (the mode).
#### Since these are broad methods for filling the missing values, this may oversimplify your data and/or make your final learning model less accurate.

##### Numerical: For example age column, we could use mean or median to fill values.
##### Or, take an average based on some other criteria – for example filling the missing age values based on an average age for users that selected the same country_destination.

##### Note: For both types of data, we can use far more complicaed methods to impute the missing values. There are endless no. of ways...

## Cleaning efforts on two files –
 train_users_2.csv and test_users.csv 

In [8]:

#Loading the data
import pandas as pd

print("Reading data...")
train_file = "/home/b2/Downloads/train_users_2.csv"
df_train = pd.read_csv(train_file, header = 0,index_col=None)

test_file = "/home/b2/Downloads/test_users.csv"
df_test = pd.read_csv(test_file, header = 0,index_col=None)

# Combining into one dataset for cleaning
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
print("Reading data...completed")


Reading data...
Reading data...completed


## Cleaning the timestamps - Fixing up formats of dates

#### Why to convert? - Reason: e.g. subtract one date from another, extract the month of the year from each date etc.

#### In next exercise, we will find its importance when we start adding various new features to the training data based on this date information.


In [9]:
# Fixing date formats in Pandas - to_datetime
## Change dates to specific format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
print("Fixing timestamps...completed")
df_all.head()


Fixing timestamps...
Fixing timestamps...completed


Unnamed: 0,affiliate_channel,affiliate_provider,age,country_destination,date_account_created,date_first_booking,first_affiliate_tracked,first_browser,first_device_type,gender,id,language,signup_app,signup_flow,signup_method,timestamp_first_active
0,direct,direct,,NDF,2010-06-28,,untracked,Chrome,Mac Desktop,-unknown-,gxn3p5htnn,en,Web,0,facebook,2009-03-19 04:32:55
1,seo,google,38.0,NDF,2011-05-25,,untracked,Chrome,Mac Desktop,MALE,820tgsjxq7,en,Web,0,facebook,2009-05-23 17:48:09
2,direct,direct,56.0,US,2010-09-28,2010-08-02,untracked,IE,Windows Desktop,FEMALE,4ft3gnwmtx,en,Web,3,basic,2009-06-09 23:12:47
3,direct,direct,42.0,other,2011-12-05,2012-09-08,untracked,Firefox,Mac Desktop,FEMALE,bjjt8pjhuk,en,Web,0,facebook,2009-10-31 06:01:29
4,direct,direct,41.0,US,2010-09-14,2010-02-18,untracked,Chrome,Mac Desktop,-unknown-,87mebub9p4,en,Web,0,basic,2009-12-08 06:11:05


## Removing booking date field

#### Why? Notice howmany date fields are there?

#### We converted two date fields to a format above.

#### which one is not covered?

#### Why? Reason - 
#### In training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column AND those who have not made a booking  (country_destination = NDF) the value is missing. 
#### In test_users.csv, the date_first_booking column is empty for all the records.


## Summary: 
#### This column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset.

In [10]:
## Removing date_first_booking column
df_all.drop('date_first_booking', axis = 1, inplace = True)
print("Droped date_first_booking column...")

Droped date_first_booking column...


In [None]:
df_all.head()

## Clean the Age column - Correcting erroneous values 
#### [1] Outliers - there are several age values that are clearly incorrect (unreasonably high or too low)
#### Solution: replace these incorrect values with 'NaN' (changing incorrect values into missing values)
#### [2] Missing values - there are a significant number of users who did not provide their age at all, they show up as NaN in the dataset
#### Solution: lets change all the NaN values to -1

In [11]:
import numpy as np

## Remove outliers function - [1]
def remove_outliers(df, column, min_val, max_val):
    col_values = df[column].values
    df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)
    return df
## Fixing age column - [2]
print("Fixing age column...")
df_all = remove_outliers(df = df_all, column = 'age', min_val = 15, max_val = 90)
df_all['age'].fillna(-1, inplace = True)
print("Fixing age column...completed")


Fixing age column...
Fixing age column...completed


  
  


## HW - there are several more ways to fill in the missing values in the age column, try and list


## Identify and fill additional columns with missing values - Filling in missing values 
#### One such column is first_affiliate_tracked has missing values
#### Solution: follow same procedure as above (change all the NaN values to -1)


In [None]:
# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)
print("Filling first_affiliate_tracked column...completed")

In [None]:
## Breaking Stage - Output
## What does the data look like after all these changes? 
## Sample of some rows from cleaned dataset
df_all.head(20)

## Is that all?
#### Not really - this is just a small work of cleaning 

### What Next?
### Aim: Focus on transforming the data and feature extraction
##### Why? To make better prediction learning model.