This script takes data from IOS apps in the app store and uses various features in to predict their rating. The features have been formatted, as has the target, articulated below. We then train neural networks to provide various predictions.

we run a variety of neural network architectures. The first is a very simple network with eight hidden nodes in one layer, using only linear transformations. The second is a dynamically-sized one layer network using a sigmoid activation function, and the third is a large two layer network using the same function. The final is similar to the one layer network, but making use of the 'dropout' method to reduce overfitting.

# Initalization
---

In [1]:
# Import libraries needed for model

import os

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import copy

import numpy as np
import pandas as pd

import matplotlib.pylab as plt
%matplotlib inline

from sklearn.datasets import fetch_mldata
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import StratifiedShuffleSplit
import sklearn

from itertools import *




## Parameters

Change the following values as necessary to change the output of the model, or when running on a different computer

In [2]:
# Root data directory
data_dir = "/processing-task"

# Input data
data_file = "IOS_app_data.csv"

# Set seed for random number generator
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)

# Set outcome to predict, either "ver" or "tot"
to_predict = "ver"

In [3]:
if(to_predict == "tot"):
    outcomeVal = 'user_rating'
if(to_predict == "ver"):
    outcomeVal = 'user_rating_ver'

# Data Preprocessing
---

In [4]:
# Load data 
dataset = pd.read_csv(os.path.join(data_dir, data_file), sep = ",")

**Test Data Shape and Values**

In [5]:
# Check shape of dataset
dataset.shape

(5000, 17)

In [6]:
# Check for missing values in dataset
dataset.isnull().values.any()

False

In [7]:
# Show different values of user ratings and their frequencies
dataset[outcomeVal].value_counts(dropna = False)

4.5    1660
4.0     871
0.0     856
5.0     694
3.5     378
3.0     199
2.5     129
2.0      85
1.0      79
1.5      49
Name: user_rating_ver, dtype: int64

In [8]:
# Show column names in dataset
dataset.columns

Index(['id', 'track_name', 'size_bytes', 'currency', 'price',
       'rating_count_tot', 'rating_count_ver', 'user_rating',
       'user_rating_ver', 'ver', 'cont_rating', 'prime_genre',
       'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic',
       'app_desc'],
      dtype='object')

## Feature Selection

In our attempt to determine rating we will use the following features:
* Free/Paid
* Number of Screenshots
* Number of Languages
* Number of Supporting Devices
* Length of Title
* Length of Description
* Content Rating
* Size
* Vpp Licensing
* Rating Count (for relevant rating)
* Genre (broken up by category)
* Currency (broken up by category)

In order to preserve the integrity of the measurement we will not use the other rating (as data influencing the current version's rating will also influence the total rating, and vice versa)

In [9]:
columns_to_keep = ['ipadSc_urls.num', 'lang.num','sup_devices.num', 'size_bytes', 'vpp_lic']

### Transform Outcomes

In [10]:
# Function for transforming ratings to preferred values
def rating_transformer(input_rating):
    if(0 <= input_rating <= 1.5):
        return -1
    elif(2.0 <= input_rating <= 3.5):
        return 0
    elif(4.0 <= input_rating <= 5.0):
        return 1
    

In [11]:
# Transform user_rating and user_rating_ver to preferred values.
# Instead of "high", "medium", and "low" we will code as 1, 0, and -1 respectively

dataset['user_rating'] = dataset['user_rating'].apply(rating_transformer)
dataset['user_rating_ver'] = dataset['user_rating_ver'].apply(rating_transformer)

In [12]:
if(to_predict == "tot"):
    columns_to_keep += ['rating_count_tot', 'user_rating']
if(to_predict == "ver"):
    columns_to_keep += ['rating_count_ver', 'user_rating_ver']

### Transform Inputs

In [13]:
# Function for transforming price to paid/free binary
def price_transformer(input_price):
    if(0 < input_price):
        return 1
    else:
        return 0

In [14]:
dataset['price'] = dataset['price'].apply(price_transformer)
columns_to_keep += ['price']

In [15]:
dataset['track_name'] = dataset['track_name'].apply(lambda desc: len(desc))
columns_to_keep += ['track_name']

In [16]:
dataset['app_desc'] = dataset['app_desc'].apply(lambda desc: len(desc))
columns_to_keep += ['app_desc']

In [17]:
dataset['cont_rating'] = dataset['cont_rating'].apply(lambda r: int(''.join(x for x in r if x.isdigit())))
columns_to_keep += ['cont_rating']

In [18]:
genre_dummies = pd.get_dummies(dataset['prime_genre'])
dataset = pd.concat([dataset, genre_dummies], axis=1)
columns_to_keep += np.ndarray.tolist(genre_dummies.columns.values)

In [19]:
dataset = dataset.loc[:,columns_to_keep]

**Test Data Shape and Values**

In [20]:
# Show outcome values
dataset[outcomeVal].value_counts(dropna = False)

 1    3225
-1     984
 0     791
Name: user_rating_ver, dtype: int64

In [21]:
# Show shape of data
dataset.shape

(5000, 34)

In [22]:
# Show columns kept in dataset
columns_to_keep

['ipadSc_urls.num',
 'lang.num',
 'sup_devices.num',
 'size_bytes',
 'vpp_lic',
 'rating_count_ver',
 'user_rating_ver',
 'price',
 'track_name',
 'app_desc',
 'cont_rating',
 'Book',
 'Business',
 'Catalogs',
 'Education',
 'Entertainment',
 'Finance',
 'Food & Drink',
 'Games',
 'Health & Fitness',
 'Lifestyle',
 'Medical',
 'Music',
 'Navigation',
 'News',
 'Photo & Video',
 'Productivity',
 'Reference',
 'Shopping',
 'Social Networking',
 'Sports',
 'Travel',
 'Utilities',
 'Weather']

## Split Data into Training and Test Datasets

In [23]:
# Stratified split on outcome
# 25% testing
# 15% validation

split = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)

for train_index, test_index in split.split(dataset, dataset[outcomeVal]):
    strat_operating_set = dataset.iloc[train_index]
    strat_test_set = dataset.iloc[test_index]
    
split = StratifiedShuffleSplit(n_splits=1, test_size=0.20, random_state=42)

for train_index, test_index in split.split(strat_operating_set, strat_operating_set[outcomeVal]):
    strat_train_set = strat_operating_set.iloc[train_index]
    strat_val_set = strat_operating_set.iloc[test_index]

# Print shape of two datasets
print(strat_train_set.shape)
print(strat_val_set.shape)
print(strat_test_set.shape)


(3000, 34)
(750, 34)
(1250, 34)


## Format Data for Model

In [24]:
train_target = strat_train_set.loc[:,outcomeVal].values.reshape(-1, 1)
train_data = strat_train_set.drop(outcomeVal, axis=1).values
val_target = strat_val_set.loc[:,outcomeVal].values.reshape(-1, 1)
val_data = strat_val_set.drop(outcomeVal, axis=1).values
test_target = strat_test_set.loc[:,outcomeVal].values.reshape(-1, 1)
test_data = strat_test_set.drop(outcomeVal, axis=1).values

In [25]:
train_target.shape

(3000, 1)

## Standardize Data

In [26]:
train_data = preprocessing.scale(train_data)
val_data = preprocessing.scale(val_data)
test_data = preprocessing.scale(test_data)



# Neural Network Modelling
---

In [27]:
features = train_data.shape[1]

## Define Function to Train Networks

In [28]:
def train_model(model, loss_func, optimizer, num_epochs=25):
    
    best_model = model
    best_acc = 0.0

    for epoch in range(num_epochs):
        
#         if(epoch%100 == 0):
#             print()
#             print('Epoch {}/{}'.format(epoch, num_epochs - 1))
#             print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                inputs = Variable(torch.Tensor(train_data))
                labels = Variable(torch.Tensor(train_target))
                model.train(True)  # Set model to training mode
                
            else:
                inputs = Variable(torch.Tensor(val_data))
                labels = Variable(torch.Tensor(val_target))
                model.train(False)  # Set model to evaluate mode

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward
            outputs = model(inputs)
            loss = loss_func(outputs, labels)

            # backward + optimize only if in training phase
            if phase == 'train':
                loss.backward()
                optimizer.step()

            # statistics
            preds = model(Variable(torch.Tensor(inputs))).data.numpy()
            preds = np.squeeze(preds)
            corrects = np.count_nonzero(np.abs(np.rint(preds)) == np.int32(labels.data.numpy().ravel()))

            epoch_acc = corrects / float(inputs.shape[0])
            
            # Plot epoch accuracy
            
#             if(epoch%100 == 0):
#                 print('{} Acc: {:.4f}'.format(phase, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model = copy.deepcopy(model)


    # Best prediction rate on validation set throughout training
    print('Best val Acc: {:4f}'.format(best_acc))
    
    # Prediction rate on holdout set using threshold of 0.5
    best_model.eval()
    pred = best_model(Variable(torch.Tensor(test_data))).data.numpy()
    pred = np.squeeze(pred)
    print('Final holdout Acc: {:4f}'
          .format(np.count_nonzero(np.abs(np.rint(pred)) == np.int32(test_target.ravel())) / float(test_data.shape[0])))
    
    print('-' * 30)
    print()
    
    return best_model

## Build Networks and Output Final Prediction

In [29]:
#Create instances of models
toynet = nn.Linear(features, 1)

onelayernet = nn.Sequential(
    nn.Linear(features, features//2), 
    nn.Sigmoid(), 
    nn.Linear(features//2, 1)
)

twolayernet = nn.Sequential(
    nn.Linear(features, features//2),
    nn.Sigmoid(), 
    nn.Linear(features//2, features//6), 
    nn.Sigmoid(), 
    nn.Linear(features//6, 1)
)

# One layer neural network with a dropout layer
dropoutnet = nn.Sequential(
    nn.Linear(features, features//2),
    nn.Dropout(0.5),  # drop 50% of the neuron
    nn.Sigmoid(),
    nn.Linear(features//2, 1)    
)

nn_models = [(toynet, 'simple linear transformation'),
             (onelayernet , 'dynamically sized one-layer network'),
             (twolayernet, 'dynamically sized large two-layer network'),
             (dropoutnet, 'one-layer network with a dropout feature')]

In [30]:
# Select optimization criterion for all models
# We have chosen to use the standard gradient descent algorithm with MSE loss function
loss_func = torch.nn.MSELoss()  
l_rate = 0.02 #learning rate 
mom = 0.9 #momentum
num_epochs = 6000

for model, m_name in nn_models:
    print(m_name)
    
    optimizer = torch.optim.SGD(model.parameters(), lr = l_rate, momentum = mom)
    model = train_model(model, loss_func, optimizer, num_epochs)

simple linear transformation
Best val Acc: 0.560000
Final holdout Acc: 0.576800
------------------------------

dynamically sized one-layer network
Best val Acc: 0.645333
Final holdout Acc: 0.644800
------------------------------

dynamically sized large two-layer network
Best val Acc: 0.645333
Final holdout Acc: 0.644800
------------------------------

one-layer network with a dropout feature
Best val Acc: 0.645333
Final holdout Acc: 0.644800
------------------------------

