# Google Play Store Data

We will predit the rating of a Google Play Store app as a function of its attributes. We get our data from this source: https://www.kaggle.com/lava18/google-play-store-apps

Our first step will be to import all of the modules we need, and then load the data.

In [2]:
import Pkg
Pkg.add("CSV")
Pkg.add("Plots")
Pkg.add("DataFrames")
Pkg.add("Statistics")

using Random
Random.seed!(13)

using CSV
using Plots
using DataFrames
using Statistics
using LinearAlgebra

[32m[1m   Updating[22m[39m registry at `~/.julia/registries/General`
######################################################################### 100.0%
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`


In [45]:
df = CSV.read("googleplaystore.csv")
names!(df, Symbol.(replace.(string.(names(df)), Ref(r"\s"=>"")))) #remove whitespace from column names
feature_names = names(df)
for i in 1:13
    println(string(i), "\t", string(feature_names[i]), "\t\t\t", string(eltype(df[!, i])))
end

1	App			String
2	Category			String
3	Rating			Float64
4	Reviews			String
5	Size			String
6	Installs			String
7	Type			String
8	Price			String
9	ContentRating			Union{Missing, String}
10	Genres			String
11	LastUpdated			String
12	CurrentVer			Union{Missing, String}
13	AndroidVer			Union{Missing, String}


## Clean Data

In [46]:
#Clean Ratings column 

new_ratings = Float64[]
for idx=1:size(df,1)
    if !isnan(df[idx, :Rating])
        push!(new_ratings, df[idx, :Rating])
    end
end 
        
for idx=1:size(df,1)
    if isnan(df[idx, :Rating])
        df[idx, :Rating]=round(mean(new_ratings), digits=1)   #set NaN Rating values equal to the mean 
    end
end

In [51]:
#Clean Size column

"This function converts strings to floating point values.
Strings that cannot be represented as a number (like NA) are converted to zeros"
function string_to_float(str)
    try
        parse(Float64, str)
    catch
       0.0
    end
end

for idx=1:size(df,1)
    df[idx, :Size] = rstrip(df[idx, :Size], 'M')
    if rstrip(df[idx, :Size], 'k') != df[idx, :Size]
        df[idx, :Size] = string((round(string_to_float(rstrip(df[idx, :Size], 'k'))/1000, digits=1)))
    end
    if lstrip(df[idx, :Size], 'V') != df[idx, :Size]
        df[idx, :Size] = ""
    end
end

In [53]:
#Clean Installs column

for idx=1:size(df,1)
    df[idx, :Installs] = rstrip(df[idx, :Size], '+')
end

## Train/Test Split

To make the 80 / 20 train test split, we are going to shuffle the data, and then select the first 80%
  as the train data, with 20%
  held out for validation.

In [57]:
feature_names = names(df)

df = df[shuffle(1:end), :] # we shuffle the data so that our train/test split will be truly random

train_proportion = 0.8
n = size(df, 1)
println("Size of dataset: ", string(n))

# Put the first ntrain observations in the DataFrame df into the training set, and the rest into the test set
ntrain = convert(Int, round(train_proportion*n))
ntest = n-ntrain

target = df[:, :Rating]
data = df[:, filter(col -> (col != :Rating), feature_names)]

# the following variable records the features of examples in the training set
train_x = data[1:ntrain,:]
# the following variable records the features of examples in the test set
test_x = data[ntrain+1:n,:]
# the following variable records the labels of examples in the training set
train_y = target[1:ntrain,:]
# the following variable records the labels of examples in the test set
test_y = target[ntrain+1:n,:]

Size of dataset: 10841


2168×1 Array{Float64,2}:
 4.6
 4.6
 4.0
 5.0
 4.2
 4.3
 4.7
 3.8
 5.0
 4.3
 4.2
 3.9
 4.7
 ⋮
 4.1
 4.1
 3.9
 4.3
 4.3
 4.8
 4.2
 4.4
 4.3
 4.5
 4.6
 4.5