## Google Play Store Apps Data Analysis

This sdf

### Setup your spark session

In [151]:
# Import SparkSession from the pyspark library
from pyspark.sql import SparkSession

# This will initialize our spark session
# which is the main entry point for our
# spark application

# The appName() method specifies the name of our application
# The getOrCreate() method will try to get
# an existing session, and if there is no existing session,
# creates a new one
spark = SparkSession.builder \
    .appName('GooglePlayStoreAppsAnalysis') \
    .getOrCreate()

### Load the dataset

The following lines of code will do the following:
    
- Creates a variable named **path** and assign the location of file to be loaded
- Creates a variable named **options** and assign the option(s) to be used when reading the file
- Creates a **DataFrame**

In [278]:
# Specifies the location of the file
# we are goint to read
path = 'data/googleplaystore.csv'

# Specifies the input options to be applied
# when reading the file
options = {
    'delimeter': ',',
    'header': True
}

# Creates a dataframe
df = spark.read \
    .options(**options) \
    .format('csv') \
    .load(path)

## View the data

***Tip**: You can limit the number of rows returned by the show() method by specifying (e.g. df.show(10))*

In [279]:
df.show()

+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|                 App|      Category|Rating|Reviews|Size|   Installs|Type|Price|Content Rating|              Genres|      Last Updated|       Current Ver| Android Ver|
+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159| 19M|    10,000+|Free|    0|      Everyone|        Art & Design|   January 7, 2018|             1.0.0|4.0.3 and up|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967| 14M|   500,000+|Free|    0|      Everyone|Art & Design;Pret...|  January 15, 2018|             2.0.0|4.0.3 and up|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510|8.7M| 5,000,000+|Free|    0|      Everyone|        Art & Design|    August 1, 2018|             1.2.4|4.0.3 

### Count the total number of rows

In [280]:
# Returns the total number of rows 
# your DataFrame has
df.count()

10841

### Remove unecessary and irrelevant columns

This step ensures that only columns that are relevant and useful for our data analysis are present on our dataframe, thus making our analysis much faster since our dataframe will only load much fewer columns

In [281]:
# Drops/remove column(s) from our dataframe
# based on the given list of column(s)
df = df.drop(
    'Size',
    'Installs',
    'Type',
    'Price',
    'Content Rating',
    'Last Updated',
    'Current Ver',
    'Android Ver'
)

df.show()

+--------------------+--------------+------+-------+--------------------+
|                 App|      Category|Rating|Reviews|              Genres|
+--------------------+--------------+------+-------+--------------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159|        Art & Design|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967|Art & Design;Pret...|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510|        Art & Design|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644|        Art & Design|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|Art & Design;Crea...|
|Paper flowers ins...|ART_AND_DESIGN|   4.4|    167|        Art & Design|
|Smoke Effect Phot...|ART_AND_DESIGN|   3.8|    178|        Art & Design|
|    Infinite Painter|ART_AND_DESIGN|   4.1|  36815|        Art & Design|
|Garden Coloring Book|ART_AND_DESIGN|   4.4|  13791|        Art & Design|
|Kids Paint Free -...|ART_AND_DESIGN|   4.7|    121|Art & Design;Crea...|
|Text on Photo - F...|ART_AND_DESIGN| 

### Perform data cleansing

#### Renaming the columns

In [282]:
# Display column(s)
df.columns

['App', 'Category', 'Rating', 'Reviews', 'Genres']

In [283]:
from pyspark.sql.types import StringType

to_lower = spark.udf.register('to_lower', lambda x: x.lower(), StringType())

In [284]:
col_renamed_df = df.withColumnRenamed('App', 'app') \
    .withColumnRenamed('Category', 'category') \
    .withColumnRenamed('Rating', 'rating') \
    .withColumnRenamed('Reviews', 'reviews') \
    .withColumnRenamed('Genres', 'genres')

In [285]:
col_renamed_df.columns

['app', 'category', 'rating', 'reviews', 'genres']

#### Replace rating Nan/NULL with appropriate values

In [301]:
# Show all the rows that contains a NaN rating
col_renamed_df.filter('rating = "NaN"').show()

+---------------------+-------------------+------+-------+--------------------+
|                  app|           category|rating|reviews|              genres|
+---------------------+-------------------+------+-------+--------------------+
| Mcqueen Coloring ...|     ART_AND_DESIGN|   NaN|     61|Art & Design;Acti...|
| Wrinkles and reju...|             BEAUTY|   NaN|    182|              Beauty|
| Manicure - nail d...|             BEAUTY|   NaN|    119|              Beauty|
| Skin Care and Nat...|             BEAUTY|   NaN|    654|              Beauty|
| Secrets of beauty...|             BEAUTY|   NaN|     77|              Beauty|
| Recipes and tips ...|             BEAUTY|   NaN|     35|              Beauty|
| Lady adviser (bea...|             BEAUTY|   NaN|     30|              Beauty|
| Anonymous caller ...|BOOKS_AND_REFERENCE|   NaN|    161|   Books & Reference|
| SH-02J Owner's Ma...|BOOKS_AND_REFERENCE|   NaN|      2|   Books & Reference|
| URBANO V 02 instr...|BOOKS_AND_REFEREN

In [319]:
nan_filled_df = col_renamed_df.replace(['NaN'], ['0.0'], 'rating')

In [320]:
# Checks for NaN valued rating in our dataframe
nan_filled_df.filter('app like "Mcqueen Coloring%"').show()

+--------------------+--------------+------+-------+--------------------+
|                 app|      category|rating|reviews|              genres|
+--------------------+--------------+------+-------+--------------------+
|Mcqueen Coloring ...|ART_AND_DESIGN|   0.0|     61|Art & Design;Acti...|
|Mcqueen Coloring ...|        FAMILY|   0.0|     65|Art & Design;Acti...|
+--------------------+--------------+------+-------+--------------------+



In [321]:
df.filter('rating = "NaN"').show()

+---------------------+-------------------+------+-------+--------------------+
|                  App|           Category|Rating|Reviews|              Genres|
+---------------------+-------------------+------+-------+--------------------+
| Mcqueen Coloring ...|     ART_AND_DESIGN|   NaN|     61|Art & Design;Acti...|
| Wrinkles and reju...|             BEAUTY|   NaN|    182|              Beauty|
| Manicure - nail d...|             BEAUTY|   NaN|    119|              Beauty|
| Skin Care and Nat...|             BEAUTY|   NaN|    654|              Beauty|
| Secrets of beauty...|             BEAUTY|   NaN|     77|              Beauty|
| Recipes and tips ...|             BEAUTY|   NaN|     35|              Beauty|
| Lady adviser (bea...|             BEAUTY|   NaN|     30|              Beauty|
| Anonymous caller ...|BOOKS_AND_REFERENCE|   NaN|    161|   Books & Reference|
| SH-02J Owner's Ma...|BOOKS_AND_REFERENCE|   NaN|      2|   Books & Reference|
| URBANO V 02 instr...|BOOKS_AND_REFEREN

In [322]:
nan_filled_df.show()

+--------------------+--------------+------+-------+--------------------+
|                 app|      category|rating|reviews|              genres|
+--------------------+--------------+------+-------+--------------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159|        Art & Design|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967|Art & Design;Pret...|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510|        Art & Design|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644|        Art & Design|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|Art & Design;Crea...|
|Paper flowers ins...|ART_AND_DESIGN|   4.4|    167|        Art & Design|
|Smoke Effect Phot...|ART_AND_DESIGN|   3.8|    178|        Art & Design|
|    Infinite Painter|ART_AND_DESIGN|   4.1|  36815|        Art & Design|
|Garden Coloring Book|ART_AND_DESIGN|   4.4|  13791|        Art & Design|
|Kids Paint Free -...|ART_AND_DESIGN|   4.7|    121|Art & Design;Crea...|
|Text on Photo - F...|ART_AND_DESIGN| 

#### Enforcing appropriate data types

In [323]:
nan_filled_df.printSchema()

root
 |-- app: string (nullable = true)
 |-- category: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- reviews: string (nullable = true)
 |-- genres: string (nullable = true)



In [330]:
from pyspark.sql.types import StringType, DecimalType, IntegerType
import pyspark.sql.functions as func

type_casted_df = nan_filled_df.select(func.col('app').cast(StringType()), 
                         func.col('category').cast(StringType()),
                         func.col('rating').cast(DecimalType(10, 2)),
                         func.col('reviews').cast(IntegerType()),
                         func.col('genres').cast(StringType()))

In [331]:
type_casted_df.printSchema()

root
 |-- app: string (nullable = true)
 |-- category: string (nullable = true)
 |-- rating: decimal(10,2) (nullable = true)
 |-- reviews: integer (nullable = true)
 |-- genres: string (nullable = true)



In [332]:
type_casted_df.filter('app like "Mcqueen Coloring%"').show()

+--------------------+--------------+------+-------+--------------------+
|                 app|      category|rating|reviews|              genres|
+--------------------+--------------+------+-------+--------------------+
|Mcqueen Coloring ...|ART_AND_DESIGN|  0.00|     61|Art & Design;Acti...|
|Mcqueen Coloring ...|        FAMILY|  0.00|     65|Art & Design;Acti...|
+--------------------+--------------+------+-------+--------------------+



In [357]:
type_casted_df.groupBy('app') \
    .sum('rating') \
    .sort(func.desc('sum(rating)')) \
    .limit(10) \
    .show(truncate=False)

+-------------------------------------------------+-----------+
|app                                              |sum(rating)|
+-------------------------------------------------+-----------+
|ROBLOX                                           |40.50      |
|CBS Sports App - Scores, News, Stats & Watch Live|34.40      |
|Duolingo: Learn Languages Free                   |32.90      |
|8 Ball Pool                                      |31.50      |
|Candy Crush Saga                                 |30.80      |
|ESPN                                             |29.40      |
|Bowmasters                                       |28.20      |
|Zombie Catchers                                  |28.20      |
|Sniper 3D Gun Shooter: Free Shooting Games - FPS |27.60      |
|Subway Surfers                                   |27.00      |
+-------------------------------------------------+-----------+

