# 1. <a id='toc1_'></a>[AppStore Exploratory Data Analysis](#toc0_)
As of 2022, Apple's App Store was home to some 1.76 million apps and over 460,000 games. For this effort, app, rating, and review data were obtained from the Apple [App Store](https://www.apple.com/app-store/) for the following nine search terms:
1. business
2. education
3. entertainment
4. health
5. lifestyle
6. medical
7. productivity
9. social_networking

Three datasets comprise the App Store data collection: 
- **AppData**: the core dataset containing app name, description, category, the number of ratings, and average ratings;
- **Rating**: rating histogram, and review count data used to prioritize the targeting and collection of review data; and,
- **Review**: Customer reviews of selected apps available in the Apple App Store.

We kick-off the exploratory data analysis with an examination of the AppData and Rating datasets. With this foundation, an exploratory text analysis of the Review dataset will reveal a more nuanced hearing of the voice of the mobile app customer, their satisfaction, sentiment, and needs, met and unmet. After some dependency housekeeping, the remainder of this section is organized as follows.

**Table of contents**<a id='toc0_'></a>    
- 1. [AppStore Exploratory Data Analysis](#toc1_)    
  - 1.1. [AppData](#toc1_1_)    
    - 1.1.1. [AppData Overview](#toc1_1_1_)    
    - 1.1.2. [AppData Prep](#toc1_1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

**Dependencies** 

In [1]:
import os

import pandas as pd
from IPython.display import HTML

from aimobile.container import AIMobileContainer
from aimobile.data.analysis.eda import EDA

container = AIMobileContainer()
container.init_resources()
container.wire(packages=["aimobile.data.acquisition.appstore"])

**Dependencies**

<a id='appdata'></a>

## 1.1. <a id='toc1_1_'></a>[AppData](#toc0_)
AppData, the term, encapsulates the core, descriptive, and aggregate rating data for each app as follows:

| #  | attribute     | type  | description                                  | API Field         |
|----|---------------|-------|----------------------------------------------|-------------------|
| 1  | id:           | int   | Unique Apple App Identifier                  | trackId           |
| 2  | name:         | str   | Name of the app.                             | trackName         |
| 3  | description:  | str   | Description                                  | description       |
| 4  | category_id:  | int   | Four digit category identifier               | primaryGenreId    |
| 5  | category:     | str   | Category name                                | primaryGenreName  |
| 6  | price:        | float | Cost of the app                              | price             |
| 7  | rating:       | float | The user average rating                      | averageUserRating |
| 8  | ratings:      | int   | The rating count                             | userRatingCount   |
| 9  | developer_id: | int   | The app developer identifier                 | artistId          |
| 10 | developer:    | str   | The app developer name                       | artistName        |
| 11 | released:     | str   | The date of initial release                  | releaseDate       |
| 12 | source:       | str   | The host from which the data were obtained.  | itunes.apple.com  |



### 1.1.1. <a id='toc1_1_1_'></a>[AppData Overview](#toc0_)
Let's instantiate an EDA object with the appdata from the appdata repository, and get a sense of the overall profile of the data.

In [3]:
uow = container.data.uow()
appdata = uow.appdata_repo.getall()
appdata_eda = EDA(data=appdata)
appdata_eda.overview

Unnamed: 0,Unnamed: 1
Number of Variables,12.0
Number of Observations,513183.0
Number of Cells,6158196.0
Missing Cells,0.0
Missing Cells (%),0.0
Duplicate Rows,0.0
Duplicate Rows (%),0.0
Size (Bytes),1576810257.0


The AppData contains a bit over 500,000 apps, described by 12 variables for a total of over 6 million data cells. Let's examine the variable, data types, validity and cardinality of the dataset.

In [4]:
appdata_eda.summary

Unnamed: 0,Column,Dtype,Valid,Missing,Validity,Unique,Cardinality,Size
0,id,int64,513183,0,1.0,461878,0.9,4105464
1,name,object,513183,0,1.0,461358,0.9,43714521
2,description,object,513183,0,1.0,451349,0.88,1356704636
3,category_id,int64,513183,0,1.0,26,0.0,4105464
4,category,object,513183,0,1.0,26,0.0,34058664
5,price,float64,513183,0,1.0,116,0.0,4105464
6,developer_id,int64,513183,0,1.0,258212,0.5,4105464
7,developer,object,513183,0,1.0,257297,0.5,39955896
8,rating,float64,513183,0,1.0,52917,0.1,4105464
9,ratings,int64,513183,0,1.0,20026,0.04,4105464


### 1.1.2. <a id='toc1_1_2_'></a>[AppData Prep](#toc0_)
The AppData summary reveals several observations / insights as we prepare for the univariate analysis:

1. Data validity is 100%, revealing no missing data,    
2. The cardinality of the id, name, and description variables suggests some degree of duplication among these variables,   
3. Similarly, developer and developer id have different unique value counts hinting at data quality/cleaning issues,    
4. Our nine search terms returned apps across 26 categories, and
5. Category id and category share the same cardinality
6. Source has a cardinality of 1 and can be ignored.

Yet, as we engage in the exploration and discovery effort, it is essential that the data types are appropriate at the variable level. As such, the following variables will converted to categorical.

- id
- name
- category_id
- category 
- developer_id
- developer

The description variable will be converted to pandas 'string' dtype.

In [5]:
category_vars = ['id', 'name', 'category_id', 'category', 'developer_id', 'developer']
str_vars = ['description']
appdata_eda.astype(vars=category_vars, dtype='category')
appdata_eda.astype(vars=str_vars, dtype='string')
del appdata_eda.summary
appdata_eda.summary

Unnamed: 0,Column,Dtype,Valid,Missing,Validity,Unique,Cardinality,Size
0,id,category,513183,0,1.0,461878,0.9,22656084
1,name,category,513183,0,1.0,461358,0.9,58286267
2,description,string,513183,0,1.0,451349,0.88,1356704636
3,category_id,category,513183,0,1.0,26,0.0,514463
4,category,category,513183,0,1.0,26,0.0,515995
5,price,float64,513183,0,1.0,116,0.0,4105464
6,developer_id,category,513183,0,1.0,258212,0.5,12572612
7,developer,category,513183,0,1.0,257297,0.5,30932069
8,rating,float64,513183,0,1.0,52917,0.1,4105464
9,ratings,int64,513183,0,1.0,20026,0.04,4105464


### AppData Univariate Analysis
#### AppData Id

Here, we are interested in the cardinality of the id variable.

In [15]:
appdata_eda.describe(x="id")
id_value_counts = appdata_eda.value_counts(x="id", threshold=2)
id_value_counts['count'].value_counts().to_frame()

Unnamed: 0,count,unique,top,freq
id,513183,461878,1059124601,6


Unnamed: 0,count
2,44499
3,3189
4,129
5,9
6,1


Nearly 47,830 ids are present in 2 or more observations. The counts of value counts are summarized above and stored for later data cleaning operations.

#### AppData Name

In [16]:
appdata_eda.describe(x="name")
name_value_counts = appdata_eda.value_counts(x="name", threshold=2)
name_value_counts['count'].value_counts().to_frame()

Unnamed: 0,count,unique,top,freq
name,513183,461358,Birthday Wishes & Cards,6


Unnamed: 0,count
2,44788
3,3250
4,153
5,17
6,2


The cardinality for the name variable looks conspicuously similar to that of the id variable.

#### AppData Description

In [17]:
appdata_eda.describe(x="description")
desc_value_counts = appdata_eda.value_counts(x="description", threshold=2)
desc_value_counts['count'].value_counts().to_frame()

Unnamed: 0,count,unique,top,freq
description,513183,451349,"accounting firm and business consultancy, whic...",412


Unnamed: 0,count
2,45402
3,3519
4,373
5,115
6,56
...,...
26,1
46,1
45,1
31,1
