## Summarize dataset characteristics

The dataset, titled "Google Play Store Apps," is acquired from the Kaggle Dataset Platform through the provided link: https://www.kaggle.com/datasets/lava18/google-play-store-apps/data. 

It contains web-scraped data from 10,000 Play Store applications, intended for the analysis of the Android market. The data has been extracted from the Google Play Store through web scraping. The dataset's creator/publisher believes that the Play Store apps data has enormous potential to drive app-making businesses to success, therefore, developers can extract actionable insights from this dataset to enhance the app creation strategies and tap into the Android market. 

This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, you can visit http://creativecommons.org/licenses/by/3.0/. Users are freely allowed to:

* Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
* Adapt — remix, transform, and build upon the material for any purpose, even commercially.

File Size (9.03 MB):
Inside the folder, this dataset contains two separate csv files:
1. googleplaystore.csv (1.36 MB)
    Dataset size: 10,842 rows (1 app/row) x 13 columns
    - App Name(Text) - Application Name
    - Category(Categorical Text) - Category the app belongs to
    - Rating(Float) - Overall user rating of the app
    - Reviews(Integer) - Number of user reviews for the app
    - Size(Numeric) - Size of the app
    - Installs(Numeric) - Number of user downloads/installs for the app
    - Type(Categorical Text) - Paid/Free
    - Price(Float/Integer) - Price of the app
    - Content Rating(Categorical Text) - Age group the app is targeted at
    - Genres(Categorical Text) - Different types of genres that an app belongs to
    - Last Updated(Date) - updated time
    - Current Ver(Text data type) - current version number
    - Android Ver(Text data type) - android version number
2. googleplaystore_user_reviews.csv (7.67 MB)
    - App(Text)  - Name of app
    - Translated_Review(Text)  - User review (Preprocessed and translated to English)
    - Sentiment(Categorical Text) - Positive/Negative/Neutral (Preprocessed)
    - Sentiment_Polarity(Float & None Type) - Sentiment polarity score
    - Sentiment_Subjectivity(Float & None Type) - Sentiment subjectivity score

For this study, we will mainly focus on the analysis of first dataset: googleplaystore.csv. Inside the dataset, we will conduct a deep investigation of the first 10-column features.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px

import ipywidgets as widgets

In [2]:
full_df=pd.read_csv("./archive/googleplaystore.csv")
full_df=full_df.iloc[:,:10]
full_df.shape

(10841, 10)

### proprocessing step 1: remove unexpected Category value

In [3]:
full_df=full_df[full_df['Category']!='1.9'] #reason: Category cannot be a float
full_df.shape

(10840, 10)

### proprocessing step 2: remove nan from any existing rows (mainly for removing nan in Rating column)

In [4]:
full_df=full_df.dropna(axis='rows')
full_df.shape

(9366, 10)

In [5]:
full_df['Reviews']=full_df['Reviews'].astype(float)
full_df['Rating']=full_df['Rating'].astype(float)
full_df['Price']=full_df['Price'].str.replace('$','').astype(float)

  full_df['Price']=full_df['Price'].str.replace('$','').astype(float)


### Analysis

In [6]:
# find out unique category values across diff apps
c_ls=[x for x in list(full_df.Category) if str(x) != 'nan']
c_ls=list(set(c_ls))

# Add a select multiple ipywidget for multiple categorical selection
input_c=widgets.SelectMultiple(
    options=c_ls,
    value=['EDUCATION'],
    rows=10,
    description='Category:',
    disabled=False
)

In [16]:

def get_rr(df, ls):
    '''this function will filter the categories and generate three different kinds of visualizations:
        return: pie chart
                scatter plot
                histogram
    '''
    
    print("These are the categories that you have selected:\n", ls.value)
    new_ls=[i for i in ls.value]
    cur_df=df[df['Category'].isin (new_ls)]
    fig=px.scatter(cur_df, x='Type', y='Rating', facet_col='Content Rating', color='Genres', 
                   title='Selected Category Interactive Description')
    fig2=px.histogram(cur_df, x='Rating', color='Genres', barmode='overlay',
                     title='Selected Category Histogram - Rating Distribution')
    fig3=px.pie(cur_df, names='Genres', title='Distribution of Different Genre Types')
    
    fig3.show()
    fig.show()
    fig2.show()
    


### SelectMultiple Widget

In [17]:
#Multiple values can be selected with shift and/or ctrl (or command) pressed and mouse clicks or arrow keys.
input_c 

SelectMultiple(description='Category:', index=(16,), options=('SPORTS', 'SOCIAL', 'ENTERTAINMENT', 'COMMUNICAT…

In [20]:
get_rr(full_df, input_c)

These are the categories that you have selected:
 ('EDUCATION',)


Since these three interactive plots are created by plotly.express, they will not show after uploading to Github. I have submit another .ipynb copy in the Canvas window for your reference. Thank you!