<a href="https://www.kaggle.com/code/mayanklad/google-play-store-apps-case-study-and-prediction?scriptVersionId=143686599" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Data Analysis and Ratings Prediction for Apps on Google Play Store**

# Importing Libraries

In [None]:
# Data
import re
import numpy as np
import pandas as pd
from collections import defaultdict

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import missingno as msn

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error


# Hide warnings
import warnings
warnings.filterwarnings('ignore')

## Setting theme for plots

In [None]:
sns.set_theme(style='darkgrid', palette='bright')
sns.set_context('paper')

# Reading data

* **App** : The name of the app

* **Category** : The category of the app

* **Rating** : The rating of the app in the Play Store

* **Reviews** : The number of reviews of the app

* **Size** : The size of the app

* **Install** : The number of installs of the app

* **Type** : The type of the app (Free/Paid)

* **Price** : The price of the app (0 if it is Free)

* **Content Rating** : The appropiate target audience of the app

* **Genres**: The genre of the app

* **Last Updated** : The date when the app was last updated

* **Current Ver** : The current version of the app

* **Android Ver** : The minimum Android version required to run the app

In [None]:
data_path = '/kaggle/input/google-play-store-apps/googleplaystore.csv'

df = pd.read_csv(data_path)
df.head()

In [None]:
df.info()

As we can see we have data of **10841** applications consisting of 13 attributes.

In [None]:
df.describe()

# Data Cleaning

## Missing Values

### Checking for missing values

In [None]:
fig = msn.bar(df=df, color='#27aeef')
fig.set_title('Non-null rows count', fontdict= {'fontsize': 25})
plt.show()

In [None]:
df.isna().sum()

Its clear that we have missing values in **Rating**, **Type**, **Content Rating**, **Current Ver** and **Android Ver**.

### Handling missing values

We are dropping all the rows with null values

In [None]:
df.dropna(inplace=True)

## Duplicates

Removing the duplicate rows

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
print(f'After removing the rows with Null values and the duplicate entries, {df.shape[0]} applications remained for further analysis.')

## Data Types

In [None]:
df.head()

In [None]:
df.dtypes

As most of the features are set to data type **object** and have **suffixes**, each feature's data type must be converted into a suitable format for analysis.

### Preprocessing: Reviews

In [None]:
df['Reviews'].unique()

The feature **Reviews** must be of **integer** type.

In [None]:
df['Reviews'] = df['Reviews'].astype('int')


In [None]:
df['Reviews'].dtype

### Preprocessing: Size

* The feature **Size** must be of **floating** type.
* The suffix, which is a size unit, must be removed.
  \
  Example: '19.2M' to 19.2
* If size is given as **'Varies with device'** we replace it with 0
* The converted floating values of **Size** is represented in **megabytes** units.

In [None]:
df['Size'][0]

In [None]:
def size_prep(size):
    size = size.strip()
    
    if 'M' in size: # Size in Megabytes
            size = size.replace('M', '')
            size = float(size)

    elif 'k' in size: # Size in Kilobytes
        size = size.replace('k', '')
        size = float(size)
        size /= 1024  # 1 Megabyte = 1024 Kilobytes
    
    elif size == 'Varies with device': # 'Varies with device' is represented by the value 0
        size = float(0)
    
    return size

In [None]:
df['Size'] = df['Size'].map(size_prep)
df['Size'].dtype

### Preprocessing: Installs

* The feature **Installs** must be of **integer** type.
* The characters '**,**' and '**+**' must be removed.
  \
  Example: '10,000+' to 10000

In [None]:
def installs_prep(installs):
    installs = installs.replace(',', '')
    installs = installs.replace('+', '').strip()
    
    return int(installs)

In [None]:
df['Installs'] = df['Installs'].map(installs_prep)
df['Installs'].dtype

### Preprocessing: Price

* The feature **Price** must be of **floating** type.
* The suffix '**\$**' must be removed if **Price** is non-zero.
  \
  Example: '$4.99' to 4.99

In [None]:
df['Price'].unique()

In [None]:
def price_prep(price):
    return float(0) if price == '0' else float(price[1:])

In [None]:
df['Price'] = df['Price'].map(price_prep)
df['Price'].dtype

### Preprocessing: Last Updated

* Updating the **Last Updated** column's **datatype** from **string** to **pandas datetime**.
* **Extracting** new columns **Last Updated Year** and **Last Updated Month**.
* Changing the **dtype** of **Last Updated Month** to **category**.

In [None]:
df['Last Updated'] = pd.to_datetime(df['Last Updated'])

df['Last Updated Year'] = df['Last Updated'].apply(lambda d: int(d.strftime('%Y')))

df['Last Updated Month'] = df['Last Updated'].apply(lambda d: int(d.strftime('%m')))

In [None]:
df.head()

### Preprocessing: Android Ver

* The feature **Android Ver** must be of **floating** type.
* All the values **'Varies with device'** are replaced with **0.0**
* Only the **minimum** supported version is considered
  \
  Example: '4.1 and up' to 4.1

In [None]:
df['Android Ver'].unique()

In [None]:
def android_ver_prep(android_ver):
    android_ver = android_ver.strip()
    
    if android_ver == 'Varies with device': # When the version is 'Varies with device'
        android_ver = float(0)
    else:
        android_ver = float(android_ver[:3])
    
    return android_ver

In [None]:
df['Android Ver'] = df['Android Ver'].map(android_ver_prep)
df['Android Ver'].dtype

### Data after handling types and formats

In [None]:
df.head()

# EDA and Visualization

## Correlation

### Correlogram

In [None]:
sns.pairplot(df)

plt.title('Correlogram')
plt.show()

### Heatmap

In [None]:
fig, axes = plt.subplots(figsize=(8, 6))

sns.heatmap(data=df.select_dtypes(include='number').corr(),
            annot=True, linewidths=.5, fmt='.2f',
            cmap=sns.color_palette('light:#5A9', as_cmap=True),
            square=True)

plt.title('Correlation Heatmap')
plt.show()

We can infer that **Installs** is **highly correlated** with **Reviews**

## Box and Whisker Plot

In [None]:
count = 0
col_names = df.select_dtypes(include='number').columns

fig = make_subplots(
    rows=2,
    cols=4,
    subplot_titles=col_names,
)

for row in range(1, 3):
    for col in range(1, 5):
        box = go.Box(y=df[col_names[count]])
        fig.add_trace(box, row=row, col=col)
        count += 1

fig.update_layout(height=800, width=1000, title_text='Box and Whisker Plots', showlegend=False)
fig.show()

## Number of Apps per Category

In [None]:
plt.figure(figsize=(14, 10))

sns.countplot(data=df, y='Category', order=df['Category'].value_counts().index)

plt.title('Number of Apps per Category')
plt.show()

As we can see, the maximum number of apps in the Google Play Store belongs to the **Family**, **Game** and **Tools** categories, with **Family** with the highest number of apps.

## Distribution of Ratings

In [None]:
sns.histplot(data=df, x='Rating', kde=True)

plt.title('Distribution of Rating')
plt.show()

The distributon of **Rating** is **left-skewed** and most of the apps have a rating between **4** and **5**.

## Distribution of Reviews, Size, Installs and Price as per Rating

In [None]:
groupby_rating_sum = df.drop(['Last Updated', 'Last Updated Year', 'Last Updated Month'], axis=1).groupby('Rating').sum().reset_index()

cols = ['Reviews', 'Size', 'Installs', 'Price']
color = ['r', 'g', 'b', 'k']
fig, axs = plt.subplots(1, 4, figsize=(14, 4))

for ax, col, color in zip(axs, cols, color):
    sns.lineplot(ax=ax, data=groupby_rating_sum, x='Rating', y=col, color=color)
    ax.set_xlabel('Rating')
    ax.set_ylabel(col)
    ax.set_title(f'{col} Per Rating')

fig.tight_layout(pad=2)
plt.show()

We may infer from the above graphs that the majority of apps with higher **ratings** in ranges of **4.0 - 5.0** have **high** numbers of **reviews**, **size**, and **installs**. While we can observe **price** **fluctuations** even at the range of **high ratings**, **price** **does not** necessarily have a **direct** **correlation** with **rating**.

## Rating vs Size as per App Type

In [None]:
plt.figure(figsize=(12, 6))

sns.scatterplot(data=df, x='Size', y='Rating', hue='Type', s=40)

plt.title('Rating vs. Size as per Type')
plt.show()

We can infer from this scatter plot that the bulk of the **free** apps are **small** in **size** and have **excellent** **ratings**. While we have a **even** mix of **sizes** and **ratings** for **paid** apps.

## Distribution of Reviews

In [None]:
sns.histplot(data=df, x='Reviews', bins=60, kde=True)

plt.title('Distribution of Reviews')
plt.show()

## Top 20 Apps as per Reviews count

In [None]:
total_reviews_per_app_desc = df.drop(['Last Updated', 'Last Updated Year', 'Last Updated Month'], axis=1).groupby(['App'])['Reviews'].sum().sort_values(ascending=False)

plt.figure(figsize=(14, 10))

sns.barplot(x=total_reviews_per_app_desc.head(20).values, y=total_reviews_per_app_desc.head(20).index)

plt.title('Number of Reviews per App (Top 20)')
plt.show()

Mostly the **top 20** reviewed apps are **games** and **social media** apps with **Instagram** and **Facebook** at the top.

## Type (Paid or Free)

In [None]:
colors = ["#00EE76","#7B8895"]
plt.figure(figsize=(7, 7))

plt.pie(x=df['Type'].value_counts(), labels=['Free', 'Paid'], colors=colors, autopct='%.0f%%', explode=(0, 0.1))

plt.title('Distribution of Paid and Free apps')
plt.legend()
plt.show()

The Google Play Store has **93% free** apps.

## Content Rating

In [None]:
data = df['Content Rating'].value_counts().reset_index()

In [None]:
fig = px.pie(values=data.iloc[:, 1],
             names=data.iloc[:, 0],
             title='Pie chart of App Content Rating',
             width=800, height=600)

fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

Almost **80%** of the apps are made for **Everyone**.

## Android Version

In [None]:
plt.figure(figsize=(10, 5))

sns.countplot(data=df, x='Android Ver')

plt.title('Count of Apps with minimum supported Android Version')
plt.show()

It can be seen that **majority** of the apps support **Android Version 4.0 and above**

## Distribution of App Update over the Year

In [None]:
groupby_year_free_count = df[df['Type'] == 'Free'].drop(['Last Updated'], axis=1).groupby('Last Updated Year')['Last Updated Year'].count()
groupby_year_paid_count = df[df['Type'] == 'Paid'].drop(['Last Updated'], axis=1).groupby('Last Updated Year')['Last Updated Year'].count()

In [None]:
sns.lineplot(x=groupby_year_free_count.index,
             y=groupby_year_free_count.values,
             marker='o', label='Free')

sns.lineplot(x=groupby_year_paid_count.index,
             y=groupby_year_paid_count.values,
             marker='o', label='Paid')

plt.title('Distribution of App Update over the Year as per Type')
plt.legend()
plt.show()

As seen in the above plot, **before 2011**, there were **no paid** apps. However, as time has passed, the number of **free** apps has **increased** more than paid apps. By comparing the applications updated from **2011 to 2018**, **free** apps have **grown exponentially**, whereas there is **no** significant **increase** in **paid** apps. We may infer that the **majority** of **users** choose **free** apps.

## Distribution of App Update over the Month

In [None]:
groupby_month_free_count = df[df['Type'] == 'Free'].drop(['Last Updated'], axis=1).groupby('Last Updated Month')['Last Updated Month'].count()
groupby_month_paid_count = df[df['Type'] == 'Paid'].drop(['Last Updated'], axis=1).groupby('Last Updated Month')['Last Updated Month'].count()

In [None]:
sns.lineplot(x=groupby_month_free_count.index,
             y=groupby_month_free_count.values,
             marker='o', label='Free')

sns.lineplot(x=groupby_month_paid_count.index,
             y=groupby_month_paid_count.values,
             marker='o', label='Paid')

plt.title('Distribution of App Update over the Month as per Type')
plt.legend()
plt.show()

**Majority** of the apps are **updated** around **June**, **July** and **August**.

## Further Analysis

### Apps with a 5.0 Rating

In [None]:
df_rating_5 = df[df.Rating == 5.]
print(f'There are {df_rating_5.shape[0]} apps having rating of 5.0')

#### Installs

In [None]:
sns.histplot(data=df_rating_5, x='Installs', kde=True, bins=50)

plt.title('Distribution of Installs with 5.0 Rating Apps')
plt.show()

Despite the **full ratings**, the number of **installations** for the majority of the apps is **low**. Hence, those apps **cannot** be considered the **best** products.

#### Reviews

In [None]:
sns.histplot(data=df_rating_5, x='Reviews', kde=True)
plt.title('Distribution of Reviews with 5.0 Rating Apps')
plt.show()

The distribution is **right-skewed** which shows applications with **few reviews** having **5.0 ratings**, which is **misleading**.

#### Category

In [None]:
df_rating_5_cat =  df_rating_5['Category'].value_counts().reset_index()

In [None]:
fig = px.pie(values=df_rating_5_cat.iloc[:, 1],
             names=df_rating_5_cat.iloc[:, 0],
             title='Pie chart of App Categories with 5.0 Rating',
             width=800, height=600)

fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

**Family**, **LifeStyle** and **Medical** apps receive the **most 5.0 ratings** on Google Play Store with **Family** representing about **quater** of whole.

#### Type

In [None]:
df_rating_5_type =  df_rating_5['Type'].value_counts().reset_index()

In [None]:
fig = px.pie(values=df_rating_5_type.iloc[:, 1],
             names=df_rating_5_type.iloc[:, 0],
             title='Pie chart of App Types with 5.0 Rating',
             color_discrete_sequence=["#00EE76","#7B8895"],
             width=800, height=600)

fig.update_traces(textposition='inside', textinfo='percent+label', pull=[0, 0.1])
fig.show()

Almost **90%** of the **5.0 rating** apps are **free** on Goolge Play Store.

# Feature Pruning

We decide to **prune** the following features:
* **App** : App names are of no value for the model
* **Genres** : The informations it stores is same as the feature **Category**
* **Last Updated** : Since feature extraction is already carried out to make **year** and **month** features
* **Current Ver** : Current Version of an app doesn't hold significant value.
    

In [None]:
df.head()

In [None]:
pruned_features = ['App', 'Genres', 'Last Updated', 'Current Ver']

# Data Splitting for Modeling

We split the dataset into **80%** train and **20%** test.

In [None]:
target = 'Rating'

In [None]:
X = df.copy().drop(pruned_features+[target], axis=1)
y = df.copy()[target]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# Label Encoding

In [None]:
le_dict = defaultdict()

In [None]:
features_to_encode = X_train.select_dtypes(include=['category', 'object']).columns

for col in features_to_encode:
    le = LabelEncoder()

    X_train[col] = le.fit_transform(X_train[col]) # Fitting and tranforming the Train data
    X_train[col] = X_train[col].astype('category') # Converting the label encoded features from numerical back to categorical dtype in pandas

    X_test[col] = le.transform(X_test[col]) # Only transforming the test data
    X_test[col] = X_test[col].astype('category') # Converting the label encoded features from numerical back to categorical dtype in pandas

    le_dict[col] = le # Saving the label encoder for individual features

# Standardization

In [None]:
# Converting and adding "Last Updated Month" to categorical features
categorical_features = features_to_encode + ['Last Updated Month']
X_train['Last Updated Month'] = X_train['Last Updated Month'].astype('category')
X_test['Last Updated Month'] = X_test['Last Updated Month'].astype('category')

# Listing numeric features to scale
numeric_features = X_train.select_dtypes(exclude=['category', 'object']).columns

In [None]:
numeric_features

In [None]:
scaler = StandardScaler()

# Fitting and transforming the Training data
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
# X_train = scaler.fit_transform(X_train)

# Only transforming the Test data
X_test[numeric_features] = scaler.transform(X_test[numeric_features])
# X_test = scaler.transform(X_test)

# Modeling

## Regression

### Creating dataframe for metrics

In [None]:
models = ['Linear', 'KNN', 'Random Forest']
datasets = ['train', 'test']
metrics = ['RMSE', 'MAE', 'R2']

multi_index = pd.MultiIndex.from_product([models, datasets, metrics],
                                         names=['model', 'dataset', 'metric'])

df_metrics_reg = pd.DataFrame(index=multi_index,
                          columns=['value'])

In [None]:
df_metrics_reg

### Linear Regressor

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
df_metrics_reg.loc['Linear', 'train', 'R2'] = lr.score(X_train, y_train)
df_metrics_reg.loc['Linear', 'test', 'R2'] = lr.score(X_test, y_test)

In [None]:
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

df_metrics_reg.loc['Linear', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['Linear', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['Linear', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['Linear', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)

### KNeighbors Regressor

In [None]:
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

In [None]:
df_metrics_reg.loc['KNN', 'train', 'R2'] = knn.score(X_train, y_train)
df_metrics_reg.loc['KNN', 'test', 'R2'] = knn.score(X_test, y_test)

In [None]:
y_train_pred = knn.predict(X_train)
y_test_pred = knn.predict(X_test)

df_metrics_reg.loc['KNN', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['KNN', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['KNN', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['KNN', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)

### Random Forest Regressor

In [None]:
rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

In [None]:
df_metrics_reg.loc['Random Forest', 'train', 'R2'] = rf.score(X_train, y_train)
df_metrics_reg.loc['Random Forest', 'test', 'R2'] = rf.score(X_test, y_test)

In [None]:
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

df_metrics_reg.loc['Random Forest', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['Random Forest', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['Random Forest', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['Random Forest', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)

### Regression Evaluation

In [None]:
# Rounding the values

df_metrics_reg['value'] = df_metrics_reg['value'].apply(lambda v: round(v, ndigits=3))
df_metrics_reg

In [None]:
data = df_metrics_reg.reset_index()

g = sns.catplot(col='dataset', data=data, kind='bar', x='model', y='value', hue='metric')

# Adding annotations to bars
# iterate through axes
for ax in g.axes.ravel():
    # add annotations
    for c in ax.containers:
        ax.bar_label(c, label_type='edge')

    ax.margins(y=0.2)

plt.show()

**The Regression predictions don't hold up very well!**

**We can interpret that the dataset is not suitable for regression problem.**

## Classification

Let's frame it as a **classification** problem statement.

### Converting the Rating from continuous to discrete

In [None]:
y_train_int = y_train.astype(int)
y_test_int = y_test.astype(int)

### Creating dataframe for metrics

In [None]:
models = ['Logistic Regression', 'KNN', 'Random Forest']
datasets = ['train', 'test']

multi_index = pd.MultiIndex.from_product([models, datasets],
                                         names=['model', 'dataset'])

df_metrics_clf = pd.DataFrame(index=multi_index,
                          columns=['accuracy %'])

In [None]:
df_metrics_clf

### Logistic Regression Classifier

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train_int)

In [None]:
df_metrics_clf.loc['Logistic Regression', 'train'] = lr_clf.score(X_train, y_train_int)
df_metrics_clf.loc['Logistic Regression', 'test'] = lr_clf.score(X_test, y_test_int)

### KNeighbors Classifier

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train_int)

In [None]:
df_metrics_clf.loc['KNN', 'train'] = knn_clf.score(X_train, y_train_int)
df_metrics_clf.loc['KNN', 'test'] = knn_clf.score(X_test, y_test_int)

### Random Forest Classifier

In [None]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train_int)

In [None]:
df_metrics_clf.loc['Random Forest', 'train'] = rf_clf.score(X_train, y_train_int)
df_metrics_clf.loc['Random Forest', 'test'] = rf_clf.score(X_test, y_test_int)

### Classification Evaluation

In [None]:
# Rounding and coverting the accuracies to percentages
df_metrics_clf['accuracy %'] = df_metrics_clf['accuracy %'].apply(lambda v: round(v*100, ndigits=2))
df_metrics_clf

In [None]:
data = df_metrics_clf.reset_index()

g = sns.catplot(col='dataset', data=data, kind='bar', x='model', y='accuracy %')

# Adding annotations to bars
# iterate through axes
for ax in g.axes.ravel():
    # add annotations
    for c in ax.containers:
        ax.bar_label(c, label_type='edge')

    ax.margins(y=0.2)

plt.show()

**After comparing with Regression models, its clear that we would get better results from Classification!**

# Conclusion

- In conclusion, the dataset from Google Play Store apps has been explored and analyzed using various data visualization techniques with the help of Matplotlib, Seaborn and Plotly libraries. 
- The preliminary analysis, visualization methods and EDA provided insights into the data and helped in understanding the underlying patterns and relationships among the variables. 
- The analysis of the Google Play Store dataset has shown that there is a weak correlation between the rating and other app attributes such as size, installs, reviews, and price. We found that there was a moderate positive correlation between the number of installs and the rating, suggesting that higher-rated apps tend to have more installs.
- We also observed that free apps have higher ratings than paid apps, and that app size does not seem to have a significant impact on rating.