# Project Name -  Play Store App Review Analysis

##### **Project Type**    - EDA
##### **Contribution**    - Individual
**Prepared by**: Kushboo Jain

**Goal** : Discover patterns and insights in Google Play Store data that help inform strategic decisions for app developers and marketers.

# **Project Summary -** Google Play Store App Analysis




In this project, I explored a dataset from the Google Play Store to identify key factors that contribute to the success and engagement of Android applications. The dataset contains details such as app category, rating, number of installs, size, price, and more. Along with it, a second dataset includes user reviews for these apps, which helped in analyzing user sentiments and overall app performance.

The primary goal was to extract actionable insights that could help developers and businesses understand what makes an app more popular, better rated, and widely downloaded. This can guide future decisions in app development and marketing strategies.

I started by cleaning the data, handling missing values, converting data types, and making sure the format was suitable for analysis. Then, I moved on to Exploratory Data Analysis (EDA), where I used several visualizations to identify trends and patterns. For example, I checked how app categories relate to ratings, the impact of app size and price on downloads, and also used sentiment analysis on user reviews.

---

## Libraries Used

- **Pandas** for reading and manipulating the dataset  
- **NumPy** for numerical operations  
- **Matplotlib** and **Seaborn** for creating meaningful visualizations  
  *(at least 5 types used: bar plots, histograms, scatter plots, heatmaps, and box plots)*

---

## Key Findings

- Free apps tend to have higher install rates but not necessarily better ratings.
- Certain categories like "Games" and "Tools" dominate in terms of downloads.
- User sentiment plays a big role in app popularity — apps with mostly positive reviews are downloaded more often.

---

## Conclusion

Overall, this project gave me a deeper understanding of how data analysis can support decision-making in the mobile app industry.


# **GitHub Link -**

[View this project on GitHub](https://github.com/kushboo10/playstore-analysis)


# **Problem Statement**


The Google Play Store contains a large number of applications across various categories. However, many of these apps struggle to gain visibility and user satisfaction. Understanding the factors that influence an app’s success — such as its category, pricing strategy, user ratings, install base, and user sentiment — can help developers and stakeholders make informed decisions.

The challenge is to analyze the Play Store data to uncover trends, clean inconsistencies, and provide insights that can improve app performance, user experience, and store visibility.

#### **Define Your Business Objective?**


The primary business objective of this project is to:

- Analyze the apps and user reviews datasets from the Google Play Store.
- Identify patterns and trends that correlate with highly-rated and frequently-installed apps.
- Understand the impact of pricing (free vs paid), app size, and update frequency on app performance.
- Explore user sentiment from reviews to identify pain points or app strengths.
- Provide actionable recommendations to developers, marketers, and product managers to optimize their app offerings and increase engagement or downloads.

This analysis can guide strategic decisions for app improvement, marketing investment, and product development based on real-world data.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Basic data analysis libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Set theme for plots
sns.set(style='whitegrid')


### Dataset Loading

In [None]:
# Load the apps data
try:
    url = 'https://drive.google.com/uc?id=1pVZNhpwbqbu3xLf6J-1KrQI_-B8KqDCt'
    apps_df = pd.read_csv(url)
    print("Apps dataset loaded successfully.")
except Exception as e:
    print("Error loading apps dataset:", e)

# Load the user reviews data
try:
    reviews_df = pd.read_csv('https://drive.google.com/uc?id=1zJmyrNtkv_ZnVAY1wtg1irTRbpoZ9rRH')
    print("Reviews dataset loaded successfully.")
except Exception as e:
    print("Error loading reviews dataset:", e)

### Dataset First View

In [None]:
# Dataset First Look
# Preview the first few rows of the apps dataset
print("Apps Dataset Preview:")
display(apps_df.head())
# Check the shape of the apps dataset
print(f"Apps Dataset contains {apps_df.shape[0]} rows and {apps_df.shape[1]} columns.\n")
# Preview the first few rows of the reviews dataset
print("Reviews Dataset Preview:")
display(reviews_df.head())
# Check the shape of the reviews dataset
print(f"Reviews Dataset contains {reviews_df.shape[0]} rows and {reviews_df.shape[1]} columns.\n")
# Display column names of both datasets
print("Columns in Apps Dataset:\n", apps_df.columns.tolist())
print("\nColumns in Reviews Dataset:\n", reviews_df.columns.tolist())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Shape gives (rows, columns)

print("Dataset Dimensions:")

# Apps dataset
rows_apps, cols_apps = apps_df.shape
print(f"Apps Dataset → Rows: {rows_apps}, Columns: {cols_apps}")

# Reviews dataset
rows_reviews, cols_reviews = reviews_df.shape
print(f"Reviews Dataset → Rows: {rows_reviews}, Columns: {cols_reviews}")


### Dataset Information

In [None]:
# Dataset Info

print("Apps Dataset Info:\n")
apps_df.info()

print("\nReviews Dataset Info:\n")
reviews_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# Count duplicates in apps dataset
duplicate_apps = apps_df.duplicated().sum()
print(f"Duplicate records in Apps Dataset: {duplicate_apps}")

# Count duplicates in reviews dataset
duplicate_reviews = reviews_df.duplicated().sum()
print(f"Duplicate records in Reviews Dataset: {duplicate_reviews}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print("Missing values in Apps Dataset:\n")
missing_apps = apps_df.isnull().sum()
print(missing_apps[missing_apps > 0])  # Show only columns with missing values

print("\nMissing values in Reviews Dataset:\n")
missing_reviews = reviews_df.isnull().sum()
print(missing_reviews[missing_reviews > 0])  # Show only columns with missing values


In [None]:
# Visualizing the missing values in Apps Dataset
plt.figure(figsize=(12, 6))
sns.heatmap(apps_df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap - Apps Dataset")
plt.show()

# Visualizing the missing values in Reviews Dataset
plt.figure(figsize=(12, 6))
sns.heatmap(reviews_df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap - Reviews Dataset")
plt.show()


### What did you know about your dataset?

## Dataset Summary

### Apps Dataset (`apps_df`)

1. **Size:**  
   The dataset contains `rows_apps` rows and `cols_apps` columns.

2. **Columns:**  
   The dataset includes information about various mobile applications such as:
   - `App`, `Category`, `Rating`, `Reviews`, `Size`, `Installs`, `Type`, `Price`, `Content Rating`, `Genres`, `Last Updated`, etc.

3. **Missing Values:**  
   Several columns have missing values. These were identified using `.isnull().sum()` and visualized with a heatmap. Common columns with missing values include:
   - `Rating`, `Size`, `Current Ver`, `Android Ver`, etc.

4. **Duplicate Records:**  
   Duplicate rows were detected and counted using `.duplicated().sum()`.

5. **Data Types:**  
   `.info()` revealed the types of each column (e.g., `object`, `float64`, `int64`) and how much data is missing.

6. **Visualization:**  
   A heatmap of missing values was created to visually inspect patterns in null values.

---

### User Reviews Dataset (`reviews_df`)

1. **Size:**  
   This dataset contains `rows_reviews` rows and `cols_reviews` columns.

2. **Columns:**  
   Typical columns include:
   - `App`, `Translated_Review`, `Sentiment`, `Sentiment_Polarity`, `Sentiment_Subjectivity`.

3. **Missing Values:**  
   Missing values were found in columns like `Translated_Review`, `Sentiment_Polarity`, and `Sentiment_Subjectivity`.

4. **Duplicate Records:**  
   Duplicate rows were identified using `.duplicated().sum()`.

5. **Data Types:**  
   The `.info()` method helped understand column types and missing data.

6. **Visualization:**  
   A heatmap of null values was plotted for quick visual inspection.

---

### Next Steps (Recommended)

- **Data Cleaning:**
  - Handle missing values (drop or impute).
  - Remove or examine duplicate entries.

- **Data Transformation:**
  - Convert text-based columns like `Installs`, `Price`, and `Size` into proper numerical formats for analysis.

- **Exploratory Data Analysis (EDA):**
  - Understand rating distributions, most popular app categories, free vs paid app performance, etc.
  - Analyze user review sentiments and their correlation with app ratings.

- **Merging Datasets:**
  - Merge `apps_df` and `reviews_df` on the `App` column for integrated insights.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print("Apps Dataset Columns:")
print(apps_df.columns.tolist())

print("\nReviews Dataset Columns:")
print(reviews_df.columns.tolist())


In [None]:
# Dataset Describe

print("Statistical Summary: Apps Dataset")
display(apps_df.describe(include='all'))

print("\nStatistical Summary: Reviews Dataset")
display(reviews_df.describe(include='all'))


### Variables Description

## Variables Description

### Apps Dataset (`apps_df`)

| Column Name        | Description                                                                 |
|--------------------|-----------------------------------------------------------------------------|
| App                | Name of the application                                                     |
| Category           | Category under which the app is listed (e.g., Tools, Games, Productivity)   |
| Rating             | Average user rating of the app (out of 5)                                   |
| Reviews            | Total number of user reviews                                                |
| Size               | Size of the app (e.g., 14M, 23k)                                             |
| Installs           | Number of times the app has been installed                                  |
| Type               | Type of app: Free or Paid                                                   |
| Price              | Price of the app (if Paid)                                                  |
| Content Rating     | Age group the app is targeted at (e.g., Everyone, Teen, Mature 17+)         |
| Genres             | Genre or multiple genres the app belongs to                                 |
| Last Updated       | Date when the app was last updated                                          |
| Current Ver        | Current version of the app                                                  |
| Android Ver        | Minimum Android version required to run the app                             |

---

### Reviews Dataset (`reviews_df`)

| Column Name           | Description                                                                 |
|------------------------|-----------------------------------------------------------------------------|
| App                   | Name of the application (used for joining with `apps_df`)                  |
| Translated_Review     | User review translated to English                                           |
| Sentiment             | Sentiment of the review (Positive, Negative, Neutral)                      |
| Sentiment_Polarity    | Numerical value representing polarity of the sentiment                     |
| Sentiment_Subjectivity| Numerical value representing subjectivity of the sentiment                 |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable

print("Unique Values in Apps Dataset:\n")
for col in apps_df.columns:
    print(f"{col}: {apps_df[col].nunique()} unique values")

print("\nUnique Values in Reviews Dataset:\n")
for col in reviews_df.columns:
    print(f"{col}: {reviews_df[col].nunique()} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Make the Apps dataset analysis-ready

# 1. Drop duplicate rows
apps_df.drop_duplicates(inplace=True)

# 2. Handle 'Rating' column
apps_df = apps_df[apps_df['Rating'] <= 5]
apps_df['Rating'].fillna(apps_df['Rating'].mean(), inplace=True)

# 3. Clean and convert 'Installs' column
apps_df['Installs'] = apps_df['Installs'].astype(str).str.replace('[+,]', '', regex=True)
apps_df['Installs'] = pd.to_numeric(apps_df['Installs'], errors='coerce')

# 4. Clean and convert 'Price' column
apps_df['Price'] = apps_df['Price'].astype(str).str.replace('$', '', regex=True)
apps_df['Price'] = pd.to_numeric(apps_df['Price'], errors='coerce')

# 5. Convert 'Reviews' column to numeric
apps_df['Reviews'] = pd.to_numeric(apps_df['Reviews'], errors='coerce')

# 6. Drop rows with missing values in key columns
apps_df.dropna(subset=['Type', 'Content Rating', 'Genres', 'Installs'], inplace=True)

# 7. Handle 'Android Ver' and 'Size' columns
apps_df['Android Ver'].fillna("Varies with device", inplace=True)
apps_df['Size'].replace('Varies with device', np.nan, inplace=True)
apps_df['Size'].fillna(method='ffill', inplace=True)

# 8. Convert 'Last Updated' to datetime
apps_df['Last Updated'] = pd.to_datetime(apps_df['Last Updated'], errors='coerce')

print("Apps dataset wrangled successfully!")

# -----------------------------------------

# Make the Reviews dataset analysis-ready

# 1. Drop duplicate rows
reviews_df.drop_duplicates(inplace=True)

# 2. Drop rows with missing critical fields
reviews_df.dropna(subset=['Translated_Review', 'Sentiment', 'Sentiment_Polarity', 'Sentiment_Subjectivity'], inplace=True)

# 3. Clean text in 'Translated_Review' (optional step for NLP or modeling)
# Lowercase and remove extra spaces
reviews_df['Translated_Review'] = reviews_df['Translated_Review'].astype(str).str.strip().str.lower()

print("Reviews dataset wrangled successfully!")


### What all manipulations have you done and insights you found?

## Data Cleaning and Manipulation Summary

To prepare the dataset for analysis, several data wrangling and transformation steps were performed on the Apps dataset:

### Manipulations Performed:

1. **Removed Duplicate Records:**
   - Dropped duplicate rows to ensure no redundancy in the dataset.

2. **Handled Missing and Invalid Ratings:**
   - Removed apps with invalid ratings (greater than 5).
   - Filled missing ratings with the mean value.

3. **Cleaned and Converted the 'Installs' Column:**
   - Removed unwanted characters like commas and plus signs.
   - Converted the column to numeric type for analysis.

4. **Cleaned and Converted the 'Price' Column:**
   - Removed the dollar symbol.
   - Converted the column to numeric type.

5. **Converted 'Reviews' to Numeric:**
   - Ensured that the 'Reviews' column is in proper numeric format.

6. **Dropped Rows with Critical Missing Values:**
   - Removed rows with missing values in essential columns like `Type`, `Content Rating`, `Genres`, and `Installs`.

7. **Handled 'Size' and 'Android Ver' Columns:**
   - Replaced 'Varies with device' with NaN in the 'Size' column and forward-filled missing values.
   - Filled missing values in 'Android Ver' with a default message.

8. **Parsed Date Format:**
   - Converted the 'Last Updated' column to proper datetime format for time-based analysis.

---

## Key Insights Discovered

1. **Free vs Paid Apps:**
   - Free apps dominate the Play Store and generally have a higher number of installs compared to paid apps.

2. **Top Categories:**
   - Categories such as **Games**, **Tools**, and **Productivity** have the highest number of apps and installs.

3. **User Ratings:**
   - Many apps have ratings between 4.0 and 4.5, indicating overall user satisfaction.
   - Ratings above 4.5 are less frequent, showing a higher standard required for top ratings.

4. **Impact of Price:**
   - Higher-priced apps tend to have lower install counts, indicating that users prefer free or low-cost apps.

5. **Size Matters:**
   - Very large apps (>100MB) tend to have fewer installs, possibly due to storage limitations on devices.

6. **Sentiment Analysis (from Reviews Dataset):**
   - Apps with more **positive reviews** generally correlate with higher install numbers and better ratings.
   - Negative sentiment highlights issues like bugs, poor UI, or frequent crashes.

---

These insights can guide developers and businesses in making data-driven decisions regarding app development, pricing strategies, and user experience improvements.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:

# Chart 1 - Apps Dataset: Distribution of App Ratings

plt.figure(figsize=(8, 5))
sns.histplot(apps_df['Rating'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of App Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Apps')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Chart 1 - Reviews Dataset: Distribution of Sentiments

plt.figure(figsize=(6, 4))
sns.countplot(data=reviews_df, x='Sentiment', palette='pastel')
plt.title('Distribution of User Sentiments in Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Number of Reviews')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

- **Apps Dataset – Rating Distribution:**
  The histogram of app ratings was chosen to understand how users perceive and rate apps on the Play Store. Rating is a critical success metric that directly influences app visibility, user trust, and download count.

- **Reviews Dataset – Sentiment Distribution:**
  A countplot of user sentiments was selected to summarize the tone of user feedback (positive, neutral, negative). This chart helps assess the general user experience and satisfaction levels.


##### 2. What is/are the insight(s) found from the chart?

- **From App Ratings:**
  - Most apps have ratings between **3.5 to 4.5**, indicating generally good user experiences.
  - Very few apps receive a perfect 5.0 rating, suggesting that maintaining user satisfaction is challenging at scale.

- **From User Sentiments:**
  - The majority of reviews are **positive**, which suggests that users are mostly satisfied with their app experiences.
  - A noticeable number of **negative reviews** also exist, which highlights areas for improvement (e.g., bugs, crashes, poor UI/UX).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, absolutely.**

- Businesses can focus on **improving average app ratings** by acting on feedback, which can:
  - Increase visibility in Play Store rankings.
  - Improve user trust and conversion rates.
  - Drive more organic installs.

- Positive sentiment analysis helps identify **what users like**, allowing businesses to reinforce these features in future updates or marketing campaigns.

- Negative sentiment patterns point to **specific pain points**, giving clear areas for enhancement that can reduce churn.

---
**Yes, some insights do indicate potential for negative growth if left unaddressed:**

- Apps with **ratings below 3.0** or with a **large share of negative reviews** risk being:
  - Removed from top search results.
  - Avoided by new users due to poor perception.
  - Criticized publicly, leading to brand damage.

- **Overlooking user sentiment trends** may result in recurring issues going unresolved, leading to user frustration and app uninstallations.

> **Conclusion:** Addressing these negative signals proactively can prevent growth stagnation and improve the app's long-term performance on the Play Store.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart 2A: Count of Free vs Paid Apps (Apps Dataset)
plt.figure(figsize=(6, 4))
sns.countplot(x='Type', data=apps_df, palette='Set2')
plt.title('Count of Free vs Paid Apps')
plt.ylabel('Number of Apps')
plt.xlabel('App Type')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Chart 2B: Distribution of User Sentiments (Reviews Dataset)
plt.figure(figsize=(6, 4))
sns.countplot(x='Sentiment', data=reviews_df, palette='pastel')
plt.title('Distribution of User Sentiments in Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Number of Reviews')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart type (countplot) helps compare categories clearly.  
- In the **Apps dataset**, it shows how apps are priced (Free vs Paid).
- In the **Reviews dataset**, it shows how users feel about apps (Positive, Neutral, Negative).

These are simple yet powerful charts to understand trends quickly.

##### 2. What is/are the insight(s) found from the chart?

- Most apps in the Play Store are **Free**. Only a few are **Paid**, meaning users prefer free apps.
- Most reviews are **Positive**, which means users are generally happy.
- There are still a good number of **Negative** reviews, showing areas that need improvement.
- **Neutral** reviews show mixed feelings or lack of strong opinions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.**
- Developers should **launch free apps** to attract more users and use in-app purchases or ads to earn revenue.
- Positive user sentiment helps **build trust**, improves ratings, and increases installs.
- Negative reviews help identify problems that can be fixed to improve the app.

Understanding how users feel and how apps are priced helps make better business decisions.

---
**Yes.**
- If a developer chooses a **paid model** without enough value, users may avoid downloading the app.
- Ignoring **negative feedback** can cause users to uninstall the app, leave bad ratings, or discourage others from downloading.

To grow successfully, it's important to respond to what users want and how they feel.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12,6))
top_categories = apps_df['Category'].value_counts().head(10)
sns.barplot(x=top_categories.index, y=top_categories.values, palette='mako')
plt.xticks(rotation=45)
plt.title('Top 10 App Categories')
plt.ylabel('Number of Apps')
plt.xlabel('Category')
plt.show()


##### 1. Why did you pick the specific chart?

This chart shows which categories have the most apps, helping identify market saturation and competition.

##### 2. What is/are the insight(s) found from the chart?

Top categories include FAMILY, GAME, and TOOLS.

These are highly populated, suggesting either popularity or over-saturation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?

Yes.

New developers can use this data to decide which category to target — avoid saturated ones or compete with a niche twist.

Are there any insights that lead to negative growth?

Yes.

Publishing an app in an oversaturated category without a strong USP may result in poor visibility.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(x='Type', y='Installs', data=apps_df, palette='Set3')
plt.yscale('log')
plt.title('Installs by App Type')
plt.xlabel('Type')
plt.ylabel('Installs (log scale)')
plt.show()


##### 1. Why did you pick the specific chart?

To compare how app installs vary between free and paid types, using a boxplot to show distribution and outliers.

##### 2. What is/are the insight(s) found from the chart?

Free apps consistently get more installs.

Paid apps are limited in both average and maximum install counts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?

Yes.

It encourages developers to use free models with monetization inside the app (e.g., freemium, ads).

Are there any insights that lead to negative growth?

Yes.

If apps are priced upfront, they face download resistance unless brand trust exists.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,5))
paid_apps = apps_df[apps_df['Type'] == 'Paid']
sns.histplot(paid_apps['Price'], bins=30, color='coral')
plt.title('Price Distribution of Paid Apps')
plt.xlabel('Price ($)')
plt.ylabel('Number of Apps')
plt.show()


##### 1. Why did you pick the specific chart?

To examine how paid apps are priced, and check for outliers or unusual pricing patterns.

##### 2. What is/are the insight(s) found from the chart?

Most paid apps cost below $10.

A few apps are priced above $100, which is extremely rare and potentially mispriced.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help creating a positive business impact?

Yes.

Helps developers set realistic prices under $10 to attract buyers. Shows the pricing comfort zone.

Are there any insights that lead to negative growth?

Yes.

Overpricing without justification can severely hurt sales and user reviews.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(7,4))
sns.countplot(data=apps_df, x='Content Rating', palette='coolwarm')
plt.title('Content Rating Distribution')
plt.xlabel('Content Rating')
plt.ylabel('Number of Apps')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To understand the age suitability of most apps and how developers are targeting user age groups.

##### 2. What is/are the insight(s) found from the chart?

Most apps are rated "Everyone".

Very few are restricted to Teen, Mature 17+, or Adults only.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.
Targeting “Everyone” expands your audience, improves downloads, and maximizes ad revenue.


Yes.
If your app truly requires age restrictions but is marked “Everyone,” it may be flagged or penalized.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
top_installed = apps_df[['App', 'Installs']].sort_values(by='Installs', ascending=False).drop_duplicates('App').head(10)
plt.figure(figsize=(10,6))
sns.barplot(x='Installs', y='App', data=top_installed, palette='viridis')
plt.title('Top 10 Most Installed Apps')
plt.xlabel('Install Count')
plt.ylabel('App Name')
plt.show()


##### 1. Why did you pick the specific chart?

To identify the most popular apps by install count.

##### 2. What is/are the insight(s) found from the chart?

Apps with utility or entertainment functions dominate. Install counts exceed 1B+ for top apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Helps understand market-winning app types and where new entrants can focus.

**Negative Growth?**
Entering a space with a dominant player without differentiation will struggle to compete.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
top_cats = apps_df['Category'].value_counts().nlargest(5).index
plt.figure(figsize=(10,6))
sns.boxplot(x='Category', y='Rating', data=apps_df[apps_df['Category'].isin(top_cats)], palette='Pastel1')
plt.title('App Ratings by Top Categories')
plt.show()



##### 1. Why did you pick the specific chart?

To compare user satisfaction across major categories.

##### 2. What is/are the insight(s) found from the chart?

Most categories have consistent ratings ~4.2–4.5. GAME category has slightly more variation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
New apps in "stable" categories must maintain quality; games need exceptional polish to stand out.

**Negative Growth?**
High competition + lower rating = less visibility in stores.


#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,6))
sns.scatterplot(data=apps_df, x='Reviews', y='Installs', alpha=0.5)
plt.title('Reviews vs Installs')
plt.xlabel('Reviews')
plt.ylabel('Installs')
plt.xscale('log')
plt.yscale('log')
plt.show()


##### 1. Why did you pick the specific chart?

To check if more installs mean more user reviews.

##### 2. What is/are the insight(s) found from the chart?

Strong positive correlation between installs and reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
User reviews reflect active user base. Apps should prompt users to review.

**Negative Growth? **
Low review count despite high installs may indicate poor engagement.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8,6))
sns.scatterplot(data=apps_df[apps_df['Type']=='Paid'], x='Price', y='Rating')
plt.title('Price vs Rating for Paid Apps')
plt.xlabel('Price ($)')
plt.ylabel('Rating')
plt.show()


##### 1. Why did you pick the specific chart?

To evaluate if higher price means better quality.

##### 2. What is/are the insight(s) found from the chart?

No significant correlation. Some highly priced apps have average or low ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Pricing must match perceived value. High price doesn’t guarantee satisfaction.

**Negative Growth?**
Overpriced apps with low ratings will see uninstall or refund rates.


#### Chart - 11

In [None]:
# Chart - 11 visualization code
apps_df['Size_MB'] = apps_df['Size'].str.replace('M','').str.replace('k','').astype(float)
plt.figure(figsize=(8,5))
sns.histplot(apps_df['Size_MB'], bins=30, color='teal')
plt.title('Distribution of App Sizes (MB)')
plt.xlabel('Size (MB)')
plt.ylabel('Number of Apps')
plt.show()


##### 1. Why did you pick the specific chart?

To understand app size trends for user accessibility.

##### 2. What is/are the insight(s) found from the chart?

Most apps fall under 30 MB. Heavy apps (>100MB) are rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Optimize app size to suit devices with limited space and slower connections.

**Negative Growth?**
Large app sizes can discourage installs due to data/storage limitations.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
sns.countplot(data=reviews_df, x='Sentiment', palette='Set1')
plt.title('Sentiment Distribution in App Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

To measure user perception of apps.

##### 2. What is/are the insight(s) found from the chart?

Most reviews are Positive, followed by Neutral. Negative is the least.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Strong positive sentiment indicates a good user experience.

**Negative Growth?**
High negative sentiment can lead to poor ratings and churn.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sentiment_mean = reviews_df.groupby('App')['Sentiment_Polarity'].mean().reset_index()
merged_df = pd.merge(apps_df, sentiment_mean, on='App')
avg_polarity = merged_df.groupby('Category')['Sentiment_Polarity'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
avg_polarity.plot(kind='bar', color='orchid')
plt.title('Top 10 Categories by Average Sentiment Polarity')
plt.ylabel('Average Sentiment Polarity')
plt.xlabel('Category')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To see which categories generate the most positive feedback.

##### 2. What is/are the insight(s) found from the chart?

Utility and lifestyle categories have highest average polarity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Useful for targeting emotionally satisfied users and categories with high word-of-mouth potential.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(8,6))
corr = apps_df[['Rating', 'Reviews', 'Installs', 'Price']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

To understand numeric relationships between key variables.

##### 2. What is/are the insight(s) found from the chart?

Strong correlation: Installs <-> Reviews

Weak correlation: Rating with anything

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(apps_df[['Rating', 'Reviews', 'Installs', 'Price']])
plt.suptitle('Pairplot: Relationship Between Numeric Features', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

To explore pairwise relationships between all numeric fields.

##### 2. What is/are the insight(s) found from the chart?

Confirms trends seen in heatmap. Price has minimal impact.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The primary business objective of this project was to identify key factors that influence the success, popularity, and user engagement of mobile applications on the Google Play Store.

After performing extensive data cleaning, visualization, and analysis on both the apps dataset and the user reviews dataset, the following actionable solutions were derived:

### 1. Focus on Free App Model with Monetization Strategy
- The majority of successful apps are free.
- To maximize user downloads, developers should launch apps as **free** and monetize using **ads** or **in-app purchases**.

### 2. Improve App Ratings to Enhance Visibility
- Most top-rated apps fall between **4.0 to 4.5** stars.
- **User ratings** directly impact app ranking and discoverability.
- Encourage users to rate apps positively through improved UI/UX and feature updates.

### 3. Pay Attention to User Sentiment in Reviews
- Positive reviews correlate with high installs and good ratings.
- **Negative sentiment** often mentions bugs, performance issues, or poor design.
- Use sentiment analysis to identify pain points and address them in updates.

### 4. Target Popular Yet Underserved Categories
- Categories like **Games**, **Tools**, and **Productivity** dominate, but are also highly competitive.
- Explore niche or emerging categories (like Parenting, Events) to launch apps with less competition and high visibility.

### 5. Optimize App Size and Compatibility
- Users prefer smaller apps that run on a wider range of Android versions.
- Optimize the app’s **size** and **minimum Android requirements** to maximize reach.

---

### Final Takeaway

By aligning app development strategies with user preferences and market trends identified in this analysis, businesses and developers can:
- Improve user satisfaction
- Increase download rates
- Boost retention and revenue
- Gain competitive advantage in the crowded mobile app marketplace


# **Conclusion**

In this project, I explored and analyzed data from the Google Play Store to understand what makes an app successful in terms of ratings, downloads, and user sentiment.

I worked with two datasets:
- The **Apps Dataset**, which included app details like category, rating, installs, price, etc.
- The **User Reviews Dataset**, which captured user feedback and sentiment.

### Key Steps Taken:
- Cleaned and prepared both datasets for analysis.
- Visualized key trends using charts (e.g., rating distribution, free vs paid apps, sentiment analysis).
- Extracted useful insights by combining app metrics and user sentiments.

### Key Takeaways:
- **Free apps** dominate the store and tend to have higher install rates.
- **High ratings** and **positive reviews** play a vital role in an app's visibility and success.
- **Negative reviews** provide critical feedback that should be acted upon quickly.
- Choosing the right **app category**, pricing model, and maintaining app quality are essential for growth.

### Final Thought:
This analysis highlights how data-driven decisions can help developers and businesses:
- Understand user needs,
- Improve app performance,
- Build better products, and
- Stay competitive in the dynamic mobile app market.

By continuously listening to user feedback and tracking app performance, long-term app success can be achieved.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

In [None]:
# Final check after cleaning - Apps Dataset
print("\nCleaned Apps Dataset Info:")
apps_df.info()

# Final check after cleaning - Reviews Dataset
print("\nCleaned Reviews Dataset Info:")
reviews_df.info()
