<a href="https://colab.research.google.com/github/muhammed-jaseef/EDA-GooglePlaystore/blob/main/Copy_of_Sample_EDA_Submission_Template2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -**Muhammed Jaseef K


# **Project Summary -**

The Google Play Store is a massive digital marketplace, hosting a vast array of applications catering to different user needs. The dataset under analysis provides insights into various app attributes, including category, rating, size, number of installs, price, and user reviews. By analyzing these factors, developers and businesses can identify the key determinants of app engagement and success. The analysis also incorporates user review data to gain deeper insights into customer sentiments and preferences.

**Objectives**

The primary objectives of this analysis are:

* To identify the characteristics of successful apps.

* To explore user engagement through ratings and reviews.

* To determine the impact of pricing and app size on downloads.

* To analyze sentiment in customer reviews to understand user satisfaction and complaints.

# **GitHub Link -**

# **Problem Statement**


The Google Play Store hosts a vast ecosystem of mobile applications, each competing for user attention and engagement. Understanding the factors that drive app success—such as category, rating, size, pricing model, and user reviews—can provide valuable insights for developers and businesses aiming to optimize their apps for higher downloads, better user retention, and increased revenue.

This project aims to analyze Play Store app data and customer reviews to identify key determinants of app success. By leveraging data-driven techniques, we will explore how various attributes impact app engagement and popularity, ultimately helping developers make informed decisions to enhance their apps and capture a larger market share.

#### **Define Your Business Objective?**

The objective of this analysis is to extract actionable insights from Google Play Store app data and customer reviews to help developers and businesses optimize their applications for better engagement, higher downloads, and increased revenue.

By identifying key factors influencing app success—such as app category, user ratings, pricing models, review sentiments, and feature offerings—we aim to:

Enhance App Performance: Understand what makes top-performing apps successful and apply those learnings to improve app features, usability, and design.

Optimize User Engagement: Analyze user reviews and ratings to identify common pain points, feature requests, and areas for improvement.

Increase Market Competitiveness: Help developers position their apps effectively in the marketplace by identifying trends and consumer preferences.

Monetization Strategy: Determine how pricing models (free, paid, in-app purchases) influence app popularity and revenue generation.

Improve Marketing Strategies: Provide data-driven recommendations for targeted advertising and user acquisition strategies.

This analysis will help developers and businesses make informed, strategic decisions to maximize app engagement, customer satisfaction, and profitability.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
reviews_df = pd.read_csv('/content/User Reviews (1).csv')
data_df = pd.read_csv('/content/Play Store Data (1).csv')

### Dataset First View

In [None]:
# Dataset First Look
reviews_df.head(5)

In [None]:
data_df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, cols = reviews_df.shape
print(f"Review Dataset contains {rows} rows and {cols} columns.")

In [None]:
rows, cols = data_df.shape
print(f"Playstore data  Dataset contains {rows} rows and {cols} columns.")

### Dataset Information

In [None]:
# Dataset Info
data_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = data_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = data_df.isnull().sum()
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(data_df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

This dataset contains information about mobile applications available on the Google Play Store.
It provides details such as the application name, category, user ratings, number of reviews, size of the app, number of installs, type (Free or Paid), price, content rating (age group suitability), and genres.
This dataset is useful for generating business insights related to app performance, user behavior, and market trends on the Google Play Store.

After exploring the dataset, I observed the following:

* The Rating column contains a significant number of null (missing) values, which indicates incomplete user feedback for many apps.

* The dataset has 483 duplicate rows, which need to be handled during data cleaning to avoid incorrect analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data_df.columns

In [None]:
# Dataset Describe
data_df.describe()

### Variables Description

* Application Name: Name of the mobile application.

* Category: Primary category of the app (e.g., Games, Education, Productivity).

* Rating: Overall user rating of the app given by users (numeric value).

* Reviews: Total number of user reviews received by the app.

* Size: Size of the app file (example: 15MB, 50MB).

* Installs: Total number of times the app has been installed by users.

* Type: Indicates whether the app is Free or Paid.

* Price: Price of the app (0 for Free apps and numeric value for Paid apps).

* Content Rating: Age group targeted by the app (e.g., Everyone, Teen, Adult).

* Genres: Additional genres/categories of the app apart from the main category.
* Translated_Review: User review of the app (cleaned, tokenized, and translated into English).

* Sentiment: Sentiment of the review categorized as Positive, Negative, or Neutral.

* Sentiment_Polarity: Numeric score showing sentiment polarity ranging from -1 (most negative) to 1 (most positive).

* Sentiment_Subjectivity: Numeric score indicating how subjective or opinion-based the review is, ranging from 0 (objective) to 1 (subjective).



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = data_df.nunique()
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

Now let's make a dataframe from the reviews_df such that :

* Remove rows where Sentiment_Polarity or Sentiment_Subjectivity is null.
* Group by the App column.
* Calculate the mean of Sentiment_Polarity and Sentiment_Subjectivity.


In [None]:
# Remove rows with NaN values in Sentiment_Polarity or Sentiment_Subjectivity
reviews_df_sub = reviews_df.dropna(subset=['Sentiment_Polarity', 'Sentiment_Subjectivity'])

# Group by 'App' and compute the average Sentiment_Polarity and Sentiment_Subjectivity
df_avg_sentiment = reviews_df_sub.groupby('App').agg({
    'Sentiment_Polarity': 'mean',
    'Sentiment_Subjectivity': 'mean'
}).reset_index()

In [None]:
df_avg_sentiment

In [None]:
common_apps = set(df_avg_sentiment['App']).intersection(set(data_df['App']))

print("Number of common Apps:", len(common_apps))


Now let's make a data frame compaining df_avg_sentiment and data_df

In [None]:
# Concatenate both DataFrames along columns (axis=1) and align on 'App'
df = pd.merge(data_df, df_avg_sentiment, on='App', how='outer')
df

(1)  Removing Missing Data

In [None]:
df_cleaned= df.dropna().reset_index(drop=True)
print(df_cleaned.isnull().sum())  # Check if any NaNs remain
print(df_cleaned.shape)  # Verify the new shape

In [None]:
data_df_cleaned= data_df.dropna().reset_index(drop=True)
print(data_df_cleaned.isnull().sum())  # Check if any NaNs remain
print(data_df_cleaned.shape)  # Verify the new shape

(2) Handling duplicates

In [None]:
duplicate_count = df_cleaned.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

In [None]:
duplicate_count = data_df_cleaned.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

Removing the duplicates

In [None]:
data_df_final = data_df_cleaned.drop_duplicates().reset_index(drop=True)
print(data_df_final.shape)

In [None]:
df_final = df_cleaned.drop_duplicates().reset_index(drop=True)
print(df_final.shape)

(3) Fixing data types and cleaning values

In [None]:
df_final.dtypes

for df_cleaned

In [None]:
# 1. Reviews to int
df_final['Reviews'] = df_final['Reviews'].astype(int)

# 2. Clean Size Column
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024
        else:
            return np.nan
    else:
        return size  # If already float or nan


df_final['Size'] = df_final['Size'].apply(convert_size)
df_final['Size'] = df_final['Size'].fillna(df_final['Size'].median())


# 3. Clean Installs Column
df_final['Installs'] = df_final['Installs'].astype(str)  # Convert all values to string
df_final['Installs'] = df_final['Installs'].str.replace('[+,]', '', regex=True)  # Remove + and ,
df_final['Installs'] = df_final['Installs'].astype(int)  # Convert to int


# 4. Clean Price Column
df_final['Price'] = (
    df_final['Price']
    .astype(str)
    .str.replace(r'[^\d.]', '', regex=True)  # Remove all non-numeric chars
    .replace('', '0')  # Handle empty strings
    .pipe(pd.to_numeric, errors='coerce')
    .fillna(0)  # Only if you truly want NaN → 0
)


# 5. Convert Last Updated to Datetime
df_final['Last Updated'] = pd.to_datetime(df_final['Last Updated'])

# Verify Datatypes
print(df_final.dtypes)

for data_df_final

In [None]:
# 1. Reviews to int
data_df_final['Reviews'] = data_df_final['Reviews'].astype(int)

# 2. Clean Size Column
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024
        else:
            return np.nan
    else:
        return size  # If already float or nan

data_df_final['Size'] = data_df_final['Size'].apply(convert_size)
data_df_final['Size'] = data_df_final['Size'].fillna(data_df_final['Size'].median())

# 3. Clean Installs Column
data_df_final['Installs'] = data_df_final['Installs'].astype(str)  # Convert all values to string
data_df_final['Installs'] = data_df_final['Installs'].str.replace('[+,]', '', regex=True)  # Remove + and ,
data_df_final['Installs'] = data_df_final['Installs'].astype(int)  # Convert to int

# 4. Clean Price Column
data_df_final['Price'] = (
    data_df_final['Price']
    .astype(str)
    .str.replace(r'[^\d.]', '', regex=True)  # Remove all non-numeric chars
    .replace('', '0')  # Handle empty strings
    .pipe(pd.to_numeric, errors='coerce')
    .fillna(0)  # Only if you truly want NaN → 0
)

# 5. Convert Last Updated to Datetime
data_df_final['Last Updated'] = pd.to_datetime(data_df_final['Last Updated'])

# Verify Datatypes
print(data_df_final.dtypes)

In [None]:
data_df_final.describe()

In [None]:
df_final.describe()

In [None]:
data_df_final['Price'].unique()

In [None]:
df_final['Price'].unique()

In [None]:
data_df_final['Type'].value_counts()

### What all manipulations have you done and insights you found?

To preprocess the dataset, I performed the following steps:

* Removed missing values to ensure data completeness.

* Removed duplicate records to avoid redundancy and ensure data accuracy.

* Converted the 'Reviews' column to a numeric format for accurate analysis.

* Converted the 'Date' column to a standardized datetime format.

* Cleaned the 'Price' column by removing currency symbols and non-numeric characters, and converted it to a float type for numerical analysis.

* Cleaned the 'Installs' column by removing special characters like + and , and converted it to a numeric format.
* Cleaned the 'Size' column by converting values like 'M' and 'k' into numeric format for consistency and easier analysis.





The dataset shows that most apps are free, lightweight, and have good ratings with generally positive reviews. However, there is a huge variation in app installs and reviews, highlighting the presence of both very popular and less-known apps.

In [None]:
#from google.colab import files


#data_df_final.to_csv('data_df_final.csv', index=False)
#df_cleaned.to_csv('df_cleaned.csv', index=False)


#files.download('data_df_final.csv')
#files.download('df_cleaned.csv')


In [None]:
data_df_final.shape

In [None]:
df_final.shape

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Create the histogram with KDE
plt.figure(figsize=(8, 5))
sns.histplot(data=data_df_final, x='Rating', bins=8, kde=True, edgecolor='black')

# Labels and title
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.title('Distribution of App Ratings with KDE')

plt.show()

##### 1. Why did you pick the specific chart?

For analysing the distribution of app ratings, I have used a histogram combined with a KDE (Kernel Density Estimation) plot. This type of visualization is most suitable because it helps in understanding the overall distribution pattern of ratings across all apps. The histogram shows the frequency of different rating values while the KDE curve provides a smooth estimation of the density. Since ratings are continuous numerical data, this chart allows us to easily observe the skewness, concentration, and spread of the ratings, giving a clear picture of how users are rating apps on the Play Store.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it is evident that the majority of apps on the Play Store have ratings clustered between 4.0 and 4.5. There are very few apps that have ratings lower than 3.0, indicating that users generally rate apps positively unless there is a significant issue. The distribution is right-skewed, which shows that higher ratings are more common. The peak density is around the rating range of 4.2 to 4.4, which suggests that most successful apps maintain ratings within this range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this chart can certainly help in creating a positive business impact. It highlights the importance of maintaining a rating above 4.0 for better visibility and user trust in the Play Store. App developers should focus on improving app performance, user interface, and overall customer satisfaction to ensure high ratings. On the other hand, the chart also indicates that apps with ratings below 3.0 are very few, which could be a sign of poor quality, bad user experience, or unresolved issues. Such low ratings can negatively impact an app’s growth as they directly affect user downloads, engagement, and retention. Therefore, developers must pay attention to customer feedback and address issues promptly to avoid falling into the low-rating category, which could hinder their success in a highly competitive marke

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Count the number of apps in each category
category_count = data_df_final['Category'].value_counts().head(10).reset_index()
category_count.columns = ['Category', 'App Count']

# Set the figure size
plt.figure(figsize=(12, 6))

# Create the bar plot
sns.barplot(data=category_count, x='Category', y='App Count')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Add chart title and labels
plt.title('Top 10 Categories by App Count', fontsize=16)
plt.xlabel('Category', fontsize=12)
plt.ylabel('Number of Apps', fontsize=12)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This bar chart was chosen because it is the most suitable visualization for comparing categorical data — in this case, the number of apps in each category. A bar chart clearly shows the distribution and market share of different app categories, making it easy to identify which categories dominate the Google Play Store and which ones have fewer apps. This visual representation helps us quickly understand the competitive landscape.

##### 2. What is/are the insight(s) found from the chart?

From the chart, the insight gained is that the 'FAMILY' and 'GAME' categories have the highest number of apps. These two categories capture the majority of the market share, indicating very high competition. On the other hand, categories like 'MEDICAL', 'PHOTOGRAPHY', and 'LIFESTYLE' have significantly fewer apps compared to the top categories. This suggests that while popular categories attract more users, they are also harder to penetrate due to market saturation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can create a positive business impact because they guide app developers or businesses to identify potential opportunities. For example, entering a category with fewer apps may offer a better chance of visibility and user acquisition due to less competition. However, it also depends on the user demand within that category.

There is no direct insight that leads to negative growth from this chart, but there is an indirect risk — entering a highly saturated category like 'FAMILY' or 'GAME' without a unique selling proposition could result in lower visibility and higher marketing costs. Therefore, businesses must balance between choosing high-demand categories and exploring less competitive, niche categories based on their strengths and target audience.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Count of Free vs Paid apps
app_type = data_df_final['Type'].value_counts()

# Plotting Pie Chart
plt.figure(figsize=(7,7))
plt.pie(app_type, labels=app_type.index, autopct='%1.1f%%')
plt.title('Free vs Paid Apps Distribution')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()


##### 1. Why did you pick the specific chart?

I selected a Pie Chart because it is the most suitable visualization for showing the proportion or percentage distribution of a categorical variable. In this case, the chart effectively displays the market share of Free and Paid apps in a simple and easily understandable format.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that the majority of the apps in the market are Free (93.1%), while only a small percentage of apps are Paid (6.9%).
This highlights the clear preference of developers and businesses towards offering free apps, possibly due to higher downloads, user acquisition, and monetization through ads or in-app purchases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights will help businesses make better monetization decisions.
Since most apps are free, launching a free app initially can help capture a larger user base. Businesses can later monetize through ads, premium features, or in-app purchases.

No negative growth is indicated directly from this insight. However, launching a Paid app in a market dominated by Free apps may result in lower downloads unless the app provides high value, unique features, or targets a niche audience willing to pay

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Filter only Paid apps
paid_apps = data_df_final[data_df_final['Type'] == 'Paid']

plt.figure(figsize=(10, 6))
sns.boxplot(y='Price', data=paid_apps)

plt.title('Price Distribution of Paid Apps')
plt.xlabel('Price (USD)')
plt.show()


##### 1. Why did you pick the specific chart?

The Box Plot is chosen here because it effectively displays the distribution of app prices, highlighting the central tendency, spread, and the presence of outliers. It provides a clear visual representation of the most common pricing range for paid apps, making it easier to identify the recommended pricing strategy.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it is evident that the majority of paid apps are priced between 0 and 10 USD, with the median price being quite low. This indicates that users generally prefer affordable apps, and developers usually price their apps competitively to attract a larger user base. However, there are a few extreme outliers with very high prices, going up to 400 USD. These are rare cases and might belong to niche categories or offer specialized features.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights are valuable for business decision-making. Setting app prices within the common range of 0 to 10 USD can lead to a positive business impact by increasing downloads and user engagement. On the other hand, pricing an app too high without offering significant value or uniqueness could result in negative growth, as users may not be willing to pay a premium for standard features. Therefore, understanding the price distribution helps in setting a competitive yet profitable pricing strategy.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Grouping the data by Category and summing the Installs
top_categories = data_df_final.groupby('Category')['Installs'].sum().sort_values(ascending=False).head(10)

# Plotting
plt.figure(figsize=(10,6))
sns.barplot(y=top_categories.values, x=top_categories.index)

plt.title('Top 10 Categories by Install Count')
plt.xlabel('Total Installs')
plt.ylabel('Category')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I used a bar chart because it is the most effective way to represent the total install count across different categories. A bar chart provides a clear visual comparison of the install counts, making it easy to identify which categories have the highest and lowest user reach. It helps in ranking the categories based on their popularity.



##### 2. What is/are the insight(s) found from the chart?

he chart reveals that the Game category has the highest install count, indicating that gaming apps are the most popular among users. The Communication and Social categories follow, suggesting that users highly engage with messaging, social media, and networking apps. Categories like Productivity, Tools, Family, and Photography also have a considerable user base, but their install counts are relatively lower.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights will definitely help create a positive business impact. Businesses and app developers can target the most popular categories like Games, Communication, and Social to attract more users and generate higher revenues.

There are no direct insights from this chart that indicate negative growth. However, the lower install counts in categories like News and Magazines or Video Players may indicate lower user demand or market saturation in those segments. Businesses should either innovate in these categories to revive user interest or focus their resources on high-growth categories for better returns.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(10,6))

# Scatter Plot
sns.scatterplot(data=data_df_final, x='Reviews', y='Rating')

# Apply Log scale to x-axis (Reviews)
plt.xscale('log')

plt.title('Reviews Count vs Rating (Log Scale)')
plt.xlabel('Reviews Count (log scale)')
plt.ylabel('App Rating')

plt.show()


##### 1. Why did you pick the specific chart?

I have used a Scatter Plot with a log scale because it helps to visualize the relationship between App Ratings and the number of Reviews. Since the number of reviews varies widely (from very low to very high), a log scale makes the visualization more readable and highlights the distribution effectively.

##### 2. What is/are the insight(s) found from the chart?

Most of the highly-rated apps (Rating between 4.0 to 5.0) have varying review counts, indicating that good apps do not always receive a high number of reviews.

There are some apps with very high ratings but a very low number of reviews — these are potential hidden gems (quality apps that are not very popular).

Apps with lower ratings (below 3.0) are scattered across various review counts, indicating that more reviews do not guarantee higher ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Identifying hidden gem apps (high ratings but low reviews) is an opportunity for marketing teams to promote these apps and increase their visibility — leading to potential user growth.It helps businesses focus on improving the quality of their apps rather than just focusing on increasing the number of reviews.



Apps with a very high number of reviews but low ratings might indicate user dissatisfaction. This could lead to a negative reputation, uninstallations, and bad word of mouth. Such apps should be investigated, and product improvements should be prioritized.

In [None]:
reviews_df['Sentiment'].value_counts()

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Count of each Sentiment
sentiment_counts = reviews_df['Sentiment'].value_counts()

# Pie Chart - Overall Customer Happiness
plt.figure(figsize=(7,7))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%')
plt.title('Overall Customer Sentiment Distribution')
plt.axis('equal')
plt.show()


##### 1. Why did you pick the specific chart?

I selected the Pie Chart to visualize the Overall Customer Sentiment Distribution because it is one of the best ways to represent the percentage share of different categories within a whole. In this case, it clearly shows the proportion of Positive, Negative, and Neutral reviews in a simple and easy-to-understand format.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that the majority of customers, about 64.1%, have shared positive feedback about the product or service. This indicates that the company has a strong base of satisfied customers, which is a good sign for the business. However, there is also 22.1% negative feedback, which highlights that there are certain areas where customers are not happy. Additionally, 13.8% of the reviews are neutral, showing that some customers had an average experience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

he insights gained from this analysis can certainly help in creating a positive business impact. The large percentage of positive reviews reflects customer satisfaction, which can be further used to build brand reputation and customer loyalty. At the same time, the negative reviews act as a warning sign for the business. If these negative feedbacks are not addressed properly, they may lead to customer dissatisfaction and loss of future customers, which can negatively affect the business growth. Therefore, by taking corrective actions based on customer feedback, the company can reduce risks, improve customer experience, and create long-term positive growth.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Scatter Plot
plt.figure(figsize=(10,6))
sns.scatterplot(x='Size', y='Rating', data=data_df_final)

plt.title('App Size vs Rating')
plt.xlabel('App Size (MB)')
plt.ylabel('Rating')
plt.show()



##### 1. Why did you pick the specific chart?

The scatter plot titled "App Size vs Rating" was chosen because it effectively visualizes the relationship between two continuous variables: the size of an app in megabytes and its corresponding user rating. This type of chart helps reveal patterns, trends, and anomalies that might not be immediately apparent from raw data or other chart types. In this case, it gives a visual overview of how app size may—or may not—impact user satisfaction as reflected in app ratings.

##### 2. What is/are the insight(s) found from the chart?

From the chart, several insights emerge. There is a clear concentration of highly rated apps (ratings between 4.0 and 5.0) across all app sizes, but particularly among mid to larger-sized apps. This suggests that app size doesn’t negatively impact user ratings, and in many cases, larger apps may even be associated with better user experiences, possibly due to more robust features or smoother functionality. However, smaller apps (under 20MB) show a much wider spread in ratings, including a significant number of low-rated apps. This indicates that while small apps can be excellent, they are also more likely to suffer from poor ratings, possibly due to limited functionality or performance issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can definitely help drive positive business impact. Understanding that users are generally more satisfied with apps that maintain a certain quality—regardless of size—can guide developers to focus less on minimizing size at the cost of user experience, and more on delivering value. However, there is a potential negative takeaway: businesses that overly prioritize small app size to cater to users with limited storage might unintentionally sacrifice quality, leading to lower ratings. This can harm long-term growth, particularly in competitive app markets. Balancing size with functionality and user satisfaction is crucial for sustainable success.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Required Libraries
from wordcloud import WordCloud

# Remove null values from Translated_Review column
all_reviews = reviews_df['Translated_Review'].dropna()

# Combine all reviews into one string
text = ' '.join(all_reviews)

# Generate the WordCloud
wordcloud = WordCloud(width=1000, height=500, background_color='white').generate(text)

# Plot the WordCloud
plt.figure(figsize=(15, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Frequent Keywords in Reviews')
plt.show()


##### 1. Why did you pick the specific chart?

The reason for selecting a word cloud for this analysis is that it is one of the most effective visualizations when dealing with large volumes of textual data such as customer reviews. Word clouds provide a quick and intuitive way to understand the most frequently occurring words in the dataset. The larger the word in the cloud, the more often it appears in the reviews. This visual appeal helps in identifying common themes, customer sentiments, and frequently discussed topics without going into the complexity of detailed textual analysis.

##### 2. What is/are the insight(s) found from the chart?

From the word cloud generated, the key insights found are very clear. Words like “game”, “app”, “good”, “great”, “love”, “work”, “update”, “need”, and “phone” appear prominently. This indicates that customers often mention these terms in their feedback. Words like “good”, “great”, “love”, and “best” reflect positive user sentiments, showing satisfaction with the product or service. On the other hand, terms like “problem”, “issue”, “update”, “need”, and “fix” point towards areas where customers are facing difficulties or expecting improvements.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this analysis are definitely useful for creating a positive business impact. Positive feedback words such as “love”, “great”, and “good” highlight the strengths of the product or app — these can be used for marketing purposes to build a positive brand image. However, the words indicating negative feedback or issues like “problem”, “issue”, “update”, and “need” help the business to identify areas of improvement. Addressing these user concerns by fixing bugs, improving updates, and enhancing user experience will help retain customers and attract new users.

There are no direct insights that suggest negative growth, but if the frequently occurring negative terms are ignored by the business, it may lead to customer dissatisfaction, bad reviews, and eventually a drop in app ratings and user retention. Therefore, the real risk lies in not acting upon these insights. The smart approach would be to leverage positive feedback for branding and simultaneously prioritize solving customer complaints to improve overall user satisfaction and avoid negative growth.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Count of Content Rating
content_rating_count = data_df_final['Content Rating'].value_counts()

# Bar Chart
plt.figure(figsize=(8,5))
plt.bar(content_rating_count.index, content_rating_count.values)

plt.xlabel('Content Rating')
plt.ylabel('Number of Apps')
plt.title('Content Rating Distribution (Target Audience Analysis)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()



##### 1. Why did you pick the specific chart?

This bar chart titled "Content Rating Distribution (Target Audience Analysis)" provides a clear view of how apps are categorized based on their target audience. It visually represents the number of apps assigned to each content rating category.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it's immediately evident that the vast majority of apps—over 7,000—are rated “Everyone,” making it the dominant category by a huge margin. This suggests that most app developers prioritize inclusivity, aiming to create products that appeal to the broadest possible user base, including children, families, and casual users. Following that, apps rated “Teen” come next, though with a much smaller count, around 1,000. This shows that while there's a notable presence of content targeting teenagers, it’s still a distant second in comparison. Other categories like “Mature 17+” and “Everyone 10+” have relatively minor shares, and ratings like “Adults only 18+” and “Unrated” are nearly negligible.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from this distribution are valuable for business strategy. The dominance of the “Everyone” category indicates a strong market orientation towards general, family-friendly content. For app developers or businesses planning new products, this suggests that creating broadly accessible apps could increase reach and market penetration. However, the lower representation of mature content could also highlight an untapped niche market. Businesses targeting adult users with specialized or premium content might face less competition in that space, offering a potential growth opportunity.

There aren’t any direct indicators of negative growth from this chart alone, but one potential concern could be oversaturation in the “Everyone” category. With so many apps competing in that space, discoverability becomes a challenge. Apps in that category may need stronger differentiation or marketing efforts to stand out. On the other hand, exploring underrepresented categories with appropriate and valuable content might help apps capture attention in less crowded markets.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

data_df_final['Year_Month'] = data_df_final['Last Updated'].dt.to_period('M')
monthly_launches = data_df_final['Year_Month'].value_counts().sort_index()

plt.figure(figsize=(12,6))

plt.plot(monthly_launches.index.astype(str), monthly_launches.values)

plt.xlabel('Year-Month')
plt.ylabel('Number of Apps Launched')
plt.title('Monthly Trend of New App Launches')

# Rotate x-axis labels
plt.xticks(rotation=60)

# Optional: Show only some labels (like every 3rd label)
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(15))  # Change 15 as per your data

plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart?

The chart selected is a line chart, which is ideal for visualizing trends over time. Since the data relates to the monthly trend of new app launches, this format clearly shows how app launches have changed across different months and years. The continuous line allows easy detection of patterns, spikes, or declines, which would be harder to spot in other chart types like bar charts.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it's evident that the number of new app launches remained relatively modest from 2010 through 2016, with only gradual increases over time. However, starting in late 2017 and accelerating into 2018, there is a significant surge in app launches, culminating in a dramatic peak. This suggests a period of intense activity in the app development space—potentially driven by increased investor interest, technological advances, or shifts in consumer demand. Yet, right after this peak, there is an equally sharp drop, indicating that the growth was not sustainable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can be valuable from a business perspective. Understanding when and why these spikes occurred can help businesses time their product launches, recognize market saturation points, and anticipate industry cycles. The period of rapid growth could highlight opportunities that were successfully capitalized on, while the sudden decline that follows may signal underlying issues such as market saturation, quality control by app stores, or changing developer incentives.

While the initial rise reflects positive momentum and potential opportunity, the steep fall hints at possible negative growth trends. This might have resulted from a combination of factors such as app store policy changes, removal of low-performing or duplicate apps, or a simple overproduction of apps leading to diminished user engagement. Understanding both the positive and negative sides of this trend is crucial for strategic planning and long-term growth in the app market.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

from matplotlib.ticker import FuncFormatter

# Group by Genres → Get Avg Sentiment Polarity & Total Installs
genre_df = df_final.groupby('Genres').agg({
    'Sentiment_Polarity': 'mean',
    'Installs': 'sum'
}).reset_index()

# Get Top 10 Genres by Avg Sentiment Polarity
top10_genres = genre_df.sort_values(by='Sentiment_Polarity', ascending=False).head(10)

# Plot
fig, ax1 = plt.subplots(figsize=(12,6))

# Barplot for Avg Sentiment Polarity
sns.barplot(x='Genres', y='Sentiment_Polarity', data=top10_genres, ax=ax1, color='skyblue')
ax1.set_ylabel('Average Sentiment Polarity', color='blue')
ax1.set_xlabel('Genres')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')

# Secondary Y-axis for Total Installs
ax2 = ax1.twinx()
sns.lineplot(x='Genres', y='Installs', data=top10_genres, ax=ax2, color='red', marker='o')
ax2.set_ylabel('Total Installs', color='red')

# Formatting Installs axis
def format_installs(x, pos):
    if x >= 1_000_000_000:
        return f'{x/1_000_000_000:.1f}B'
    elif x >= 1_000_000:
        return f'{x/1_000_000:.1f}M'
    else:
        return f'{x}'

ax2.yaxis.set_major_formatter(FuncFormatter(format_installs))

plt.title('Top 10 Genres by Avg Sentiment Polarity & Their Total Installs')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The chosen dual-axis chart effectively compares two key performance indicators—average sentiment polarity and total installs—for the top 10 app genres by sentiment. This format was selected because it allows for clear visual correlation (or lack thereof) between how positively users feel about apps in a genre (sentiment polarity) and how widely those apps are adopted (installs). A simple bar or line chart alone wouldn't capture this multidimensional relationship as clearly.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we observe that genres like Comics and Educational; Creativity have the highest sentiment polarity, indicating highly positive user feedback. However, their install numbers remain relatively low. On the other hand, Health & Fitness exhibits both high sentiment polarity and significantly high install volume, suggesting a genre that is not only well-liked but also highly adopted. Conversely, genres like Parenting; Music & Video and Art & Design show lower polarity and installs, hinting at lower overall engagement or satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can guide strategic focus. For instance, the Comics genre shows strong user sentiment—an opportunity for growth if more marketing or feature development is directed there. Meanwhile, Health & Fitness stands out as a high-performing genre that could benefit from continued investment. Identifying genres with low installs despite high sentiment can help uncover untapped potential, while those with both low sentiment and adoption may require reevaluation or innovation.

Such insights can drive positive business impact by highlighting where to scale and where to improve. At the same time, genres with low sentiment despite high installs could pose reputational risks if left unaddressed—potentially leading to negative growth due to user dissatisfaction.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Filter only Paid apps
paid_apps = data_df_final[data_df_final['Type'] == 'Paid']

plt.figure(figsize=(10, 6))

sns.scatterplot(
    data=paid_apps,
    x='Price',
    y='Installs',
    alpha=0.7,
    color='green'
)

plt.xscale('log')  # Optional if wide price range
plt.yscale('log')  # Optional if wide install range

plt.xlabel('App Price')
plt.ylabel('Total Installs')
plt.title('Install Counts vs App Price (Paid Apps Only)', fontsize=16)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()

plt.show()



##### 1. Why did you pick the specific chart?

The scatter plot of Install Counts vs App Price for Paid apps was chosen because it effectively visualizes the relationship between the price of an app and the total number of installs it receives. Both these variables show a wide range of values, and using a scatter plot with a logarithmic scale for both axes helps to handle this variation and gives a clearer picture of how app pricing impacts user downloads.

##### 2. What is/are the insight(s) found from the chart?

from the chart, we can observe that most of the paid apps are priced within a lower price range, typically under 10,and these apps also tend to have a higher number of installs compared to the more expensive ones. As the price of the app increases, the number of installs significantly decreases, indicating that users are highly price-sensitive when it comes to paid apps. Very few apps priced beyond 50 have managed to gain a large number of installs, which suggests that only apps offering very niche or premium features can afford to charge high prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can certainly help businesses make better decisions, especially in terms of pricing strategy. Keeping the price affordable is crucial if the goal is to maximize user base and installs. Overpricing an app without offering unique or exceptional value could lead to negative growth, as users are unlikely to pay a high price when there are cheaper or free alternatives available. Therefore, companies should carefully analyze their app’s value proposition before deciding to charge a premium price. The insights gained from this chart can guide app developers to strike a balance between price and value to ensure profitability without compromising user adoption.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(12, 8))

# Select only numeric columns
numeric_df = df_final.select_dtypes(include=['float64', 'int64'])

# Calculate correlation
corr = numeric_df.corr()

# Plot heatmap
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)

plt.title("Correlation Heatmap", fontsize=16)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* It visually shows the strength and direction of relationships between multiple numerical variables in a single view.

* Helps quickly identify which features have strong or weak correlations.

* Makes it easier for business or non-technical people to interpret relationships without going through numeric correlation tables.

* Ideal for feature selection in machine learning or data analysis

##### 2. What is/are the insight(s) found from the chart?

* There is a moderate positive correlation between Reviews and Installs (0.44) — Apps that receive more user reviews are generally installed more often.

* A moderate correlation between Reviews and Size (0.28) suggests that larger apps may encourage more user feedback, possibly due to richer features.

* A weak positive correlation exists between Rating and Sentiment Polarity (0.21) — Apps with more positive sentiments tend to have slightly higher ratings.

* Rating and Sentiment Subjectivity (0.22) also show a weak positive relationship, implying that subjective feedback might influence ratings modestly.

* Sentiment Polarity and Subjectivity (0.27) share a weak positive correlation — more subjective reviews are often mildly positive.

* Size and Sentiment Polarity (-0.30) have a moderate negative correlation, indicating larger apps might face more negative sentiment, possibly due to performance or space concerns.

* Reviews and Sentiment Polarity (-0.18) suggest that a higher number of reviews may include more criticism or mixed feedback.

* Installs and Sentiment Polarity (-0.13) implies popular apps may attract a diverse user base with varied (including negative) opinions.

* Price is almost uncorrelated with all variables, reinforcing that being free or paid has minimal impact on ratings, reviews, installs, or sentiment.




#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Selecting relevant numerical columns for pairplot
numeric_df = df_final.select_dtypes(include=['float64', 'int64'])

plt.figure(figsize=(12, 8))
sns.pairplot(data=numeric_df)
plt.suptitle("Pair Plot of Numerical Features", fontsize=16, y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

 selected a pairplot because it helps in visualizing the pairwise relationship between multiple variables in a dataset. It also shows the distribution of individual variables, making it useful for detecting patterns, trends, and potential correlations.

##### 2. What is/are the insight(s) found from the chart?

 The diagonal histograms indicate that variable like Ratings is right-skewed — meaning most apps have very low values, while only a few apps have very high values.

 The diagonal plots (histograms) show that  Sentiment_Subjectivity is approximately normally distributed — indicating that user sentiments are balanced across middle.

 The scatterplots between Reviews and Installs show a clear positive linear trend — apps with more installs generally have more reviews.

 Size of the app does not show any strong visible trend with Rating or Installs — confirming weak correlations as seen in the heatmap.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

* Since we observed a positive correlation between number of reviews and installs, increasing installs will automatically lead to more user engagement and feedback. This can be done through marketing campaigns, promotional offers, and improving app visibility.
* Sentiment Polarity has shown a slight positive correlation with ratings. Therefore, maintaining a positive user experience by addressing user feedback, improving app performance, and providing customer support will help in improving app ratings.
* There is a weak negative correlation between App Size and Sentiment Polarity. Larger app sizes may lead to user dissatisfaction. The client should focus on optimizing the app size for faster downloads and better performance.
* As price does not have a strong relationship with ratings or installs, keeping the app free for basic features and charging for premium features would attract more users without affecting user sentiment.
* Since no single factor is highly dominant on ratings, the client should continuously monitor user feedback, reviews, and performance metrics to make regular improvements and updates in the app.


# **Conclusion**

The overall analysis highlights that user perception towards the apps is largely positive, with a majority of reviews being favorable. However, a noticeable portion of negative and neutral feedback indicates the need for continuous improvement. Genres like Comics, Educational, and Events stand out with higher user satisfaction based on sentiment analysis. Moreover, apps with higher installs are generally backed by good ratings and active user engagement. This reinforces the importance of maintaining quality, addressing user concerns, and focusing on content that aligns with user preferences to sustain growth and competitiveness in the market.

In conclusion, focusing on increasing installs, improving user sentiment, optimizing app performance, and regularly monitoring user feedback will help the client achieve their business objectives effectively.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***