<a href="https://colab.research.google.com/github/poojarani0301/EDA-on-On-Google-Play-Store-Apps-Data-and-User-Reviews/blob/main/Pooja_Rani_Google_play_store_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Google Play Store :dataset 1, dataset 2
##### **Contribution**    - Individual


By: Pooja Rani

# **Project Summary -**

The primary objective of the Google Play Store analysis project is to explore various dimensions of apps available on the platform and extract meaningful insights that can help businesses, and researchers make data-driven decisions. By leveraging advanced Python libraries, such as Pandas, NumPy, seabourn, and Matplotlib, the primary objective of the project aims to:

*   Ensure the data is clean, complete, and structured, eliminating any inconsistencies or irrelevant information
*  Understand the relationships between different variables, identify
patterns, and visualize trend
*   Analyze user reviews to gauge customer sentiment about various apps and their features
*   Identify which app categories are performing well based on rating scores and installation counts
*   Detect trends over time, such as shifts in popular categories, user preferences, or regional variations


#Data Collection:

 The dataset 1 given includes:

1.   Application Name: The name of the app as displayed on the Google Play Store.
2.   Category: The genre or type of the app (e.g., games, health & fitness, education).
3. Rating: The average user rating, represnted as numeric value.
4.   Reviews: The total number of user reviews for the app, as recorded when the data was scraped.
5. Installs: The total number of times the app has been installed
6.   Price: The cost of the app, if any.
7.   content rating: The age group the app targeted.
8.   Genres: The additional categories the app belong to.

The dataset 2 given includes:

1. App: The name of the app as displayed on the Google Play Store.
2. Translated_Review: User reviews translated to a consistent language (e.g., English) for analysis.
3. Sentiment: Categorical label (Positive, Neutral, Negative) classifying the review's emotional tone.
4. Sentiment_Polarity: Numeric score (-1 to +1) quantifying sentiment strength (e.g., +0.8 = strong positivity).
5. Sentiment_Subjectivity: Score (0 to 1) measuring how opinionated (1) vs. factual (0) the review is.


#Data Cleaning and Preprocessing:

Once we have the data, the next step is to clean and preprocess it. Data preprocessing in Python is commonly done using Pandas and NumPy. Missing values, duplicates, and irrelevant columns are removed or imputed. Data normalization techniques are also applied to standardize certain columns, such as ratings ,ensuring the data is ready for analysis.

#Exploratory Data Analysis (EDA):

After cleaning the data, EDA is performed to uncover patterns, relationships, and insights within the dataset. This involves generating statistical summaries and visualizations. Advanced Python libraries such as Matplotlib and Seaborn are commonly used for this purpose.

#Conclusion

By utilizing advanced Python tools and techniques, this project provides valuable insights into the Google Play Store’s ecosystem, helping developers optimize their apps for better user experience, predict future trends, and ultimately improve their app performance. Through data collection, cleaning, and advanced analysis methods, the project sheds light on the dynamics of the mobile app market, offering actionable insights for stakeholders across various industries.
























# **GitHub Link -**

https://github.com/poojarani0301

#Problem statement

This project aims to address the challenge of analyzing and extracting meaningful insights from the vast data of the Google Play Store. It focuses on collecting, cleaning, and analyzing app data, including ratings, reviews, and installs, to identify trends, assess performance, and predict future app success using advanced Python techniques.

#Business objective



The business objective of this project is to empower app developers, marketers, and businesses with data-driven insights from the Google Play Store. By analyzing app performance, user sentiment, and trends, the project aims to help optimize app strategies, improve user experiences, and drive higher downloads, ratings, and profitability.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset

# This code will import drive module from 'google.colab'

from google.colab import drive
drive.mount('/content/drive')


### Dataset First View

In [None]:
# Dataset1 First Look

# This code will copy the path from our local system

data1 = pd.read_csv('https://drive.usercontent.google.com/download?id=1pVZNhpwbqbu3xLf6J-1KrQI_-B8KqDCt&export=download&authuser=0')
data1.head()  # To display first 5 rows of the dataset

In [None]:
# Dataset2 first look

data2 = pd.read_csv('https://drive.usercontent.google.com/download?id=1zJmyrNtkv_ZnVAY1wtg1irTRbpoZ9rRH&export=download&authuser=0')
data2.head()

### Dataset Rows & Columns count

In [None]:
# Dataset1 Rows & Columns count

data1.shape  # to find the number of rows and columns in the dataset

In [None]:
# Dataset2 Rows & Columns count

data2.shape

### Dataset Information

In [None]:
# Dataset1 Info

data1.info()    # To display the information(datatypes and non-null values in each column)

In [None]:
# Dataset2 info

data2.info()

#### Duplicate Values

In [None]:
# Dataset1 Duplicate Value Count

len(data1[data1.duplicated()])


In [None]:
# Dataset2 Duplicate Value Count

len(data2[data2.duplicated()])    # To count duplicated records in the dataset

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count in dataset1:

print(data1.isnull().sum())

In [None]:
# Missing Values/Null Values Count in dataset2:

print(data2.isnull().sum())

In [None]:
# Visualizing the missing values in dataset1:

# we can visualize the missing values by plotting heatmap graph using the below code:

sns.heatmap(data1.isnull(), cbar = False)

# we can see from the below graph that 'ratings' column has the maximum number of missing values.

In [None]:
# Visualizing the missing values in dataset2:
# we can visualize the missing values by plotting heatmap graph using the below code:

sns.heatmap(data2.isnull(), cbar = False)

# We can see from the below graph that all columns except for 'App' has equal number of missing values which needs to be replaced.

### What did you know about your dataset?

# About Dataset1:

*   The dataset1 given has a total of 13 columns, out of which 12 are of type 'object' and the one left is of type 'Float', and a total of 10841 rows.
*   The dataset1 contain 483 duplicated records, needs to drop.
*   The dataset has several missing values, largely in the 'Rating' column.
*   All the missing values needs to be replaced with mean, median or mode of that particular column.

# About dataset2:

*    The dataset2 given has a total of 5 columns, out of which 3 are of type 'object' and the 2 left are of type 'Float', and a total of 64295 rows.
*   The dataset2 contain 33616 duplicated records, needs to drop.
*   The dataset has several missing values, largely in all the columns except for 'App'.
*   All the missing values needs to be replaced with mean, median or mode of that particular column.



## ***2. Understanding Your Variables***

In [None]:
# Dataset1 Columns

data1.columns

In [None]:
# Dataset1 Describe

# for numerical columns

data1.describe(include = ['int64', 'float64'])     # Statistical description of a numerical column.


In [None]:
# Dataset1 Describe

# statistical description of categorical columns

data1.describe(include = 'object')

In [None]:
# Dataset2 columns:

data2.columns

In [None]:
# Dataset Describe
# for numerical columns

data2.describe(include = ['int64', 'float64'])

In [None]:
# Dataset Describe
# for categorical columns

data2.describe(include = 'object')

### Variables Description

# About dataset1:
*   The 'Rating' column in the dataset1 has 9367, rest 1474 records are missing.
*   The average(mean) of the column is 4.193338
*   The standard deviation of the column is 0.5374, meaning the deviation of the data from the average is 0.5374
*   The minimum value that the column holds is 1
*   The maximum value that the column holds is 19
*   We can see the number of unique values for each categorical column

# About dataset2:
*   Both 'Sentiment_polarity' and 'Sentiment_subjectivity' columns in the dataset2 has 37432 records out of which 26863 records are missing in both of them.
*   The maximum values in both the numerical columns are 1.
*  The minimum values however, are 0 and -1 for 'Sentiment_subjectivity' and 'Sentiment_polarity respectively.






### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable in dataset1:

for i in data1.columns.tolist():
  print("No. of unique values in ",i,"is",data1[i].nunique())

In [None]:
# Check Unique Values for each variable in dataset2:

for i in data2.columns.tolist():
  print("No. of unique values in ",i,"is",data2[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# firstly, we need to drop the duplicate records in the dataset1

data1.drop_duplicates(inplace = True)

# now, we can check if the duplicate record have been removed.

len(data1[data1.duplicated()])


In [None]:
# Write your code to make your dataset analysis ready.

# firstly, drop the duplicate records in the dataset2

data2.drop_duplicates(inplace = True)

# now, we can check if the duplicate record have been removed.

len(data2[data2.duplicated()])

In [None]:
# Now we need to replace the numerical missing values with mean or median
# Since we have only 1 numerical column in dataset1: 'Rating
# If the data is uniformly distributed, replace the missing values with mean and if the data is skewed, replace it with median

data1['Rating'].skew()

In [None]:
# The data for the column is not uniformly distributed, so we will replace the missing values with median
# The below code will calculate the median for the column.

data1['Rating'].median()

In [None]:
# Now, the below code will replace the missing values with median which is 4.3

data1['Rating'].fillna(data1['Rating'].median(), inplace = True)


In [None]:
# check if any missing value is left or not

data1['Rating'].isnull().sum()

In [None]:
# Now, we can replace the missing values of the categorical columns in dataset1 one by one by replacing it with mode of the particular column
# we have 4 columns that have missing values: Type, Content Rating, Current ver, Android ver

print(data1['Type'].mode())
print('\n')

print(data1['Content Rating'].mode())
print('\n')

print(data1['Current Ver'].mode())
print('\n')

print(data1['Android Ver'].mode())

In [None]:
# Replace the missing values with the calculated mode

data1['Type'] = data1['Type'].replace(0, pd.NA)
data1['Type'].fillna('Free', inplace = True)

data1['Content Rating'].fillna('Everyone', inplace = True)

data1['Current Ver'].fillna('Varies with device', inplace = True)

data1['Android Ver'].fillna('4.1 and up', inplace = True)


In [None]:
print(data1['Type'].isnull().sum())

print(data1['Content Rating'].isnull().sum())

print(data1['Current Ver'].isnull().sum())

print(data1['Android Ver'].isnull().sum())

In [None]:
# Now, we can do the all the similar operations with dataset2 as well.
# Since we have already dropped the duplicated records, now, we can replace the null or missing values.

# we have 2 numerical columns in the data: 'Sentiment_polarity' and 'Sentiment_subjectivity'

data2['Sentiment_Polarity'].skew()

# The result below depicts the data in 'Sentiment_Polarity' is uniformly distributed

In [None]:
# Calculate the mean of the column-'Sentiment_Polarity':

data2['Sentiment_Polarity'].mean()


In [None]:
# Replace the missing values with the mean calculated above:

data2['Sentiment_Polarity'].fillna(data2['Sentiment_Polarity'].mean(), inplace = True)


In [None]:
# Now, we can check again if there are any missing values left or not:

data2['Sentiment_Polarity'].isnull().sum()

In [None]:
# Now, we can do the same with the other numerical column i.e., 'Sentiment_Subjectivity':

data2['Sentiment_Subjectivity'].skew()

# The result below depicts the data in 'Sentiment_Polarity' is non-uniformly distributed

In [None]:
# Calculate the median of the column:

data2['Sentiment_Subjectivity'].median()

In [None]:
# Replace the missing values of the column with median calculated above:

data2['Sentiment_Subjectivity'].fillna(data2['Sentiment_Subjectivity'].median(), inplace = True)

In [None]:
# We can check now if there is any null values left:

data2['Sentiment_Subjectivity'].isnull().sum()

In [None]:
# Now that all the numerical columns are sorted, 2 categorical columns are left with some missing values, 'App' has 0 missing values.

# The 2 columns are : 'Translated_Review' and 'Sentiment'

# calculate mode of both the columns

data2['Translated_Review'].mode()



In [None]:
data2['Sentiment'].mode()

In [None]:
# Replace the above calculated mode for both the columns with missing values:

data2['Translated_Review'].fillna('Good', inplace = True)

data2['Sentiment'].fillna('Positive', inplace = True)

In [None]:
# check for the missing values now:

print(data2['Sentiment'].isnull().sum())

print(data2['Translated_Review'].isnull().sum())


In [None]:
# We can merge the dataset1 and dataset2 on a column 'App':

data = pd.merge(data1, data2, on = 'App', how = 'inner')
data

In [None]:
# Check for the duplicates in the merged data:

data.duplicated().sum()

In [None]:
# There are 81001 duplicate records, we need to drop them all.

data.drop_duplicates(inplace = True)

# Now we can again check if there is any duplicate record left

In [None]:
data.duplicated().sum()

In [None]:
print(data1['Category'].unique())

In [None]:
data1.drop(data1[data1['Category'] == '1.9'].index, inplace = True)


### What all manipulations have you done and insights you found?

#Step by step manipulations done with the given dataset1:

*   Deleted all the diplicated records in the data.
*   Replaced all the missing values in the data with the appropriate data using mean, median and mode.

#Insights:

*   Extract insights about the most populat apps based on category, rating, reviews etc
*   Identify the most common categories for apps
*   Determine how many apps are free versus paid and analyze the price range of paid apps
*   List the highest rated apps as well as overall

#Step by step manipulations done with the given dataset2:

*   Deleted all the diplicated records in the data.
*   Replaced all the missing values in the data with the appropriate data using mean, median and mode.

# Insights:

* We can determine whether users generally feel positive, neutral, or negative about each app using the 'Sentiment' column.
* The sentiment_subjectivity reveals how opinionated the reviews are, while sentiment_polarity helps gauge the intensity of users' feelings.

This data can help identify user satisfaction trends, highlight areas for improvement, and support app development strategies.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Import all the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Chart - 1 visualization code
# Histogram

plt.figure(figsize = (5,3))
sns.histplot(x = data1['Rating'], bins = 15, kde = True, color = 'Blue')
plt.title("Distribution of Rating")
plt.show()


##### 1. Why did you pick the specific chart?

* To check the distribution of the column 'Rating'.

##### 2. What is/are the insight(s) found from the chart?

* The maximum ratings given are between 3 and 4.
* Rating distribution is right skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* A rightly skewed histogram meaning more high ratings than low ratings which indicates strong user satisfaction with a few outliers. * This insight helps businesses identify features are highly appreciated, enabling them to focus on these strengths in marketing or future development.
* It also highlights areas needing attention by addressing outlier complaints.
* By leveraging positive feedback and addressing issues, businesses can enhance user retention, attract new users, and improve overall app performance, boosting growth.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Count plot

plt.figure(figsize=(10,3))   # to fix the size of the figure/chart/visualisation
plt.title("Category Count")
sns.countplot(x=data1['Category'],color='orange')
plt.xticks(rotation=90)   # to rotate the x-axis labels by 90 degrees
plt.show()


##### 1. Why did you pick the specific chart?

* To visualize the number of downloads of each category in a 'Category' column


##### 2. What is/are the insight(s) found from the chart?

* The maximum apps downloaded are related to 'Family', 'Game' and 'Tools' respectively
* The minum apps downloaded are related to 'Beauty', 'Comics', 'Events' and 'Parenting'

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* By identifying the categories that are most favored or frequently mentioned like Family, Game and Tools, businesses can focus on enhancing successful aspects, resolve recurring issues, and tailor marketing strategies. This data-driven approach improves customer satisfaction, boosts user engagement, and supports informed decisions, fostering business growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(6,3))
sns.boxplot(x=data1['Rating'])
plt.show()



##### 1. Why did you pick the specific chart?

* To identify the outliers

##### 2. What is/are the insight(s) found from the chart?

* There are outliers in 'Rating', most of them are in left direction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* A box plot with outliers on the left suggests that a portion of users are dissatisfied or facing issues with the app. These outliers represent potential areas for improvement, such as bugs or features that aren't meeting user expectations.
* By addressing these main points through updates or fixes, businesses can enhance the user experience, increase satisfaction, and reduce churn. This targeted approach can lead to better ratings, higher retention, and positive business growth

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(15,3))
plt.title("Category wise rating")
barplot=sns.barplot(data=data1 ,x ='Category' , y='Rating',color='pink')
plt.xticks(rotation=90)
plt.ylim(bottom = 4, top = 4.5)
plt.show()

##### 1. Why did you pick the specific chart?

* To check which category is given the highest or lowest rating

##### 2. What is/are the insight(s) found from the chart?

* The highest rated apps are 'Events', 'Education', 'Art and design' respectively with a rating ranging between 4.3 and 4.4
* the lowest rated app is 'Dating' with a rating 4.0

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Highest-rated apps reveals positive features—businesses can replicate success, boost marketing, and increase monetization. High ratings build trust, improving downloads and retention.

* For lowest-rated apps, feedback exposes critical flaws—fixing these prevents user loss, and restores app store rankings. Addressing complaints shows responsiveness, strengthening brand reputation.

* By leveraging ratings data, businesses reduce churn, increase conversions, and maximize long-term growth.

* Visualizations (like your bar plot) make trends clear, which in turn help businesses take data-driven decisions for sustained success.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Pie chart
sentiment_counts = data2['Sentiment'].value_counts()

# Create pie chart
plt.figure(figsize=(4, 3))
plt.pie(
    sentiment_counts,
    labels=sentiment_counts.index,
    autopct='%1.1f%%',  # Show percentages
    colors=['green', 'yellow', 'grey'],  # Custom colors
    startangle=90,  # Rotate for better readability
    wedgeprops={'edgecolor': 'white'}  # Add white edges for clarity
)

# Add title and style
plt.title('Distribution of App Reviews by Sentiment', fontsize=14)
plt.axis('equal')  # Ensures pie is a circle (not oval)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* To check the share of sentiments among apps

##### 2. What is/are the insight(s) found from the chart?

* The sentiment for most of the apps are positive
* 22% of the apps has negavtive sentiments
* and nearly 14% of the apps have neutral sentiments

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Dominant Positivity (64.1%): Leverage this in marketing campaigns to attract new users by highlighting strengths (e.g., "Rated 5/5 by 64% of users!").

* Negative Reviews (22.1%): Address these urgently—analyze feedback for common pain points (bugs, UX) to reduce churn and improve ratings.

* Neutral Segment (13.8%): Engage these users with targeted surveys to convert them into promoters by addressing unmet needs.

Action Plan:

* Fix top complaints from negative reviews.
* Amplify positive sentiment in ads.
* Optimize features for neutral users.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(8,3))
sns.histplot(x=data2['Sentiment_Polarity'],bins = 10,kde=True,color='red')
plt.title("Distribution of Sentiment polarity")
plt.show()

##### 1. Why did you pick the specific chart?

* To check the distribution of Sentiment Polarity: how is the distribution of reviews.

##### 2. What is/are the insight(s) found from the chart?

* The data distribution is neither right-skewed or left skewed, as it shows both strong positive and negative sentiment
* There are more number of positive reviews than the negative ones.
* Lowest counts near 0.0, indicating fewer neutral reviews.

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Few neutral reviews suggest users don't mildly engage; they love or hate the app.
* The -0.75 spike exposes urgent pain points (fix bugs/poor UX).
* Polarized Users: Extreme scores indicate emotions drive reviews—capitalize on positives, address negatives to reduce churn.
* Businesses can prioritize fixes for negative clusters, amplify positive feedback in ads, and monitor shifts post-updates.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(10,3))
sns.histplot(x=data['Category'],bins = 10,hue=data['Sentiment'],palette='rainbow')
plt.title("Distribution of sentiment over categories")
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

* To check how sentiment (Positive/Neutral/Negative) varies across app categories

##### 2. What is/are the insight(s) found from the chart?

* The dominance of positive reviews across all apps categories suggests that, on average, apps in the Google Play Store meet or exceed user expectations.
* There is a dominance of neutral reviews in 'Games' category indicating Businesses can focus on converting the few neutral users into promoters by addressing subtle pain points (e.g., adding customization options)
* Dominant positivity may reflect Google Play's effective app review policies or developer responsiveness. However, if negative reviews are rare but severe (e.g., one-star complaints about crashes), they warrant prioritization to prevent reputational damage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 * Businesses can identify best practices from top-performing categories (e.g., Gaming apps) and replicate them in other apps.
 * Address critical pain points in low-performing categories (e.g., intrusive ads in Gaming) to reduce churn and improve ratings.
 * Engage neutral users with surveys or incentives to convert them into promoters.
 * Respond to negative reviews publicly to demonstrate responsiveness, boosting trust.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize = (6,3))
sns.violinplot(x = data1['Rating'], color = 'red')
plt.title("Distribution of Rating and outliers")
plt.xlim(left = 0, right = 20)
plt.show()

##### 1. Why did you pick the specific chart?

* To compare the distribution of rating across different categories. A violin plot would show how the ratings are distributed(i.e., whether most ratings are clustered around a particular range) and to identify the ouliers

##### 2. What is/are the insight(s) found from the chart?

The violin plot highlights the presence of outliers in the data, with ratings extending beyond the typical range (e.g., beyond 5.0 up to 20.0). These extreme values are likely anomalies or errors in data entry. The majority of the data is concentrated around ratings between 4.0 and 5.0, as indicated by the bulge near this range. The dense, narrow central region of the plot suggests that most ratings are clustered within a small range, demonstrating a consistent level of customer satisfaction./

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize = (15,4))
plt.title("Paid/unpaid apps across categories and ratings")
sns.barplot(x = data1['Category'] , y = data1['Rating'] , hue = data1['Type'])
plt.tick_params(axis = 'x' , rotation = 90)
plt.ylim(bottom = 4, top = 5)
plt.show()

##### 1. Why did you pick the specific chart?

* To show the ratings of applications across various categories, divided into "Free," "Paid,".  
* It provides a clear comparison of the average ratings for different app categories.

##### 2. What is/are the insight(s) found from the chart?

* Categories like "GAME" and "SOCIAL" are likely high-traffic areas, indicating strong user engagement and competition.

* Lesser-known categories ("PARENTING', "WEATHER") may represent underserved markets with growth potential.

* The tiered structure (e.g., "ENTERTAINMENT" vs. "EVENTS" as sub-items) suggests logical grouping, aiding in targeted analysis or marketing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Focus on high-traffic categories like GAME or SOCIAL to capitalize on existing demand, ensuring feature-rich apps that stand out in competitive markets.
* Explore underserved segments like PARENTING or WEATHER, where lower competition may yield higher visibility and user loyalty.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(6,3))
sns.countplot(x=data1['Content Rating'])
plt.title('Comparision of content rating' , color = 'Black' , fontsize = 15 )
plt.tick_params(axis = 'x' , rotation = 45, colors = 'black')
plt.show()

##### 1. Why did you pick the specific chart?

* To compares app distribution across content ratings.

##### 2. What is/are the insight(s) found from the chart?

* A high count for "Everyone" suggests broader market appeal, while niche ratings like "Adults only 18+" indicate specialized demand.
* Low representation in "Everyone 10+" or "Unrated" may reveal untapped segments or compliance issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The dominance of "Everyone" ratings highlights mass-market appeal, guiding developers to prioritize family-friendly features (e.g., parental controls, educational content).
* Niche categories like "Adults only 18+" can inspire targeted apps (e.g., dating, gambling) with tailored monetization.
* For underrepresented ratings (e.g., "Everyone 10+"), create campaigns to attract overlooked demographics.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

plt.figure(figsize = (8,3))
sns.kdeplot(x=data2['Sentiment_Subjectivity'] , color = 'green')
plt.title('KDE of Sentiment Subjectivity' , fontsize =15 , color = 'black')
plt.show()


##### 1. Why did you pick the specific chart?

* To visualize the distribution of sentiment subjectivity in Google Play Store reviews. It reveals how polarized (extreme 0 or 1) or nuanced (middle values) user opinions are.

##### 2. What is/are the insight(s) found from the chart?

* High density at extremes (near 0.0 and 1.0) indicates polarized reviews—users express strong emotions (e.g., "Love this app!" or "Worst experience!"), with few neutral opinions.
* Low density in the middle (0.4 and 0.6) suggests minimal balanced/neutral feedback, highlighting emotional extremes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Addressing extreme negativity (near 0.0 subjectivity) reduces churn, while leveraging passionate praise (1.0) strengthens marketing.
* Prioritizing features that reduce emotional bias (e.g., clearer onboarding) can shift reviews toward balanced (0.4 and 0.6) sentiment.
* Targeting polarized feedback (e.g., fixing pain points in negative reviews) improves average ratings and app store visibility.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

plt.figure(figsize = (10,3))
sns.scatterplot(x =data['Sentiment_Polarity'], y=data['Rating'] )
plt.title('Relationship between Sentiment_Polarity and Rating with Outlier' , color = 'black' , fontsize = 12)
plt.show()

##### 1. Why did you pick the specific chart?

* To show correlation between sentiment polarity (emotional tone) and user ratings, helping identify if positive reviews drive high ratings or outliers skew results.

##### 2. What is/are the insight(s) found from the chart?

* Positive sentiment polarity (near +1.0) aligns with higher ratings, confirming that emotionally positive reviews drive 5-star ratings. Negative sentiment (closer to -1.0) correlates with low ratings (1 to 2 stars).
* Some highly polarizing reviews (extreme -1.0 or +1.0) deviate from typical rating patterns, suggesting niche issues or exaggerated feedback needing investigation.
* Few reviews cluster near 0.0 polarity, indicating users rarely leave emotionally neutral ratings—they either love or dislike the app.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Fixing pain points in negative outlier reviews (-1.0 polarity with low ratings) reduces churn and improves average ratings.
* Doubling down on aspects driving +1.0 polarity (e.g., intuitive UI) enhances user satisfaction, boosting 5-star reviews.
* Highlighting emotionally positive feedback in ads (e.g., "Rated 5/5 for ease of use!") increases conversion rates.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

plt.figure(figsize = (7,3))
sns.countplot(x=data2['Sentiment'] , color = 'yellow')
plt.tick_params(axis = 'x', colors = 'orange' , labelsize = 10)
plt.tick_params(axis = 'y', colors = 'orange' ,labelsize = 10)
plt.xlabel('Sentiment' , fontsize = 12, color = 'black')
plt.ylabel('Count', fontsize = 12, color = 'black' )
plt.show()

##### 1. Why did you pick the specific chart?

* To compare sentiment distribution (Positive, Neutral, Negative) in Google Play Store reviews.
* To reveal dominant trends.

##### 2. What is/are the insight(s) found from the chart?

* A high count of Positive reviews (nearly 20,000+) implies strong user satisfaction, validating successful features that can be highlighted in marketing.

* Fewer Negative reviews (e.g., <5,000) indicate isolated pain points—fixing these can further boost ratings.

* Low Neutral counts signals users rarely leave lukewarm feedback; they either love or dislike the app

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Fix pain points from negative reviews (<5,000), potentially recovering a few of dissatisfied users.
* Target neutral users with surveys to uncover hidden issues.
* Leverage nearly 20,000 positive reviews in ads to boost conversions by 10-15%.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

data = data.select_dtypes(['int64','float64'])        # data: merged data
data

In [None]:
sns.pairplot(data)

##### 1. Why did you pick the specific chart?

* To explore the relationships between multiple numerical variables in a dataset.
* Ideal for performing exploratory data analysis (EDA).
* It allow for a quick examination of both individual distributions (diagonal plots) and relationships between variables (scatter plots in off-diagonal grids).

##### 2. What is/are the insight(s) found from the chart?

# Individual Insights (Diagonals):
# Rating:

Most of the ratings are concentrated between 3.5 and 5, with very few low ratings (<3.5).
The distribution is skewed towards higher ratings.

# Sentiment Polarity:

The distribution of sentiment polarity is roughly centered around 0, with a range from -1 (negative sentiment) to 1 (positive sentiment).
Positive sentiments dominate the data (values > 0 are denser).
# Sentiment Subjectivity:

Most sentiments are towards subjectivity (closer to 1), indicating users often express personal opinions rather than objective reviews.
Pairwise Insights (Scatter Plots):
Rating vs. Reviews:


A positive correlation exists: Higher sentiment polarity (positive sentiment) aligns with higher ratings.
Negative sentiment polarity is associated with lower ratings.
Sentiment Subjectivity vs. Rating:

No strong correlation, but higher ratings tend to occur across varying levels of subjectivity.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

data.corr()

In [None]:
sns.heatmap(data.corr(), annot = True, cmap = 'coolwarm')

##### 1. Why did you pick the specific chart?

* To display the degree of correlation between multiple numerical variables in a dataset.
* Provides a concise, visual representation of the relationships (correlations) between variables.
* Ideal for identifying patterns, dependencies, or lack of relationships between variables in exploratory data analysis (EDA)

##### 2. What is/are the insight(s) found from the chart?

# Rating vs. Sentiment Polarity:

Correlation: 0.074 (weak positive)
Higher positive sentiment polarity is weakly associated with higher ratings, which shows that user sentiment affects ratings but not strongly.

# Sentiment Polarity vs. Sentiment Subjectivity:

Correlation: 0.24 (moderate positive)
More subjective reviews (personal opinions) tend to be slightly more positive in sentiment.

# General Insights:

No strong correlations (values close to ±1) are observed, indicating that these variables are only weakly related to each other.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

#1. Monitor and Improve Sentiment Polarity
   * Why?:
      * Positive sentiment polarity is weakly but positively correlated with higher ratings.
   *Action:
      * Analyze user reviews and identify common issues or requests. Use this feedback to:
      * Enhance features or fix bugs.
Create positive sentiment through prompt responses to customer feedback.

#2. Encourage Quality Reviews
   * Why?:
      * Higher numbers of reviews are not strongly linked to ratings, but reviews provide valuable qualitative data.
   * Action:
      * Encourage satisfied users to leave positive reviews by offering incentives or in-app prompts.
      * Respond to negative reviews to showcase active customer care and potentially improve ratings.

#3. Focus on Consistency

   * Why?:
       * Variability in ratings and reviews can harm reputation.
  
* Action:
       * Maintain consistent service or product quality by:
       * Conducting periodic quality assurance.
       * Ensuring all app versions deliver similar performance across devices.

4. Leverage Sentiment Subjectivity
    * Why?:
         * Subjective reviews are often linked with positive sentiment (correlation: 0.24).

    * Action:
        * Highlight positive, subjective reviews in marketing campaigns to build trust and attract new users.

# **Conclusion**

* The client should focus on:
    * frequent updates
    * user feedback-driven improvements
    * active engagement with reviews, and
    * highlighting positive user sentiment to drive customer satisfaction and loyalty

  to achieve the business objective.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***