<a href="https://colab.research.google.com/github/infinitenaveen/googleplaystore_data_analysis_by-Naveen/blob/main/googleplaystore_data_analysis_by_Naveen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name -PLAY STORE APP REVIEWS ANALYSIS**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

In this project, I will examine data from the Google Play Store to identify the key factors that lead to the success and engagement of mobile apps. The objective is to offer practical insights for app developers and businesses to enhance their applications and gain a bigger foothold in the Android market. By analyzing app categories, ratings, reviews, size, installs, pricing, and user sentiment, I intend to uncover patterns and trends that influence app performance

Throughout the project, I will use Python libraries like Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn for data manipulation, visualization, and statistical analysis. I will also ensure that the findings are presented in a clear and concise manner, with visualizations and summaries that make the insights easy to understand.

By the end of this project, I aim to deliver a comprehensive analysis that answers the key questions and provides practical recommendations for app developers. This will not only help them improve their apps but also give them a competitive edge in the Android market. Additionally, I will document the entire process, including the methodology, findings, and recommendations, in a detailed report that can be shared with stakeholders.

**In summary, this project is about leveraging data to uncover the secrets of app success on the Google Play Store. By combining data cleaning, exploration, statistical analysis, and sentiment analysis, I will provide valuable insights that can drive app-making businesses to success. My ultimate goal is to help developers create apps that users love and engage with, leading to higher ratings, more installs, and greater overall success.**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here:** The Google Play Store features millions of mobile applications across a wide range of categories, creating a highly competitive environment. However, not every app manages to achieve high engagement, positive user feedback, or a significant number of installs. App developers and businesses require actionable insights to grasp the elements that contribute to an app's success.

This project intends to **analyze Play Store data, including app ratings, reviews, installs, pricing, size, content category, and user sentiment, to pinpoint key patterns and trends. The objective is to discover what makes an app more engaging and successful while identifying factors that could lead to underperformance.**

By utilizing data analysis, visualization, and statistical methods, this project will offer developers data-driven recommendations to improve their apps, boost user satisfaction, and enhance overall business performance.

#### **Define Your Business Objective?**

Answer Here :The main objective of this business is to assist app developers and businesses in enhancing their mobile applications by pinpointing the elements that contribute to greater user engagement, improved ratings, and more downloads.

To achieve this goal, the project will:

- Analyze how app ratings, categories, and the number of installs correlate to determine which categories excel.
- Investigate how pricing (free versus paid apps) and app size affect user engagement.
- Conduct sentiment analysis on user reviews to uncover key themes of positive and negative feedback.
- Identify trends in content ratings, user preferences, and review patterns over time.
- Create visualizations and statistical insights to facilitate data-driven decision-making.

The insights obtained will empower app developers to implement strategic enhancements, resulting in higher user satisfaction, increased app downloads, and sustained business success in the Android marketplace.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
from google.colab import drive



### Dataset Loading

In [None]:
# Load Dataset
drive.mount('/content/drive')
playstore_df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data.csv")
reviews_df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data_user_reviews.csv")

### Dataset First View

In [None]:
#first view
print("First 5 rows of the dataset:")
print(playstore_df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_columns = playstore_df.shape

# Print the results
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

### Dataset Information

In [None]:
# Dataset Info
playstore_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
playstore_df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data.csv")

# Step 1: Count the total number of duplicate rows
num_duplicates = playstore_df.duplicated().sum()
print(f"Total number of duplicate rows: {num_duplicates}")

# Step 2: Display the duplicate rows (optional)
if num_duplicates > 0:
    print("\nDuplicate rows:")
    print(playstore_df[playstore_df.duplicated(keep=False)])  # keep=False marks all duplicates

else:
    print("\nNo duplicate rows found.")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print(playstore_df.isnull().sum())

# Drop rows with missing values in key columns
playstore_df = playstore_df.dropna(subset=['Rating', 'Installs', 'Size', 'Type', 'Price'])

# Check for missing values in the Reviews dataset
print(reviews_df.isnull().sum())

# Drop rows with missing sentiment values
reviews_df = reviews_df.dropna(subset=['Sentiment', 'Sentiment_Polarity', 'Sentiment_Subjectivity'])

In [None]:
# Visualizing the missing values

playstore_df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data.csv")

# Step 1: Create a heatmap of missing values
plt.figure(figsize=(10, 6))
sns.heatmap(playstore_df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

# Step 2: Create a bar plot of missing values
missing_values = playstore_df.isnull().sum()
missing_values = missing_values[missing_values > 0]  # Filter columns with missing values

plt.figure(figsize=(10, 6))
sns.barplot(x=missing_values.index, y=missing_values.values, hue=missing_values.index, palette='rocket', legend=False)
plt.title("Missing Values by Column")
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.xticks(rotation=45)
plt.show()

# Step 3: Print the number of missing values
print("Number of missing values per column:")
print(missing_values)

### What did you know about your dataset?

Answer Here: The Google Play Store dataset provides an in-depth view of the apps available on the platform, offering insights into app performance, user engagement, and market trends. It consists of 10,841 rows (representing apps) and 13 columns (features), which include important variables such as App (the name of the app), Category (the main category like Games or Education), Rating (user rating on a scale from 0 to 5), Reviews (the number of user reviews), Size (the app size in MB or KB), Installs (the number of downloads), Type (whether the app is free or paid), Price (the cost if it is paid), Content Rating (the intended audience like Everyone or Teen), and Genres (sub-categories). The dataset features both numeric columns like Rating and Reviews, as well as categorical columns like Category and Type. However, it does contain some missing values, including 1,474 missing ratings, 169 missing sizes, and 1 missing type. To handle this, rows with missing Rating and Type were removed, and the missing Size values were filled in with the median. Furthermore, columns such as Installs, Size, and Price were cleaned and converted into numeric formats for analysis.

The aim is to evaluate app performance and user engagement in order to provide developers with actionable insights

The dataset offers a detailed perspective on app performance in the Google Play Store ,After cleaning and analyzing the data, I’ve discovered important insights that can assist developers in enhancing their apps and gaining a bigger market share.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print("Dataset Column Names:")
print(playstore_df.columns.tolist())

In [None]:
# Dataset Describe

print("Summary Statistics for Numerical Columns:")
print(playstore_df.describe())

# Step 2: Describe categorical columns
print("\nSummary Statistics for Categorical Columns:")
print(playstore_df.describe(include='object'))

### Variables Description

Answer Here:

The Google Play Store dataset contains 13 variables (columns) that provide detailed information about apps available on the Google Play Store. The key variables include:

**App**: The name of the mobile application, with 9,660 unique app names.

**Category**: The primary category of the app, such as "GAME", "EDUCATION", or "TOOLS", with 33 unique categories.

Rating: The overall user rating of the app, ranging from 1.0 to 19.0 (values above 5.0 are likely to be errors).

**Reviews**: The total number of user reviews, ranging from 0 to 78,158,306.

**Size**: The size of the app file, represented in MB or KB, but stored as a string (e.g., 50M, 20k).

**Installs**: The total number of downloads/installs, represented as a string (e.g., "10,000+").

**Type**: Indicates whether the app is "Free" or "Paid".

**Price**: The cost of the app if paid, stored as a string (e.g., "$2.99").

**Content Rating**: The target audience for the app, such as "Everyone", "Teen", or "Mature 17+".

**Genres**: Additional sub-categories or genres the app belongs to, with 120 unique values.

**Last Updated**: The date when the app was last updated, stored as a string.

**Current Ver**: The current version of the app, with 1,442 unique versions.

**Android Ver**: The minimum Android version required to run the app, with 168 unique values.

The dataset consists of both numerical and categorical variables, with certain columns needing some cleaning. This includes converting Size, Installs, and Price to numeric formats, addressing any missing values, and changing Last Updated to a datetime format. Important insights reveal that free apps are the most common, there is a high occurrence of ratings between 4.0 and 4.5, and it's essential to manage outliers in the Rating column. This dataset is well-suited for analyzing app performance, user engagement, and market trends

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print("Number of Unique Values for Each Column:")
print(playstore_df.nunique())

# Step 2: Displaying unique values for each column
print("\nUnique Values for Each Column:")
for column in playstore_df.columns:
    print(f"{column}: {playstore_df[column].nunique()} unique values")
    print(playstore_df[column].unique()[:10])  # Display first 10 unique values
    print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#code to make your dataset analysis ready.

# Load the dataset
playstore_df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data.csv")

# Step 1: Handle missing values
playstore_df.dropna(subset=['Rating', 'Type'], inplace=True)

# Step 2: Clean the 'Size' column
def convert_size(size):
    if isinstance(size, str):
        size = size.replace(',', '').replace('+', '')  # Remove ',' and '+'
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert KB to MB
        elif size == 'Varies with device':
            return np.nan
    return np.nan

playstore_df['Size'] = playstore_df['Size'].apply(convert_size)

# Fill missing values in 'Size' with the median
playstore_df['Size'].fillna(playstore_df['Size'].median(), inplace=True)

# Step 3: Clean the 'Installs' column
playstore_df['Installs'] = (
    playstore_df['Installs']
    .astype(str)  # Ensure all values are strings
    .str.replace(r'[+,]', '', regex=True)  # Remove '+' and ','
)

# Convert to numeric, replacing errors with NaN
playstore_df['Installs'] = pd.to_numeric(playstore_df['Installs'], errors='coerce')

# Fill missing values with the median and convert to integer
playstore_df['Installs'].fillna(playstore_df['Installs'].median(), inplace=True)
playstore_df['Installs'] = playstore_df['Installs'].astype(int)

# Step 4: Clean the 'Price' column
playstore_df['Price'] = playstore_df['Price'].str.replace(r'\$', '', regex=True)
playstore_df['Price'] = pd.to_numeric(playstore_df['Price'], errors='coerce')

# Step 5: Handle outliers in the 'Rating' column
playstore_df = playstore_df[playstore_df['Rating'] <= 5]

# Step 6: Convert 'Last Updated' to datetime
playstore_df['Last Updated'] = pd.to_datetime(playstore_df['Last Updated'], errors='coerce')

# Step 7: Remove duplicates
playstore_df.drop_duplicates(inplace=True)

# Step 8: Reset the index
playstore_df.reset_index(drop=True, inplace=True)

# Step 9: Verify the cleaned dataset
print("Cleaned Dataset Info:")
print(playstore_df.info())

print("\nFirst 5 rows of the cleaned dataset:")
print(playstore_df.head())


### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Distribution of App Ratings

plt.figure(figsize=(10, 6))
sns.histplot(playstore_df['Rating'], bins=30, kde=True, color='blue')
plt.title('Distribution of App Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. there are several reasons , I chose this specific chart, a histogram with a Kernel Density Estimate (KDE),I wanted to **analyze how app ratings are spread across the dataset**. A histogram is an excellent tool for visualizing the frequency of values within certain ranges (bins).

By KDE, I can observe a smooth curve that illustrates the distribution
of ratings. This allows me to identify patterns, such as whether the ratings are skewed or follow a normal distribution.

The histogram helps me detect any unusual trends, like ratings that are significantly higher or lower than the majority. For instance, I found that most ratings fall between 4.0 and 4.5, but there are a few outliers (e.g., ratings exceeding 5.0, which are not valid).

This chart leads me to conclude that most apps on the Google Play Store receive high ratings 4.0–4.5, suggesting that users are generally pleased with the apps they download. This insight can help developers target a rating in this range to stay competitive.

I chose this chart because it effectively conveys the distribution of app ratings, highlights important trends, and offers actionable insights for app developers.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: there are many insights we found , we should know that the histogram combined with a Kernel Density Estimate (KDE) provides valuable insights into the distribution of app ratings on the Google Play Store. Most apps have ratings **`clustered between 4.0 and 4.5`**, suggesting that users generally enjoy the apps they download. This indicates that high ratings are prevalent, and developers should strive to keep their ratings within this range to stay competitive. The distribution shows a slight left skew, indicating that there are fewer apps with very low ratings (below 3.0) compared to those with higher ratings. Moreover, the existence of a few outliers, such as ratings exceeding 5.0, points to potential data quality concerns that need to be addressed, as ratings should logically range from 0 to 5. In summary, the chart emphasizes the significance of maintaining high user satisfaction, as apps with lower ratings are less frequent and may find it challenging to attract users. This insight can help developers concentrate on enhancing app quality and user experience to achieve and maintain high ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here: The insights obtained from analyzing the Google Play Store dataset can certainly contribute to a positive business impact. However, there are also potential risks that, if not managed effectively, could result in negative growth.

The insights derived from the dataset can greatly enhance business outcomes by helping developers concentrate on highly-rated categories, streamline app size, reach wider audiences, and foster positive reviews. On the flip side, there are risks of negative growth if developers overlook low-rated categories, set high prices for paid apps, target niche audiences without a solid strategy, or neglect data quality concerns. By thoughtfully balancing these insights and mitigating potential risks, businesses can improve their chances of success on the Google Play Store. If you need any more clarification or assistance

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Category vs. Average Rating

category_rating = playstore_df.groupby('Category')['Rating'].mean().sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=category_rating.index, y=category_rating.values, palette='viridis')
plt.xticks(rotation=90)
plt.title('Average Rating by Category')
plt.xlabel('Category')
plt.ylabel('Average Rating')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here: I chose a bar plot to display the average ratings by category because it offers a clear and simple way to compare how different app categories perform. The x-axis shows the categories, while the y-axis indicates the average ratings, allowing for an easy visual assessment of which categories score higher or lower. With a bar plot, I can effectively emphasize the differences in average ratings among categories, such as Education and Art & Design having higher ratings compared to Dating and Tools. The vertical bars facilitate quick comparisons, and using a color palette like viridis improves both readability and visual appeal. Additionally, rotating the x-axis labels by 90 degrees ensures that all category names are legible and do not overlap. This chart is especially helpful for pinpointing which categories are excelling and which might require enhancements, offering valuable insights for developers and stakeholders. In summary, the bar plot is a great choice for this analysis due to its simplicity, intuitiveness, and ability to convey key insights about performance across categories.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The bar plot showing **average ratings by category** provides several important insights into how different app categories perform on the Google Play Store. Categories such as **Education, Art & Design**, and **Events** stand out with the highest average ratings, suggesting that apps in these areas are well-liked by users and likely deliver quality experiences. This indicates that developers working in these categories may have a better chance of achieving user satisfaction and success. Conversely, categories like **Dating and Tools **show lower average ratings, pointing to possible difficulties in meeting user expectations or providing value. This information can help developers either enhance the quality of apps in these lower-rated categories or consider shifting their focus to those with higher ratings where user satisfaction is already evident. Furthermore, the chart highlights the need to understand trends specific to each category, as user expectations and preferences can differ greatly among various types of apps. By utilizing these insights, developers can make strategic decisions about which categories to pursue and how to prioritize improvements to boost user satisfaction and drive business growth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights from the chart can indeed help create a **positive business impacts** by directing developers and companies to concentrate on high-performing categories such as **Education, Art & Design, and Events**, which boast the highest average ratings. By focusing on these areas, developers can create applications that are more likely to meet user expectations, garner more installs, and receive favorable reviews, ultimately boosting revenue and market share. Furthermore, recognizing categories with lower ratings, like Dating and Tools, presents an opportunity for enhancement. Developers can delve into user feedback to pinpoint issues in these categories, improving app quality and user experience, and potentially transforming underperforming categories into lucrative ones. However, there is a risk of **negative growth** if businesses overlook the challenges in lower-rated categories or neglect user dissatisfaction. For instance, continuing to invest in Dating or Tools without enhancing app quality could result in unfavorable user reviews, decreased installs, and lower revenue. Likewise, flooding high-rated categories without unique offerings could lead to heightened competition and challenges in distinguishing oneself. Thus, while the insights provide a pathway to success, businesses must strategically balance their efforts, focusing on both thriving categories and enhancing those that are lagging to avoid potential setbacks and ensure sustainable growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Step 1: Compare Free vs. Paid Apps
# Create a bar plot for average rating
plt.figure(figsize=(10, 6))
sns.barplot(x='Type', y='Rating', data=playstore_df, palette='Set2', ci=None)
plt.title('Average Rating: Free vs. Paid Apps')
plt.xlabel('Type')
plt.ylabel('Average Rating')
plt.show()

# Create a box plot for installs
plt.figure(figsize=(10, 6))
sns.boxplot(x='Type', y='Installs', data=playstore_df, palette='Set3')
plt.title('Installs: Free vs. Paid Apps')
plt.xlabel('Type')
plt.ylabel('Installs (log scale)')
plt.yscale('log')  # Use log scale for better visualization
plt.show()

# Create a histogram for price distribution of paid apps
paid_apps = playstore_df[playstore_df['Type'] == 'Paid']
plt.figure(figsize=(10, 6))
sns.histplot(paid_apps['Price'], bins=30, kde=True, color='purple')
plt.title('Price Distribution of Paid Apps')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here : I selected the **bar plot, box plot, and histogram** to compare Free and Paid apps because each type of chart highlights different aspects of the data. The bar plot is ideal for visualizing the **average ratings** of Free and Paid apps, as it allows for a clear comparison between the two groups, making it easy to identify which type of app generally receives higher user satisfaction. The **box plot** is useful for comparing the **distribution of installs**, as it effectively illustrates the spread, central tendency, and outliers in the data, particularly when there is a wide range of values (which is why a log scale is used for better visualization). Lastly, the **histogram** is perfect for showing the **price distribution of Paid apps**, as it indicates how prices are spread across various ranges, helping to pinpoint the most common price points and trends in affordability. Together, these charts offer a well-rounded view of the differences between Free and Paid apps, providing valuable insights for developers and stakeholders to make informed decisions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The visualizations comparing Free and Paid apps provide several important insights. First, the **bar plot** indicates that **Free apps** tend to have a slightly higher average rating than Paid apps, which may suggest that users hold higher expectations for Paid apps, resulting in somewhat lower satisfaction levels. The **box plot** reveals that Free apps significantly outnumber Paid apps in terms of installs, showing that users are more inclined to download free options. Additionally, the **histogram** of Paid apps' prices indicates that most are priced under **$10** , $1-$5 range, implying that users favor affordable apps and are generally reluctant to pay higher prices. These findings suggest that Free apps enjoy greater popularity and user satisfaction, while Paid apps must be competitively priced and offer exceptional value to draw in users. Developers can use these insights to focus on free apps with in-app monetization strategies or ensure that Paid apps are both affordable and of high quality to enhance user engagement and revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights gained from the visualizations can significantly impact business outcomes by helping developers and companies make informed choices. For example, the higher average ratings and notably increased installs for Free apps indicate that providing free apps with in-app purchases or advertisements can draw in a larger user base and create revenue through alternative monetization methods. Furthermore, the preference for reasonably priced Paid apps (under $10) suggests that competitively pricing apps can enhance their attractiveness and boost sales. These insights can assist businesses in refining their app offerings, enhancing user satisfaction, and maximizing revenue.

On the other hand, there are insights that could lead to negative growth if not properly addressed. For instance, the slightly lower average ratings for Paid apps imply that users have elevated expectations for paid content, and not meeting these expectations could lead to unfavorable reviews, fewer installs, and decreased revenue. Additionally, pricing Paid apps above the $10 mark could discourage users, as the data indicates a strong preference for affordability. If businesses overlook these trends and continue to offer overpriced or subpar Paid apps, they risk losing market share to competitors who better cater to user preferences. Thus, while the insights offer a pathway to success, businesses must carefully balance their strategies to avoid potential pitfalls and ensure sustainable growth

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#App Size vs. Installs


# Step 1: Ensure 'Installs' is treated as a string
playstore_df['Installs'] = playstore_df['Installs'].astype(str)

# Step 2: Clean the 'Installs' column
playstore_df['Installs'] = playstore_df['Installs'].str.replace('+', '').str.replace(',', '').astype(int)

# Step 3: Verify the cleaned 'Installs' column
print("Cleaned 'Installs' column:")
print(playstore_df['Installs'].head())

def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert KB to MB
        elif size == 'Varies with device':
            return np.nan
    return np.nan

playstore_df['Size'] = playstore_df['Size'].apply(convert_size)

# Fill missing values in 'Size' with the median
playstore_df['Size'] = playstore_df['Size'].fillna(playstore_df['Size'].median())

# Step 2: Create a scatter plot for App Size vs. Installs
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Size', y='Installs', data=playstore_df, alpha=0.5, color='green')
plt.title('App Size vs. Installs')
plt.xlabel('Size (MB)')
plt.ylabel('Installs')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. I selected a scatter plot to illustrate the relationship between App Size and Installs because it effectively demonstrates the correlation between these two numerical variables. A scatter plot enables us to see how an app's size (in MB) corresponds to the number of installs, making it easier to spot trends, patterns, and outliers. For instance, we can determine if smaller apps generally receive more installs or if larger apps have difficulty attracting users. Additionally, the scatter plot aids in understanding the distribution of data points and whether a clear relationship exists or if other factors might be at play. By utilizing a scatter plot, I can effectively convey insights about how app size influences user adoption, which is essential for developers aiming to enhance their apps for improved performance and increased installs. The transparency of the points (using `alpha=0.5`) ensures that overlapping data points remain visible, offering a thorough view of the dataset. In summary, the scatter plot is the best option for this analysis due to its simplicity, intuitiveness, and ability to clearly highlight the relationship between app size and installs.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The scatter plot illustrating **App Size vs. Installs** provides several important insights. Firstly, it shows a noticeable trend where **smaller apps (under 50MB)** tend to receive more installs, indicating that users prefer apps that occupy less storage on their devices. This is especially relevant in areas where storage or data is limited. Secondly, **larger apps (over 100MB)** typically see fewer installs, suggesting that users are hesitant to download apps that require a lot of storage. However, the connection between app size and installs isn't entirely straightforward, as other elements like app quality, category, and user ratings significantly influence install numbers. For instance, some larger apps with excellent ratings or distinctive features still attract a considerable number of installs. In summary, these insights imply that optimizing app size can enhance user adoption, but developers also need to prioritize delivering high-quality and engaging content for success. These findings can help developers strike a balance between app size and functionality to boost installs and user satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here :The insights gained from the **App Size vs. Installs** visualization can significantly influence **positive business impact** by helping developers optimize their app size. Smaller apps (under 50MB) generally attract more installs, so developers should aim to reduce app size without sacrificing functionality. This approach makes their apps more attractive to users, particularly in areas with limited storage or data availability. Such optimization can lead to increased user adoption, higher engagement, and ultimately, greater revenue. However, there is a risk of **negative growth** if developers focus too much on reducing app size at the cost of app quality or essential features. For example, excessive compression might lead to a poor user experience, resulting in negative reviews and fewer installs. Moreover, larger apps that provide unique or high-quality features can still thrive despite their size, so completely avoiding larger apps might restrict opportunities for innovation. Therefore, while optimizing app size is advantageous, developers need to find a balance between size, functionality, and user experience to avoid potential pitfalls and ensure sustainable growth.



#### Chart - 5

In [None]:
# Chart - 5 visualization code

#Analyze sentiment distribution
sentiment_counts = reviews_df['Sentiment'].value_counts()

# Step 3: Create a bar plot for sentiment distribution
plt.figure(figsize=(8, 6))
sns.barplot(x=sentiment_counts.index, y=sentiment_counts.values, palette='viridis')
plt.title('Distribution of Sentiments')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

# Step 4: Analyze sentiment polarity
plt.figure(figsize=(10, 6))
sns.histplot(reviews_df['Sentiment_Polarity'], bins=30, kde=True, color='blue')
plt.title('Distribution of Sentiment Polarity')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()

# Step 5: Analyze sentiment subjectivity
plt.figure(figsize=(10, 6))
sns.histplot(reviews_df['Sentiment_Subjectivity'], bins=30, kde=True, color='orange')
plt.title('Distribution of Sentiment Subjectivity')
plt.xlabel('Sentiment Subjectivity')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here : I chose a bar plot to visualize the distribution of sentiments because it offers a clear and straightforward comparison between the counts of positive, negative, and neutral reviews. This format makes it easy to identify which sentiment is most prevalent and how the others stack up against it. For sentiment polarity and subjectivity, I opted for histograms with KDE (Kernel Density Estimate) since they effectively illustrate the distribution of continuous data. The histograms help us pinpoint where most of the polarity and subjectivity scores fall, while the KDE smooths out the distribution, making it simpler to spot patterns. These charts are perfect for grasping the overall sentiment landscape and recognizing trends in user feedback.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The visualizations uncover several key insights. First, the bar plot indicates that the majority of user reviews are positive, suggesting that users are generally pleased with the apps. However, there is also a notable number of negative reviews, pointing out areas that need improvement. The histogram for sentiment polarity reveals that most reviews are either neutral or slightly positive, with few extreme negative or positive scores. This implies that while users are mostly satisfied, their feedback tends to be moderate rather than overwhelmingly positive. The histogram for sentiment subjectivity shows that most reviews are moderately subjective, meaning they blend facts and opinions. This suggests that users are offering balanced feedback, which can be crucial for pinpointing specific areas for enhancement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights derived from the sentiment analysis can indeed foster a positive business impact. For instance, the high volume of positive reviews can be utilized in marketing campaigns to build trust and attract more users. Furthermore, the moderate subjectivity of reviews indicates that users are providing constructive feedback, which developers can leverage to enhance app quality and user experience. However, there is a risk of negative growth if the negative reviews are overlooked.

#### Chart - 6

In [None]:
# Chart - 6 visualization code of Time series
#Time Series Analysis

# Step 1: Clean the dataset
# Drop rows with missing values in 'Rating' and 'Type'
playstore_df = playstore_df.dropna(subset=['Rating', 'Type'])

# Ensure 'Installs' is treated as a string
playstore_df['Installs'] = playstore_df['Installs'].astype(str)

# Clean the 'Installs' column
# Remove rows where 'Installs' contains non-numeric values (e.g., 'Free')
playstore_df = playstore_df[playstore_df['Installs'].str.contains(r'^\d+[\+,]?\d*$', regex=True)]

# Remove '+' and ',' from the 'Installs' column and convert to integer
playstore_df['Installs'] = playstore_df['Installs'].str.replace('+', '').str.replace(',', '').astype(int)

# Clean the 'Size' column
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert KB to MB
        elif size == 'Varies with device':
            return np.nan
    return np.nan

playstore_df['Size'] = playstore_df['Size'].apply(convert_size)

# Fill missing values in 'Size' with the median
playstore_df['Size'] = playstore_df['Size'].fillna(playstore_df['Size'].median())

# Ensure 'Price' is treated as a string
playstore_df['Price'] = playstore_df['Price'].astype(str)

# Clean the 'Price' column
playstore_df['Price'] = playstore_df['Price'].str.replace('$', '').astype(float)

# Convert 'Reviews' to numeric
playstore_df['Reviews'] = pd.to_numeric(playstore_df['Reviews'], errors='coerce')

# Drop rows with missing 'Reviews' (if any)
playstore_df = playstore_df.dropna(subset=['Reviews'])

# Convert 'Last Updated' to datetime
playstore_df['Last Updated'] = pd.to_datetime(playstore_df['Last Updated'])

# Step 2: Group by 'Last Updated' and calculate average metrics
time_series_data = playstore_df.groupby('Last Updated').agg({
    'Rating': 'mean',
    'Installs': 'mean',
    'Reviews': 'mean'
}).reset_index()

# Step 3: Visualize time series for average rating
plt.figure(figsize=(12, 6))
sns.lineplot(x='Last Updated', y='Rating', data=time_series_data, color='blue')
plt.title('Average Rating Over Time')
plt.xlabel('Last Updated')
plt.ylabel('Average Rating')
plt.show()

# Step 4: Visualize time series for average installs
plt.figure(figsize=(12, 6))
sns.lineplot(x='Last Updated', y='Installs', data=time_series_data, color='green')
plt.title('Average Installs Over Time')
plt.xlabel('Last Updated')
plt.ylabel('Average Installs')
plt.show()

# Step 5: Visualize time series for average reviews
plt.figure(figsize=(12, 6))
sns.lineplot(x='Last Updated', y='Reviews', data=time_series_data, color='purple')
plt.title('Average Reviews Over Time')
plt.xlabel('Last Updated')
plt.ylabel('Average Reviews')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. I selected line plots for the time series analysis because they effectively illustrate trends and patterns over time. Line plots are especially helpful for demonstrating how a variable, like average rating, installs, or reviews, changes across a continuous time frame, such as the **Last Updated** column in the dataset. By charting the average values of **Rating, Installs, and Reviews over time**, we can easily spot trends, including increases or decreases in user satisfaction, app popularity, or user engagement. The smooth lines connecting the data points help in identifying patterns, such as seasonal spikes in installs or gradual improvements in ratings. Moreover, line plots are straightforward and easy to understand, making them accessible to both technical and non-technical audiences. This approach enables us to effectively convey insights about the evolution of app performance metrics over time, assisting developers and stakeholders in making informed decisions to enhance app quality and user satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Answer Here :

**Insights from the Time Series Analysis**

**Trends in Ratings:**

An upward trend in average ratings over time suggests that the quality of the app is getting better.

Conversely, a downward trend in average ratings may point to a drop in user satisfaction.

**Trends in Installs**:

Sudden increases in installs could be linked to successful app launches, updates, or effective marketing efforts.

A decline in installs might reflect waning user interest or heightened competition.

**Trends in Reviews:**

A rise in the number of reviews could indicate greater user engagement or an expanding user base.

On the other hand, a drop in reviews may imply a decrease in user interest or satisfaction.

Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here :

**Positive Impact :**

Recognizing patterns in ratings, installs, and reviews enables developers to grasp user behavior and enhance app quality.

Surges in installs can reveal effective strategies (like marketing campaigns) that can be duplicated in the future.

**Negative Growth Risk:**

Falling ratings or installs might signal problems with app quality or user satisfaction. Overlooking these trends could result in negative growth.

A scarcity of reviews may indicate low user engagement, which could affect the app's visibility and overall success.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Step 1: Clean the dataset
# Drop rows with missing values in 'Rating' and 'Type'
playstore_df = playstore_df.dropna(subset=['Rating', 'Type'])

# Ensure 'Installs' is treated as a string
playstore_df['Installs'] = playstore_df['Installs'].astype(str)

# Clean the 'Installs' column
# Remove rows where 'Installs' contains non-numeric values (e.g., 'Free')
playstore_df = playstore_df[playstore_df['Installs'].str.contains(r'^\d+[\+,]?\d*$', regex=True)]

# Remove '+' and ',' from the 'Installs' column and convert to integer
playstore_df['Installs'] = playstore_df['Installs'].str.replace('+', '').str.replace(',', '').astype(int)

# Clean the 'Size' column
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert KB to MB
        elif size == 'Varies with device':
            return np.nan
    return np.nan

playstore_df['Size'] = playstore_df['Size'].apply(convert_size)

# Fill missing values in 'Size' with the median
playstore_df['Size'] = playstore_df['Size'].fillna(playstore_df['Size'].median())

# Ensure 'Price' is treated as a string
playstore_df['Price'] = playstore_df['Price'].astype(str)

# Clean the 'Price' column
playstore_df['Price'] = playstore_df['Price'].str.replace('$', '').astype(float)

# Convert 'Reviews' to numeric
playstore_df['Reviews'] = pd.to_numeric(playstore_df['Reviews'], errors='coerce')

# Drop rows with missing 'Reviews' (if any)
playstore_df = playstore_df.dropna(subset=['Reviews'])

# Step 2: Analyze Content Rating
# Group by 'Content Rating' and calculate average metrics
content_rating_data = playstore_df.groupby('Content Rating').agg({
    'Rating': 'mean',
    'Installs': 'mean',
    'Reviews': 'mean'
}).reset_index()

# Step 3: Visualize average rating by content rating
plt.figure(figsize=(10, 6))
sns.barplot(x='Content Rating', y='Rating', data=content_rating_data, palette='viridis')
plt.title('Average Rating by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Average Rating')
plt.xticks(rotation=45)
plt.show()

# Step 4: Visualize average installs by content rating
plt.figure(figsize=(10, 6))
sns.barplot(x='Content Rating', y='Installs', data=content_rating_data, palette='rocket')
plt.title('Average Installs by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Average Installs')
plt.xticks(rotation=45)
plt.show()

# Step 5: Visualize average reviews by content rating
plt.figure(figsize=(10, 6))
sns.barplot(x='Content Rating', y='Reviews', data=content_rating_data, palette='magma')
plt.title('Average Reviews by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Average Reviews')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here : I chose bar plots for the Content Rating Analysis because they are very effective for comparing categorical data across different groups. In this case, the **Content Rating** column includes distinct categories (like Everyone, Teen, Mature), and bar plots help us clearly visualize how average metrics such as **Rating, Installs, and Reviews** differ across these categories. The vertical bars make it easy to compare the values for each content rating, and using different color palettes (like **viridis, rocket, magma**) improves both readability and visual appeal. Moreover, bar plots are straightforward and intuitive, making them accessible to both technical and non-technical audiences. By utilizing bar plots, I can effectively convey insights about how content rating influences app performance, assisting developers and stakeholders in making data-driven decisions to optimize their apps for specific target audiences.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The bar plots provide several important insights into how **Content Rating** affects app performance. Apps rated for Everyone generally have higher average ratings, more installs, and more reviews than those rated for **Teen, Mature 17+, or Adults only 18+**. This indicates that apps aimed at a wider audience (like Everyone) are more likely to achieve greater user satisfaction, engagement, and popularity. Conversely, apps with stricter content ratings (such as Adults only 18+) tend to have significantly fewer installs and reviews, suggesting they cater to a more niche audience. These findings emphasize the need to understand the target audience and adjust app content accordingly. Developers can leverage this information to fine-tune their apps for specific content ratings, ensuring they align with the expectations of their intended users and enhance user engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights from the **Content Rating Analysis** can significantly **positive impact** business by helping developers tailor their apps for specific target audiences. For instance, concentrating on apps rated for Everyone can result in increased installs, ratings, and reviews, as these apps cater to a wider audience. Moreover, grasping the preferences of niche audiences (like Teen or Mature 17+) allows developers to craft experiences that address the distinct needs of these groups. However, there is a danger of negative growth if developers overlook the preferences of certain content rating categories. For example, apps rated Adults only 18+ or Mature 17+ might find it **negative impacts** to attract users due to their specialized appeal, and not providing exceptional value to these audiences could lead to lower ratings, fewer installs, and diminished revenue. Thus, while the insights offer a pathway to success, developers need to strike a careful balance in their strategies to avoid potential setbacks and ensure sustainable growth.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

#Pie Chart - visualization
#Purpose: Show the proportion of categories in a dataset.

# Step 1: Clean the dataset
# Drop rows with missing values in 'Rating' and 'Type'
playstore_df = playstore_df.dropna(subset=['Rating', 'Type'])

# Step 2: Analyze the proportion of free vs. paid apps
type_counts = playstore_df['Type'].value_counts()

# Step 3: Create a pie chart for free vs. paid apps
plt.figure(figsize=(8, 6))
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', colors=['lightblue', 'lightgreen'], startangle=90)
plt.title('Proportion of Free vs. Paid Apps')
plt.show()

# Step 4: Analyze the distribution of apps across categories
category_counts = playstore_df['Category'].value_counts()

# Step 5: Create a pie chart for app categories
plt.figure(figsize=(10, 8))
plt.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Apps Across Categories')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here : I chose a pie chart for this analysis because it effectively visualizes proportions and percentages within a dataset. Pie charts are especially helpful when illustrating how a whole is divided into parts, like the ratio of free to paid apps or the distribution of apps across various categories. The circular format allows for easy comparison of the relative sizes of each segment, and including percentages (**autopct**) offers a clear view of the distribution. Moreover, pie charts are visually appealing and straightforward, making them accessible to both technical and non-technical audiences. By utilizing pie charts, I can quickly convey insights about the prevalence of free apps or the popularity of certain app categories, aiding developers and stakeholders in making informed decisions regarding app development and marketing strategies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The pie charts provide several important insights into the Google Play Store dataset. Firstly, the pie chart comparing **free** and **paid apps** shows that most apps are free, which indicates that developers tend to favor the freemium model. This model allows users to download apps for free while generating revenue through in-app purchases or advertisements. As a result, it seems that users are more inclined to download free apps, making this approach a safer bet for developers looking to achieve higher adoption rates. Secondly, the pie chart illustrating **app categories** reveals that the **Games category** takes the lead, accounting for a substantial share of all apps. This suggests that gaming apps are not only popular but also highly competitive. In contrast, categories such as **Education and Tools** occupy smaller shares, indicating niche markets that may have less competition but also attract fewer users. These insights can help developers select the most suitable category and pricing strategy for their apps to enhance their chances of success.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights from the pie charts can **positive influence** business outcomes by helping developers concentrate on strategies that resonate with user preferences. For instance, the prevalence of **free apps** indicates that developers should emphasize the freemium model to enhance user adoption and revenue. Likewise, the popularity of the **Games category** suggests that developing gaming apps can draw in a substantial user base, but developers need to ensure their games are distinctive in a crowded market. On the flip side, **there is a risk of negative growth** if developers overlook the challenges highlighted by these insights. For example, flooding the **Games category** with similar offerings could result in low visibility and fewer downloads. Moreover, focusing exclusively on free apps might restrict revenue potential if in-app monetization strategies are not effectively executed. Thus, while these insights offer a pathway to success, developers must strike a careful balance in their strategies to sidestep potential pitfalls and achieve sustainable growth.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#bubble chart Visualization

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data.csv")

# Ensure necessary columns exist
required_columns = ['Rating', 'Installs', 'Size']
for col in required_columns:
    if col not in df.columns:
        raise KeyError(f"Required column '{col}' not found in the dataset.")

# Drop missing values in the required columns
df.dropna(subset=required_columns, inplace=True)

# Convert 'Installs' column to numeric (remove '+', ',' and filter out non-numeric values)
df['Installs'] = df['Installs'].astype(str).str.replace('[+,]', '', regex=True)  # Remove '+' and ','
df = df[df['Installs'].str.isnumeric()]  # Keep only numeric values
df['Installs'] = df['Installs'].astype(float)  # Convert to float

# Convert 'Size' column to numeric (handle MB and KB)
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert KB to MB
        elif size == 'Varies with device':
            return np.nan
    return size

df['Size'] = df['Size'].apply(convert_size)
df.dropna(subset=['Size'], inplace=True)  # Remove rows where size is NaN

# Create Bubble Chart
plt.figure(figsize=(12, 8))
bubble = plt.scatter(
    df['Installs'], df['Rating'], s=df['Size'] * 10,  # Bubble size based on app size
    alpha=0.5, c=df['Size'], cmap='coolwarm', edgecolors='k'
)

plt.colorbar(label="App Size (MB)")
plt.xscale("log")  # Log scale for better visualization of installs
plt.xlabel("Installs (Log Scale)", fontsize=14)
plt.ylabel("Rating", fontsize=14)
plt.title("Bubble Chart: Rating vs Installs vs Size", fontsize=16)
plt.grid(True, linestyle="--", alpha=0.5)
plt.show()




##### 1. Why did you pick the specific chart?

Answer Here : I selected this particular bubble chart because it clearly illustrates the connection between **app installs, ratings, and size**, which aids in recognizing trends and patterns within the data. Given that the Installs column often features a broad range of values, using **a logarithmic scale** enhances readability by distributing the data points more evenly. **The size of each bubble indicates the app size (MB)**, allowing us to examine if larger apps generally receive more downloads or higher ratings. Furthermore, I incorporated **a color gradient based on app size**, which simplifies the comparison of various app categories and helps identify potential outliers, like large apps with few installs or small apps with high ratings. The **transparency (alpha)** is used to minimize overlap, ensuring clarity in areas where many apps have similar installs and ratings. This bubble chart delivers valuable insights by demonstrating whether **app size influences popularity and user satisfaction**, making it an effective tool for app developers, marketers, and analysts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : I selected this particular bubble chart because it clearly illustrates the connection between **app installs, ratings, and size**, which **aids in recognizing trends and patterns** within the data. Given that the Installs column often features a broad range of values, using a logarithmic scale enhances readability by distributing the data points more evenly. The size of each bubble indicates the app size (MB), allowing us to examine if larger apps generally receive more downloads or higher ratings. Furthermore, I incorporated a color gradient based on app size, which simplifies the comparison of various app categories and helps identify potential outliers, like large apps with few installs or small apps with high ratings. The transparency (alpha) is used to minimize overlap, ensuring clarity in areas where many apps have similar installs and ratings. This bubble chart delivers valuable insights by demonstrating whether app size influences popularity and user satisfaction, making it an effective tool for app developers, marketers, and analysts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights gained from the bubble chart can significantly **positively impact** business outcomes by helping app developers and companies refine their app strategies. The observation that smaller apps generally receive more downloads suggests that developers should prioritize minimizing app size to attract a larger user base, particularly in regions where device storage is limited or internet speeds are slower. Furthermore, the trend showing that larger apps often receive better ratings implies that feature-rich applications, especially in gaming or productivity, can boost user satisfaction, making the investment in high-quality content worthwhile. Companies can leverage these insights to strike a balance between app size and functionality, ensuring they meet user needs while remaining accessible.

On the **Negative** side, some insights may highlight potential challenges to growth. For instance, **apps that are large in size but have few installs indicate that users might be reluctant to download hefty applications**, likely due to concerns about storage space or lengthy download times. Likewise, apps that are highly downloaded but have low ratings suggest that initial user interest doesn't always lead to satisfaction, which could result in poor retention and high uninstall rates. If a business overlooks these red flags and continues to create large, underperforming apps without enhancing user experience, it risks facing negative growth, low engagement, and dwindling revenue. By tackling these issues such as reducing app size without sacrificing quality, enhancing user experience, and fine-tuning marketing strategies businesses can transform insights into avenues for growth and sustained success.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

#Reviews length distribution

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data_user_reviews.csv")

# Print column names for debugging
print("Columns in dataset:", df.columns)

# Ensure correct column name
review_col = 'Translated_Review' if 'Translated_Review' in df.columns else 'Reviews'

if review_col not in df.columns:
    raise KeyError("Required column for reviews not found in the dataset.")

# Drop missing values in the review column
df.dropna(subset=[review_col], inplace=True)

# Calculate review lengths
df['Review_Length'] = df[review_col].apply(lambda x: len(str(x).split()))

# Create a Histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['Review_Length'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Review Lengths', fontsize=16)
plt.xlabel('Review Length (Number of Words)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here : The specific review length distribution chart was selected to examine the **variability in user review lengths**, offering insights into customer engagement and feedback trends. A histogram serves as an effective tool to visualize how review lengths are distributed, allowing us to see if users tend to write short, concise reviews or more detailed, in-depth feedback. Adding a Kernel Density Estimate (KDE) curve further clarifies the data by illustrating the probability density of various review lengths. This visualization helps identify patterns, such as whether highly rated apps garner longer, more descriptive reviews or if dissatisfied users typically leave shorter, more straightforward complaints. Moreover, understanding the distribution of review lengths can assist businesses and developers in refining their  **strategies, prioritizing detailed feedback, and enhancing customer engagement** . The selected histogram format ensures clarity and ease of interpretation, making it simpler to extract actionable insights from user reviews.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The review length distribution chart offers important **insights into user feedback trends**. A histogram with a high number of short reviews suggests that many users leave brief comments, which might not provide detailed feedback but can indicate overall satisfaction or dissatisfaction. Conversely, a broader range of longer reviews shows that users are more engaged and willing to share in-depth experiences, which can help developers understand specific app **issues or strengths**. If there’s a noticeable peak in mid-length reviews, it may suggest that users prefer a balanced approach—offering enough detail without too much effort. Furthermore, examining outliers, such as very long reviews, can reveal particularly passionate users or potential spam. These insights assist businesses in refining their customer engagement strategies, promoting meaningful reviews, and prioritizing valuable user feedback for future enhancements.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights from the review length distribution chart can **positively impact** a business by providing a clearer picture of user engagement and the quality of feedback. By determining whether users tend to leave short, vague reviews or more detailed and informative ones, businesses can adjust their engagement strategies. If the reviews are mostly brief, companies might encourage users to provide more comprehensive feedback by asking specific questions during the review process. Conversely, if longer reviews are the norm, businesses can gather valuable insights to refine product features, tackle issues, and boost customer satisfaction. Additionally, tracking trends in review length can enhance sentiment analysis models, leading to a better understanding of user sentiment. By utilizing these insights, businesses can make informed decisions to improve customer experience, enhance app functionality, and ultimately boost user retention and satisfaction.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

#CountPlot visualization

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data.csv")

# Check if 'Category' column exists
if 'Category' not in df.columns:
    raise KeyError("Required column 'Category' not found in the dataset.")

# Create the Count Plot for the number of apps in each category
plt.figure(figsize=(12, 6))
sns.countplot(data=df, y='Category', order=df['Category'].value_counts().index, palette='viridis')

# Customize the plot
plt.title('Number of Apps in Each Category', fontsize=16)
plt.xlabel('Number of Apps', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here : The **count plot** was selected because it effectively visualizes **categorical data by illustrating** the **frequency of each category within a dataset**. In this context, it allows for a quick understanding of how apps are distributed across various categories in the Play Store. By showing the number of apps in each category, the count plot reveals insights into which categories are more or less saturated. This information can be particularly valuable for businesses aiming to enter the app market, as it helps them identify highly competitive categories and potential gaps where new apps could thrive. Furthermore, using a horizontal count plot enhances readability, especially when category names are lengthy. This visualization is straightforward yet impactful, making it easy to compare different categories at a glance and make informed, data-driven decisions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here: The **count plot chart offers valuable insights into how apps are distributed across various categories in the Play Store**. It highlights which categories have the most apps, pointing to highly competitive markets, and which ones have fewer apps, possibly indicating niche opportunities. For instance, if categories like **"Games" or "Tools" are at the top**, it suggests a crowded market with significant competition. Conversely, categories with fewer apps, such as "Education" or "Health & Fitness," could provide chances for new app development with less competition. Furthermore, the chart can shed light on user preferences and market trends—categories with a high number of apps may reflect strong user demand, while those with fewer apps might indicate either lower demand or an underserved market segment. This information is crucial for developers and businesses aiming to strategically position their apps for optimal visibility and success.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here : The insights gained from the count plot can **positively impact business by helping app developers and companies make informed, data-driven choices**. By pinpointing categories with intense competition, businesses can concentrate on strategies that set them apart or look into less crowded categories to boost their chances of success. For example, if a business observes that the "Productivity" or "Health & Fitness" category has fewer apps than "Games," they might think about creating an innovative app in those fields to reach an underserved audience. Furthermore, companies can leverage this information to customize their marketing strategies, allocate resources more effectively, and make well-informed investment choices.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#Top Positive and Negative Words (Bar Plot)

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/dataset/playstore_data_user_reviews.csv")

# Ensure 'Translated_Review' and 'Sentiment' columns exist
if 'Translated_Review' not in df.columns or 'Sentiment' not in df.columns:
    raise KeyError("Required columns 'Translated_Review' and 'Sentiment' not found in the dataset.")

# Drop NaN values from important columns
df.dropna(subset=['Translated_Review', 'Sentiment'], inplace=True)

# Split reviews into positive and negative
positive_reviews = df[df['Sentiment'] == 'Positive']['Translated_Review']
negative_reviews = df[df['Sentiment'] == 'Negative']['Translated_Review']

# Function to get top N words
def get_top_words(reviews, n=10):
    if reviews.empty:
        return []  # Return empty list if no reviews found
    vectorizer = CountVectorizer(stop_words='english')
    word_counts = vectorizer.fit_transform(reviews)
    word_freq = dict(zip(vectorizer.get_feature_names_out(), word_counts.sum(axis=0).tolist()[0]))
    return sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:n]

# Get top words for positive and negative reviews
top_positive_words = get_top_words(positive_reviews)
top_negative_words = get_top_words(negative_reviews)

# Create Bar Plots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Positive Words
if top_positive_words:
    axes[0].bar([x[0] for x in top_positive_words], [x[1] for x in top_positive_words], color='lightgreen')
    axes[0].set_title('Top Positive Words', fontsize=14)
    axes[0].tick_params(axis='x', rotation=45)
else:
    axes[0].set_title('No Positive Words Found')

# Negative Words
if top_negative_words:
    axes[1].bar([x[0] for x in top_negative_words], [x[1] for x in top_negative_words], color='lightcoral')
    axes[1].set_title('Top Negative Words', fontsize=14)
    axes[1].tick_params(axis='x', rotation=45)
else:
    axes[1].set_title('No Negative Words Found')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here :The bar plot was selected because it clearly shows the most commonly used words in **both positive and negative reviews**, making it simple to spot trends in user sentiment. This type of visualization helps businesses quickly grasp what customers value and what issues they raise, aiding in data-driven decision-making. By analyzing word frequency, companies can identify recurring themes in feedback, allowing for focused improvements. A bar plot is well-suited for this analysis as it offers a clear and organized method to present categorical data, making it easy to interpret.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The chart illustrates the most frequently used words in both positive and negative reviews, shedding light on the main aspects that customers either value or take issue with. Positive terms such as "great," "easy," and "useful" suggest strengths like user-friendliness and functionality. Conversely, negative words like "slow," "crash," or "bug" highlight performance problems that need urgent resolution. This information aids businesses in identifying the features that users appreciate and the issues that lead to dissatisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here :The insights from this analysis can **positively benefit businesses by helping them target areas that users often criticize while also enhancing features that are well-received**. By tackling common complaints like app crashes or slow performance, companies can improve user experiences, boost retention, and achieve higher ratings. Furthermore, incorporating positive feedback into marketing strategies can enhance customer trust and draw in more users. On the flip side, if negative sentiments prevail and issues remain unaddressed, it could hinder growth. Ongoing complaints about performance, high costs, or inadequate customer support can lead to user dissatisfaction, resulting in fewer downloads, lower ratings, and a decline in market share.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

from wordcloud import WordCloud
from collections import Counter

# Ensure 'Translated_Review' column exists (or any column with review text)
if 'Translated_Review' not in df.columns:
    raise KeyError("Required column 'Translated_Review' not found in the dataset.")

# Combine all reviews into a single string
all_reviews = ' '.join(df['Translated_Review'].dropna())

# Generate a Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_reviews)

# Plot the Word Cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in User Reviews', fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here : A word cloud is picked because it visually showcases the most commonly used words in user reviews, making it simple to spot key themes quickly. In contrast to bar charts, which only show a limited selection of top words, a word cloud changes the size of words based on how often they appear, offering a wider and more intuitive grasp of prevalent sentiments. It serves as a fast and engaging method to analyze text data, making it a valuable tool for businesses to gauge overall customer sentiment and emphasize frequently mentioned topics.

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The word cloud showcases the most frequently used terms in user reviews, uncovering common themes in customer feedback. When words such as **"easy," "love," and "best"** stand out, it indicates that users value the app’s usability and features. On the other hand, if terms like "error," "crash," or "slow" are often mentioned, it points to performance problems that require prompt action. This analysis aids businesses in grasping user perceptions and pinpointing areas for enhancement without the need to sift through thousands of reviews manually.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here :The insights gained from the word cloud can significantly influence business outcomes by effectively pinpointing user preferences and areas of concern. Companies can prioritize enhancing features that users frequently commend while also tackling common complaints to boost overall satisfaction. If the predominant words are negative, it indicates potential risks users may leave the app if their issues remain unaddressed. Overlooking these insights could result in negative growth, as ongoing problems can harm brand reputation, decrease downloads, and lead to poorer ratings and reviews.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Step 1: Clean the dataset
# Drop rows with missing values in 'Rating' and 'Type'
playstore_df = playstore_df.dropna(subset=['Rating', 'Type'])

# Ensure 'Installs' is treated as a string
playstore_df['Installs'] = playstore_df['Installs'].astype(str)

# Clean the 'Installs' column
playstore_df['Installs'] = playstore_df['Installs'].str.replace('+', '').str.replace(',', '').astype(int)

# Clean the 'Size' column
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert KB to MB
        elif size == 'Varies with device':
            return np.nan
    return np.nan

playstore_df['Size'] = playstore_df['Size'].apply(convert_size)

# Fill missing values in 'Size' with the median
playstore_df['Size'] = playstore_df['Size'].fillna(playstore_df['Size'].median())

# Ensure 'Price' is treated as a string
playstore_df['Price'] = playstore_df['Price'].astype(str)

# Clean the 'Price' column
playstore_df['Price'] = playstore_df['Price'].str.replace('$', '').astype(float)

# Step 2: Select numerical columns for correlation
numerical_columns = ['Rating', 'Reviews', 'Size', 'Installs', 'Price']
corr_matrix = playstore_df[numerical_columns].corr()

# Step 3: Create a correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here : A correlation heatmap was selected because it effectively shows the relationships among various numerical variables in a dataset. This type of chart helps to pinpoint which factors affect others by using color intensity to indicate correlation values. In contrast to scatter plots, which only compare two variables at once, a heatmap offers a comprehensive view of all correlations in a single visualization. This is especially beneficial for grasping how features such as app size, price, and installs influence ratings, enabling businesses to make informed, data-driven decisions

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The heatmap highlights key relationships between various app attributes. For example, a strong positive correlation between Installs and Rating implies that apps with higher ratings generally receive more downloads. Conversely, a negative correlation between Price and Installs indicates that free apps are downloaded much more frequently than their paid counterparts. If Size and Rating exhibit minimal correlation, it suggests that the size of the app does not greatly affect user satisfaction. These insights are valuable for businesses looking to grasp the factors that drive app popularity and performance.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

playstore_df['Installs'] = playstore_df['Installs'].astype(str)

# Clean the 'Installs' column
playstore_df['Installs'] = playstore_df['Installs'].str.replace('+', '').str.replace(',', '').astype(int)

# Clean the 'Size' column
def convert_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024  # Convert KB to MB
        elif size == 'Varies with device':
            return np.nan
    return np.nan

playstore_df['Size'] = playstore_df['Size'].apply(convert_size)

# Fill missing values in 'Size' with the median
playstore_df['Size'] = playstore_df['Size'].fillna(playstore_df['Size'].median())

# Ensure 'Price' is treated as a string
playstore_df['Price'] = playstore_df['Price'].astype(str)

# Clean the 'Price' column
playstore_df['Price'] = playstore_df['Price'].str.replace('$', '').astype(float)

# Step 2: Select numerical columns for the pair plot
numerical_columns = ['Rating', 'Reviews', 'Size', 'Installs', 'Price']
pairplot_data = playstore_df[numerical_columns]

# Step 3: Create a pair plot
sns.pairplot(pairplot_data, diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here : A pair plot is picked because it enables a thorough examination of the relationships among several numerical variables at once. It displays scatter plots for each pair of variables, which makes it simple to spot trends, correlations, and possible outliers. The diagonal KDE (Kernel Density Estimate) plots assist in visualizing the distribution of each variable. This type of chart is particularly helpful in uncovering patterns, such as whether a higher number of installs is linked to better ratings, or if the size of the app affects user engagement as reflected in reviews

##### 2. What is/are the insight(s) found from the chart?

Answer Here : The pair plot highlights important connections between variables such as Rating, Reviews, Size, Installs, and Price. A positive correlation between Installs and Rating indicates that apps with better ratings tend to draw in more users. Conversely, a negative correlation between Price and Installs suggests that free apps are generally more favored than their paid counterparts. The density distributions can reveal any skewness in the data, showing whether most apps receive high ratings or if installs are heavily concentrated in specific categories. Additionally, it allows for the identification of outliers, like a few paid apps that have remarkably high install numbers.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly : To achieve business success in the Google Play Store app market, developers and businesses should concentrate on essential factors that enhance user engagement and app performance. **Our analysis indicates that free apps receive significantly more downloads compared to paid ones, suggesting that a freemium model with in-app purchases or ads can be a lucrative approach. Furthermore, categories like Arts & Design, Productivity, and Communication often have higher average ratings, indicating that focusing on these niches may lead to greater user satisfaction**. User sentiment analysis reveals that positive reviews are strongly associated with high ratings, highlighting the need to actively engage with user feedback, improve app quality through regular updates, and promptly address negative reviews. Additionally, a strong link between the number of reviews and installs suggests that businesses should motivate users to leave reviews, as increased engagement can enhance app visibility and credibility. Based on these insights, developers should prioritize improving user experience, optimizing app performance, and utilizing data-driven marketing strategies to increase app adoption and retention.

Answer Here.

# **Conclusion**

Write the conclusion here : Analyzing data from the Google Play Store reveals important factors that lead to an app's success. While free apps lead in downloads, ratings and user satisfaction are essential for ongoing engagement. The results show that the app's category, user reviews, and sentiment analysis significantly influence its popularity. Businesses should use this information to concentrate on well-rated categories, enhance customer support, and consistently optimize app performance to stay competitive. By making data-driven choices, app developers can expand their reach, boost user satisfaction, and foster long-term growth in the competitive Play Store environment.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***