<a href="https://colab.research.google.com/github/itssidkali/new/blob/master/Copy_of_my_project_%5BPlay_Store_App_Review_Analysis_Project_ipynb%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Play Store App Review Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
submitted by : kali srivastava

**project summary** >> This project analyzes Play Store app data to identify key factors affecting app success and engagement. By exploring attributes like categories, ratings, and reviews, the goal is to uncover insights that can guide developers and businesses in improving app features, marketing strategies, and user experience.

# github **link**

# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


The success and engagement of apps on the Play Store depend on various factors, such as ratings, reviews, app categories, and size. By analyzing this data, the goal is to identify the key elements that contribute to app performance and user satisfaction. This analysis will help developers and businesses understand what drives app success and how they can optimize their apps to attract and retain users.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud
import warnings
warnings.filterwarnings("ignore")


### Dataset Loading

In [None]:
# Importing the dataset
hb_df = pd.read_csv("/content/Play Store Data.csv")
ur_df = pd.read_csv("User Reviews.csv")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First
hb_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
# Play Store Data shape
print("Play Store Data Shape:", hb_df.shape)

### Dataset Information

In [None]:
# Dataset Info
hb_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(hb_df[hb_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(hb_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(hb_df.isnull(), cbar=False)

### What did you know about your dataset?

Dataset Overview:

The dataset contains 10,841 rows and 13 columns, representing various attributes of Play Store apps.

Duplicate Records:

There are 483 duplicate rows, which need to be removed to ensure data integrity.

Missing Values:

Rating → 1,474 missing values (important for app performance analysis).
Type → 1 missing value (affects free/paid categorization).
Content Rating → 1 missing value (impacts audience segmentation).
Current Ver → 8 missing values (essential for version tracking).
Android Ver → 3 missing values (needed for compatibility insights).

Null Value Heatmap Analysis:

The heatmap confirms missing values in key columns, highlighting areas that need cleaning.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hb_df.columns

In [None]:
# Dataset Describe
hb_df.describe(include='all')

### Variables Description

App → Name of the application.

Category → Category under which the app is listed.

Rating → Average user rating (out of 5).

Reviews → Total number of user reviews.

Size → App size (in MB or "Varies with device").

Installs → Number of installs (e.g., 1,000,000+).

Type → Whether the app is Free or Paid.

Price → Price of the app (0 if free).

Content Rating → Age group for which the app is suitable.

Genres → App genre(s) (e.g., Action, Productivity).

Last Updated → Last update date of the app.

Current Ver → Current version of the app.

Android Ver → Minimum Android version required.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
pd.Series({column:hb_df[column].unique() for column in hb_df})

# **EDA on user reviews**

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
ur_df = pd.read_csv("User Reviews.csv")

# Display basic information
print("Dataset Shape:", ur_df.shape)
print("\nDataset Info:")
print(ur_df.info())

# Check for missing values
print("\nMissing Values Count:")
print(ur_df.isnull().sum())

# Display first few rows
ur_df.head()


# **missing values**

In [None]:
# Drop rows with missing values in 'Translated_Review' and 'Sentiment' (critical columns for analysis)
cleaned_df = ur_df.dropna(subset=['Translated_Review', 'Sentiment'])

# Alternatively, we could fill missing sentiment values with a default or median value (if needed)
# ur_df['Sentiment'].fillna('Neutral', inplace=True) # Example for filling
# ur_df['Sentiment_Polarity'].fillna(0.0, inplace=True)
# ur_df['Sentiment_Subjectivity'].fillna(0.5, inplace=True)

# Confirm data cleaning
print(cleaned_df.isnull().sum())


# **sentiment distribution**

In [None]:
# Sentiment distribution pie chart
import matplotlib.pyplot as plt

sentiment_counts = cleaned_df['Sentiment'].value_counts()

# Plotting
plt.figure(figsize=(7, 7))
sentiment_counts.plot(kind='pie', autopct='%1.1f%%', colors=['#66b3ff','#99ff99','#ff6666'], startangle=90)
plt.title("Sentiment Distribution in Reviews")
plt.ylabel('')  # Remove y-label for aesthetic
plt.show()


# **polarity distribution**

In [None]:
# Plotting Sentiment Polarity Distribution
plt.figure(figsize=(10, 6))
plt.hist(cleaned_df['Sentiment_Polarity'], bins=30, color='#ff9999', edgecolor='black')
plt.title("Sentiment Polarity Distribution")
plt.xlabel("Sentiment Polarity")
plt.ylabel("Frequency")
plt.show()


# **Word Cloud for Positive and Negative Reviewst**

In [None]:
from wordcloud import WordCloud

# Positive and Negative Reviews
positive_reviews = cleaned_df[cleaned_df['Sentiment'] == 'Positive']['Translated_Review']
negative_reviews = cleaned_df[cleaned_df['Sentiment'] == 'Negative']['Translated_Review']

# Combine positive reviews for word cloud
positive_reviews_text = " ".join(positive_reviews.dropna())
negative_reviews_text = " ".join(negative_reviews.dropna())

# Generate Word Clouds
positive_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(positive_reviews_text)
negative_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(negative_reviews_text)

# Plotting Word Clouds
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.imshow(positive_wordcloud, interpolation='bilinear')
plt.title('Positive Reviews Word Cloud')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(negative_wordcloud, interpolation='bilinear')
plt.title('Negative Reviews Word Cloud')
plt.axis('off')

plt.show()


# **Review Length Analysis**

In [None]:
# Adding a column for review length
cleaned_df['Review_Length'] = cleaned_df['Translated_Review'].apply(lambda x: len(str(x)))

# Plotting review length vs sentiment polarity
plt.figure(figsize=(10, 6))
plt.scatter(cleaned_df['Review_Length'], cleaned_df['Sentiment_Polarity'], alpha=0.5, color='#ffb3b3')
plt.title("Review Length vs Sentiment Polarity")
plt.xlabel("Review Length")
plt.ylabel("Sentiment Polarity")
plt.show()


# *what info we get From the User Reviews dataset*

**The dataset shows that most user reviews are positive, with high sentiment polarity values, indicating general satisfaction. Negative reviews are present but less frequent, and review length doesn't significantly correlate with sentiment.**

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to df
hb_dfs=hb_df.copy()
hb_dfs

In [None]:
# Copy the original dataframe to work on it without altering the original
hb_dfs = hb_df.copy()

# Remove Duplicates from the dataset
hb_dfs.drop_duplicates(inplace=True)  # Dropping duplicate rows

# Handle Missing Values:

# For Rating, fill missing values with the median value of the column
hb_dfs['Rating'].fillna(hb_dfs['Rating'].median(), inplace=True)  # Filling missing ratings with median

# For Type, Content Rating, Current Ver, and Android Ver (categorical columns), fill with common values
hb_dfs['Type'].fillna('Free', inplace=True)  # Filling missing 'Type' values with 'Free' (most common)
hb_dfs['Content Rating'].fillna('Everyone', inplace=True)  # Filling missing 'Content Rating' with 'Everyone'
hb_dfs['Current Ver'].fillna('Varies with device', inplace=True)  # Filling missing 'Current Ver' with 'Varies with device'
hb_dfs['Android Ver'].fillna('4.1 and up', inplace=True)  # Filling missing 'Android Ver' with '4.1 and up'

# Remove Outliers in Rating:
hb_dfs = hb_dfs[hb_dfs['Rating'] <= 5]  # Keeping ratings that are less than or equal to 5

# Convert Price to Numeric:
hb_dfs['Price'] = hb_dfs['Price'].replace('Free', 0)  # Replacing 'Free' values with 0
hb_dfs['Price'] = hb_dfs['Price'].apply(lambda x: float(str(x).replace('$', '').replace(',', '')))  # Converting price to numeric values

# Convert Installs to Numeric:
hb_dfs['Installs'] = hb_dfs['Installs'].apply(lambda x: str(x).replace(',', '').replace('+', ''))  # Removing commas and '+' symbols
hb_dfs['Installs'] = pd.to_numeric(hb_dfs['Installs'], errors='coerce')  # Converting to numeric, invalid values will be set to NaN

# Remove 'Varies with device' from Size and Convert to Numeric:
hb_dfs['Size'] = hb_dfs['Size'].replace('Varies with device', np.nan)  # Replacing 'Varies with device' with NaN
hb_dfs['Size'] = hb_dfs['Size'].apply(lambda x: float(str(x).replace('M', '').replace('k', '').replace(',', '').strip()) if pd.notnull(x) else np.nan)  # Converting size values to numeric, cleaning unnecessary characters

# Check the first few rows after cleaning
hb_dfs.head()


In [None]:
hb_dfs.shape

In [None]:
pd.Series({col:hb_dfs[col].unique() for col in hb_dfs})

### What all manipulations have you done and insights you found?

Data Manipulations:
Duplicate rows were removed to ensure uniqueness.


Handled Missing Values:

Rating:
Filled with median value.

Type:
Filled with 'Free'.

Content Rating:
Filled with 'Everyone'.
Current Ver & Android Ver:
 Filled with common values.

Outliers Removal:
Ratings above 5 were removed.

Data Type Conversion:
Price:

 Converted to numeric, replacing 'Free' with 0.

Installs:
 Cleaned from commas/plus signs and converted to numeric.

Size:
Cleaned and converted to numeric, removing units.

Insights:

Missing values were addressed, making the dataset cleaner.
Outliers in ratings were removed, ensuring valid data.
Price and installs were standardized for easier analysis.
Size data was cleaned for consistency.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Rating vs Price (Scatter Plot)
plt.figure(figsize=(8,6))
sns.scatterplot(x='Price', y='Rating', data=hb_dfs)
plt.title('Rating vs Price')
plt.xlabel('Price')
plt.ylabel('Rating')
plt.show()

# 2. Rating Distribution (Histogram)
plt.figure(figsize=(8,6))
sns.histplot(hb_dfs['Rating'], bins=10, kde=True, color='blue')
plt.title('Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

# 3. Category-wise Rating Distribution (Box Plot)
plt.figure(figsize=(10,8))
sns.boxplot(x='Category', y='Rating', data=hb_dfs)
plt.title('Category-wise Rating Distribution')
plt.xticks(rotation=90)
plt.xlabel('Category')
plt.ylabel('Rating')
plt.show()

# 4. Size vs Installs (Scatter Plot)
plt.figure(figsize=(8,6))
sns.scatterplot(x='Size', y='Installs', data=hb_dfs)
plt.title('Size vs Installs')
plt.xlabel('Size (MB)')
plt.ylabel('Installs')
plt.show()

# 5. Price vs Installs (Scatter Plot)
plt.figure(figsize=(8,6))
sns.scatterplot(x='Price', y='Installs', data=hb_dfs)
plt.title('Price vs Installs')
plt.xlabel('Price')
plt.ylabel('Installs')
plt.show()


##### 1. Why did you pick the specific chart?

Each chart is chosen based on the type of data and the insights we want to extract.

Pie Chart (Category Distribution, Churn, etc.)
Used for categorical data to show proportions.
Helps understand how data is distributed among different categories.

Bar Chart (Category-wise App Distribution)
Used to compare frequency counts of different categories.
Helps identify which app categories are most popular.

Histogram (Rating Distribution)
Used for continuous numerical data like ratings.
Helps understand the spread of ratings (Are most apps rated high or low?).

Box Plot (Outliers in Reviews, Price, etc.)
Shows distribution and outliers in numerical data.
Helps in detecting anomalies and extreme values.

Scatter Plot (Reviews vs. Rating, Size vs. Installs, etc.)
Shows relationships between two numerical variables.
Helps answer questions like:
Does higher reviews mean better ratings?
Do larger apps get more installs?
Charts are chosen based on their clarity, relevance, and ability to extract meaningful insights from the data.

##### 2. What is/are the insight(s) found from the chart?

The charts reveal that Family and Game apps dominate, while Social Media and Gaming apps have the highest engagement. There is no strong correlation between reviews and ratings, and free apps get significantly more downloads than paid ones.

##### 3. Will the gained insights help creating a positive business impact?
Yes, the insights can drive a positive business impact by helping developers focus on high-engagement categories like Social Media and Gaming, optimize app pricing strategies, and improve user experience based on ratings and reviews to increase downloads and retention.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective, I suggest the client:

Focus on High-Engagement Categories – Prioritize app development in Gaming, Social Media, and Family categories, as they attract the most users.

Optimize App Pricing – Since free apps get more downloads, using freemium models or in-app purchases can drive revenue.

Improve User Experience – Address issues highlighted in ratings and reviews to enhance user satisfaction and retention.

Leverage Data-Driven Marketing – Target the right audience by analyzing download trends and engagement patterns.

Regular Updates & Optimization – Keep apps updated with new features and bug fixes to maintain user interest and engagement.




# **Conclusion**

The analysis of Play Store data provides valuable insights into app success factors. Gaming, Social Media, and Family apps have the highest engagement, and free apps attract significantly more downloads than paid ones. There is no strong correlation between reviews and ratings, highlighting the need for developers to focus on both aspects separately. By optimizing pricing strategies, improving user experience, and leveraging data-driven marketing, app developers can enhance downloads, user retention, and overall business growth.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***