<a href="https://colab.research.google.com/github/kanak85/kanak-chokhani/blob/main/kanak_chokhani.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Exploratory Data Analysis of Google Play Store App Reviews**


#   **Name-** Kanak Chokhani

# **Project Summary -**

Exploratory Data Analysis (EDA) is a crucial step in any data analysis or data science project. The aim of EDA is to understand the data better, uncover patterns and anomalies, and develop hypotheses based on the data. The EDA process involves several steps: defining the problem statement, generating hypotheses, conducting univariate, bivariate, and multivariate analyses, cleaning the data, and testing hypotheses. The purpose of EDA is to summarize the main characteristics of the data and make it more attractive and appealing through data visualization methods.

The Google Play Store is a platform for mobile apps. With the ease of app creation and potential profitability, a large number of apps are being developed and made available on the platform. This analysis aims to provide a comprehensive understanding of the Android app market by examining the characteristics of over 10,000 apps in various categories. The insights gained from the data analysis will be used to identify opportunities for growth and improvement of the app market.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Two datasets are provided one with basic information and other with user reviews for the respective app
We must examine and evaluate the data in both datasets in order to identify the important characteristic that influence app engagement and sucess

# **Business Objective** : -

The business objective in this case is to analyze the Google Play Store app data to uncover key factors that drive app engagement and success. By analyzing the data, the goal is to draw actionable insights for developers to capture the Android market and grow their app-making businesses. The objective is to answer questions such as: what are the top categories on the Play Store, what percentage of apps are paid or free, how important is the rating of the app, what category the app should be based on, which category has the most installations, how the last update affects the rating, how the ratings are affected when the app is paid, and how the ratings and reviews are related.

# ***Let's Begin !***

## ***1. Know Your Data***

Let's take a look at the data, which consists of two files:

playstore data.csv: contains all the details of the applications on Google Play. There are 13 features that describe a given app.
user_reviews.csv: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as  sns
%matplotlib inline

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
df_app = pd.read_csv("/content/drive/MyDrive/Play Store Data.csv")
df_reviews = pd.read_csv("/content/drive/MyDrive/User Reviews.csv")

### Dataset First View

In [None]:
# Dataset First Look
df_app.head()

In [None]:
df_reviews.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df_app.shape

In App dataset we have 10841 rows and 13 columns

In [None]:
df_reviews.shape

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In reviews dataset we have 64295 rows and 5 columns

### Dataset Information

In [None]:
# Dataset Info
df_app.info()

In [None]:
df_reviews.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df_app.duplicated().sum()

We have found 483 duplicates values in app data. For the accuracy purpose we have to remove these duplicates values from dataset. Otherwise it will hamper our analysis and prediction!

In [None]:
df_reviews.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df_app.isnull().sum()

**Null value overview of app data:**


1.   Rating column have 1474 null values
2.   Content Rating column have 1 null value
3.   Current Ver column have 8 null values
4.   Type column have 1 null value
5.   Android Ver column have 3 null values


In [None]:
df_reviews.isnull().sum()

**Null value overview of reviews dataset:**


1.   Translated_Review column have 26868 null values.
2.   Sentiment, Sentiment_Polarity, Sentiment_Subjectivity columns have equal nos of null values which is 26863.




In [None]:
# Visualizing the missing values
sns.displot(
    data=df_app.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=2
)

### What did you know about your dataset?

In the described dataset, the column with the most missing values is the Rating column, which has 1474 missing values. There are also a few missing values in other columns, such as the Content Rating, Current Ver, Type, and Android Ver columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_app.columns

In [None]:
df_reviews.columns

In [None]:
# Dataset Describe
df_app.describe()

**From the above table we can evaluate the basic calculations of the Rating column. But wait a minutes but we don't have Reviews, Size, Installs and Price columns in this table. So they must have inapporopiate data, we can fix their data types in data cleaning section!**

In [None]:
df_reviews.describe()

### Variables Description

1. App - It tells us about the name of the application.
2. Category - It tells us about the category to which an application belongs.
3. Rating- It tells us about the ratings given by the users for a specific application.
4. Reviews - It tells us about the total number of users who have given a review for the application.
5. Size- It tells us about the size being occupied the application on the mobile phone.
6. Installs - It tells us about the total number of installs/downloads for an application.
7. Type - It tells us whether the application is free or a paid one.
8. Price- It tells us about the price of the application.
9. Content_Rating- It tells us about the target audience for the application.
10. Genres - It tells us about the various other categories to which an application can belong.
11. Last_Updated - It tells us about the when the application was updated.
12. Current_Ver - It tells us about the current version of the application.
13. Android_Ver - It tells us about the android version which can support the application on its platform.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df_app.columns

In [None]:
df_app['App'].unique()

In [None]:
df_app['Category'].unique()

In [None]:
df_app['Rating'].unique()

In [None]:
df_app['Reviews'].unique()

In [None]:
df_app['Size'].unique()

In [None]:
df_app['Installs'].unique()

In [None]:
df_app['Type'].unique()

In [None]:
df_app['Price'].unique()

In [None]:
df_app['Content Rating'].unique()

In [None]:
df_app['Genres'].unique()

In [None]:
df_app['Last Updated'].unique()

In [None]:
df_app['Current Ver'].unique()

In [None]:
df_app['Android Ver'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

#### 1. We have to clean our data first

**Removing Duplicates Values**

In [None]:
# Write your code to make your dataset analysis ready.
df_app.drop_duplicates(inplace=True)

In [None]:
df_reviews.drop_duplicates(inplace=True)

**Droping Null Values from both Data sets**

In [None]:
df_app.dropna(axis=0,inplace=True)

In [None]:
df_reviews.dropna(axis=0,inplace=True)

**checking info**

In [None]:
df_app.info()

In [None]:
df_reviews.info()

In [None]:
df_app.shape

In [None]:
df_reviews.shape

#### Fixing Data types of columns and convert into appropiate form

**1. Size column**

Removing K,M and Nan values and converting string dtype to float

In [None]:
# Removing M,k from entries
def get_app_size(size):
  try:

    # Checking condition
    if 'M' in size:
      return float(size.replace('M',''))*1000
    elif 'k' in size:
      return float(size.replace('k',''))
    else:
      return 0


  except:
    return 0


In [None]:
df_app['Size'] = df_app['Size'].apply(get_app_size)

**2. Removing + and , chars from Installs column. Also converting its  dtype**

In [None]:
df_app['Installs'] = df_app['Installs'].apply(lambda x : x.replace('+' , ''))
df_app['Installs'] = df_app['Installs'].apply(lambda x : x.replace(',' , ''))

Converting Dtype to int

In [None]:
df_app['Installs'] = df_app['Installs'].astype(int)

**3. Price Column --> removing $**

In [None]:
df_app['Price'] = df_app['Price'].apply(lambda x : x.replace('$',''))

Converting Dtype into float

In [None]:
df_app["Price"] = df_app["Price"].astype(float)

**4.Reviews columns --> fixing Dtype**

In [None]:
df_app['Reviews'] = df_app['Reviews'].astype(int)

**5.Last Updated column --> string to datetime**

In [None]:
# Pandas to_datetime() function applied to the values in the last updated column helps to convert string Date time into Python Date time object.
df_app["Last Updated"] = pd.to_datetime(df_app['Last Updated'])
df_app.head()

### What all manipulations have you done and insights you found?

The manipulations done above are data wrangling techniques used to clean and prepare the data for further analysis. The following insights were found:

1. Duplicates: The code removes duplicates from the app and reviews dataframes using the drop_duplicates() method. This ensures that the data is unique and no insights are skewed by duplicate entries.

2. Null Values: The code removes any rows with missing values (null values) from the app and reviews dataframes using the dropna() method. This ensures that the data is complete and no insights are skewed by missing values.

3. Data Types: The code manipulates the data types of several columns in the app dataframe to ensure that they are in the appropriate format for analysis. This includes:

* Size: The code converts the size of the apps from strings to float values.

* Installs: The code removes the '+' and ',' characters from the entries in the 'Installs' column and converts the data type to integer.

* Price: The code removes the '$' character from the entries in the 'Price' column and converts the data type to float.

* Reviews: The code converts the data type of the 'Reviews' column to integer.

* Last Updated: Converts the "Last Updated" column from a string representation of a date and time to a pandas datetime object

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Chart - 1- KDE Plot**

**Graph for Rating distribution using Kde plot**

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(20,10))
sns.kdeplot(df_app['Rating'],color='blue')
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.title('Rating Distribution',size=25)

##### 1. Why did you pick the specific chart?

Beacuse we are only comparing Rating variable and kde plot use for univariate variable visualization

##### 2. What is/are the insight(s) found from the chart?

App ratings on the Google Play Store are predominantly in the higher range, with most ratings falling between 3-5. The majority of ratings are densely concentrated in the 4.2-4.7 range, indicating that users tend to give positive ratings to apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

A low rating on the Google Play Store can hurt app downloads as potential users may choose alternative apps with better ratings. Maintaining a positive rating is crucial for app success.

# **Chart - 2- Histrogram**

In [None]:
plt.figure(figsize=(20,10))
plt.hist(df_app['Size'])
plt.xlabel('Installs')
plt.ylabel("Frequency")
plt.title('Size Distribution',size=25)

##### 1. Why did you pick the specific chart?

Beacuse we are only comparing Size variable and kde plot use for histrogram  visualization

##### 2. What is/are the insight(s) found from the chart?

The app size on the Google Play Store is skewed towards smaller sizes, with most apps falling in the range of 1KB to 20,000KB. This suggests that a majority of the apps are relatively small in size and do not take up a significant amount of storage on a user's mobile device.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Having a smaller app size can positively impact an app's downloads, as it requires less storage space on the user's device. This can be particularly important for users with limited storage space or for those who prefer to minimize the amount of storage their apps consume. By having a smaller app size, developers can appeal to a wider audience and potentially increase their app's visibility and downloads.

# **Chart - 3- Bar Plot**


**Most Downloaded Categories**

In [None]:
# Chart - 2 visualization code
category_installs = df_app.groupby('Category')['Installs'].sum().reset_index().sort_values(by='Installs', ascending=False)
category_installs['Installs'] = category_installs['Installs'].astype(int)

In [None]:
plt.figure(figsize=(20,10))
chart = sns.barplot(data=category_installs, x='Category', y='Installs')
plt.xlabel('Category Names',size=15)
plt.ylabel('Number of Installs',size=15)
plt.title('Top most downloaded Categories on Playstore',fontsize=25)
chart.set_xticklabels(chart.get_xticklabels(),rotation=30, horizontalalignment='right')

##### 1. Why did you pick the specific chart?

We have one categorical value and one numerical value in comparision in best case we should use bar chart. Also in categorical column we have multiple entries. So bar chart would be ideal for this visualization.

##### 2. What is/are the insight(s) found from the chart?

Game, Communication, Social and Productivity have more numbers of downloads. In other hand Medical, Parenting, Beauty and Events have less numbers of downloads.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the chart can have a positive impact on a business by guiding it to choose app categories with high downloads, such as Games, Communication, Social, and Productivity. Choosing categories with low downloads like Medical, Parenting, Beauty, and Events may result in lower profits in the short-term. However, categories like Dating, Medical, and Comics may bring instant profits, but it's important to consider long-term growth as well. While categories like Games, Productivity, and Family may grow slower initially, they have the potential to bring in more money over time. Ultimately, the business should consider market trends and consumer demand when making decisions on app categories.

# **Chart - 4 - Pie Chart**

**Pie Chart for Type of Apps**

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(20,10))
type_values = df_app.groupby('Type')['App'].count().values
type_index = df_app.groupby('Type')['App'].count().index
plt.pie(type_values, labels=type_index, autopct='%1.2f%%')
plt.title('Pie chart for Type of apps',size=25)

##### 1. Why did you pick the specific chart?

The pie chart is used to show the distribution of free and paid apps in the Google Play Store.

##### 2. What is/are the insight(s) found from the chart?

The insight gained is that there are more free apps available than paid apps

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

To have a positive business impact, companies can offer some services for free to attract users and then charge for other premium features. This strategy can create user loyalty and increase revenue.

# **Chart - 5- Bar Plot**

**Top Categories on Playstore**

In [None]:
# Chart - 1 visualization code
x = df_app['Category'].value_counts().index
y = df_app['Category'].value_counts()

In [None]:
plt.figure(figsize=(20,10))
chart = sns.barplot(x=x,y=y)
plt.xlabel('Category Names',size=15)
plt.ylabel('Number of Apps',size=15)
plt.title('Top Categories on Playstore',fontsize=25)
chart.set_xticklabels(chart.get_xticklabels(),rotation=30, horizontalalignment='right')

##### 1. Why did you pick the specific chart?

We have one categorical value and one numerical value in comparision in best case we should use bar chart. Also in categorical column we have multiple entries. So bar chart would be ideal for this visualization.

##### 2. What is/are the insight(s) found from the chart?



*   Family have most number of apps in playstore. Follwing Games, Tools, Productivity, Finance .
*   Events and Beauty have least number of apps.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

According to graph companines should make applications in the categories like Family, Games, Tools, Productivity, Finance. It does not mean companies should not to try make other apps. If you have great idea about app it would definetily work!

# **Chart - 6 -Joint Plot**

**Relation between Rating and Price**

In [None]:
# Chart - 5 visualization code
sns.jointplot(data=df_app, x = 'Price', y = 'Rating')


##### 1. Why did you pick the specific chart?

Jointplot be used to quickly visualize and analyze the relationship between two variables and describe their individual distributions on the same plot

##### 2. What is/are the insight(s) found from the chart?

Apps which are low in price are more in numbers, also rating is also good for these apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Some apps which are high in price are low in numbers but rated above 3.5 which means these apps related to FAMILY and HEALTH category and providing good services to users.

# **Chart 7**

 **Correlation between Year of last Update and Rating.**

In [None]:
plt.figure(figsize = (20,10))
grouped = df_app.groupby(df_app["Last Updated"].dt.year)["Rating"].mean()
plt.plot(grouped.index, grouped.values, "o-g")
plt.title("Last year vs Average rating", fontsize=25)
plt.xlabel("Year of last update", size=15)
plt.ylabel("Average rating", size=15)


1. Why did you pick the specific chart?

  By using a scatter plot, the code creates a visual representation of the data that can be used to identify patterns and trends in the data

2. What is/are the insight(s) found from the chart?

   Average app rating is gradually increasing after 2016. That means apps that have regular updates have higher average ratings.


3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 The graph is falling from 2010 to 2012. This may be due to bad user experience, people are started giving poor ratings due to non regular updates or many other reasons too.


# **Chart 8**

**What are the count of Apps in different genres?**

In [None]:
topAppsinGenres = df_app['Genres'].value_counts().head(40)

In [None]:
x3sis = []
y3sis = []

for i in range(len(topAppsinGenres)):
    x3sis.append(topAppsinGenres.index[i])
    y3sis.append(topAppsinGenres[i])

In [None]:
plt.figure(figsize = (20,10))
plt.ylabel('Genres(App Count)')
plt.xlabel('Genres')
graph = sns.barplot(x=x3sis,y=y3sis,palette="deep")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90, fontsize=12)
graph.set_title("Top Genres in the Playstore", fontsize = 20);

In [None]:
top_genres = df_app['Genres'].value_counts().head(40)

plt.figure(figsize = (20,10))
plt.bar(top_genres.index, top_genres.values, color="y")
plt.xlabel('Genres', fontsize=12)
plt.ylabel('App Count', fontsize=12)
plt.xticks(rotation=90, fontsize=12)
plt.title('Top Genres in the Playstore', fontsize=20)
plt.show()


1.Why did you pick this?

They are used to show the relationship between two or more sets of data. They are mostly used in business and finance.

2.What is/are the insight(s) found from the chart?

From the above visualization we can see that the Highest Number of Apps found in the Tools and Entertainment genres followed by Education, Productivity, Finance and many more.


# **Chart 9**

 **What is the Relation between app category and app price?**

   So now comes the hard part. How are companies and developers supposed to make end meet? What monetization strategies can companies use to maximize profit? The costs of apps are largely based on features, complexity, and platform. Lets plot a graph and see.

Here we are selecting some most popular app category from the data set for our analysis i.e GAME, FAMILY, PHOTOGRAPHY, MEDICAL, TOOLS, FINANCE, LIFESTYLE , BUSINESS

In [None]:
graph, plot = plt.subplots()
graph.set_size_inches(20,10)
popular_app= df_app[df_app.Category.isin(['GAME', 'FAMILY', 'PHOTOGRAPHY','MEDICAL', 'TOOLS', 'FINANCE','LIFESTYLE','BUSINESS'])]
plot = sns.stripplot(x = popular_app['Price'], y = popular_app['Category'], jitter=True, linewidth=1)
plot.set_title('App pricing trend across categories',size=25)


1. Why did you pick the specific chart?

    Strip plot use to visualise the distribution of many individual one-dimensional values.
It is a good complement to a boxplot or violinplot in cases where all observations are shown along with some representation of the underlying distribution

2. What is/are the insight(s) found from the chart?

   Here we can see that Different categories apps demand different price ranges. Some apps that are simple and easy are free, where as apps in category FAMILY LIFESTYLE FINANCE and MEDICAL are high in price.

3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

  All Game apps are comparetively low in price, may be thats the reason game apps have more number of downloads, as we have seen earlier.

# **Chart 10**

 **Content rating vs no of apps**

In [None]:
plt.figure(figsize = (20,10))
index = df_app['Content Rating'].value_counts().index
values = df_app['Content Rating'].value_counts().values
sns.barplot(x=index,y=values)
plt.title('Content rating',fontsize=25)
plt.xlabel('Rating',size=15)
plt.ylabel('Apps Count',size=15)

1.Why did you pick the specific chart?

 The bar plot is a good choice for this type of data because it allows for easy comparison of the frequency of each category.



2.What is/are the insight(s) found from the chart?

Mostly 90% of total apps are targetting audience in every age group and hence open for everyone.
Very few(less then 500 apps) are catering to only adult population i.e Mature 17+

3.Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

Yes.Large majority of apps are targeted towards all age groups, a business could decide to focus on creating apps that are family-friendly and appealing to a wide range of users.

# **Chart 11**

**How Installs and Rating are correlated?**

In [None]:
df_rats=df_app.groupby('Rating').sum().reset_index()
plt.figure(figsize = (20,10))
plt.plot(df_rats['Rating'],df_rats['Installs'])
plt.xlabel('Rating',size=15)
plt.ylabel('Installs (e+10)',size=15)
plt.title('Installs per Rating',size=25 )
plt.show()


1.Why did you pick the specific chart?

 It gives the perfect information of how rating and installs are correlated



2.What is/are the insight(s) found from the chart?

  Here, we can observe that higher the rating, more the no of installs. But this correlation sightly changes after 4.5 ratings. The probable reason could be bad user experience.

3.Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

No because the ratings are slightly dipped.The reason is increase in price! But as soon as the prices are decreased, the ratings are again up.

# **Chart 12**

**Let us try to understand and co-relate rating, reviews and price columns together.**

In [None]:
f,(ax1,ax2,ax3) = plt.subplots(ncols=3,sharey=False)
sns.boxplot(x='Rating',data=df_app,ax=ax1)
sns.boxplot(x='Reviews',data=df_app,ax=ax2)
sns.boxplot(x='Price',data=df_app,ax=ax3)
f.set_size_inches(20, 8)

1.Why did you pick the specific chart?


   Box plots divide the data into sections that each contain approximately 25% of the data in that set. Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness.

2.What is/are the insight(s) found from the chart?

  We can see that most of the Ratings are between 4 and around 4.5 and 5.

3.Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.


  As far as Reviews are concerned, for most of the Apps Reviews are not given.
Also for Price, most of the Apps are Free.


# **Chart 13**

**Let's merge two data sets to determine the correlation between each other.**


In [None]:
df_merge= pd.merge(df_app, df_reviews, on="App", how="inner")
df_merge.shape


In [None]:
df_merge.corr()

# **Pie Plot- Comparison between sentiment factors.**

In [None]:
df_reviews['Sentiment'].value_counts()

In [None]:
plt.figure(figsize=(8, 8))
plt.pie(df_reviews['Sentiment'].value_counts(), labels = df_reviews['Sentiment'].unique(),autopct='%1.1f%%',)
plt.show()


1. Why did you pick the specific chart?

 because it is a simple and effective way to show the proportion of each sentiment category in the data set

2. What is/are the insight(s) found from the chart?

The chart shows that the majority of sentiments are Positive, while the Negative and Neutral sentiments account for a smaller proportion.

# **Chart 14**

# **Correlation between Sentiment_Polarity and Sentiment_Subjectivity.**


In [None]:

plt.figure(figsize=(20, 10))

sns.scatterplot(
    x=df_reviews['Sentiment_Subjectivity'],
    y=df_reviews['Sentiment_Polarity'],
    hue=df_reviews['Sentiment'],
    palette="BuGn"
)

plt.title("Google Play Store Reviews Sentiment Analysis", fontsize=25)
plt.xlabel("Sentiment Subjectivity")
plt.ylabel("Sentiment Polarity")

plt.show()


1.Why did you pick the specific chart?



Scatter plots are useful when you want to understand the relationship between two numerical variables.

##### 2. What is/are the insight(s) found from the chart?

   The concentration of points in the range of 0.5 to 0.8 for sentiment subjectivity suggests that most of the reviews are expressing opinions and experiences rather than factual information.

# **Chart - 15**

# **Correlation Heatmap on merged dataframe**

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(df_merge.corr(),annot= True,cmap=sns.color_palette('Pastel1'),square=True)
plt.title('Correlation Heatmap for Playstore data and User review data', size=25)

##### 1. Why did you pick the specific chart?

   Correlation heatmaps can be used to find potential relationships between variables and to understand the strength of these relationships

##### 2. What is/are the insight(s) found from the chart?

1. Size and sentiment polarity are negatively correlated(-0.19). There may be a reason when the size of an app increases people start disliking the app because it consumes more storage, takes more RAM and needs a high speed connection for its execution.
2. There is a positive correlation between reviews and number of installs(0.56) because as the reviews increase, people start noticing the app and install them.
3. There is slightly positive correlation(0.24) between sentiment polarity and sentiment subjectivity that means if users share the positive reviews (sentiment polarity) then there is much chances that users are sharing their personal opinion and not a factual information(Sentiment Subjectivity)

# **Chart - 16 - Pair Plot**

In [None]:
# Pair Plot visualization code
sns.pairplot(df_app)

##### 1. Why did you pick the specific chart?

The Seaborn Pairplot allows us to plot pairwise relationships between variables within a dataset.

##### 2. What is/are the insight(s) found from the chart?

  It creates a nice visualization and helps us to understand the data by summarising a large amount of data in a single figure.

## **5. Solution to Business Objective**


1. Target the Art and Design or Family and Lifestyle categories for app development, as these categories have a high number of installs.
2. Offer a free version of the app to increase the number of installs, as 92% of the apps on the play store are free.
3. Develop apps that are light-weighted and suitable for all ages, as the majority of apps on the play store are of Everyone content rating and have a positive sentiment.
4. Focus on the Game, Tools, Entertainment, Education, Business, and Medical genres, as these categories have a high number of installs and positive sentiments.
5. Regularly update the app and keep it optimally sized to maintain a high rating, as there is a positive correlation between installs and rating.
6. Consider offering a paid version of the app, but ensure that it is of a reasonable size, as paid apps that are higher in size may not perform well in the market.
7. Encourage users to review the app, as a high number of reviews can increase the number of installs. However, be prepared for harsher reviews for paid apps.



# **Conclusion**


1. The app rating is significant and positively correlated with the number of installs. Additionally, a negative correlation exists between app price and rating.

2. Larger-sized paid apps may not perform well in the market. Users tend to prefer lightweight paid apps.


3. "Art and Design," "Family," and "Lifestyle" categories are leading in terms of potential for installs and revenue.


4. The majority of apps (92.12%) are free, while 7.81% are paid.


5. The "Game" category has the highest number of installs and is a potential market for developers.


6. Offering a free app with a content rating suitable for everyone is a viable strategy.


7. Regular updates and an optimal app size are key factors for achieving a high rating.

8. The "Family," "Game," and "Tools" categories have the maximum number of apps.


9. The focus of the conclusion does not directly address this question.

10. Offering a free app with a content rating suitable for everyone is a viable strategy.


11. A positive correlation exists between app ratings and the number of installs.


12. The conclusion highlights a negative correlation between app price and rating, and a positive correlation between installs and rating.


13. Positive sentiments dominate, with 61% of users expressing positive feelings.




14. Paid apps tend to receive more critical reviews compared to free apps, indicating a difference in sentiment polarity.

###   ***! THANK YOU !!!***



