# **Project Name**    -  Play Store App Review Analysis.



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

# The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

# Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

# Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statements**
1. What are the top categories on Play Store?
2. Are majority of the apps Paid or Free?
3. How importance is the rating of the application?
4. Which categories from the audience should the app be based on?
5. Which category has the most no. of installations?
6. How does the count of apps varies by Genres?
7. How does the last update has an effect on the rating?
8. How are ratings affected when the app is a paid one?
9. How are reviews and ratings co-related?
10. Lets us discuss the sentiment subjectivity.
11. Is subjectivity and polarity proportional to each other?
12. What is the percentage of review sentiments?
13. How is sentiment polarity varying for paid and free apps?
14. How Content Rating affect over the App?
15. Does Last Update date has an effects on rating?
16. Distribution of App update over the Year.
17. Distribution of Paid and Free app updated over the Month.

#### **Define Your Business Objective?**

### **📱 Hi everybody !**

In this notebook, I'm gonna analyze Google Play Store datas. While I was analyzing the data, I used Python. This study is my first data analyzing study.

Google Play Store apps and reviews Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
data=pd.read_csv('play_store_data.csv')
rev=pd.read_csv('User Reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
pd.set_option('display.max_columns', None)
#this is main database
data.sample(5)

In [None]:
#this review database
rev.sample(5)

### Dataset Rows & Columns count

In [None]:
#this show the number of rows and column
r_c=data.shape
print(f"the no of rows is in play_store_data- {r_c[0]} and no of columns in play_store_data is- {r_c[1]}")

In [None]:
r_c=rev.shape
print(f"the no of rows is in User Reviews- {r_c[0]} and no of columns in User Reviews is- {r_c[1]}")

### Dataset Information

In [None]:
#this is about play_store_data
data.info()

In [None]:
#this is about user Reviews
rev.info()

In [None]:
#here we describe the play_store_data
data.describe()

In [None]:
#here we describe the user Reviews
rev.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicateRows_data = data[data.duplicated()]

In [None]:
#here 483 row have duplicate value in play_store_Data database.
len(duplicateRows_data)

In [None]:
#here we drop all duplicate value from play_store_data
data.drop_duplicates(subset='App',inplace=True)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
temp=pd.DataFrame(index=data.columns)
temp["datatype"]=data.dtypes
temp["not null values"]=data.count()
temp["null value"]=data.isnull().sum()
temp["% of the null value"]=data.isnull().mean()*100
temp["unique count"]=data.nunique()
temp

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull())

### What did you know about your dataset?

A few thousands of new applications are regularly uploaded on Google play store. A huge number of designers working freely on designing the apps and making them successful. With the enormous challenge from everywhere throughout the globe, it is important for a developer to know whether he/she is continuing the correct way or not. Since most Play Store applications are free, the income model is very obscure and inaccessible regarding how the in-application buys, in-application adverts and memberships add to the achievement of an application. In this way, an application's prosperity is normally dictated by the quantity of installation of the application and the client appraisals that it has gotten over its lifetime instead of the income is created. The objective of this experiment is to deliver insights to understand customer demands better and thus help developers to popularize the product. We have tried to discover the relationships among various attributes such as which application is free or paid, what are the user reviews, rating of the application.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
#this the feature of main dataset
data.columns

In [None]:
#data type of main dataset
data.dtypes

In [None]:
#data type of reviews dataset
rev.dtypes

In [None]:
#this is feature of reviews dataset.
rev.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description 

##The contents of Play Store Data are:
1. App: It contains the name of the app with a short description (optional).
2. Category: This section gives the category to which an app belongs. In this dataset, the apps are divided among 33 categories.
3. Size: The disk space required to install the respective app.
4. Rating: The average rating given by the users for the respective app. It can be in between 1 and 5.
5. Reviews: The number of users that have dropped a review for the respective app.
6. Installs: The approximate number of times the respective app was installed.
7. Type: It states whether an app is free to use or paid.
8. Price: It gives the price payable to install the app. For free type apps, the price is zero.
9. Content rating: It states which age group is suitable to consume the content of the respective app.
10. Genres: It gives the genre(s) to which the respective app belongs.
11. Last updated: It gives the day in which the latest update for the respective app was released.
12. Current Ver: It gives the current version of the respective app.
13. Android Ver: It gives the android version of the respective app.

##The contents of User Reviews are:
1. App: It contains the name of the app with a short description (optional).
2. Translated_Review: It contains the English translation of the review dropped by the user of the app.
3. Sentiment: It gives the attitude/emotion of the writer. It can be ‘Positive’, ‘Negative’, or ‘Neutral’.
4. Sentiment_Polarity: It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
5. Sentiment_Subjectivity: This value gives how close a reviewer’s opinion is to the opinion of the general public. Its range is [0,1]. Higher the subjectivity, closer is the reviewer’s opinion to the opinion of the general public, and lower subjectivity indicates the review is more of a factual information.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data.columns:
  print(f"This is the unique values in {i} \n")
  print(data[i].value_counts())
  print('\n \n \n')
  

## 3. ***Data Wrangling*** 

 **Data Wrangling Code**

###Filling null values.

`The number of null values are:`

Rating has 1474 null values which contributes 13.60% of the data.
Type has 1 null value which contributes 0.01% of the data.
Content_Rating has 1 null value which contributes 0.01% of the data.
Current_Ver has 8 null values which contributes 0.07% of the data.
Android_Ver has 3 null values which contributes 0.03% of the data.

In [None]:
# Missing Values/Null Values Count
temp=pd.DataFrame(index=data.columns)
temp["datatype"]=data.dtypes
temp["not null values"]=data.count()
temp["null value"]=data.isnull().sum()
temp["% of the null value"]=data.isnull().mean()*100
temp["unique count"]=data.nunique()
temp

Lets first deal with the columns which contains lesser number of NaN values. By going through the NaN values, we must come up with a way to replace them with non NaN values or we need to come up with a reason for having NaN.

 **`1). Android Ver: There are a total of 3 NaN values in this column.`**

---

In [None]:
# The rows containing NaN values in the Android Ver column
data[data["Android Ver"].isnull()]

In [None]:
# Finding the different values the 'Android Ver' column takes
data["Android Ver"].value_counts()

Since the NaN values in the Android Ver column cannot be replaced by any particular value, and, since there are only 3 rows which contain NaN values in this column, which accounts to less than 0.03% of the total rows in the given dataset, it can be be dropped.

In [None]:
# dropping rows corresponding to the to the NaN values in the 'Android Ver' column.
data=data[data['Android Ver'].notna()]
# Shape of the updated dataframe
data.shape

We were successfully able to handle the NaN values in the` Android Ver `column.

 **`2). Current Ver: There are a total of 8 NaN values in this column.`**

In [None]:
# The rows containing NaN values in the Current Ver column
data[data["Current Ver"].isnull()]

In [None]:
# Finding the different values the 'Current Ver' column takes
data['Current Ver'].value_counts()

Since there are only 8 rows which contain NaN values in the Current Ver column, and it accounts to just around 0.07% of the total rows in the given dataset, and there is no particular value with which we can replace it, these rows can be dropped.

In [None]:
# dropping rows corresponding to the values which contain NaN in the column 'Current Ver'.
data=data[data["Current Ver"].notna()]
# Shape of the updated dataframe
data.shape

 **`3). Type: There is only one NaN value in this column.`** 

In [None]:
# The row containing NaN values in the Type column
data[data["Type"].isnull()]

In [None]:
# Finding the different values the 'Type' column takes
data["Type"].value_counts()

The `Type `column contains only two entries, namely, `Free` and `Paid`. Also, if the app is of t`ype-paid`, the price of that app will be printed in the corresponding `Price` column, else, it will show as '0'. In this case, the price for the respective app is printed as '0', which means the app is of type-free. Hence we can replace this NaN value with Free.

In [None]:
# Replacing the NaN value in 'Type' column corresponding to row index 9148 with 'Free'
data.loc[9148,'Type']='Free'


 **`4). Rating: This column contains 1470 NaN values.`**

In [None]:
# The rows containing NaN values in the Rating column
data[data['Rating'].isnull()]

Also, we know that the rating of any app in the play store will be in between 1 and 5. Lets check whether there are any ratings out of this range.

In [None]:
data[(data['Rating'] <1) | (data['Rating']>5)]

* The `Rating` column contains 1470 NaN values which accounts to apprximately 13.5% of the rows in the entire dataset. It is not practical to drop these rows because by doing so, we will loose a large amount of data, which may impact the final quality of the analysis.
* The NaN values in this case can be imputed by the aggregate (mean or median) of the remaining values in the Rating column.

In [None]:
#we fill the null value by categoricaly round of mean of Ratings.
for i in data['Category'].unique():
  data.Rating[(data['Category']==i)&(data.Rating.isnull()==True)]=round(data.Rating[data['Category']==i].mean(),1)

 we can see that, not any null value is present in play_store_data.

In [None]:
sns.heatmap(data.isnull())

**Now we filling the null value of User Reviews data.**

In [None]:
temp1=pd.DataFrame(index=rev.columns)
temp1["datatype"]=rev.dtypes
temp1["not null values"]=rev.count()
temp1["null value"]=rev.isnull().sum()
temp1["% of the null value"]=rev.isnull().mean().round(4)*100
temp1["unique count"]=rev.nunique()
temp1

The number of null values are:
* **Translated_Review** has 26868 null values which contributes **41.79%** of the data.
* **Sentiment** has 26863 null values which contributes **41.78%** of the data.
* **Sentiment_Polarity**  has 26863 null values which contributes **41.78%** of the data.
* **Sentiment_Subjectivity** has 26863 null values which contributes **41.78%** of the data.

In [None]:
#heatmap of null value of user Reviews
sns.heatmap(rev.isnull())

In [None]:
# Finding the total no of NaN values in each column.
rev.isnull().sum()

There are a lot of NaN values. We need to analyse these values and see how we can handle them.

In [None]:
# checking the NaN values in the translated rview column
rev[rev['Translated_Review'].isnull()]

There are a total of 26868 rows containing NaN values in the Translated_Review column.

We can say that the apps which do not have a review (NaN value insted) tend to have NaN values in the columns `Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity` in the majority of the cases.

Lets check if there are any exceptions.

In [None]:
# The rows corresponding to the NaN values in the translated_review column, where the rest of the columns are non null.
rev[rev['Translated_Review'].isnull() & rev['Sentiment'].notna()]

In the few exceptional cases where the values of remaining columns are non null for null values in the translated_Review column, there seems to be errors. This is because the Sentiment, sentiment ploarity and sentiment subjectivity of the review can be determined if and only if there is a corresponding review.

Hence these values are wrong and can be deleted altogather.

In [None]:
# Deleting the rows containing NaN values
rev = rev.dropna()

In [None]:
# The shape of the updated df
rev.shape

In [None]:
#heatmap of null value of updated user Reviews
sns.heatmap(rev.isnull())

###Converting the datatype of features

**changing the data type of playe_store_data**

this is the original datatype of play store data dataset.

In [None]:
data.dtypes

In [None]:
#here we make Reviews as intiger type
data.Reviews=data.Reviews.astype(int)

 1. `converting the dtype of Size column object to float `

**note:**- here we change the **Varies with device** to **0.0**, because we want to chage the datatype of Size feature object to float.

In [None]:
#here we change 'Varies with device' to '0.0'
data.Size[data.Size=='Varies with device']='0.0'

value of size is store megabite and kilobite both with thier sign but it is not possible to change the alphabet to numaric thierfor we have to remove the sign and than convert all value into megabite by deviding 1024. 

In [None]:
#here we remove sign and and convert all value into megabite and store into 's' list
s=[]
for i in data.Size:
   if 'K' in i:
     i=i.replace('K','')
     i=str(round(float(i)/1024,2))
     s.append(i)
   elif 'k' in i:
     i=i.replace('k','')
     i=str(round(float(i)/1024,2))
     s.append(i)
   elif 'm' in i:
     i=i.replace('m','')
     s.append(i)
   else:
     i=i.replace('M','')
     s.append(i)
     

In [None]:
#here we import all value of 's' list into data.Size.
data.Size=s

In [None]:
#here we change the data type of column object to float
data.Size=data.Size.astype(float)

In [None]:
data.Size.tail(10)

`2.converting the dtype of Installs, object to intiger.`

the Installs column have '+' sign and object data type , we have to change the data type to intiger and remove the '+' sign.

In [None]:
#first view of Installs column
data.Installs.sample(15)

In [None]:
#here we remove '+' sign from Installs
data.Installs=[i.replace('+','') for i in data.Installs]

In [None]:
#here we remove ',' from Installs
data.Installs=[i.replace(',','') for i in data.Installs]

In [None]:
#here chage the datatype of Installs to intiger
data.Installs=data.Installs.astype(int)

`3.converting the data type object to float of Price column`

**note:** here we remove the '$' doller sign but all price vlaue are in doller.

In [None]:
#here we replace '$' sign from Price
data.Price=[i.replace('$','') for i in data.Price]

In [None]:
#here i change the datatype of Price to intiger
data.Price=data.Price.astype(float)

`4.converting the data type of Last Update column, objcet to datetime.`

In [None]:
#here i chaged the datatype object to datetime of Last Update column
data['Last Updated']=pd.to_datetime(data['Last Updated'])

now we can see all changest in data type of play_store_data dataset

In [None]:
#updated data
data.dtypes

## merging the User Reviews to play_store_data.

Here we can see that each app have lots of reviews and sentiments .And when we merge them into play_store_data than lots of null value and duplicate value create.therefor we have to find the mean of setiment_polarity and setiment_subjective of each app and than merge into play_store_data. And we are going to create some columns in play_store_data to store count of positive,negative,nutral value frome User_Reviews.

In [None]:
#this is some example of number of reviews in each app.
for i in rev.App.unique()[:10]:
  print(i,':-',len(rev[rev.App==i]))

In [None]:
#here we get the intersection of APP names from both database
inter=set(rev.App.unique()) & set(data.App)
inter=list(inter)

here we can see that the number of Apps which is common in both play_store_data and User Reviews. 

In [None]:
#total number of app wich is common in both dataset==816.
len(inter)

`1. now we are going to create 'number_positive_R' ,'number_negative_R' and 'number_neutral_R' for storing number of counts of Positive, Negative and nuetral Reviews for each App. `

**note:-** here i clearify that **'0.0'** means **Data is not preset in the User Review dataset for that App.**

In [None]:
#here we collect number of all positive , negative and neutral reviews 
Positive=[]
Negative=[]
Neutral=[]
for i in data.App:
  if i in inter:
    p=len(rev[(rev.App==i)&(rev['Sentiment']=='Positive')])
    n=len(rev[(rev.App==i)&(rev['Sentiment']=='Negative')])
    ne=len(rev[(rev.App==i)&(rev['Sentiment']=='Neutral')])
    Positive.append(p)
    Negative.append(n)
    Neutral.append(ne)
  else:
    Positive.append(0.0)
    Negative.append(0.0)
    Neutral.append(0.0)

In [None]:
#here we create new columns which show the number of positive, negative and neutral reviews of each app.
data['number_positive_R']=Positive
data['number_negative_R']=Negative
data['number_neutral_R']=Neutral

In [None]:
#here we clearly see that 'number_positive_R', 'number_negative_R' and 'number_neutral_R' are created.
data.dtypes

In [None]:
#here we can see that all columns of play_store_Data
data.sample(5)

`2. here we create 'mean_sentiment_polarity' and 'mean_pentiment_subjectivity' for storing the mean of 'Sentiment_Polarity' and 'Sentiment_Subjectivity' from User Reviews to store in play_store_data.`

**note:-** here i clearify that **'0.0'** means **Data is not preset in the User Review dataset for that App.**

In [None]:
data['mean_sentiment_polarity']=[round(rev.Sentiment_Polarity[rev.App==i].mean(),3)  if i in inter else 0.0 for i in data.App]
data['mean_sentiment_subjectivity']=[round(rev.Sentiment_Subjectivity[rev.App==i].mean(),3)  if i in inter else 0.0 for i in data.App]

`3. now here we can see that how many Apps of play_store_data dataset have their Reviews and how many Apps's reviews are not present in User Reviews dataset .`

In [None]:
rev_present=[len(data.mean_sentiment_polarity[data.mean_sentiment_polarity>0]),len(data.mean_sentiment_polarity[data.mean_sentiment_polarity==0])][0]
rev_not_present=[len(data.mean_sentiment_polarity[data.mean_sentiment_polarity>0]),len(data.mean_sentiment_polarity[data.mean_sentiment_polarity==0])][1]

here we clearly see that there are total 740 Apps were Reviews are present but 8835 Apps were Reviews are not present.

This means that there are 740 Apps whose reviews were present in the User Reviews dataset and which match the play_store_data dataset.

In [None]:
#number of reviews present and not present in play_store_data.
print('Reviews present :-',rev_present)
print('Reviews not Present:-',rev_not_present)

In [None]:
data.isnull().sum().sum()

In [None]:
data.dtypes

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  `Top categories on Google Playstore?`

In [None]:
y=data.Category.value_counts().values
x=data.Category.value_counts().index

In [None]:
# Chart - 1 visualization code
#Number of apps belonging to each category in the playstore
fig, ax=plt.subplots(figsize=(20,8))
sns.barplot(x=x,y=y,palette= "tab10")
ax.set_title('Top categories on Playstore',fontsize=20)
ax.set_xlabel('Name of categories',fontsize=20),
ax.set_ylabel('Number of App',fontsize=20),
ax.set_xticklabels(ax.get_xticklabels(), rotation= 45, horizontalalignment='right',fontsize=13),
plt.show();

In [None]:
#Percentage of apps belonging to each category in the playstore
plt.figure(figsize=(17,17))
plt.pie(y,labels=x,autopct='%1.2f%%')
plt.title('% of apps share in each Category', fontsize = 25)
plt.show()

##### 1. Why did you pick the specific chart?

1. Number of apps belonging to each category in the playstore
2. Percentage of apps belonging to each category in the playstore

##### 2. What is/are the insight(s) found from the chart?

1. So there are all total 33 categories in the dataset From the above output we can come to a conclusion that in playstore most of the apps are under` FAMILY & GAME` category and least are of `EVENTS & BEAUTY` Category.
2. here we can see that highest number of percentage of app is belogs to GAME,FAMILY AND TOOLS categories.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By this chart we can understand that the most number of apps belogs to game,family and tools categories. And this will creat a postive impact on the buisness.

#### Chart - 2 `Which category App's have most number of installs?`

In [None]:
# Chart - 2 visualization code
a = data.groupby(['Category'])['Installs'].sum().sort_values()
a.plot.barh(figsize=(15,10), color = 'c', )
plt.ylabel('Total app Installs', fontsize = 15)
plt.xlabel('App Categories', fontsize = 15)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)

In [None]:
#what is top 10 apps in top 10 category by installetions.
cat=data.groupby(['Category'])['Installs'].sum().sort_values(ascending=False).head(10).index
for i in cat:
  df=data[['App','Installs']][data.Category==i].sort_values(by='Installs',ascending=False).head(10)
  fig,ax=plt.subplots(figsize=(12,5))
  graph=sns.barplot(x = df.App, y = df.Installs, palette= "icefire")
  ax.set_xlabel('Name of Apps',fontsize=15)
  ax.set_ylabel('number of Installations',fontsize=15)
  ax.set_title(f'Top 10 Apps of {i} Category',fontsize=15,fontweight='bold')
  graph.set_xticklabels(graph.get_xticklabels(),rotation= 90, horizontalalignment='right')
  plt.show()
  print('\n')

##### 1. Why did you pick the specific chart?

1. Becouse this show the which category app have highest number of installations.
2. what is top 10 apps in top 10 category by installetions.

##### 2. What is/are the insight(s) found from the chart?

This tells us the category of apps that has the maximum number of installs. The `Game,` `Communication and Tools` categories has the highest number of installs compared to other categories of apps.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

here we can clearly see that highest number of installs is in` game, communication and Tools` categories therefor we have to focus on these app.

#### Chart - 3 ` Which category of Apps from the Content Rating column are found more on playstore ?`

In [None]:
# Chart - 3 visualization code
#create pie chart
plt.figure(figsize=(10,10))
explode=(0,0.1,0.1,0.1,0.0,1.3)
colors = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data['Content Rating'].value_counts().values, labels = data['Content Rating'].value_counts().index, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Content Rating',size=20,loc='center')
plt.legend()

##### 1. Why did you pick the specific chart?

This char shows the Which category of Apps from the Content Rating column are found more on playstore

##### 2. What is/are the insight(s) found from the chart?

A majority of the apps (82%) in the play store are can be used by everyone.The remaining apps have various age restrictions to use it.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In playstore market maximum number of apps for everyone . and this category is dominent the market.

#### Chart - 4 ` What is the ratio of number of Paid apps and Free apps?`

In [None]:
# Chart - 4 visualization code
# create pie chart
plt.figure(figsize=(10,10))
colors = ["c","m"]
explode=(0,0.1)
plt.pie(data['Type'].value_counts().values, labels = data['Type'].value_counts().index, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free apps',size=15,loc='center')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

This show the ratio of number of Paid apps and Free apps.

##### 2. What is/are the insight(s) found from the chart?

From the above graph we can see that 92% of apps in google play store are free and 8%are paid.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Here we can see that the maximum number of apps is free.

#### Chart -5  `Average rating of the apps`

In [None]:
# Chart - 5 visualization code
# Average app ratings
data['Rating'].value_counts().plot.bar(figsize=(20,8), color = 'm' )
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()
plt.show()

We can represent the ratings in a better way if we group the ratings between certain intervals. Here, we can group the rating as follows:

* 4-5: Top rated
* 3-4: Above average
* 2-3: Average
* 1-2: Below average

In [None]:
A_R_name=['Top rated','Above average','Average','Below average']
A_R_value=[data.Rating[data.Rating>4].value_counts().sum(),
           data.Rating[(data.Rating>3)&(data.Rating<4)].value_counts().sum(),
           data.Rating[(data.Rating>2)&(data.Rating<3)].value_counts().sum(),
           data.Rating[(data.Rating>1)&(data.Rating<2)].value_counts().sum()]
plt.figure(figsize=(10,8))
plt.pie(A_R_value,labels=A_R_name,autopct='%1.2f%%')
plt.title('% of Categories of Ratings', fontsize = 25)
plt.show()

In [None]:
#Catogeries wise mode ratings
import matplotlib.patches as mpatches
key=[]
value=[]
color=[]
for i in data.Category.unique():
  key.append(i)
  value.append(data.Rating[data.Category==i].mode()[0])
  if data.Rating[data.Category==i].mode()[0]>4.2:
    color.append('green')
  elif data.Rating[data.Category==i].mode()[0]<=4.2 and data.Rating[data.Category==i].mode()[0]>4:
    color.append('grey')
  else:
    color.append('red')
plt.figure(figsize=(17,6))
plt.bar(key,value,color=color)
plt.xticks(rotation=90, ha='right',color='green')
grey = mpatches.Patch(color='grey', label='Rating is less than 4.2 and greater than 4')
green = mpatches.Patch(color='green', label='Rating is greater than 4.2')
red = mpatches.Patch(color='red', label='Rating is less than 4')
plt.legend(handles=[grey,green,red])
plt.ylim(3.5,5)
plt.title('Catogery wise MODE Rating',fontsize=15,fontweight='bold')
plt.ylabel('Ratings',fontsize=15)
plt.xlabel('Name of Catogeries',fontsize=15)
plt.show()


##### 1. Why did you pick the specific chart?

1. This show the number of apps in each Rating.
2. percentage of catergory of ratings.
3. Categories wise mode rating.

##### 2. What is/are the insight(s) found from the chart?

1. here we see the highest number of Apps have 4.3 rating followed by 4.1,4.2 etc.
2. And Below Avrage Rating have most of Apps.
3. ART_AND_DESIGN, COMICS and EDUCATION, EVENT categories are highest mode Ratign, And DATING, TOOLS AND VIDEO_PLAYERS are least mode Ratings.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This chart show the positive impact because here we see how many apps in which type of Rating. And whch categories have highest mode rating.

#### Chart - 6 ` Percantage of large Size Apps in top 10 Categories`.

In [None]:
# Chart - 6 visualization code
#The percantage of large Size Apps in top 10 Categories.
top=data[['Category','Size']].sort_values('Size', ascending=False).head(500)
plt.figure(figsize=(10,8))
plt.pie(top['Category'].value_counts().head(10).values,labels=top['Category'].value_counts().head(10).index,autopct='%1.2f%%')
plt.title('% of large size Apps in top Categories',fontsize=15,fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

The percantage of large Size Apps in top 10 Categories.

##### 2. What is/are the insight(s) found from the chart?

 The category of GAME and FAMILY family have more than 80% of large apps . 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In GAME and FAMILY have haviest apps compering others Categories.



#### Chart - 7  `This show the number of Paid Apps by Category/Genres. And number of top apps by installs in each categories.`

In [None]:
# Chart - 7 visualization code
paid=data.Genres[data.Type=='Paid'].value_counts().head(20)
paid2=data.Category[data.Type=='Paid'].value_counts().head(20)
plt.figure(figsize=(15,7))
plt.bar(paid.index,paid.values,color='c')
plt.bar(paid2.index,paid2.values,color='m')
c = mpatches.Patch(color='c', label='GENRES')
m = mpatches.Patch(color='m', label='CATEGORIES')
plt.xticks(rotation = 90)
plt.legend(handles=(c,m))
plt.ylabel('Number of paid Apps',fontsize=15)
plt.xlabel('Name of Category/Genres',fontsize=15)
plt.title('Number of Piad Apps by Category/Genres',fontsize=15,fontweight='bold')
plt.show()

In [None]:
#Number of apps in each Category in top 100 Apps by Installs. 
tp_I=data[['Category','Installs']].sort_values(by='Installs',ascending=False).head(100)
tp_I['Category'].value_counts().plot(kind='bar')
plt.xlabel('Name of Categories',fontsize=15)
plt.ylabel('Number of Apps out of 100 Apss',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

This show the number of Paid Apps by Category/Genres. And number of top apps by installs in each categories.

##### 2. What is/are the insight(s) found from the chart?

1. medical, personalization and tools have most of the number of Apps which is Paid in Category.
2. FAMILY have highest number of app which is Paid in Genres and followed by MEDICAL, GAME, PERSENOLIZATION and TOOLS .
3. There are total top 100 free apps by Installs over one billion installs.
   The top categories in which these apps fall are communication(16), Game(17)
   Tools(20), Photography(9) and social(7).

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Here we can see how many Paid apps in each Category and Genres.

#### Chart - 8 ` Percentage of Positve, Negative and Nuetral Reviews in dataset.`

In [None]:
# Chart - 8 visualization code
s_key=['Positive Reviews', 'Negative Reviews','Neutral Reviews']
s_value=[data.number_positive_R.sum(),data.number_negative_R.sum(),data.number_neutral_R.sum()]
plt.figure(figsize=(11,11))
plt.pie(s_value,labels=s_key,autopct="%.2f%%",explode=[0.01, 0.05, 0.05], shadow=True)
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

This show the percentage of positive, negative and nuetral reviews.

##### 2. What is/are the insight(s) found from the chart?

1. Positive reviews are **64.22%**
2. Negative reviews are **22.28%**
3. Neutral reviews are **13.50%**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Here we can the sentiments of reviews and work acording to sentiments.

#### Chart - 9 ` Density of Mean Sentiment Polarity in top Categories and in Rnage`

In [None]:
# Chart - 9 visualization code

1. Denisty of Top 10 Categories Mean Sentiment ploarity

In [None]:
# Denisty of Top 10 Categories Mean Sentiment ploarity
for j in data.groupby('Category')['Installs'].sum().sort_values(ascending=False).head(10).index:
  p=[]
  n=[]
  ne=[]
  for i in data.mean_sentiment_polarity[data.Category==j]:
    if i >0:
      p.append(i)
    elif i<0:
      n.append(i)
    else:
      ne.append(i)
  fig, ax=plt.subplots(figsize=(10,6))
  sns.scatterplot(p,color='green')
  sns.scatterplot(ne,color='blue')
  sns.scatterplot(n,color='red')
  a=mpatches.Patch(color='green',label='POSITIVE')
  b=mpatches.Patch(color='red',label='NEGATIVE')
  c=mpatches.Patch(color='blue',label='NEUTRAL')
  ax.set_title(f'Density of Sentiment polarity of {j} Categiry',fontsize=15,fontweight='bold')
  ax.set_xlabel('Sentiment Polarity',fontsize=15)
  ax.set_ylabel('Nuber of Sentiments',fontsize=15)
  plt.legend(handles=[a,b,c])
  plt.show()
  print('\n"')


2. Denisty of Mean Sentiment ploarity in Least 10 Categories 

In [None]:
#Denisty of Least 10 Categories Mean Sentiment ploarity
# Denisty of Top 10 Categories Mean Sentiment ploarity
for j in data.groupby('Category')['Installs'].sum().sort_values(ascending=True).head(10).index:
  p=[]
  n=[]
  ne=[]
  for i in data.mean_sentiment_polarity[data.Category==j]:
    if i >0:
      p.append(i)
    elif i<0:
      n.append(i)
    else:
      ne.append(i)
  fig, ax=plt.subplots(figsize=(10,6))
  sns.scatterplot(p,color='green')
  sns.scatterplot(ne,color='blue')
  sns.scatterplot(n,color='red')
  a=mpatches.Patch(color='green',label='POSITIVE')
  b=mpatches.Patch(color='red',label='NEGATIVE')
  c=mpatches.Patch(color='blue',label='NEUTRAL')
  ax.set_title(f'Density of Sentiment polarity of {j} Categiry',fontsize=15,fontweight='bold')
  ax.set_xlabel('Sentiment Polarity',fontsize=15)
  ax.set_ylabel('Nuber of Sentiments',fontsize=15)
  plt.legend(handles=[a,b,c])
  plt.show()
  print('\n"')


3. Number of sentiments in sentiment polarity range.

In [None]:
c_sp_n=['-1 to -0.75','-0.75 to -0.5','-0.5 to -0.25','-0.25 to 0.0','0.0 to 0.25','0.25 to 0.50','0.50 to 0.75','0.75 to 1']
c_sp_v=[len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity<-0.75)&(data.mean_sentiment_polarity>=-1)]]),
        len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity<-0.5)&(data.mean_sentiment_polarity>=-0.75)]]),
        len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity<-0.25)&(data.mean_sentiment_polarity>=-0.5)]]),
        len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity<0.0)&(data.mean_sentiment_polarity>=-0.25)]]),
        len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity>0.0)&(data.mean_sentiment_polarity<=0.25)]]),
        len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity>0.25)&(data.mean_sentiment_polarity<=0.50)]]),
        len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity>0.50)&(data.mean_sentiment_polarity<=0.75)]]),
        len([i for i in data.mean_sentiment_polarity[(data.mean_sentiment_polarity>0.75)&(data.mean_sentiment_polarity<=1)]])]

fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(x=c_sp_n,y=c_sp_v)
ax.set_title('Range of Mean Sentimental Polarity',fontsize=15,fontweight='bold')
ax.set_ylabel('Number of Sentiments',fontsize=15)
ax.set_xlabel('Range of sentiments',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Density of Mean Sentiment Polarity. And number of sentiment in thier range.

##### 2. What is/are the insight(s) found from the chart?

1. Density of Top 10 Categories Mean Sentiment ploarity, here we can see that the density of polarity is high.
2. Density of lowest 10 Categories Mean Sentiment ploarity , we can see that the density of polarity is low.
3. 90% of mean sentiments polarity is in between -0.25 to 0.50.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

1. we can clearly differentiate the lowest categories and top categories by the density of sentiment polarity.
2. maximum number of sentiments is in between -0.25 to 0.50.

#### Chart - `10  Number of Updation of Apps in each month.`

In [None]:
# Chart - 10 visualization code
#Number of Updated paid apps in each months
m=data['Last Updated'].dt.strftime('%m')[data.Type=='Paid'].value_counts()
m=pd.DataFrame(m)
m['month']=m.index
m=m.sort_values(by='month')
plt.figure(figsize=(13,7))
plt.bar(m['month'],m['Last Updated'],color='c')
plt.title("Paid Apps update over the month", fontsize=15,fontweight='bold')
plt.xlabel('Months',fontsize=15)
plt.ylabel('Number of Updated App',fontsize=15)
s=mpatches.Patch(color='c',label='Paid Apps')
plt.legend(handles=[s])
plt.show()

In [None]:
#Number of Updated Free apps in each months
m=data['Last Updated'].dt.strftime('%m')[data.Type=='Free'].value_counts()
m=pd.DataFrame(m)
m['month']=m.index
m=m.sort_values(by='month')
plt.figure(figsize=(13,7))
plt.bar(m['month'],m['Last Updated'],color='m')
plt.title("Free Apps update over the month", fontsize=15,fontweight='bold')
plt.xlabel('Monts',fontsize=15)
plt.ylabel('Number of Updated App',fontsize=15)
s=mpatches.Patch(color='m',label='Paid Apps')
plt.legend(handles=[s])
plt.show()

##### 1. Why did you pick the specific chart?

Apps updation in each months.



##### 2. What is/are the insight(s) found from the chart?

The paid and free both type of Apps are mostly update in month of july.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

large amount of apps are update in month of july.

#### Chart - 11  `Density of Mean sentiment polarity in each months`

In [None]:
# Chart - 11 visualization code

In [None]:
#density of sentiment polarity of free apps in each months .
m_p=['JAN','FEB','MARCH','APRIL','MAY','JUNE','JULY','AUG','SEP','OCT','NOV','DEC']
d_p=[]
for i in sorted(data['Last Updated'].dt.strftime('%m').unique()):
  l=len(data['mean_sentiment_polarity'][(data.Type=='Free') & (data['Last Updated'].dt.strftime('%m')==i)])
  d_p.append(l)
l_sp_f=len(data['mean_sentiment_polarity'][(data.Type=='Free')])

plt.figure(figsize=(10,7))
plt.plot(m_p,d_p,color='green',marker='o')
plt.fill_between(m_p, d_p, color='green', alpha=0.5)
t_sp_f=mpatches.Patch(color='green',label=f'Total Sentimental Polarity= {l_sp_f}')
plt.xlabel('Months',fontsize=15)
plt.ylabel('Density of Sentiment Polarity', fontsize=15)
plt.title('Density of Sentiment Polarity of Free Apps in each months.',fontsize=15,fontweight='bold')
plt.legend(handles=[t_sp_f])
plt.show()



In [None]:
#density of sentiment polarity of paid apps in each months .
m_p_p=['JAN','FEB','MARCH','APRIL','MAY','JUNE','JULY','AUG','SEP','OCT','NOV','DEC']
d_p_p=[]
for i in sorted(data['Last Updated'].dt.strftime('%m').unique()):
  l=len(data['mean_sentiment_polarity'][(data.Type=='Paid') & (data['Last Updated'].dt.strftime('%m')==i)])
  d_p_p.append(l)
l_sp_p=len(data['mean_sentiment_polarity'][(data.Type=='Paid')])

plt.figure(figsize=(10,7))
plt.plot(m_p_p,d_p_p,color='red',marker='o')
plt.fill_between(m_p_p, d_p_p, color='red', alpha=0.5)
t_sp_p=mpatches.Patch(color='red',label=f'Total Sentimental Polarity= {l_sp_p}')
plt.xlabel('Months',fontsize=15)
plt.ylabel('Density of Sentiment Polarity', fontsize=15)
plt.title('Density of Sentiment Polarity of Paid Apps in each months.',fontsize=15,fontweight='bold')
plt.legend(handles=[t_sp_p])
plt.show()

##### 1. Why did you pick the specific chart?

Density of sentiment polarity in each type of Apps.

##### 2. What is/are the insight(s) found from the chart?

1. Density of sentiment polarity is high in Free Apps compering to Paid Apps.
2. In FREE Apps total 4800 out of 8896 sentiment are in month of june july and august it is more than 50 % . And paid Apps have total 300 out of 753 sentiment polarity in month of june july and august it is approx 40 %.
3. In month of JUNE, JULY and AUGUST density of sentiment polarity is high . This show that the Installetion of Apps is higher in June and July than other months.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This show that the Installetion of Apps is higher in June and July than other months.

#### Chart - 12 ` This show the number of apps in top 10 Android Versions. And this shows the number of top 10 Current versions Apps.`

In [None]:
# This show the number of apps in top 10 Android Versions
plt.figure(figsize=(10,6))
data['Android Ver'].value_counts().head(10).plot.bar()
plt.title('Count of top 10 android version', fontsize=15,fontweight='bold')
plt.ylabel('Number of Apps',fontsize=15)
plt.xlabel('Name of Android version',fontsize=15)
plt.show()

In [None]:
# This shows the number of top 10 Current versions Apps.
plt.figure(figsize=(10,6))
data['Current Ver'].value_counts().head(10).plot.bar()
plt.title('Count of top 10 current version', fontsize=15,fontweight='bold')
plt.ylabel('Number of Apps',fontsize=15)
plt.xlabel('Name of current version',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

This show the number of apps in top 10 Android Version. And The number of top 10 android versions apps .

##### 2. What is/are the insight(s) found from the chart?

1. The Android Version of most of Apps is 4.1 and up .
2. here we see the lost of apps is varies with device.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

here we can see that the most of apps is above 4 version and this indicate that most of apps is older in play store.

#### Chart - 13  `Revenue Genereted by Paid Apps.`

In [None]:
#revenue
new_df=data[['Category','Price','Installs']][data.Type=='Paid'].sort_values(by='Installs',ascending=False).head(100)
new_df['revenue']=data.Price*data.Installs
revenue=[]
name=[]
for i in new_df['Category'].unique():   
  name.append(i)
  revenue.append(new_df['revenue'][data.Category==i].sum())
plt.bar(name,revenue,color='#e8d741')
plt.xticks(rotation = 90)
plt.title('Revenue Genreted by Paid Apps',fontsize=15)
plt.ylabel('Revenue (multiple of 10^8)',fontsize=15)
plt.xlabel('Name of Categories',fontsize=15)
plt.show()

**revenue of paid apps = PRICE multiplies by INSTALLS**

##### 1. Why did you pick the specific chart?

This shows the Revenue of 100 top apps in paid type.

##### 2. What is/are the insight(s) found from the chart?

most of the revenue Genereted by FAMILY , LIFESTYLE and GAME categories.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

by this graph we understod that, if we want to designe a paid app than we have to focus on these categories .

#### Chart - 14 - `Correlation Heatmap`

In [None]:
# Correlation Heatmap visualization code

In [None]:
fig,ax=plt.subplots(figsize=(15,10))
sns.heatmap(data.corr(),annot=True,cmap='Greens')

##### 1. Why did you pick the specific chart?

Correlation Heatmap

##### 2. What is/are the insight(s) found from the chart?

1. number_positive_R, number_negative_R, number_neutral_R, mean_sentiment_polarity and mean_sentiment_subjectivity are highly correlated to each other.
2. Installs is highly correlated to reviews.

#### Chart - 15 - `Pair Plot `

In [None]:
pair=data.loc[:,['Category','Rating','Size','Installs','mean_sentiment_polarity']]

In [None]:
# Pair Plot visualization code
sns.pairplot(pair)
plt.show()

In [None]:
sns.pairplot(pair)

##### 1. Why did you pick the specific chart?

Pair Plot 

##### 2. What is/are the insight(s) found from the chart?

The density of Rating, size, Installs, mean_sentiment_polarity above 0.0 of mean_sentiment_polarity is highest.
ETC..


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

# **Analysis Summary**
In this project of analyzing play store applications, we have worked on several parameters which would help AlmaBetter to do well in launching their apps on the play store.

In the initial phase, we focused more on the problem statements and data cleaning, in order to ensure that we give them the best results out of our analysis.

You needs to focus more on:
1. Developing apps related to the least categories as they are not explored much. Like events and beauty.
2. Most of the apps are Free, so focusing on free app is more important.
3. Focusing more on content available for Everyone will increase the chances of getting the highest installs.
4. They need to focus on updating their apps regularly, so that it will attract more users.
5. They need to keep in mind that the sentiments of the user keep varying as they keep using the app, so they should focus more on users needs and features.

* Percentage of free apps = ~92%
* Percentage of apps with no age restrictions = ~82%
* Most competitive category: Family
* Category with the highest average app installs: Game
* Percentage of apps that are top rated = ~80%
* Family, Game and Tools are top three categories having 1906, 926 and 829 app count. 
* Tools, Entertainment, Education, Buisness and Medical are top Genres. 
* 8783 Apps are having size less than 50 MB. 7749 Apps are having rating more than 4.0 including both type of apps.
* There are 20 free apps that have been installed over a billion times.
* Minecraft is the only app in the paid category with over 10M installs. This app has also produced the most revenue only from the installation fee.
* Category in which the paid apps have the highest average installation fee: Finance
* The median size of all apps in the play store is 12 MB.
* The apps whose size varies with device has the highest number average app installs.
* The apps whose size is greater than 90 MB has the highest number of average user reviews, ie, they are more popular than the rest.
* Helix Jump has the highest number of positive reviews and Angry Birds Classic has the highest number of negative reviews.
* Overall sentiment count of merged dataset in which Positive sentiment count is 64%, Negative 22% and Neutral 13%.

**1.Rating**

Most of the apps have rating in between 4 and 5.

Most numbers of apps are rated at 4.3

Categories of apps have more than 4 average rating.item

 **2.Size**

Maximum number of applications present in the dataset are of small size.

**3.Installs**

Majority of the apps come into these three categories, Family, Game, and Tools.

Maximum number of apps present in google play store come under Family, Game and tools but as per the installation and requirement in the market plot, scenario is not the same. Maximum installed apps comes under Game, Communication, Productivity and Social.

Subway Surfers, Facebook, Messenger and Google Drive are the most installed apps.

**4.Type(Free/Paid)**

About 92% apps are free and 8% apps are of paid type.

The category ‘Family’ has the highest number of paid apps.

Free apps are installed more than paid apps.

The app “I’m Rich — Trump Edition” from the category ‘Lifestyle’ is the most costly app priced at $400

**5.Content Rating**

Content having Everyone only has most installs, while unrated and Adults only 18+ have less installs.

**6.Reviews**

Number of installs is positively correlated with reviews with correlation 0.64.
Sentiment Analysis

**7.Sentiment** 

Most of the reviews are of Positive Sentiment, while Negative and Neutral have low number of reviews.

**8.Sentiment Polarity / Sentiment Subjectivity**

Collection of reviews shows a wide range of subjectivity and most of the reviews fall in [-0.50,0.75] polarity scale implying that the extremely negative or positive sentiments are significantly low.
Most of the reviews show a mid-range of negative and positive sentiments.

Sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low.

Sentiment Polarity is not highly correlated with Sentiment Subjectivity.

# **Conclusion**

The dataset contains possibilities to deliver insights to understand customer demands better and thus help developers to popularize the product. Dataset can also be used to look whether the original ratings of the app matches the predicted rating to know whether the app is performing better or worse compared to other apps on the Play Store.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***