<a href="https://colab.research.google.com/github/lakshmi-rsl/Project1/blob/main/Shenbagalakshmi__EDA_Submission_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Exploratory Data Analysis on Google Playstore App data





```
# This is formatted as code
```

##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The Play Store is a key player in the digital markets, influencing the app ecosystem for both users and developers.Conducting an exploratory data analysis (EDA) on Play Store data becomes crucial to unravel patterns, trends, and dynamics within this vast collection of mobile applications. This analysis aims to do three things: first, it will show how popular apps are across different categories; second, it will show which factors affect user reviews and ratings; and third, it will investigate the complex relationships that exist between app features like price, size, and user engagement.The analysis attempts to answer important research issues, such as identifying the most popular app categories and figuring out how many installations there are for each rating. Through the use of correlation analysis, data visualisation methods, and descriptive statistics, the study aims to provide insightful information about how the dynamics of the Play Store ecosystem are changing.The expected results have the capacity to direct developers, educate users, and aid in strategic decision-making in the constantly changing field of mobile applications. We hope that our thorough investigation will help to clarify the Play Store's current situation as well as its potential future directions and the ramifications they may have for various parties involved in the online market.






# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The Play Store has significant impact on the app ecosystem in the ever-changing world of digital markets.But it's critical to identify subtle patterns and connections in the vast universe of mobile applications.This necessitates a thorough exploratory data analysis (EDA).This approach attempts to examine the intricate relationships between app features, identify factors that impact user reviews and ratings, and understand the popularity of apps across several categories. Key research questions are addressed via correlation analysis, data visualisation, and descriptive statistics, such as identifying preferred app categories and measuring installations per rating. The information needed could help consumers, developers, and strategists make informed decisions in the constantly changing field of mobile applications.






#### **Define Your Business Objective?**

To Utilize exploratory data analysis findings to enhance app popularity, user satisfaction, and revenue streams, ensuring sustained success and competitiveness within the Play Store ecosystem.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')



In [None]:
df_app = pd.read_csv('/content/drive/My Drive/Project_Module2/Play Store Data.csv')

In [None]:
df_rev = pd.read_csv('/content/drive/My Drive/Project_Module2/User Reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
df_app.head()

In [None]:
df_rev.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df_app.shape

In [None]:
df_rev.shape

### Dataset Information

In [None]:
# Dataset Info
df_app.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df_app[df_app.duplicated()])

Dropping the duplicate Apps from the dataset

In [None]:
df_app.drop_duplicates(subset="App", inplace = True)

In [None]:
#checking whether the dunplicates were dropped from the dataset
len(df_app[df_app.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df_app.isnull().sum())

It is understood that the columns like "Rating","Type","Content rating", "Current Ver"  and ""Android Ver" have missing values which are to be addressed in the later sections.


In [None]:
# Visualizing the missing values as bar chart
df_app.isnull().sum().plot(kind='bar')

In [None]:
#Visualizing the missing values using heatmap
sns.heatmap(df_app.isnull(),cbar=False)

### What did you know about your dataset?

The datasets consist of two CSV files which are play store data.csv and user reviews.csv.

The play store data.csv has 10,841 observations and 13 variables about details of the applications on Google Play.

The user reviews.csv has 64,295 observations and 5 variables about the most relevant 100 reviews for each app and sentiment informations for each review.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_app.columns

In [None]:
df_rev.columns

In [None]:
# Dataset Describe
df_app.describe()

### Variables Description

**Play store data.csv**

**App**: Application name

**Rating**: Overall user rating of the app

**Reviews**: Number of user reviews for the app

**Size**: Size of the app

**Price**: Price of the app

**Installs**: Number of user downloads/installs for the app

**Type**: Paid or Free

**Content Rating**: Age group the app is targeted at - Children / Mature 21+ / Adult

**Genres**: An app can belong to multiple genres (apart from its main category)

**Last Updated**: Date of the last app update.

**Current Ver**': Current version of the app.

**Android Ver**': Minimum Android version required.

**User reviews.csv**

**Genres**: An app can belong to multiple genres (apart from its main category)

**App**: Name of app

**Translated_Review**: User review (Preprocessed and translated to English)

**Sentiment:** Positive/Negative/Neutral (Preprocessed)

**Sentiment_Polarity**: Sentiment polarity score (>0 - positive, <0 - negative)
'**Sentiment_Subjectivity**': Numerical score indicating the subjectivity of the review

**Quick Check for Outliers**

**On studying the dataset further, it was found that there was  data with some kind of weird anomaly. Let us find out the row in the data and purge it.**

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for "Rating"
df_app["Rating"].unique()

In [None]:
#Checking the unique value for "Size"
df_app["Size"].unique()

In [None]:
# Checking the unique value for the column "Installs"
df_app["Installs"].unique()

In [None]:
# Checking the unique value for the column "Price"
df_app["Price"].unique()

## 3. ***Data Wrangling***

**Rating column**

We have an app rating of 19 which is out of range and needs to be dropped

In [None]:
df_app[df_app["Rating"]==19]

As we can see that this entry of our dataset is having a Rating of 19.0 which is way higher than the maximum rating of 5.0. Also, the value in the Reviews column has an alphabet which makes it a lone entry to have so. Hence we are removing this particular row to make our analysis easier.**Dropping the row that has incorrect values for our features**.

In [None]:
index_to_drop = 10472
df_app = df_app.drop(index = index_to_drop)

In [None]:
#To check whether the column is dropped or not
df_app[df_app['Rating']==19]

**Installs Column**

**Now we have dropped the column with rating as 19.Next we have to remove '+' and ',' from 'Installs' and also we need to make it numeric**

In [None]:
df_app["Installs"]=df_app["Installs"].apply(lambda x:x.replace("+"," ")if "+" in str(x) else x)

In [None]:
df_app["Installs"]=df_app["Installs"].apply(lambda x: x.replace(' ',' '))

In [None]:
df_app["Installs"]=df_app["Installs"].apply(lambda x:x.replace(',',' ')if ',' in str(x)else x)

In [None]:
# Removing extra spaces and non-numeric characters from "Installs" column
df_app["Installs"] = df_app["Installs"].str.replace(' ', '').str.replace(',', '')

In [None]:
# converting "Installs" column to numeric
df_app["Installs"]=df_app["Installs"].apply(lambda x: int(x))

**Size Column**

**We need to remove 'M' and 'k' from the column and to convert them into bytes. Also we need to remove the term "Varies with device"**

In [None]:
def convert_size(size):
  if isinstance(size, str):
    if "M" in size:
      return float(size.replace("M"," "))*1024*1024
    elif "k" in size:
      return(float(size.replace('k'," ")))*1024
    elif "Varies with device" in size:
      return np.nan
  return size

In [None]:
df_app["Size"]=df_app["Size"].apply(convert_size)

**Now we have removed the anamolies from the three columns viz., Rating, Installs and Size and converterd them into numeric.Lets check them**

In [None]:
df_app.describe()

**Price Column**

**Now we need to remove the "$" symbol from price column and convert them to float data type**

In [None]:
df_app["Price"]=df_app["Price"].apply(lambda x: x.replace("$"," ")if "$" in str(x) else x)

In [None]:
df_app["Price"]=df_app["Price"].apply(lambda x: float(x))

**Reviews Column**

In [None]:
# Converting the Reviews column to int data type
df_app["Reviews"] = df_app["Reviews"].apply(lambda x: int(x))

**We have to ensure whether all the corresponding columns are converted into the required data types**

In [None]:
df_app.dtypes

In [None]:
df_app.describe()

**Now it is found that the columns Rating, Reviews, Size, Installs and Price are converted to numeric columns of required datatypes.**

**Dealing with the Missing Values**

In [None]:
df_app.isnull().sum()

**It is understood that there are 1463 null values in the Rating column, 1227 null values in the Size column, 1 null value in Type column, 8 null values in Current Ver column and finally 2 null values in Android Ver column**

In [None]:
# Let's remove the missing values in the Current Ver, Android Ver and Type columns as they are very smaller numbers
df_app.dropna(subset=['Current Ver','Android Ver','Type'], inplace = True)

In [None]:
# Let's check whether the null values are dropped in the corresponding columns
df_app.isnull().sum()

**Now we have only two columns namely Rating and Size with null values.We can replace them with their mean values.**

In [None]:
# Replacing null values in 'Rating'  column with the mean of the column
df_app['Rating'].fillna(df_app['Rating'].mean(), inplace=True)

In [None]:
# Replacing null values in 'Size' column with the median of the column
df_app['Size'].fillna(df_app['Size'].median(), inplace=True)

In [None]:
#To check whether there are no null values in the dataset
df_app.isnull().sum()

### What all manipulations have you done and insights you found?

**The columns such as Installs, Price, Rating and size contained some symbols and weired anamolies. We have identified everyhting, removed them and converted them into the integer and float datatypes repectively. Especially in the size column, we have converted the values in terms of bytes to make it uniform.We have replaced the Rating column with their corresponding mean value and size column with its median.**


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Pie Chart(1.	How many number of apps are free and paid?)

In [None]:
# Chart - 1 visualization code
apps_count = df_app["Type"].value_counts()

In [None]:
plt.figure(figsize=(8, 8))
plt.pie(apps_count, labels=apps_count.index,autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title('Distribution of Free and Paid Apps')
plt.show()

##### 1. Why did you pick the specific chart?

The ultimate aim is to find out the distribution of paid and free apps in the dtataset. I have chosen "pie chart" since it is effective and it gives the percentage distribution of both the free apps and paid apps.It gives us the comparative visualiazation highlighting their relative sizes.Moreover it is quite straight forward,enhances the readability and well suitable for the

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***