In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
eda_df = pd.read_csv('data/cleaned_data.csv')

In [3]:
eda_df.drop(['Unnamed: 0', 'index', 'Last Updated'], axis=1, inplace=True)

In [4]:
eda_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8387 entries, 0 to 8386
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8387 non-null   object 
 1   Category        8387 non-null   object 
 2   Rating          8387 non-null   float64
 3   Reviews         8387 non-null   float64
 4   Size            8387 non-null   float64
 5   Installs        8387 non-null   int64  
 6   Type            8387 non-null   object 
 7   Content Rating  8387 non-null   object 
dtypes: float64(3), int64(1), object(4)
memory usage: 524.3+ KB


In [None]:
corr = eda_df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(15, 20))

sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1, center=0,
            square=True, linewidths=.5,annot=True, cbar_kws={"shrink": .5});

In [None]:
pd.plotting.scatter_matrix(eda_df, alpha=0.1, figsize=(12,12));

# Which category is dominating the market based on Installs?

In [None]:
top_installs_cat = eda_df.groupby(by='Category').sum(
).sort_values(by = 'Installs', ascending = False).head(10)
top_installs_cat

## Does the number of reviews affect the number of installs?

In [None]:
eda_df['log_installs'] = np.log(eda_df['Installs'])
eda_df['log_reviews'] = np.log(eda_df['Reviews'])

In [None]:
sns.jointplot(x='log_reviews', y='log_installs', data=eda_df,
                 kind='reg', truncate=False,
                 color='m')

**There's a positive correlation between Installs and Reviews.** This shows that customers tend to gravitate towards apps that have higher review counts. This can also mean that the more installs an app has there more likely an user would leave a review. So with the right marketing, you can get more people to download your app and hopefully(depending on how good the app is) make your app more popular.

## Does the size of an app affect the number of installs?

# What is the average rating of Apps?
Does Google have more bad or good apps?

In [None]:
fig = px.histogram(eda_df, x='Rating')
print('Average app rating = ', eda_df.Rating.mean())
fig.show()

**So out of 5, 4.17 is pretty great for the apps of Google Store. You can see that a majority of apps are between 4 and 4.6. Only 3% of the selected apps gets 5 out of 5 stars.**

## Does the number of reviews have any affect on ratings?

In [None]:
plt.figure(figsize=(12,10))
sns.regplot(x='Reviews', y='Rating', data=eda_df[eda_df['Reviews'] <= 5000000]);

**I filter out reviews over 5,000,000 to get a better look at the relationship, so we're missing about 80 data points. Here you can a slight positive correlation between the two. We can assume that the more reviews an app has the more people are using that particular app. Though we all know that one bad rating can bring a 5 rating to a 4 in an instance.**

## Rating By Category

Trying to get the average rating of each category

# What are the top ten category?

In [None]:
top_ten_cat = eda_df.Category.value_counts().head(10)
plt.figure(figsize=(20,12))
sns.barplot(x=top_ten_cat.values, y=top_ten_cat.index)
plt.xlabel('Number of Apps')
plt.ylabel('Category')
plt.title('Top Ten Google Apps In 2018 By Category')
plt.show()

In [None]:
top_ten_cat.index

In [None]:
f, ax = plt.subplots(figsize=(11, 6))
sns.violinplot(data=top_ten_cat, palette="Set3", bw=.2, cut=1, linewidth=1)