<p><img src= "https://www.pngitem.com/pimgs/m/114-1143971_google-play-logo-google-play-logo-transparent-background.png")></p>

Small project concentrated on apps available on Google Play.<br>
The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

#### 1. Importing and inspecting the data

In [2]:
import pandas as pd
import numpy as np
import matplotlib as plt

apps1 = pd.read_csv('googleplaystore.csv')
print(apps1.sample(7))
print(apps1.shape)

                                 App          Category  Rating Reviews  \
8745                   World Webcams           WEATHER     3.7    7896   
4290            K keyboard - Myanmar             TOOLS     4.8    1955   
4790                      PLAYBULB X         LIFESTYLE     2.9    2160   
6924  Best Western e-Concierge Hotel  TRAVEL_AND_LOCAL     4.0      41   
8052                      CX_WiFiUFO            SPORTS     3.7     818   
8328               Guide to Nikon Df       PHOTOGRAPHY     NaN       1   
3359                        Launcher   PERSONALIZATION     4.5  102923   

                    Size    Installs  Type   Price Content Rating  \
8745  Varies with device  1,000,000+  Free       0       Everyone   
4290                 14M    100,000+  Free       0       Everyone   
4790                 15M    100,000+  Free       0       Everyone   
6924                 23M     10,000+  Free       0       Everyone   
8052                5.3M    100,000+  Free       0       Every

Data in columns "Installs", "Price" and "Size" need to be cleaned. First two have special characters, the last one has letters "M" and "k" which presumably indicate size in Megabytes and kilobytes.

In [3]:
before_drop_dup = sum(apps1.App.duplicated())
#print(apps1[apps1.App.duplicated() == True].describe())
print('Number of duplicated apps:', before_drop_dup)

Number of duplicated apps: 1181


In [4]:
apps = apps1.drop_duplicates('App')
print('Number of unique apps:', len(apps))

Number of unique apps: 9660


#### 2. Data cleaning
As seen previously, columns "Installs" and "Price" include characters that make it unable to change their type. We need to get rid of any obstacles that prevent us from changing the types. Column "Size" also needs to be cleaned<br>


In [5]:
print(apps['Installs'].unique())
print(apps['Price'].unique())
print(apps['Size'].unique())

['10,000+' '500,000+' '5,000,000+' '50,000,000+' '100,000+' '50,000+'
 '1,000,000+' '10,000,000+' '5,000+' '100,000,000+' '1,000,000,000+'
 '1,000+' '500,000,000+' '50+' '100+' '500+' '10+' '1+' '5+' '0+' '0'
 'Free']
['0' '$4.99' '$3.99' '$6.99' '$1.49' '$2.99' '$7.99' '$5.99' '$3.49'
 '$1.99' '$9.99' '$7.49' '$0.99' '$9.00' '$5.49' '$10.00' '$24.99'
 '$11.99' '$79.99' '$16.99' '$14.99' '$1.00' '$29.99' '$12.99' '$2.49'
 '$10.99' '$1.50' '$19.99' '$15.99' '$33.99' '$74.99' '$39.99' '$3.95'
 '$4.49' '$1.70' '$8.99' '$2.00' '$3.88' '$25.99' '$399.99' '$17.99'
 '$400.00' '$3.02' '$1.76' '$4.84' '$4.77' '$1.61' '$2.50' '$1.59' '$6.49'
 '$1.29' '$5.00' '$13.99' '$299.99' '$379.99' '$37.99' '$18.99' '$389.99'
 '$19.90' '$8.49' '$1.75' '$14.00' '$4.85' '$46.99' '$109.99' '$154.99'
 '$3.08' '$2.59' '$4.80' '$1.96' '$19.40' '$3.90' '$4.59' '$15.46' '$3.04'
 '$4.29' '$2.60' '$3.28' '$4.60' '$28.99' '$2.95' '$2.90' '$1.97'
 '$200.00' '$89.99' '$2.56' '$30.99' '$3.61' '$394.99' '$1.26' 'Everyone'

Above we can se that we need to delete all special signs such as dollar signs and plusses, additionally two words need to be deleted, in case of Price the word "Everyone" will be changed to NaN since it doesnt indicate wether it means free or something else, the same will be applied with word Free in "Installs". The size column shows size of application both in Megabytes (digits ending with M) and kilobytes (ones with k), this also needs to be taken care of.

In [8]:
print(apps['Size'].value_counts())

Varies with device    1227
11M                    182
12M                    181
13M                    177
14M                    177
                      ... 
430k                     1
429k                     1
200k                     1
460k                     1
619k                     1
Name: Size, Length: 462, dtype: int64


Additionaly size also has the "Varies with device" value. It is also noticable 1,2k records making it not possible bo delete this value, at the same time it would be usefull to do some analysis with app sizes. To get around that we will have two dataframes, one full without sizes and one cut down with sizes.

In [5]:
#Below we are removing all characters and the word "free" from the price column
char_remove = ['$', ',', '+']
col_cl = ['Installs', 'Price', 'Size']

for col in col_cl:
    for char in char_remove:
        apps[col] = apps[col].apply(lambda x: x.replace(char, ''))

#To later convert price to float type we need to fill empty cells with 0's or NaN's
#apps = apps.replace('', np.nan, regex= True)
#Changing "Free" to nan
apps['Installs'] = apps['Installs'].replace('Free', np.nan)
#Changing word "everyone" to nan
apps['Price'] = apps['Price'].replace('Everyone', np.nan)
#Changing "Varies with device" to nan
apps['Size'] = apps['Size'].replace('Varies with device', np.nan)

print(apps.info)

<bound method DataFrame.info of                                                      App             Category  \
0         Photo Editor & Candy Camera & Grid & ScrapBook       ART_AND_DESIGN   
1                                    Coloring book moana       ART_AND_DESIGN   
2      U Launcher Lite – FREE Live Cool Themes, Hide ...       ART_AND_DESIGN   
3                                  Sketch - Draw & Paint       ART_AND_DESIGN   
4                  Pixel Draw - Number Art Coloring Book       ART_AND_DESIGN   
...                                                  ...                  ...   
10836                                   Sya9a Maroc - FR               FAMILY   
10837                   Fr. Mike Schmitz Audio Teachings               FAMILY   
10838                             Parkinson Exercices FR              MEDICAL   
10839                      The SCP Foundation DB fr nn5n  BOOKS_AND_REFERENCE   
10840      iHoroscope - 2018 Daily Horoscope & Astrology            LIFESTYLE

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps[col] = apps[col].apply(lambda x: x.replace(char, ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps['Installs'] = apps['Installs'].replace('Free', np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps['Price'] = apps['Price'].replace('Everyone', np.nan)
A value is trying to be set 

In [6]:
print(apps['Installs'].unique())
print(apps['Price'].unique())
print(apps['Size'].unique())

['10000' '500000' '5000000' '50000000' '100000' '50000' '1000000'
 '10000000' '5000' '100000000' '1000000000' '1000' '500000000' '50' '100'
 '500' '10' '1' '5' '0' nan]
['0' '4.99' '3.99' '6.99' '1.49' '2.99' '7.99' '5.99' '3.49' '1.99' '9.99'
 '7.49' '0.99' '9.00' '5.49' '10.00' '24.99' '11.99' '79.99' '16.99'
 '14.99' '1.00' '29.99' '12.99' '2.49' '10.99' '1.50' '19.99' '15.99'
 '33.99' '74.99' '39.99' '3.95' '4.49' '1.70' '8.99' '2.00' '3.88' '25.99'
 '399.99' '17.99' '400.00' '3.02' '1.76' '4.84' '4.77' '1.61' '2.50'
 '1.59' '6.49' '1.29' '5.00' '13.99' '299.99' '379.99' '37.99' '18.99'
 '389.99' '19.90' '8.49' '1.75' '14.00' '4.85' '46.99' '109.99' '154.99'
 '3.08' '2.59' '4.80' '1.96' '19.40' '3.90' '4.59' '15.46' '3.04' '4.29'
 '2.60' '3.28' '4.60' '28.99' '2.95' '2.90' '1.97' '200.00' '89.99' '2.56'
 '30.99' '3.61' '394.99' '1.26' nan '1.20' '1.04']
['19M' '14M' '8.7M' '25M' '2.8M' '5.6M' '29M' '33M' '3.1M' '28M' '12M'
 '20M' '21M' '37M' '2.7M' '5.5M' '17M' '39M' '31M' '4.2M' '

Now we need to fix the Size column. For that we will turn the column into a new dataframe, then extract the last letter which stands for Megabyte or kilobyte into second column. After that we delete the letters from Size colum and turn int to float type. Lastly using the second collumn we divide by 1000 the values that have k next to them to unify the table and show all sizes in Megabytes.

In [19]:
#Here need to figure out how to delete M and k in size and also divide all the ones with k's by 1000 so that all are in megabytes

apps_siz = pd.DataFrame(apps['Size'])
apps_siz['Measure'] = [str(x).strip()[-1] for x in apps['Size']]
apps_siz['Size'] = apps_siz['Size'].str[:-1].astype('float')
print(apps_siz.head())
print('The unique values are:', apps_siz['Measure'].unique())
print(apps_siz.shape)

   Size Measure
0  19.0       M
1  14.0       M
2   8.7       M
3  25.0       M
4   2.8       M
The unique values are: ['M' 'n' 'k' '0']
(9660, 2)


As can be seen first part is done but we have spotted another issue. Apart from M and k we also have n and a 0, zero comes originaly from one 1000 value. In this circumstance we have two main options: <br>
* If the number of occurences is not big we could just delete all the ones that have anything else than M or k. <br>
* With the 1k value we could  assume that its in kilobytes, and "n" ending occurences that its megabytes, maybe misstyped it.

In [17]:
print(apps_siz.groupby('Measure').count())
print('---------------')
print(apps_siz['Measure'].value_counts())

         Size
Measure      
0           1
M        8118
k         314
n           0
---------------
M    8118
n    1227
k     314
0       1
Name: Measure, dtype: int64


In [24]:
check = apps_siz[apps_siz['Measure']=='n']
check

Unnamed: 0,Size,Measure
37,,n
42,,n
52,,n
67,,n
68,,n
...,...,...
10713,,n
10725,,n
10765,,n
10826,,n


As we can see the results above are different. At the first stage of cleaning the data we have replaced the size "Varies with device" to nan. The count method excludes missing data when counting. The value_counts counts everything, this is also why the n appeared (we pulled last character of the cells). <br> <br>1 thousand records is too much to drop, but at the same time it would be interesting to explore the characteristics of apps given their size. We will create two dataframes, one with the size specified where we drop the nan's and one full without the size column.


In [9]:
#apps_siz2 = np.where(apps_siz['Measure']=='k', apps_siz['Size'], apps_siz['Size']/1000)
#apps_siz2
#print(apps_siz['Measure'].unique())

In [10]:
#apps['Size'] = apps['Size'].astype('float')
#apps['Installs'] = apps['Installs'].astype('float')
#apps['Price'] = apps['Price'].astype('float')

#apps.dtypes

In [None]:
apps
apps_w_size = 

In [11]:
#plotting numbers by category