<p><img src= "https://www.pngitem.com/pimgs/m/114-1143971_google-play-logo-google-play-logo-transparent-background.png")></p>

Small project concentrated on apps available on Google Play.<br>
The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

#### 1. Importing and inspecting the data

In [60]:
import pandas as pd
import numpy as np

apps1 = pd.read_csv('googleplaystore.csv')
print(apps1.sample(7))
print(apps1.shape)

                                            App             Category  Rating  \
3635                       Weather Radar Widget              WEATHER     4.3   
10388                     Field Goal Tournament               SPORTS     3.7   
9799             ES Summer Chill Theme for Free                TOOLS     4.3   
5279                               AK-47 sounds               FAMILY     3.9   
3856   GPS Speedometer - Trip Meter - Altimeter  MAPS_AND_NAVIGATION     4.3   
1521                    Aviary Effects: Classic   LIBRARIES_AND_DEMO     3.8   
5667                   loans.com.au Smart Money              FINANCE     3.5   

      Reviews  Size    Installs  Type Price Content Rating             Genres  \
3635    18194  3.2M  1,000,000+  Free     0       Everyone            Weather   
10388    2980   21M    100,000+  Free     0       Everyone             Sports   
9799      940  387k    100,000+  Free     0       Everyone              Tools   
5279        7  3.8M      1,000+  Fr

Data in columns "Installs", "Price" and "Size" need to be cleaned. First two have special characters, the last one has letters "M" and "k" which presumably indicate size in Megabytes and kilobytes.

In [61]:
before_drop_dup = sum(apps1.App.duplicated())
#print(apps1[apps1.App.duplicated() == True].describe())
print('Number of duplicated apps:', before_drop_dup)

Number of duplicated apps: 1181


In [62]:
apps = apps1.drop_duplicates('App')
print('Number of unique apps:', len(apps))

Number of unique apps: 9660


#### 2. Data cleaning
As seen previously, columns "Installs" and "Price" include characters that make it unable to change their type. We need to get rid of any obstacles that prevent us from changing the types. Column "Size" also needs to be cleaned<br>


In [63]:
print(apps['Installs'].unique())
print(apps['Price'].unique())
print(apps['Size'].unique())

['10,000+' '500,000+' '5,000,000+' '50,000,000+' '100,000+' '50,000+'
 '1,000,000+' '10,000,000+' '5,000+' '100,000,000+' '1,000,000,000+'
 '1,000+' '500,000,000+' '50+' '100+' '500+' '10+' '1+' '5+' '0+' '0'
 'Free']
['0' '$4.99' '$3.99' '$6.99' '$1.49' '$2.99' '$7.99' '$5.99' '$3.49'
 '$1.99' '$9.99' '$7.49' '$0.99' '$9.00' '$5.49' '$10.00' '$24.99'
 '$11.99' '$79.99' '$16.99' '$14.99' '$1.00' '$29.99' '$12.99' '$2.49'
 '$10.99' '$1.50' '$19.99' '$15.99' '$33.99' '$74.99' '$39.99' '$3.95'
 '$4.49' '$1.70' '$8.99' '$2.00' '$3.88' '$25.99' '$399.99' '$17.99'
 '$400.00' '$3.02' '$1.76' '$4.84' '$4.77' '$1.61' '$2.50' '$1.59' '$6.49'
 '$1.29' '$5.00' '$13.99' '$299.99' '$379.99' '$37.99' '$18.99' '$389.99'
 '$19.90' '$8.49' '$1.75' '$14.00' '$4.85' '$46.99' '$109.99' '$154.99'
 '$3.08' '$2.59' '$4.80' '$1.96' '$19.40' '$3.90' '$4.59' '$15.46' '$3.04'
 '$4.29' '$2.60' '$3.28' '$4.60' '$28.99' '$2.95' '$2.90' '$1.97'
 '$200.00' '$89.99' '$2.56' '$30.99' '$3.61' '$394.99' '$1.26' 'Everyone'

Above we can se that we need to delete all special signs such as dollar signs and plusses, additionally two words need to be deleted, in case of Price the word "Everyone" will be changed to NaN since it doesnt indicate wether it means free or something else, the same will be applied with word Free in "Installs". Adittionaly the size column shows size of application both in Megabytes (digits ending with M) and kilobytes (ones with k).

In [65]:
#Below we are removing all characters and the word "free" from the price column
char_remove = ['$', ',', '+']
col_cl = ['Installs', 'Price']

for col in col_cl:
    for char in char_remove:
        apps[col] = apps[col].apply(lambda x: x.replace(char, ''))

#To later convert price to float type we need to fill empty cells with 0's or NaN's
#apps = apps.replace('', np.nan, regex= True)
#Changing "Free" to nan
apps['Installs'] = apps['Installs'].replace('Free', np.nan)
#Changing word "everyone" to nan
apps['Price'] = apps['Price'].replace('Everyone', np.nan)
#Changing "Varies with device" to nan
apps['Size'] = apps['Size'].replace('Varies with device', np.nan)

print(apps.info)

<bound method DataFrame.info of                                                      App             Category  \
0         Photo Editor & Candy Camera & Grid & ScrapBook       ART_AND_DESIGN   
1                                    Coloring book moana       ART_AND_DESIGN   
2      U Launcher Lite – FREE Live Cool Themes, Hide ...       ART_AND_DESIGN   
3                                  Sketch - Draw & Paint       ART_AND_DESIGN   
4                  Pixel Draw - Number Art Coloring Book       ART_AND_DESIGN   
...                                                  ...                  ...   
10836                                   Sya9a Maroc - FR               FAMILY   
10837                   Fr. Mike Schmitz Audio Teachings               FAMILY   
10838                             Parkinson Exercices FR              MEDICAL   
10839                      The SCP Foundation DB fr nn5n  BOOKS_AND_REFERENCE   
10840      iHoroscope - 2018 Daily Horoscope & Astrology            LIFESTYLE

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps[col] = apps[col].apply(lambda x: x.replace(char, ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps['Installs'] = apps['Installs'].replace('Free', np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps['Price'] = apps['Price'].replace('Everyone', np.nan)
A value is trying to be set 

In [67]:
print(apps['Installs'].unique())
print(apps['Price'].unique())
print(apps['Size'].unique())

['10000' '500000' '5000000' '50000000' '100000' '50000' '1000000'
 '10000000' '5000' '100000000' '1000000000' '1000' '500000000' '50' '100'
 '500' '10' '1' '5' '0' nan]
['0' '4.99' '3.99' '6.99' '1.49' '2.99' '7.99' '5.99' '3.49' '1.99' '9.99'
 '7.49' '0.99' '9.00' '5.49' '10.00' '24.99' '11.99' '79.99' '16.99'
 '14.99' '1.00' '29.99' '12.99' '2.49' '10.99' '1.50' '19.99' '15.99'
 '33.99' '74.99' '39.99' '3.95' '4.49' '1.70' '8.99' '2.00' '3.88' '25.99'
 '399.99' '17.99' '400.00' '3.02' '1.76' '4.84' '4.77' '1.61' '2.50'
 '1.59' '6.49' '1.29' '5.00' '13.99' '299.99' '379.99' '37.99' '18.99'
 '389.99' '19.90' '8.49' '1.75' '14.00' '4.85' '46.99' '109.99' '154.99'
 '3.08' '2.59' '4.80' '1.96' '19.40' '3.90' '4.59' '15.46' '3.04' '4.29'
 '2.60' '3.28' '4.60' '28.99' '2.95' '2.90' '1.97' '200.00' '89.99' '2.56'
 '30.99' '3.61' '394.99' '1.26' nan '1.20' '1.04']
['19M' '14M' '8.7M' '25M' '2.8M' '5.6M' '29M' '33M' '3.1M' '28M' '12M'
 '20M' '21M' '37M' '2.7M' '5.5M' '17M' '39M' '31M' '4.2M' '

In [68]:
#Here need to figure out how to delete M and k in size and also divide all the ones with k's by 1000 so that all are in megabytes

In [33]:
apps.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [41]:
#apps['Size'] = apps['Size'].astype('float')
#apps['Installs'] = apps['Installs'].astype('float')
#apps['Price'] = apps['Price'].astype('float')

apps.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs          float64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [69]:
#plotting numbers by category