# GOOGLE PLAY STORE - Transforming Raw to Clean Data
_______________________

* Consideration: source data was scraped from the web

## Objectives:

* Create a cleaned up version of the Google Play Store Source Data by filtering:

 - Games with no reviews
 - Duplicates
 - Converting all ratings, reviews, installs, and price to uniform types and formats by column
 

* Subsequently, make sure there's no duplicate app names or double counting / aggegration; organize by apps, and remove exact duplicates, and or take the higher of the two


* Final Product should be a cleaned gps source data we'll use to create charts with




In [1]:
# Import Dependencies
%matplotlib notebook
import os 
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Pull the trigger and covert the original CSV to a dataframe, and print the DF
gps_sourcedata_df = pd.read_csv("./resources/original_raw_data/googleplaystore.csv")

In [3]:
# Run to see count
gps_sourcedata_df.count()

App               10841
Category          10841
Rating             9367
Reviews           10841
Size              10841
Installs          10841
Type              10840
Price             10841
Content Rating    10840
Genres            10841
Last Updated      10841
Current Ver       10833
Android Ver       10838
dtype: int64

In [4]:
# To see count and type

#type(gps_sourcedata_df.head())

# Identify Columns we want to remove and keep
gps_sourcedata_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


In [5]:
# Sort by Reviews, and drop any cells with missing information to make all columns equal
#gps_sourcedata_df = gps_sourcedata_df.sort_values(by= ["Reviews"], ascending=True).dropna(how="any")

gps_sourcedata_df = gps_sourcedata_df.sort_values(by= ["Reviews"], ascending=False)

gps_sourcedata_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2989,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up
4970,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,91k,"100,000+",Free,0,Everyone,Tools,"December 17, 2013",3.2,2.2 and up
2723,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
2705,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
3079,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
3229,DreamTrips,TRAVEL_AND_LOCAL,4.7,9971,22M,"500,000+",Free,0,Teen,Travel & Local,"August 6, 2018",1.28.1,5.0 and up
3049,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
7002,Adult Color by Number Book - Paint Mandala Pages,FAMILY,4.3,997,Varies with device,"100,000+",Free,0,Everyone,Entertainment,"June 27, 2018",2.4,4.1 and up
6724,BSPlayer ARMv7 VFP CPU support,VIDEO_PLAYERS,4.3,9966,5.5M,"1,000,000+",Free,0,Everyone,Video Players & Editors,"March 31, 2017",1.23,2.1 and up
7982,"Easy Resume Builder, Resume help, Curriculum v...",TOOLS,4.3,996,10M,"50,000+",Free,0,Everyone,Tools,"September 28, 2017",2.3,4.0.3 and up


In [6]:
gps_sourcedata_df.count()

App               10841
Category          10841
Rating             9367
Reviews           10841
Size              10841
Installs          10841
Type              10840
Price             10841
Content Rating    10840
Genres            10841
Last Updated      10841
Current Ver       10833
Android Ver       10838
dtype: int64

### Only run the ".drop function once. if you have to restart the kernel, unhash it and run it.

### if you try to run it twice, it will say an error because nothign is there to drop

#### it may take a few times. run the 'gps_sourcedata_df' to view it a few times to make sure.

In [7]:
# create a list to drop unwanted columns and store into a new dataframe


#to_drop =['Android Ver', 'Current Ver', 'Size']

#gps_sourcedata_df.drop(to_drop, inplace=True, axis=1)
gps_sourcedata_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2989,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up
4970,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,91k,"100,000+",Free,0,Everyone,Tools,"December 17, 2013",3.2,2.2 and up
2723,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
2705,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
3079,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
3229,DreamTrips,TRAVEL_AND_LOCAL,4.7,9971,22M,"500,000+",Free,0,Teen,Travel & Local,"August 6, 2018",1.28.1,5.0 and up
3049,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
7002,Adult Color by Number Book - Paint Mandala Pages,FAMILY,4.3,997,Varies with device,"100,000+",Free,0,Everyone,Entertainment,"June 27, 2018",2.4,4.1 and up
6724,BSPlayer ARMv7 VFP CPU support,VIDEO_PLAYERS,4.3,9966,5.5M,"1,000,000+",Free,0,Everyone,Video Players & Editors,"March 31, 2017",1.23,2.1 and up
7982,"Easy Resume Builder, Resume help, Curriculum v...",TOOLS,4.3,996,10M,"50,000+",Free,0,Everyone,Tools,"September 28, 2017",2.3,4.0.3 and up


In [8]:
gps_sourcedata_df['Reviews'].value_counts().head()
#gps_sourcedata_df['Reviews'].describe()

0    596
1    272
2    214
3    175
4    137
Name: Reviews, dtype: int64

In [9]:
# Sort the file by 

gps_sourcedata_df = gps_sourcedata_df.sort_values(['Reviews'], ascending=False)
#gps_sourcedata_df.max()
gps_sourcedata_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2989,GollerCepte Live Score,SPORTS,4.2,9992,31M,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018",6.5,4.1 and up
4970,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,91k,"100,000+",Free,0,Everyone,Tools,"December 17, 2013",3.2,2.2 and up
2723,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
2705,SnipSnap Coupon App,SHOPPING,4.2,9975,18M,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018",1.4,4.3 and up
3079,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
3229,DreamTrips,TRAVEL_AND_LOCAL,4.7,9971,22M,"500,000+",Free,0,Teen,Travel & Local,"August 6, 2018",1.28.1,5.0 and up
3049,US Open Tennis Championships 2018,SPORTS,4.0,9971,33M,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018",7.1,5.0 and up
7002,Adult Color by Number Book - Paint Mandala Pages,FAMILY,4.3,997,Varies with device,"100,000+",Free,0,Everyone,Entertainment,"June 27, 2018",2.4,4.1 and up
6724,BSPlayer ARMv7 VFP CPU support,VIDEO_PLAYERS,4.3,9966,5.5M,"1,000,000+",Free,0,Everyone,Video Players & Editors,"March 31, 2017",1.23,2.1 and up
7982,"Easy Resume Builder, Resume help, Curriculum v...",TOOLS,4.3,996,10M,"50,000+",Free,0,Everyone,Tools,"September 28, 2017",2.3,4.0.3 and up


In [10]:
gps_sourcedata_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10841 entries, 2989 to 4177
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.2+ MB


In [11]:


#create a list to drop unwanted columns and store into a new dataframe
#only run the drop once
to_drop =['Android Ver', 'Current Ver', 'Size']

gps_sourcedata_df = gps_sourcedata_df.drop(columns=to_drop)
#gps_cleaning_df

#gps_sourcedata_df.count()
gps_sourcedata_df

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres,Last Updated
2989,GollerCepte Live Score,SPORTS,4.2,9992,"1,000,000+",Free,0,Everyone,Sports,"May 23, 2018"
4970,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,"100,000+",Free,0,Everyone,Tools,"December 17, 2013"
2723,SnipSnap Coupon App,SHOPPING,4.2,9975,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018"
2705,SnipSnap Coupon App,SHOPPING,4.2,9975,"1,000,000+",Free,0,Everyone,Shopping,"January 22, 2018"
3079,US Open Tennis Championships 2018,SPORTS,4.0,9971,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018"
3229,DreamTrips,TRAVEL_AND_LOCAL,4.7,9971,"500,000+",Free,0,Teen,Travel & Local,"August 6, 2018"
3049,US Open Tennis Championships 2018,SPORTS,4.0,9971,"1,000,000+",Free,0,Everyone,Sports,"June 5, 2018"
7002,Adult Color by Number Book - Paint Mandala Pages,FAMILY,4.3,997,"100,000+",Free,0,Everyone,Entertainment,"June 27, 2018"
6724,BSPlayer ARMv7 VFP CPU support,VIDEO_PLAYERS,4.3,9966,"1,000,000+",Free,0,Everyone,Video Players & Editors,"March 31, 2017"
7982,"Easy Resume Builder, Resume help, Curriculum v...",TOOLS,4.3,996,"50,000+",Free,0,Everyone,Tools,"September 28, 2017"


In [12]:
#gps_sourcedata_df['App'].value_counts()
gps_sourcedata_df['App'].unique()

array(['GollerCepte Live Score', 'Ad Block REMOVER - NEED ROOT',
       'SnipSnap Coupon App', ..., 'Toronto Dating',
       'i miss you quotes and photos', 'G-NetReport Pro'], dtype=object)

In [13]:
# Clean up installs - can use replace or .map function to remove '+' signs
# Khaled said once next time just remove commas in between text

gps_sourcedata_df['Installs'] = gps_sourcedata_df['Installs'].map(lambda x: str(x)[:-1])


gps_sourcedata_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10841 entries, 2989 to 4177
Data columns (total 10 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
dtypes: float64(1), object(9)
memory usage: 931.6+ KB


In [14]:
# Remove commas in betwen numbers in installs, and convert to int64

gps_sourcedata_df = gps_sourcedata_df.dropna(how="any")
gps_sourcedata_df['Installs'] = [x.replace(",","") for x in gps_sourcedata_df['Installs']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [15]:
# Remove commas in betwen numbers in Reviews, and convert to int64

gps_sourcedata_df['Reviews'] = [x.replace(",","") for x in gps_sourcedata_df['Reviews']]

gps_sourcedata_df['Reviews'].astype(np.int64).head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


2989    9992
4970     999
2723    9975
2705    9975
3079    9971
Name: Reviews, dtype: int64

In [16]:
# Remove $ in betwen numbers in Reviews, and convert to float64

gps_sourcedata_df['Price'] = [x.replace("$","") for x in gps_sourcedata_df['Price']]

#gps_sourcedata_df['Price'] = gps_sourcedata_df['Price'].astype(np.float64)
gps_sourcedata_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres,Last Updated
2989,GollerCepte Live Score,SPORTS,4.2,9992,1000000,Free,0,Everyone,Sports,"May 23, 2018"
4970,Ad Block REMOVER - NEED ROOT,TOOLS,3.3,999,100000,Free,0,Everyone,Tools,"December 17, 2013"
2723,SnipSnap Coupon App,SHOPPING,4.2,9975,1000000,Free,0,Everyone,Shopping,"January 22, 2018"
2705,SnipSnap Coupon App,SHOPPING,4.2,9975,1000000,Free,0,Everyone,Shopping,"January 22, 2018"
3079,US Open Tennis Championships 2018,SPORTS,4.0,9971,1000000,Free,0,Everyone,Sports,"June 5, 2018"
3229,DreamTrips,TRAVEL_AND_LOCAL,4.7,9971,500000,Free,0,Teen,Travel & Local,"August 6, 2018"
3049,US Open Tennis Championships 2018,SPORTS,4.0,9971,1000000,Free,0,Everyone,Sports,"June 5, 2018"
7002,Adult Color by Number Book - Paint Mandala Pages,FAMILY,4.3,997,100000,Free,0,Everyone,Entertainment,"June 27, 2018"
6724,BSPlayer ARMv7 VFP CPU support,VIDEO_PLAYERS,4.3,9966,1000000,Free,0,Everyone,Video Players & Editors,"March 31, 2017"
7982,"Easy Resume Builder, Resume help, Curriculum v...",TOOLS,4.3,996,50000,Free,0,Everyone,Tools,"September 28, 2017"


In [17]:
# Sort Ratings - right way below:
gps_sourcedata_df = gps_sourcedata_df.sort_values(['Rating'], ascending=False)
gps_sourcedata_df.head()

#gps_sourcedata_df = gps_sourcedata_df.sort_values(['Reviews'], ascending=False)

# this is wrong: gps_sourcedata_df['Rating'].sort_values(gps_sourcedata_df['Rating'], ascending=False)

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres,Last Updated
9002,DW Security,BUSINESS,5.0,6,100,Free,0,Everyone,Business,"July 25, 2018"
2455,FoothillsVet,MEDICAL,5.0,2,50,Free,0,Everyone,Medical,"July 11, 2018"
2477,Basics of Orthopaedics,MEDICAL,5.0,1,10,Free,0,Everyone,Medical,"July 27, 2018"
7245,Overcomers CF - GA,LIFESTYLE,5.0,7,100,Free,0,Everyone,Lifestyle,"March 20, 2017"
10721,Mad Dash Fo' Cash,GAME,5.0,14,100,Free,0,Everyone,Arcade,"June 19, 2017"


In [18]:
#gps_sourcedata_df.sort_values(['Category'], ascending=True)

category_list = np.sort(gps_sourcedata_df['Category'].unique())

# for x in category_list:
#     print(x)

In [19]:
#gps_sourcedata_df = gps_sourcedata_df.sort_values(['Content Rating'], ascending=False)

#gps_sourcedata_df.sort_values(['Content Rating'], ascending=True)

category_list = np.sort(gps_sourcedata_df['Content Rating'].unique())

for x in category_list:
     print(x)

#gps_sourcedata_df.head()

Adults only 18+
Everyone
Everyone 10+
Mature 17+
Teen
Unrated


In [20]:
gps_sourcedata_df['Content Rating'] = [x.replace("Everyone 10+","Everyone") for x in gps_sourcedata_df['Content Rating']]
                                      
gps_sourcedata_df['Content Rating'].head()

9002     Everyone
2455     Everyone
2477     Everyone
7245     Everyone
10721    Everyone
Name: Content Rating, dtype: object

In [21]:
gps_sourcedata_df = gps_sourcedata_df.sort_values(['Category'], ascending=True)
x_list = gps_sourcedata_df['Category'].unique()

for x in x_list:
    print(x)

ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
BOOKS_AND_REFERENCE
BUSINESS
COMICS
COMMUNICATION
DATING
EDUCATION
ENTERTAINMENT
EVENTS
FAMILY
FINANCE
FOOD_AND_DRINK
GAME
HEALTH_AND_FITNESS
HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
MAPS_AND_NAVIGATION
MEDICAL
NEWS_AND_MAGAZINES
PARENTING
PERSONALIZATION
PHOTOGRAPHY
PRODUCTIVITY
SHOPPING
SOCIAL
SPORTS
TOOLS
TRAVEL_AND_LOCAL
VIDEO_PLAYERS
WEATHER


In [None]:
gps_sourcedata_df.head()

In [None]:
gps_sourcedata_df['Reviews'] = gps_sourcedata_df['Reviews'].astype(np.int64)

gps_sourcedata_df = gps_sourcedata_df.sort_values(['Reviews'], ascending=False)

gps_filterdata_df = gps_sourcedata_df.drop_duplicates(['App']).sort_values(['Reviews'], ascending=False)

gps_filterdata_df

In [None]:
gps_filterdata_df.describe()

In [None]:
gps_filterdata_df.head()

In [None]:
top_quartile = np.percentile(gps_filterdata_df['Reviews'], 75)
top_quartile
# for notes. don't use.
# top_quartile = int((gps_filterdata_df['Reviews'].max()*.5))


In [None]:
top_quartile_data_df = gps_filterdata_df.loc[gps_filterdata_df['Reviews'] > top_quartile]

In [None]:
top_quartile_data_df.head(20)

In [None]:
top_quartile_data_df['Category'].unique()

top_quartile_data_df = top_quartile_data_df.sort_values(['Category'], ascending=True)
x_list = top_quartile_data_df['Category'].unique()

for x in x_list:
    print(x)

### Clean up Cateogry Columns to set up pie and bar charts:

In [None]:

top_quartile_data_df["Category"] = [x.replace("FINANCE", "Business") for x in top_quartile_data_df["Category"]]

top_quartile_data_df["Category"] = [x.replace("LIBRARIES_AND_DEMO", "Education") for x in top_quartile_data_df["Category"]]

top_quartile_data_df["Category"] = [x.replace("GAMES", "Games") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("COMICS", "Games") for x in top_quartile_data_df["Category"]]


top_quartile_data_df["Category"] = [x.replace("HEALTH_AND_FITNESS", "Health and Fitness") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("MEDICAL", "Health and Fitness") for x in top_quartile_data_df["Category"]]

top_quartile_data_df["Category"] = [x.replace("DATING","Life Stlye") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("BEAUTY", "Life Style") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("PARENTING", "Life Style") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("LIFE_STYLE", "Life Style") for x in top_quartile_data_df["Category"]]

top_quartile_data_df["Category"] = [x.replace("BOOKS_AND_REFERENCES", "Productivity") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("TRAVEL_AND_LOCAL", "Productivity") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("AUTO_AND_VEHICLE", "Productivity") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("PRODUCTIVITY", "Productivity") for x in top_quartile_data_df["Category"]]



top_quartile_data_df["Category"] = [x.replace("FOOD_AND_DRINK", "Social Networking") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("FAMILY", "Social Networking") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("ENTERTAINMENT", "Social Networking") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("COMMUNITCATION", "Social Networking") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("NEWS_AND_MAGAZING", "Social Networking") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("PERSONALIZATION", "Social Networking") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("SOCIAL", "Social Networking") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("LIFE_STYLE", "Social Networking") for x in top_quartile_data_df["Category"]]

top_quartile_data_df["Category"] = [x.replace("SPORTS", "Sports") for x in top_quartile_data_df["Category"]]

top_quartile_data_df["Category"] = [x.replace("VIDEO_PLAYERS", "Utility") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("MAPS_AND_NAVIGATION", "Utility") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("TOOLS", "Utility") for x in top_quartile_data_df["Category"]]
top_quartile_data_df["Category"] = [x.replace("TRAVEL_AND_LOCATION", "Utility") for x in top_quartile_data_df["Category"]]


top_quartile_data_df["Category"] = [x.replace("WEATHER", "Weather") for x in top_quartile_data_df["Category"]]

top_quartile_data_df

In [None]:
top_quartile_data_df["Category"] = [x.replace("Productivity", "Productivity") for x in top_quartile_data_df["Category"]]
top_quartile_data_df

In [None]:
plt.figure(2, figsize=(6,8))
values_list = [urb_fares, rural_fares, suburb_fares]
labels = ['Urban', 'Rural', 'Suburban']
colors = ['lightcoral', 'gold', 'lightskyblue']
explode = (0.1, 0, 0)  

for city in fare_city_list:
    plt.pie(fare_city_list, labels=labels, colors=colors, explode=explode, autopct="%1.1f%%", shadow=True, startangle=270)

plt.axis('equal')
plt.title('% of Total Fares by City Type')

plt.show()
plt.savefig('./Pyber_TotalFares_CityType.png')

In [None]:
# slicing pulling imporant data out strainging our data to condense it to juice 


data_file_pd.sort_values( by= ["Installs","Rating"], ascending=False).dropna(how="any")

In [None]:
#  get top 20 apps 

In [None]:
business 
education
games 
health_fitness
lifestyle
photo_video
productivity
social Networking
sports
travel
utility
weather


plt.figure(1, figsize=(6,6))
fare_city_list = [urb_fares, rural_fares, suburb_fares] # rename: with all categories
labels = ['Urban', 'Rural', 'Suburban'] # list of all category names
colors = ['lightcoral', 'gold', 'lightskyblue'] 
explode = (0.1, 0, 0)  

for cat in category_list:
    plt.pie(fare_city_list, labels=labels, colors=colors, explode=explode, autopct="%1.1f%%", shadow=True, startangle=270)

plt.axis('equal')
plt.title('% of Total Fares by City Type')

plt.show()
plt.savefig('./Pyber_TotalFares_CityType.png')

In [None]:
data_file_pd.drop_duplicates(["App"], keep="first")

In [None]:
data_file_pd.sort_values( by= ["Installs","Rating"], ascending=False).dropna(how="any")

In [None]:
data_file_pd.sort_values( by= ["Installs","Rating"], ascending=False).dropna(how="any")

data_file_pd.drop_duplicates(["App"], keep="first")

Top 20 for installs and ratings : ez /
remove the + field when data scrubbing so we can chart issues 

list of 4 apps to check success. /

factor in categories on how we can compare the two 

get rod of duplicates
sory by app name and where app name == true ommitt data /

remove rows and append findings to a clean DF thats sorted 


investigate unique instances and removal /