## 1 Profitable App Profiles for the App Store and Google Play Markets

We've come a long way in this first three weeks and learned:

- The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions

To make learning smoother and more efficient, we learned about each of these topics in isolation. In this project, we'll learn to combine these skills to perform practical data analysis. Projects are a bit more involved compared to regular lessons, so **you should expect to spend a little more time working on them**.

For this project, **we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps**. We make our apps available on **Google Play** and the **App Store**.

We only build apps that are free to download and install, and our main source of revenue consists of **in-app ads**. This means our **revenue** for any given app is **mostly influenced by the number of users who use our app** — the more users that see and engage with the adds, the better. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- To help readers gain context into your project, use the first Markdown cell of the notebook to:
  - Add a title
  - Write a short introduction where you explain in no more than two paragraphs:
    - What the project is about
    - What your goal is in this project
  - The title and the introduction are tentative at this point, so don't spend too much time here — you can come back at the end of your work to refine them.

## 2 Opening and Exploring the Data

In the previous section, we outlined that our aim is to help our developers understand what kinds of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store.

As of [September 2018](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/), there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1_XFp1RvtqpxF1GnTEF-N0n0SbBTxzcQq" /></left>

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:

  - A [data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play — the data was collected in August 2018
  - A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store — the data was collected in July 2017
  
We'll start by opening and exploring these two data sets. To make it easier for you to explore them, we created a function named **explore_data()** you can use repeatedly to print rows in a more readable way.


In [0]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

- The **explore_data()** function:
  - Takes in four parameters:
    - **dataset**, which is expected to be a **list of lists**
    - **start** and **end**, which are both expected to be **integers** and represent the starting and the ending indices of a slice from the data set
    - **rows_and_columns**, which is expected to be a **Boolean** and has **False** as a default argument.
  - Slices the data set using **dataset[start:end]**
  - Loops through the slice, and for each iteration, it prints a row and adds a new line after that row using **print('\n')**
    - The **\n** in **print('\n')** is a special character and won't be printed. Instead, the **\n** character adds a new line, and we use **print('\n')** to add some blank space between rows.
  - Prints the number of rows and columns if **rows_and_columns** is **True**
    - **dataset** shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length)
    
To help you better understand what **print('\n')** does, below we printed three rows from the **AppleStore.csv** data set. In the first code cell, we don't use **print('\n')** between rows, while in the second one we do:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1XMdtF6pGYv1osQL24aP1Ay7hmD1X2LNi" /></left>

Now let's open the two data sets and explore them.


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Open the two data sets we mentioned above, and save both as **lists of lists**.
  - The App Store data set is stored in a CSV file named **AppleStore.csv**, and the Google Play data set is stored in a CSV file named **googleplaystore.csv**.
  - If you run into an error named **UnicodeDecodeError**, then add **encoding="utf8"** to the **open()** function (for instance, use **open('AppleStore.csv', encoding='utf8'))**.
- Explore both data sets using the **explore_data()** function.
  - Print the first few rows of each data set.
  - Find the number of rows and columns of each data set (recall that the function assumes the argument for the dataset parameter doesn't have a header row).
- Print the column names and try to identify the columns that could help us with our analysis. Use the documentation of the data sets if you're having trouble understanding what a column describes. Also, add a link to the documentation for readers if you think the column names are not descriptive enough.

## 3  Deleting Wrong Data

In the previous section, we opened the two data sets and performed a brief exploration of the data. Before beginning our analysis, we need to make sure the data we analyze is accurate, otherwise the results of our analysis will be wrong. This means that we'll need to:

- Detect inaccurate data and correct (or remove) it
- Detect duplicate data and remove the duplicates

Recall that at our company we only build apps that are free to download and install, and that are directed toward an English-speaking audience. 

This means that we'll need to:

- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播
- Remove non-free apps

This process of preparing our data for analysis is called **data cleaning**. **Data cleaning is done before beginning the analysis**, and it includes removing or correcting wrong data, removing duplicate data, modifying the data to fit the purpose of our analysis, etc.

It's often said that data scientists spend around 80 percent of their time cleaning data, and only about 20 percent actually analyzing (cleaned) data. In this project, we'll see that this is not far from the truth.

Let's begin by detecting and deleting wrong data. For this project, **we'll guide you throughout the entire data cleaning process**. 


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>

- The Google Play data set has a dedicated [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.
  - Read the discussion and find out what the index of the row is.
  - Print the row at that index to check whether it's indeed incorrect. Take into account the user reporting the error might or might have not removed the header row, so the index number might vary.
  - If the row has an error, remove the row using the [del statement](https://docs.python.org/3/reference/simple_stmts.html?highlight=del#the-del-statement). For instance, to remove the row with the index 149 from a data set data stored as a list of list, you can use the code **del data[149]**.
  - **Make sure** you don't run the **del statement** more than once, otherwise you'll delete more than one row.
- Read the [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) section for the App Store data set, and see whether you can find any reports of wrong data.



## 4 Removing Duplicate Entries: Part One

In the last step, we started the **data cleaning** process and deleted a row with incorrect data from the Google Play data set. If you explore the Google Play data set long enough, or look at the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries.


<left><img width="500" src="https://drive.google.com/uc?export=view&id=17DGJTsFdn80lFMknGKEpjFuf98iNoyiy" /></left>

In total, there are 1,181 cases where an app occurs more than once:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1R-wTWQpdj9gMJgbpBRC28PlPtfkwgoPx" /></left>


Above, we:

- Created two lists: one for storing the name of duplicate apps, and one for storing the name of unique apps
- Looped through the **android** data set (the Google Play data set), and for each iteration:
  - We saved the app name to a variable named **name**
  - If **name** was already in the **unique_apps** list, we appended **name** to the **duplicate_apps** list
  - Else (if **name** wasn't already in the **unique_apps** list), we appended **name** to the **unique_apps** list
  
  
As a side note, you might have noticed above that we used the **in** operator to check for membership in a list. We only learned to use **in** to check for membership in dictionaries, but **in** also works with lists:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1tFkD1kz-3DFIIYyrs-oJqYU5E-Cl4_gO" /></left>

Returning to our discussion, we don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times.

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1eafrb14zY1VINRxluU8gViMPHDCcBvUR" /></left>


We could use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

We'll remove the rows in the next section. Now it's your turn to write some code and confirm the data has duplicate entries.



**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Using a combination of narrative and code, explain the reader that the Google Play data set has duplicate entries. Print a few duplicate rows to confirm.
- Count the number of duplicates using the technique we learned above.
- Explain that you won't remove the duplicates randomly. Describe the criterion you're going to use to remove the duplicates.
  - We already suggested a criterion above, but you can come up with another criterion if you want. Make sure you support your criterion with at least one argument.

## 5 Removing Duplicate Entries: Part Two

In the previous section, we looped through the Google Play data set and found that there are 1,181 duplicates. After we remove the duplicates, we should be left with 9,659 rows:

<left><img width="400" src="https://drive.google.com/uc?export=view&id=1ITU79LAApO8-lRqUCKikKS0KJHK39_FD" /></left>

To remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app
- Use the information stored in the dictionary to create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews)

To be able to turn the steps above into code, we'll need to use the **not in** operator. The **not in** operator is the opposite of the **in** operator. For instance, **'z' in ['a', 'b', 'c']** returns **False** because **'z'** is **not in ['a', 'b', 'c']**, but **'z' not in ['a', 'b', 'c']** returns **True** because it's true that **'z' is not in** the **list ['a', 'b', 'c']**.

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1A8Ejn3fKlyq7xXz8NJO3FVnXEY3Y7wJF" /></left>

Essentially, we use both the **in** and **not in** operators to check for membership — we want to know whether a value belongs to some group of values or not. We can also use the **not in** operator to check for membership in a dictionary. Just like in the case of the in operator, the membership check is only done over the dictionary keys:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=19Iv9zOmFtf-ldAD3q_v0nMZgWFBYWDd7" /></left>

Now let's write the code to remove the duplicate entries.



**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
  - Start by creating an empty dictionary named **reviews_max**.
  - Loop through the Google Play data set (make sure you don't include the header row), and for each iteration:
    - Assign the app name to a variable named **name**.
    - Convert the number of reviews to float, and assign it to a variable named **n_reviews.**
    - If **name** already exists as a key in the **reviews_max** dictionary and **reviews_max[name] < n_reviews**, then update the number of reviews for that entry in the **reviews_max** dictionary.
    - If **name** is **not in** the **reviews_max** dictionary as a key, then create a new entry in the dictionary where the key is the app name, and the value is the number of reviews. Make sure you don't use an **else** clause here, otherwise the number of reviews will be incorrectly updated whenever **reviews_max[name] < n_reviews** evaluates to **False.**
  - Inspect the dictionary to make sure everything went as expected. Measure the length of the dictionary — remember that the expected length is 9,659 entries.
- Use the dictionary you created above to remove the duplicate rows:
  - Start by creating two empty lists: **android_clean** (which will store our new cleaned data set) and **already_added** (which will just store app names).
  - Loop through the Google Play data set (make sure you don't include the header row), and for each iteration:
    - Assign the app name to a variable named **name**.
    - Convert the number of reviews to **float**, and assign it to a variable named **n_reviews**.
  - If **n_reviews** is the same as the number of maximum reviews of the app **name** (the number can be found in the **reviews_max** dictionary) **and name** is not already in the list **already_added**:
    - Append the entire row to the **android_clean** list (which will eventually be a list of list and store our cleaned data set).
    - Append the name of the app **name** to the **already_added** list — this helps us to keep track of apps that we already added.
- Explore the **android_clean** data set to ensure everything went as expected. The data set should have 9,659 rows. The two steps above are a bit more involved, so make sure you use Markdown to explain the readers the steps you took.

## 6 Removing Non-English Apps: Part One

In the previous section, we managed to remove the duplicate app entries in the Google Play data set. Remember that the language we use for the apps we develop at our company is English, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps whose name suggests that they are not directed toward an English-speaking audience.

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1Qb99YPYkc2_V_BYHYhVfpCtEmFwleuHx" /></left>

We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, for character 'A' is 65, and for character '爱' is 29,233. We can get the corresponding number of each character using the [ord() built-in function](https://docs.python.org/3/library/functions.html#ord).


<left><img width="500" src="https://drive.google.com/uc?export=view&id=1cTEm9DS2OKgFGPKP5PbTikg-rHELubRo" /></left>

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters, otherwise it doesn't.

So if an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. Our app names, however, are stored as strings, so how could we take each individual character of a string and check its corresponding number?

In Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1s-jzLgQdROM4cALy_gFYt2c_MhRzVm2m" /></left>

Let's first try to write the function we talked about above, and in the next scection we'll remove the rows corresponding to the non-English apps.


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>

- Write a function that takes in a string and returns **False** if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns **True**.
  - Inside the function, iterate over the input string, and for each iteration check whether the number associated with the character is greater than 127. When a character is greater than 127, the function should immediately **return False** because it means the app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.
  - If the loop finishes running without the return statement being executed, then it means no character had a corresponding number over 127. This means the app name is probably English, so the functions should **return True**.
- Use your function to check whether these app names are detected as English or non-English:
  - 'Instagram'
  - '爱奇艺PPS -《欢乐颂2》电视剧热播'
  - 'Docs To Go™ Free Office Suite'
  - 'Instachat 😜'



## 7 Removing Non-English Apps: Part Two

In the previous section, we wrote a function that detects non-English app names, but we saw that the function couldn't identify correctly certain English app names like **'Docs To Go™ Free Office Suite'** and **'Instachat 😜'**. This is because emojis and some characters like **™** fall outside the ASCII range and have corresponding numbers that are over 127.

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1HLwWmLSRHXGeOF8py_mNGLsxH_VM-wIh" /></left>

If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, **we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range**. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

Let's edit the function we created in the previous section, and then use it to filter out the non-English apps.


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Change the function you created in the previous section. If the input string has more than three characters that fall outside the ASCII range (0 - 127), then the function should return **False** (identify the string as non-English), otherwise it should return **True**.
- Use the new function to check whether these app names are detected as English or non-English:
  - 'Docs To Go™ Free Office Suite'
  - 'Instachat 😜'
  - '爱奇艺PPS -《欢乐颂2》电视剧热播'
- Use the new function to filter out non-English apps from both data sets. Loop through each data set, and if an app name is identified as English, then append the whole row to a separate list.
- Explore the data sets and see how many rows you have remaining for each data set.



## 8  Isolating the Free Apps

So far in the data cleaning process, we:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process. In the next section, we're going to start analyzing the data.

**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Loop through each data set to isolate the free apps in separate lists. Make sure you identify the columns describing the app price correctly.
- After you isolate the free apps, check the length of each data set to see how many apps you have remaining.

## 9 Most Common Apps by Genre: Part One

So far, we spent a good amount of time on cleaning data, and:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps
- Isolated the free apps

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we then develop it further.
- If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>

- Give readers more context into why we want to find an app profile that fits both the App Store and Google Play. Explain our validation strategy for an app idea.
- Inspect both data sets and identify the columns you could use to generate frequency tables to find out what are the most common genres in each market

## 10 Most Common Apps by Genre: Part Two

In the previous section, we looked at our validation strategy for an app idea, and then we inspected the data sets to identify the columns that might be useful for finding out what the most common genres in each market are. Our conclusion was that we'll need to build a frequency table for the **prime_genre** column of the App Store data set, and for the **Genres** and **Category** columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order

We already learned to generate frequency tables that show percentages, and we're going to build a function for that in the exercise below. However, dictionaries don't have order, and it will be very difficult to analyze the frequency tables. We'll need to build a second function which can help us display the entries in the frequency table in a descending order.

To do that, we'll need to make use of the built-in [sorted() function](https://docs.python.org/3/library/functions.html#sorted). This function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns a list of the elements of that iterable sorted in ascending or descending order (the **reverse** parameter controls whether the order is ascending or descending).

<left><img width="500" src="https://drive.google.com/uc?export=view&id=18sPIrZDlqFpUzmFU5W5nlrs51B_uLPiS" /></left>

The **sorted()** function doesn't work too well with dictionaries because it only considers and returns the dictionary keys.

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1sFTjDWqsBX8zUvJIYbo_k3kvsEgelP2u" /></left>

However, the **sorted()** function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1gA1cdI-ptaDGTL44iKwT-LNqptMwb6JJ" /></left>


This is a bit overcomplicated to just sort a dictionary, but we'll see there are much simpler ways to do this once we learn more advanced techniques. Using the workaround above, we wrote a helper function for you named **display_table()**, which you'll be able to combine with the function you're going to write in the next exercise. The **display_table()** function you see below:

- Takes in two parameters: **dataset** and **index**. **dataset** is expected to be a **list of lists**, and **index** is expected to be an integer
- Generates a frequency table using the **freq_table()** function (which you're going to write as an exercise)
- Transforms the frequency table into a list of tuples, and then it sorts the list in a descending order
- Prints the entries of the frequency table in descending order

In [0]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's now create a function for generating frequency tables, and then use it in combination with the **display_table()** function.


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>

- Create a function named **freq_table()** that takes in two inputs: - **dataset** (which is expected to be a list of lists) and **index** (which is expected to be an integer).
  - The function should return the frequency table (as a dictionary) for any column we want. The frequencies should also be expressed as percentages.
  - We already learned how to build frequency tables in the lesson on dictionaries (lesson #03).
- Copy the **display_table()** function we wrote above, and use it to display the frequency table of the columns **prime_genre**, **Genres**, and **Category**. We'll analyze the resulting tables in the next section.

## 11 Most Common Apps by Genre: Part Three

In the previous section, we generated frequency tables for the columns **prime_genre**, **Genres**, and **Category**, and we'll now focus on analyzing these frequency tables.

Remember our data set only contains free English apps, so you should be careful not to extend your conclusions beyond the scope of free English apps. If you find that gaming apps are the most numerous among the free English apps on Google Play, this doesn't mean that we'll see the same pattern on Google Play as a whole.

**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Analyze the frequency table you generated for the **prime_genre** column of the App Store data set.
  - What is the most common genre? What is the runner-up?
  - What other patterns can you see?
  - What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) or more for fun (games, entertainment, photo and video, social networking, sports, music, etc.)?
  - Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?
- Analyze the frequency table you generated for the **Category** and **Genres** column of the Google Play data set.
  - What are the most common genres?
  - What other patterns can you see?
  - Compare the patterns you see for the Google Play market with those you saw for the App Store market.
  - Can you recommend an app profile based on what you found so far? Do the frequency tables you generated tell you what are the most frequent app genres or what genres have the most users?


## 12 Most Popular Apps by Genre on the App Store

The frequency tables we analyzed in the previous section showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the **Installs** column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the **rating_count_tot** app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:
- Isolate the apps of each genre
- Sum up the user ratings for the apps of that genre
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

To calculate the average number of user ratings for each genre, we'll need to use a for loop inside of another for loop. This is an example of a for loop used inside another for loop:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1Vx5Y0f28CzGhLCVEbBtBUEtoHcLUmaQH" /></left>

Above, we can see that:

- We first iterate over the **some_strings** list, and for each iteration:
  - We print **string** (iteration variable)
  - We start another iteration over the list **some_integers**, and for each iteration over this list:
    - We print **integer** (iteration variable)
    
We can see that for each of the two iterations over the list **some_strings** (there are two iterations because **some_strings** only has two list elements) there's another inner iteration happening over the list **some_integers**.

The second iteration over **some_strings** begins only when the iteration over **some_integers** is done completely. Notice also that all the elements of the list **some_integers** are printed for each of the two iterations over the list **some_strings**.

A loop that is inside another loop is called a **nested loop**. We'll use a nested loop to compute the averages we mentioned above.


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Start by generating a frequency table for the **prime_genre** column to get the unique app genres (below, we'll need to loop over the unique genres). You can use the **freq_table()** function you wrote in a previous section.
- Loop over the unique genres of the App Store data set. For each iteration (below, we'll assume that the iteration variable is named **genre**):
  - Initiate a variable named **total** with a value of 0. This variable will store the sum of user ratings (the number of ratings, not the actual ratings) specific to each genre.
  - Initiate a variable named **len_genre** with a value of 0. This variable will store the number of apps specific to each genre.
  - Loop over the App Store data set, and for each iteration:
    - Save the app genre to a variable named **genre_app**.
    - If **genre_app** is the same as **genre** (the iteration variable of the main loop), then:
      - Save the number of user ratings of the app as a float.
      - Add up the number of user ratings to the **total** variable.
      - Increment the **len_genre** variable by 1.
  - Compute the average number of user ratings by dividing **total** by **len_genre**. This should be done outside the nested loop.
  - Print the app genre and the average number of user ratings. This should also be done outside the nested loop.
- Analyze the results and try to come up with at least one app profile recommendation for the App Store. Note that there's no fixed answer here, and it's perfectly fine if the app profile you recommended is different  from other students. 


## 13 Most Popular Apps by Genre on Google Play

In the previous section, we came up with an app profile recommendation for the App Store based on the number of user ratings. We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1_g3jQoVEwPbRd7px9F9leNNqnIMA5Fbz" /></left>

One problem with this data is that it's not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

To remove characters from strings, we can use [str.replace(old, new)](https://docs.python.org/3/library/stdtypes.html?#str.replace) method (just like **list.append()** or **list.copy()**, **str.replace()** is a special kind of function called method). **str.replace()** takes in two parameters, **old** and **new**, and replaces all occurrences of **old** within a string with **new**:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1Hg08kLcn0XZ2JhATe81VZ-rU1PiFtjiB" /></left>

To remove certain characters we can replace them with the empty string ''.

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1ENyEkXk2sxsibuHkDSPeulJIUAQXeUZ6" /></left>

Note that we'll need to reassign to **n_installs** if we want our changes saved:

<left><img width="500" src="https://drive.google.com/uc?export=view&id=1eufsuamKxLEX_eSOH1MsJuoosLoVmMFl" /></left>

Now let's calculate the average number of installs per app genre for the Google Play data set. We'll need to use a nested loop, just like in the previous section.


**Exercise**
<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ" /></left>


- Start by generating a frequency table for the **Category** column of the Google Play data set to get the unique app genres (below, we'll need to loop over the unique genres). You can use the **freq_table()** function you wrote in a previous section.
- Loop over the unique genres of the Google Play data set. For each iteration (below, we'll assume that the iteration variable is named **category**):
  - Initiate a variable named **total** with a value of 0. This variable will store the sum of installs specific to each genre.
  - Initiate a variable named **len_category** with a value of 0. This variable will store the number of apps specific to each genre.
  - Loop over the Google Play data set, and for each iteration:
    - Save the app genre to a variable named **category_app**.
    - If **category_app** is the same as **category** (the iteration variable of the main loop), then:
      - Save the number of **installs**.
      - Remove any + or , character, and then convert the string to a **float**.
      - Add up the number of installs to the **total** variable.
      - Increment the **len_category** variable by 1.
  - Compute the average number of installs by dividing **total** by **len_category**. This should be done outside the nested loop.
  - Print the app genre and the average number of installs. This should also be done outside the nested loop.
- Analyze the results and try to come up with at least one app profile recommendation for Google Play. Remember, our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play. Note that there's no fixed answer here, and it's perfectly fine if the app profile you recommended is different from the other students.

## 14 Next Steps

In this project, we went through a complete data science workflow:

- We started by clarifying the goal of our project.
- We collected relevant data.
- We cleaned the data to prepare it for analysis.
- We analyzed the cleaned data.


These are a few next steps you could take (optional):

- Analyze the frequency table for the Genre column of the Google Play data set, and see whether you can find useful patterns.
- Assume we could also make revenue via in-app purchases and subscriptions, and try to find out which genres seem to be liked the most by users — you could examine app ratings here.