<a href="https://colab.research.google.com/github/melbow2424/Data-602-Assignment-7/blob/main/07_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 7**

# **Weeks 8 & 9 - Pandas**
* In this homework assignment, you will explore and analyze a public dataset of your choosing. Since this assignment is “open-ended” in nature, you are free to expand upon the requirements below. However, you must meet the minimum requirments as indicated in each section. 

* You must use Pandas as the **primary tool** to process your data.

* The preferred method for this analysis is in a .ipynb file. Feel free to use whichever platform of your choosing.  
 * https://www.youtube.com/watch?v=inN8seMm7UI (Getting started with Colab).

* Your data should need some "work", or be considered "dirty".  You must show your skills in data cleaning/wrangling.

### **Some data examples:**
•	https://www.data.gov/

•	https://opendata.cityofnewyork.us/

•	https://datasetsearch.research.google.com/

•	https://archive.ics.uci.edu/ml/index.php

### **Resources:**

•	https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html 

•	https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html


### **Headings or comments**
**You are required to make use of comments, or headings for each section.  You must explain what your code is doing, and the results of running your code.**  Act as if you were giving this assignment to your manager - you must include clear and descriptive information for each section.

### **You may work as a group or indivdually on this assignment.**


# Introduction

In this section, please describe the dataset you are using.  Include a link to the source of this data.  You should also provide some explanation on why you choose this dataset.

______________
# Data Exploration
Import your dataset into your .ipynb, create dataframes, and explore your data.  

Include: 

* Summary statistics means, medians, quartiles, 
* Missing value information
* Any other relevant information about the dataset.  



In [18]:
import pandas as pd
import numpy as np

In [19]:
#read from csv: Play Store Apps Web scraped data of 10k Play Store apps (https://www.kaggle.com/datasets/whenamancodes/play-store-apps?select=googleplaystore.csv)
df = pd.read_csv('https://raw.githubusercontent.com/melbow2424/Data-602-Assignment-7/main/googleplaystore.csv')

In [None]:
# Using head function to make sure csv data frame was created
df.head()

In [None]:
# Summary statistics means, medians, quartiles for all the columns. 
df.describe(include='all') 

In [None]:
# Counting the NaN values in all the columns. Overview of missing value information
display(df.isnull().sum())

# Data Wrangling
Create a subset of your original data and perform the following.  

1. Modify multiple column names.

2. Look at the structure of your data – are any variables improperly coded? Such as strings or characters? Convert to correct structure if needed.

3. Fix missing and invalid values in data.

4. Create new columns based on existing columns or calculations.

5. Drop column(s) from your dataset.

6. Drop a row(s) from your dataset.

7. Sort your data based on multiple variables. 

8. Filter your data based on some condition. 

9. Convert all the string values to upper or lower cases in one column.

10. Check whether numeric values are present in a given column of your dataframe.

11. Group your dataset by one column, and get the mean, min, and max values by group. 
  * Groupby()
  * agg() or .apply()

12. Group your dataset by two columns and then sort the aggregated results within the groups. 

**You are free (and should) to add on to these questions.  Please clearly indicate in your assignment your answers to these questions.**

In [None]:
# Modify multiple column names. df.reman rename specific columns in a pandas DataFrame
df.rename(columns = {'App':'Application_Name', 
                     'Size' : 'Size (M)',
                     'Content Rating':'Content_Rating', 
                     'Android Ver':'Android_Version',
                     'Current Ver':'Current_Version'}, inplace = True)

list(df)

In [None]:
# Look at the structure of your data to see if variables improperly coded
print("Data Types of The Columns in Data Frame")
display(df.dtypes)

In [None]:
"""
Trying to convert a string data type to an integer values, I find a row of information that was missing its Category info and thus all its scrapped into was shifted to the left. 
It is displayed below and the Error is a follows: 
ValueError: Unable to parse string "3.0M" at position 10472
"""

print("Value of row at position 10472")
display(df.iloc[10472])

In [None]:
#Because of this, I will be dropping a row from the dataset here: 
#Drop a row(s) from your dataset.

df1 = df.drop(index=[10472])
display(df1.iloc[10472])

In [None]:
# On review of the data type Reviews and Installs should be either integer values. These will need to be corrected. 
# Convert to correct data type structure

# Reviews 
# Convert string to an integer for column Reviews
df1['Reviews'] = df1['Reviews'].astype(int)

display(df1.dtypes)


In [None]:
# Fix invalid values in data.
# Installs  
# Installs had a string value of + and ,. They need to dropped before the Installs data type can be changed 

#df2 = df1['Installs'].str.extract('(\d+)', expand=False) 

#Dropping M from Size
df1['Installs'] = df1['Installs'].str.replace('+','')
df1['Installs'] = df1['Installs'].str.replace(',','')
#Displaying it to show it was dropped. 
df1.head()

In [None]:
# Installs
# Convert string to an integer for column Installs
df1['Installs'] = df1['Installs'].astype(int)

display(df1.dtypes)

In [None]:
#Create new columns based on existing columns or calculations.

#Wanted to see which apps had more than 100,000 downloads, thus a column named Over 100000 was created. 
#If there are over 100,000 downloads for an app, a yes is placed in the column else a no is placed in the column. 
df1['Over 100000'] = np.where(df1['Installs']>= 100000, 'yes', 'no')
df1.head()

In [None]:
#Drop column(s) from your dataset.
df1.drop('Android_Version', axis=1, inplace=True)
df1.head()

In [61]:
#Sort your data based on multiple variables.
# Sorting by Category value in ascending order then Content_Rating in descending order
df1.sort_values(by = ['Category', 'Content_Rating'], ascending = [True, False])
df1.head()

Unnamed: 0,Application_Name,Category,Rating,Reviews,Size (M),Installs,Type,Price,Content_Rating,Genres,Last Updated,Current_Version,Over 100000
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,no
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,yes
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,0,Everyone,Art & Design,1-Aug-18,1.2.4,yes
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,50000000,Free,0,Teen,Art & Design,8-Jun-18,Varies with device,yes
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,100000,Free,0,Everyone,Art & Design;Creativity,20-Jun-18,1.1,yes


In [None]:
# Filter your data based on some condition.
# Filtered data set based on Rating column being greater than 3.5 
df1[df1['Rating'] > 3.5] 

In [None]:
# Convert all the string values to upper or lower cases in one column.
df1.Content_Rating.str.upper()

In [None]:
# Check whether numeric values are present in a given column of your dataframe.
df1['Size (M)'].str.isnumeric()

In [None]:
# Group your dataset by one column, and get the mean, min, and max values by group.
# Mean, min, and max values of Installs grouped by app Category 
df1.groupby('Category').agg({'Installs': ['mean', 'min', 'max']})

In [None]:
# Group your dataset by two columns and then sort the aggregated results within the groups.
# Grouped by Category and Installed then aggregated the results by the mean. 
# Interestingly got the same mean vales as if I group by the Category then aggregated over the Installs. 
df1.groupby('Category')['Installs'].mean() 

# Conclusions  

After exploring your dataset, provide a short summary of what you noticed from this dataset.  What would you explore further with more time?



Pandas is a powerful package that can manipulate data frames in multiple ways. It can be used to filter information, change variables inside of a column, summaries, and it can even add columns based on existing. From the data set used, I noticed there was a lot of tiding need in the data before manipulation. Especially when most numeric values were interpreted as strings and therefore needed corrections. If I had more time, I would work on making the Size column into a float variable and not a string variable. Below you can even see the code I started to work on to do just that, but it got complex. As you can see in the data set the Size columns had an M next to its variables. I though I could just drop that value, but I then ran into other issues. There were also variables with k next to its variables as well as column which said, “Varies with device”. Too correct everything, I know I would need to change the columns that said “Varies with device” into NaN values and figure out a way to take the M and k values and multiply them by a million and a thousand respectfully (or something like that). 


In [None]:
# Size 
# Size had a string value M before its numbers. That needs to dropped before the Size data type can be changed 

#df2 = df1['Installs'].str.extract('(\d+)', expand=False) 

#Dropping M from Size
#df1['Size (M)'] = df1['Size (M)'].str.replace('M','')
#Displaying it to show it was dropped. 
#df1.head()

#df1['Size (M)'] = df1['Size (M)'].replace('Varies with device', np.NaN, regex=True)
#df1.head()

#df1['Size (M)'] = df1['Size (M)'].str.replace('201k', '0.201')
#df1['Size (M)'] = df1['Size (M)'].str.replace('23k', '0.023')
#df1.head()

# Convert string to an float for column Size
#df1['Size (M)'] = df1['Size (M)'].astype(float)

#display(df1.dtypes)