# Guided Project: Visualizing Real World Data
- Use a composed dataset from kaggle, join and cleanup the data (the requirements here is to open at least 2 files, it doesn’t matter if its csv or json)
- At least 3 histograms on different aggregated data. Do an analysis on which is the optimal bin parameter (the one that maximizes clarity on specified insight). Keep an eye on the dataset you choose, it must have at least 3 numeric columns to operate.
- Plot a scatter distribution of data for a joined column with any column you like
- Create a combined scatterplot with a two series you choosed, it should contain the legend for each scatterplot
- Create a plot for a category distribution (using seaborn violin plot o other kind of graph that fits better your data using catplot) https://seaborn.pydata.org/generated/seaborn.catplot.html
- Do a comparision with 3x3 subplot matrix. Plots can be anything you liked about the dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Step 1: Select and Upload Dataset
I went to Kaggle and selected the [Mobile App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) dataset. This dataset contains more than 7000 Apple iOS mobile application and their details

In [32]:
data1 = pd.read_csv('data/AppleStore.csv')
data2 = pd.read_csv('data/appleStore_description.csv')
data = pd.merge(data1, data2, on="id")
data.head()

Unnamed: 0.1,Unnamed: 0,id,track_name_x,size_bytes_x,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,track_name_y,size_bytes_y,app_desc
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1,PAC-MAN Premium,100788224,"SAVE 20%, now only $3.99 for a limited time!\n..."
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1,Evernote - stay organized,158578688,Let Evernote change the way you organize your ...
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,Download the most popular free weather app pow...
3,4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,The eBay app is the best way to find anything ...
4,5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1,Bible,92774400,On more than 250 million devices around the wo...


## Step 2: Data Cleaning
- I started to explore the dataset using the info command to get an idea of the data types and the nulls
    - Since there were no nulls in the dataset, I kept all colums and rows based on this filter

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7197 entries, 0 to 7196
Data columns (total 20 columns):
Unnamed: 0          7197 non-null int64
id                  7197 non-null int64
track_name_x        7197 non-null object
size_bytes_x        7197 non-null int64
currency            7197 non-null object
price               7197 non-null float64
rating_count_tot    7197 non-null int64
rating_count_ver    7197 non-null int64
user_rating         7197 non-null float64
user_rating_ver     7197 non-null float64
ver                 7197 non-null object
cont_rating         7197 non-null object
prime_genre         7197 non-null object
sup_devices.num     7197 non-null int64
ipadSc_urls.num     7197 non-null int64
lang.num            7197 non-null int64
vpp_lic             7197 non-null int64
track_name_y        7197 non-null object
size_bytes_y        7197 non-null int64
app_desc            7197 non-null object
dtypes: float64(3), int64(10), object(7)
memory usage: 1.2+ MB


- I proceded to verify if the following columns contained the same information:
    - track_name_x and track_name_y
    - size_bytes_x and size_bytes_y
        - Since they both contained the same information:
            - I dropped columns track_name_y and size_bytes_y
            - I renamed columns track_name_x
- I verified the changes were executed correctly using the info command 

In [57]:
print(sum((data.track_name_x == data.track_name_y)==True))
print(sum((data.size_bytes_x == data.size_bytes_y)==True))

7197
7197


In [59]:
data = data.drop(['size_bytes_y', 'track_name_y'], axis=1)

In [63]:
data = data.rename(columns={'size_bytes_x':'size_bytes',
                            'track_name_x':'track_name'})

In [64]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7197 entries, 0 to 7196
Data columns (total 18 columns):
Unnamed: 0          7197 non-null int64
id                  7197 non-null int64
track_name          7197 non-null object
size_bytes          7197 non-null int64
currency            7197 non-null object
price               7197 non-null float64
rating_count_tot    7197 non-null int64
rating_count_ver    7197 non-null int64
user_rating         7197 non-null float64
user_rating_ver     7197 non-null float64
ver                 7197 non-null object
cont_rating         7197 non-null object
prime_genre         7197 non-null object
sup_devices.num     7197 non-null int64
ipadSc_urls.num     7197 non-null int64
lang.num            7197 non-null int64
vpp_lic             7197 non-null int64
app_desc            7197 non-null object
dtypes: float64(3), int64(9), object(6)
memory usage: 1.0+ MB


- I changed the data types of the following columns:
    - prime_genre to category
    - vpp_lic to category
    - cont_rating to int64 
        - In order to change the data type  from object to int64, I had to remove the + in the data 
- I used the dtypes command to verify the data types where changed

In [69]:
data['prime_genre'] = data['prime_genre'].astype('category')
data['vpp_lic'] = data['vpp_lic'].astype('category')

In [75]:
data['cont_rating'] = data['cont_rating'].str.replace('+', '')
data['cont_rating'] = data['cont_rating'].astype('int64')

In [77]:
data.dtypes

Unnamed: 0             int64
id                     int64
track_name            object
size_bytes             int64
currency              object
price                float64
rating_count_tot       int64
rating_count_ver       int64
user_rating          float64
user_rating_ver      float64
ver                   object
cont_rating            int64
prime_genre         category
sup_devices.num        int64
ipadSc_urls.num        int64
lang.num               int64
vpp_lic             category
app_desc              object
dtype: object