<a id='Top'></a>
# First Python project - Data query on CSV file

In this exercise, I will:
- Using 2 different sets of data "AppleStore.csv" and "googleplaystore.csv".
- Remove duplicated apps based on app name and number of ratings. 
- Remove apps that have non-English characters.
- Isolated the free apps.
- Analyze the data to print out frequency tables.

*This is a project from Dataquest.io class of "Python for Data Science: Fundamentals", called "Profitable App Profiles for the App Store and Google Play Markets"*

**Jump to each section:**

[**DATA_CLEANING_1**](#DATA_CLEANING_1)
[**DATA_CLEANING_2**](#DATA_CLEANING_2)

[**DATA_QUERY**](#DATA_QUERY)

[**ANALYZE_DATA**](#ANALYZE_DATA)

**This project used basics of programming in Python:**
- Arithmetical operations, variables, common data types, etc.
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions
- Jupyter Notebook


In [1]:
open_file_1 = open('/Users/spare/Downloads/projects_python/AppleStore.csv')
open_file_2 = open('/Users/spare/Downloads/projects_python/googleplaystore.csv')
from csv import reader
read_file_1 = reader(open_file_1)
read_file_2 = reader(open_file_2)
apps_data_1 = list(read_file_1)
apps_data_2 = list(read_file_2)


Function explore_data is used to print out the row between 'start' and 'end'. 
- It also counts the number of rows and columns.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each print-out

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
#         print('\n')

Function del_row is used to delete the row between 'start' and 'end' from the data set.

In [3]:
def del_row (dataset, start, end):
    print('\nRows to be deleted:')
    print(dataset[start:end+1])
    del dataset[start:end+1]
    
    print('\nNow the rows are:')
    print(dataset[start:end+1])
    print('\n')

In [4]:
explore_data(apps_data_1, 0, 2, True)
explore_data(apps_data_2, 0, 2, True)

print('------------------------')
del_row (apps_data_2, 10473, 10473)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


Number of rows: 7198
Number of columns: 17
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13
------------------------

Rows to be deleted:
[['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]

## DATA_CLEANING_1
### Remove duplicate apps in the list, based on app name & number of reviews.

apps_data_2 (from googleplaystore.csv) is a list of lists, each element is a row in the data set.
For each element (row) of the list:
- Index 0 is an app name.
- Index 3 is an reviews number.

To remove duplicate elements out of the list. 
- Need to keep track of number of current row when traverse the the "master" list. 
    - When using 'del' on an index of a list, it will affect the list directly. 
- Thus, what we can do are: 
    - First, for each row, put index 0 and 3 from each row to a dictionary. Then, move on to the next row.
    - Second, if index 0 of the current row (app name) exists in the dictionary, compare index 3 (rating) with the one in the dictionary.
        - If the existed one (in the dictionary) is bigger, insert nothing to the new list. The rating value in the dictionary doesn't change. 
        - If the existed one is smaller, replace the data in the dictionary with the data in the current row for the next comparison. Also, add the current row to the new list.
    - But, how can we keep track of the existed row number of the existed apps in current dictionary (since we need to know which row to delete)? By using list as 'value' of each key in the dictionary. Each list has:
        - 1st element is the reviews number.
        - 2nd element is the row number. 
        
    - Instead of using 'del' to delete the row from the apps_data_2, I should create a new 'new_apps_data_2' because 'del' will remove the actual row and make it hard to keep track of the current row. 
        - Build a new list instead.

In [5]:
row_dict={} #Dictionary to keep track of unquique app_names
row_num=0   
duplicates=0 #keep track of number of duplicates
new_apps_data_2=[] #new list that doesn't contains duplicates

print('Number of element in apps_data_2:',len(apps_data_2))

for row in apps_data_2[1:]:  
    row_num +=1 #keep track of current row number in the list

    if row[0] in row_dict: #if an app_name exists in the dict
        duplicates +=1 #counting number of duplicates
        print('\nFound duplicate app at row:', row_num)
        print('Dictionary data:', row_dict[row[0]][0])
        print('Data on current row:', row[3])
        
        if row_dict[row[0]][0] >= row[3]:
            #If the existed one (in the dictionary) is bigger, do nothing to the list.
            #The rating value in the dictionary doesn't change.
            print ('Do nothing to the new list.')
            print ('Current row:',row_num, ' with data:', row[0], row[3])
            print ('Current rating value in the dictionary:', row_dict[row[0]])  
            print ('\n')
        else: 
            #If the existed one is smaller, replace the data in the dictionary with 
            #the data in the current row for the next comparison. 
            print('Update the row number and rating value to the dictionary:')
            print('New data', row[3],' on row:', row_num )
            row_dict[row[0]] = [row[3],row_num]
            
            #Also, add the current row to the new list.
            new_apps_data_2.append(row)
    else: 
        #Put row to dictionary
        #print('Put data to dictionary and new list.')
        row_dict[row[0]] = [row[3],row_num]
        new_apps_data_2.append(row)
        #print(row_dict)
        #print(apps_data_2[row_num-1])
    

print('\nTotal of duplicate:', duplicates)
print('Number of element in new_apps_data_2:',len(new_apps_data_2))

Number of element in apps_data_2: 10841

Found duplicate app at row: 230
Dictionary data: 80805
Data on current row: 80805
Do nothing to the new list.
Current row: 230  with data: Quick PDF Scanner + OCR FREE 80805
Current rating value in the dictionary: ['80805', 223]



Found duplicate app at row: 237
Dictionary data: 159872
Data on current row: 159872
Do nothing to the new list.
Current row: 237  with data: Box 159872
Current rating value in the dictionary: ['159872', 205]



Found duplicate app at row: 240
Dictionary data: 70991
Data on current row: 70991
Do nothing to the new list.
Current row: 240  with data: Google My Business 70991
Current rating value in the dictionary: ['70991', 194]



Found duplicate app at row: 257
Dictionary data: 31614
Data on current row: 31614
Do nothing to the new list.
Current row: 257  with data: ZOOM Cloud Meetings 31614
Current rating value in the dictionary: ['31614', 214]



Found duplicate app at row: 262
Dictionary data: 6989
Data on current r

Dictionary data: 97684
Data on current row: 97699
Update the row number and rating value to the dictionary:
New data 97699  on row: 698

Found duplicate app at row: 739
Dictionary data: 85375
Data on current row: 85375
Do nothing to the new list.
Current row: 739  with data: Khan Academy 85375
Current rating value in the dictionary: ['85375', 703]



Found duplicate app at row: 740
Dictionary data: 181893
Data on current row: 181893
Do nothing to the new list.
Current row: 740  with data: TED 181893
Current rating value in the dictionary: ['181893', 701]



Found duplicate app at row: 747
Dictionary data: 215301
Data on current row: 215301
Do nothing to the new list.
Current row: 747  with data: Lumosity: #1 Brain Games & Cognitive Training App 215301
Current rating value in the dictionary: ['215301', 735]



Found duplicate app at row: 758
Dictionary data: 99020
Data on current row: 99020
Do nothing to the new list.
Current row: 758  with data: Udemy - Online Courses 99020
Current rat

Current row: 993  with data: Nick 123279
Current rating value in the dictionary: ['123279', 918]



Found duplicate app at row: 994
Dictionary data: 243747
Data on current row: 243747
Do nothing to the new list.
Current row: 994  with data: Fandango Movies - Times + Tickets 243747
Current rating value in the dictionary: ['243747', 900]



Found duplicate app at row: 1097
Dictionary data: 347838
Data on current row: 347838
Do nothing to the new list.
Current row: 1097  with data: Google Pay 347838
Current rating value in the dictionary: ['347838', 1084]



Found duplicate app at row: 1101
Dictionary data: 283
Data on current row: 283
Do nothing to the new list.
Current row: 1101  with data: Wells Fargo Daily Change 283
Current rating value in the dictionary: ['283', 1092]



Found duplicate app at row: 1115
Dictionary data: 706301
Data on current row: 706302
Update the row number and rating value to the dictionary:
New data 706302  on row: 1115

Found duplicate app at row: 1137
Dictiona

Data on current row: 10434
Update the row number and rating value to the dictionary:
New data 10434  on row: 1735

Found duplicate app at row: 1736
Dictionary data: 5234162
Data on current row: 5234825
Update the row number and rating value to the dictionary:
New data 5234825  on row: 1736

Found duplicate app at row: 1741
Dictionary data: 5566669
Data on current row: 5566805
Update the row number and rating value to the dictionary:
New data 5566805  on row: 1741

Found duplicate app at row: 1745
Dictionary data: 1295557
Data on current row: 1295606
Update the row number and rating value to the dictionary:
New data 1295606  on row: 1745

Found duplicate app at row: 1749
Dictionary data: 4447388
Data on current row: 4448791
Update the row number and rating value to the dictionary:
New data 4448791  on row: 1749

Found duplicate app at row: 1750
Dictionary data: 1497361
Data on current row: 1498648
Update the row number and rating value to the dictionary:
New data 1498648  on row: 1750



Current row: 2030  with data: Dog Run - Pet Dog Simulator 48615
Current rating value in the dictionary: ['48615', 2014]



Found duplicate app at row: 2034
Dictionary data: 967
Data on current row: 974
Update the row number and rating value to the dictionary:
New data 974  on row: 2034

Found duplicate app at row: 2042
Dictionary data: 148990
Data on current row: 59843
Update the row number and rating value to the dictionary:
New data 59843  on row: 2042

Found duplicate app at row: 2046
Dictionary data: 68057
Data on current row: 68286
Update the row number and rating value to the dictionary:
New data 68286  on row: 2046

Found duplicate app at row: 2051
Dictionary data: 10216538
Data on current row: 10216997
Update the row number and rating value to the dictionary:
New data 10216997  on row: 2051

Found duplicate app at row: 2055
Dictionary data: 235496
Data on current row: 235906
Update the row number and rating value to the dictionary:
New data 235906  on row: 2055

Found duplicate

Dictionary data: 83488
Data on current row: 83488
Do nothing to the new list.
Current row: 2636  with data: Text Free: WiFi Calling App 83488
Current rating value in the dictionary: ['83488', 2587]



Found duplicate app at row: 2638
Dictionary data: 382120
Data on current row: 382121
Update the row number and rating value to the dictionary:
New data 382121  on row: 2638

Found duplicate app at row: 2642
Dictionary data: 79129
Data on current row: 79130
Update the row number and rating value to the dictionary:
New data 79130  on row: 2642

Found duplicate app at row: 2643
Dictionary data: 1175794
Data on current row: 1175815
Update the row number and rating value to the dictionary:
New data 1175815  on row: 2643

Found duplicate app at row: 2644
Dictionary data: 1259894
Data on current row: 1259894
Do nothing to the new list.
Current row: 2644  with data: MeetMe: Chat & Meet New People 1259894
Current rating value in the dictionary: ['1259894', 2618]



Found duplicate app at row: 2645

Data on current row: 122283
Do nothing to the new list.
Current row: 3062  with data: Bleacher Report: sports news, scores, & highlights 122283
Current rating value in the dictionary: ['122283', 3051]



Found duplicate app at row: 3063
Dictionary data: 82882
Data on current row: 82883
Update the row number and rating value to the dictionary:
New data 82883  on row: 3063

Found duplicate app at row: 3064
Dictionary data: 133833
Data on current row: 133833
Do nothing to the new list.
Current row: 3064  with data: theScore: Live Sports Scores, News, Stats & Videos 133833
Current rating value in the dictionary: ['133833', 3056]



Found duplicate app at row: 3065
Dictionary data: 91033
Data on current row: 91033
Do nothing to the new list.
Current row: 3065  with data: CBS Sports App - Scores, News, Stats & Watch Live 91033
Current rating value in the dictionary: ['91033', 3057]



Found duplicate app at row: 3067
Dictionary data: 32386
Data on current row: 32386
Do nothing to the new lis

Found duplicate app at row: 3910
Dictionary data: 66577446
Data on current row: 66509917
Do nothing to the new list.
Current row: 3910  with data: Instagram 66509917
Current rating value in the dictionary: ['66577446', 2605]



Found duplicate app at row: 3911
Dictionary data: 9883367
Data on current row: 9876369
Do nothing to the new list.
Current row: 3911  with data: My Talking Angela 9876369
Current rating value in the dictionary: ['9883367', 1891]



Found duplicate app at row: 3912
Dictionary data: 470713
Data on current row: 469851
Do nothing to the new list.
Current row: 3912  with data: YouTube Kids 469851
Current rating value in the dictionary: ['470713', 2209]



Found duplicate app at row: 3914
Dictionary data: 685981
Data on current row: 685450
Do nothing to the new list.
Current row: 3914  with data: PAC-MAN 685450
Current rating value in the dictionary: ['685981', 1674]



Found duplicate app at row: 3916
Dictionary data: 787177
Data on current row: 787107
Do nothing to 




Found duplicate app at row: 4502
Dictionary data: 313633
Data on current row: 313403
Do nothing to the new list.
Current row: 4502  with data: Quora 313403
Current rating value in the dictionary: ['313633', 2580]



Found duplicate app at row: 4528
Dictionary data: 4450890
Data on current row: 4443407
Do nothing to the new list.
Current row: 4528  with data: ROBLOX 4443407
Current rating value in the dictionary: ['4450890', 2207]



Found duplicate app at row: 4567
Dictionary data: 7790693
Data on current row: 7775146
Do nothing to the new list.
Current row: 4567  with data: SHAREit - Transfer & Share 7775146
Current rating value in the dictionary: ['7790693', 3256]



Found duplicate app at row: 4587
Dictionary data: 2955326
Data on current row: 2953886
Do nothing to the new list.
Current row: 4587  with data: Tumblr 2953886
Current rating value in the dictionary: ['2955326', 2549]



Found duplicate app at row: 4589
Dictionary data: 175110
Data on current row: 174827
Do nothing to



Found duplicate app at row: 6599
Dictionary data: 7
Data on current row: 7
Do nothing to the new list.
Current row: 6599  with data: Free Blood Pressure 7
Current rating value in the dictionary: ['7', 2357]



Found duplicate app at row: 6627
Dictionary data: 531
Data on current row: 46
Do nothing to the new list.
Current row: 6627  with data: High Blood Pressure Symptoms 46
Current rating value in the dictionary: ['531', 2509]



Found duplicate app at row: 6644
Dictionary data: 8226
Data on current row: 8211
Do nothing to the new list.
Current row: 6644  with data: QR Scanner & Barcode Scanner 2018 8211
Current rating value in the dictionary: ['8226', 3456]



Found duplicate app at row: 6653
Dictionary data: 16320
Data on current row: 16317
Do nothing to the new list.
Current row: 6653  with data: Camera FV-5 16317
Current rating value in the dictionary: ['16320', 2914]



Found duplicate app at row: 6655
Dictionary data: 244371
Data on current row: 244302
Do nothing to the new li

New data 200214  on row: 9629

Found duplicate app at row: 9636
Dictionary data: 206532
Data on current row: 207294
Update the row number and rating value to the dictionary:
New data 207294  on row: 9636

Found duplicate app at row: 9642
Dictionary data: 23772
Data on current row: 1375988
Do nothing to the new list.
Current row: 9642  with data: Chess Free 1375988
Current rating value in the dictionary: ['23772', 2157]



Found duplicate app at row: 9643
Dictionary data: 363934
Data on current row: 364452
Update the row number and rating value to the dictionary:
New data 364452  on row: 9643

Found duplicate app at row: 9672
Dictionary data: 288523
Data on current row: 288606
Update the row number and rating value to the dictionary:
New data 288606  on row: 9672

Found duplicate app at row: 9681
Dictionary data: 1574204
Data on current row: 1574546
Update the row number and rating value to the dictionary:
New data 1574546  on row: 9681

Found duplicate app at row: 9690
Dictionary data:

[**Back to DATA_CLEANING_1**](#DATA_CLEANING_1)

[**Back to Top**](#Top)


## DATA_CLEANING_2 
### Only keep apps directed toward English-speaking audience.
- The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.

In [6]:
eng_apps_data_2 = []
non_eng_count = 0
row_count=0
non_eng_app_name_dict={}

for row in new_apps_data_2:  
    non_eng=0
    row_count +=1
    
    for letter in row[0]:
        if 0 <= ord(letter) <= 127: #checking condition
            #do nothing to count the non-eng letter
            non_eng +=0
        else:
            non_eng +=1
            print('The non-ENG letter is:', letter,',with ASCII:', ord(letter))
            
    if non_eng > 0:
        print('App name constains non-ENG letters:', row[0],'on row:', row_count)
        print('Do nothing to the new list.')
        non_eng_app_name_dict[row_count] = row[0] #put the app in the dictionary
        non_eng_count +=1
        print('\n')
    else:
#       COMMENTS FOR DEBUGS  
#         print('App name constains ENG letters:', row[0],'on row:', row_count)
#         print('Append current line to the new list.')
        eng_apps_data_2.append(row)
#         print('App name in current line:', row[0])
#         print('\n')

print('Number of non-ENG apps:', non_eng_count)
#print(eng_apps_data_2)


    

The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: U Launcher Lite – FREE Live Cool Themes, Hide Apps on row: 3
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: CarMax – Cars for Sale: Search Used Car Inventory on row: 86
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: AutoScout24 Switzerland – Find your new car on row: 89
Do nothing to the new list.


The non-ENG letter is: á ,with ASCII: 225
The non-ENG letter is: ã ,with ASCII: 227
App name constains non-ENG letters: Zona Azul Digital Fácil SP CET - OFFICIAL São Paulo on row: 90
Do nothing to the new list.


The non-ENG letter is: 📖 ,with ASCII: 128214
App name constains non-ENG letters: Wattpad 📖 Free Books on row: 140
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: ReadEra – free ebook reader on row: 162
Do nothing to the ne

The non-ENG letter is: ® ,with ASCII: 174
App name constains non-ENG letters: Geocaching® on row: 1193
Do nothing to the new list.


The non-ENG letter is: 🏠 ,with ASCII: 127968
App name constains non-ENG letters: Homes.com 🏠 For Sale, Rent on row: 1248
Do nothing to the new list.


The non-ENG letter is: · ,with ASCII: 183
The non-ENG letter is: · ,with ASCII: 183
The non-ENG letter is: · ,with ASCII: 183
App name constains non-ENG letters: At home - rental · real estate · room finding application such as apartment · apartment on row: 1277
Do nothing to the new list.


The non-ENG letter is: 乐 ,with ASCII: 20048
The non-ENG letter is: 屋 ,with ASCII: 23627
The non-ENG letter is: 网 ,with ASCII: 32593
App name constains non-ENG letters: 乐屋网: Buying a house, selling a house, renting a house on row: 1284
Do nothing to the new list.


The non-ENG letter is: ® ,with ASCII: 174
App name constains non-ENG letters: ColorSnap® Visualizer on row: 1287
Do nothing to the new list.


The non-ENG let

The non-ENG letter is: é ,with ASCII: 233
App name constains non-ENG letters: Buscapé - Offers and discounts on row: 2343
Do nothing to the new list.


The non-ENG letter is: 🏆 ,with ASCII: 127942
App name constains non-ENG letters: CheckPoints 🏆 Rewards App on row: 2354
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Coupons.com – Grocery Coupons & Cash Back Savings on row: 2376
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Modcloth – Unique Indie Women's Fashion & Style on row: 2382
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Zappos – Shoe shopping made simple on row: 2391
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Overstock – Home Decor, Furniture Shopping on row: 2399
Do nothing to the new list.


The non-ENG letter is: – ,with ASC

The non-ENG letter is: 포 ,with ASCII: 54252
The non-ENG letter is: 인 ,with ASCII: 51064
The non-ENG letter is: 트 ,with ASCII: 53944
The non-ENG letter is: 포 ,with ASCII: 54252
The non-ENG letter is: 인 ,with ASCII: 51064
The non-ENG letter is: 트 ,with ASCII: 53944
The non-ENG letter is: 멤 ,with ASCII: 47716
The non-ENG letter is: 버 ,with ASCII: 48260
The non-ENG letter is: 십 ,with ASCII: 49901
The non-ENG letter is: 적 ,with ASCII: 51201
The non-ENG letter is: 립 ,with ASCII: 47549
The non-ENG letter is: 사 ,with ASCII: 49324
The non-ENG letter is: 용 ,with ASCII: 50857
The non-ENG letter is: 모 ,with ASCII: 47784
The non-ENG letter is: 바 ,with ASCII: 48148
The non-ENG letter is: 일 ,with ASCII: 51068
The non-ENG letter is: 카 ,with ASCII: 52852
The non-ENG letter is: 드 ,with ASCII: 46300
The non-ENG letter is: 쿠 ,with ASCII: 53216
The non-ENG letter is: 폰 ,with ASCII: 54256
The non-ENG letter is: 롯 ,with ASCII: 47215
The non-ENG letter is: 데 ,with ASCII: 45936
App name constains non-ENG lette

App name constains non-ENG letters: Bubbu – My Virtual Pet on row: 4770
Do nothing to the new list.


The non-ENG letter is: ™ ,with ASCII: 8482
App name constains non-ENG letters: Weaphones™ Gun Sim Free Vol 1 on row: 4777
Do nothing to the new list.


The non-ENG letter is: ★ ,with ASCII: 9733
App name constains non-ENG letters: Live Camera Viewer ★ World Webcam & IP Cam Streams on row: 4779
Do nothing to the new list.


The non-ENG letter is: ® ,with ASCII: 174
App name constains non-ENG letters: AP® Guide on row: 4782
Do nothing to the new list.


The non-ENG letter is: ™ ,with ASCII: 8482
App name constains non-ENG letters: AP App for Android™ on row: 4786
Do nothing to the new list.


The non-ENG letter is: 🔥 ,with ASCII: 128293
App name constains non-ENG letters: Guess the Class 🔥 AQW on row: 4835
Do nothing to the new list.


The non-ENG letter is: 中 ,with ASCII: 20013
The non-ENG letter is: 国 ,with ASCII: 22269
The non-ENG letter is: 語 ,with ASCII: 35486
The non-ENG letter is:


The non-ENG letter is: á ,with ASCII: 225
App name constains non-ENG letters: BH Táxi on row: 5559
Do nothing to the new list.


The non-ENG letter is: ç ,with ASCII: 231
The non-ENG letter is: í ,with ASCII: 237
App name constains non-ENG letters: BH Açaí on row: 5563
Do nothing to the new list.


The non-ENG letter is: í ,with ASCII: 237
App name constains non-ENG letters: Bi en Línea on row: 5565
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Microsoft Power BI–Business data analytics on row: 5567
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Bitmoji – Your Personal Emoji on row: 5572
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: happn – Local dating app on row: 5592
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Zalo – Video Call on ro

Do nothing to the new list.


The non-ENG letter is: ã ,with ASCII: 227
App name constains non-ENG letters: Meu Cartão BV on row: 6160
Do nothing to the new list.


The non-ENG letter is: ü ,with ASCII: 252
App name constains non-ENG letters: BW Mobilbanking für Smartphone und Tablet on row: 6192
Do nothing to the new list.


The non-ENG letter is: İ ,with ASCII: 304
The non-ENG letter is: İ ,with ASCII: 304
App name constains non-ENG letters: C3-C4-PİCASSO-ELYSEE RACİNG on row: 6247
Do nothing to the new list.


The non-ENG letter is: ó ,with ASCII: 243
App name constains non-ENG letters: Casa de Bolsa Bx+ Móvil on row: 6252
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: Color by Number – New Coloring Book on row: 6271
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: No.Color – Color by Number on row: 6285
Do nothing to the new list.


The non-ENG letter is: ‼ ,wit

App name constains non-ENG letters: DF 司機 on row: 7588
Do nothing to the new list.


The non-ENG letter is: á ,with ASCII: 225
App name constains non-ENG letters: Rádio DF FM on row: 7599
Do nothing to the new list.


The non-ENG letter is: – ,with ASCII: 8211
App name constains non-ENG letters: DF Wall Plus – Droid Firewall on row: 7605
Do nothing to the new list.


The non-ENG letter is: í ,with ASCII: 237
App name constains non-ENG letters: Digital FM Brasília DF on row: 7607
Do nothing to the new list.


The non-ENG letter is: é ,with ASCII: 233
The non-ENG letter is: â ,with ASCII: 226
App name constains non-ENG letters: Técnico Legislativo Câmara Legislativa DF on row: 7612
Do nothing to the new list.


The non-ENG letter is: ç ,with ASCII: 231
App name constains non-ENG letters: Cargo de Praça PM DF on row: 7614
Do nothing to the new list.


The non-ENG letter is: é ,with ASCII: 233
App name constains non-ENG letters: México City D.F News on row: 7617
Do nothing to the new list.

The non-ENG letter is: 드 ,with ASCII: 46300
App name constains non-ENG letters: EG SIM CARD (EGSIMCARD, 이지심카드) on row: 8569
Do nothing to the new list.


The non-ENG letter is: í ,with ASCII: 237
App name constains non-ENG letters: qEG APP / Química EG SRL on row: 8585
Do nothing to the new list.


The non-ENG letter is: ™ ,with ASCII: 8482
App name constains non-ENG letters: EG Classroom Decimals™ on row: 8586
Do nothing to the new list.


The non-ENG letter is: ™ ,with ASCII: 8482
App name constains non-ENG letters: Doomsday Preppers™ on row: 8629
Do nothing to the new list.


The non-ENG letter is: í ,with ASCII: 237
App name constains non-ENG letters: Esporte Interativo - Notícias e Resultados Ao Vivo on row: 8676
Do nothing to the new list.


The non-ENG letter is: 国 ,with ASCII: 22269
The non-ENG letter is: 际 ,with ASCII: 38469
App name constains non-ENG letters: EI国际 on row: 8689
Do nothing to the new list.


The non-ENG letter is: í ,with ASCII: 237
App name constains non-ENG l

[**Back to DATA_CLEANING_2**](#DATA_CLEANING_2)

[**Back to Top**](#Top)

### Debugging the previous code
- Some of the name seems to be OK, such as in "3" and "86"
- Create a separate function to check "non-ENG" letter

In [7]:
print('Number of non-English app:', len(non_eng_app_name_dict))
#print('List of non-ENG app name:\n',non_eng_app_name_dict,'\n')

Number of non-English app: 563


#### Some app_name seam to be OK, but were still considered as non-English.

In [8]:
def check_non_eng(app_name):
    non_eng = 0
    print(app_name)
    for letter in app_name:
        if 0 <= ord(letter) <= 127:
            #do nothing to count the non-eng letter            
            non_eng +=0
            #print('Current ENG letter:', letter, ',ASCII code:', ord(letter))
        else:
            non_eng +=1
            print('Current non-ENG letter:', letter, ',ASCII code:', ord(letter))
    return non_eng

In [9]:
debug_1 = non_eng_app_name_dict[3]
debug_2 = non_eng_app_name_dict[86]
check_non_eng(debug_1)
#check_non_eng(debug_2)


U Launcher Lite – FREE Live Cool Themes, Hide Apps
Current non-ENG letter: – ,ASCII code: 8211


1

As shown above, these apps seem "legit". This shows that the check of ASCII code in the range of (0-127) is not sufficient enough. However, I will not exploit this further as this is not the focus of this exercise. 
- The purpose of this exercise is to practice basic data cleaning on Python. 

To minimize the impact of data loss, we could potentially only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. 
- This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [10]:
print(ord('–'))
print(ord('-'))
print(ord('😍'))

8211
45
128525


[**Back to DATA_CLEANING_2**](#DATA_CLEANING_2)

[**Back to Top**](#Top)

### DATA_QUERY
### Isolate free apps. 
Free apps have value in 'Price'=='0' or 'Type'=='Free'. In each row:
- The 'Price' column is in index 7.
- The 'Type' column is in index 6.

This step, we continue using the list created above, eng_apps_data_2.


In [11]:
apps_data_2_free = [] #create empty list

for row in eng_apps_data_2:
  #  if row[7] == '0':
  #      apps_data_2_free.append(row)
    if row[6] == 'Free':
        apps_data_2_free.append(row)

#print(apps_data_2_free)
print('Only print out the frist 10 free apps:\n')
for i in range(0,10):
    print(apps_data_2_free[i])
    print('\n')

print('Numbers of free apps:')
print(len(apps_data_2_free))

Only print out the frist 10 free apps:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '17

[**Back to DATA_QUERY**](#DATA_QUERY)

[**Back to Top**](#Top)


### ANALYZE_DATA
We'll need to build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:
- One function to generate frequency tables that show percentages.
- Another function we can use to display the percentages in a descending order.

In [12]:
#Create a function named freq_table() that takes in two inputs: 
# - dataset (which is expected to be a list of lists) 
# - index (which is expected to be an integer).
#The function should return the frequency table (as a dictionary) for any column we want. 
#The frequencies should also be expressed as percentages.

def freq_table (dataset, index):
    col_freq = {}
    percentage = 0.0
    
    for row in dataset:
        if row[index] in col_freq:
            col_freq[row[index]] +=1 
        else: 
            col_freq[row[index]] =1
    
    #Calculate the frequencies
    total_apps = len(dataset)
  #  print('Total of apps in the list is:', total_apps)
    for ele in col_freq
        col_freq[ele] = round(col_freq[ele]/total_apps*100,2)
    
    return col_freq

SyntaxError: invalid syntax (<ipython-input-12-81c6ec19e3f2>, line 20)

In [None]:
print('Create freq_table for \'Category\', index=1:\n')
cat_table = freq_table(apps_data_2_free,1)
print(cat_table)
print(type(cat_table))


print('\n-----------------------------------------------')
print('\n\nCreate freq_table for \'Genre\', index=9:\n')
genre_table = freq_table(apps_data_2_free,9)
print(genre_table)
print(type(genre_table))

**Transforms the frequency table into a list of tuples, then sorts the list in a descending order.**
- Prints the entries of the frequency table in descending order.

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


In [None]:
print('Apps sorted by Category:')
display_table (apps_data_2_free,1)

print("\nApps sorted by Genre:")
display_table (apps_data_2_free,9)

[**Back to ANALYZE_DATA**](#ANALYZE_DATA)

[**Back to Top**](#Top)