## 1. Bulid Seed Lexicon

- Total: 3292 apps
- Level 1: 2076
- Level 2: 351
- Level 3: 731
- Level 4: 97

### 1.1 Generate frequency distribution for entire collection 
- only keep n, v, adj, adv
- select top 5000 words as seed words 
- go through seed words, delete unreasonable words

In [170]:
import nltk
import pandas as pd
from nltk import FreqDist
import numpy as np

In [6]:
df = pd.read_excel('3296_Andriod_Apps.xlsx')
df.shape

(3296, 13)

In [18]:
df1 = df[df['content_rating']=='Everyone']
df2 = df[df['content_rating']=='Everyone 10+']
df3 = df[df['content_rating']=='Teen']
df4 = df[df['content_rating']=='Mature 17+']

In [23]:
def process(inputdf):
    
## Tokenization ##

    tokens = []
    nrow = len(inputdf.index)
    for i in range(nrow):
        try:
            tokens = tokens + nltk.word_tokenize(inputdf.ix[i,'description'])
        except:
            pass
        
    tokens = [w.encode('utf-8') for w in tokens]
        
## POS Tagging ##

    tags = nltk.pos_tag(tokens)
    ##only keep n, v, adj, adv
    tag_list = [(word,tag) for (word,tag) in tags if tag.startswith('N') or tag.startswith('V') or tag.startswith('J') or tag.startswith('R')]
    
## Frequency distribution ##
    tag_freq = FreqDist(tag_list)
    
    sorted_tag_freq = sorted(tag_freq.items(), key = lambda k:k[1], reverse = True)
    
    return sorted_tag_freq 




In [26]:
whole_set = process(df)

In [28]:
len(whole_set)

54257

In [34]:
whole_set_top5000 = whole_set[:5001]

In [35]:
with open('whole_set_top5000.txt','w') as f:
    for (word,tag),frequency in whole_set_top5000:
        f.write(str(word)+'\t'+str(tag)+'\t'+str(frequency)+'\n')

### 1.2 After select seed lexicon, generate Set1-Set4

In [40]:
seed = pd.read_csv('whole_set_top5000_after.csv')
seed.head()

Unnamed: 0,word,tag,frequency,keepornot
0,game,NN,23332,0
1,is,VBZ,21373,0
2,http,NN,11093,0
3,are,VBP,10911,0
4,be,VB,8195,0


In [47]:
seed = seed[seed['keepornot']==1]
seed_list = seed['word'].str.lower().tolist()
seed_list[:10]

['new',
 'free',
 'friends',
 'levels',
 'players',
 'fun',
 'facebook',
 'up',
 'different',
 'best']

In [50]:
len(seed_list)

3206

In [60]:
#remove dupulicates in seed_list
seed_list2 = list(set(seed_list))

In [101]:
# remove stop words in seed_list
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

stop_df = pd.read_csv('stop_list.csv')
stop_list = stop_df['word'].tolist()
seed_list3 = [w for w in seed_list2 if w not in stop_list if w not in stopwords]

In [149]:
def set_generate(inputdf):
    
    ## Tokenization ##

    tokens = []
    nrow = len(inputdf.index)
    for i in range(nrow):
        try:
            tokens = tokens + nltk.word_tokenize(inputdf.ix[i,'description'])
        except:
            pass
    
    #lowercase and encode
    tokens = [w.encode('utf-8').lower() for w in tokens]
    
    #only keep the words in seed_list
    words = [w for w in tokens if w in seed_list3]
    
    ## Freq Distribution
    freq = FreqDist(words)
    
    ## Normalization 
    total_freq = sum(freq.values())
    for k,v in freq.items():
        if total_freq != 0:
            freq[k] = float(v)/total_freq*100
            
    return freq

In [150]:
#set = dict(word:freq)

set1 = set_generate(df1)
set2 = set_generate(df2)
set3 = set_generate(df3)
set4 = set_generate(df4)

In [151]:
with open('set1.txt','w') as f:
    f.write('word\tfrequency\n')
    for (word,frequency) in set1.items():
        f.write(str(word)+'\t'+str(frequency)+'\n')
with open('set2.txt','w') as f:
    f.write('word\tfrequency\n')
    for (word,frequency) in set2.items():
        f.write(str(word)+'\t'+str(frequency)+'\n')
with open('set3.txt','w') as f:
    f.write('word\tfrequency\n')
    for (word,frequency) in set3.items():
        f.write(str(word)+'\t'+str(frequency)+'\n')
with open('set4.txt','w') as f:
    f.write('word\tfrequency\n')
    for (word,frequency) in set4.items():
        f.write(str(word)+'\t'+str(frequency)+'\n')

In [153]:
len(set1),len(set2),len(set3),len(set4)

(490, 483, 503, 0)

In [152]:
#get top 500 words from set1,set3 
for k,v in set1.items():
    if v<0.05:   ##change
        set1.pop(k)
for k,v in set3.items():
    if v<0.044:  ##change
        set3.pop(k)
        

Why len(set4) is 0? 

- Maybe need to add weight when generate whole_set
- \+ Offensive list

## 2. Assign initial weights to seed terms 

### 2.1 Create Positive and Negative sets

In [154]:
P1 = set1
P2 = set2
P3 = set3
P4 = set4

P = [P1,P2,P3,P4]

N1 = FreqDist()
N2 = set1
N3 = set1+set2        
N4 = set1+set2+set3

N = [N1,N2,N3,N4]

Set = [set1,set2,set3,set4]

In [155]:
len(N1),len(N2),len(N3),len(N4)

(0, 490, 754, 938)

### 2.2 Create weight_dictionary for each level

In [156]:
def weight_calculation(n):
    '''
    generate a weight_dict for each level 
    weight_dict = {word:weight}
    
    '''
    weight_dict = dict()
    for word in Set[n]:
        if (word in P[n]) & (not word in N[n]):
            weight_dict.update({word:P[n][word]})
        elif (word in N[n]) & (not word in P[n]):
            weight_dict.update({word:-N[n][word]})
        elif (word in P[n]) & (word in N[n]):
            weight_dict.update({word:P[n][word]/N[n][word]})
        else:
            weight_dict.update({word:0})
    
    return weight_dict

In [157]:
weight_dict1 = weight_calculation(0)
weight_dict2 = weight_calculation(1)
weight_dict3 = weight_calculation(2)
weight_dict4 = weight_calculation(3)

In [158]:
weight_dict1

{'abilities': 0.060404172730447014,
 'ability': 0.0649918314188354,
 'able': 0.06957949010722378,
 'access': 0.13533593130745722,
 'account': 0.08436194588091966,
 'achievements': 0.12106321538802671,
 'action': 0.09736031216468675,
 'add': 0.06983436003435647,
 'added': 0.05734573360485476,
 'addictive': 0.11239763786551533,
 'addition': 0.09098856398636956,
 'additional': 0.1386492403601822,
 'ads': 0.11647555669963834,
 'adventure': 0.21026768988446745,
 'adventures': 0.05224833506220101,
 'advertising': 0.1190242559709652,
 'age': 0.09557622267475793,
 'agreement': 0.0565811238234567,
 'allows': 0.08538142558945042,
 'along': 0.060659042657579695,
 'already': 0.07161844952428528,
 'always': 0.13508106138032455,
 'amazing': 0.13992358999584562,
 'android': 0.2538504474241571,
 'angry': 0.07314766908708141,
 'animals': 0.09736031216468675,
 'application': 0.17050798125176816,
 'apps': 0.13992358999584562,
 'arcade': 0.06116878251184508,
 'around': 0.14986351715402044,
 'art': 0.06473

In [159]:
weight_dict2

{'ability': 0.5023359131442325,
 'acceptance': 0.45706823375775385,
 'access': 0.964939042197961,
 'account': 3.0959675613723094,
 'accounts': 0.1305909239307868,
 'achievements': 1.348375345808203,
 'acquired': 0.1305909239307868,
 'action': 0.6706578945119335,
 'ad': 0.1305909239307868,
 'adjusting': 0.1305909239307868,
 'ads': 1.1211873772584622,
 'adventure': 0.4658023921882884,
 'adventures': 1.2497137351393102,
 'advertisements': 0.1305909239307868,
 'advertising': 2.1943581644844423,
 'age': 0.3415884209380781,
 'agree': 0.1305909239307868,
 'agreement': 19.04124643742665,
 'allows': 1.5295003922600514,
 'along': 0.538217049797392,
 'amazing': 2.333254241380315,
 'android': 1.6719312771818582,
 'animations': 0.0979431929480901,
 'areas': 0.0979431929480901,
 'army': 0.0652954619653934,
 'around': 0.6535492747539761,
 'art': 1.5129408407690468,
 'aspects': 0.1305909239307868,
 'assistance': 0.4244205027750571,
 'associated': 0.1305909239307868,
 'attack': 0.1305909239307868,
 'au

In [160]:
weight_dict3

{'absolutely': 0.0792910447761194,
 'access': 0.1753933088216672,
 'account': 0.31720497065823816,
 'achievements': 0.2706945091225163,
 'action': 0.4229584972251048,
 'active': 0.055970149253731345,
 'add': 2.1372535134546244,
 'added': 1.2403491397180764,
 'addictive': 0.7988197035655058,
 'additional': 0.7400829606562773,
 'adult': 0.24953358208955226,
 'advanced': 0.0804570895522388,
 'adventure': 0.2950950067091721,
 'age': 0.845724695627492,
 'agreement': 0.05655638204605792,
 'along': 0.5248695103939921,
 'already': 0.8954740449221862,
 'always': 0.7941610637848494,
 'amazing': 0.4600167771706258,
 'american': 0.05713619402985075,
 'amounts': 0.04547574626865671,
 'amusement': 0.0804570895522388,
 'and/or': 0.10261194029850745,
 'android': 0.4641686179538868,
 'anytime': 0.17257462686567165,
 'anywhere': 0.10960820895522388,
 'applicable': 0.058302238805970144,
 'application': 0.889024782198871,
 'apply': 0.08512126865671642,
 'apps': 0.6250079647867222,
 'around': 0.80933926853

In [161]:
weight_dict4

{}

In [162]:
with open ("weight_dict1.txt",'w') as f:
    f.write('word\tweight\n')
    for k,v in weight_dict1.items():
        f.write(str(k)+'\t'+str(v)+'\n')
with open ("weight_dict2.txt",'w') as f:
    f.write('word\tweight\n')
    for k,v in weight_dict2.items():
        f.write(str(k)+'\t'+str(v)+'\n')
with open ("weight_dict3.txt",'w') as f:
    f.write('word\tweight\n')
    for k,v in weight_dict3.items():
        f.write(str(k)+'\t'+str(v)+'\n')
with open ("weight_dict4.txt",'w') as f:
    f.write('word\tweight\n')
    for k,v in weight_dict4.items():
        f.write(str(k)+'\t'+str(v)+'\n')

## 3. Classification

### 3.1 calculate sum weight for each level for each app

In [165]:
def rating(row):  
    des = str(row['description'])
    
    tokens = nltk.word_tokenize(des)
    
    words1 = [w.lower().encode('utf-8') for w in tokens if w in weight_dict1]
    words2 = [w.lower().encode('utf-8') for w in tokens if w in weight_dict2]
    words3 = [w.lower().encode('utf-8') for w in tokens if w in weight_dict3]
    words4 = [w.lower().encode('utf-8') for w in tokens if w in weight_dict4]
    
    freq1 = FreqDist(words1)
    freq2 = FreqDist(words2)
    freq3 = FreqDist(words3)
    freq4 = FreqDist(words4)
    
    ##normalization 
    for k,v in freq1.items():
        freq1[k] = float(v)/len(words1)
    for k,v in freq2.items():
        freq2[k] = float(v)/len(words2)
    for k,v in freq3.items():
        freq3[k] = float(v)/len(words3)
    for k,v in freq4.items():
        freq4[k] = float(v)/len(words4)
    
    s1,s2,s3,s4 = 0,0,0,0
    for word in words1:
        s1 += freq1[word]*weight_dict1[word]*100
    for word in words2:
        s2 += freq2[word]*weight_dict2[word]*100
    for word in words3:
        s3 += freq3[word]*weight_dict3[word]*100
    for word in words4:
        s4 += freq4[word]*weight_dict4[word]*100
    
    
    row['s1'] = s1
    row['s2'] = s2
    row['s3'] = s3
    row['s4'] = s4
    return row
    

In [166]:
df_rating = df.apply(rating,axis = 1)

In [167]:
df_rating

Unnamed: 0,Subcategory,name,apppurl,review_title,review_content,content_rating,content_rating_reason+,icon_url,rating_score,rating_num,current_version,Developer,description,s1,s2,s3,s4
0,Educational,ABC Kids - Tracing & Phonics,https://play.google.com/store/apps/details?id=...,"Perfectly as stated!;Loved it, until;Lovely;Ex...",Perfectly as stated! No ads!! My 3yo can't ac...,Everyone,,http://lh3.googleusercontent.com/tcbtYwJSxe_J3...,4.3,8703.0,1.1.1,RV AppStudios,"Looking for a fun, free, and simple educationa...",209.402769,1415.275474,696.349809,0
1,Educational,Educational Games 4 Kids,https://play.google.com/store/apps/details?id=...,;;love;;;,My kids like to play ; My son love it ; l...,Everyone,,http://lh3.googleusercontent.com/I-3t7gsJoI411...,4.2,475.0,,pescAPPs,New pescAPPs game! This fun application contai...,93.328270,325.644911,319.180208,0
2,Educational,Peppa Pig: Paintbox,https://play.google.com/store/apps/details?id=...,My 4yrs old sister loves this game;Great;Great...,My 4yrs old sister loves this game Download i...,Everyone,,http://lh3.googleusercontent.com/n5BfP13c0JQc7...,3.8,60649.0,1.2.6,Entertainment One,Peppa's Paintbox is a drawing application desi...,74.643260,323.558003,255.318237,0
3,Educational,Preschool Adventures-2,https://play.google.com/store/apps/details?id=...,Can't go to level 4;So fun!;Diggin the work lo...,"Can't go to level 4 Even if i paid, i cant un...",Everyone,,http://lh3.googleusercontent.com/U8rfm5VAWM49H...,4.1,1529.0,1.6.4,forqan smart tech,Education puzzles for 4-5 years old children!A...,155.554901,478.249567,917.841639,0
4,Educational,Fun Kids Cars,https://play.google.com/store/apps/details?id=...,Greaat;;Good for kids;Fun cars games for kids ...,Greaat Finally a game where he is not constan...,Everyone,,http://lh3.googleusercontent.com/4H5TckrAjiNSa...,4,271.0,1.2,razmobi,Race through the city and beach with these hap...,185.902443,1290.831064,610.958273,0
5,Educational,Learning games For babies,https://play.google.com/store/apps/details?id=...,Thank you for this. Plan on purchasing full ve...,Thank you for this. Plan on purchasing full v...,Everyone,,http://lh3.googleusercontent.com/q8TsqAtJFSuCU...,3.9,778.0,1,Preschool & Kindergarten Learning Kids Games,"* Babies game for 1,2,3 years old.* 10 Educati...",86.590114,836.042406,258.641963,0
6,Educational,ABC 123 Tracing for Toddlers,https://play.google.com/store/apps/details?id=...,ABC 123 tracing for toddlers good for kids;Gre...,ABC 123 tracing for toddlers good for kids Le...,Everyone,,http://lh3.googleusercontent.com/ZLmnCT4DggGcK...,4.4,31.0,1,Gameitech - Kids Education Games,Are you Looking for free educational game for ...,278.018340,1092.137466,1361.599975,0
7,Educational,Monster Trucks Game for Kids 2,https://play.google.com/store/apps/details?id=...,The BEST!!;Good game;Fine;;Love it;Best racing...,The BEST!! This is the BEST racing game for m...,Everyone,,http://lh3.googleusercontent.com/_yQHJEWudjm1r...,4.1,2829.0,2,razmobi,"If your kids love all things monster trucks, T...",275.284567,1997.011173,783.147449,0
8,Educational,Baby puzzles,https://play.google.com/store/apps/details?id=...,Very good!;Good work;My one year old;Great unt...,Very good! 2 yr old granddaughter absolutely ...,Everyone,,http://lh3.googleusercontent.com/pPAxW-vh1vyaY...,4.3,4585.0,3.4,AppQuiz,Try out this fun educational learning app with...,36.934650,178.216879,196.581605,0
9,Educational,Piano and music games for kids,https://play.google.com/store/apps/details?id=...,Paying is worth it;Awesome;;Lost purchase;Why;,Paying is worth it You just pay 2.99 for all ...,Everyone,,http://lh3.googleusercontent.com/IXaYgVYRkhFof...,4.7,1372.0,2.4,Bimi Boo Kids - Games for boys and girls LLC,Piano and games for kids free music games for...,138.439060,986.490656,360.380706,0


In [168]:
df_rating.describe()

Unnamed: 0,rating_num,s1,s2,s3,s4
count,3296.0,3296.0,3296.0,3296.0,3296.0
mean,252252.7,146.155002,852.663178,500.587687,0.0
std,1218517.0,132.727112,802.474302,589.748657,0.0
min,4.3,0.0,0.0,0.0,0.0
25%,4461.0,59.963281,322.016603,195.634253,0.0
50%,31490.5,109.607554,628.255798,359.550659,0.0
75%,148881.0,189.199156,1108.931867,630.386513,0.0
max,40703810.0,1702.339495,11204.168735,12567.361136,0.0


In [130]:
1.617500e+03

1617.5

In [133]:
4.834315e+05
1.715192e+06
9.143250e+03

9143.25

### 3.2 Set threshold, generate model rating 

In [173]:
#set threshold at 80% 
a1 = np.percentile(df_rating['s1'],80)
a2 = np.percentile(df_rating['s2'],80)
a3 = np.percentile(df_rating['s3'],80)
a4 = np.percentile(df_rating['s4'],80)

In [175]:
def ma(row):
    if row['s4']>a4:
        ma = 4
    elif row['s3']>a3:
        ma = 3
    elif row['s2']>a2:
        ma = 2
    else:
        ma = 1
    
    
    row['ma'] = ma
    return row

In [176]:
df_ma = df_rating.apply(ma,axis=1)

In [177]:
df_ma

Unnamed: 0,Subcategory,name,apppurl,review_title,review_content,content_rating,content_rating_reason+,icon_url,rating_score,rating_num,current_version,Developer,description,s1,s2,s3,s4,ma
0,Educational,ABC Kids - Tracing & Phonics,https://play.google.com/store/apps/details?id=...,"Perfectly as stated!;Loved it, until;Lovely;Ex...",Perfectly as stated! No ads!! My 3yo can't ac...,Everyone,,http://lh3.googleusercontent.com/tcbtYwJSxe_J3...,4.3,8703.0,1.1.1,RV AppStudios,"Looking for a fun, free, and simple educationa...",209.402769,1415.275474,696.349809,0,2
1,Educational,Educational Games 4 Kids,https://play.google.com/store/apps/details?id=...,;;love;;;,My kids like to play ; My son love it ; l...,Everyone,,http://lh3.googleusercontent.com/I-3t7gsJoI411...,4.2,475.0,,pescAPPs,New pescAPPs game! This fun application contai...,93.328270,325.644911,319.180208,0,1
2,Educational,Peppa Pig: Paintbox,https://play.google.com/store/apps/details?id=...,My 4yrs old sister loves this game;Great;Great...,My 4yrs old sister loves this game Download i...,Everyone,,http://lh3.googleusercontent.com/n5BfP13c0JQc7...,3.8,60649.0,1.2.6,Entertainment One,Peppa's Paintbox is a drawing application desi...,74.643260,323.558003,255.318237,0,1
3,Educational,Preschool Adventures-2,https://play.google.com/store/apps/details?id=...,Can't go to level 4;So fun!;Diggin the work lo...,"Can't go to level 4 Even if i paid, i cant un...",Everyone,,http://lh3.googleusercontent.com/U8rfm5VAWM49H...,4.1,1529.0,1.6.4,forqan smart tech,Education puzzles for 4-5 years old children!A...,155.554901,478.249567,917.841639,0,3
4,Educational,Fun Kids Cars,https://play.google.com/store/apps/details?id=...,Greaat;;Good for kids;Fun cars games for kids ...,Greaat Finally a game where he is not constan...,Everyone,,http://lh3.googleusercontent.com/4H5TckrAjiNSa...,4,271.0,1.2,razmobi,Race through the city and beach with these hap...,185.902443,1290.831064,610.958273,0,2
5,Educational,Learning games For babies,https://play.google.com/store/apps/details?id=...,Thank you for this. Plan on purchasing full ve...,Thank you for this. Plan on purchasing full v...,Everyone,,http://lh3.googleusercontent.com/q8TsqAtJFSuCU...,3.9,778.0,1,Preschool & Kindergarten Learning Kids Games,"* Babies game for 1,2,3 years old.* 10 Educati...",86.590114,836.042406,258.641963,0,1
6,Educational,ABC 123 Tracing for Toddlers,https://play.google.com/store/apps/details?id=...,ABC 123 tracing for toddlers good for kids;Gre...,ABC 123 tracing for toddlers good for kids Le...,Everyone,,http://lh3.googleusercontent.com/ZLmnCT4DggGcK...,4.4,31.0,1,Gameitech - Kids Education Games,Are you Looking for free educational game for ...,278.018340,1092.137466,1361.599975,0,3
7,Educational,Monster Trucks Game for Kids 2,https://play.google.com/store/apps/details?id=...,The BEST!!;Good game;Fine;;Love it;Best racing...,The BEST!! This is the BEST racing game for m...,Everyone,,http://lh3.googleusercontent.com/_yQHJEWudjm1r...,4.1,2829.0,2,razmobi,"If your kids love all things monster trucks, T...",275.284567,1997.011173,783.147449,0,3
8,Educational,Baby puzzles,https://play.google.com/store/apps/details?id=...,Very good!;Good work;My one year old;Great unt...,Very good! 2 yr old granddaughter absolutely ...,Everyone,,http://lh3.googleusercontent.com/pPAxW-vh1vyaY...,4.3,4585.0,3.4,AppQuiz,Try out this fun educational learning app with...,36.934650,178.216879,196.581605,0,1
9,Educational,Piano and music games for kids,https://play.google.com/store/apps/details?id=...,Paying is worth it;Awesome;;Lost purchase;Why;,Paying is worth it You just pay 2.99 for all ...,Everyone,,http://lh3.googleusercontent.com/IXaYgVYRkhFof...,4.7,1372.0,2.4,Bimi Boo Kids - Games for boys and girls LLC,Piano and games for kids free music games for...,138.439060,986.490656,360.380706,0,1


In [178]:
df_ma.to_csv('3296_Android_Apps_ma_result.csv')