# Lab 1: Information Retrieval

### Step 1: retrieve app names from google play

First we get the links for the apps that we want to collect information about.

In [1]:
import requests
import bs4 as bs
import re
import pandas as pd

In [142]:
base_url = 'https://play.google.com'
category_urls = [
    'https://play.google.com/store/apps?hl=en',
    'https://play.google.com/store/apps/top?hl=en',
    'https://play.google.com/store/apps/new?hl=en',
    'https://play.google.com/store/apps/category/FAMILY?hl=en',
    'https://play.google.com/store/apps/category/GAME?hl=en',
]

In [143]:
links = list()
for url in category_urls:
    resp = requests.get(url)
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    links += [link['href'] for link in soup.find_all("a", class_="see-more", href=True)]

In [144]:
def get_game_urls(sub_url):
    resp = requests.get(base_url + sub_url)
    appreg = r'href=\"(/store/apps/details.*?)\"'
    appre = re.compile(appreg)
    app_url_list = re.findall(appre, resp.text)
    return list(set(app_url_list))

In [145]:
app_suburls = list()
for link in links:
    app_suburls += get_game_urls(link)
app_suburls = list(set(app_suburls))

In [146]:
print('we have collected {} unique app urls'.format(len(app_suburls)))

we have collected 1054 unique app urls


Now we can gather the app title and description by following the collected links.

In [147]:
def get_app_name_and_desc(app_url):
    try:
        resp = requests.get(base_url + app_url)
        soup = bs.BeautifulSoup(resp.text, 'lxml')
        title = soup.find('h1', {'itemprop' : 'name'}).text
        desc = soup.find('div', {'jsname' : 'sngebd'}).text
        return [title, desc]
    except:
        return []

In [148]:
data = list()
counter = 0
for url in app_suburls:
    counter += 1
    title_and_desc = get_app_name_and_desc(url)
    if counter % 100 == 0:
        print(f'processed {counter} apps')
    if title_and_desc:
        data.append(title_and_desc)

In [149]:
df = pd.DataFrame(columns=['title', 'desc'], data=data)

### Step 2: Clean our collected app descriptions

We lowecase, tokenize and stem all the description strings for our apps.

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/simja114/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/simja114/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [79]:
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer

In [80]:
ps = PorterStemmer()

In [90]:
def clean(desc):
    """Lowercase, tokenize and stem a string"""
    desc = desc.lower()
    words = word_tokenize(desc)
    words = ' '.join([ps.stem(w) for w in words if w not in stopwords.words('english') and w.isalpha()])
    return words

In [91]:
df['desc'] = df['desc'].apply(lambda x: clean(x))

Some investigation showed that we havea few duplicate titles, we can remove those.

In [94]:
df = df.drop_duplicates(subset='title')

### Step 3: Create tfidf model

The inverse document frequency (idf) is calculated as:

$$\log{\frac{N}{D : t \in D}}$$

Where N are the total number of documents in the corpus and D are the number of documents which contains the term t.

The term frequency is calculated as:

$$\frac{tf_d(t)}{max(tf_{d^*}(t))}$$

The weights vectors, tfidf, are now calculated as:

$$tfidf(t, d, D, N) = tf_d(t, d) * idf(t, D, N)$$

By calculating the cosine similarity of a query vector to the tfidf weight vectors, we can find the most similar documents to our query.

In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
    use_idf=True, 
    smooth_idf=True,
    analyzer='word'
)
corpus = df['desc'].values
tfidf = vectorizer.fit_transform(corpus)

In [97]:
print(tfidf.shape)

(1032, 11997)


Let's try to calculate the cosine similarity of the first document with all the others

In [98]:
from sklearn.metrics.pairwise import linear_kernel

In [99]:
cosine_similarities = linear_kernel(tfidf[0], tfidf).flatten()

If we sort on the similarities, we can find the indices of the most similar documents, where the last document should be at index 0, since te similarity of the first document with itself should be one.

In [100]:
idxs = cosine_similarities.argsort()[-5:]

In [101]:
for i in reversed(idxs):
    display(df.iloc[i])

title                                            Solitaire
desc     solitair solitair game world solitair card gam...
Name: 0, dtype: object

title                                           Solitaire!
desc     play free solitair klondik solitair patienc ca...
Name: 392, dtype: object

title                                     Spider Solitaire
desc     spider solitair mobilitywar fun challeng card ...
Name: 533, dtype: object

title                                    Pyramid Solitaire
desc     play classic free card game pyramid solitair a...
Name: 883, dtype: object

title    Solitaire TriPeaks: Play Free Solitaire Card G...
desc     win play compet solitair tripeak fun free card...
Name: 448, dtype: object

Looks really good! The most similar documents are all versions of Solitaire. Let's pack this into a function.

In [102]:
def find_k_similar_docs(query, vectorizer, tfidf, k):
    """Prints the k most similar documents to the input query"""
    query = clean(query)
    query_vector = vectorizer.transform([query])
    cosine_similarities = linear_kernel(query_vector, tfidf).flatten()
    idxs = cosine_similarities.argsort()[-k:]
    for i in reversed(idxs):
        display(df.iloc[i])

### Query Process

eg. "Dragon, Control, hero, running"

eg. "The hero controls the dragon to run."



In [104]:
find_k_similar_docs("Dragon, Control, hero, running", vectorizer, tfidf, 5)

title                       Baby Dragons: Ever After High™
desc     babi dragon arriv ever high time dragon game h...
Name: 352, dtype: object

title                                          Dragon City
desc     readi take hottest battl game train dragon one...
Name: 149, dtype: object

title                                    School of Dragons
desc     join hiccup toothless astrid rest vike school ...
Name: 965, dtype: object

title                                       Merge Dragons!
desc     discov dragon legend magic quest secret land e...
Name: 82, dtype: object

title                                 Dragon Mania Legends
desc     dragon mania legend anyon want pet dragon obvi...
Name: 853, dtype: object

In [105]:
find_k_similar_docs("The hero controls the dragon to run.", vectorizer, tfidf, 5)

title                       Baby Dragons: Ever After High™
desc     babi dragon arriv ever high time dragon game h...
Name: 352, dtype: object

title                                          Dragon City
desc     readi take hottest battl game train dragon one...
Name: 149, dtype: object

title                                    School of Dragons
desc     join hiccup toothless astrid rest vike school ...
Name: 965, dtype: object

title                                       Merge Dragons!
desc     discov dragon legend magic quest secret land e...
Name: 82, dtype: object

title                                 Dragon Mania Legends
desc     dragon mania legend anyon want pet dragon obvi...
Name: 853, dtype: object