<div class="alert alert-info">
    
‚û°Ô∏è Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).

‚û°Ô∏è If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).

‚û°Ô∏è Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**.  You normally shouldn't need to modify any of the other cells.

</div>

# L1: Information Retrieval

In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.

In [1]:
# Define some helper functions that are used in this notebook
from IPython.display import display, HTML

def success():
    display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))

## Dataset

The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).

In [2]:
import bz2
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 500)

with bz2.open('app-descriptions.json.bz2', mode='rt', encoding='utf-8') as source:
    df = pd.read_json(source, encoding='utf-8')

In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200‚Äì204:

In [3]:
df.loc[200:205]

Unnamed: 0,name,description
200,Brick Breaker Star: Space King,"Introducing the best Brick Breaker game that everyone can enjoy.\nEnjoy various missions and addictively simple play control.\n\n[Features]\n- Hundreds of stages and various missions\n- No limit to play such as Heart, play as much as you can!\n- 5 kinds of various items and items reinforcement system\n- No network required\n- game file is as low as 20M, light-weight download!\n- supports tablet screen\n- supports Google Play Leaderboards, Achievement, Multiplay\n- supports 14 languages\n\nHo..."
201,Brick Classic - Brick Game,"Classic Brick Game!\n\nBrick Classic is a popular and addictive puzzle game!\n\nHow to play?\n- Simply drag the bricks to move them.\n- Create full lines on the grid vertically or horizontally to break bricks.\n\nTips:\n- Classic brick game without time limits.\n- Place the bricks in a reasonable position.\n- The more brick break, the more scores you have.\n- Bricks can't be rotated.\n\nWho's the best brick breaker? Challenge it now!!!"
202,Bricks Breaker - Glow Balls,"Bricks Breaker - Glow Balls is a addictive and challenging brick game.\nJust play it to relax your brain. Be focus on breaking bricks and you will find it more funny and exciting.\n\nHow to play\n- Hold the screen with your finger and move to aim.\n- Find best positions and angles to hit all bricks.\n- When the durability of brick reaches 0, destroyed.\n- Never let bricks reach the bottom or game is over.\n\nFeatures\n- Colorful glow skins.\n- Free to play.\n- Easy game controls with one fin..."
203,Bricks Breaker Quest,"How to play\n- The ball flies to wherever you touched.\n- Clear the stages by removing bricks on the board.\n- Break the bricks and never let them hit the bottom.\n- Find best positions and angles to hit every brick.\n\nFeature\n- Free to play\n- Tons of stages\n- Various types of balls\n- Easy to play, Simplest game system, Designed for one handheld gameplay.\n- Off-line (without internet connection) gameplay supported \n- Multi-play supported\n- Tablet device supported\n- Achievement & lea..."
204,Brothers in Arms¬Æ 3,"Fight brave soldiers from around the globe on the frenzied multiplayer battlegrounds of World War 2 or become Sergeant Wright and experience a dramatic, life-changing single-player journey, in the aftermath of the D-Day invasion.\n\nCLIMB THE ARMY RANKS IN MULTIPLAYER \n> 4 maps to master and enjoy. \n> 2 gameplay modes to begin with: Free For All and Team Deathmatch.\n> Unlock game-changing perks by playing with each weapon class!\n> A soldier‚Äôs only as deadly as his weapon. Be sure to upgr..."
205,Brown Dust - Tactical RPG,"The Empire has fallen, and the Age of Great Mercenaries Now Begins!\nCreate Your Ultimate Team And Strike Down Your Enemies!\n\nCAPTIVATING AND STUNNING ARTWORK\n- Experience the high-quality anime illustrations you have never seen before.\n- Meet Brown Dust's charming Mercenaries now.\n\nASSEMBLE LEGENDARY MERCENARIES\n- Over 300 Mercenaries and a Variety of Skills.\n- Discover the Unique Mercenaries, 6 Devils and Dominus Octo.\n- All Mercenaries can reach max level and the highest rank.\n\..."


As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The next cell shows how to access only the description field from row 200:

In [4]:
df.loc[200, 'description']

'Introducing the best Brick Breaker game that everyone can enjoy.\nEnjoy various missions and addictively simple play control.\n\n[Features]\n- Hundreds of stages and various missions\n- No limit to play such as Heart, play as much as you can!\n- 5 kinds of various items and items reinforcement system\n- No network required\n- game file is as low as 20M, light-weight download!\n- supports tablet screen\n- supports Google Play Leaderboards, Achievement, Multiplay\n- supports 14 languages\n\nHomepage:\nhttps://play.google.com/store/apps/dev?id=4931745640662708567\n\nFacebook: \nhttps://www.facebook.com/spcomesgames/'

## Problem 1: What's in a vector?

We start by vectorising the data ‚Äî more specifically, we map each app description to a tf‚Äìidf vector. This is very simple with a library like [scikit-learn](https://scikit-learn.org/stable/), which provides a [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class for exactly this purpose.  If we instantiate this class, and call `fit_transform()` on all of our app descriptions, scikit-learn will preprocess and tokenize each app description, compute tf‚Äìidf values for each of them, and return a vectorised representation:

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['description'])
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 267110 stored elements and shape (1614, 27877)>

Let‚Äôs pick the app "Pancake Tower", which has a rather short description text, to see how it has been vectorised:

In [6]:
# We can use 'toarray' to convert the sparse matrix object into a "normal" array
vec = X[1032].toarray()[0]

# The app description & its corresponding vector
df.loc[1032, 'description'], vec

("Let's see how many pancakes you can pile up!!",
 array([0., 0., 0., ..., 0., 0., 0.], shape=(27877,)))

That's not very informative yet.  We know that the vector contains tf‚Äìidf values, and that each dimension of the vector corresponds to a token in the vectorizer‚Äôs vocabulary; let's extract these for this specific example.

Your **first task** is to find out how to access the `vectorizer`‚Äôs vocabulary, for example by [checking the documentation of `TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), and print all the tokens that are represented in the vector with a tf‚Äìidf value greater than zero (i.e., only the tokens that are actually part of this app‚Äôs description) _in descending order of the tf‚Äìidf values_.  In other words, the token with the highest tf‚Äìidf value should be at the top of your output, and the token with the lowest tf‚Äìidf value at the bottom.   Before you implement this, think about what you would expect the output look like, for example which words you would expect to have the highest/lowest tf‚Äìidf values in this example.

Your final output should look something like this:

```
<token 1>: <tf-idf value 1>
<token 2>: <tf-idf value 2>
...
```

In [7]:
"""Print the tokens and their tf‚Äìidf values, in descending order."""

# Get the tf-idf vector for document with index 1032
vec = X[1032].toarray()[0]
def sorted_tokens(vectorizer, vec):
    """Print the tokens and their tf‚Äìidf values, in descending order."""
    idf = vectorizer.idf_
    voca = vectorizer.vocabulary_
    # Get feature names and values for non-zero entries
    feature_names = vectorizer.get_feature_names_out()
    nonzero_idx = vec.nonzero()[0]
    tokens = [(feature_names[i], vec[i]) for i in nonzero_idx]
    # Sort by tf-idf value in descending order and print
    tokens_sorted = sorted(tokens, key=lambda x: x[1], reverse=True)
    for tok, val in tokens_sorted:
        print(f"{tok}: {val}")
sorted_tokens(vectorizer, vec)

pancakes: 0.6539332651185913
pile: 0.5304701435508047
let: 0.2615287714771797
see: 0.2557630827415271
many: 0.23491959669849022
how: 0.21153246225085887
up: 0.17216837691451817
can: 0.13047602895910532
you: 0.10276923239718011


## Problem 2: Finding the nearest vectors

To build a small search engine, we need to be able to turn _queries_ (for example the string "pile up pancakes") into _query vectors_, and then find out which of our app description vectors are closest to the query vector.

For the first part (turning queries into query vectors), we can simply re-use the `vectorizer` that we used for the app descriptions. For the second part, an easy way to find the closest vectors is to use scikit-learn‚Äôs [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. This class needs to be _fit_ on a set of vectors (the "training set"; in our case the app descriptions) and can then be used with any vector to find its _nearest neighbors_ in the vector space.

**First,** instantiate and fit a class that returns the _ten (10)_ nearest neighbors:

In [8]:
"""Instantiate and fit a class that returns the 10 nearest neighboring vectors."""

from sklearn.neighbors import NearestNeighbors
# Fit a NearestNeighbors model on the tf-idf matrix X
nn = NearestNeighbors(n_neighbors=10)
nn.fit(X)

0,1,2
,n_neighbors,10
,radius,1.0
,algorithm,'auto'
,leaf_size,30
,metric,'minkowski'
,p,2
,metric_params,
,n_jobs,


**Second,** implement a function that uses the vectorizer and the fitted class to find the nearest neighbours for a given query string:

In [9]:
def search(query):
    """Find the nearest neighbors in `df` for a query string.

    Arguments:
      query (str): A query string.

    Returns:
      The 10 apps (with name and description) most similar (in terms of
      cosine similarity) to the given query as a Pandas DataFrame.
    """
    # Transform the query to a tf-idf vector and find nearest neighbors
    qv = vectorizer.transform([query])
    distances, indices = nn.kneighbors(qv)
    res = df.iloc[indices[0]][['name', 'description']].copy()
    res['score'] = distances[0]
    return res.reset_index(drop=True)

### ü§û Test your code

Test your implementation by running the following cell, which will sanity-check your return value and show the 10 best search results for the query _"pile up pancakes"_:

In [10]:
"""Check that searching for "pile up pancakes" returns a DataFrame with ten results,
   and that the top result is "Pancake Tower"."""

result = search('pile up pancakes')
display(result)
assert isinstance(result, pd.DataFrame), "search() function should return a Pandas DataFrame"
assert len(result) == 10, "search() function should return 10 search results"
assert result.iloc[0]["name"] == "Pancake Tower", "Top search result should be 'Pancake Tower'"
success()

Unnamed: 0,name,description,score
0,Pancake Tower,Let's see how many pancakes you can pile up!!,0.530172
1,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We...",1.289785
2,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe...",1.369161
3,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d...",1.371386
4,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ...",1.373759
5,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i...",1.379902
6,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki...",1.381918
7,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi...",1.382737
8,UNO!‚Ñ¢,"Play the world‚Äôs number one card game like never before. UNO!‚Ñ¢ has all-new rules, tournaments, adventures and so much more! At home or on the move, jump into games instantly. Whether an UNO!‚Ñ¢ veteran or completely new, take on challenges and reap the rewards. UNO!‚Ñ¢ is the ultimate competitive family-friendly card game.\n- Play classic UNO!‚Ñ¢ or use tons of popular house rules!\n- Connect anytime, anywhere with friends from around the world! \n- Two heads are better than one in 2v2 mode. Use t...",1.382758
9,TO-FU Oh!SUSHI,"You are the veritable sushi master! Prepare your own fun sushi with ‚ÄúDaizu‚Äù the skunk!\n\nThis app is designed to allow children to be creative by decorating their original sushi.\n\nServe your delicious, mysterious or impossible sushi to the people of ‚ÄúTofu Island‚Äù! \n\nHow about creating sushi that is totally original and serve it to your beloved guests? Spice it up with tons of wasabi or even sprinkle chocolate and gummy bears for those sweet lovers.\nFeel free to make any kind of sushi y...",1.385036


Before continuing with the next problem, play around a bit with this simple search functionality by trying out different search queries, and see if the results look like what you would expect:

In [11]:
# Example ‚Äî try out your own queries!
search("dodge trains")

Unnamed: 0,name,description,score
0,Train Conductor World,"Master and manage the chaos of international railway traffic as the ultimate railroad tycoon. Build the rail network of your dreams; lay rails and solve the railroad puzzle with branching and forking roads at every turn. Become the richest manager and pick your path, do you optimise to the micro level, planning routes and managing the timetable, or sit idle letting your business keep earning while you sleep! \n\nGet in the driver's seat and take passengers to their destinations, dropping the...",1.166325
1,Subway Surfers,"DASH as fast as you can! \nDODGE the oncoming trains! \n\nHelp Jake, Tricky & Fresh escape from the grumpy Inspector and his dog. \n\n‚òÖ Grind trains with your cool crew! \n‚òÖ Colorful and vivid HD graphics! \n‚òÖ Hoverboard Surfing! \n‚òÖ Paint powered jetpack! \n‚òÖ Lightning fast swipe acrobatics! \n‚òÖ Challenge and help your friends! \n\nJoin the most daring chase! \n\nA Universal App with HD optimized graphics.\n\nBy Kiloo and Sybo.",1.179596
2,Subway Princess Runner,"Subway princess runner, Bus run, forest rush with addictive endless running game!\nRush as fast as you can, dodge the oncoming trains and buses. Careful the rolling wood in the forest! Intuitive controls to run left or right, jump in the sky to obtain more coins, excited slide to safety!\n\nHelp your loved beautiful princess to escape the police! Use skateboard after double tapping, experience the unique board in the subway. Challenge the highest score of the rank with the world players or s...",1.243451
3,No Humanity - The Hardest Game,"2M+ Downloads All Over The World!\n\n* IGN Nominated Best Aussie/NZ game *\n* Top 5 indie games at PAX 2015 Australia ‚Äì Mashable *\n* Global Game Jam ""Best Game"" Sydney 2015 *\n* Global Game Jam ""Best Audio"" Sydney 2015 *\n\nIt's the end of the world and you are the lone survivor in a tiny spaceship. Get ready to dodge everything that is trying to kill you! Your reaction time and precision is key! No Humanity is the hardest bullet hell dodge game. Compare your score with friends and watch as...",1.312795
4,Bus Rush 2,"Bus Rush 2 is one of the most complete multiplayer runners for Android. \nRun along Rio de Janeiro and other scenarios. Drag to jump or slide and to move left or right, avoid hitting obstacles like trucks, buses and subway trains among others!\nPlay races with other users around the world in the multiplayer mode. Run around and gather all the coins you can in different scenarios from Rio city like downtown, subway, sewer, forest, different beaches, and an amazing jungle!\n\nIn Bus Rush 2, yo...",1.34752
5,Virus War - Space Shooting Game,"Warning! Virus invasion! Destroy them with your fingertip! \nis a free casual shooting game. Using only your fingertip, destroy all sorts of viruses. Remember to dodge, don‚Äôt let those filthy things hit your ship!\n*Simple and engaging gameplay. Play Virus War anywhere and anytime; get the most fun out of your breaks!\n*Equip your ship with different weapons and blast through swarms of enemies!\n*Surpass your friends in the ranking; set new records!",1.351082
6,Dancing Road: Color Ball Run!,Try out the most exciting Running - Sliding - Matching Music Game!\n\nThe rolling ball starts simply and ramps up shortly. \n\n‚òÖ Hold and drag your rolling ball to match other balls of the same color!\n‚òÖ Dodge different color balls!\n‚òÖ Try to collect all the coins and Gift Boxes on the dancing road!\n\nEnjoy the catchy music and challenges designed for each dancing road. \n\nLet's roll the ball and feel the beat in this Color Matching Game!,1.351424
7,Bob - jigsaw puzzles free games for kids & parents,"Free jigsaw puzzles for kids, hundreds of puzzles for toddlers to assemble. Try now kids puzzle games for toddlers.\nJigsaw puzzles are great game for your toddler to play in waiting room or anywhere while you have to wait.\n\nFeatures:\n- kid puzzle game for free\n- unlimited number of pictures, many colorful pictures to choose from by your children\n- Perfect kid game when you are waiting in line with your children\n- 4 to 100 puzzles, various difficulty levels of jigsaw puzzles \n- jigsaw...",1.359047
8,Blocky Highway: Traffic Racing,"Blocky Highway is about racing traffic, avoiding trains, collecting cars and most importantly having fun. Collect coins, open prize boxes to get new cars and complete collections! Drive at full speed to score big and be the #1. \n\nCrash time! Control your car after crash, hit traffic cars for extra score!\n\nKey Features\n- Gorgeous voxel art graphics\n- 4 worlds to choose from\n- 55 different vehicles to drive : Taxi, Tank, Ufo, Police Car, Army 4x4, Dragster, Monster, Space Shuttle, Motor...",1.361677
9,Cat Runner: Decorate Home,"Cat runner is the best cat running game. Decorate your home for free! From the Living to bedroom or many other rooms, you can design and decorate everything with you loving!\n\nEnjoy hours of fun with your loved cat, run to collect gold coins after being robbed in this endless runner game! Explore new worlds, only racing with fast speed. go on a running adventure, dodge fast cars and trains as you go after the robber.\n\nIt is very easy to control, run as fast as you can, rush in the endless...",1.364435


## Problem 3: Custom preprocessing & tokenization

In Problem 1, you should have seen that `TfidfVectorizer` already performs some preprocessing by default and also does its own tokenization of the input data. This is great for getting started, but often we want to have more control over these steps. We can customize some aspects of the preprocessing through arguments when instantiating `TfidfVectorizer`, but for this exercise, we want to do _all_ of our preprocessing & tokenizing outside of scikit-learn.

Concretely, we want to use [spaCy](https://spacy.io), a library that we will make use of in later labs as well.  Here is a brief example of how to load and use a spaCy model:

In [12]:
import spacy
# Load the small English model, disabling some components that we don't need right now
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])

# Take an example sentence and print every token from it separately
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


**Your task** is to write a preprocessing function that uses spaCy to perform the following steps:
- tokenization
- lemmatization
- stop word removal
- removing tokens containing non-alphabetical characters

We recommend that you go through the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy&nbsp;101, which demonstrates how you can get the relevant kind of information via the spaCy library.

Implement your preprocessor by completing the following function:

In [13]:
def preprocess(text):
    """Preprocess the given text by tokenising it, removing any stop words, 
    replacing each remaining token with its lemma (base form), and discarding 
    all lemmas that contain non-alphabetical characters. Proper nouns (PROPN)
    are kept in their original surface form and capitalization to preserve names."""
    doc = nlp(text)
    lemmas = []
    for token in doc:
        # skip stop words and non-alphabetic tokens
        if token.is_stop:
            continue
        if not token.is_alpha:
            continue
        if token.pos_ == 'PROPN': # if it is a proper noun, such as a name, keep original form  
            lemmas.append(token.text)
        else:
            lemmas.append(token.lemma_.lower())
    return lemmas

### ü§û Test your code

Test your implementation by running the following cell:

In [14]:
"""Check that the preprocessing returns the correct output for a number of test cases."""

assert (
    preprocess('Apple is looking at buying U.K. startup for $1 billion') ==
    ['Apple', 'look', 'buy', 'startup', 'billion']
)
assert (
    preprocess('"Love Story" is a country pop song written and sung by Taylor Swift.') ==
    ['Love', 'Story', 'country', 'pop', 'song', 'write', 'sing', 'Taylor', 'Swift']
)
success()

## Problem 4: The effect of preprocessing

To make use of the new `preprocess` function from Problem 3, we need to make sure that we incorporate it into `TfidfVectorizer` and disable all preprocessing & tokenization that `TfidfVectorizer` performs by default. Afterwards, we also need to re-fit the vectorizer and the nearest-neighbors class. To make this a bit easier to handle, let‚Äôs take everything we have done so far and put it in a single class `AppSearcher`.

### Task 4.1

**Your first task** is to complete the stub of the `AppSearcher` class given below. Keep in mind:
- The `fit()` function should fit both the vectorizer (from Problem 1) and the nearest-neighbors class (from Problem 2).  Make sure to modify the call to `TfidfVectorizer` to _disable all preprocessing & tokenization_ that it would do by default, and replace it with a call to the `preprocess()` function _defined in `AppSearcher`_.
- For the `preprocess()` function, you can start by copying your solution from Problem 3.
- For the `search()` function, you can copy your solution from Problem 2.
- Make sure to adapt your code to store the everything (data, vectorizer, nearest-neighbors class) within the `AppSearcher` class, so that your solution is independent of the code you wrote above!

In [15]:
class AppSearcher:
    def fit(self, df):
        """Instantiate and fit all the classes required for the search engine (cf. Problems 1 and 2)."""
        import spacy
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.neighbors import NearestNeighbors
        self.df = df
        # load spacy model once and store it
        self.nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])
        # Use custom tokenizer (self.preprocess) and disable TfidfVectorizer's own preprocessing
        self.vectorizer = TfidfVectorizer(preprocessor=lambda x: x, tokenizer=self.preprocess, lowercase=False)
        self.X = self.vectorizer.fit_transform(self.df['description'])
        self.nn = NearestNeighbors(n_neighbors=10, metric='cosine')
        self.nn.fit(self.X)

    def preprocess(self, text):
        """Preprocess the given text (cf. Problem 3)."""
        # Use the stored spacy pipeline
        doc = self.nlp(text)
        tokens = []
        for token in doc:
            if token.is_stop:
                continue
            if not token.is_alpha:
                continue
            if token.pos_ == 'PROPN':
                tokens.append(token.text)
            else:
                tokens.append(token.lemma_.lower())
        return tokens

    def search(self, query):
        """Find the nearest neighbors in `df` for a query string (cf. Problem 2)."""
        qv = self.vectorizer.transform([query])
        distances, indices = self.nn.kneighbors(qv)
        scores = 1 - distances[0]
        res = self.df.iloc[indices[0]][['name', 'description']].copy()
        res['score'] = scores
        return res.reset_index(drop=True)

#### ü§û Test your code

The following cell demonstrates how your class should be used. Note that it can take a bit longer to train it on the data as before, since we‚Äôre now calling spaCy for the preprocessing.

In [16]:
apps = AppSearcher()
apps.fit(df)
apps.search("pile up pancakes")



Unnamed: 0,name,description,score
0,Pancake Tower,Let's see how many pancakes you can pile up!!,0.957827
1,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We...",0.201224
2,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d...",0.113288
3,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i...",0.102453
4,Spider Solitaire,"Spider Solitaire was built to offer card players a fun way to play their favorite classic in both portrait and landscape mode.\n\nWith large cards and a unique stacking system our Spider card game doesn't have problems fitting your screen like many others do. \n\n* How to play *\n\nTo win a game of spider solitaire, all cards must be removed from the table. Assembling the cards in the tableau allows for cards to be placed in their respective stacks in order. At the beginning of each game, 54...",0.09507
5,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe...",0.070756
6,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ...",0.058788
7,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki...",0.050081
8,Solitaire Free,"Solitaire by Gemego is the card game you know and love for your phone and tablet. Our Solitaire is beautifully designed with a simple interface to help you enjoy this classic game. \n\nOur Solitaire has the best card movement on the market. You don't need to select a specific card in a pile unlike other Solitaire games. \n\nFeatures\n‚òÖ Instructions - an overview of the rules of Solitaire\n‚òÖ Winning deals (random) - unlike any other Solitaire! \n‚òÖ One Card, Three Card and Vegas style games\n‚òÖ...",0.047518
9,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi...",0.046181


### Task 4.2

**Your second task** is to experiment with the effect of using (or not using) different preprocessing steps.  We always need to _tokenize_ the text, but other preprocessing steps are optional and require a conscious decision whether to use them or not, such as:
- lemmatization
- lowercasing all characters
- removing stop words
- removing tokens containing non-alphabetical characters

**Modify the definition of the `preprocess()` function** of `AppSearcher` to include/exclude individual preprocessing steps, run some searches, and observe if and how the results change.  Which search queries you try out is up to you ‚Äî you could compare searching for "pile up pancakes" with "pancake piling", for example; or you could try entirely different search queries aimed at different kinds of apps.  (You can modify the class directly by changing the cell above under Task 4.1, or copy the definitions to the cells below, whichever you prefer; there is no separate code to show for this task, but you will use your observations here for the individual reflection.)

In [17]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
# load a local spacy pipeline for these experiments (cached if already loaded)
nlp_local = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])

def make_tokenizer(lemmatize=True, lowercase=True, remove_stop=True, remove_nonalpha=True):
    def tokenizer(text):
        doc = nlp_local(text)
        toks = []
        for t in doc:
            if remove_stop and t.is_stop:
                continue
            if remove_nonalpha and not t.is_alpha:
                continue
            if t.pos_ == 'PROPN':
                tok = t.text
            else:
                tok = t.lemma_ if lemmatize else t.text
                if lowercase:
                    tok = tok.lower()
            toks.append(tok)
        return toks
    return tokenizer

def run_experiment(query='pile up pancakes', **flags):
    tok = make_tokenizer(**flags)
    vect = TfidfVectorizer(preprocessor=lambda x: x, tokenizer=tok, lowercase=False)
    Xexp = vect.fit_transform(df['description'])
    nnexp = NearestNeighbors(n_neighbors=20, metric='cosine').fit(Xexp)
    qv = vect.transform([query])

    # Show a small vocab sample
    print('vocab sample:', list(vect.get_feature_names_out())[:20])

    # Print query token scores (non-zero tf-idf entries) sorted by score
    # feature_names = vect.get_feature_names_out()
    # q_tokens = [(feature_names[i], qvec[i]) for i in qvec.nonzero()[0]]
    # q_tokens_sorted = sorted(q_tokens, key=lambda x: x[1], reverse=True)
    # print('query token scores:')
    # for tname, val in q_tokens_sorted:
    #     print(f'{tname}: {val:.6f}')

    # Also print token scores using helper if available (for compatibility)


    distances, indices = nnexp.kneighbors(qv)
    scores = 1 - distances[0]
    res = df.iloc[indices[0]][['name', 'description']].copy()
    res['score'] = scores
    display(res)
    return Xexp,vect


# Run a few variations (these re-fit the vectorizer each time)
Xexp1,vect1 = run_experiment('pile up pancakes', lemmatize=True, lowercase=True, remove_stop=True, remove_nonalpha=True)
Xexp2,vect2 =run_experiment('pile up pancakes', lemmatize=True, lowercase=True, remove_stop=True, remove_nonalpha=False)
Xexp3,vect3=run_experiment('pile up pancakes', lemmatize=True, lowercase=True, remove_stop=False, remove_nonalpha=True)
Xexp4,vect4 =run_experiment('pile up pancakes', lemmatize=True, lowercase=False, remove_stop=True, remove_nonalpha=True)
Xexp5,vect5 =run_experiment('pile up pancakes', lemmatize=False, lowercase=True, remove_stop=True, remove_nonalpha=True)



vocab sample: ['AA', 'AAA', 'AAC', 'AAGBI', 'AAT', 'AB', 'ABC', 'ABCD', 'ABCmouse', 'ABCs', 'ABEL', 'ABG', 'ABI', 'ABS', 'ABSOLUTELY', 'AC', 'ACA', 'ACADEMY', 'ACHIEVEMENT', 'ACLS']


Unnamed: 0,name,description,score
1032,Pancake Tower,Let's see how many pancakes you can pile up!!,0.957827
326,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We...",0.201224
1235,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d...",0.113288
1181,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i...",0.102453
1263,Spider Solitaire,"Spider Solitaire was built to offer card players a fun way to play their favorite classic in both portrait and landscape mode.\n\nWith large cards and a unique stacking system our Spider card game doesn't have problems fitting your screen like many others do. \n\n* How to play *\n\nTo win a game of spider solitaire, all cards must be removed from the table. Assembling the cards in the tableau allows for cards to be placed in their respective stacks in order. At the beginning of each game, 54...",0.09507
656,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe...",0.070756
1164,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ...",0.058788
436,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki...",0.050081
1245,Solitaire Free,"Solitaire by Gemego is the card game you know and love for your phone and tablet. Our Solitaire is beautifully designed with a simple interface to help you enjoy this classic game. \n\nOur Solitaire has the best card movement on the market. You don't need to select a specific card in a pile unlike other Solitaire games. \n\nFeatures\n‚òÖ Instructions - an overview of the rules of Solitaire\n‚òÖ Winning deals (random) - unlike any other Solitaire! \n‚òÖ One Card, Three Card and Vegas style games\n‚òÖ...",0.047518
1442,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi...",0.046181




vocab sample: ['\n', '\n\n', '\n\n\n', '\n\n\n\n', '\n\n\n\n\n', '\n\n\n\xa0', '\n\n\n\xa0 ', '\n\n\n\xa0\xa0\n', '\n\n\xa0', '\n\n\xa0\n', '\n\n\xa0\xa0\n', '\n\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0', '\n\n\u2003', '\n\n\u2028', '\n\n\u2028\u2028', '\n\xa0', '\n\xa0\n', '\n\xa0\xa0', '\n\xa0\xa0\n', '\n\xa0\xa0\n\xa0\xa0\n']


Unnamed: 0,name,description,score
1032,Pancake Tower,Let's see how many pancakes you can pile up!!,0.928676
326,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We...",0.192624
1235,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d...",0.093308
1181,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i...",0.092048
1263,Spider Solitaire,"Spider Solitaire was built to offer card players a fun way to play their favorite classic in both portrait and landscape mode.\n\nWith large cards and a unique stacking system our Spider card game doesn't have problems fitting your screen like many others do. \n\n* How to play *\n\nTo win a game of spider solitaire, all cards must be removed from the table. Assembling the cards in the tableau allows for cards to be placed in their respective stacks in order. At the beginning of each game, 54...",0.090204
656,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe...",0.064144
1164,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ...",0.044724
436,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki...",0.043907
1442,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi...",0.040676
427,Dr. Panda Ice Cream Truck Free,"Dr. Panda Ice Cream Truck is FREE for you to play!\n\nChocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve crea...",0.035208




vocab sample: ['A', 'AA', 'AAA', 'AAC', 'AAGBI', 'AAT', 'AB', 'ABC', 'ABCD', 'ABCmouse', 'ABCs', 'ABEL', 'ABG', 'ABI', 'ABS', 'ABSOLUTELY', 'AC', 'ACA', 'ACADEMY', 'ACHIEVEMENT']


Unnamed: 0,name,description,score
1032,Pancake Tower,Let's see how many pancakes you can pile up!!,0.861703
326,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We...",0.17474
1235,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d...",0.109519
1181,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i...",0.090947
1263,Spider Solitaire,"Spider Solitaire was built to offer card players a fun way to play their favorite classic in both portrait and landscape mode.\n\nWith large cards and a unique stacking system our Spider card game doesn't have problems fitting your screen like many others do. \n\n* How to play *\n\nTo win a game of spider solitaire, all cards must be removed from the table. Assembling the cards in the tableau allows for cards to be placed in their respective stacks in order. At the beginning of each game, 54...",0.086616
656,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe...",0.062785
1164,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ...",0.05847
1442,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi...",0.048696
436,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki...",0.047733
1245,Solitaire Free,"Solitaire by Gemego is the card game you know and love for your phone and tablet. Our Solitaire is beautifully designed with a simple interface to help you enjoy this classic game. \n\nOur Solitaire has the best card movement on the market. You don't need to select a specific card in a pile unlike other Solitaire games. \n\nFeatures\n‚òÖ Instructions - an overview of the rules of Solitaire\n‚òÖ Winning deals (random) - unlike any other Solitaire! \n‚òÖ One Card, Three Card and Vegas style games\n‚òÖ...",0.042597




vocab sample: ['AA', 'AAA', 'AAC', 'AAGBI', 'AAT', 'AB', 'ABC', 'ABCD', 'ABCmouse', 'ABCs', 'ABEL', 'ABG', 'ABI', 'ABS', 'ABSOLUTELY', 'AC', 'ACA', 'ACADEMY', 'ACHIEVE', 'ACHIEVEMENT']


Unnamed: 0,name,description,score
1032,Pancake Tower,Let's see how many pancakes you can pile up!!,0.95773
326,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We...",0.201034
1235,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d...",0.113278
1181,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i...",0.102524
1263,Spider Solitaire,"Spider Solitaire was built to offer card players a fun way to play their favorite classic in both portrait and landscape mode.\n\nWith large cards and a unique stacking system our Spider card game doesn't have problems fitting your screen like many others do. \n\n* How to play *\n\nTo win a game of spider solitaire, all cards must be removed from the table. Assembling the cards in the tableau allows for cards to be placed in their respective stacks in order. At the beginning of each game, 54...",0.095058
656,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe...",0.070745
1164,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ...",0.058786
436,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki...",0.050076
1245,Solitaire Free,"Solitaire by Gemego is the card game you know and love for your phone and tablet. Our Solitaire is beautifully designed with a simple interface to help you enjoy this classic game. \n\nOur Solitaire has the best card movement on the market. You don't need to select a specific card in a pile unlike other Solitaire games. \n\nFeatures\n‚òÖ Instructions - an overview of the rules of Solitaire\n‚òÖ Winning deals (random) - unlike any other Solitaire! \n‚òÖ One Card, Three Card and Vegas style games\n‚òÖ...",0.047515
1442,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi...",0.046177




vocab sample: ['AA', 'AAA', 'AAC', 'AAGBI', 'AAT', 'AB', 'ABC', 'ABCD', 'ABCmouse', 'ABCs', 'ABEL', 'ABG', 'ABI', 'ABS', 'ABSOLUTELY', 'AC', 'ACA', 'ACADEMY', 'ACHIEVEMENT', 'ACLS']


Unnamed: 0,name,description,score
1032,Pancake Tower,Let's see how many pancakes you can pile up!!,0.954772
326,Cooking School: Games for Girls,"Children like to help their parents. They especially like to help with cooking . When there is a cooking in the kitchen, it is no way to play. But cooking is a complicated process and often it ends up with a huge mess in the kitchen. But what if you are so eager to cook pancakes, cake or cupcakes? How to cook all that without doing a cleaning after? We have a solution! Home Cooking School with our curious Hippo has opened especially for parents and children! We do not only cook food here. We...",0.196506
656,"Hell‚Äôs Cooking ‚Äî crazy chef burger, kitchen fever","‚≠ê ‚≠ê ‚≠ê ‚≠ê ‚≠ê New world of crazy cooking is here. Feel what it means to be a master chef who prepares fantastic fast food in a prominent king kitchen! If you haven't ever tried yourself as a hamburger chef cook, it's possibly the best time for making diner. Download and launch Hell's Cooking ‚Äî crazy chef burger, kitchen fever HD game and get prepared to jump into a fever and adventurous perfect world of burgers.\n\nNew girls game Hell's Cooking gives you lots of opportunities for your crazy cafe...",0.070053
1235,Solitaire,"Solitaire Free by Solitaire Card Games is the #1 klondike solitaire games on android. The solitaire Free is popular and classic card games you know and love.\n\nWe carefully designed a fresh solitaire free modern look, woven into the wonderful solitaire classic feel that everyone loves. \n\nExperience the crisp, clear, and easy to read cards, simple and quick animations, and subtle sounds, in either landscape or portrait views. \n\nYou can move cards with a single tap or drag them to their d...",0.065502
1164,Rummy - Free,"Play the famous Rummy card game on your Android Smartphone or Tablet !! \n\nPlay rummy with 2, 3, or 4 players against simulated opponents playing with high-level artificial intelligence. \nThere are a number of rules that can be modified, making this game very faithful to the original. \n\n*** MANY VARIATIONS INCLUDED *** \n\nMany rummy variations are included in the application: \n\n- From 2 to 4 players. \n- Choose the AI level of opponents. \n- Number of cards dealt to each player (from ...",0.060067
436,Dr. Panda's Ice Cream Truck,"Chocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda‚Äôs Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve created!\n\nToppings galore!\nUse chocolate syrup, cooki...",0.052653
1181,Sago Mini Trucks and Diggers,"Drive a dump truck with Rosie the hamster! Pile dirt high and dig deep in the ground with diggers, cranes and bulldozers. Build a home for a new friend! Choose a barn, a castle or even a cupcake-house. Don‚Äôt forget to add the finishing touches for the proud owner.\n\nOn this construction site, kids love being the boss. With six mighty machines and piles of dirt, you can build all day! Part of the award-winning suite of Sago Mini apps, this app puts kids in charge.\n\nSago Mini apps have no i...",0.052349
1245,Solitaire Free,"Solitaire by Gemego is the card game you know and love for your phone and tablet. Our Solitaire is beautifully designed with a simple interface to help you enjoy this classic game. \n\nOur Solitaire has the best card movement on the market. You don't need to select a specific card in a pile unlike other Solitaire games. \n\nFeatures\n‚òÖ Instructions - an overview of the rules of Solitaire\n‚òÖ Winning deals (random) - unlike any other Solitaire! \n‚òÖ One Card, Three Card and Vegas style games\n‚òÖ...",0.050069
1442,Turbo Dismount‚Ñ¢,"The legendary crash simulator is now on Google Play!\n\nPerform death-defying motor stunts, crash into walls, create traffic pile-ups of epic scale - and share the fun!\n\nTurbo Dismount‚Ñ¢ is a kinetic tragedy about Mr. Dismount and the cars who love him. It is the official sequel to the wildly popular and immensely successful personal impact simulator - Stair Dismount‚Ñ¢. \n\nFEATURES:\n* Flinch-inducing crash physics\n* Crunchy sound effects\n* Delicious slow-mo replay system\n* Multiple vehi...",0.048783
427,Dr. Panda Ice Cream Truck Free,"Dr. Panda Ice Cream Truck is FREE for you to play!\n\nChocolate? Vanilla? Strawberry? All three!? You decide! In Dr. Panda Ice Cream Truck you can mix up all sorts of different flavors with cookies, chocolate, nuts and more to make the perfect ice cream‚Äîhundreds of combinations in all.\n\nScoop it!\nThese animals love ice cream, and will eat as much (or little) as you want to serve them. You can make scoops big or small and pile them as high as you want‚Äîusing any of the ice cream you‚Äôve crea...",0.042521


## Individual reflection

<div class="alert alert-info">
    <strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below.  Remember:
    <ul>
        <li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
        <li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
    </ul>
</div>

1. In Problem 1, which token had the highest tf‚Äìidf score, which the lowest?  Based on your knowledge of how tf‚Äìidf works, how would you explain this result?
2. Based on your observations in Problem 4, which preprocessing steps do you think are the most appropriate for this "search engine" example?  Why?

**Congratulations on finishing this lab! üëç**

<div class="alert alert-info">
    
‚û°Ô∏è Before you submit, **make sure the notebook can be run from start to finish** without errors.  For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "‚è©" button in the toolbar).

</div>

In [18]:
Y = Xexp1[656]
Y_ = Y.toarray()[0]
sorted_tokens(vect1,Y_)

burger: 0.38946512226495467
chef: 0.32588773772090207
kitchen: 0.2405834747191089
fever: 0.22667690449790226
hamburger: 0.2240581409086694
restaurant: 0.2033978950161184
cafe: 0.19473256113247733
Hell: 0.17924651272693554
diner: 0.17353320467919559
Cooking: 0.1471711033876947
girl: 0.14411168675965014
serve: 0.13959598980158408
crazy: 0.13338209120755382
cooking: 0.13337078055872112
visitor: 0.12112451771529135
hell: 0.1150093143379695
food: 0.09924597905367138
kitchenette: 0.09849683425007463
dash: 0.09849104129468418
tasty: 0.09639410910714163
ingredient: 0.09450378405547079
fast: 0.09236009306785811
cook: 0.09070497582834393
pancake: 0.08962325636346777
frenzy: 0.08676660233959779
game: 0.08436857869537219
coffee: 0.07667287622531299
craze: 0.07667287622531299
queen: 0.07453427208306297
prepare: 0.07422461553225629
guest: 0.07270231414209745
prepared: 0.07109999225623523
bake: 0.07036825785507392
meal: 0.06967609648947963
order: 0.06588986130332049
rush: 0.06382873625549057
real: 0.

In [19]:
Y = Xexp5[656]
Y_ = Y.toarray()[0]
sorted_tokens(vect5,Y_)

burger: 0.339918401332665
chef: 0.3279992254143046
kitchen: 0.24478587803829457
diner: 0.23477667948243566
fever: 0.2281455866200747
cafe: 0.199388146094794
hamburger: 0.18782134358594854
restaurant: 0.1808544982719971
Hell: 0.1804078844568298
cooking: 0.15753560199151306
Cooking: 0.14812465253256532
crazy: 0.13569971963156507
girls: 0.1270583917088482
hell: 0.11575448122749286
serving: 0.10519131145407251
fast: 0.10306897912311283
food: 0.10257266777068216
dash: 0.1006761861203253
kitchenette: 0.09913501368822292
tasty: 0.09839976762429138
ingredients: 0.09572970581107275
prepared: 0.09283967327745231
pancakes: 0.0902039422284149
game: 0.09010845795632269
frenzy: 0.087328779408056
burgers: 0.08497960033316625
visitors: 0.08497960033316625
craze: 0.0797552584379176
coffee: 0.0771696541516619
queen: 0.0771696541516619
meals: 0.07501719359260098
making: 0.07160655324486731
stand: 0.06559984508286092
cook: 0.0646791097301573
rush: 0.0646791097301573
hot: 0.061893115518301546
serve: 0.0618