# Apps: Webpage + Text Classification

In this example we aim to explore webpages and their descriptions. We will run the project using new Toloka Ready-to-go solutions (App Services).

### Challenge

We picked the dataset from Yandex.Direct as an example. It is XLSX file (in attachments) which contains information about advertising campaigns.


### Context

Advertising campaigns usually consist of several thousands of ads that are linked to many web pages. Web pages may change in time and become irrelevant for the ads. If a web page does not match the ad snippet, conversion from click-to-order drops dramatically and, therefore, ad budgets are wasted.


### Description

With this project we aim to check that web pages are relevant to ads that are linked to it. We will use new App projects (Ready-to-go solutions) feature of Toloka.

### Set up the environment

First, you'll need to register in Toloka as a requester. Learn more about this step in the [documentation](https://toloka.ai/docs/guide/concepts/access.html).


The second step is to obtain your [OAuth token](https://toloka.ai/docs/api/concepts/access.html#access__token).
A detailed description of these actions can be found in the example [learn the basics](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb).

To access the file with a dataset from the code, save the XLSX file to the same directory as this notebook. In this example file is called 'direct_example.xlsx'


In [None]:
import os
import random

import datetime
import time

import pandas as pd

from ipywidgets import widgets
from IPython.display import display, clear_output

In [None]:
!pip install toloka-kit==0.1.23 #--upgrade
clear_output()
print('Packages are installed!')

In [None]:
import toloka.client as toloka

### App project settings

It's time to think about initial project settings, instructions, and examples for performers.

In [None]:
# App name
app_name = 'Webpage + Text Classification'

# App project settings
project_name = 'Does the ad snippet match the webpage?'
project_parameters={
'name': project_name,
'option_other': False, # Option disabled. Assume there are NO pairs of text + webpage that don't apply to any class
'option_multiple_choice': False, # Option disabled. Assume there are NO pairs of text + webpage that can apply to multiple classes
'default_language': 'en', # Choose the language of the content you’ll be using in the application. Possible values: en/ru
'instruction_intro': '''You will see an ad and a link to the webpage that represents the ad.
Read the ad carefully and then follow the link – study the webpage carefully.
Answer the question: Does the ad snippet match the webpage?''', # Set the instruction. What performers need to do?
'instruction_question': 'Does the ad snippet match the webpage?',
'instruction_classes':
[# Add at least two classes, and describe each one
{'label': 'Matches', 'value': 'OK', 'description': 'The ad text is appropriate and relevant to the webpage. Meaning of the ad and main idea of the webpage are similar, the words can converge.'},
{'label': 'Does not match', 'value': 'BAD', 'description': 'The ad text does not match the webpage. Meaning of the ad and main idea of the webpage are not similar at all. Ad is clearly about something different.'}
],
'instruction_examples': [
# Example 1
{'text': 'Yandex.Direct — contextual ads on Yandex', 'label': 'Matches', 'description': 'The ad is about the website for placing contextual advertising. Link goes to the website for placing the ads. Answer should be "Matches"', 'website_url': 'https://direct.yandex.com/'},
# Example 2
{'text': 'Auto.ru: buy, sell, and trade your auto', 'label': 'Matches', 'description': 'The ad is about a web service where you can buy, sell and exchange a car. Link goes to a service where you can do all these. Answer should be "Matches"', 'website_url': 'https://auto.ru/'},
# Example 3
{'text': 'Yandex.Services — find services and specialists of your need', 'label': 'Does not match', 'description': 'The ad is about a service. Link goes to a catalog with Porsche cars. Answer should be "Does not match"', 'website_url': 'https://auto.ru/moskva/cars/porsche/new/'},
# Example 4
{'text': 'Buy new Porsche auto', 'label': 'Does not match', 'description': 'The ad is about new Porsche cars. Link goes to Toyota cars with mileage. Answer should be "Does not match"', 'website_url': 'https://auto.ru/moskva/cars/toyota/used/'}
]
}


### Download the file with information about ads

For this example we will use small dataset with different variants of add description leading to a single page.

In [None]:
!curl https://tlk.s3.yandex.net/dataset/direct_example.csv --output direct_example.csv

df_direct = pd.read_csv('direct_example.csv')
df_direct = df_direct.sample(frac=1).reset_index(drop=True)

### Connect to Toloka

Create an instance and check your account.

In [None]:
# Create a Toloka client instance
# All API calls will pass through it
toloka_client = toloka.TolokaClient(input("Enter your token:"), 'PRODUCTION')  # or switch to SANDBOX

# Check the money available on your account, and also the validity of your OAuth token
requester = toloka_client.get_requester()
print('You have enough money on your account - ', requester.balance > 3.0)

### Run the App project

Set up several objects related to your project: AppProject and AppBatch.

In [None]:
# Getting Webpage + Text Classification App (Ready-to-go solution)
wtc_app = next(app for app in toloka_client.get_apps() if app.name == app_name)
app_id = wtc_app.id

In [None]:
# Create an App project
wtc_project_obj = toloka.app.AppProject(
    app_id=app_id,
    name=project_name,
    parameters=project_parameters
)

wtc_project = toloka_client.create_app_project(wtc_project_obj)

In [None]:
# Wait until the moderation of project is completed
def wait_project_moderation(project):
    sleep_time = 30
    while project.status.value == 'CREATING':
        project = toloka_client.get_app_project(project.id)
        print(
            f'{datetime.datetime.now().strftime("%H:%M:%S")} '
            f'Project {project.name} has status {project.status.value}.'
        )
        time.sleep(sleep_time)
    print(project.status.value)
wait_project_moderation(wtc_project)

### Add your data

Data will be located in batches.

In [None]:
# Create a batch
wtc_batch_obj = toloka.app.AppBatch(name=app_name+str(random.randint(1,100)))
wtc_batch = toloka_client.create_app_batch(app_project_id=wtc_project.id,
                             request = wtc_batch_obj)

In [None]:
# Add tasks to batch
for ind, row in df_direct.iterrows():
    wtc_item_obj = toloka.app.AppItem(batch_id=wtc_batch.id,
                   input_data={'id': str(row['Ad ID']),
                               'text': row['Ad Text'],
                               'website_url': row['Link']}
                        )
    toloka_client.create_app_item(app_project_id=wtc_project.id,
                          app_item = wtc_item_obj)

In [None]:
# Start data labeling
toloka_client.start_app_batch(app_project_id = wtc_project.id,
                              app_batch_id = wtc_batch.id)
print("Labeling started. Please, wait...")

In [None]:
# Wait for batch
def wait_batch_for_close(batch):
    sleep_time = 60
    while batch.status.value in ['PROCESSING', 'NEW']:
        batch = toloka_client.get_app_batch(wtc_project.id, batch.id)
        print(
            f'{datetime.datetime.now().strftime("%H:%M:%S")} '
            f'Batch {batch.name} has status {batch.status.value}.'
        )
        time.sleep(sleep_time)
    print(batch.status.value)

wait_batch_for_close(wtc_batch)

### Get your results

Get results in data frame

In [None]:
result_df = pd.DataFrame()
for item in toloka_client.get_app_items(app_project_id=wtc_project.id):
    result_df = result_df.append(pd.json_normalize(item.output_data), ignore_index=True)

In [68]:
result_df.sample(5)

Unnamed: 0,text,result,confidence,website_url
24,Over 6000 games available on Yandex.Games. No ...,[OK],0.984802520945325,https://yandex.com/games/?utm_medium=rsya&utm_...
36,Over 6000 games available on Yandex.Games. No ...,[OK],0.984802520945325,https://yandex.com/games/?utm_medium=rsya&utm_...
2,Over 6000 games available on Yandex.Games. No ...,[OK],0.984802520945325,https://yandex.com/games/?utm_medium=rsya&utm_...
20,Over 6000 games available on Yandex.Games. No ...,[OK],0.984802520945325,https://yandex.com/games/?utm_medium=rsya&utm_...
3,Over 6000 games available on Yandex.Games. No ...,[OK],0.984802520945325,https://yandex.com/games/?utm_medium=rsya&utm_...


### Summary

In this project, we built up a fast and automated labeling pipeline using new Toloka features – Ready-to-go solutions (or simply Apps).