# Apps Classification link + text
This example will explain how web pages are classified and described 

### Task
An XLSX file for managing Yandex.Direct ad campaigns (https://direct.yandex.ru/public/direct_example.xlsx?cmd=exportSampleCamp&amp;new_format=1) was selected as a dataset for link + text classification

### Inspiration
Ad campaigns managed through XLSX files usually consist of many, possibly several thousand, ads, leading to many web pages, which can change and no longer match the ads. If a web page does not match an ad's snippet, conversion from conversion to bid is greatly reduced, and <b>advertising budgets are not effectively spent</b>.

### Description
To solve this problem we will use Apps. In Apps we will check if the ad's snippet matches the web page to which the ad is linked.

### Setting up the environment
First of all, you need to register with Tolok as a customer. Find out more at https://yandex.ru/support/toloka-requester/concepts/access.html

The second step is to get your OAuth token https://doc.yandex-team.ru/toloka/doc/concepts/access.html?lang=ru

To check the file, save the XLSX file for managing your Yandex.Direct ad campaigns in the example it is called 'direct_example.xlsx' in the folder with this notepad and is an example https://direct.yandex.ru/public/direct_example.xlsx

In [None]:
import os
import random
import shutil
import csv

import datetime
import time

import pandas as pd
import json

import logging

from ipywidgets import widgets
from IPython.display import display, clear_output

In [None]:
!pip install toloka-kit==0.1.9 #--upgrade
clear_output()
print('Packages is installed!')

In [None]:
import ipyplot
import toloka.client as toloka

### Settings

In [1]:
#toloka settings
token = ''

#app name
app_name = 'Webpage + Text Classification'

#toloka text setting
project_name = 'Does the advertising snippet match the website link?' #project title
project_parameters={
'name': project_name, 
'option_other': False, #Are there pairs of text and a webpage that might not apply to any class?
'default_language': 'en', #Choose the language of the content you’ll be using in the application en/ru
'instruction_intro': 'You will be shown an advertisement and a link to the page of the website the advertisement advertises. Read the advertisement carefully and follow the link and study the site carefully. Answer the question does the ad match the page?', #What do performers need to do?
'instruction_classes': 
[#Add at least two classes and describe each one
{'label': 'Relevant', 
 'value': 'OK', 
 'description': 'The text of the ad is appropriate and relevant to the website. The meaning of the ad and the meaning of the website are similar, the words may converge.'}, 
{'label': 'Doesnt match', 
 'value': 'BAD', 
 'description': 'The text of the ad does not match and is not relevant to the site. The meaning of the ad and the meaning of the website are not at all similar, the ad is clearly about something other than the website in the link'}
], 
'instruction_examples': [
#Example 1
{'text': 'Yandex.Direct - contextual advertising on Yandex', 
 'label': 'Match', 
 'description': 'The ad refers to a contextual advertising site, the link refers to a contextual advertising site and the verdict is "Relevant".', 
 'website_url': 'https://direct.yandex.ru/'}, 
#Example 2
{'text': 'Auto.ru: buy, sell and exchange cars', 
 'label': 'Match', 
 'description': 'The ad advert advertises a service where you can buy, sell and exchange cars, the link is to a service where you can do this, the verdict should be "Relevant"', 
 'website_url': 'https://auto.ru/'}, 
#Example 3
{'text': 'Yandex.Services - search for services and specialists', 
 'label': 'Doesnt match', 
 'description': 'The ad advert advertises a service, the link links to a catalogue with Porsche cars, the verdict should be "Doesnt match"', 
 'website_url': 'https://auto.ru/moskva/cars/porsche/new/'}, 
#Example 4
{'text': 'Buy a new Porsche', 
 'label': 'Doesnt match', 
 'description': 'The ad is talking about new Porsche cars, the link is talking about used Toyota cars, the verdict should be "Doesnt match"', 
 'website_url': 'https://auto.ru/moskva/cars/toyota/used/'}
], 
'instruction_question': 'Is the ad relevant to the page?', 
'option_multiple_choice': False #Are there pairs of text and a webpage that could apply to multiple classes?
}

#advertising XLSX file
adv_file = "direct_example.xlsx"

### Uploading the Yandex.Direct feed file

In [None]:
if not os.path.exists("./"+adv_file):
    os.system("curl -H 'Accept-Language: en-US,en;q=0.9,it;q=0.8' -d 'cmd=exportSampleCamp&lang=en&new_format=1' https://direct.yandex.net/public/direct_example.xlsx --output direct_example.xlsx")

### Receiving advertisements

In [None]:
pd_direct = pd.read_excel(adv_file, sheet_name='Texts', header=9)
pd_direct.sample(3)

In [None]:
pd_direct = pd_direct.loc[pd_direct["Title 1"].notnull()]
pd_direct["text"]=pd_direct["Title 1"] + " | " + pd_direct["Title 2"] + " | " +  pd_direct["Ad Text"] 
col=['Ad ID','text','Link']
df_direct = pd_direct[col]
df_direct.sample(3)

In [None]:
# Create a Toloka client instance
# All API calls will pass through it
toloka_client = toloka.TolokaClient(token, 'PRODUCTION')  # or switch to SANDBOX or PRODUCTION

# We check account, which also checks the validity of the OAuth token
logging.info(toloka_client.get_requester())

### Launching a project

In [None]:
#looking apps
app_id = ''
wtc_app = toloka_client.get_apps(name_lte=app_name, name_gte=app_name)
if len(list(wtc_app))==1:
    app_id = next(wtc_app).id
else:
    print("This apps does not exist")

In [None]:
#create project
wtc_project_obj = toloka.app.AppProject(
    app_id = app_id,
    name = project_name, 
    parameters= project_parameters)

wtc_project = toloka_client.create_app_project(wtc_project_obj)

In [None]:
#wait project moderation
def wait_project_moderation(project):
    sleep_time = 30
    while project.status.value == 'CREATING':
        print(
            f'{datetime.datetime.now().strftime("%H:%M:%S")} '
            f'Batch {project.name} has status {project.status.value}.'
        )
        time.sleep(sleep_time)
    print(project.status.value)
wait_project_moderation(wtc_project)

In [None]:
#create batch
wtc_batch_obj = toloka.app.AppBatch(name=app_name+str(random.randint(1,100)))
wtc_batch = toloka_client.create_app_batch(app_project_id=wtc_project.id,
                             request = wtc_batch_obj)

In [None]:
#add tasks
new_tasks = [
    {
        'id': str(row['Ad ID']), 
        'text': row['text'], 
        'website_url': row['Link']
    }
    for _, row in df_direct.iterrows()
]

toloka_client.create_app_items(
    app_project_id=wtc_project.id,
    batch_id=wtc_batch.id,
    items=new_tasks)

In [None]:
#marking start
toloka_client.start_app_batch(app_project_id = wtc_project.id,
                              app_batch_id = wtc_batch.id)
print("Marking started. Please wait...")

In [None]:
#wait batch
def wait_batch_for_close(batch):
    sleep_time = 60
    while batch.status.value == 'PROCESSING':
        print(
            f'{datetime.datetime.now().strftime("%H:%M:%S")} '
            f'Batch {batch.name} has status {batch.status.value}.'
        )
        time.sleep(sleep_time)
    print(batch.status.value)

wait_batch_for_close(wtc_batch)

In [None]:
#get result
result_df = pandas.DataFrame()
for item in toloka_client.get_app_items(app_project_id=wtc_project.id):
    result_df = result_df.append(pandas.json_normalize(item.output_data), ignore_index=True)
    
result_df.to_csv('result.tsv', sep='\t', encoding='utf-8')
result_df.sample(3)