# Statistical Analysis of Queries

One of the most common things that you'll be asked to do as a data scientist (especially as a data scientist who is working anywhere near text) is answer questions like:
- "What is the most common search query, or text input, that this product receives?"
- "What kind of metadata characteristics do these queries have?" (this is just a bad way of generalizing the kinds of questions that are going to be specific to your problem domain-- more on this later)

This section of the Data Science Crash Course is going to be focused on making general observations about text-based fields.

In [8]:
# first, let's import all the libraries we'll need for this exercise
from collections import Counter
from matplotlib import pyplot as plt
import pandas as pd
import pickle
import re

# as always, we'll need to define the data we'd like to use for this exercise
# i had fun using the pickled russian-language llm query data, let's use it again
file_path = '../data/llm_usage_ru_dataset.pkl'

# we'll need to unpickle our list of dictionaries before declaring our dataframe
with open(file_path, "rb") as f:
    file = pickle.load(f)

# declare our dataframe
df = pd.DataFrame(file)

# have a quick look at the data we've just loaded
print('first few rows of dataframe:')
print(df.head())
print('------')
print('number of rows in dataframe')
print(len(df))

first few rows of dataframe:
                                     id  \
0  b668ae91-bbdc-491a-bca8-b4a8e4a07d0b   
1  5260c1a0-39ca-433f-9ce4-114d67bcc4ff   
2  ad9f9d9f-f511-49f5-afae-fa98c0b2c78b   
3  80218e6a-dcbd-4d76-a6be-4c36d09454ef   
4  b6ada662-6599-4b60-ba8c-0b944234ca01   

                                               query  \
0                    Зеленый угодный изучить металл.   
1                      Выражаться кузнец коричневый.   
2  Вывести неправда изредка избегать поколение во...   
3               Нож через видимо выгнать советовать.   
4  Аллея призыв космос за монета появление совето...   

                                            response        time  feedback  \
0          Забирать рай пламя. Отдел магазин металл.  1723578498  negative   
1  Точно миг правый необычный тута. Порядок рабоч...  1740303596  positive   
2  Белье спешить другой запустить. Расстегнуть го...  1734326957  positive   
3  Палка болото плавно подробность постоянный вск...  173950102

# Cleaning our dataset

When we use a text-based dataset, there are a few steps we need to take before we start analyzing our data. Here are a couple of the most standard steps that you'll see data scientists and analysts take before working with text:
- Convert all text to lower case: this let's us more easily count similar queries because a machine will understand "this" and "This" as different queries when really, they are the same!
- "Strip" leading and trailing whitespace: this may just be a case of user error or some kind of tokenization process where certain queries will have " extra spaces" before or "extra spaces " after the core search term.
- Remove stopwords: stopwords are very common words which might muddle the results of a text count, more on this in just a moment.

Let's work through the cleaning process and let have a look at our results!

In [None]:
# define our stopwords
ru_stopwords = ['а', 'и', 'в', 'на', 'не', 'с', 'о', 'к', 'за', 'по', 
'из', 'до', 'как', 'что', 'где', 'когда', 'кто', 'какой', 
'который', 'этот', 'тот', 'свой', 'весь', 'каждый', 
'один', 'два', 'три', 'четыре', 'пять', 'шесть', 
'семь', 'восемь', 'девять', 'десять', 'первый', 
'второй', 'третий', 'четвёртый', 'пятый', 'мой', 
'твой', 'наш', 'ваш', 'его', 'её', 'их', 'был', 
'была', 'было', 'были', 'есть', 'нет', 'да', 'но', 
'или', 'если', 'чтобы', 'потому', 'поэтому', 'тоже', 
'также', 'даже', 'только', 'уже', 'ещё', 'вот', 
'там', 'здесь', 'туда', 'сюда', 'оттуда', 'откуда', 
'всегда', 'никогда', 'сейчас', 'тогда', 'раньше', 
'позже', 'быстро', 'медленно', 'хорошо', 'плохо', 
'можно', 'нужно', 'должен', 'может'
]

# convert all of our queries to lower case and strip leading and trailing whitespace
df['clean_query'] = df['query'].str.lower().str.strip()

