# Research queries of corsiuniversitari.info

This is a dataset of research made in the website corsiuniversitari.info, after having collected two month of data the aim of this analysis is to understand what are the most research courses in order to make new webpages in the website. The main problem is that most of the researches are not complete in the sense that usually users search part of the name of the course, like "info" for "informatica".

In [21]:
import re

import polars as pl

In [2]:
ds = pl.read_csv(
    source="queries-sito.csv",
    dtypes=[pl.UInt64, pl.Utf8, pl.Utf8, pl.Datetime],
    quote_char='"',
)

In [3]:
ds.head()

ID,query,page,time
u64,str,str,datetime[μs]
12,"""test""","""corsi-di-laure…",2023-04-17 13:00:52
13,"""Test""","""tutti-i-corsi""",2023-04-17 13:05:14
14,"""Telec""","""tutti-i-corsi""",2023-04-17 13:08:45
15,"""Parma""","""lista-corsi-di…",2023-04-17 13:10:53
16,"""Pa""","""lista-corsi-di…",2023-04-17 13:10:59


Day of the first query collected.

In [4]:
ds.select("time").min()

time
datetime[μs]
2023-04-17 13:00:52


Day of the last query collected.

In [5]:
ds.select("time").max()

time
datetime[μs]
2023-06-22 08:09:08


Cleaning of the query in order to count them in the right way.

In [6]:
ds = ds.with_columns(pl.col("query").apply(lambda x: x.lower()))
ds = ds.with_columns(pl.col("query").apply(lambda x: x.strip()))

In [7]:
ds.select("query").groupby("query", maintain_order=False).count().sort(
    "count", descending=True
)

query,count
str,u32
"""economia""",120
"""triennale""",116
"""milano""",89
"""psicologia""",84
"""ingegneria""",81
"""informatica""",72
"""scienze""",65
"""design""",59
"""roma""",51
"""lingue""",49


In [8]:
ds = ds.with_columns(
    [
        pl.when(pl.col(pl.Utf8).str.lengths() == 0)
        .then(None)
        .otherwise(pl.col(pl.Utf8))
        .keep_name()
    ]
).drop_nulls()

Some of the names are like "ingegneria edile" where you have two words, let's see which is the most searched unigram.

In [17]:
query = []
queryLen = []

for i in ds.select("query").iter_rows():
    for j in i[0].split():
        query.append(j)
        queryLen.append(len(j))

data = {"query": query, "queryLen": queryLen}
unigram = pl.DataFrame(data)
unigram.head()

query,queryLen
str,i64
"""test""",4
"""test""",4
"""telec""",5
"""parma""",5
"""pa""",2


Let's see if we have some words cutted or most of them are written entirely.

In [19]:
unigram.select("queryLen").mean()

queryLen
f64
6.851929


In [20]:
unigram.select("queryLen").median()

queryLen
f64
7.0


Scienze and ingegneria become the two most researched words.

In [18]:
unigram.select("query").groupby("query", maintain_order=False).count().sort(
    "count", descending=True
)

query,count
str,u32
"""scienze""",412
"""ingegneria""",243
"""e""",213
"""economia""",183
"""triennale""",132
"""informatica""",104
"""psicologia""",100
"""milano""",95
"""design""",88
"""lingue""",74


Now we will search which type of ingegneria, scienze and economia are the most searched. And we can notice that the sum of all the courses is high, but is very sparse between the single course. For scienze the problem is that there are politcal science which are completly different from biology. Talking about economia most of the researches are about economia, the ones related to a particular course are not so much.

In [32]:
ds.with_columns(default_match=pl.col("query").str.contains("ingegneria")).filter(
    pl.col("default_match")
).select("query").groupby("query", maintain_order=False).count().sort(
    "count", descending=True
)

query,count
str,u32
"""ingegneria""",81
"""ingegneria inf…",19
"""ingegneria ges…",14
"""ingegneria mec…",8
"""ingegneria bio…",6
"""ingegneria edi…",6
"""ingegneria aer…",5
"""ingegneria ene…",4
"""ingegneria civ…",4
"""ingegneria del…",4


In [33]:
ds.with_columns(default_match=pl.col("query").str.contains("scienze")).filter(
    pl.col("default_match")
).select("query").groupby("query", maintain_order=False).count().sort(
    "count", descending=True
)

query,count
str,u32
"""scienze""",65
"""scienze motori…",24
"""scienze biolog…",19
"""scienze politi…",17
"""scienze della …",9
"""scienze della""",8
"""scienze dell'e…",7
"""scienze natura…",6
"""scienze biolo""",5
"""scienze moto""",5


In [34]:
ds.with_columns(default_match=pl.col("query").str.contains("economia")).filter(
    pl.col("default_match")
).select("query").groupby("query", maintain_order=False).count().sort(
    "count", descending=True
)

query,count
str,u32
"""economia""",120
"""economia azien…",7
"""economia e man…",6
"""economia e com…",3
"""economia dei""",2
"""economia e man…",2
"""economia magis…",2
"""economia e ges…",2
"""economia azien…",1
"""economia e so""",1
