# Prepare a use-case dataset

In this notebook, we load the original use-case names and their COSMIC size to produce a dataset that could be published without revealing confidential information.

The original input file is not provided in this replication package.

## Imports

In [28]:
import pandas as pd

import spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG, LOWER

from paths import input_folder

## Load original data

In [29]:
use_cases_raw_df = pd.read_csv(f"{input_folder}use-cases-raw.csv", index_col=0)

## Transform the data

In [30]:
nlp = spacy.load('en_core_web_lg')

stop_words = spacy.lang.en.stop_words.STOP_WORDS
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'

In [31]:
titles = []
verbs = set()
for text in use_cases_raw_df['TitleTranslated']:
    doc = nlp(text)
    tokens = " ".join([token.lower_ for token in doc if not token.orth_.isspace() and not token.orth_ in filters and not token.orth_ in stop_words])
    verbs = verbs | set([token.lemma_ for token in doc if token.pos_ == "VERB"])
    titles.append(tokens)

In [32]:
use_cases_df = use_cases_raw_df.drop(columns=['TitleTranslated'])

In [33]:
use_cases_df['TitleTokens'] = titles

In [34]:
use_cases_df.head(10)

Unnamed: 0,ProjectID,UC,TransTypes,UCType,Cfp,TitleTokens
0,P01,UC2-1-1,C|D|R|U,C|D|R|U,16,manage faculties crud
1,P01,UC2-1-10,DL|L|R,L,27,assign science olympiads major specialty edit ...
2,P01,UC2-1-11,CS|R,CS,7,manage ranking algorithms
3,P01,UC2-1-13,C|D|R|U,C|D|R|U,17,manage exams crud
4,P01,UC2-1-14,DL|L|R,L,27,manage assignments exams majors specialties
5,P01,UC2-1-16,C|D|R|U,C|D|R|U,18,manage courses crud
6,P01,UC2-1-18,DL|L|R,L,27,manage assignments courses majors specialties
7,P01,UC2-1-19,C|D|R|U|U,C|D|R|U,25,manage grading scales crud
8,P01,UC2-1-2,C|D|R|U,C|D|R|U,20,manage admissions crud
9,P01,UC2-1-20,L|R,L,18,manage assignments ranking algorithms majors s...


In [35]:
verbs_df = pd.DataFrame({'verb':list(verbs)})
verbs_df.head(10)

Unnamed: 0,verb
0,add
1,base
2,remind
3,attend
4,settle
5,lock
6,give
7,import
8,search
9,effort


## Save data

In [36]:
use_cases_df.to_csv(f"{input_folder}use-cases.csv")

In [37]:
verbs_df.to_csv(f"{input_folder}verbs.csv")