<a href="https://colab.research.google.com/github/ilexistools/kitconc-usage/blob/main/basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Kitconc usage**

Kitconc is a free package for Corpus Linguistics and text analysis with Python.

It contains, among other things, tools for creating:

* Corpora;
* Frequency wordlists;
* Keywords;
* Concordance lines;
* Collocates;
* N-gram lists;
* Dispersion plots;
* Excel data files.

The package is built on top of platforms and packages for scientific research: numpy, nltk, pandas, xlsxwriter.

## **1. Install kitconc**

We can use *pip* to install it:

In [None]:
!pip install kitconc

## **2. Import modules**

These are the main modules we are going to use:


In [3]:
from kitconc.kit_corpus import Corpora, Corpus


## **3. Create a corpus**

Here, we create a new corpus from raw texts:


In [21]:
# creates/sets a workspace folder
corpora = Corpora('workspace')

# downloads raw texts
url = 'https://github.com/ilexistools/kitconc-usage/raw/main/raw_ads.zip'
corpora.download_raw_texts(url) # unzips to 'ads' folder

# creates a new corpus from raw texts
if 'jobs' not in corpora.list_all():
  corpus = corpora.create('jobs','english','ads') 


## **4. Import a corpus**

Import a created corpus (the same corpus as in '3') from a url:


In [26]:
# creates/sets a workspace folder
corpora = Corpora('workspace')

# import corpus from url
url = 'https://github.com/ilexistools/kitconc-usage/raw/main/jobs.zip'
if 'jobs' not in corpora.list_all():
 corpora.import_corpus_from_url(url)

# open imported corpus 
corpus = corpora.open('jobs')


## **5. Wordlist - Frequency**

In [30]:
# create the wordlist
wordlist = corpus.wordlist()
# save to Excel
wordlist.save_excel('wordlist.xlsx')
# print the 10 most frequent words on screen
wordlist.df.head(10) 




Unnamed: 0,N,WORD,FREQUENCY,%
0,1,and,1383,5.05
1,2,to,691,2.52
2,3,the,627,2.29
3,4,of,540,1.97
4,5,a,423,1.55
5,6,in,410,1.5
6,7,for,304,1.11
7,8,with,292,1.07
8,9,is,207,0.76
9,10,experience,198,0.72


In [35]:
# print corpus info
labels = ['Tokens','Types','TTR','Hapax']
for i in range(0,4):
  print( labels[i] + ':' + str(corpus.info()[i]))


Tokens:27378
Types:4467
TTR:16.316020965576172
Hapax:2370


## **6. Keywords**

In [31]:
keywords = corpus.keywords()
keywords.save_excel('keywords.xlsx')
keywords.df.head(10)

Unnamed: 0,N,WORD,FREQUENCY,KEYNESS
0,1,experience,198,970.11
1,2,skills,115,632.1
2,3,s,113,584.49
3,4,ability,83,403.57
4,5,and,1383,330.95
5,6,sales,72,312.4
6,7,team,85,299.72
7,8,corporation,50,286.42
8,9,requirements,56,274.56
9,10,management,83,267.33


In [37]:
# using a stoplist
stopwords = ['s','and']
keywords = corpus.keywords(stoplist=stopwords)
keywords.save_excel('keywords_stop.xlsx')
keywords.df.head(10)  

Unnamed: 0,N,WORD,FREQUENCY,KEYNESS
0,1,experience,198,970.11
1,2,skills,115,632.1
2,3,ability,83,403.57
3,4,sales,72,312.4
4,5,team,85,299.72
5,6,corporation,50,286.42
6,7,requirements,56,274.56
7,8,management,83,267.33
8,9,marketing,46,221.21
9,10,degree,54,209.3


## **7. Range**

In [39]:
range = corpus.wfreqinfiles()
range.save_excel('range.xlsx')
range.df.head(10)

Unnamed: 0,N,WORD,RANGE,%
0,1,to,94,98.95
1,2,in,89,93.68
2,3,and,88,92.63
3,4,of,87,91.58
4,5,for,87,91.58
5,6,the,86,90.53
6,7,a,84,88.42
7,8,with,79,83.16
8,9,experience,76,80.0
9,10,is,73,76.84


## **8. Frequency of words and tags (POS)**

In [40]:
wt = corpus.wtfreq()
wt.save_excel('wt.xlsx')
wt.df.head(10)

Unnamed: 0,N,WORD,TAG,FREQUENCY,%
0,1,and,CC,1370,5.0
1,2,",",",",1366,4.99
2,3,.,.,1053,3.85
3,4,the,AT,611,2.23
4,5,of,IN,518,1.89
5,6,a,AT,415,1.52
6,7,in,IN,406,1.48
7,8,to,TO,355,1.3
8,9,:,:,334,1.22
9,10,to,IN,327,1.19
