<a href="https://colab.research.google.com/github/ms624atyale/Scratch/blob/main/LexicalAnalysis101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔧 🔨 👀 Essential Steps before Text Analysis

## 🐹 🐾 [Text Corpus <font size='1.8'>코퍼스/말뭉치</font>](https://en.wikipedia.org/wiki/Text_corpus)  
- In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

- The **corpus-toolkit** package grew out of courses in corpus linguistics and learner corpus research. The toolkit attempts to balance simplicity of use, broad application, and scalability. Common corpus analyses such as the <font color = 'red'>_calculation of word and n-gram frequency and range, keyness, and collocation_</font> are included. In addition, more advanced analyses such as the identification of <font color = 'red'>_dependency bigrams (e.g., verb-direct object combinations) and their frequency, range, and strength of association_</font>  are also included.(https://pypi.org/project/corpus-toolkit/)

Some conditions should be fulfilled if you want to conduct corpus-related analysis. 

>1. Read and write a file using an operating system package.
>2. 🆘 import the **[os](https://docs.python.org/3/library/os.html)** module.


>3. Text files you want to analyze (e.g., url(uniform resource locator) with html document, text files under the Files dicrectory of Google Colab).
>4. Text ➡️ Words: **Tokenization**
>5. Words with the conjugation, inflection, derivation process ↔️ Words sorted by grouping inflected or variant forms of the same word (i.e., **lemmatization**)
>6. POS (part of speech (e.g., word-grammatical category pairs))
>7. 🆘 Install **corpus-toolkit** and **nltk**(natural language tool kit) packages.
 


In [1]:
#@markdown 📌 Download the os module 
import os

In [18]:
#@markdown 📌 Make a new working directory as "txtdata". 📎 <Module name: os> <function: mkdir>

os.mkdir("txtfolder")

In [19]:
#@markdown 📌 Download the corpus-toolkit package
!pip install corpus-toolkit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [20]:
#@markdown 📌 Get working directory. <code line: print working directory>
%pwd

'/content'

In [21]:
#@markdown 📌 Open a txt file. <Use a set of duble quotation marks "" and assign the url address as the _url_ variable> 
url="https://raw.githubusercontent.com/ms624atyale/Data_Misc/main/TheAesop_theFoxwithoutaTail.txt" 


os.system("curl " + url + " > foxtail.txt") #This moves the whole text of the url to the txtdata folder.  

file = open("foxtail.txt")
text = file.read().replace("\n", " ") #Replace line with a space.
file.close() #Close the file you have been working on.

#@markdown 📎 When you see foxtail.txt under the Files directory, move it under the txtfolder folder you've created by drag & drop.

In [22]:
#@markdown 📌 i) Tokenize your text and ii) Get frequency.

from corpus_toolkit import corpus_tools as ct
txt = ct.ldcorpus("txtfolder") #load and read the 'txtfolder' folder (cf., NOT "foxtail.txt")
tok_corp = ct.tokenize(txt) #tokenize corpus - by default this lemmatizes as well
txt_freq = ct.frequency(txt) #creates a frequency dictionary

Processing foxtail.txt (1 of 1 files)


In [24]:
#@markdown 📌 Tagging (i.e., associating each token with a grammatical category (e.g., mountain - N) )
ct.write_corpus("tagged_txt",ct.tag(ct.ldcorpus("txtfolder")))

Writing files to existing folder
Processing foxtail.txt (1 of 1 files)


In [25]:
#@markdown 📌 Get frequency of your tagged tokens. 'hits=10' means you want to get the top 10 words. 

tagged_freq = ct.frequency(ct.reload("tagged_txt"))
ct.head(tagged_freq, hits = 10)

Processing 1.txt (1 of 1 files)
he_PRON	16
the_DET	13
of_ADP	13
a_DET	12
to_PART	9
and_CCONJ	9
Fox_PROPN	8
tail_NOUN	8
be_AUX	7
have_AUX	6


## 💡 Now, let's save tagged data as a dataframe and get word clouds!

In [26]:
#@markdown 📌  Tagged data is in a dictionary format (e.g., {key:value}).
type(tagged_freq)

dict

In [27]:
#@markdown 📌 Import the pandas package so as to handle dataframe.

import pandas as pd

In [28]:
#@markdown 📌 Generate a dateframe with tagged words (e.g., word_POS) and their frequencies. 

data_dict = tagged_freq
data_items = data_dict.items()
data_list = list(data_items)
df = pd.DataFrame(data_list)
df.columns = ["Tagged","Freq"]
print(df)

          Tagged  Freq
0        the_DET    13
1     Æsop_PROPN     1
2        for_ADP     4
3     child_NOUN     1
4      Fox_PROPN     8
..           ...   ...
156  advice_NOUN     1
157    seek_VERB     1
158   lower_VERB     1
159      own_ADJ     1
160   level_NOUN     1

[161 rows x 2 columns]


## 💡 Splitting tagged columns into Words and POS <font size = '2.3'> part of speech (i.e., grammatical categories)
  - e.g., 
              column          column 1.    column2
          yesterday_ADP ➡️   yesterday       ADP
          rain_NOUN             rain         NOUN
          yellow_ADJ           yellow.       ADJ

In [29]:
#@markdown 📌 Codelines to get tagged columns split into words and POS

tagged = df["Tagged"]
pos = []
word = []

for i in range(0, len(tagged)):
  w = tagged[i]
  ws = w.split("_")
  word.append(ws[0])
  pos.append(ws[1])

print(len(tagged))
print(word[:10])
print(pos[:10])

161
['the', 'Æsop', 'for', 'child', 'Fox', 'without', 'a', 'Tail', 'that', 'have']
['DET', 'PROPN', 'ADP', 'NOUN', 'PROPN', 'ADP', 'DET', 'PROPN', 'PRON', 'AUX']


In [30]:
#@markdown 📌 Add new columns to the dataframe.

df["POS"] = pos
df["Word"] = word

# Rearranging column order (remove Tagged column)
cols = ["POS","Word","Freq"]
df = df[cols]

# Sort by POS and Freq
df = df.sort_values(by=['POS', 'Freq'], ascending = False)
print("Total rows: ", len(df))
df.head()

Total rows:  161


Unnamed: 0,POS,Word,Freq
25,VERB,have,4
69,VERB,say,4
11,VERB,catch,2
21,VERB,get,2
40,VERB,know,2
