<a href="https://colab.research.google.com/github/ms624atyale/Scratch/blob/main/LexicalAnalysis_ConcordanceCollocation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#😱💦 Before you want to analyze text corpus, <font size = '3.0'>you should do **steps essentials** for further text analysis such as concordance and collocation. 

#⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️ ⏬ ⬇️  

#🐹 🐾 Essential Steps for Text Analysis

## 📚 👀 [Text Corpus <font size='1.8'>코퍼스/말뭉치</font>](https://en.wikipedia.org/wiki/Text_corpus)  
- In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.


## 📚 👀 Text analysis all starts with tokenization!
Before you conduct corpus-related analysis, all words in your text should be **tokenized** (i.e., break text into individual units (e.g., words)). In particular, words with conjugation, derivation, or inflection are tokenized based on its base forms (e.g., stem). Tokenized words can be further associated with thier grammatical categories (e.g., NOUN, VERB, ADJ, etc.), which we call it **POS (part of speech)**. Pairwise units (e.g., word_POS) can employ a **dictionary** form where a key is paired with its value.  

- 1. Read and write a file using an operating system package.
    - 🆘 import the **[os](https://docs.python.org/3/library/os.html)** module.
- 2. Analyze text corpus.
    - 🆘 Install **corpus-toolkit** 
- 3. Aanlyze natural language.
    - 🆘 Install **nltk**(i.e., natural language tool kit) packages.
    - 🆘 Import **re** (i.e., regular expression module in Python)
- 4. Arrange your data into a structured data frame.
    - 🆘 Install **pandas**

## 📚 👀 Don't miss this!
>1. Whatever operating system you use, your computer has a directory which contains folders, files, and subdirectories in a hierarchical manner. Python also has an [os module](https://docs.python.org/3/library/os.html), and you can **read** and **write** files after you import it.
>2. Text files can be ready for use as follows. 
  - text="url(uniform resource locator) with html document" 
    - <font size='2.0'> https://raw.githubusercontent.com/ms624atyale/Data_Misc/main/TheAesop_theFoxwithoutaTail.txt</font>
  - text files under the Files dicrectory of Google Colab.
  - Use a codeline on a Code cell
    txt = """_copy and paste a text of your interest from a url of websites or html (HyperTextMarkup Language)_""" 
 


In [None]:
#@markdown 📌 Download the os module 
import os

In [None]:
#@markdown 📌 Make a new working directory as "txtdata". 📎 <Module name: os> <function: mkdir>

os.mkdir("txtfolder")

In [1]:
#@markdown 📌 Download the corpus-toolkit package
!pip install corpus-toolkit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting corpus-toolkit
  Downloading corpus_toolkit-0.32-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 14.9 MB/s 
[?25hInstalling collected packages: corpus-toolkit
Successfully installed corpus-toolkit-0.32


In [None]:
!pip install lexical-diversity


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lexical-diversity
  Downloading lexical_diversity-0.1.1-py3-none-any.whl (117 kB)
[K     |████████████████████████████████| 117 kB 12.7 MB/s 
[?25hInstalling collected packages: lexical-diversity
Successfully installed lexical-diversity-0.1.1


In [None]:
from lexical_diversity import lex_div as ld

In [None]:
import re

In [None]:
#@markdown 📌 Get working directory. <code line: print working directory>
%pwd

'/content'

In [6]:
#@markdown 📌 Make a folder (e.g., brown_single_folder)under the FILE directory of Colab, and upload files from your machine.

from corpus_toolkit import corpus_tools as ct

brown_corp = ct.ldcorpus("brown_single_folder") #load and read corpus: ca~cd from brown_single original folder for class use

## 📌 Text ➡️ Words: **Tokenization**

📎  Words sorted by grouping inflected or variant forms of the same word (i.e., **lemmatization**) ↔️ Words with the conjugation, inflection, derivation process

In [7]:
#@markdown 📌 Lemmatization
tok_corp = ct.tokenize(brown_corp) #tokenize corpus - by default this lemmatizes as well

In [8]:
#@markdown 📌 Word Frequency
brown_freq = ct.frequency(tok_corp) #creates a frequency dictionary
##note that range can be calculated instead of frequency using the argument calc = "range"
ct.head(brown_freq, hits = 10) #print top 10 items

Processing ca_ca16.txt (1 of 88 files)
Processing cb_cb25.txt (2 of 88 files)
Processing cc_cc03.txt (3 of 88 files)
Processing cc_cc16.txt (4 of 88 files)
Processing cb_cb04.txt (5 of 88 files)
Processing cc_cc05.txt (6 of 88 files)
Processing ca_ca39.txt (7 of 88 files)
Processing cb_cb03.txt (8 of 88 files)
Processing ca_ca13.txt (9 of 88 files)
Processing ca_ca19.txt (10 of 88 files)
Processing cb_cb20.txt (11 of 88 files)
Processing cc_cc06.txt (12 of 88 files)
Processing cc_cc17.txt (13 of 88 files)
Processing cc_cc09.txt (14 of 88 files)
Processing ca_ca11.txt (15 of 88 files)
Processing ca_ca30.txt (16 of 88 files)
Processing ca_ca02.txt (17 of 88 files)
Processing cb_cb06.txt (18 of 88 files)
Processing cc_cc12.txt (19 of 88 files)
Processing cb_cb18.txt (20 of 88 files)
Processing cb_cb17.txt (21 of 88 files)
Processing ca_ca33.txt (22 of 88 files)
Processing ca_ca27.txt (23 of 88 files)
Processing ca_ca44.txt (24 of 88 files)
Processing ca_ca21.txt (25 of 88 files)
Processin

In [10]:
brown_freq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single_folder")))
ct.head(brown_freq, hits = 10)

Processing ca_ca16.txt (1 of 88 files)
Processing cb_cb25.txt (2 of 88 files)
Processing cc_cc03.txt (3 of 88 files)
Processing cc_cc16.txt (4 of 88 files)
Processing cb_cb04.txt (5 of 88 files)
Processing cc_cc05.txt (6 of 88 files)
Processing ca_ca39.txt (7 of 88 files)
Processing cb_cb03.txt (8 of 88 files)
Processing ca_ca13.txt (9 of 88 files)
Processing ca_ca19.txt (10 of 88 files)
Processing cb_cb20.txt (11 of 88 files)
Processing cc_cc06.txt (12 of 88 files)
Processing cc_cc17.txt (13 of 88 files)
Processing cc_cc09.txt (14 of 88 files)
Processing ca_ca11.txt (15 of 88 files)
Processing ca_ca30.txt (16 of 88 files)
Processing ca_ca02.txt (17 of 88 files)
Processing cb_cb06.txt (18 of 88 files)
Processing cc_cc12.txt (19 of 88 files)
Processing cb_cb18.txt (20 of 88 files)
Processing cb_cb17.txt (21 of 88 files)
Processing ca_ca33.txt (22 of 88 files)
Processing ca_ca27.txt (23 of 88 files)
Processing ca_ca44.txt (24 of 88 files)
Processing ca_ca21.txt (25 of 88 files)
Processin

In [11]:
conc_results1 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single_folder"),lemma = False),["run","ran","running","runs"],nhits = 10)
for x in conc_results1:
	print(x)

Processing ca_ca16.txt (1 of 88 files)
Processing cb_cb25.txt (2 of 88 files)
Processing cc_cc03.txt (3 of 88 files)
Processing cc_cc16.txt (4 of 88 files)
Processing cb_cb04.txt (5 of 88 files)
Processing cc_cc05.txt (6 of 88 files)
Processing ca_ca39.txt (7 of 88 files)
Processing cb_cb03.txt (8 of 88 files)
Processing ca_ca13.txt (9 of 88 files)
Processing ca_ca19.txt (10 of 88 files)
Processing cb_cb20.txt (11 of 88 files)
Processing cc_cc06.txt (12 of 88 files)
Processing cc_cc17.txt (13 of 88 files)
Processing cc_cc09.txt (14 of 88 files)
Processing ca_ca11.txt (15 of 88 files)
Processing ca_ca30.txt (16 of 88 files)
Processing ca_ca02.txt (17 of 88 files)
Processing cb_cb06.txt (18 of 88 files)
Processing cc_cc12.txt (19 of 88 files)
Processing cb_cb18.txt (20 of 88 files)
Processing cb_cb17.txt (21 of 88 files)
Processing ca_ca33.txt (22 of 88 files)
Processing ca_ca27.txt (23 of 88 files)
Processing ca_ca44.txt (24 of 88 files)
Processing ca_ca21.txt (25 of 88 files)
Processin

In [12]:
conc_results2 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single_folder"),lemma = False),["run","ran","running","runs"],collocates = ["suddenly", 'just'], nhits = 10)
for x in conc_results2:
	print(x)

Processing ca_ca16.txt (1 of 88 files)
Processing cb_cb25.txt (2 of 88 files)
Processing cc_cc03.txt (3 of 88 files)
Processing cc_cc16.txt (4 of 88 files)
Processing cb_cb04.txt (5 of 88 files)
Processing cc_cc05.txt (6 of 88 files)
Processing ca_ca39.txt (7 of 88 files)
Processing cb_cb03.txt (8 of 88 files)
Processing ca_ca13.txt (9 of 88 files)
Processing ca_ca19.txt (10 of 88 files)
Processing cb_cb20.txt (11 of 88 files)
Processing cc_cc06.txt (12 of 88 files)
Processing cc_cc17.txt (13 of 88 files)
Processing cc_cc09.txt (14 of 88 files)
Processing ca_ca11.txt (15 of 88 files)
Processing ca_ca30.txt (16 of 88 files)
Processing ca_ca02.txt (17 of 88 files)
Processing cb_cb06.txt (18 of 88 files)
Processing cc_cc12.txt (19 of 88 files)
Processing cb_cb18.txt (20 of 88 files)
Processing cb_cb17.txt (21 of 88 files)
Processing ca_ca33.txt (22 of 88 files)
Processing ca_ca27.txt (23 of 88 files)
Processing ca_ca44.txt (24 of 88 files)
Processing ca_ca21.txt (25 of 88 files)
Processin

In [14]:
collocates = ct.collocator(ct.tokenize(ct.ldcorpus("brown_single_folder")),"go",stat = "MI")
#stat options include: "MI", "T", "freq", "left", and "right"

ct.head(collocates, hits = 10)

Processing ca_ca16.txt (1 of 88 files)
Processing cb_cb25.txt (2 of 88 files)
Processing cc_cc03.txt (3 of 88 files)
Processing cc_cc16.txt (4 of 88 files)
Processing cb_cb04.txt (5 of 88 files)
Processing cc_cc05.txt (6 of 88 files)
Processing ca_ca39.txt (7 of 88 files)
Processing cb_cb03.txt (8 of 88 files)
Processing ca_ca13.txt (9 of 88 files)
Processing ca_ca19.txt (10 of 88 files)
Processing cb_cb20.txt (11 of 88 files)
Processing cc_cc06.txt (12 of 88 files)
Processing cc_cc17.txt (13 of 88 files)
Processing cc_cc09.txt (14 of 88 files)
Processing ca_ca11.txt (15 of 88 files)
Processing ca_ca30.txt (16 of 88 files)
Processing ca_ca02.txt (17 of 88 files)
Processing cb_cb06.txt (18 of 88 files)
Processing cc_cc12.txt (19 of 88 files)
Processing cb_cb18.txt (20 of 88 files)
Processing cb_cb17.txt (21 of 88 files)
Processing ca_ca33.txt (22 of 88 files)
Processing ca_ca27.txt (23 of 88 files)
Processing ca_ca44.txt (24 of 88 files)
Processing ca_ca21.txt (25 of 88 files)
Processin

# 🔨🔧 🔍 👀 Under construction

Think about writng code lines for making data frame table using pandas. 

Note that code lines in the following is from EssentialSteps4TextAnalysis.

In [None]:
#@markdown 📌 Tagging (i.e., associating each token with a grammatical category (e.g., mountain - N) )
ct.write_corpus("tagged_txt",ct.tag(ct.ldcorpus("txtfolder")))

Processing foxtail.txt (1 of 1 files)


In [None]:
#@markdown 📌 Get frequency of your tagged tokens (e.g., POS). 'hits=10' means you want to get the top 10 words. 

tagged_freq = ct.frequency(ct.reload("tagged_txt"))
ct.head(tagged_freq, hits = 10)

Processing 1.txt (1 of 1 files)
he_PRON	16
the_DET	13
of_ADP	13
a_DET	12
to_PART	9
and_CCONJ	9
Fox_PROPN	8
tail_NOUN	8
be_AUX	7
have_AUX	6


## 💡 Now, let's save tagged data as a dataframe and get word clouds!

In [None]:
#@markdown 📌  Tagged data is in a dictionary format (e.g., {key:value}).
type(tagged_freq)

dict

In [None]:
#@markdown 📌 Import the pandas package so as to handle dataframe.

import pandas as pd

In [None]:
#@markdown 📌 Generate a dateframe with tagged words (e.g., word_POS) and their frequencies. 

data_dict = tagged_freq
data_items = data_dict.items()
data_list = list(data_items)
df = pd.DataFrame(data_list)
df.columns = ["Tagged","Freq"]
print(df)

          Tagged  Freq
0        the_DET    13
1     Æsop_PROPN     1
2        for_ADP     4
3     child_NOUN     1
4      Fox_PROPN     8
..           ...   ...
156  advice_NOUN     1
157    seek_VERB     1
158   lower_VERB     1
159      own_ADJ     1
160   level_NOUN     1

[161 rows x 2 columns]


## 💡 Splitting tagged columns into Words and POS <font size = '2.3'> part of speech (i.e., grammatical categories)
  - e.g., 
              column          column 1.    column2
          yesterday_ADP ➡️   yesterday       ADP
          rain_NOUN             rain         NOUN
          yellow_ADJ           yellow.       ADJ

In [None]:
#@markdown 📌 Codelines to get tagged columns split into words and POS

tagged = df["Tagged"]
pos = []
word = []

for i in range(0, len(tagged)):
  w = tagged[i]
  ws = w.split("_")
  word.append(ws[0])
  pos.append(ws[1])

print(len(tagged))
print(word[:10])
print(pos[:10])

161
['the', 'Æsop', 'for', 'child', 'Fox', 'without', 'a', 'Tail', 'that', 'have']
['DET', 'PROPN', 'ADP', 'NOUN', 'PROPN', 'ADP', 'DET', 'PROPN', 'PRON', 'AUX']


In [None]:
#@markdown 📌 Add new columns to the dataframe.

df["POS"] = pos
df["Word"] = word

# Rearranging column order (remove Tagged column)
cols = ["POS","Word","Freq"]
df = df[cols]

# Sort by POS and Freq
df = df.sort_values(by=['POS', 'Freq'], ascending = False)
print("Total rows: ", len(df))
df.head()

Total rows:  161


Unnamed: 0,POS,Word,Freq
25,VERB,have,4
69,VERB,say,4
11,VERB,catch,2
21,VERB,get,2
40,VERB,know,2


In [None]:
#@markdown 🔨🔧 🔍 👀 Under construction

foxtail="""The Æsop for Children The Fox Without a Tail A Fox that had been caught in a trap, succeeded at last, after much painful tugging, in getting away. But he had to leave his beautiful bushy tail behind him.
For a long time he kept away from the other Foxes, for he knew well enough that they would all make fun of him and crack jokes and laugh behind his back. But it was hard for him to live alone, and at last he thought of a plan that would perhaps help him out of his trouble.
He called a meeting of all the Foxes, saying that he had something of great importance to tell the tribe.
When they were all gathered together, the Fox Without a Tail got up and made a long speech about those Foxes who had come to harm because of their tails.
This one had been caught by hounds when his tail had become entangled in the hedge. That one had not been able to run fast enough because of the weight of his brush. Besides, it was well known, he said, that men hunt Foxes simply for their tails, which they cut off as prizes of the hunt. With such proof of the danger and uselessness of having a tail, said Master Fox, he would advise every Fox to cut it off, if he valued life and safety.
When he had finished talking, an old Fox arose, and said, smiling:
"Master Fox, kindly turn around for a moment, and you shall have your answer."
When the poor Fox Without a Tail turned around, there arose such a storm of jeers and hooting, that he saw how useless it was to try any longer to persuade the Foxes to part with their tails.
Do not listen to the advice of him who seeks to lower you to his own level."""

text = file.read().replace("\n", " ") #Replace lines with spaces.

shortword = re.compile(r'\W*\b\w{1,3}\b') #Getting rid of Stopwords of 1~3 spellings. Regular expression
txt = shortword.sub('',foxtail)

In [None]:
#@markdown 🔨🔧 🔍 👀 Under construction (Failed!)

#@markdown 📌 Open a txt file. <Use a set of duble quotation marks "" and assign the url address as the _url_ variable> 
url="https://raw.githubusercontent.com/ms624atyale/Data_Misc/main/TheAesop_theFoxwithoutaTail.txt" 


os.system("curl " + url + " > foxtail.txt") #This generates a txt file under the txtfolder directory and moves the whole text of the url to the txt file (e.g., foxtail.txt).  

foxtail = open("foxtail.txt")
text = file.read().replace("\n", " ") #Replace lines with spaces.

shortword = re.compile(r'\W*\b\w{1,3}\b') #Getting rid of Stopwords of 1~3 spellings. Regular expression
txt = shortword.sub('',file)

file.close() #Close the file you have been working on.

#@markdown 📎 When you see crimepunish.txt under the Files directory, move it under the txtfolder folder you've created by drag & drop.