- Points:
    - You can do one thing in multiple ways.
        - How do you decide which path you take:
                - The most important factor is efficiency of your code: How fast your code can be run?

In [2]:
from IPython.display import clear_output
def countdown_timer(seconds):
    for i in range(seconds, -1, -1):
        clear_output(wait=True)
        print(f"⏳ Time remaining: {i} seconds")
        time.sleep(1)
    print("✅ Time's up!")

# Text and its features 

In [35]:
import os
import time
import string
import pandas as pd 
import nltk # this the basic and traditional package for text analysis in python 
import textstat # this the basic package for measuring linguistic features of a text

In [36]:
# Download necessary resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to F:\Mahdi\Dropbox\Dropbox\code
[nltk_data]     \courseTeach\.venv\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [43]:
# Initialize stemmer
stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()

```mermaid
flowchart LR
    A{Subject of study} --> B{Text}
    B --> | influence | C[Content of Text]
    B --> | influence | D[Linguistic Features of Text]
```
---
---

## 1.1. Meta data 

Each text has three types of data that can be useful: 

- Metadata 
  - Any information about a text: author, type of text (speech, tweet,...), date of publication, language, etc.
    - Sometimes metadata appears in text itself. 
    
    
- Textual data 
  - Linguistic features of text
  - Content of text 
---
---

**Metadata**

- When designing a project containing text analysis think carefully what kind of information you want to gather:
  - You have to spend a huge amount of time to recover information that you did not code.

- This is information beyond the text and its linguistic features. We refer to them as metadata, such as author, date, type, etc...
- 3 ways to store and retrieve metadata
  - Working with API
  - Store in separate file and retrieve from it.  
  - Store in the name of file and retrieve from it.
---
---

### Exercise 1.1 

In [3]:
countdown_timer(300)

⏳ Time remaining: 0 seconds
✅ Time's up!


'---
---

**Retrieve metadata from file names**

In [4]:
# Step 1: read the files and store text and file names in a dictionary 
dictUNSpeech = {} # create an empty dictionary 
# The directory
fileAddress1 = '../../corpusExample/unSpeeches2000_2010'
# Open the file one by one - remember you need to tell python each single step - nothing here is automatic. 
# 
for file in os.listdir(fileAddress1):
    with open(os.path.join(fileAddress1, file), 'r', encoding='utf-8', errors='replace') as textFile: 
        dictUNSpeech[file.replace('.txt', '')] =  textFile.read().lower()

# convert the dictionary to a dataframe 
dfUNSpeech = pd.DataFrame(list(dictUNSpeech.items()), columns=["id", "text"])

dfUNSpeech["isoAlpha"] = dfUNSpeech["id"].str.split("_", n=2,  expand=True)[0].astype('str')
dfUNSpeech["session"] = dfUNSpeech["id"].str.split("_", n=2, expand=True)[1].astype('int')
dfUNSpeech["year"] = dfUNSpeech["id"].str.split("_", n=2, expand=True)[2].astype('int')

In [5]:
dfUNSpeech.head(5)

Unnamed: 0,id,text,isoAlpha,session,year
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004


**Example 2: Retreieve from Separate File** 

In [7]:
fileAddress2 = '../../corpusExample/speakersSession.xlsx'
speakerData = pd.read_excel(fileAddress2)
dfUNSpeechComplete = dfUNSpeech.merge(speakerData, on=['year', 'isoAlpha'])

In [8]:
dfUNSpeechComplete.head(5)

Unnamed: 0,id,text,isoAlpha,session,year,Session,cname,speakerName,post
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000,55,Afghanistan,Abdullah Abdullah,MFA
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001,56,Afghanistan,Ravan Farhâdi,UN_Rep
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002,57,Afghanistan,Hâmid Karzai,President
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003,58,Afghanistan,Hâmid Karzai,President
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004,59,Afghanistan,Mr. Hamid Karzai,President


---
---
**Some Abbreviation** 
- Prime Minister: PM 
- Deputy Prime Minister: DPM 
- Head of Government: HOG
- Head of State: HOS
- Minister for Foreign Affairs: MFA
- UN Representative: UN_rep
---
---

**Example 3: working with API**

---
---

### Exercise 1.2

In [10]:
countdown_timer(300)

KeyboardInterrupt: 

---
---

## 1.2. Linguistic Features

**How Does Linguistic Features help us?**
Sometimes we do not need a complex method to achieve our goal. 

Simple measurements are useful. 

Some linguistic features:
- Length of text
- Number of sentences 
- Length of sentences 
- Complexity of text
- ...

---
---

aaaa**Example**

In [9]:
# the length of each text 
dfUNSpeechComplete['word_count'] = dfUNSpeechComplete['text'].str.split().str.len()
dfUNSpeechComplete

Unnamed: 0,id,text,isoAlpha,session,year,Session,cname,speakerName,post,word_count
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000,55,Afghanistan,Abdullah Abdullah,MFA,2873
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001,56,Afghanistan,Ravan Farhâdi,UN_Rep,2073
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002,57,Afghanistan,Hâmid Karzai,President,1700
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003,58,Afghanistan,Hâmid Karzai,President,1617
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004,59,Afghanistan,Mr. Hamid Karzai,President,1096
...,...,...,...,...,...,...,...,...,...,...
2072,ZWE_61_2006,Let me begin my statement \nby echoing the sen...,ZWE,61,2006,61,Zimbabwe,Mr. Robert Gabriel Mugabe,President,2371
2073,ZWE_62_2007,Allow me to congratulate \nMr. Kerim on his el...,ZWE,62,2007,62,Zimbabwe,Robert G. Mugabe,President,2052
2074,ZWE_63_2008,I wish to begin by joining \nthose who have co...,ZWE,63,2008,63,Zimbabwe,Robert Mugabe,President,1800
2075,ZWE_64_2009,Let me begin by extending \nour warmest congra...,ZWE,64,2009,64,Zimbabwe,Robert G. Mugabe,President,1722


---
---

**Example: counting the number of sentences**

In [11]:
# Apply sentence tokenizer
dfUNSpeechComplete['sentence_count'] = dfUNSpeechComplete['text'].apply(lambda x: len(nltk.tokenize.sent_tokenize(x)))

- You can count the number of sentences using spaCy as well. We will come to that later.

---
---

**Example: looking at the frequency of speeches from different officials**

In [12]:
dfUNSpeechComplete['post'].value_counts()

post
MFA            955
President      533
PM             221
UN_Rep         187
DPM            109
V-President     34
HOS             24
HOG             14
Name: count, dtype: int64

---
---

**Features of Text that could be interesting:**
- Length 
- Readability 
- Entropy
- Lexical diversity
- Tenses of verbs 

In [16]:
# Correct usage of textstat.flesch_reading_ease
dfUNSpeechComplete['flesch_reading_ease'] = dfUNSpeechComplete['text'].apply(textstat.flesch_reading_ease)

AttributeError: module 'textstat' has no attribute 'flesch_reading_ease'

---
---

### Exercise 1.3

In [None]:
countdown_timer(300)

---
---

**Tokenization**
 - The next step in analyzing text, is to break down the documents into elements that can be quantified.
  - The basic element in each document is called *token*.
  - Does token mean word? Not necessarily. But token usually corresponds to a word.

**NLTK**
- In the good old days (almost 8 years ago), the package *NLTK* was the main package for doing basic NLP tasks.
- Here we use it to break down our documents to words.

**Example: Tokenization**

In [17]:
dfUNSpeechComplete['tokens'] = dfUNSpeechComplete['text'].apply(nltk.tokenize.word_tokenize)

In [18]:
dfUNSpeechComplete

Unnamed: 0,id,text,isoAlpha,session,year,Session,cname,speakerName,post,word_count,sentence_count,tokens
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000,55,Afghanistan,Abdullah Abdullah,MFA,2873,67,"[On, my, way, to, the, Assembly, Hall, ,, I, w..."
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001,56,Afghanistan,Ravan Farhâdi,UN_Rep,2073,81,"[﻿At, the, outset, ,, on, behalf, of, the, Gov..."
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002,57,Afghanistan,Hâmid Karzai,President,1700,62,"[﻿Not, very, far, from, here, stood, two, towe..."
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003,58,Afghanistan,Hâmid Karzai,President,1617,83,"[﻿There, is, no, reality, more, oppressive, th..."
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004,59,Afghanistan,Mr. Hamid Karzai,President,1096,56,"[Nelson, Mandela, once, described, his, countr..."
...,...,...,...,...,...,...,...,...,...,...,...,...
2072,ZWE_61_2006,Let me begin my statement \nby echoing the sen...,ZWE,61,2006,61,Zimbabwe,Mr. Robert Gabriel Mugabe,President,2371,100,"[Let, me, begin, my, statement, by, echoing, t..."
2073,ZWE_62_2007,Allow me to congratulate \nMr. Kerim on his el...,ZWE,62,2007,62,Zimbabwe,Robert G. Mugabe,President,2052,121,"[Allow, me, to, congratulate, Mr., Kerim, on, ..."
2074,ZWE_63_2008,I wish to begin by joining \nthose who have co...,ZWE,63,2008,63,Zimbabwe,Robert Mugabe,President,1800,65,"[I, wish, to, begin, by, joining, those, who, ..."
2075,ZWE_64_2009,Let me begin by extending \nour warmest congra...,ZWE,64,2009,64,Zimbabwe,Robert G. Mugabe,President,1722,63,"[Let, me, begin, by, extending, our, warmest, ..."


**spaCy**
- You can use *spaCy* to do tokenization for you.
- More efficient, more accurate
- We come back to spaCy in our last session.

---
---

### Exersice 1.4

In [19]:
countdown_timer(300)

KeyboardInterrupt: 

---
---

**Removing**
- Not all tokens are meanigful.
- We want to drop those tokens that do not carry any meaning.
- The most obvious ones are punctuations and stopwords:
    - Stopwords: Pronomens, articles, prepositions

**Example: removing words that are not meaningful.**

In [22]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to F:\Mahdi\Dropbox\Dropbox\
[nltk_data]     code\courseTeach\.venv\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [23]:
# English stopwords and punctuation
stop_words = set(nltk.corpus.stopwords.words('english'))
punctuations = set(string.punctuation)

In [24]:
# Function to clean tokens
def text_cleaner(text):
    tokens = nltk.tokenize.word_tokenize(text)
    return [word.lower() for word in tokens if word.lower() not in stop_words and word not in punctuations]

# Apply cleaning
dfUNSpeechComplete['tokens_clean'] = dfUNSpeechComplete['text'].apply(text_cleaner)

In [26]:
# Access the first row and two specific variables (columns)
dfUNSpeechComplete.iloc[0][['tokens', 'tokens_clean']]

tokens          [On, my, way, to, the, Assembly, Hall, ,, I, w...
tokens_clean    [way, assembly, hall, informed, supreme, state...
Name: 0, dtype: object

In [33]:
print("length of tokens:", len(dfUNSpeechComplete.loc[0, 'tokens']), "\n",
      "length of clean tokens:", len(dfUNSpeechComplete.loc[0, 'tokens_clean'])
      )

length of tokens: 3173 
 length of clean tokens: 1622


In [41]:
print("length of tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens'])), "\n",
      "length of clean tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens_clean']))
      )

length of tokens: 1057 
 length of clean tokens: 932


---
---

### Exercise 1.5

In [None]:
countdown_timer(300)

---
---

**Words that are similar but not identical**
- What is the difference between run, running and runs?
    - There is no difference in their meaning.
    - But from the perspective of tokenizer they are separate words.
- Solution to the problem:
    - Stemming
    - Lemmatization

**Stemming**
- We cut a few alphabets at the end of words:
    - run, run, run: we will have the same words 3 times instead of 3 different words just one time.

**Example: Stemming**

In [44]:
# Function to clean and stem tokens
def text_cleaner(text):
    tokens = nltk.tokenize.word_tokenize(text)
    cleaned_tokens = [word.lower() for word in tokens if word.lower() not in stop_words and word not in punctuations]
    stemmed_tokens = [stemmer.stem(token) for token in cleaned_tokens]
    return cleaned_tokens, stemmed_tokens

# Apply cleaning
dfUNSpeechComplete[['cleaned_tokens', 'tokens_clean_stemmed']] = dfUNSpeechComplete['text'].apply(text_cleaner).apply(pd.Series)

ValueError: Columns must be same length as key

In [39]:
# Access the first row and two specific variables (columns)
dfUNSpeechComplete.iloc[0][['tokens', 'tokens_clean', 'tokens_clean_stemmed']]

tokens                  [On, my, way, to, the, Assembly, Hall, ,, I, w...
tokens_clean            [way, assembly, hall, informed, supreme, state...
tokens_clean_stemmed    [way, assembl, hall, inform, suprem, state, co...
Name: 0, dtype: object

In [42]:
print("length of tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens'])), "\n",
      "length of clean tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens_clean'])), "\n",
      "length of stemmed and clean tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens_clean_stemmed'])), "\n",
      )

length of tokens: 1057 
 length of clean tokens: 932 
 length of stemmed and clean tokens: 818 



**Lemmatization**
- It takes one step further and try to transform a word in its roots

**Example: Lemmatization**

In [45]:
# Function to clean and stem tokens
def text_cleaner(text):
    tokens = nltk.tokenize.word_tokenize(text)
    cleaned_tokens = [word.lower() for word in tokens if word.lower() not in stop_words and word not in punctuations]
    stemmed_tokens = [stemmer.stem(token) for token in cleaned_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]
    return cleaned_tokens, stemmed_tokens, lemmatized_tokens

# Apply cleaning
dfUNSpeechComplete[['cleaned_tokens', 'tokens_clean_stemmed', 'tokens_clean_lemmatized']] = dfUNSpeechComplete['text'].apply(text_cleaner).apply(pd.Series)

In [46]:
# Access the first row and two specific variables (columns)
dfUNSpeechComplete.iloc[0][['tokens', 'tokens_clean', 'tokens_clean_stemmed', 'tokens_clean_lemmatized']]

tokens                     [On, my, way, to, the, Assembly, Hall, ,, I, w...
tokens_clean               [way, assembly, hall, informed, supreme, state...
tokens_clean_stemmed       [way, assembl, hall, inform, suprem, state, co...
tokens_clean_lemmatized    [way, assembly, hall, informed, supreme, state...
Name: 0, dtype: object

In [47]:
print("length of tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens'])), "\n",
      "length of clean tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens_clean'])), "\n",
      "length of stemmed and clean tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens_clean_stemmed'])), "\n",
      "length of lemmatized and clean tokens:", len(set(dfUNSpeechComplete.loc[0, 'tokens_clean_lemmatized'])), "\n",
      )

length of tokens: 1057 
 length of clean tokens: 932 
 length of stemmed and clean tokens: 818 
 length of lemmatized and clean tokens: 897 



### Exercise 1.6