In [10]:
from IPython.display import clear_output
def countdown_timer(seconds):
    for i in range(seconds, -1, -1):
        clear_output(wait=True)
        print(f"⏳ Time remaining: {i} seconds")
        time.sleep(1)
    print("✅ Time's up!")

# Text and its features 

In [42]:
import os 
import pandas as pd 
import nltk # this the basic and traditional package for text analysis in python 
import textstat # this the basic package for measuring linguistic features of a text 

```mermaid
flowchart LR
    A{Subject of study} --> B{Text}
    B --> | influence | C[Content of Text]
    B --> | influence | D[Linguistic Features of Text]
```
---
---

## 1.1. Meta data 

Each text has three types of data that can be useful: 

- Metadata 
  - Any information about a text: author, type of text (speech, tweet,...), date of publication, language, etc.
    - Sometimes metadata appears in text itself. 
    
    
- Textual data 
  - Linguistic features of text
  - Content of text 
---
---

**Metadata**

- When designing a project containing text analysis think carefully what kind of information you want to gather:
  - You have to spend a huge amount of time to recover information that you did not code.

- This is information beyond the text and its linguistic features. We refer to them as metadata, such as author, date, type, etc...
- 3 ways to store and retrieve metadata
  - Working with API
  - Store in separate file and retrieve from it.  
  - Store in the name of file and retrieve from it.
---
---

### Exercise 1.1 

In [3]:
countdown_timer(10)

⏳ Time remaining: 0 seconds
✅ Time's up!


---
---

**Retrieve metadata from file names**

In [69]:
# Step 1: read the files and store text and file names in a dictionary 
dictUNSpeech = {} # create an empty dictionary 
# The directory
fileAddress1 = '../../corpusExample/unSpeeches2000_2010'
# Open the file one by one - remember you need to tell python each single step - nothing here is automatic. 
# 
for file in os.listdir(fileAddress1):
    with open(os.path.join(fileAddress1, file), 'r', encoding='utf-8', errors='replace') as textFile: 
        dictUNSpeech[file.replace('.txt', '')] =  textFile.read()

# convert the dictionary to a dataframe 
dfUNSpeech = pd.DataFrame(list(dictUNSpeech.items()), columns=["id", "text"])

dfUNSpeech["isoAlpha"] = dfUNSpeech["id"].str.split("_", n=2,  expand=True)[0].astype('str')
dfUNSpeech["session"] = dfUNSpeech["id"].str.split("_", n=2, expand=True)[1].astype('int')
dfUNSpeech["year"] = dfUNSpeech["id"].str.split("_", n=2, expand=True)[2].astype('int')

In [70]:
dfUNSpeech.head(5)

Unnamed: 0,id,text,isoAlpha,session,year
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004


**Example 2: Retreieve from Separate File** 

In [71]:
fileAddress2 = '../../corpusExample/speakersSession.xlsx'
speakerData = pd.read_excel(fileAddress2)
dfUNSpeechComplete = dfUNSpeech.merge(speakerData, on=['year', 'isoAlpha'])

In [72]:
dfUNSpeechComplete.head(5)

Unnamed: 0,id,text,isoAlpha,session,year,Session,cname,speakerName,post
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000,55,Afghanistan,Abdullah Abdullah,MFA
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001,56,Afghanistan,Ravan Farhâdi,UN_Rep
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002,57,Afghanistan,Hâmid Karzai,President
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003,58,Afghanistan,Hâmid Karzai,President
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004,59,Afghanistan,Mr. Hamid Karzai,President


---
---
**Some Abbreviation** 
- Prime Minister: PM 
- Deputy Prime Minister: DPM 
- Head of Government: HOG
- Head of State: HOS
- Minister for Foreign Affairs: MFA
- UN Representative: UN_rep
---
---

**Example 3: working with API**

---
---

### Exercise 1.2 

In [73]:
countdown_timer(10)

⏳ Time remaining: 0 seconds
✅ Time's up!


## 1.2. Linguistic Features

**How Does Linguistic Features help us?**
Sometimes we do not need a complex method to achieve our goal. 

Simple measurements are useful. 

Some linguistic features:
- Length of text
- Number of sentences 
- Length of sentences 
- Complexity of text
- ... 

In [74]:
# Flesch–Kincaid Grade Level
dfUNSpeechComplete['flesch_kincaid_grade'] = dfUNSpeechComplete['text'].apply(textstat.flesch_kincaid_grade)

KeyError: 'selfevident'

In [75]:
# the length of each text 
dfUNSpeechComplete['word_count'] = dfUNSpeechComplete['text'].str.split().str.len()
dfUNSpeechComplete

Unnamed: 0,id,text,isoAlpha,session,year,Session,cname,speakerName,post,word_count
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000,55,Afghanistan,Abdullah Abdullah,MFA,2873
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001,56,Afghanistan,Ravan Farhâdi,UN_Rep,2073
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002,57,Afghanistan,Hâmid Karzai,President,1700
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003,58,Afghanistan,Hâmid Karzai,President,1617
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004,59,Afghanistan,Mr. Hamid Karzai,President,1096
...,...,...,...,...,...,...,...,...,...,...
2072,ZWE_61_2006,Let me begin my statement \nby echoing the sen...,ZWE,61,2006,61,Zimbabwe,Mr. Robert Gabriel Mugabe,President,2371
2073,ZWE_62_2007,Allow me to congratulate \nMr. Kerim on his el...,ZWE,62,2007,62,Zimbabwe,Robert G. Mugabe,President,2052
2074,ZWE_63_2008,I wish to begin by joining \nthose who have co...,ZWE,63,2008,63,Zimbabwe,Robert Mugabe,President,1800
2075,ZWE_64_2009,Let me begin by extending \nour warmest congra...,ZWE,64,2009,64,Zimbabwe,Robert G. Mugabe,President,1722


In [None]:
# Apply sentence tokenizer
dfUNSpeechComplete['sentence_count'] = dfUNSpeechComplete['text'].apply(lambda x: len(nltk.tokenize.sent_tokenize(x)))

- You can count the number of sentences using spaCy as well. We will come to that later.

In [None]:
dfUNSpeechComplete['post'].value_counts()

**Features of Text that could be interesting**
- Length 
- Readability 
- Entropy
- Lexical diversity
- Tenses of verbs 

In [49]:
import re

def preprocess_text(text):
    # Insert space before capital letters (if any)
    text = re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', text)
    # Fix joined words like "selfevident" → "self evident"
    text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)
    return text

dfUNSpeechComplete['clean_text'] = dfUNSpeechComplete['text'].apply(preprocess_text)
from textstat import textstat

dfUNSpeechComplete['flesch_reading_ease'] = dfUNSpeechComplete['clean_text'].apply(textstat.flesch_reading_ease)


KeyError: 'selfevident'

In [50]:
from textstat.textstat import textstatistics

ts = textstatistics()
words = dfUNSpeechComplete['text'].iloc[0].split()

In [52]:
problem_words = [word for word in words if word.lower() not in ts._textstatistics__cmudict]
print(problem_words)

AttributeError: 'textstatistics' object has no attribute '_textstatistics__cmudict'

In [53]:
import nltk
nltk.download('cmudict')
from nltk.corpus import cmudict

cmu_dict = cmudict.dict()


[nltk_data] Downloading package cmudict to
[nltk_data]     C:\Users\bax1408\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\cmudict.zip.


In [54]:
def get_unrecognized_words(text):
    words = re.findall(r'\b\w+\b', text.lower())
    return [word for word in words if word not in cmu_dict]


In [57]:
from nltk.corpus import cmudict
cmu_dict = cmudict.dict()

def filter_recognized_words(text):
    words = re.findall(r'\b\w+\b', text)
    good_words = [w for w in words if w.lower() in cmu_dict]
    return ' '.join(good_words)

dfUNSpeechComplete['filtered_text'] = dfUNSpeechComplete['text'].apply(filter_recognized_words)
dfUNSpeechComplete['flesch_reading_ease'] = dfUNSpeechComplete['filtered_text'].apply(safe_readability)



In [58]:
dfUNSpeechComplete

Unnamed: 0,id,text,isoAlpha,session,year,Session,cname,speakerName,post,word_count,sentence_count,flesch_reading_ease,clean_text,unknown_words,filtered_text
0,AFG_55_2000,"On my way to the\nAssembly Hall, I was informe...",AFG,55,2000,55,Afghanistan,Abdullah Abdullah,MFA,2873,67,-2832.758458,"On my way to the\nAssembly Hall, I was informe...","[chitral, badakhshan, gurirab, talibanism, 4, ...",On my way to the Assembly Hall I was informed ...
1,AFG_56_2001,"﻿At the outset, on\nbehalf of the Government o...",AFG,56,2001,56,Afghanistan,Ravan Farhâdi,UN_Rep,2073,81,-2022.997687,"﻿At the outset, on\nbehalf of the Government o...","[857, seung, 11, 9, 33, massoud, 23, xvii, 70,...",At the outset on behalf of the Government of t...
2,AFG_57_2002,﻿Not very far from here stood\ntwo towers that...,AFG,57,2002,57,Afghanistan,Hâmid Karzai,President,1700,62,-1649.637857,﻿Not very far from here stood\ntwo towers that...,"[buddhas, qaida, honour, isaf, lakhdar, brahim...",Not very far from here stood two towers that s...
3,AFG_58_2003,﻿There is no reality more\noppressive than the...,AFG,58,2003,58,Afghanistan,Hâmid Karzai,President,1617,83,-1579.975372,﻿There is no reality more\noppressive than the...,"[enthusing, isaf, isaf, isaf, 30, 7, framework...",There is no reality more oppressive than the s...
4,AFG_59_2004,Nelson Mandela once\ndescribed his countryís t...,AFG,59,2004,59,Afghanistan,Mr. Hamid Karzai,President,1096,56,-1040.981580,Nelson Mandela once\ndescribed his countryís t...,"[countryís, 18, afghanistanís, 3, 5, 10, 5, ó,...",Nelson Mandela once described his transition t...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2072,ZWE_61_2006,Let me begin my statement \nby echoing the sen...,ZWE,61,2006,61,Zimbabwe,Mr. Robert Gabriel Mugabe,President,2371,100,-2307.199807,Let me begin my statement \nby echoing the sen...,"[eliasson, 10, 2000, endeavours, 2005, mdgs, p...",Let me begin my statement by echoing the senti...
2073,ZWE_62_2007,Allow me to congratulate \nMr. Kerim on his el...,ZWE,62,2007,62,Zimbabwe,Robert G. Mugabe,President,2052,121,-1988.871556,Allow me to congratulate \nMr. Kerim on his el...,"[kerim, haya, rashed, neighbours, ezulwini, 18...",Allow me to congratulate Mr on his election to...
2074,ZWE_63_2008,I wish to begin by joining \nthose who have co...,ZWE,63,2008,63,Zimbabwe,Robert Mugabe,President,1800,65,-1766.109783,I wish to begin by joining \nthose who have co...,"[kerim, 11, 08, 51851, hiv, mdgs, mdgs, progra...",I wish to begin by joining those who have cong...
2075,ZWE_64_2009,Let me begin by extending \nour warmest congra...,ZWE,64,2009,64,Zimbabwe,Robert G. Mugabe,President,1722,63,-1678.747998,Let me begin by extending \nour warmest congra...,"[honour, 192, fulfil, 09, 52463, 2, 2009, alia...",Let me begin by extending our warmest congratu...
