In the following we will go through some essential topics that forms the basis of performing data analysis on textual data. In order to perform data analysis, 

In [16]:
# Importing basic libraries
import pandas as pd
import numpy as np

# The basic library we will start using is nltk that offers numerous tools for natural language processing
# In order to import it, first you have to install it as with other packages before
# conda install -c anaconda nltk

import nltk

# A very extensive treatment on using nltk is available in an free online book
# If you plan to work with textual data extensively, I recommend checking that
# There are numerous other libraries that you can use, such as spaCy
# You can get introduced to that in a DataCamp course Feature Engineering for NLP in Python

### Basic text processing
We start with somebasic techniques that you want to perform on any piece of textual information: text preprocessing.
You are already familiar with strings in python, and can perform some basic operations, but here we focus on the tasks that will help us to extract information in a meaningful way. I will only focus on the most important tasks, for a more extensive discussion you may consult the above mentioned resources.

In [2]:
# Some simple basic things we already know about preprocessing text data (strings)
# include for example changing text to lowercase latters

# Take an example string
text_ex = "Text mining usually involves the process of structuring the input text. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods."

# We can change it to lowercase 
# (this can be important, as from the perspective of meaning, Text and text should count as the same)

text_ex.lower()

'text mining usually involves the process of structuring the input text. the overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (nlp) and analytical methods.'

In [3]:
# Note: an important task in working with text is matching patterns
# So for example if you want to extract email addresses from a large text, you need to find
# things of the form (something)@(something).(something), where (something can be anything)
# To do these types of tasks, you want to utilize so-called regular epxressions
# We do not have the time to cover those here, but I would definitely recommend checking out
# the course 'Regular Expressions in Python' in Datacamp, this can be very useful in the future

In [4]:
# The first sting what sometimes we want to do is to take a text and cut it into meaningful units.
# The unit of analysis can be sentence or word typically

from nltk.tokenize import sent_tokenize, word_tokenize

# When we use nltk, in most cases we need to download extra rescources to perform text processing
# You only have to run this once, it will be downloaded and be loaded together with nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jmezei\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
# We can create a list of words

word_text = word_tokenize(text_ex)
print(word_text)

['Text', 'mining', 'usually', 'involves', 'the', 'process', 'of', 'structuring', 'the', 'input', 'text', '.', 'The', 'overarching', 'goal', 'is', ',', 'essentially', ',', 'to', 'turn', 'text', 'into', 'data', 'for', 'analysis', ',', 'via', 'application', 'of', 'natural', 'language', 'processing', '(', 'NLP', ')', 'and', 'analytical', 'methods', '.']


In [6]:
# We can also creat a Series of the count of occurences of each word

print(pd.Series(dict(nltk.FreqDist(word_text))))

Text           1
mining         1
usually        1
involves       1
the            2
process        1
of             2
structuring    1
input          1
text           2
.              2
The            1
overarching    1
goal           1
is             1
,              3
essentially    1
to             1
turn           1
into           1
data           1
for            1
analysis       1
via            1
application    1
natural        1
language       1
processing     1
(              1
NLP            1
)              1
and            1
analytical     1
methods        1
dtype: int64


In [7]:
# Or we can create a list of sentences
# The algorithm correctlyfinds the two sentences

sent_text = sent_tokenize(text_ex)
print(sent_text)

['Text mining usually involves the process of structuring the input text.', 'The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.']


In [8]:
# Another important thing is to filter out stopwords
# stopwords of a language is the list of most commonly used words, that you typically want to exclude
# from any further analysis, as they do not contain any important information, as any random text should have these words

# We need to download also this first to use it
nltk.download("stopwords")
from nltk.corpus import stopwords

# We can check what are the stopwords in the English language
stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jmezei\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [9]:
# Other languages are also available
stopwords.words("finnish")

['olla',
 'olen',
 'olet',
 'on',
 'olemme',
 'olette',
 'ovat',
 'ole',
 'oli',
 'olisi',
 'olisit',
 'olisin',
 'olisimme',
 'olisitte',
 'olisivat',
 'olit',
 'olin',
 'olimme',
 'olitte',
 'olivat',
 'ollut',
 'olleet',
 'en',
 'et',
 'ei',
 'emme',
 'ette',
 'eivät',
 'minä',
 'minun',
 'minut',
 'minua',
 'minussa',
 'minusta',
 'minuun',
 'minulla',
 'minulta',
 'minulle',
 'sinä',
 'sinun',
 'sinut',
 'sinua',
 'sinussa',
 'sinusta',
 'sinuun',
 'sinulla',
 'sinulta',
 'sinulle',
 'hän',
 'hänen',
 'hänet',
 'häntä',
 'hänessä',
 'hänestä',
 'häneen',
 'hänellä',
 'häneltä',
 'hänelle',
 'me',
 'meidän',
 'meidät',
 'meitä',
 'meissä',
 'meistä',
 'meihin',
 'meillä',
 'meiltä',
 'meille',
 'te',
 'teidän',
 'teidät',
 'teitä',
 'teissä',
 'teistä',
 'teihin',
 'teillä',
 'teiltä',
 'teille',
 'he',
 'heidän',
 'heidät',
 'heitä',
 'heissä',
 'heistä',
 'heihin',
 'heillä',
 'heiltä',
 'heille',
 'tämä',
 'tämän',
 'tätä',
 'tässä',
 'tästä',
 'tähän',
 'tallä',
 'tältä',
 'täl

In [10]:
# For our example we can set the stopwords

stop_words = list(stopwords.words("english"))

# We can change our string to lowercase and create list of words

word_list = word_tokenize(text_ex.lower())

word_list_filtered = []

# We collect every word which is not a stopword in a new list

for word in word_list:
    if word not in stop_words:
        word_list_filtered.append(word)
        
print(word_list)
        
print(word_list_filtered)

['text', 'mining', 'usually', 'involves', 'the', 'process', 'of', 'structuring', 'the', 'input', 'text', '.', 'the', 'overarching', 'goal', 'is', ',', 'essentially', ',', 'to', 'turn', 'text', 'into', 'data', 'for', 'analysis', ',', 'via', 'application', 'of', 'natural', 'language', 'processing', '(', 'nlp', ')', 'and', 'analytical', 'methods', '.']
['text', 'mining', 'usually', 'involves', 'process', 'structuring', 'input', 'text', '.', 'overarching', 'goal', ',', 'essentially', ',', 'turn', 'text', 'data', 'analysis', ',', 'via', 'application', 'natural', 'language', 'processing', '(', 'nlp', ')', 'analytical', 'methods', '.']


In [11]:
# We can also extend the list of stopwords depending on the text sources we analysing. 
# For example, if we have a lot of movie reviews, we probably want to remove the word 'movie' from all the reviews
# as it does not contain any new information, we know that we analyse movie reviews
# In our example we can decide that text is also a stopword

# We append 'text' to the list of stopwords
stop_words += ['text']

word_list_filtered_new = []

# We can collect the list again 

for word in word_list:
    if word not in stop_words:
        word_list_filtered_new.append(word)
        
print(word_list_filtered)
        
print(word_list_filtered_new)

['text', 'mining', 'usually', 'involves', 'process', 'structuring', 'input', 'text', '.', 'overarching', 'goal', ',', 'essentially', ',', 'turn', 'text', 'data', 'analysis', ',', 'via', 'application', 'natural', 'language', 'processing', '(', 'nlp', ')', 'analytical', 'methods', '.']
['mining', 'usually', 'involves', 'process', 'structuring', 'input', '.', 'overarching', 'goal', ',', 'essentially', ',', 'turn', 'data', 'analysis', ',', 'via', 'application', 'natural', 'language', 'processing', '(', 'nlp', ')', 'analytical', 'methods', '.']


In [None]:
# The next important thing is to take care problems when different forms of the same word are not combined in further analysis
# Typically, if we have the words 'car' and 'cars', we would like them to be combined, as they refer to the same concept
# just one is in a plural form
# There are different ways to solve this problem, the two most widely used onesare stemming and lemmatization

# The ide of stemming is very simple: just cut of the end of the word
# This is appropriate in many situations, but this will not give you the rootof the word in many cases

from nltk.stem import PorterStemmer

# we always need to initialize a stemmer object first
stemmer = PorterStemmer()


# For example let's consider the following list of words

word_list = ['run','runner','running','ran','runs','easily','fairly']

# Stemming wourld result in the following
# As you can see, it at least manages to differentiate between a verb (run) 
# and noun (runner) and does not result in the same word

[stemmer.stem(word) for word in word_list]

In [None]:
# For our original eample sentence:
print(word_text)
print([stemmer.stem(word) for word in word_text])

In [None]:
# Lemmatization is a much more complex process
# It does not simply reduce the word word reduction, but considers properties of the language and grammar 
# and also how the word is used. For example, the lemma of 'meeting' 
# might be 'meet' or 'meeting' depending on its use in a sentence.

from nltk.stem import WordNetLemmatizer

# We initialize the object

lemmatizer = WordNetLemmatizer()

# As an example, we can see how it works on the word 'mice' compared to stemming
# As you can see, it understands that this is the plural of mouse, and returns that as the base form of the word, 
# which would be used in the rest of the analysis, while stemming has no effect on the word, as it does not find anything to cut

print(stemmer.stem('mice'))
print(lemmatizer.lemmatize('mice'))

In [None]:
# For the above list of words

[lemmatizer.lemmatize(word) for word in word_list]

In [None]:
# And our original sentences

lemmatizer_1 = WordNetLemmatizer()

print(word_text)
print([lemmatizer_1.lemmatize(word) for word in word_text])

In [None]:
# Another important example that can be important in many applications is to tag parts of speech
# i.e. an algorithm that can identify whether a word is a noun, verb etc.

# We need to download another component first
nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(word_text)

In [None]:
# If we want to understand the meaning of the abbreviations
nltk.download('tagsets')
nltk.help.upenn_tagset()

In [None]:
# Using this we can now perform correct lemmatization
# We still need to create an extra fucntion as the POS tag above is not usable in the lemmatizer
# We extract the first character 


def get_wordnet_pos(word):
    pos_tag = nltk.pos_tag([word])[0][1][0].upper()
    if pos_tag == 'J':
        return nltk.corpus.wordnet.ADJ
    elif pos_tag == 'N':
        return nltk.corpus.wordnet.NOUN
    elif pos_tag == 'V':
        return nltk.corpus.wordnet.VERB
    elif pos_tag == 'R':
        return nltk.corpus.wordnet.ADV
    else:
        return nltk.corpus.wordnet.NOUN

print(word_text)
print([lemmatizer_1.lemmatize(word, get_wordnet_pos(word)) for word in word_text])

In [None]:
# Note: We can check what components of nltk we have already downloaded

nltk.download()

### Representing text
After getting familiar with some basic tools, we will now look at how to transform text data into a numeric representation that can be used in more complex analysis. We will look at two ways that will offer different perspectives.

In [17]:
# Let's start loading a dataset
# In the dataset we have 104 articles

articles = pd.read_csv('articles.txt', header = None, sep='delimiter')
articles.head()

  articles = pd.read_csv('articles.txt', header = None, sep='delimiter')


Unnamed: 0,0
0,Image copyright EPA Image caption Uber has bee...
1,Ride-sharing firm Uber is facing a criminal in...
2,The scrutiny has started because the firm is a...
3,"The software, called ""greyball"", helped it ide..."
4,A spokesman for Uber declined to comment on th...


In [18]:
# As we will see the transformation functions include the possibility to perform basic transformation
# but we can do it ourselves
# First we can change everything to lowercase

articles[0] = articles[0].str.lower()
articles

Unnamed: 0,0
0,image copyright epa image caption uber has bee...
1,ride-sharing firm uber is facing a criminal in...
2,the scrutiny has started because the firm is a...
3,"the software, called ""greyball"", helped it ide..."
4,a spokesman for uber declined to comment on th...
...,...
99,"the latest study shows that chatbots, driven b..."
100,as the stilted computer interactions of today ...
101,one concern is the potential for technology de...
102,there is also the potential for users to becom...


In [19]:
# Next step is removing the stopwords
# We still want to keep the string format and not create list so it requires some extra steps
# If we just wanted to create a list from the words of ech article and then remove stopwords from that
# we can do it easily using split

articles['split'] = articles[0].apply(lambda x: x.split())

# And then we can remove the stopwrods from the lists

stop_words = list(stopwords.words("english"))

articles['split'] = articles['split'].apply(lambda x: [item for item in x if item not in stop_words])

articles.head()

Unnamed: 0,0,split
0,image copyright epa image caption uber has bee...,"[image, copyright, epa, image, caption, uber, ..."
1,ride-sharing firm uber is facing a criminal in...,"[ride-sharing, firm, uber, facing, criminal, i..."
2,the scrutiny has started because the firm is a...,"[scrutiny, started, firm, accused, using, ""sec..."
3,"the software, called ""greyball"", helped it ide...","[software,, called, ""greyball"",, helped, ident..."
4,a spokesman for uber declined to comment on th...,"[spokesman, uber, declined, comment, investiga..."


In [15]:
from nltk.stem import PorterStemmer

# we always need to initialize a stemmer object first
stemmer = PorterStemmer()

articles['split_1'] = articles['split'].apply(lambda x: [stemmer.stem(word) for word in x])
articles

Unnamed: 0,0,split,split_1
0,image copyright epa image caption uber has bee...,"[image, copyright, epa, image, caption, uber, ...","[imag, copyright, epa, imag, caption, uber, cr..."
1,ride-sharing firm uber is facing a criminal in...,"[ride-sharing, firm, uber, facing, criminal, i...","[ride-shar, firm, uber, face, crimin, investig..."
2,the scrutiny has started because the firm is a...,"[scrutiny, started, firm, accused, using, ""sec...","[scrutini, start, firm, accus, use, ""secret"", ..."
3,"the software, called ""greyball"", helped it ide...","[software,, called, ""greyball"",, helped, ident...","[software,, call, ""greyball"",, help, identifi,..."
4,a spokesman for uber declined to comment on th...,"[spokesman, uber, declined, comment, investiga...","[spokesman, uber, declin, comment, investigati..."
...,...,...,...
99,"the latest study shows that chatbots, driven b...","[latest, study, shows, chatbots,, driven, mach...","[latest, studi, show, chatbots,, driven, machi..."
100,as the stilted computer interactions of today ...,"[stilted, computer, interactions, today, repla...","[stilt, comput, interact, today, replac, somet..."
101,one concern is the potential for technology de...,"[one, concern, potential, technology, designed...","[one, concern, potenti, technolog, design, sed..."
102,there is also the potential for users to becom...,"[also, potential, users, become, emotionally, ...","[also, potenti, user, becom, emot, dependent,,..."


In [21]:
def stem_words(word_list):
    stemmed_list = []
    for word in word_list:
        stemmed_list.append(stemmer.stem(word))
    return stemmed_list

articles['split_1'] = articles['split'].apply(stem_words)
articles       

Unnamed: 0,0,split,split_1
0,image copyright epa image caption uber has bee...,"[image, copyright, epa, image, caption, uber, ...","[imag, copyright, epa, imag, caption, uber, cr..."
1,ride-sharing firm uber is facing a criminal in...,"[ride-sharing, firm, uber, facing, criminal, i...","[ride-shar, firm, uber, face, crimin, investig..."
2,the scrutiny has started because the firm is a...,"[scrutiny, started, firm, accused, using, ""sec...","[scrutini, start, firm, accus, use, ""secret"", ..."
3,"the software, called ""greyball"", helped it ide...","[software,, called, ""greyball"",, helped, ident...","[software,, call, ""greyball"",, help, identifi,..."
4,a spokesman for uber declined to comment on th...,"[spokesman, uber, declined, comment, investiga...","[spokesman, uber, declin, comment, investigati..."
...,...,...,...
99,"the latest study shows that chatbots, driven b...","[latest, study, shows, chatbots,, driven, mach...","[latest, studi, show, chatbots,, driven, machi..."
100,as the stilted computer interactions of today ...,"[stilted, computer, interactions, today, repla...","[stilt, comput, interact, today, replac, somet..."
101,one concern is the potential for technology de...,"[one, concern, potential, technology, designed...","[one, concern, potenti, technolog, design, sed..."
102,there is also the potential for users to becom...,"[also, potential, users, become, emotionally, ...","[also, potenti, user, becom, emot, dependent,,..."


In [None]:
# We can then get back the single string format using join
articles['split_final'] = articles['split'].apply(lambda x: ' '.join(x))
articles.head()

In [None]:
# The first representation we look as is called CountVectorizer and is available in the sklearn package
# This will create a new dataset where each row corresponds to one of the original text documents (artciles) in this case
# and each column is a word
# Each value in the data then will provide the count on how many times the word (column) is present in the article (row)

from sklearn.feature_extraction.text import CountVectorizer

# We initialize an object 

vect = CountVectorizer()

# Then create the representation

article_counts = vect.fit_transform(articles['split_final'])

# We can check the size of the resulting data
# We can see that we have 104 rows corresponding to the original articles, and 1442 columns for the words that 
# appear in at least 1 article

article_counts.shape

In [None]:
# We can look at the data after converting it to a dataframe

article_counts_df = pd.DataFrame(article_counts.toarray(), columns=vect.get_feature_names())

article_counts_df.head()

In [None]:
# Using this table we can now do some basic word frequnecy analsys
# First sum up the columns to get frequency and sort it
# We can see that the most frequent word is said, which most likely not something we care about

word_count = article_counts_df.sum(axis=0).sort_values(ascending = False)
word_count[:20]

In [None]:
# At this point, we can go back to creating the representation with CountVectorizer as we can actually specify 
# removing stopwords there
# We can make use of the parameter stop_words
# If we set the value 'english', it will remove the most frequent english words
# As we have already done that, it will not change anything 
# But we can extend that list as we have seen before 
stop_words_new = list(stopwords.words("english")) + ['said']

# We perform now the transformation

vect_new = CountVectorizer(stop_words = stop_words_new)

article_counts = vect_new.fit_transform(articles['split_final'])

# We convert it to dataframe

article_counts_df_new = pd.DataFrame(article_counts.toarray(), columns=vect_new.get_feature_names())

# And look at the word count
# As we can see said disappeared
# We could run this again if we want to remove some further stopwords until we are satisfied

word_count_new = article_counts_df_new.sum(axis=0).sort_values(ascending = False)
word_count_new[:20]

In [None]:
# The other representation is the term frequency inverse document frequency mentioned in the lecture
# This will calculate a weighted frequency of the words
# If a word appers too much, it will be recognized as not so relevant, as that word will 
# not help in differentiating the articles from each other, as it appears in a lot of them

from sklearn.feature_extraction.text import TfidfVectorizer

# We can also use stop_owrds if necessary

tdf_vect = TfidfVectorizer()

article_tdf = tdf_vect.fit_transform(articles['split_final'])

# We convert it to dataframe

article_tdf_df = pd.DataFrame(article_tdf.toarray(), columns=tdf_vect.get_feature_names())

# And we can look at the sum of weights of a word (sum of values in a column)
# As you can see it adjusted the count of said
# The algorithm found that it appears too much, so it is actually not that important

word_count_tdf = article_tdf_df.sum(axis=0).sort_values(ascending = False)
word_count_tdf[:20]

### Topic modeling
We will look at one application, topic modeling. The idea is very similarto clustering. We try to group the text documents in a way that documents that end up in the same group are similar, i.e. they discuss more or less the same topic. We can run the very famous algorithm for this purpose: Latent Dirichlet Allocation

In [None]:
# We have already created CountVectorizer, we can make use of the created object article_counts (not the dataframe)

from sklearn.decomposition import LatentDirichletAllocation

# As in clustering, we need to determine the optimal number of topics
# In this case I will not introduce any specific measure
# You can try different values, and see whehter the resulting topics make sense
# The algorithm also involves some randomness, so we need to fix it to get the same results when running again

LDA = LatentDirichletAllocation(n_components = 3, random_state = 42)

LDA_results = LDA.fit_transform(article_counts)

In [None]:
# The results are stored in an array, as the importance of each word in each topic
# They way we try to understand topics is to print the most frequent words
# If there are topics present in the data, the most frequent words should differe
# and characterize topics in a meaningful way

LDA.components_

In [None]:
# We can print the 10 most important words for each topic

for topic, component in enumerate(LDA.components_):
    # first we 
    words_sorted = np.argsort(component)[-10:]
    
    print([vect_new.get_feature_names()[i] for i in words_sorted])

In [None]:
# We can also add topic number to the original articles
# In the outcome of LDA, each row tells us to what extent each article belongs to each topic
# Based on the first row we can see that the first article seem to belong to the first topic
LDA_results

In [None]:
# We can simply check which number is the higest, and will assign that topic
LDA_results.argmax(axis=1)

In [None]:
# Let's add this as a new column
articles['Topic'] = LDA_results.argmax(axis=1)
articles.head()