# **Introduction to text analysis in Python. Day 2 Part 2**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 2!**

## **Today, we are going to look at:**

+ `f-strings` in Python
+ `for` loops, `if-else` statements and functions in Python (quick guide)
+ `.apply()` method to deal with multiple text entries in a dataframe
+ Descriptive text analysis

---



# **Preparatory steps for descriptive text analysis**

It's very likely that when dealing with multiple text entries, they will be organized as a dataframe

Therefore, knowing such Python libraries as `pandas` and `numpy` for data management, as well as `matplotlib` and `seaborn` for data visualization will help you in processing the text data


In [None]:
# Import the necessary libraries

import pandas as pd # data analysis and management library
import numpy as np # multi-dimensional arrays

# Data visualization

import seaborn as sns # easy-syntax plots
import matplotlib.pyplot as plt # deep-level library used to tweak the details of the seaborn plots


In [None]:
# Let's upload the dataset into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# Getting the dataset

df = pd.read_csv('fakenews.csv')

df


In [None]:
# Some helpful commands to get to know the dataset

type(df) # object type - pandas.core.frame.DataFrame

df.shape # number of rows and columns

df.columns # column names

df.index # indeces

df.info() # summary of the variables in the dataset

df.head() # get the top 5 rows of the dataset

df.tail() # get the last 5 rows of the dataset


---

# **3. Descriptive text analysis**

There are certains features that each text entry contains, and we can extract them, summarize, and compare across the documents

The most common descriptive text features are:
+ *Number of characters*
+ *Number of words*
+ *Average word length*
+ *Average sentence length*
+ *Number of uppercase or capitalized words*
+ *Number of numerical or special elements*

We can calculate these statistics for different types of entries (e.g. **fake newspaper articles** vs. **real news articles**) and compare them.

We will first get these statistics for a **single text entry** and then see how one can use Python methods to apply functions to the **entire dataset of text entries**



## **Single text entry**

In [None]:
# Let's take one of the article titles as an example

my_title = df['title'][95]

my_title


### **Number of characters**

In [None]:
# Number of characters

len(my_title) # 99 characters


### **Number of words**

In [None]:
# Number of words:

len(my_title.split()) # 18 words


### **Average word length**

In [None]:
# Average word length:

word_length = [] # create an empty list as a container for values derived from a loop

for e in my_title.split():

  w_length = len(e) # get the length of each word

  word_length.append(w_length) # append the value to the word_length object


In [None]:
word_length  # now word_length contains a length of each word and we can calculate the average value

print(f'The average word length in this title is: {np.array(word_length).mean()}')


### **Number of uppercase/capitalized words**

In [None]:
# Number of capitalized words

cap_words = [] # create an empty list as a container for values derived from a loop

for e in my_title.split(): # for each word..

  if e[0].isupper(): # ..check if it starts with a capital letter

    cap_words.append(e) # and if it does, append it to the cap_words list

# use e.isupper() instead of e[0].isupper() for uppercase words


In [None]:
# Now you can get a length of the cap_words list to see how many capitalized words are there

cap_words # capitalized words

len(cap_words) # 3


### **Number of numerical/special elements**

In [None]:
# Number of numerical elements

# For this, we will use the .isnumeric() method

# .isnumeric() returns True if all characters in the string are numerical

num_elements = [] # create an empty list as container for values derived from a loop

for e in my_title.split(): # for each word..

  if e.isnumeric(): # ..check if it is a purely numerical element

    num_elements.append(e) # and if it is, append it to the num_elements list


In [None]:
# Now you can get a length of the num_elements list to see how many numerical elements are there

num_elements # numerical elements

len(num_elements) # 1


## **Multiple text entries**

In order to perform same operations on multiple text entries, we will:

+ wrap them up into functions (if necessary)
+ apply these function to the entire column of the dataset via `.apply()` method

### **Number of characters**

In [None]:
# Get the number of characters for each of the titles in the dataframe

n_char = df['title'].apply(lambda x: len(x))

# check n_char, check len(n_char)

# If this statement returns True it means that
# you've got as many entries in the n_char object as rows in the 'title' variable
len(n_char) == len(df['title'])

n_char


### **Number of words**

In [None]:
# Get the number of words for each of the titles in the dataframe

n_words = df['title'].apply(lambda x: len(x.split()))

n_words


### **Average word length**

In [None]:
# Get the average word length for each of the titles in the dataframe

def avg_word_length(text): # define a function called avg_word_length that takes some text (string) as an input

  word_length = [] # create an empty list as a container for values derived from a loop

  words = text.split() # split text (string) into words

  for e in words: # for each word..

    w_length = len(e) # get its length

    word_length.append(w_length) # append the value to the word_length object

  avg_length = np.array(word_length).mean() # calculate the average word length

  return(avg_length) # return it


In [None]:
# Now you can apply the avg_word_length function to the 'title' column via .apply() method

avg_word_length = df['title'].apply(lambda x: avg_word_length(x))

avg_word_length


*Now you try!*

Please create two functions:

1. `n_cap_words ` that gets the number of *capitalized words* in a string
2. `n_num` that gets the number of *numerical elements* in a string

.. and apply it to the `title` column of the dataframe


### **Number of uppercase/capitalized words**

In [None]:
# Get the number of capitalized words for each of the titles in the dataframe




In [None]:
# Now you can apply the n_cap_words function to the 'title' column via .apply() method




### **Number of numerical/special elements**

In [None]:
# Get the number of numerical elements for each of the titles in the dataframe




In [None]:
# Now you can apply the n_num function to the 'title' column via .apply() method




In [None]:
# Finally, let's append the newly created arrays to the original dataset

df_2 = pd.DataFrame({'n_char' : n_char,
                     'n_words' : n_words,
                     'avg_word_length' : avg_word_length,
                     'n_cap_words' : n_cap_words,
                     'n_num' : n_num})

df = pd.concat([df, df_2], axis = 1)

df


# **That's the end of Part 2!**