# **Introduction to text analysis in Python. Day 3 Part 1**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 3 Part 1!**

## **Today, we are going to look at:**

+ Descriptive text analysis (continuation)
+ Text preprocessing

---



# **Descriptive text analysis (continuation)**


In [None]:
# Import the necessary libraries

import pandas as pd # data analysis and management library
import numpy as np # multi-dimensional arrays

# Data visualization

import seaborn as sns # easy-syntax plots
import matplotlib.pyplot as plt # deep-level library used to tweak the details of the seaborn plots


In [None]:
# Let's upload the updated dataset (fakenews_upd.csv) into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# Getting the dataset

df = pd.read_csv('fakenews_upd.csv')

df


In [None]:
# Some helpful commands to get to know the dataset

type(df) # object type - pandas.core.frame.DataFrame

df.shape # number of rows and columns

df.columns # column names

df.index # indeces

df.info() # summary of the variables in the dataset

df.head() # get the top 5 rows of the dataset

df.tail() # get the last 5 rows of the dataset


Let's take a look at some examples of **descriptive analysis** that one can run with the existing variables

We will start by inspecting the characterstics of the newspaper articles altogether, and then scrutinize the differences between the **real** and **fake** newspaper article titles

---

## **Are number of characters, number of words, average word length and number of capitalized words correlated with each other?**


<figure>
<left>
<img src=https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/images/R_value.png width="2000">
</figure>

In [None]:
# Import the necessary command first

from scipy.stats import pearsonr # Pearson's r


In [None]:
# Draw a simple scatterplot of number of words vs. average word length

plt.figure(figsize = (14, 9))

sns.scatterplot(data = df, x = 'n_words', y = 'avg_word_length')

plt.title('The relationship between number of words and average word length', fontsize = 20)
plt.xlabel('Number of words', fontsize = 15)
plt.ylabel('Average word length', fontsize = 15)

plt.show()


In [None]:
# Use sns.lmplot() if you want to have a trend line on top of a scatterplot
# and split by article type

plt.figure(figsize = (14, 9))

sns.lmplot(data = df,
            x = 'n_words',
            y = 'avg_word_length',
            hue = 'label',
            palette = "Set1",
            height = 8, # height of the graph
            aspect = 1.5) # width = height * aspect

plt.title('The relationship between number of words and average word length\nby newspaper article type', fontsize = 20)
plt.xlabel('Number of words', fontsize = 15)
plt.ylabel('Average word length', fontsize = 15)            

plt.show()


In [None]:
# Obtain Pearson's correlation coefficient and its p-value between these two variables

corr_coef, p_value = pearsonr(df['n_words'], df['avg_word_length'])

corr_coef

p_value

round(corr_coef, 2)

print(f'The Pearson correlation coefficient value is: {round(corr_coef, 2)}')

print(f'The Pearson correlation coefficient p-value is: {round(p_value, 2)}')


In [None]:
# Correlation matrix: use panda's .corr() method

# .corr() provides a correlation matrix even if only two variables are presented:

df[['n_words', 'avg_word_length']].corr()



In [None]:
# Let's get a correlation matrix of all the variables of interest

corr_matrix = df[['n_char', 'n_words', 'avg_word_length', 'n_cap_words']].corr()

corr_matrix

round(corr_matrix, 2)


In [None]:
# You can visualize the correlation matrix
# (which is essentially a dataframe - check type(corr_matrix)) by using seaborn's heatmap

type(corr_matrix) # pandas.core.frame.DataFrame

plt.figure(figsize = (14, 9))

sns.heatmap(corr_matrix, annot = True)

plt.show()

# Check out more arguments of the seaborn heatmap here - https://seaborn.pydata.org/generated/seaborn.heatmap.html


If you want to have p-values on top of the correlation coefficient values, you can use one of the community solutions provided [here](https://stackoverflow.com/questions/24432101/correlation-coefficients-and-p-values-for-all-pairs-of-rows-of-a-matrix).

---

## **Inspecting how real newspaper articles titles are different from the fake ones**


In [None]:
# See how many fake and real newspaper article titles are there in the dataset

df['label'].value_counts(normalize = True)


### **Aggregated statistics**

<figure>
<left>
<img src=https://miro.medium.com/max/1400/0*XVlrOuSBNKwIZpPj.png width="500">
</figure>

**[Image source](https://towardsdatascience.com/7-pandas-functions-that-i-use-the-most-b83ddbaf53bf)**

**Aggregated analysis** implies that you get some statistic (e.g. mean or median) of your main variable **separately for groups of observations** defined by some **other variable**

(also known as **Split-Apply-Combine** technique)

Your main variable should **continuous**, whereas grouping variable - **categorical** 

For example:

+ mean income for men and women
+ median level of life satisfaction for young, middle-aged, and elderly people
+ 25% and 75% percentiles of anxiety scale for 1st, 2nd, and 3rd-year students

The process of getting **aggregated statistics** requires two steps:
 + first group your data via .groupby() method,
 + then get aggregated values

In [None]:
# Let's get the mean number of characters per fake and real newspaper article titles

df.groupby('label')['n_char'].mean()

# Fake titles seem to be longer than the real ones


In [None]:
# Let's visualize the relatiosnhip between the newspaper article title type
# and average number of characters by drawing a bar plot

# A bar plot represents an estimate of central tendency (e.g. mean) for a numeric variable
# with the height of each rectangle and provides some indication of the uncertainty
# around that estimateusing error bars.

# See more examples of `seaborn` bar plots [here](https://seaborn.pydata.org/generated/seaborn.barplot.html).

plt.figure(figsize = (14, 9))

sns.barplot(data = df, x = 'label', y = 'n_char')

plt.title('The average number of characters\nby newspaper article title type', fontsize = 20)
plt.xlabel('Newspaper article title type ', fontsize = 15)
plt.ylabel('Average number of characters per title', fontsize = 15)  

plt.show()


In [None]:
# Getting the mean and standard deviation of number of words per fake and real newspaper article titles

df.groupby('label')['n_words'].agg(['mean', 'std'])


In [None]:
# What about the average word length?

df.groupby('label')['avg_word_length'].agg(['mean', 'median', 'std'])

# Seems pretty even!


---

## **What about the proportion of titles that include numerical values?**


In [None]:
# Originally, the variable n_num (number of numerical values) has got three values - 0, 1, and 3.

df['n_num'].value_counts(normalize = True)

df['n_num'].value_counts()

# However, since there is only 1% of titles that contain 2 numerical values, let's make this variable binary:

# 0 - no numerical values
# 1 - one or more numerical values


In [None]:
# Let's create a new variable called n_num_bin and use the np.where() method

df['n_num_bin'] = np.where(df['n_num'] == 0, 'No numerical values', 'Some numerical values')

df['n_num_bin'].value_counts(normalize = True)


In [None]:
# Now let's run a crosstab to see what is the proportion of titles that have at least some numerical values
# among real and fake newspaper article titles

pd.crosstab(df['label'], df['n_num_bin'],
            normalize = 'index', margins = True) * 100 # row proportions, assuming that type of title affects number of numerical values

# Is there a relationship between the newspaper article title type and number of numerical values?


In [None]:
# Let's visualize the crosstab!

# We will visualize this crosstab with a help of baseline matplotlib (not seaborn)

# We need a crosstab with proportions but with no margins, so let's create it and save as a separate object

crosstab = pd.crosstab(df['label'], df['n_num_bin'], normalize = 'index')

# For other variables, you can recycle this code entirely, but make sure that you
# provide a correct crosstab to the .plot() command,

# Defining a figure with a single plot of sizes 12 and 7

f, ax = plt.subplots(nrows = 1, ncols = 1, figsize=(12, 7))

# Drawing a plot straight from the crosstab created above

crosstab.plot(kind = 'bar', # bar plot
              stacked = True, # stacking proportions from a crosstab
              rot = 0, # no rotation for X ticks
              ax = ax) # linking a figure created above with the plot of a crosstab


plt.title('Proportion of titles that include numerical values\nby newspaper article title type', fontsize = 20)
plt.xlabel('Newspaper article title type', fontsize = 15)
plt.ylabel('Proportion of titles\n with and without numerical values', fontsize = 15)

plt.show()


## **Exercise**

Get the dataset and try out different types of descriptive analysis that you've just learned!

# **That's the end of Day 3 Part 1!**