# Text analysis I

In this notebook, we will:

- Calculate interrater reliability.
- Export data for coding.
- Import coded data.
- Score text with built-in TextBlob sentiment.

In [1]:
import random

import pandas as pd
import numpy as np

from textblob import TextBlob
from sklearn.metrics import cohen_kappa_score
from nltk import agreement

## Interrater reliability

In [2]:
# Generate some data.
true_scores = [random.randrange(1, 7) for i in range(200)]
r1_diff = [random.randrange(-1, 1) for i in range(200)]
r2_diff = [random.randrange(-1, 1) for i in range(200)]
r3_diff = [random.randrange(-1, 1) for i in range(200)]

ratings = pd.DataFrame(true_scores, columns=['true_scores'])

ratings['item'] = range(200)
ratings['r1'] = ratings['true_scores'] + r1_diff
ratings['r2'] = ratings['true_scores'] + r2_diff
ratings['r3'] = ratings['true_scores'] + r3_diff
ratings= ratings.clip(1, 7)

ratings.head()

Unnamed: 0,true_scores,item,r1,r2,r3
0,2,1,2,2,2
1,5,1,5,4,5
2,5,2,4,4,4
3,2,3,2,2,2
4,2,4,1,2,2


In [3]:
# For two raters
print(cohen_kappa_score(ratings['r1'], ratings['r2']))
print(cohen_kappa_score(ratings['r2'], ratings['r3']))
print(cohen_kappa_score(ratings['r1'], ratings['r3']))

0.4511860479297517
0.4083562061604147
0.48082091375519176


In [4]:
# NLTK
# Note: recoding these is annoying, as you can see.
#       We could reshape the columns and recode some stuff,
#       but it's probably practically worse than this.

ratings2_list = []
rater = 0
for rd in [r1_diff, r2_diff, r3_diff]:
    values = np.array(true_scores) + np.array(rd)
    df = pd.DataFrame(values, columns=['value'])
    df = df.clip(1, 7)
    df['rater'] = rater
    df['item'] = range(200)
    ratings2_list.append(df)
    rater += 1

ratings2 = pd.concat(ratings2_list, ignore_index=True)
ratings_nltk = [tuple(x) for x in ratings2[['rater', 'item', 'value']].values]

nltk_rate = agreement.AnnotationTask(data=ratings_nltk)

print(f'Kappa:   {nltk_rate.kappa()}')
print(f'Fleiss:  {nltk_rate.multi_kappa()}')
print(f'Alpha:   {nltk_rate.alpha()}')

Kappa:   0.4467877226151195
Fleiss:  0.44677215962250344
Alpha:   0.4475293989990099


# Exporting data for coding

In [5]:
msft_prs = pd.read_csv('../data/msft_prs.csv')
msft_prs.head()

Unnamed: 0,article:modified_time,article:published_time,article:section,body,h3,og:description,og:title,og:updated_time,og:url
0,2018-10-12T00:36:34+00:00,2018-10-11T23:16:52+00:00,Press Releases,"REDMOND, Wash. — Oct. 11, 2018 — Microsoft Cor...",Microsoft to host earnings conference call web...,Microsoft to host earnings conference call web...,Microsoft announces quarterly earnings release...,2018-10-12T00:36:34+00:00,https://news.microsoft.com/2018/10/11/microsof...
1,2018-10-11T16:36:30+00:00,2018-10-11T14:00:25+00:00,Press Releases,"INDIANAPOLIS – Oct. 11, 2018 – Today, Eric J. ...","With support provided by Walmart, and in partn...","With support provided by Walmart, and in partn...","Indiana Governor Eric J. Holcomb, the Markle F...",2018-10-11T16:36:30+00:00,https://news.microsoft.com/2018/10/11/indiana-...
2,2018-10-11T07:05:46+00:00,2018-10-11T07:04:28+00:00,Press Releases,"MUMBAI, India — Oct. 11, 2018 — Microsoft Corp...",,"Microsoft, Star India and Anil Kumble redefine...",Anil Kumble’s Microsoft AI-enabled Power Bat p...,2018-10-11T07:05:46+00:00,https://news.microsoft.com/2018/10/11/anil-kum...
3,2018-10-11T20:29:31+00:00,2018-10-10T18:00:22+00:00,Press Releases,"BOYDTON, Va. — Oct. 10, 2018 — Microsoft Corp....",,"BOYDTON, Va. — Oct. 10, 2018 — Microsoft Corp....",Microsoft and Lake Country Satellite roll out ...,2018-10-11T20:29:31+00:00,https://news.microsoft.com/2018/10/10/microsof...
4,,2018-10-10T15:15:58+00:00,Press Releases,"Charleston, S.C. (October 10, 2018) — Blackbau...",Introduce first phase of initiative with joint...,Introduce first phase of initiative with joint...,Blackbaud and Microsoft expand partnership wit...,,https://news.microsoft.com/2018/10/10/blackbau...


In [6]:
msft_prs_export = msft_prs[['og:title']].copy()

# We'd usually recover this from a filename or other metadata.
msft_prs_export.loc[:,'id_ticker'] = 'msft'

# Add empty columns for coding.
_PR_COLS = ['cat_partner', 'cat_dividend', 'cat_award']
for col in _PR_COLS:
    msft_prs_export.loc[:,col] = ''

In [7]:
msft_prs_export.head()

Unnamed: 0,og:title,id_ticker,cat_partner,cat_dividend,cat_award
0,Microsoft announces quarterly earnings release...,msft,,,
1,"Indiana Governor Eric J. Holcomb, the Markle F...",msft,,,
2,Anil Kumble’s Microsoft AI-enabled Power Bat p...,msft,,,
3,Microsoft and Lake Country Satellite roll out ...,msft,,,
4,Blackbaud and Microsoft expand partnership wit...,msft,,,


In [8]:
# Let's reorder to put the title at the end and export
msft_prs_export[['id_ticker', 'cat_partner', 
                 'cat_dividend', 'cat_award',
                 'og:title']].to_csv('../data/msft_pr_coding.csv')

# Import from LIWC

I ran these in LIWC for us.

Note that it would be similar to import human-coded data, and we would not be dropping those columns in that case.

In [9]:
msft_liwc = pd.read_csv('../data/msft_pr_liwc.csv')
msft_liwc.head()

Unnamed: 0,Source (A),Source (B),Source (C),Source (D),Source (E),Source (F),WC,Analytic,Clout,Authentic,...,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
0,0,msft,,,,Grab forges strategic cloud partnership with M...,19,99.0,85.38,19.27,...,0.0,0.0,0.0,0.0,0.0,5.26,0.0,0.0,0.0,0.0
1,1,msft,,,,Redline Communications and Microsoft announce ...,16,99.0,96.95,35.37,...,0.0,0.0,0.0,0.0,0.0,6.25,0.0,0.0,0.0,0.0
2,2,msft,,,,"Adobe, Microsoft and SAP announce the Open Dat...",18,99.0,71.09,58.07,...,5.56,0.0,0.0,0.0,0.0,5.56,0.0,0.0,0.0,0.0
3,3,msft,,,,Microsoft fortifies security and brings AI to ...,13,99.0,77.92,17.46,...,0.0,0.0,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0
4,4,msft,,,,Shell expands strategic collaboration with Mic...,13,97.77,77.92,17.46,...,0.0,0.0,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0


To merge these back together, we want to use a key that makes sense.
We could use index, but there's a good reason that I am not.
It's likely that Microsoft will have a new press release between the time I'm preparing materials and the time that we use them.
So, we want to merge on something that will make sense.

Here, for expediency, I'm using the title column.
However, I often use a hash function on the document body to create a unique document ID.
This is especially helpful in news collections where a lot of the documents are returned for more than one search.
We can calculate hash values, remove duplicates, and export those for coding.
This prevents rework.

In [10]:
# Drop the columns we already have.
_DROP_LIWC = [f'Source ({i})' for i in 'ABCDE']
msft_liwc = msft_liwc.drop(_DROP_LIWC, axis=1)
msft_liwc = msft_liwc.rename(columns={'Source (F)': 'og:title'})
msft_liwc.head()

Unnamed: 0,og:title,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,...,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
0,Grab forges strategic cloud partnership with M...,19,99.0,85.38,19.27,25.77,19.0,47.37,68.42,26.32,...,0.0,0.0,0.0,0.0,0.0,5.26,0.0,0.0,0.0,0.0
1,Redline Communications and Microsoft announce ...,16,99.0,96.95,35.37,1.0,16.0,43.75,81.25,25.0,...,0.0,0.0,0.0,0.0,0.0,6.25,0.0,0.0,0.0,0.0
2,"Adobe, Microsoft and SAP announce the Open Dat...",18,99.0,71.09,58.07,25.77,18.0,44.44,66.67,27.78,...,5.56,0.0,0.0,0.0,0.0,5.56,0.0,0.0,0.0,0.0
3,Microsoft fortifies security and brings AI to ...,13,99.0,77.92,17.46,99.0,13.0,30.77,61.54,30.77,...,0.0,0.0,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0
4,Shell expands strategic collaboration with Mic...,13,97.77,77.92,17.46,25.77,13.0,61.54,69.23,23.08,...,0.0,0.0,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0


In [11]:
msft_prs = msft_prs.merge(msft_liwc, on='og:title')
msft_prs = msft_prs.rename(columns={'body_x': 'body'})
msft_prs.head()

Unnamed: 0,article:modified_time,article:published_time,article:section,body,h3,og:description,og:title,og:updated_time,og:url,WC,...,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
0,2018-10-08T23:39:05+00:00,2018-10-08T23:00:25+00:00,Press Releases,"SINGAPORE and REDMOND, Wash. — Oct. 8, 2018 — ...",,Grab will adopt Microsoft Azure as its preferr...,Grab forges strategic cloud partnership with M...,2018-10-08T23:39:05+00:00,https://news.microsoft.com/2018/10/08/grab-for...,19,...,0.0,0.0,0.0,0.0,0.0,5.26,0.0,0.0,0.0,0.0
1,2018-10-04T14:43:59+00:00,2018-10-04T13:00:35+00:00,Press Releases,"REDMOND, Wash. — Oct. 4, 2018 — On Thursday, R...",The partnership will help make broadband more ...,The partnership will help make broadband more ...,Redline Communications and Microsoft announce ...,2018-10-04T14:43:59+00:00,https://news.microsoft.com/2018/10/04/redline-...,16,...,0.0,0.0,0.0,0.0,0.0,6.25,0.0,0.0,0.0,0.0
2,2018-09-24T16:11:34+00:00,2018-09-24T13:30:14+00:00,Press Releases,"\nShantanu Narayen, CEO, Adobe (left), Satya N...",Industry leaders team up to help customers con...,Industry leaders team up to help customers con...,"Adobe, Microsoft and SAP announce the Open Dat...",2018-09-24T16:11:34+00:00,https://news.microsoft.com/2018/09/24/adobe-mi...,18,...,5.56,0.0,0.0,0.0,0.0,5.56,0.0,0.0,0.0,0.0
3,,2018-09-24T13:01:27+00:00,Press Releases,"ORLANDO, Fla. — Sept. 24, 2018 — Monday at Ign...","The company also demonstrated how AI, IoT and ...","The company also demonstrated how AI, IoT and ...",Microsoft fortifies security and brings AI to ...,,https://news.microsoft.com/2018/09/24/microsof...,13,...,0.0,0.0,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0
4,2018-09-20T15:19:45+00:00,2018-09-20T13:00:12+00:00,Press Releases,"REDMOND, Wash., and HOUSTON — Sept. 20, 2018 —...",Shell appoints C3 IoT and Microsoft Azure as i...,Shell appoints C3 IoT and Microsoft Azure as i...,Shell expands strategic collaboration with Mic...,2018-09-20T15:19:45+00:00,https://news.microsoft.com/2018/09/20/shell-ex...,13,...,0.0,0.0,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0


# TextBlob

Before applying textblob to our press releases, let's look at some examples to get a sense of how it works.

In [12]:
string1 = 'Text analysis is a lot of fun!'
TextBlob(string1).sentiment

Sentiment(polarity=0.375, subjectivity=0.2)

In [13]:
string2 = 'It is not scary.'
TextBlob(string2).sentiment

Sentiment(polarity=0.25, subjectivity=1.0)

In [14]:
string3 = 'Manual coding is the worst.'
TextBlob(string3).sentiment

Sentiment(polarity=-1.0, subjectivity=1.0)

In [15]:
string4 = 'Manual coding is kind of the worst.'
TextBlob(string4).sentiment

Sentiment(polarity=-0.2, subjectivity=0.95)

In [16]:
strings = ' '.join([string1, string2, string3, string4])
TextBlob(strings).sentiment

Sentiment(polarity=-0.155, subjectivity=0.82)

Like our comparison of web scraping and APIs, it's nice when we can work with data entirely inside of Python.

We're going to start by defining some simple functions that we can use with pandas.

In [17]:
def tb_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

def tb_polarity(text):
    return TextBlob(text).sentiment.polarity

In [18]:
msft_prs['tb_subj'] = msft_prs['og:title'].apply(tb_subjectivity)
msft_prs['tb_polar'] = msft_prs['og:title'].apply(tb_polarity)

In [19]:
msft_prs.head()

Unnamed: 0,article:modified_time,article:published_time,article:section,body,h3,og:description,og:title,og:updated_time,og:url,WC,...,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP,tb_subj,tb_polar
0,2018-10-08T23:39:05+00:00,2018-10-08T23:00:25+00:00,Press Releases,"SINGAPORE and REDMOND, Wash. — Oct. 8, 2018 — ...",,Grab will adopt Microsoft Azure as its preferr...,Grab forges strategic cloud partnership with M...,2018-10-08T23:39:05+00:00,https://news.microsoft.com/2018/10/08/grab-for...,19,...,0.0,0.0,0.0,5.26,0.0,0.0,0.0,0.0,0.0,0.0
1,2018-10-04T14:43:59+00:00,2018-10-04T13:00:35+00:00,Press Releases,"REDMOND, Wash. — Oct. 4, 2018 — On Thursday, R...",The partnership will help make broadband more ...,The partnership will help make broadband more ...,Redline Communications and Microsoft announce ...,2018-10-04T14:43:59+00:00,https://news.microsoft.com/2018/10/04/redline-...,16,...,0.0,0.0,0.0,6.25,0.0,0.0,0.0,0.0,0.0,0.0
2,2018-09-24T16:11:34+00:00,2018-09-24T13:30:14+00:00,Press Releases,"\nShantanu Narayen, CEO, Adobe (left), Satya N...",Industry leaders team up to help customers con...,Industry leaders team up to help customers con...,"Adobe, Microsoft and SAP announce the Open Dat...",2018-09-24T16:11:34+00:00,https://news.microsoft.com/2018/09/24/adobe-mi...,18,...,0.0,0.0,0.0,5.56,0.0,0.0,0.0,0.0,0.477273,0.068182
3,,2018-09-24T13:01:27+00:00,Press Releases,"ORLANDO, Fla. — Sept. 24, 2018 — Monday at Ign...","The company also demonstrated how AI, IoT and ...","The company also demonstrated how AI, IoT and ...",Microsoft fortifies security and brings AI to ...,,https://news.microsoft.com/2018/09/24/microsof...,13,...,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0,0.0,0.0
4,2018-09-20T15:19:45+00:00,2018-09-20T13:00:12+00:00,Press Releases,"REDMOND, Wash., and HOUSTON — Sept. 20, 2018 —...",Shell appoints C3 IoT and Microsoft Azure as i...,Shell appoints C3 IoT and Microsoft Azure as i...,Shell expands strategic collaboration with Mic...,2018-09-20T15:19:45+00:00,https://news.microsoft.com/2018/09/20/shell-ex...,13,...,0.0,0.0,0.0,7.69,0.0,0.0,0.0,0.0,0.0,0.0


# Breakout Exercises

Let's do two exercises to reinforce the concepts we learned above.


1. TextBlob
1. functions with `df.apply()`

## EX1: TextBlob

Above, we saw what the TextBlob subjectivity and polarity attributes looked like on some sample text.
In your breakout groups, come up with a few strings to test, and then make minor changes and see how that changes the sentiment scoring.

**Note:** Create cells as needed.

In [20]:
# 1 code


## EX2: loops

When working with pandas dataframes, we don't use loops to loop through each item to process it (for technical reasons beyond the scope of the course, but ask about it if time permits).
However, there is a conceptually similar approach, where we write a simple function to handle one case, and we use the `apply()` method to apply it to every item in a column.

1. Create a new funtion, named `tb_polarity_absolute()` that returns the absolute value of the TextBlob polarity score.
1. Create a new column, named `'tb_pol_abs` in the `msft_prs` dataframe that applies the new function you wrote to the `'og:title'` column.
1. View the first 10 rows of the dataframe to compare the `'tb_pol'` column to verify that you get the result you're expecting.

In [21]:
# 2-1 code


In [22]:
# 2-2 code


In [23]:
# 2-3 code
