In [25]:
# Importing packages 

import pandas as pd
import re
import numpy as np

# Advanced Social Data Science 2 (ASDS2) Exercises


## April 19: Overview and regular expressions

### 1: Thinking about text as data

Go to Kaggle’s database of text data sets here: https://www.kaggle.com/datasets?topic=nlpDatasets 

1. Find an interesting data set. (Try searching the data sets or playing around with the sorting rule in the top right). It doesn’t have to be social sciencey, just whatever looks interesting to you.
2. Describe the variables in the data. What’s there in addition to the text itself, if anything?
3. What’s a meaningful latent variable which might vary across the texts? (For example, ‘sentiment’ might plausibly vary across movie reviews).
4. Assume you could measure the latent variable from (3). How might that latent variable correlate with other properties of the units of the data? (These can be observed variables in the data, or other, unobserved properties).


### 2: Importing text data

1. The file mach.csv, available at the course Absalon page, contains part of Machiavelli’s The Prince, subdivided into 188 sections. Download it to your computer.
2. Import the file into Python using read_csv() from pandas 
3. Using the search function from Python’s re module (or a Pandas equivalent), find out in which section(s) the following terms appear:
    - lion
    - flatterers
    - ccmnot
4. Why might a nonsensical term like ‘ccmnot’ be in the corpus?


In [23]:
f = pd.read_csv('mach.csv')

In [29]:
f[f.text.str.contains('lion')==True]

Unnamed: 0.1,Unnamed: 0,text
26,Mach_122.txt.content,"You should know, then, that there are two way..."
27,Mach_123.txt.content,"recognise traps, and a lion to frighten away w..."
44,Mach_139.txt.content,Severus possessed so much ability that he was...
47,Mach_141.txt.content,true. But when Severus had defeated and killed...
97,Mach_187.txt.content,against infantry that fight as strongly as the...
112,Mach_30.txt.content,Alexander was forced to make a frontal assault...
139,Mach_55.txt.content,"Baglioni, Vitelli and Orsini came to Rome, the..."
166,Mach_8.txt.content,"reconquered a second time, it is less likely t..."


In [31]:
f[f.text.str.contains('flatterers')]

Unnamed: 0.1,Unnamed: 0,text
74,Mach_166.txt.content,who governs a state' should never think about ...
75,Mach_167.txt.content,"rulers easily make mistakes, unless they are v..."
76,Mach_168.txt.content,decisions. He should so conduct himself with h...


In [32]:
f[f.text.str.contains('ccmnot')==True]

Unnamed: 0.1,Unnamed: 0,text
53,Mach_147.txt.content,But let us return to our subject. I maintain ...


### 3: Regular expressions

In this exercise, we’re continuing with Python’s re module. 
<br>The following can be solved using one or more from these three functions in re:
`search`
`split`
`sub`

1. Define a string and check that it contains only a certain set of characters (in this case a-z, A-Z and 0-9). 
2. Define a string and check that it has an a followed by zero or more b's.
3. Define a string and check that it has an a followed by one or more b's.
4. Using the sample string ‘The quick brown fox jumps over the lazy dog’, search for the words 'fox', 'dog', 'horse'.
5. Define a string with the word ‘Road’ in it, and abbreviate 'Road' as 'Rd.' using sub().
6. Define a string and perform very simple tokenization by splitting at all whitespaces.
7. Define a string and replace whitespaces with an underscore. After, reverse this by replacing underscores with a whitespace.
8. Define a string with a few cases of multiple spaces between words and remove all those cases.

Hint: Take a look at the documentation for Python's re module to find solutions, and test your regular expression on regextester.com or consult regex101.com 


In [73]:
#1. define string
s = "Ihatecorona123."

#check that it only contains az, A-Z, 0-9
print(bool(re.search(r'[^A-Za-z0-9]', s)))


True


In [86]:
#2. 
string = 'ac'

bool(re.search(r'ab*', string))

True

In [85]:
#3. 
bool(re.search(r'ab+', string))

False

In [87]:
sample = 'The quick brown fox jumps over the lazy dog'


In [90]:
#4. 
print(re.search(r'(fox)', sample))
print(re.search(r'(dog)', sample))
print(re.search(r'(horse)', sample))

<re.Match object; span=(16, 19), match='fox'>
<re.Match object; span=(40, 43), match='dog'>
None


In [93]:
#5.
string = 'Follow the yellow brick road'

re.sub('road', 'rd', string)

'Follow the yellow brick rd'

In [95]:
#6. 
re.split(' ', string)

['Follow', 'the', 'yellow', 'brick', 'road']

In [98]:
#7.
string = 'Follow the yellow brick road'

print(re.sub(' ', '_', string))
print(re.sub('_', ' ', string))


Follow_the_yellow_brick_road
Follow the yellow brick road


In [99]:
#8. 
string = 'Follow  the yellow brick   road'

re.sub(r' {2,}',' ', string)

'Follow the yellow brick road'