In [1]:
# Importing relevant packages 

import pandas as pd
import re


# Advanced Social Data Science 2 (ASDS2) Exercises


## Overview and regular expressions

### 1: Thinking about text as data

Go to Kaggle’s database of text data sets here: https://www.kaggle.com/datasets?topic=nlpDatasets 

1. Find an interesting data set. (Try searching the data sets or playing around with the sorting rule in the top right). It doesn’t have to be social sciencey, just whatever looks interesting to you.
2. Describe the variables in the data. What’s there in addition to the text itself, if anything?
3. What’s a meaningful latent variable which might vary across the texts? (For example, ‘sentiment’ might plausibly vary across movie reviews).
4. Assume you could measure the latent variable from (3). How might that latent variable correlate with other properties of the units of the data? (These can be observed variables in the data, or other, unobserved properties).


[*write your answer here*]

### 2: Importing text data

1. The file mach.csv, available at the course Absalon page, contains part of Machiavelli’s The Prince, subdivided into 188 sections. Download it to your computer.
2. Import the file into Python using read_csv() from pandas. 

(Tip: Check the content of the data frame using the .head()-function, to assess whether everything is tidy and ready to go). 

In [2]:
# Importing Machiavelli file

mach = pd.read_csv("mach.csv")
mach = mach.rename(columns = {'Unnamed: 0': 'section'}) #Renaming unnamed column


3. Using the search function from Python’s re module (or a Pandas equivalent), find out in which section(s) the following terms appear:
    - lion
    - flatterers
    - ccmnot

In [3]:
# Solution using re.search

# Searching through and printing each section that contains the relevant term. 

print('Sections containing the word \"lion\"') 
[print(section) for section, text in zip(mach.section, mach.text) if re.search('lion', text)]

print('\nSections containing the word \"flatterers\"')
[print(section) for section, text in zip(mach.section, mach.text) if re.search('flatterers', text)]

print('\nSections containing the word \"ccmnot\"')
[print(section) for section, text in zip(mach.section, mach.text) if re.search('ccmnot', text)]
print()

Sections containing the word "lion"
Mach_122.txt.content
Mach_123.txt.content
Mach_139.txt.content
Mach_141.txt.content
Mach_187.txt.content
Mach_30.txt.content
Mach_55.txt.content
Mach_8.txt.content

Sections containing the word "flatterers"
Mach_166.txt.content
Mach_167.txt.content
Mach_168.txt.content

Sections containing the word "ccmnot"
Mach_147.txt.content



In [4]:
# Solution using a Pandas equvalient, str.contains

# Creating columns containing a boolean value for each section showing whether the term is present or not
# str.contains is a built-in Pandas Method and returns a boolean indicating whether the given term is contained in the text.  

mach['lion'] = mach['text'].str.contains('lion') 
mach['flatterers'] = mach['text'].str.contains('flatterers')
mach['ccmnot'] = mach['text'].str.contains('ccmnot')

# Extracting and printing the sections containing each of the given terms  

print('Sections containing the word \"lion\"') 
print(mach['section'][mach.lion == True]) 

print('\nSections containing the word \"flatterers\"')
print(mach['section'][mach.flatterers == True])

print('\nSections containing the word \"ccmnot\"')
print(mach['section'][mach.ccmnot == True])

Sections containing the word "lion"
26     Mach_122.txt.content
27     Mach_123.txt.content
44     Mach_139.txt.content
47     Mach_141.txt.content
97     Mach_187.txt.content
112     Mach_30.txt.content
139     Mach_55.txt.content
166      Mach_8.txt.content
Name: section, dtype: object

Sections containing the word "flatterers"
74    Mach_166.txt.content
75    Mach_167.txt.content
76    Mach_168.txt.content
Name: section, dtype: object

Sections containing the word "ccmnot"
53    Mach_147.txt.content
Name: section, dtype: object


4. Why might a nonsensical term like ‘ccmnot’ be in the corpus?

From investigating the section that contains 'ccmnot', it seems it is a spelling mistake and was supposed to say 'cannot'. Perhaps the Machiavelli text was made digital by scanning and this word was misread.

(Tip: Try printing the content of the text containing 'ccmnot'. Does it contain more text than the notebook displays by default? How could we change this?)

In [5]:
# Displaying and exploring the section containing 'ccmnot'

# Setting the display width to 2000 characters (otherwise Python will truncate output with "..." to indicate that there is more text that wasn't shown)
pd.set_option('display.max_colwidth', 2000)

# Print the text for the section 'Mach_147.txt.content'
print(mach['text'][mach.section=="Mach_147.txt.content"])

53     But let us return to our subject. I maintain that anyone who considers what I have written will realise that either hatred or contempt led to the downfall of the emperors I have discussed; he will recognise that some of them acted in one way and others in the opposite way, and that one ruler in each group was successful and the others ended badly. Because Pertinax and Alexander were new rulers, it was useless and harmful for them to act like Marcus, who was an hereditary ruler. Likewise, it was harmful for Caracalla, Commodus and Maximinus to act like Severus, because they lacked the ability required to follow in his footsteps. Therefore, a new ruler in a new principality ccmnot imitate the conduct of Marcus, nor again is it necessary to imitate that of Severus. Rather, he should imitate Severus in the courses of action that are necessary for establishing himself in power, and imitate Marcus in those that are necessary for maintaining power that is already established and secure

### 3: Regular expressions

In this exercise, we’re continuing with Python’s re module. 
<br>The following can be solved using one or more of these three functions in re:
`search`
`split`
`sub`

Hint: Take a look at the documentation for Python's re module to find solutions, and test your regular expression on regextester.com or consult regex101.com 

1. Define a function that can check if a string contains a certain set of characters (for this excercise a-z or A-Z or 0-9), and test your function on some strings to confirm that it works. 

In [6]:
#Defining a function to check whether a string contains letters (upper case or lower case) or numbers

def specific_char(string):
    string = re.search(r'[a-z,A-Z,0-9]', string) #Regex search to find characters that match the regex patterns: a-z, A-Z or 0-9
    return bool(string)                     # The function returns a boolian indicating whether the patterns are matched or nor. 

# Testing function for two test strings
test_string1='ABCDEFabcdef123450'
test_string2='*&%@#!}{'

print(('The string "%s" contains numbers or letters ='%(test_string1)),(specific_char(test_string1)))
print(('\nThe string "%s" contains numbers or letters ='%(test_string2)),(specific_char(test_string2)))


The string "ABCDEFabcdef123450" contains numbers or letters = True

The string "*&%@#!}{" contains numbers or letters = False


2. Define a function that can check if a string contains an _a_ followed by **zero** or more _b_'s.

Examples:

"ac" is a match

"abc" is a match

"bbc" is not a match

In [7]:
# Defining a function to test if an a is followed by zero or more bs 

def ab_match(text):
    if re.search('ab*',  text): # * searches for the previous token "b" between zero and more times
        return 'Found a match!'
    else:
        return 'Not matched!'

print(ab_match("ac"))
print(ab_match("abc"))
print(ab_match("abbc"))
print(ab_match("bbc"))

Found a match!
Found a match!
Found a match!
Not matched!


3. Define a function that can check if a string contains an _a_ followed by **one** or more _b_'s.

(Now "ac" should no longer be a match!)

In [8]:
# Defining a function to test if an a is followed by one or more bs 

def ab_match(text):
    if re.search('ab+',  text): # + matches the previous token "b" between one and more times
        return 'Found a match!'
    else:
        return('Not matched!')  

print(ab_match("ac"))
print(ab_match("abc"))
print(ab_match("abbc"))
print(ab_match("acb"))

Not matched!
Found a match!
Found a match!
Not matched!


4. Using the sample string ‘The quick brown fox jumps over the lazy dog’, search for the words 'fox', 'dog', 'horse'.

In [9]:

patterns = ['fox', 'dog', 'horse'] # Defining patterns to search for with re
text = 'The quick brown fox jumps over the lazy dog'

for pattern in patterns: # Searching for each of the patterns inthe 
    print('Searching for "%s" in "%s" ->' % (pattern, text),)
    if re.search(pattern,  text):
        print('Found a match!\n')
    else:
        print('Not Matched!\n')
        

Searching for "fox" in "The quick brown fox jumps over the lazy dog" ->
Found a match!

Searching for "dog" in "The quick brown fox jumps over the lazy dog" ->
Found a match!

Searching for "horse" in "The quick brown fox jumps over the lazy dog" ->
Not Matched!



5. Define a string containing a sentence with the word ‘Road’ in it, and use the re.sub()-function to abbreviate 'Road' as 'Rd.'.

(For example: "The quick brown fox jumps over the lazy dog on Hampton Road" --> "The quick brown fox jumps over the lazy dog on Hampton Rd.")

In [10]:
# Using re.sub to substitute words in text

text = 'The quick brown fox jumps over the lazy dog on Hampton Road'

print(re.sub('Road', 'Rd.', text)) # Using re.sub to indicate which word(s) should be replaced


The quick brown fox jumps over the lazy dog on Hampton Rd.


6. Define a string containing a sentence and perform very simple tokenization by splitting at all whitespaces.

(The result should be a list where each element in the list corresponds to a word from the sentence)

In [11]:
# Tokienizing by splitting at whitespaces

text = 'The quick brown fox jumps over the lazy dog.'

print(re.split(' ', text)) 

#This can also be done without regex using text.split(' ')

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']


7. Define a string containing a sentence and replace whitespaces with an underscore. After, reverse this by replacing underscores with a whitespace.

In [12]:
# Using re.sub() to replace whitespaces and underscores

text = "Let's become regexperts!"
text1 ="Let's_become_regexperts!"

print('Replacing whitespaces with underscores')
print(text, '-->',re.sub(" ", "_", text))

print('\nReplacing underscores with whitespaces')
print(text1, '-->', re.sub("_", " ", text1))

#Alternative solution, not using regex
print('\nAlternative solution')
print(text, '-->', text.replace(' ', '_'))


Replacing whitespaces with underscores
Let's become regexperts! --> Let's_become_regexperts!

Replacing underscores with whitespaces
Let's_become_regexperts! --> Let's become regexperts!

Alternative solution
Let's become regexperts! --> Let's_become_regexperts!


8. Define a string containing a sentence with a few cases of multiple spaces between words and remove all those cases.

In [13]:
text = 'Being      a  regexpert'
print("Original string:\t",text)
print("Without extra spaces:\t",re.sub(' +',' ',text)) # replacing one or more (+) whitespaces with one whitespace


Original string:	 Being      a  regexpert
Without extra spaces:	 Being a regexpert
