# Fundamentals of Social Data Science 

Week 1. Day 2. Exercises from Chapter 2 of FSStDS 

Within your study pod discuss the following questions. Please submit an individual assignment by 12:30pm tomorrow, Wednesday October 12, 2022 on Canvas. 

These will not be marked and are solely for recordkeeping and review upon request. They will, however, be discussed in the Wednesday tutorial and briefing.

In [101]:
import numpy as np
import pandas as pd
import requests 

# Alice's Adventures in Text

Since Alice's Adventures in Wonderland is in the public domain, we can rightly access it and make use of it as data with few issues. Below I have a code snippet that will download a version of AAiW, load the result as a string and split that string (by space, as is default) into a list of words. The rest is up to you. See questions below. 

In [102]:
import requests 

alice_path = "https://www.gutenberg.org/files/11/11-0.txt"

req = requests.get(alice_path)

text = req.content.decode("utf-8") 

words = text.split()

# Block 1. Simple descriptive reporting

First let's start by turning that list `words` into a Series and then doing some descriptives on this. 

In [103]:
# 1a. How many words are in this text file? 
# 1b. How many unique words are in this text file? 
# 1c. What are the top ten words. Note, you can get this from a value_counts(), but it will not print those ten words with display(ser.value_counts()). How will you find those words and display them? (Hint - a value_counts() itself returns a Series).

alice_ser = pd.Series(words)
len_ser = len(alice_ser)
alice_ser_ucount = len(alice_ser.unique())
top10_words = ', '.join(alice_ser.value_counts().sort_values().tail(10).index)

print(f"There are {len_ser} words in this file.")
print(f"There are {alice_ser_ucount} unique words in this file.")
print(f"The top ten words are as follows: {top10_words}")

There are 29594 words in this file.
There are 5981 unique words in this file.
The top ten words are as follows: was, it, in, said, she, of, a, to, and, the


## Block 1a. Masking and filtering 

So instead of reporting on all the words, let's B more selective. Filter all the words down to whether they have the letter 'b' in them. Remember to check for case sensitivity? 

In [104]:
# 1d. How many 'b' words are in the file? 
# 1e. What percent of words in the file contain the letter 'b'?

# Step 1. Get 'b' in each row using map-lambda (or other ways)
bmask = alice_ser.map(lambda x: x if 'b' in x else float('NaN')).dropna()

# Step 2. Get length of filtered Series
bword_count = len(bmask)
bword_percent = bword_count / len(alice_ser) 

print(f"There are {bword_count} words with the letter 'b' in them (regardless of case")

print(f"Of the words in the Series, {bword_percent:0.1%} have the letter 'b'")



There are 1550 words with the letter 'b' in them (regardless of case
Of the words in the Series, 5.2% have the letter 'b'


# Block 2. Using Map and lambda to clean the series

Create a new series `alice_ser_c` which is a cleaned version of the series. To clean it, for each word,
- Transform it into lower case 
- Remove punctuation from the front and the rear of the word. Note, this can be done in many ways. Some of these are rolled into the natural language toolkit, but I would like us to think about how to do this by hand. You can try the code from: https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string however, note that this strips **all** punctuation, so "I'd" and "id" would be treated as the same word. In the answer I gave an example of 'the hard way'. Give yourself some limited time, try a simple way and then check the answer.
- Return to the data and provide some new, comparative, descriptive statistics. 

In [105]:
import re

In [106]:
# 2a. Create a map statement that will 
#     return a new series with all letters in lower case.
#     Report on the unique words now. 
# 2b. [Challenge] Create a map statement that will return the lower case word 
#     without punctation at the front and back of the word.
#     With both punctuation and lower case, report on unique words
# 2c. Compare the top ten words in this new Series with the original
#     What words appear in the new top ten? What might explain this?

def strip_punct(text):

    matches = re.finditer('[a-zA-Z]', text)
    
    try: 
        first = next(matches)
        *_, last = matches
    except StopIteration as e:
        # in this case there were no matches
        return float('NaN')
    except ValueError as e:
        # in this case there was only one match
        return text[first.start():first.end()]
        
    return text[first.start():last.end()]

    
alice_ser_lower = alice_ser.map(lambda x: x.lower())
alice_sl_ucount = len(alice_ser_lower.unique())
alice_ser_c = alice_ser.map(lambda x: strip_punct(x.lower()))

alice_ser_c.dropna(inplace=True) # Get rid of rows where there were no
                                 # alphanumeric characters

alice_sc_ucount = len(alice_ser_c.unique())
c_top10_words = ', '.join(alice_ser_c.value_counts().sort_values().tail(10).index)

print(f"Once converted to lower case, there are {alice_sl_ucount} unique words in the Series")
print(f"Once punctuation is stripped, there are {alice_sc_ucount} unique words in the Series") 

print(f"In the cleaned Series the top ten most common words are:\n{c_top10_words}") 

Once converted to lower case, there are 5650 unique words in the Series
Once punctuation is stripped, there are 3226 unique words in the Series
In the cleaned Series the top ten most common words are:
in, you, said, she, it, of, a, to, and, the


In [107]:
# 2d. Now that we have a more cleaned data set, lets compare word lengths
#     In this cleaned Series are the words with b on average longer or shorter
#     than the words without a 'b'? 

# Step 1. Get mask for b words - get new series

hasb_ser =  alice_ser_c.map(lambda x: x if 'b' in x else float('NaN')).dropna()

# Step 1a. Get new series with the length of these words
len_hasb_ser = hasb_ser.map(len) # A new series of lengths of these words

# Step 1b. Get the average length 
avg_len_hasb = len_hasb_ser.mean()

#Step 2. Get mask for non-b words - get new series
notb_ser = alice_ser_c.map(lambda x: float('NaN') if 'b' in x else x).dropna()

# Step 2a. 
len_notb_ser = notb_ser.map(len) # A new series of lengths of these words

# Step 2b. 
avg_len_nonb = len_notb_ser.mean()

bword_longer = avg_len_hasb > avg_len_nonb # A Boolean representing if b words are longer on avg.

print(f"It is {bword_longer} that words with 'b' are longer on average.",
      f"'b' words are an average {avg_len_hasb:0.1f} characters in length.",
      f"The others are an average {avg_len_nonb:0.1f} characters in length.")

It is True that words with 'b' are longer on average. 'b' words are an average 5.7 characters in length. The others are an average 4.1 characters in length.


# Block 3. From data back to phenomena 

We took the data from Project Guttenberg as is and then acted upon this as if it were Alice's Adventures in Wonderland. But is it true? Look at the file in a browser. It has a header, author names, details at the bottom, etc... Further, you might notice that there is at least one kind of punctuation that links two separate words but has no space in between. 

Below this is partially a question for discussion: How can we clean the data so that the string which is converted to a Series represents the text of the book and not the front and back matter? Later on there are techniques called regular expressions which might help, but I think we can get away with stripping that text without regex. 

Below, propose an approach that will strip out the top and bottom text from the .txt file. Consider an approach that might be more general than for this specific text file but more general to text files from Project Guttenberg. For example, would it work with the text from Wilde's A Picture of Dorian Gray? See: https://www.gutenberg.org/cache/epub/174/pg174.txt

In [108]:
# 3a. How many words have we removed by stripping out the license?
#     Note, this assumes the same simple split by space

def remove_gutenberg_license(text):
    '''Removes the licence text from the top and bottom of files 
       downloaded from Project Gutenberg''' 

    lines = pd.Series(text.split('\n'))
    line_bookends = lines.map(lambda x: x[0:9] == '*** START' or x[0:7] == '*** END')
    start_end = line_bookends[line_bookends].index
    
    return '\n'.join(lines[(start_end[0] + 1):start_end[1]])

alice_path = "https://www.gutenberg.org/files/11/11-0.txt"
doriangray_path = "https://www.gutenberg.org/cache/epub/174/pg174.txt"

req = requests.get(alice_path)
text = req.content.decode("utf-8")
new_text = remove_gutenberg_license(text)
less_words_alice = len(text.split()) - len(new_text.split())

req = requests.get(doriangray_path)
text = req.content.decode("utf-8")
new_text = remove_gutenberg_license(text)
less_words_dg = len(text.split()) - len(new_text.split())

print(f"By removing the license from the text we process, we have removed {less_words_alice} words from the Alice file prior to processing.")
print(f"By removing the license from the text we process, we have removed {less_words_dg} words from the Dorian Gray file prior to processing.")

By removing the license from the text we process, we have removed 3069 words from the Alice file prior to processing.
By removing the license from the text we process, we have removed 3048 words from the Dorian Gray file prior to processing.
