# Welcome

For this workshop, we've set up a coding environment, with the necessary tools and data already preloaded. The interface that you're using is called Jupyter - which allows you to interact with code in your browser, though a format called a 'notebook'. In addition to being in the browser, Jupyter is pleasant because it's interactive, so you can interact with your code. The traditional way of running code involves writing a script and running the whole thing; the interactive approach used by Jupyter is better for data analysis, because you explore, tinker, and converse with your data.

In this workshop, we're working in the browser, with Jupyter, Python and all the corresponding data installed on somebody else's computer, but it can be run on your own computers. While you're in the browser, keep in mind that your custom code is ephemeral - it won't stay saved for the long term.

Welcome to a Jupyer notebook. Let's get comfortable with Jupyter before we dive into the fun bits. 

Jupyter is one of many execution environtments for Python and R. It allows you to run code in the cloud, without having to install dependencies and all the other stuff.

"Script-like" execution means that you write down all the code in a file, and that entire file is run in order.

"Interpreter-like" execution means that you type in commands one at a time. The session pauses after each, waiting for the next command. This is really similar to how the command line is run.

Jupyter is a hybrid of both those things. Notebooks are composed of cells. Then the cells are executed (almost like mini scripts). This gives you the advantage of keeping the session alive so you don't have to repeat loading data, etc., and the advantage of being able to execute multiple lines of code at the same time. It's also really easy to iterate through problems in a Jupyter environment — try something, meet the error head on, resolve it, and continue, cell by cell.

Jupyter is extremely powerful, but there are a few traps.

Let's get comfortable with cells first.

# Basics

Let's spend some time getting to know basic Python features. 

The print() function asks the computer to return a specific value (literally asking the computer to print what you want to the terminal). 

Click on the code chunk below and press SHIFT + ENTER. 

In [85]:
print("This is a string. It's being returned to your console. Cool!")

This is a string. It's being returned to your console. Cool!


SHIFT + ENTER executes a chunk of code. Try changing the text inside the quotation marks and press SHIFT + ENTER again. The output will be rendered in the subsequent chunk. Try it again in the following chunk.

In [86]:
name = "John Smith"
age = "27"
print(name + ', ' +  age)

John Smith, 27


The previous chunk of code uses variables. Variables represent other things (in this case, the variable 'name' is a standin for "John Smith"; the variable 'age' is a standin for "27") The advantage of variables is that they can be assigned once and used multiple times throughout. They can also be reassigned.

Run the final chunk below.

In [87]:
name = input('What is your name?: ')
age = input('What is your age?: ')
city = input('Where are you from?: ')

print('Your name is: ' + name + '. You are ' + age + ' years old.' + ' You are from ' + city + '.')

What is your name?: Cal
What is your age?: 27
Where are you from?: Windsor
Your name is: Cal. You are 27 years old. You are from Windsor.


Can you explain what happened in the previous chunk? 

If you said that we are saving an input to a variable and then requesting the computer to return those variables, you got it! 

# Importing our tools

Okay, so we now have a bit of an understanding about how Python works. While Python is a powerful language that can do many things, developers have made it easier for us to carry out specific tasks by creating 'libraries' that enable us to carry out tasks, like language processing, more easily. 

These libraries are additions to vanilla Python, so we have to import them. Press SHIFT + ENTER in the following chunk to import a few libraries that we'll be using throughout this lesson.


In [35]:
import matplotlib.pyplot as plt # visualizing library
import nltk # natural language processing library
import textify # text cleaning library

Great. Now we have the tools that we need to carry out some basic text analysis and visualize it. By the way, the "#" above means that those lines are commented out — they're just there to explain what we're doing. 

# Importing our data

Now that we have our tools we can get to work, but first, we need to import our data. The following lines of code  will load *OUR SAMPLE* data into a variable called data. If you want to use your own data for this lesson, let us know.

In [57]:
text = open('dss-sample-data.txt','r').read() # change 'dss-sample-data.txt' to the name of your sample data.
print(text[:1000])

ADVICE TO YOUNG MEN ON THEIR DUTIES AND CONDUCT IN LIFE

BY T. S. ARTHUR, AUTHOR OF "THE MAIDEN, " "WIFE," AND "MOTHER."



Phillips, Sampson & Company's Publications.

ADVICE TO YOUNG LADIES ON THEIR DUTIES AND CONDUCT IN LIFE. By T. S. ARTHUR. Price 75 cents.

Right modes of thinking are the basis of all correct action. It is from this cause that we shall, in addressing our young friends on their duties and conduct in life, appeal at once to their rational faculty. To learn to think right is, therefore, a matter of primary concern. If there be right modes of thinking, right actions will follow as a natural consequence.--Extract from the author's introduction.

ADVICE TO YOUNG MEN ON THEIR DUTIES AND CONDUCT IN LIFE. By. T. S. ARTHUR. Price 75 cents.

The aim of the author of this volume has been to lead young men to just conclusions, from reflections upon what they are, and what are their duties in society as integral parts of the common body. Satisfied that those who read it as it s

This line of code assigns the sample data to the variable 'data'. This is cool, because we can now analyze our text without having to type 'open('dss-sample-data.txt', 'r')' every time we want to do something with it. That's one of the cool things about variables.

Let's first review our text document to make sure it's what we want. In pseudo-code, the following chunk asks the computer to return the text of the data file. Then we save that output to a variable so that we can call on it whenver we want.


Press SHIFT + ENTER in the following chunk.

Great. It's displaying properly, which means that we have successfully loaded the file to the data variable. 

# Cleaning

Our text is pretty clean (it was transcribed and annotated by hand, so it better be!) but before we work on it we need to do a few things. Notice how there are random characters, like \n\n (new line symbols)? These are all hidden characters that you don't see when you're typing, but that exist to format a document to your liking. We need to get rid of them, as well as a few other things:

1) Remove capitalizations
2) Remove punctuation
3) Remove numbers
4) Apply stopwords

We'll do that by using one of the tools (libraries) we imported previously! Textify! There are a lot of different ways to do this work, and there are better (more thorough) ways to do it, but textify makes it really easy for us. Execute the chunk of code below by pressing SHIFT + ENTER.

In [64]:
from textify import TextCleaner # we import a particular function from the library, Textify
textcleaner = TextCleaner() # 
textcleaner.text = text # Taking the text and cleaning it with the utility function
cleantext = textcleaner.clean_text() # printing the first 1000 characters from the above.
print(cleantext[0:1000])

advice to young men on their duties and conduct in lifeby t s arthur author of the maiden  wife and motherphillips sampson  companys publicationsadvice to young ladies on their duties and conduct in life by t s arthur price  centsright modes of thinking are the basis of all correct action it is from this cause that we shall in addressing our young friends on their duties and conduct in life appeal at once to their rational faculty to learn to think right is therefore a matter of primary concern if there be right modes of thinking right actions will follow as a natural consequenceextract from the authors introductionadvice to young men on their duties and conduct in life by t s arthur price  centsthe aim of the author of this volume has been to lead young men to just conclusions from reflections upon what they are and what are their duties in society as integral parts of the common body satisfied that those who read it as it should be read cannot fail to have their good purposes strengt

Notice what happened? The Textify library manipulated your text. Name a few of the changes that you notice. 

Your text can definitely be more clean, but for now, this is good enough. Let's visualize word frequencies.

# Word frequencies

Our clean text is ready to go (somewhat normalized). We need to split up this mass of prose into individual words, or tokens, and then count the number of times they appear. Let's do that next.

In [66]:
# split into words
from nltk.tokenize import word_tokenize  # import a particular library and specific functions
tokens = word_tokenize(cleantext) # setting the output of word_tokenize() to the variable tokens
print(words[0:500]) # printing the first 1000 tokenized words!

['advice', 'to', 'young', 'men', 'on', 'their', 'duties', 'and', 'conduct', 'in', 'lifeby', 't', 's', 'arthur', 'author', 'of', 'the', 'maiden', 'wife', 'and', 'motherphillips', 'sampson', 'companys', 'publicationsadvice', 'to', 'young', 'ladies', 'on', 'their', 'duties', 'and', 'conduct', 'in', 'life', 'by', 't', 's', 'arthur', 'price', 'centsright', 'modes', 'of', 'thinking', 'are', 'the', 'basis', 'of', 'all', 'correct', 'action', 'it', 'is', 'from', 'this', 'cause', 'that', 'we', 'shall', 'in', 'addressing', 'our', 'young', 'friends', 'on', 'their', 'duties', 'and', 'conduct', 'in', 'life', 'appeal', 'at', 'once', 'to', 'their', 'rational', 'faculty', 'to', 'learn', 'to', 'think', 'right', 'is', 'therefore', 'a', 'matter', 'of', 'primary', 'concern', 'if', 'there', 'be', 'right', 'modes', 'of', 'thinking', 'right', 'actions', 'will', 'follow', 'as', 'a', 'natural', 'consequenceextract', 'from', 'the', 'authors', 'introductionadvice', 'to', 'young', 'men', 'on', 'their', 'duties', '

In [67]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/MurguDev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [82]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [83]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

['ADVICE', 'TO', 'YOUNG', 'MEN', 'ON', 'THEIR', 'DUTIES', 'AND', 'CONDUCT', 'IN', 'LIFE', 'BY', 'ARTHUR', 'AUTHOR', 'OF', 'THE', 'MAIDEN', 'WIFE', 'AND', 'MOTHER', 'Phillips', 'Sampson', 'Company', 'Publications', 'ADVICE', 'TO', 'YOUNG', 'LADIES', 'ON', 'THEIR', 'DUTIES', 'AND', 'CONDUCT', 'IN', 'LIFE', 'By', 'ARTHUR', 'Price', 'cents', 'Right', 'modes', 'thinking', 'basis', 'correct', 'action', 'It', 'cause', 'shall', 'addressing', 'young', 'friends', 'duties', 'conduct', 'life', 'appeal', 'rational', 'faculty', 'To', 'learn', 'think', 'right', 'therefore', 'matter', 'primary', 'concern', 'If', 'right', 'modes', 'thinking', 'right', 'actions', 'follow', 'natural', 'Extract', 'author', 'introduction', 'ADVICE', 'TO', 'YOUNG', 'MEN', 'ON', 'THEIR', 'DUTIES', 'AND', 'CONDUCT', 'IN', 'LIFE', 'By', 'ARTHUR', 'Price', 'cents', 'The', 'aim', 'author', 'volume', 'lead', 'young', 'men', 'conclusions', 'reflections']
