# Lab Session 1 - Getting started, with Python working with text (strings), lists and Counters

## Objectives

* **Become familiar with using Jupyter Notebooks** 
    * navigating your notebook server, creating new notebooks and folders, uploading files, etc. 
    * creating a text file
    * using cells - adding, setting type, executing
    * documenting your code with Markdown cells, formatting, using images and links
    * running code cells
     
      
* **Working with text in Python**
    * creating a text object (string) and assigning a named pointer (variable)
    * viewing an object vs using `print()` function
    * using the `len()` function
    * using string methods/functions
    * indexing and slicing
    
    
* **Lists of text**
    * using `split()` on text to create a list of tokens
    * indexing and slicing
    * counting and searching within a list of words
    * using `set()` to get list of tokens


* **Using `Counter` to create frequency lists**

## 1. Become familiar with using Jupyter Notebooks

### Tasks

* __Using text files__
    * Create an empty text file (New > Text file)
    * Rename it
    * Add some text
    * Save it
    
    
* __Using folders__
    * Create a new folder (New > Text file)
    * Rename it (select checkbox and use rename icon)
    * Move the text file you created into the folder
    * It will be a good idea to have a folder called **data** in your top level directory to keep corpus files
    
    
* __Uploading__
    * Create or a find a text file and an image file on your computer
    * Upload them your Jupyter server space (using Upload icon)
    
    
    
* __Create a Python notebook__
    * Create a new Python notebook
    * Rename it
    * Play around adding and deleting cells (try using both the menu items and icons but also the keyboard shortcuts)
    * Create a Markdown cell and use Markdown to produce the following:
    
    ![](markdown_ex1.png)

      **N.B.** the horizontal rule between the Level 1 and Level 2 headings
      
      These links could help: 
      
         * https://daringfireball.net/projects/markdown
         * http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html
      
    * Add a Markdown cell and:
        * Include the image file you uploaded above
        * Add a hyperlink to this news story:
            * https://www.theguardian.com/us-news/2018/jan/17/trump-fake-news-awards-winners
            * While you are at it copy the text of that news story and save it in a text file in your **data** folder

## 2. Working with text in Python

* Text can be used in Python by creating a string object. A piece of text is surrounded by matching quotes--either
    * single `' some text '` 
    * or double `" some more "`
    * but not `" like this '`

In [39]:
'some text'

'some text'

In [38]:
"some more"

'some more'

In [40]:
"like this'

SyntaxError: EOL while scanning string literal (<ipython-input-40-6d1922882682>, line 1)

* To make it easier to access and work on the text data we can assign a _named pointer_ (similar to a variable in other languages but not exactly the same thing).

In [42]:
text = "some text"
text

'some text'

* The name you use for a pointer should be useful to remind you (and others who look at your code) what the data. 
* So 
    * single character names are not a great idea
    * but neither are really long ones
    * the convention is to use lowercase and underscores between words, e.g.
        * `some_text = "abc"`
        * instead of `someText = "abc"`

### TASK

* create or copy 3 or 4 sentences and put them into Python objects with named pointers `sent1`, `sent2`, ..etc..

* __DISPLAY THEM__

    1. by entering the name and executing the cell
    2. using the `print()` function  
    
  _What is the difference?_

### TASK

* use the `len()` function on your sentence objects to count the number of characters
* change one of your sentences into all uppercase using the `.upper()` function
* change another into all lowercase using the `.lower()` function

In [44]:
sent="This is a SENTENCE"
sent.lower()

'this is a sentence'

In [45]:
sent

'This is a SENTENCE'

* **N.B.** how the case of the object has not been permanently changed
* _How would you store the result of the case change function?_

### TASK

* Try using some other string functions like:
    * `.title()`
    * `.swapcase()`
    * `.capitalize()`
    * `.islower()`
    * `.startswith()`
    * `.count()`
    * _What's the difference between_ `.find()` _and_ `.index()` _and_ `.rfind()`?

  
* Use `dir(str)` to get a list of the functions/methods that are available for strings and try and figure out what some of them do and how to use them
* The `help()` function or putting a `?` next to it will give you access to the documentation, e.g.
    * `help(sent.islower)`
    * `sent.islower?`
    * `?sent.title`

### Indexing

* You can select a character from a string using its index
* To do this use the square bracket notation:
    * `sent[0]` - will get the first character
    * `sent[1]` - the second character
    * `sent[-1]` - the last
    * `sent[-2]` - the second from last

* **N.B.** the first character has an index of **ZERO** not **ONE**

In [52]:
print(sent)
print(sent[1], '<-', 'oh! I thought this would be a T')

This is a SENTENCE
h <- oh! I thought this would be a T


### Slicing

* You can access a sequence of characters in a string using **slicing**
* The syntax is:
    * `sent[start_index:end_index]`

In [53]:
sent[0:2]

'Th'

In [54]:
sent[0:3]

'Thi'

* PRACTICE THIS!!
* **N.B.** the end index is the point in the string up to which you are going to include characters but not the character itself

In [57]:
# index 3 in "This is a SENTENCE" is the character 's'
sent[3] 

's'

In [59]:
# sent[0:3] should give back 'This' ... right?
sent[0:3]

'Thi'

### TASKS 
* What happens when you leave the first number blank?
    * e.g. `sent[:4]`
* What happens when you leave the second number blank?
    * e.g. `sent[:4]`
* There is a third number you can include when slicing
    * e.g. `sent[0:10:2]` 
    
  _What does it do?_

## 3. Working with lists of text (tokens)

* We can use the `.split()` function to **tokenize** (or split into **tokens**) a piece of text

In [61]:
text2='''
Donald Trump, who has routinely peddled conspiracy theories and mistruths from the office of the presidency, sought to question the accuracy of the media on Wednesday by unveiling the so-called “Fake News Awards”.

The president used his preferred medium of Twitter to announce “the winners”, which ranged from minor errors by journalists on social media to news reports that later invited corrections, with the New York Times and CNN the most frequently named.
'''

In [63]:
tokens=text2.split()

In [64]:
tokens

['Donald',
 'Trump,',
 'who',
 'has',
 'routinely',
 'peddled',
 'conspiracy',
 'theories',
 'and',
 'mistruths',
 'from',
 'the',
 'office',
 'of',
 'the',
 'presidency,',
 'sought',
 'to',
 'question',
 'the',
 'accuracy',
 'of',
 'the',
 'media',
 'on',
 'Wednesday',
 'by',
 'unveiling',
 'the',
 'so-called',
 '“Fake',
 'News',
 'Awards”.',
 'The',
 'president',
 'used',
 'his',
 'preferred',
 'medium',
 'of',
 'Twitter',
 'to',
 'announce',
 '“the',
 'winners”,',
 'which',
 'ranged',
 'from',
 'minor',
 'errors',
 'by',
 'journalists',
 'on',
 'social',
 'media',
 'to',
 'news',
 'reports',
 'that',
 'later',
 'invited',
 'corrections,',
 'with',
 'the',
 'New',
 'York',
 'Times',
 'and',
 'CNN',
 'the',
 'most',
 'frequently',
 'named.']

### TASKS 

* Read about the `.split()` function using the `help()` function
* Assign a named pointer to the result of using the function
    * `tokens = "This is my sentence of words and more words".split()`
* Now you have a **LIST** object which is a sequence of objects
* You can index and slice a list. Practice doing this on your list of tokens.
* Use `dir(list)` to find out what specific functions can be used on a list object and try some out.

### TASKS 

* Take the two paragraphs from the article above (or another one that interests you) and turn them into a string with a named pointer. (You can use a whole text if you are feeling ambitious)
* Turn you string into lowercase and create a list of tokens
* Use the `.count()` function to tabulate how often common words like _the_ and _on_ occur
* Use the `.index()` function to find the location of all the instances of a common word like _and_

## 4. Using `Counter` to create frequency lists

### TASKS 

* Take your list of tokens and create a frequency list using the `Counter` object


* List the most common words


* If you haven't already been working with a whole text load in the contents of one of your text files with the `open()` and `.read()` functions:
    * `a_long_text = open('data/your_text_file.txt').read()`


* Tokenize your text


* Create a frequency list and investigate

In [69]:
from collections import Counter

In [70]:
freq = Counter()

In [71]:
freq.update(tokens)

In [72]:
freq.most_common(10)

[('the', 7),
 ('of', 3),
 ('to', 3),
 ('and', 2),
 ('from', 2),
 ('media', 2),
 ('on', 2),
 ('by', 2),
 ('Donald', 1),
 ('Trump,', 1)]

In [73]:
freq['the']

7

In [75]:
freq.get('to')

3