<div align="center">
    <h1><a href="index.ipynb">Knowledge Discovery in Digital Humanities</a></h1>
</div>

<div align="center">
    <h2>Class 07. Python IV: Files</h2>
    <img src="img/python.png" width="300">
</div>

###Table of contents

- [Oriented-object programming](#Oriented-object-programming)
- [Text files](#Text-files)
- [CSV files](#CSV-files)
- [Web files](#Web-files)

###Oriented-object programming

####Classes
- A class is the representation of an idea or a concept
- A class is a user-defined type
- A class is defined by features that are common to all objects that belong to the class: attributes and methods
- Syntax:
```
class ClassName(SuperclassName):
    attributes and methods
```

Example:

In [1]:
class Rectangle:
    def __init__(self, base, height):
        self.base = base
        self.height = height
    
    def area(self):
        return self.base * self.height

####Objects
- An object is an instance of the class
- An object is a concrete element that belongs to a class of objects
- Syntax:
```
object_name = ClassName(arguments)
```

Example:

In [2]:
r = Rectangle(8, 5)

####Attributes
- Data
- Take specific values for each object
- Examples: `base`, `height`
- Syntax:
```
object.attribute
```

Example:

In [3]:
r.base

8

####Methods
- Functions that operate with attributes
- Get different results for each object
- Syntax:
```
object.method(arguments)
```

Example:

In [4]:
r.area()

40

*Note: For the purposes of this course, it is not necessary to know how to define classes but it is how to use them.*

###Text files

A text file is a sequence of characters stored on a permanent medium like a hard drive, flash memory.

####Opening files
- Built-in `open` function

#####Read mode, `'r'`
`'r'` or nothing as second parameter of `open`

#####Write mode, `'w'`
`'w'` as second parameter of `open`

#####Write mode, `'a'`
`'a'` as second parameter of `open`

Examples:

In [5]:
f = open('data/knowledge_wikipedia.txt')
f

<open file 'data/knowledge_wikipedia.txt', mode 'r' at 0x7ffb8f54bc00>

In [6]:
fr = open('data/knowledge_wikipedia.txt', 'r')
fr

<open file 'data/knowledge_wikipedia.txt', mode 'r' at 0x7ffb8f54bc90>

In [7]:
fw = open('data/new_file.txt', 'w')
fw

<open file 'data/new_file.txt', mode 'w' at 0x7ffb8f54bd20>

In [8]:
fa = open('data/new_file.txt', 'a')
fa

<open file 'data/new_file.txt', mode 'a' at 0x7ffb8f54bdb0>

####Reading files
1. Open the file in *read* mode
2. Read the file:
    - Use the `read` function: it reads the whole content of the file
    - Iterate over the file: it reads line by line

Examples:

In [9]:
f = open('data/knowledge_wikipedia.txt')
f.read()

'Knowledge is a familiarity, awareness or understanding of someone or something,\nsuch as facts, information, descriptions, or skills, which is acquired through\nexperience or education by perceiving, discovering, or learning. Knowledge can\nrefer to a theoretical or practical understanding of a subject. It can be\nimplicit (as with practical skill or expertise) or explicit (as with the\ntheoretical understanding of a subject); it can be more or less formal or\nsystematic.[1] In philosophy, the study of knowledge is called epistemology;\nthe philosopher Plato famously defined knowledge as "justified true belief",\nthough "well-justified true belief" is more complete as it accounts for the\nGettier problems. However, several definitions of knowledge and theories to\nexplain it exist.\n'

`'\n'` characters are *newlines*; this is equivalent to pressing *Enter* on a keyboard and starting a new line.

In [10]:
f = open('data/knowledge_wikipedia.txt')
for line in f:
    print line.strip()

Knowledge is a familiarity, awareness or understanding of someone or something,
such as facts, information, descriptions, or skills, which is acquired through
experience or education by perceiving, discovering, or learning. Knowledge can
refer to a theoretical or practical understanding of a subject. It can be
implicit (as with practical skill or expertise) or explicit (as with the
theoretical understanding of a subject); it can be more or less formal or
systematic.[1] In philosophy, the study of knowledge is called epistemology;
the philosopher Plato famously defined knowledge as "justified true belief",
though "well-justified true belief" is more complete as it accounts for the
Gettier problems. However, several definitions of knowledge and theories to
explain it exist.


The `strip` method removes the newline character at the end of the input line.

####Writing files
1. Open the file in *write* mode
2. Use the `write` function

Examples:

In [11]:
fw = open('data/new_write_file.txt', 'w')
fw.write('This is the first line.\n')
fw.write('This is the second line.\n')

####Reading and writing files
1. Open the file in *append* mode
2. Use the `write` function

Examples:

In [12]:
fa = open('data/new_write_file.txt', 'a')
fa.write('And this is the third line.\n')

####Closing files
- After finishing processing a file, in both read or write mode, close the file
- Use the `close` function

Examples:

In [13]:
f.close()
fr.close()
fw.close()
fa.close()

####Exercise 1
Write a function called `copy` that receives two arguments: a source path file and a destination path file, and copy the content of the source to the destination.

In [14]:
def copy(src, dst):
    fs = open(src, 'r')
    fd = open(dst, 'w')
    fd.write(fs.read())
    fs.close()
    fd.close()

In [15]:
copy('exercises/knowledge_clean.txt', 'exercises/knowledge_copy.txt')

####Exercise 2
Write a function that processes the file [knowledge_clean.txt](exercises/knowledge_clean.txt) (a cleaned from punctuation marks version of *knowledge_wikipedia.txt*) and prints all the words ending in *ing*.

In [16]:
def ing():
    for line in open('exercises/knowledge_clean.txt'):
        for word in line.split():
            if word.endswith('ing'):
                print word

In [17]:
ing()

understanding
something
perceiving
discovering
learning
understanding
understanding


###CSV files

Data analysis work often involves data tabulations. Python includes a library to handle files in CSV (comma-separated value), a common format that supports tabular data in plain-text form.

Example: the file [lexicon.csv](data/lexicon.csv) contains linguistic data.

####CSV reader

In [18]:
import csv

csvfile = open('data/lexicon.csv')
reader = csv.reader(
    csvfile,
    delimiter=',',
    quotechar='"'
)
#reader.next() #if it is necessary to skip the header
for row in reader:
    print row
csvfile.close()

['sleep', 'sli:p', 'v.i', 'a condition of body and mind ...']
['walk', 'wo:k', 'v.intr', 'progress by lifting and setting down each foot ...']


####CSV writer

In [19]:
import csv

csvfile = open('data/lexicon.csv', 'a')
writer = csv.writer(
    csvfile,
    delimiter=',',
    quotechar='"',
    quoting=csv.QUOTE_MINIMAL
)
writer.writerow(['wake', 'weik', 'intrans', 'cease to sleep, stop dreaming'])
csvfile.close()

####Exercise 3
Given the random list of words in the file [random_words.txt](exercises/random_words.txt), create a CSV file called `random_words.csv` that contains the word, its length, and `long` if the word's length is greater than or equals to `10`.

In [20]:
import csv

csvfile = open('exercises/random_words.csv', 'w')
writer = csv.writer(
    csvfile,
    delimiter=',',
    quotechar='"',
    quoting=csv.QUOTE_MINIMAL
)
for line in open('exercises/random_words.txt'):
    word = line.strip()
    length = len(word)
    if length >= 10:
        info = 'long'
    else:
        info = 'short'
    writer.writerow([word, length, info])
csvfile.close()

####Exercise 4
Given the CSV file [random_words.csv](exercises/random_words.csv) created in the last exercise, print the rows that contain a *long* word.

In [21]:
import csv

csvfile = open('exercises/random_words.csv')
reader = csv.reader(
    csvfile,
    delimiter=',',
    quotechar='"'
)
for row in reader:
    if row[2] == 'long':
        print row
csvfile.close()

['erythritol', '10', 'long']
['unstitching', '11', 'long']
['superexpenditure', '16', 'long']
['hypostatically', '14', 'long']
['corollaceous', '12', 'long']
['limnologically', '14', 'long']
['availableness', '13', 'long']
['delamination', '12', 'long']
['corynebacterium', '15', 'long']


###Web files

The module `urllib` provides a high-level interface for accessing data across the web. It is possible to open a file identified by its URL instead of its path and filename.

In [22]:
from urllib import urlopen

f = urlopen('http://www.gutenberg.org/files/2554/2554.txt')
counter = 1
for line in f:
    print line
    if counter == 20:
        break
    counter += 1

The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky



This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever.  You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.org





Title: Crime and Punishment



Author: Fyodor Dostoevsky



Release Date: March 28, 2006 [EBook #2554]

[Last updated: November 15, 2011]



Language: English



Character set encoding: ASCII



*** START OF THIS PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***



####Exercise 5
All the ebooks from [Gutenberg Project](http://www.gutenberg.org/) have the same format. All of them contain metadata about themselves such as their title, author, release date, language, and encoding in their first 20 lines. Given the ebook contained in the url [http://www.gutenberg.org/files/2554/2554.txt](http://www.gutenberg.org/files/2554/2554.txt), print its title and author.

In [23]:
from urllib import urlopen

f = urlopen('http://www.gutenberg.org/files/2554/2554.txt')
counter = 1
for line in f:
    if line.startswith('Title:') or line.startswith('Author:'):
        print line.strip()
    if counter == 20:
        break
    counter += 1

Title: Crime and Punishment
Author: Fyodor Dostoevsky
