#Terminal IO 

Python's basic Terminal IO function is the command print

In [1]:
print "Hello World"

Hello World


In [2]:
print 1 + 1

2


In [3]:
print "Hello World " + str(1 + 1)


Hello World 2


In [4]:
a = 696
print a

696


<h1> Reading Files from Disk </h1>

<p>A rule of thumb is that just loading data into the format that you want is going to be harder than expected. Here I'm going
to walk through loading a typical data file step by step, dealing with all the issues as they come up.
The first thing we need to do is open the file with the data, like this:</p>

<table>
    
<th><td>Function</td><td>Description</td></th>
<tr><td>format()</td><td>Converts a value to a formatted representation</td></tr>
<tr><td>input()</td><td>Reads input from the console</td></tr>
<tr><td>open()</td><td>Opens a file and returns a file object</td></tr>
<tr><td>print()</td><td>Prints to a text stream or the console</td></tr>

</table>


<h2> Open a file with the <i> open </i> function </h2>

In [5]:
f = open("MovieData.csv")

IOError: [Errno 2] No such file or directory: 'MovieData.csv'

f is now storing our connection to the file. Python assumes that we've opened a text file, and knows that text files are
divided into lines. CSV stands for comma-separated values but it is the accepted extension for all tabular data stored in
text files whether or not columns are separated by commas, tabs, or any other character.
Let's look at what the contents of the file look like by printing the first few rows:


<h2> Read lines from a file sequentially with the <i>read</i> function </h2>

In [None]:
for i in range(5):
    print f.readline()


There are a few things going on here. First, notice the table structure of the data: the first row consists of column
headers, and the next rows are data. The columns are separated by whitespace (actually tabs, as we'll see soon). Don't
forget about the column headers -- you don't want to confuse them with the data (for example, by including 'Movie' in
your list of movie titles).


Now look at the code. The file object f internally stores what line it's on. Every time you call the readline() function, which
is a built-in function of the file object it reads in the next line, then moves down a line. So if you call the function again:


In [None]:
 print f.readline()

<h2>Go back to beginning of a file with seek </h2>

it'll go down to the next line, until it gets to the end of the file. If you want to 'reset' f and return to the top of the file, you
do it with the .seek(0), which sets the file reader back to the beginning, like this:

In [None]:
f.seek(0)
print f.readline()


In [None]:
# seek actually is more general purpose and treats text files like lists and the input argument specifies the index into file
f.seek(3) # puts file read cursor at character 4 (lists are zero indexed)
print f.readline()

In [None]:
f.seek(0);
a = f.readline()
print a
print a[3:]

Notice the spacing between the blocks of text -- this suggests that the data is tab-delimited. To check, let's view one
line 'raw', without being formatted by the print function:


In [None]:
f.readline()


See the <b> '\t' </b> that appears where the gaps were before? That's the code for the tab character. Most of the time, columns in
csv files are separated by commas (csv stands for "comma-separated values" after all), but not always. In this case,
tabs were probably a good choice because some movie titles might have commas in the title.


We don't want to just print rows one at a time -- we want to store them in a variable, where we can work with them. So
the first thing we want to do is read the rows in and put them in a list, like this:


<h2> Read a whole text file with <i>for loop</i> </h2>

In [None]:
rows = [] # List to store rows:
f = open("MovieData.csv")
for row in f:   # iterates over all lines in file
    rows.append(row)




In [None]:
print rows[9]

In [None]:
print len(rows) ## how many rows?

In [None]:
print rows[0] # first row is header

In [None]:
print rows[111]

<h2> Parsing Text Lines Read with the string splite method </h2>

Right now rows is just a list of strings, with each row just one long piece of raw text -- not data we can work with yet.
The next step is to split each string on the dividing character, the tab. Splitting a string works like this:

For our data, we want to split each row on the tab character, or ''. Then we add the split row to our list of rows:

In [None]:
rows = []
f = open("MovieData.csv")
for raw_row in f:
    row = raw_row.split("\t")  # each row is split up into sub strings, using the tab a a delimeter.
    rows.append(row)

print rows[0] # first line of file, split on tab results in a list of strings
print rows[0][0]# first element of first line is a string.
print rows[1]


<h2> Convering read string values into numeric variables </h2>

Notice that even the numbers above have quotation marks around them, meaning that they're being read in as text. On
its own, Python doesn't know the difference between 'John Carter' and '66439100' -- both are just raw text. We want to
convert the numeric columns to numbers, using casting, which we touched on last week. We do it like this:


In [None]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print first_row
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[3] = int(row[3]) # Convert Budget to number using explicit casting function int()
    row[4] = int(row[4]) # "" US Gross
    row[5] = int(row[5]) # "" Worldwide Gross
    rows.append(row)


We're still getting a similar error, except now the string that isn't converting is the word Unknown. It looks like when
there's missing data, it isn't just left blank; instead, the table has the word "Unknown" in that space. That's pretty
common; often missing records in a dataset will be filled with "Unknown" or "NA" or something similar.

So how to get around this problem? There are a few things we can do -- we can use an if statement to check whether
each record is equal to "Unknown" or not; or, we can catch the error using the try: ... except: statements.


<h2> Using try-except to handle unexpected values in data file </h2>

Now let's implement this as we read in the movie dataset. To be on the safe side, let's wrap all the int() conversions in
try-except, since they will probably have the same problem. 

If a string can't be converted to a number, we'll replace it with <b>None</b>, a special value meaning (you guessed it) a missing value.


In [None]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print first_row
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[3] = int(row[3]) # Budget
    try:
        row[4] = int(row[4]) # US Gross
    except:
        row[4] = None  
    try:
        row[5] = int(row[5]) # Worldwide Gross
    except:
        row[5] = None
    rows.append(row)


In [None]:
rows[2]

Okay, looking good! However, putting try-except pairs around each line is cumbersome. In data analysis, good code
and lazy code often go together, and the solution that requires the leasts repetitive typing is probably the more elegant
one too. Notice how we're essentially repeating the same operation several times: converting to int when possible,
otherwise assigning None. So let's wrap these all up in a function, and use it to do the conversion.


In [None]:
def convert_to_int(text):
    try:
        return int(text)
    except:
        return None


In [None]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print first_row
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[3] = convert_to_int(row[3]) # Budget
    row[4] = convert_to_int(row[4]) # US Gross
    row[5] = convert_to_int(row[5]) # Worldwide Gross
    rows.append(row)


In [None]:
 rows[1000]


So how do we convert the date strings in our data to a datetime object. First, we have to split them, the same way as
we split the rows. Let's remind ourselves what the strings look like:


In [None]:
 import datetime as dt

In [None]:
print rows[1000]

In [None]:
test_date = "03/01/91"
print test_date.split("/")


When we do a split where we know how many values we expect, we can immediately assign each to a variable, like
this:


In [None]:
month, day, year = test_date.split("/") # Assign three variables at once
print year
print month
print day


In a case like this where we know exactly how many characters each element consists of, we can treat the string like a
list of characters


In [None]:
print test_date[:2] # First two characters, 0 and 1
print test_date[2] # The third character, the '/'
print test_date[3:5] # Characters 3 and 4
print test_date[6:] # Character 6 to the end


Of course, these are still strings. We need to convert them to integers, just like we did above:


In [None]:
month, day, year = test_date.split("/") # Assign three variables at once
# Convert strings to integers
month = int(month)
day = int(day)
year = int(year)

Note that the datetime object requires a four-digit year, while our data only has two-digit years. If we use only two digits,
Python we'll assume we mean a date in the first century.


In [None]:
test_date = dt.datetime(year, month, day)
print test_date.year
if test_date.year > 1900:
    print "Date in the near past"
else:
    print "Date in the far past"


The solution is to manually add the century to the date:

In [None]:
year = year + 1900
test_date = dt.datetime(year, month, day)
print test_date.year
if test_date.year > 1990:
    print "Date in the near past"
else:
    print "Date in the far past"

However, a quick examination shows us that the data has dates on both sides of the millenium:

In [None]:
print rows[2]


In [None]:
print rows[1000]


So we need a cutoff to figure out what to add to the year. Let's start with 50, and then check to see whether that works.
Let's also wrap all the conversion in a function, to make the code more readable:

In [None]:
def make_date(date_str):
    '''
    turn a MM/DD/YY string into a datetime object.
    '''
    m, d, y = date_str.split("/")
    m = int(m)
    d = int(m)
    y = int(y)
    if y > 50:
        y += 1900
    else:
        y += 2000
    return dt.datetime(y, m, d)


In [None]:
# Testing:
print make_date(rows[1000][0])
print make_date(rows[2][0])


Now that we've tested the function, let's integrate it into the data reading. The code below should start to look pretty
familiar by this point:


In [None]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print first_row
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[0] = make_date(row[0]) # Convert the date string to the datetime object
    row[3] = convert_to_int(row[3]) # Budget
    row[4] = convert_to_int(row[4]) # US Gross
    row[5] = convert_to_int(row[5]) # Worldwide Gross
    rows.append(row)


In [None]:
# Test on some arbitrary rows.
print rows[1000]
print rows[2022]

<h2>List Comprehension : slicing Up Tabular Data by Column using special [] notation </h2>

So we actually have all the records in the format we want them; now, how do we do analysis on them?

By and large, we want to do analysis column-wise: for example, finding the average film budget or gross -- or initially in
this case, finding the range of dates. To do this here, we need to create new lists of only the rows we want -- think of it
as selecting columns.

We can do this in a for loop, by creating a new list and then iterating over all the rows, and adding only the value we're
interested in to the list, like this:

In [None]:
all_dates = []
for row in rows:
    all_dates.append(row[0])
print len(all_dates) # Number of records
print all_dates[0] # First record


This sort of thing comes up often enough that Python provides a specific idiom for creating lists from other lists, called
<b>list comprehension</b>. It puts the for loop inside of the square brackets. It works like this:


In [None]:
all_dates = [row[0] for row in rows] # List comprehension
print len(all_dates) # Number of records
print all_dates[0] # First record


<h3>Built in list functions min and max </h3>

As you can see, this produces results that are identical to the for loop method above in more compact code.

Now that we have our list of dates, we can quickly use built-in functions min and max to find the start and stop points
of the data.
   

In [None]:

print "Earliest date: ", min(all_dates)
print "Latest date: ", max(all_dates)


Uh-oh, it looks like there are pre-1950s movies the dataset, so splitting on 1950 won't work.

Let's go for something more sure: if the two-digit year is greater than 13, we'll assume it was a 20th-century year. Below
13 and it's a 21st-century year. Let's revise the code accordingly.


In [None]:
def make_date(date_str):
    '''
    Turn a MM/DD/YY string into a datetime object
    '''
    m, d, y = date_str.split("/")
    m = int(m)
    d = int(m)
    y = int(y)
    if y > 13:
        y += 1900
    else:
        y += 2000
    return dt.datetime(y, m, d)


In [None]:
f = open("MovieData.csv")
rows = []
first_row = f.readline() # Save the first row, and move to the next one
print first_row
for raw_row in f: # This will actually start from the second row.
    row = raw_row.split("\t")
    row[0] = make_date(row[0])
    row[3] = convert_to_int(row[3]) # Budget
    row[4] = convert_to_int(row[4]) # US Gross
    row[5] = convert_to_int(row[5]) # Worldwide Gross
    rows.append(row)


Now let's just repeat the list comprehension to get the new dates:

In [None]:
all_dates = [row[0] for row in rows]
print "Earliest date: ", min(all_dates)
print "Latest date: ", max(all_dates)


That 2013-12-12 still looks suspect, since the file was created earlier than that. Are we sure that it should be 2013 and
not 1913? So let's take a closer look at the row it appears in -- but first we need to find it. We can do that by iterating
over all the rows:


In [None]:
target_date = max(all_dates)
for row in rows:
    if row[0] == target_date:
        print row


We know that The Hobbit couldn't have been released in 1913, since the book it's based on was still many years in the
future. In fact, however, a quick Google search will us that 'The Hobbit: There and Back Again', the third in the Hobbit
trilogy, won't be released until December of 2014. This row seems to be an earlier estimate, possibly from when The
Hobbit was meant to be only two movies instead of three.
This is a reminder that parsing the data correctly doesn't ensure that the data is a correct reflection of the world. If
something looks wrong or weird in the data, it's worth following up on and examining closely.
For completeness's sake, let's check and see what the earliest movie is in the dataset -- to make sure we aren't
recording the future release date (for example) of a movie slated for 2015:


In [None]:
target_date = min(all_dates)
for row in rows:
    if row[0] == target_date:
        print row

This looks correct; a quick Google search will confirm that 'Birth of a Nation' was indeed released in 1915.
Now that we've roughly verified the dates, we can go on to analyze some of the other fields, such as the budgets:


In [None]:
all_budgets = [row[3] for row in rows]
print len(all_budgets)


If we have all the budgets, we can analyze them like we discussed last week. For example, let's recreate the function
that computes the mean of a list:


In [None]:
def find_mean(num_list):
    '''
    Find the average of num_list
    '''
    total = 0.0
    for x in num_list:
        total += x
    return total / len(num_list)



In [None]:
find_mean(all_budgets)


In [None]:
#We can also use min and max:
print min(all_budgets)
print max(all_budgets)

<h1>Writing Files </h1>

Finally, let's see how to output data. The first step is to open a new file to write some output to. We do it like this:

In [None]:
f = open("My_file.txt", "wb") # Open a file
f.close() # And immediately close it.


<p>Notice the 'wb' after the comma: the 'w' means write -- it tells Python that you're opening a file for writing. By default, if
the file doesn't exist, Python will create it. </p>

<p> <b>warning:</b> If the file does exist, this will overwrite its entire contents. Be careful with your file naming, since you won't get
a warning that you've overwriting an existing file. </p>

<p>The <b>b</b> means binary. On Mac and Linux, this doesn't make a difference, but on Windows machines it makes sure the
file only takes the exact input you give it, and doesn't add any additional characters to the end of the line like it would
for a text file. It's good practice to include the b for cross-platform compatibility.</p>



Now that you've opened a file, you can write to it. For practice, let's write some sample text to a new file. It's very
important to close a file once you're done with it -- otherwise,the data you wrote to it may not be saved.


In [None]:
f = open("My_file.txt", "wb")
f.write("Hello world.")
f.close()


Open the file up in a text editor. You'll see that it contains the "Hello world." string we put there. Let's try another one:

In [None]:
f = open("My_file.txt", "wb")
f.write("Here is some text.")
f.write("And here is some more text.")
f.close()


Open the file; you'll notice that we've overwritten the previous text, and that there's no line break between the two lines
we wrote to the file; we need to insert them manually, using the '\n' character, the special character for a line break.


In [None]:
f = open("My_file.txt", "wb")
f.write("Here is some text.\n")
f.write("And here is some more text.\n\nThis text should have an empty line above it.")
f.close()


Let's try to write some actual data. Suppose we have a few rows of coordinates, just some x and y values:


In [None]:
data = [
[1, 2],
[3, 5],
[2, 2],
[4, 3]
]


In [None]:
f = open("my_data.csv", "wb")
for row in data:
    f.write(row)
    f.write("\n")
f.close()


as you can see, we need to translate the list into a string to be able to write it. Let's try that again:


In [None]:
f = open("my_data.csv", "wb")
for row in data:
    f.write(str(row)) #converts list to a string
    f.write("\n")
f.close()


Open the file and you'll see that it's indeed written the lists, square brackets and all. Almost there, but we don't want
the square brackets, just the data. So instead of converting the entire list to a string at once, we'll convert one record at
a time. And let's add some column headings while we're at it.


In [None]:
f = open("my_data.csv", "wb")
f.write("x, y\n") # The column headings.
for row in data:
    f.write(str(row[0])) # Write the first value
    f.write(",") # Add the comma to separate the values
    f.write(str(row[1])) # Write the second value
    f.write("\n") # Add the line break
f.close()

Now we've got pretty much exactly what we wanted. The downside is that is was cumbersome to code, and would get
more so the more columns we'd add. Fortunately, Python comes with a built-in module for writing and reading csv data,
called csv. Pretty easy name to remember.

First, we need to import it:



In [None]:
import csv



To output using the csv module, we first create a file, then create a CSV Writer with the file object, like this:


In [None]:
f = open("my_data.csv", "wb")
my_writer = csv.writer(f) # Create the CSV writer connected to f


my_writer now holds a CSV writer object; think of it as a translator from Python data to how we want the data stored in
the CSV file. Unlike the basic file object, the CSV writer knows how to write entire lists as a single line of CSV data. So:

In [None]:
for row in data:
    my_writer.writerow(row)
f.close()


You'll see that the file has written exactly the way we want it to, with fewer lines of code. Just one thing is missing -- the
column headers.


In [None]:
f = open("my_data.csv", "wb")
my_writer = csv.writer(f)
header = ["x", "y"]
my_writer.writerow(header)
for row in data:
    my_writer.writerow(row)
f.close()


The csv module can also be used as a reader, which works similarly; here, the Reader object acts as a translator from
file text to Python lists.


In [None]:
f = open("my_data.csv") # For reading, not writing this time.
my_reader = csv.reader(f)
new_rows = [row for row in my_reader]
f.close()

for row in new_rows:
    print row


In [None]:
stopWords = [] # List to store rows:
f = open("stopwords.txt")
for stopWord in f:   # iterates over all lines in file
    stopWords.append(stopWord.upper().strip())


In [None]:
print len(stopWords)
print stopWords[0]
print stopWords[2]
print stopWords[5]

if ("I" in stopWords):
    print "hello"

In [None]:
import re
a = re.findall(r"[=]+|[\d]+", "================================== 14826 ==================================")

In [None]:
a[1]

In [None]:
import re

def check_for_new_doc(doc_line):
        # the new document string looks like this
        # ================================== 14826 ==================================
        # where the number is a document id.

        # first we parse the to see if it looks like a new document string 
        doc_line_tokes = re.findall(r"[=]+|[\d]+", doc_line)
        if len(doc_line_tokes) == 3:
                try:
                    new_file_id = int(doc_line_tokes[1])
                except:
                    new_file_id = -1
                else:
                    new_file_id
        else:
                new_file_id = -1
                
        return new_file_id


In [None]:
z = check_for_new_doc("================================== 09978 ==================================");

In [None]:
current_doc_id = 1
if current_doc_id is None:
    print "None"
else:
    print "Some"

In [None]:
print z

In [None]:
b

In [None]:
x = "Hello\tWorld"
print x

In [None]:
x.split("\t")

<h2>Specifying a File Path</h2>

<p>There be monsters when ever you specify a directory path.  The operating system that python is running under has a profound impact on how you specify a file path</p>

<p>In Python, there are a number of functions in the os.path module that change forward slashes in a string to the appropriate filename separator for the platform that you are on. One of these function is os.path.normpath()The trick is to enter all of your filename strings using forward slashes, and then let os.path.normpath() change them to backslashes for you, this way.</p>

In [6]:
# the code below should work on a Linux or Window platform.  I haven't verified that this works on Linux but 
# I have verified that it works on Windows.   
import os.path
myDirname = os.path.normpath("c:/test/bob.txt")
print myDirname # notice in the output window that if you ran this on a Windows platform the slashes are reversed.
f = open(myDirname, "wb")
f.write("Here is some text.\n")
f.write("And here is some more text.\n\nThis text should have an empty line above it.")
f.close()


c:\test\bob.txt
