Contents
---
- [File Input](#input)
- [Delimiters](#delimiters)
- [CSV files](#csv)
- [File Output](#output)
- [Files in other locations](#locations)

File Input
---
<a class="anchor" id="input"></a>
So far, we have just taken in short words and sentences from the user. However, in reality we will want to examine much larger sets of text, including word documents, spreadsheets, and web pages.  In order to do this, we need to learn how to tell Python to look through a file.

If your Python program is in the same folder that the text file that you would like to read is in, then we just need the open command. Suppose we want to read the lyrics in a file saved as "kanye.txt":

If we try the following code, we will get an error: 

In [7]:
f = open('kanye.txt')
print(f)

TypeError: '_io.TextIOWrapper' object is not subscriptable

Instead, we'll need to use the .read command. Remember to always put your file name in quotations!

In [1]:
f = open('kanye.txt').read()
print(f)

Oh, when it all, it all falls down
I'm telling you, oh, it all falls down


Oh, when it all, it all falls down
I'm telling you, oh, it all falls down


Man, I promise, she's so self conscious
She has no idea what she's doing in college
That major that she majored in don't make no money
But she won't drop out, her parents will look at her funny


Now, tell me that ain't insecure
The concept of school seems so secure
Sophomore, three years, ain't picked a career
She like, screw it, I'll just stay down here and do hair


'Cause that's enough money to buy her a few pairs of new airs
'Cause her baby daddy don't really care
She's so precious with the peer pressure
Couldn't afford a car so she named her daughter Alexus


She had hair so long that it looked like weave
Then she cut it all off now she look like Eve
And she be dealing with some issues that you can't believe
Single black female, addicted to retail and well


Oh, when it all, it all falls down
I'm telling you oh, it all falls down


If we wanted to read the first 10 lines of the song, we might try typing:

In [12]:
f = open('kanye.txt').read()
print(f[0:10])

Oh, when i


Uh, oh, this gives us the first 10 characters, not the first 10 lines. Read is a useful function for manipulating files in cases when you want to process the entire contents of a file, but it isn't very good when dealing with large files. To read all of the lines at once, we can use the readlines command, and then print out the first ten lines in a list:


In [7]:
f = open('kanye.txt').readlines()
print(f[0:10])

['Oh, when it all, it all falls down\n', "I'm telling you, oh, it all falls down\n", '\n', '\n', 'Oh, when it all, it all falls down\n', "I'm telling you, oh, it all falls down\n", '\n', '\n', "Man, I promise, she's so self conscious\n", "She has no idea what she's doing in college\n"]


Actually, for very large files, it's really not ideal to read the whole file all at once. Rather, we can read one line at a time using the following code:

In [8]:
for line in open('kanye.txt'):   
   print(line)

Oh, when it all, it all falls down

I'm telling you, oh, it all falls down





Oh, when it all, it all falls down

I'm telling you, oh, it all falls down





Man, I promise, she's so self conscious

She has no idea what she's doing in college

That major that she majored in don't make no money

But she won't drop out, her parents will look at her funny





Now, tell me that ain't insecure

The concept of school seems so secure

Sophomore, three years, ain't picked a career

She like, screw it, I'll just stay down here and do hair





'Cause that's enough money to buy her a few pairs of new airs

'Cause her baby daddy don't really care

She's so precious with the peer pressure

Couldn't afford a car so she named her daughter Alexus





She had hair so long that it looked like weave

Then she cut it all off now she look like Eve

And she be dealing with some issues that you can't believe

Single black female, addicted to retail and well





Oh, when it all, it all falls down

I'm t

There are a lot of extra spaces between the lyrics. We can use the strip command to delete them: 

In [2]:
for line in open('kanye.txt'):   
   print(line.strip())

Oh, when it all, it all falls down
I'm telling you, oh, it all falls down


Oh, when it all, it all falls down
I'm telling you, oh, it all falls down


Man, I promise, she's so self conscious
She has no idea what she's doing in college
That major that she majored in don't make no money
But she won't drop out, her parents will look at her funny


Now, tell me that ain't insecure
The concept of school seems so secure
Sophomore, three years, ain't picked a career
She like, screw it, I'll just stay down here and do hair


'Cause that's enough money to buy her a few pairs of new airs
'Cause her baby daddy don't really care
She's so precious with the peer pressure
Couldn't afford a car so she named her daughter Alexus


She had hair so long that it looked like weave
Then she cut it all off now she look like Eve
And she be dealing with some issues that you can't believe
Single black female, addicted to retail and well


Oh, when it all, it all falls down
I'm telling you oh, it all falls down


Since opening the "kanye.txt" file was successful, the operating system returned us a file handle. The file handle is not the actual data contained in the file, but instead it is a “handle” that we can use to read the data. You are given a handle if the requested file exists and you have the proper permissions to read the file.

If the file does not exist, open will fail with a traceback and you will not get a handle to access the contents of the file:

In [43]:
f = open('missingfile.text')

FileNotFoundError: [Errno 2] No such file or directory: 'missingfile.text'

Often times, the first line of data file contains headers, i.e., labels for the columns. For example, consider this football data:

In [7]:
for line in open('football.txt'):   
   print(line.strip())

Team,Games,Wins,Losses,Draws,Goals,Goals Allowed,Points
Arsenal,38,26,9,3,79,36,87
Liverpool,38,24,8,6,67,30,80
Manchester United,38,24,5,9,87,45,77
Newcastle,38,21,8,9,74,52,71
Leeds,38,18,12,8,53,37,66
Chelsea,38,17,13,8,66,38,64
West_Ham,38,15,8,15,48,57,53
Aston_Villa,38,12,14,12,46,47,50
Tottenham,38,14,8,16,49,53,50
Blackburn,38,12,10,16,55,51,46
Southampton,38,12,9,17,46,54,45
Middlesbrough,38,12,9,17,35,47,45
Fulham,38,10,14,14,36,44,44
Charlton,38,10,14,14,38,49,44
Everton,38,11,10,17,45,57,43
Bolton,38,9,13,16,44,62,40
Sunderland,38,10,10,18,29,51,40
Ipswich,38,9,9,20,41,64,36
Derby,38,8,6,24,33,63,30
Leicester,38,5,13,20,30,64,28


If we don't want to store the header row, we can use "next":

In [6]:
with open('football.txt') as f:
    next(f)
    for line in f:
        print(line.strip())

Arsenal,38,26,9,3,79,36,87
Liverpool,38,24,8,6,67,30,80
Manchester United,38,24,5,9,87,45,77
Newcastle,38,21,8,9,74,52,71
Leeds,38,18,12,8,53,37,66
Chelsea,38,17,13,8,66,38,64
West_Ham,38,15,8,15,48,57,53
Aston_Villa,38,12,14,12,46,47,50
Tottenham,38,14,8,16,49,53,50
Blackburn,38,12,10,16,55,51,46
Southampton,38,12,9,17,46,54,45
Middlesbrough,38,12,9,17,35,47,45
Fulham,38,10,14,14,36,44,44
Charlton,38,10,14,14,38,49,44
Everton,38,11,10,17,45,57,43
Bolton,38,9,13,16,44,62,40
Sunderland,38,10,10,18,29,51,40
Ipswich,38,9,9,20,41,64,36
Derby,38,8,6,24,33,63,30
Leicester,38,5,13,20,30,64,28


If you want to store the header for later, you can use readline:

In [10]:
with open('football.txt') as f:
    header = f.readline()
    for line in f:
        print(line.strip())
    print('header:', header)

Arsenal,38,26,9,3,79,36,87
Liverpool,38,24,8,6,67,30,80
Manchester United,38,24,5,9,87,45,77
Newcastle,38,21,8,9,74,52,71
Leeds,38,18,12,8,53,37,66
Chelsea,38,17,13,8,66,38,64
West_Ham,38,15,8,15,48,57,53
Aston_Villa,38,12,14,12,46,47,50
Tottenham,38,14,8,16,49,53,50
Blackburn,38,12,10,16,55,51,46
Southampton,38,12,9,17,46,54,45
Middlesbrough,38,12,9,17,35,47,45
Fulham,38,10,14,14,36,44,44
Charlton,38,10,14,14,38,49,44
Everton,38,11,10,17,45,57,43
Bolton,38,9,13,16,44,62,40
Sunderland,38,10,10,18,29,51,40
Ipswich,38,9,9,20,41,64,36
Derby,38,8,6,24,33,63,30
Leicester,38,5,13,20,30,64,28
header: Team,Games,Wins,Losses,Draws,Goals,Goals Allowed,Points



When would storing the header separately be useful? Suppose we want to find the average number of wins of all the teams. If the header string "Wins" had been stored in the same list as the numerical wins, then we wouldn't be able to take the average of numbers and the word "Wins". First, let us store the team names and their wins in separate lists and print them:

In [13]:
with open('football.txt') as f:
    header = f.readline().split(',')
    teams = []
    wins = []
    for line in f:
        team_info = line.strip().split(',')
        teams.append(team_info[0])
        wins.append(int(team_info[2]))
        print('Team: ', teams[-1], ' \t Wins: ', wins[-1])

Team:  Arsenal  	 Wins:  26
Team:  Liverpool  	 Wins:  24
Team:  Manchester United  	 Wins:  24
Team:  Newcastle  	 Wins:  21
Team:  Leeds  	 Wins:  18
Team:  Chelsea  	 Wins:  17
Team:  West_Ham  	 Wins:  15
Team:  Aston_Villa  	 Wins:  12
Team:  Tottenham  	 Wins:  14
Team:  Blackburn  	 Wins:  12
Team:  Southampton  	 Wins:  12
Team:  Middlesbrough  	 Wins:  12
Team:  Fulham  	 Wins:  10
Team:  Charlton  	 Wins:  10
Team:  Everton  	 Wins:  11
Team:  Bolton  	 Wins:  9
Team:  Sunderland  	 Wins:  10
Team:  Ipswich  	 Wins:  9
Team:  Derby  	 Wins:  8
Team:  Leicester  	 Wins:  5


Notice in the code above, we needed to use the .split(',') command to split each line of team info (which was a single string) into a list of the team's info. We also needed to put "int" around each win to denote that we wanted it treated as an integer, not a string. 

To find the average wins, we can use the following code. Notice that we want to access team_info[2] and header[2] since Wins is the third column in the list.

In [11]:
with open('football.txt') as f:
    header = f.readline().split(',')
    wins = []
    for line in f:
        team_info = line.strip().split(',')
        wins.append(int(team_info[2]))

print(header[2], sum(wins)/len(wins))

Wins 13.95


### Football 1
Find the average number of losses. Print out the word "Losses" by referencing the correct header item and then the average number of losses.

In [None]:
#insert football 1

### Exercise - Football 2
Read the football file in. Use it to find the team that has the maximum wins. Then print "Max Wins: 26 by Arsenal" by referencing:
- the header to produce the word "Wins"
- the max(wins) to produce the number 26
- incorporating wins.index(max(wins) to reference the team Arsenal

In [None]:
#insert football 2

### Exercise - Football 3
Print the team with the maximum number of losses using the same conditions as above.


In [None]:
#insert football 3

## Counting

Suppose we want to count the number of times "falls" appears in the Kanye lyrics. In this case, we will want to break up each sentence into a list of words using the split command: 

In [32]:
count = 0
for line in open('kanye.txt'):   
    words = line.strip().split()
    if 'falls' in words:
        count = count +1
        
print(count)

16


Suppose you wanted to print out what lines in the file "falls" is contained on. In that case, we can use the enumerate function. The enumerate function iterates through items in a list and creates an index for them. Let's do an easier example first. Let's say I had a list of colors and I wanted to print the color and its index on a separate line. I would type:

In [3]:
colors = ['red', 'blue', 'yellow', 'blue', 'green']
for index, color in enumerate(colors):
    print(index, color)

0 red
1 blue
2 yellow
3 blue
4 green


Now, we can use enumerate to print the lines that "falls" is on:

In [1]:
count = 0
for index, line in enumerate(open('kanye.txt')):   
    words = line.strip().split()
    if 'falls' in words:
        print('falls is on line', index)


falls is on line 0
falls is on line 1
falls is on line 4
falls is on line 5
falls is on line 32
falls is on line 33
falls is on line 60
falls is on line 61
falls is on line 88
falls is on line 89
falls is on line 90
falls is on line 91
falls is on line 92
falls is on line 93
falls is on line 94
falls is on line 95


Okay, let's put everything together. Suppose we want to break up the kanye file into words. We'll make a dictionary of the words and their corresponding frequencies. Then, we'll print out the list of words in decending order of frequency. Let's do it:

In [6]:
#create the dictionary of words and frequencies:
word_dict = {}
for line in open('kanye.txt'):
    for word in line.split():
        word = word.lower()
        word_dict[word] = word_dict.get(word,0) + 1

#create a list of tuples to sort the words
word_list = []
for key,val in word_dict.items():
    word_list.append((val,key))

#print the list in reverse
word_list.sort(reverse = True)

for key,value in word_list:
    print(key, value)

30 it
20 all
17 down
16 i
16 falls
16 a
12 the
12 oh,
12 and
10 i'm
9 we
9 to
9 she
8 when
8 telling
8 buy
8 all,
7 you,
7 that
6 you
6 with
6 so
6 of
6 like
6 'cause
5 that's
5 her
5 can't
4 this
4 they
4 self
4 ohh,
4 in
4 even
4 but
4 ain't
3 up
3 she's
3 on
3 me
3 look
3 just
3 got
3 get
3 don't
3 do
3 be
2 won't
2 why
2 wanna
2 us
2 thou
2 things
2 then
2 some
2 seems
2 really
2 promise,
2 people
2 our
2 off
2 no
2 money
2 man
2 it,
2 how
2 hate
2 hair
2 had
2 for
2 f
2 conscious
2 before
2 at
2 act
2 'em
1 years,
1 workin'
1 without
1 will
1 white
1 what's
1 what
1 went
1 well
1 weave
1 wealth
1 we'a
1 way
1 watches
1 versace
1 us,
1 ugliest
1 trying
1 treat
1 three
1 than
1 terrific
1 tell
1 team
1 store
1 still
1 stay
1 spent
1 spending
1 specific
1 sophomore,
1 slave
1 single
1 shorty's
1 shirt
1 ship)
1 shine
1 shift
1 see
1 secure
1 screw
1 school
1 say
1 rollies
1 road
1 rings
1 riches
1 retail
1 pushing
1 pronounce
1 problem
1 prettiest
1 pressure
1 precious
1 police,
1 pi

What happens if I had created the list of tuples in the order (key,val) instead of (val,key)? It would have sorted alphabetically:

In [7]:
#create a list of tuples to sort the words
word_list = []
for key,val in word_dict.items():
    word_list.append((key,val))

#print the list in reverse order
word_list.sort(reverse = True)

for key,value in word_list:
    print(key, value)

you, 7
you 6
years, 1
workin' 1
won't 2
without 1
with 6
will 1
why 2
white 1
when 8
what's 1
what 1
went 1
well 1
weave 1
wealth 1
we'a 1
we 9
way 1
watches 1
wanna 2
versace 1
us, 1
us 2
up 3
ugliest 1
trying 1
treat 1
to 9
three 1
thou 2
this 4
things 2
they 4
then 2
the 12
that's 5
that 7
than 1
terrific 1
telling 8
tell 1
team 1
store 1
still 1
stay 1
spent 1
spending 1
specific 1
sophomore, 1
some 2
so 6
slave 1
single 1
shorty's 1
shirt 1
ship) 1
shine 1
shift 1
she's 3
she 9
self 4
seems 2
see 1
secure 1
screw 1
school 1
say 1
rollies 1
road 1
rings 1
riches 1
retail 1
really 2
pushing 1
pronounce 1
promise, 2
problem 1
prettiest 1
pressure 1
precious 1
police, 1
picked 1
people 2
peer 1
past 1
pass 1
pasha's 1
park 1
parents 1
paper, 1
pairs 1
paid 1
out, 1
out 1
ourself 1
our 2
ones 1
one 1
on 3
ohh, 4
oh, 12
off 2
of 6
now, 1
now 1
nothing, 1
no 2
new 1
need 1
named 1
my 1
money 2
me 3
man, 1
man 2
make 1
majored 1
major 1
made 1
lowest 1
low 1
love 1
lot 1
looked 1
look 3
l

What happens if we had forgotten .split()? It would have counted the frequency of letters instead of words:

In [8]:
#create the dictionary of words and frequencies:
word_dict = {}
for line in open('kanye.txt'):
    for word in line:
        word_dict[word] = word_dict.get(word,0) + 1

#create a list to sort the words
word_list = []
for key,val in word_dict.items():
    word_list.append((val,key))

#print the list in reverse order
word_list.sort(reverse = True)

for key,value in word_list:
    print(key, value)

524  
238 e
189 a
179 t
177 l
168 o
142 s
131 n
131 h
129 i
100 

81 r
75 d
71 u
63 w
55 c
53 '
52 ,
49 f
44 y
40 m
39 g
36 p
30 I
27 b
21 k
13 v
10 T
8 O
7 W
6 S
6 C
6 A
5 j
5 B
4 0
3 z
3 J
2 P
2 M
2 E
2 4
2 "
1 x
1 V
1 R
1 N
1 F
1 D
1 ?
1 6
1 5
1 2
1 1
1 )
1 (


### Exercise - Dance 1
Write a program that reads the file dance.txt. It should print the words and counts in decending order of frequency.

In [60]:
#insert dance code

### Exercise - Dance 2
Print out the words and their counts in alphabetically ascending order.

In [61]:
#insert dance again code

### Exercise - Dance 3
Write a program that prints out the LYRICS that contain any permutation of the word "present", such as "Present" or "presently".

In [6]:
#insert dance again again code

### Exercise - Dance 4
Write a program that prints out the LINE NUMBERS of the lyrics that contain any permutation of the word "present", such as "Present" or "presently".

In [None]:
#insert dance 4

### Exercise - find kanye
Suppose we wanted to create a program that doesn't check whether "kanye" is a word in the line but checks whether the letters in that line could form the word "Kanye." For example, the line "But she won't drop out, her parents will look at her funny" contains the letters k, a, n, y, and e. Write a program that prints the locations of the lines that form the word "Kanye" in the kanye.txt file.

In [None]:
#insert kanye

Delimiters
---
<a class="anchor" id="delimiters"></a>

Up until now, we have been mostly breaking up sentences by words using .split() with the parenthesis blank. However, we can choose to break up the sentence by any delimiter we want. Consider, for example, the student.txt file:

In [34]:
for line in open('students.txt'):
    print(line.strip())

Jane Doe, 2000, 101 Main St
John Doe, 2001, 123 Oak St
Ann Ko, 1999, 57 Tree St
Paul Smith, 2000, 60 Spring St
Sarah McDonald, 2001, 101 MLK Blvd


In this case, we want to break up each line by commas. To do this, type:

In [41]:
for line in open('students.txt'):
    words = line.strip().split(',')
    print(words)

['Jane Doe', ' 2000', ' 101 Main St']
['John Doe', ' 2001', ' 123 Oak St']
['Ann Ko', ' 1999', ' 57 Tree St']
['Paul Smith', ' 2000', ' 60 Spring St']
['Sarah McDonald', ' 2001', ' 101 MLK Blvd']


We could then keep separate name, birth year, and address lists by typing:


In [42]:
name_list = []
birth_year_list = []
address_list = []
for line in open('students.txt'):
    words = line.strip().split(',')
    name_list.append(words[0])
    birth_year_list.append(words[1])
    address_list.append(words[2])
print(name_list)
print(birth_year_list)
print(address_list)

['Jane Doe', 'John Doe', 'Ann Ko', 'Paul Smith', 'Sarah McDonald']
[' 2000', ' 2001', ' 1999', ' 2000', ' 2001']
[' 101 Main St', ' 123 Oak St', ' 57 Tree St', ' 60 Spring St', ' 101 MLK Blvd']


Notice that there are a lot of extra spaces on each side of some of the words, so we may want to strip that:

In [9]:
name_list = []
birth_year_list = []
address_list = []
for line in open('students.txt'):
    words = line.strip().split(',')
    name_list.append(words[0].strip())
    birth_year_list.append(words[1].strip())
    address_list.append(words[2].strip())
print(name_list)
print(birth_year_list)
print(address_list)

['Jane Doe', 'John Doe', 'Ann Ko', 'Paul Smith', 'Sarah McDonald']
['2000', '2001', '1999', '2000', '2001']
['101 Main St', '123 Oak St', '57 Tree St', '60 Spring St', '101 MLK Blvd']


When searching through a file, we may only care about lines that begin with certain strings. For example, consider this file:

In [45]:
for line in open('mailbox.txt'):
    print(line.strip())

From: janedoe@gmail.com Sat Jan 5 2008
To: jackdoe@aol.com
Subject: Saturday Party

From: annesmith@gmail.com Sun Jan 6 2008
To: bobpaul@amazon.com
Subject: I’m mad at you

From: jackmac@mac.com Mon Jan 7 2008
To: catdancy@gmail.com
Subject: Not safe for work

From: paullauren@gmail.com Tues Jan 9 2008
To: mikejoy@aol.com
Subject: For sale


Suppose we only wanted to save the email addresses of the people who sent the emails (in the From: lines). We could use the command .startswith():

In [47]:
for line in open('mailbox.txt'):
    if line.startswith('From:'):
        print(line)

From: janedoe@gmail.com Sat Jan 5 2008

From: annesmith@gmail.com Sun Jan 6 2008

From: jackmac@mac.com Mon Jan 7 2008

From: paullauren@gmail.com Tues Jan 9 2008



Then, we could save those emails by breaking up each line into words and saving the second word:

In [50]:
names=[]
for line in open('mailbox.txt'):
    if line.startswith('From:'):
        words = line.split()
        names.append(words[1])
print(names)

['janedoe@gmail.com', 'annesmith@gmail.com', 'jackmac@mac.com', 'paullauren@gmail.com']


Or more succinctly:

In [54]:
names=[]
for line in open('mailbox.txt'):
    if line.startswith('From:'):
        names.append(line.split()[1])
print(names)

['janedoe@gmail.com', 'annesmith@gmail.com', 'jackmac@mac.com', 'paullauren@gmail.com']


Suppose you only wanted to save the username part of the email addresses to the right of the @ sign. You could use a delimiter again:

In [59]:
for line in open('mailbox.txt'):
    if line.startswith('From:'):
        words = line.split()[1].split('@')
        print(words[0])

janedoe
annesmith
jackmac
paullauren


Or more succinctly:

In [57]:
names=[]
for line in open('mailbox.txt'):
    if line.startswith('From:'):
        print(line.split()[1].split('@')[0])

janedoe
annesmith
jackmac
paullauren


### Exercise - sports 1
Open the file sports.txt. Break up each line by the delimiter "-". Then print a list of what each student plays in the spring. For example, each line should say something like "Brenda plays track in the spring."

In [62]:
#insert sports code

### Exercise - sports 2
Print a count of how many students are enrolled in each sport. For example, your output will contain something like "2 students play soccer"

In [63]:
#insert sports again code

### Exercise - Football 5
Break up the football.txt data using comma delimiters. Store the team names in a list. Make sure to skip the header.

In [11]:
#insert football

### Exercise - Football 6
Store the football.txt data as a list of lists for the different team data. Then, find the team that has the minimum absolute value difference between their goals and goals allowed. (Hint: the answer should be Aston Villa.)

In [1]:
#insert football 2

CSV Files
---
<a class="anchor" id="csv"></a>
The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. For example, you can always store an Excel or Google sheet in CSV format. We can read them as follows:

In [1]:
import csv

degree=[]

with open('faculty.csv') as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    for row in data:
        print(row)

['name', ' degree', ' title', ' email']
['Scarlett L. Bellamy', ' Sc.D.', 'Associate Professor of Biostatistics', 'bellamys@mail.med.upenn.edu']
['Warren B. Bilker', 'Ph.D.', 'Professor of Biostatistics', 'warren@upenn.edu']
['Matthew W Bryan', ' PhD', 'Assistant Professor of Biostatistics', 'bryanma@upenn.edu']
['Jinbo Chen', ' Ph.D.', 'Associate Professor of Biostatistics', 'jinboche@upenn.edu']
['Susan S Ellenberg', ' Ph.D.', 'Professor of Biostatistics', 'sellenbe@upenn.edu']
['Jonas H. Ellenberg', ' Ph.D.', 'Professor of Biostatistics', 'jellenbe@mail.med.upenn.edu']
['Rui Feng', ' Ph.D', 'Assistant Professor of Biostatistics', 'ruifeng@upenn.edu']
['Benjamin C. French', ' PhD', 'Associate Professor of Biostatistics', 'bcfrench@mail.med.upenn.edu']
['Phyllis A. Gimotty', ' Ph.D', 'Professor of Biostatistics', 'pgimotty@upenn.edu']
['Wensheng Guo', ' Ph.D', 'Professor of Biostatistics', 'wguo@mail.med.upenn.edu']
['Yenchih Hsu', ' Ph.D.', 'Assistant Professor of Biostatistics', 'hs

Notice that the first row is a header. If you wanted to skip it, you could type:

In [2]:
import csv

degree=[]

with open('faculty.csv') as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    next(data, None)
    for row in data:
        print(row)
            

['Scarlett L. Bellamy', ' Sc.D.', 'Associate Professor of Biostatistics', 'bellamys@mail.med.upenn.edu']
['Warren B. Bilker', 'Ph.D.', 'Professor of Biostatistics', 'warren@upenn.edu']
['Matthew W Bryan', ' PhD', 'Assistant Professor of Biostatistics', 'bryanma@upenn.edu']
['Jinbo Chen', ' Ph.D.', 'Associate Professor of Biostatistics', 'jinboche@upenn.edu']
['Susan S Ellenberg', ' Ph.D.', 'Professor of Biostatistics', 'sellenbe@upenn.edu']
['Jonas H. Ellenberg', ' Ph.D.', 'Professor of Biostatistics', 'jellenbe@mail.med.upenn.edu']
['Rui Feng', ' Ph.D', 'Assistant Professor of Biostatistics', 'ruifeng@upenn.edu']
['Benjamin C. French', ' PhD', 'Associate Professor of Biostatistics', 'bcfrench@mail.med.upenn.edu']
['Phyllis A. Gimotty', ' Ph.D', 'Professor of Biostatistics', 'pgimotty@upenn.edu']
['Wensheng Guo', ' Ph.D', 'Professor of Biostatistics', 'wguo@mail.med.upenn.edu']
['Yenchih Hsu', ' Ph.D.', 'Assistant Professor of Biostatistics', 'hsu9@mail.med.upenn.edu']
['Rebecca A Hubb

If we wanted to write to a csv file, we could type:

In [5]:
emails = ['janedoe@gmail.com', 'jackdoe@amazon.com', 'sallysmith@aol.com']        
with open('emails.csv', 'w', newline='') as csvfile:
    myfile = csv.writer(csvfile)
    myfile.writerow(['list_of_emails'])
    for email in emails:
            myfile.writerow([email])

Notice that we needed to put each string that we wanted to write to the csv file inside brackets. Otherwise, it would create a space between each letter in the string.

### Exercise - Degrees

Write a program that reads in faculty.csv and creates a dictionary of each degree (standardized to not include periods) and the count of each title. 

Your program should print: {'0': 1, 'PhD': 31, 'MPH': 2, 'ScD': 6, 'MS': 2, 'BSEd': 1, 'MA': 1, 'MD': 1, 'JD': 1}

Hint: One way to get rid of periods is to use .replace('.' , '')

In [None]:
# insert degrees

### Exercise - Title
Write a program that reads in faculty.csv and creates a dictionary of each title and count. 

Your program should print: {'Professor of Biostatistics': 13, 'Assistant Professor of Biostatistics': 12, 'Associate Professor of Biostatistics': 12}

Note: You'll need to account for a typo in the csv file by using "replace". Note: this is called data cleaning and is a HUUUUUUGE (not necessarily fun) part of data science.

In [None]:
#insert title

### Exercise - email
Write a program that reads in faculty.csv and creates a unique list of the domain names (after the "@" symbol in the email address). Your program should print: {'cceb.med.upenn.edu', 'email.chop.edu', 'upenn.edu', 'mail.med.upenn.edu'}

In [None]:
#insert email

### Exercise - last name
Write a program that reads in faculty.csv and creates a dictionary such that they key is the last name and the value is a list of the degree, title, and email. Be careful to account for duplicate last names. For example:

'Bellamy': [[' Sc.D.', 'Associate Professor of Biostatistics', 'bellamys@mail.med.upenn.edu']]

and

'Li': [[' Ph.D.', 'Assistant Professor of Biostatistics', 'liy3@email.chop.edu'], [' Ph.D.', 'Associate Professor of Biostatistics', 'mingyao@mail.med.upenn.edu'], [' Ph.D', 'Professor of Biostatistics', 'hongzhe@upenn.edu']]

In [None]:
#insert last name

### Exercise - tuple
Write a program that reads in faculty.csv and creates a dictionary such that the key is a tuple of the name and the value is the list of the degree, title, and email. You can assume each tuple name is unique. For example:

('Benjamin', 'C.', 'French'): [' PhD',
  'Associate Professor of Biostatistics',
  'bcfrench@mail.med.upenn.edu']
  
  
 ('Dawei', 'Xie'): [' PhD',
  'Assistant Professor of Biostatistics',
  'dxie@upenn.edu']

In [None]:
#insert tuple

### Exercise - write names

Write a program that reads in faculty.csv and writes the names of the professors to a list called names.csv. You should include a header in the file such that the first line says "Professor Names".

Note: to check that you did it correctly, you can always read your "names.csv" file back in to view it.

In [None]:
#insert write names

File Output
---
<a class="anchor" id="output"></a>

To write a file, you have to open it with mode 'w' as a second parameter:

In [75]:
fout = open('output.txt', 'w')
print(fout)

<_io.TextIOWrapper name='output.txt' mode='w' encoding='UTF-8'>


If the file already exists, opening it in write mode clears out the old data and starts fresh, so be careful! If the file doesn’t exist, a new one is created.

The write method of the file handle object puts data into the file. The file object keeps track of where it is, so if you call write again, it adds the new data to the end.

When you are done writing, you have to close the file to make sure that the last bit of data is physically written to the disk so it will not be lost if the power goes off.

In [1]:
fout = open('output.txt', 'w')

line1 = "Oh, when it all, it all falls down,\n"
fout.write(line1)

line2 = "I'm telling you, oh, it all falls down,\n"
fout.write(line2)

fout.close()

Note in the above code that we had to add new line characters. The print statement automatically appends a newline, but the write method does not add the newline automatically. If you want each sentence to be on a different line, you'll need to add a "\n"

To make sure that we actually wrote to that file, we can open it and read it:


In [2]:
for line in open('output.txt'):
    print(line)

Oh, when it all, it all falls down,

I'm telling you, oh, it all falls down,



A good way to insure that the file is properly closed is to use the "with" keyword. The advantage is that the file is properly closed after its suite finishes - you don't need to remember to put "f.close()" at the end of the line:

In [3]:
with open('output.txt', 'w') as fout:

    line1 = "Oh, when it all, it all falls down,\n"
    fout.write(line1)

    line2 = "I'm telling you, oh, it all falls down,\n"
    fout.write(line2)

If we want to clear that file and rewrite the numbers between 1 and 10, we can type:

In [6]:
fout = open('output.txt', 'w')
for i in range(1,11):
    fout.write('Number:'+str(i)+'\n')
fout.close()

Once again, let's check our work:

In [7]:
for line in open('output.txt'):
    print(line)

Number:1

Number:2

Number:3

Number:4

Number:5

Number:6

Number:7

Number:8

Number:9

Number:10



Note: we needed to use plus signs instead of commas. Write does not accept commas between words like print does:

In [84]:
fout = open('output.txt', 'w')
for i in range(1,11):
    fout.write('Number:',str(i),'\n')
fout.close()

TypeError: write() takes exactly one argument (3 given)

Suppose we wanted to continue adding the numbers 11 through 20 to the same text file. If we write to the file again, denoted by a "w", we'll end up clearing out the numbers 1-10 and replacing them with 11-20. If we want to append to the file created already, denoted by "a", then we can create all numbers 1-20:

In [8]:
fout = open('output.txt', 'a')
for i in range(11,21):
    fout.write('Number:'+str(i)+'\n')
fout.close()

Let's check:

In [9]:
for line in open('output.txt'):
    print(line)

Number:1

Number:2

Number:3

Number:4

Number:5

Number:6

Number:7

Number:8

Number:9

Number:10

Number:11

Number:12

Number:13

Number:14

Number:15

Number:16

Number:17

Number:18

Number:19

Number:20



### Exercise - multiples
Write a program that stores the first 100 multiples of 7, each on a different line, in a file called 7.

In [85]:
#insert multiples

### Exercise - kanye
Write a program that calculates Kanye's most used words in kanye.txt and prints them to a file, in decending order of frequency.


In [86]:
#insert kanye

### Exercise - calendar app
Write a program that takes in a date in the form "MM-DD-YY' and a reminder for that day. For example, a user might input "09-27-17" and "Get Lauren a Birthday Present." The program should add this information to the file calendar.txt each time the user calls the program. One caveat: if the user enters a date that is already in the file, then that reminder should be inserted in the right spot, rather than at the end of the file. For example, two reminders for the date "09-27-17" should be next to each other. 

Hint: to do this, google "Insert line at middle of file with Python" and read the StackOverflow article. 

You don't need to worry about putting all of the dates in chronological order just yet. We'll do that in another program later when we get to the datetime module. You can google it now if you are interested.

In [None]:
#insert calendar app

Files in Other Locations
---
<a class="anchor" id="locations"></a>

We have been reading files that are located in the directory of this program. If we needed to search somewhere else for the file, we would need to add a bit more to our file path name. 

On a Mac, if your username was janedoe and you wanted to print from a "hello.txt" file located in your Documents folder, you would type:

In [None]:
for line in open('/Users/janedoe/Documents/hello.txt'):
    print(line)

On a PC, you would type:

In [None]:
for line in open('C:/Users/janedoe/Documents/hello.txt'):
    print(line)

Of course, neither of these will work, since we don't have a file called "hello.txt" located there.

There's an easier way to reference the file path name rather than typing it all in, and that involves using the OS package.

To print the working directory of this assignment (meaning the folder that this file lies in), we can type:

In [11]:
import os

print(os.getcwd())

/Users/shareshianl/Documents/CS/Unit 7 - File Input Output


Suppose we wanted to ask the user for the name of a file that they want to create within this working directory, and then print a full file path for where this file is located. We could type:

In [15]:
import os

file_name = input('What file name do you want to create? ')
file_path = os.getcwd()+'/'+file_name
print(file_path)

What file name do you want to create? turtles.txt
/Users/shareshianl/Documents/CS/Unit 7 - File Input Output/turtles.txt


Suppose we wanted to ask the user for a file. If that file already exists, we want to add the user's words to the file. If the file doesn't exist, we want to create a new one. We would need to import the os package if using a Mac:

In [16]:
import os

filename = input('What file do you want to add words to? ')
say = input('What do you want to say? ')
data = []

if os.path.exists(filename):
    with open(filename, 'a') as f:
        f.write(say+'\n')    
else:
    f = open(filename, 'w')
    f.write(say+'\n')

What file do you want to add words to? journal.txt
What do you want to say? This is another journal entry!


What if you wanted to find the full path name of the file? You would type:

In [19]:
import os

filename = input('What file are you looking for? ')

if os.path.exists(filename):
    print(os.path.abspath(filename))
else:
    print('That file does not exist.')

What file are you looking for? journal.txt
/Users/shareshianl/Documents/CS/Unit 7 - File Input Output/journal.txt


### Exercise - Fave Song

Use the OS package to create a file path that includes your current working directory and the name of your favorite song in the file name, such as "Ignition.txt".

Write a few lyrics to the song on different lines.


In [20]:
#insert fave song

### Exercise - Fave Song 2

Let's crowd source this lyric builder. Ask the user for what song they want to add lyrics to. If they say something other than "Ignition", then create a new file for that song and have the user write lyrics to that song. If they say "Ignition", then have the user append more lyrics to the Ignition.txt file without overwriting the lyrics you've already saved there.

In [None]:
#insert Fave Song 2