# Problem:
Count number of times that each of the names in 'names.txt' appears in 'list.txt'

Retain the order of names as they appear in 'names.txt'

#### --------------------------------

# Basic file i/o, & some data exploration

#### Basic way to open a file in Python: 

In [1]:
with open('names.txt') as f:
    # this reads the contents of the files in as a list, where each line in the file becomes an item in the list
    list_of_names = f.readlines() 

In [2]:
list_of_names

['OLIVER\n',
 'CHARLOTTE\n',
 'LIAM\n',
 'AMELIA\n',
 'BENJAMIN\n',
 'ARIA\n',
 'OWEN\n',
 'OLIVIA\n',
 'JACKSON\n',
 'SCARLETT\n',
 'HENRY\n',
 'AVA\n',
 'DECLAN\n',
 'VIOLET\n',
 'ETHAN\n',
 'NORA\n',
 'NOAH\n',
 'EMMA\n',
 'ALEXANDER\n',
 'AURORA\n',
 'FINN\n',
 'SOPHIA\n',
 'ELIJAH\n',
 'AUDREY\n',
 'CALEB\n',
 'ELLA\n',
 'GRAYSON\n',
 'GRACE\n',
 'EMMETT\n',
 'LILY\n',
 'ELLIOT\n',
 'HARPER\n',
 'GABRIEL\n',
 'VIVIEN\n',
 'AIDEN\n',
 'ABIGAIL\n',
 'LUCAS\n',
 'ISLA\n',
 'LEVI\n',
 'LUCY\n']

In [3]:
with open('names.txt') as f:
    # we can also iterate through the lines of the file object (which we've named 'f')
    # without reading the entirity of the file into memory
    for line in f: 
        print(line)

OLIVER

CHARLOTTE

LIAM

AMELIA

BENJAMIN

ARIA

OWEN

OLIVIA

JACKSON

SCARLETT

HENRY

AVA

DECLAN

VIOLET

ETHAN

NORA

NOAH

EMMA

ALEXANDER

AURORA

FINN

SOPHIA

ELIJAH

AUDREY

CALEB

ELLA

GRAYSON

GRACE

EMMETT

LILY

ELLIOT

HARPER

GABRIEL

VIVIEN

AIDEN

ABIGAIL

LUCAS

ISLA

LEVI

LUCY



#### Notes:
We can see here that each of the names that we're going to check has a newline appended to it. We will need to remove these newlines, as they are not actually a part of the names.

Additionally, we will want to lowercase these capital letters. (standard text processing procedure to ensure that we are comparing apples to apples, so to speak)

# Initial solution: suboptimal performance, but solves problem

#### What we're going to do:

Read the names from 'names.txt' into a list, so that we can check them against the contents of 'list.txt'. Use a dictionary to store the number of times that each of these names appears in the list.txt file. 

#### Program structure in more detail:

We do not need to retain the contents of 'list.txt' - we can actually just iterate through its lines and 
check each to see if a given line is one of the names we're looking for. In this case, we will achieve better performance this way than if we were to read the entire contents of 'list.txt' into memory.

Could we do this the other way around? That is, store 'list.txt' and iterate through the lines 'names.txt'? We
could indeed! But as we will see below, 'list.txt' has many more lines than 'names.txt', so we will achieve better
performance by storing 'names.txt' and just iterating through the lines of 'list.txt'.

Could we just iterate through the lines of both files to avoid having to store anything? No. We need to store the 
contents of at least one of the files. Otherwise, we wouldn't be able to keep track of the number of times we've seen a name.

### Getting names to check

#### Use a list to store names from names.txt:

In [4]:
# this creates a new, empty list
names_take_two = [] 

#this opens 'names.txt', and names our file object (our pathway to the data in the file) 'f'
with open('names.txt') as f:  
    
    for line in f: # means that you will move through each item in f (our open file)
        
        # for each item: 
        # .strip() - removes extra characters
        # .lower() - lowercases the letters in each name
        #  append - adds item to end of our list that we have called names_again
        names_take_two.append(line.strip().lower()) 

In [5]:
names_take_two

['oliver',
 'charlotte',
 'liam',
 'amelia',
 'benjamin',
 'aria',
 'owen',
 'olivia',
 'jackson',
 'scarlett',
 'henry',
 'ava',
 'declan',
 'violet',
 'ethan',
 'nora',
 'noah',
 'emma',
 'alexander',
 'aurora',
 'finn',
 'sophia',
 'elijah',
 'audrey',
 'caleb',
 'ella',
 'grayson',
 'grace',
 'emmett',
 'lily',
 'elliot',
 'harper',
 'gabriel',
 'vivien',
 'aiden',
 'abigail',
 'lucas',
 'isla',
 'levi',
 'lucy']

In [6]:
len(names_take_two)

40

### Examination of the list of words we're checking our names against

#### Iterate using a for loop through only the first 10 words in 'list.txt':

In [7]:
with open('list.txt') as f:
    
    # we create a variable to serve as a counter and call it i
    i=0 # we start i at 0
    for line in f:
        
        i+=1 # every time we go into a new line in f, we add 1 to i (i += 1)
        print(line) # print the line
        
        if i== 10: # we check to see if i is equal to 10 
            break # if so, we exit the for loop

a

aah

aahed

aahing

aahs

aardvark

aardvarks

aardwolf

ab

abaci



#### Notes:

These words also have newline characters appended to them, so we will want to remove those. Additionally, though these appear to be lowercased, we will force that just to be safe. 

### Additional check to get length (number of items) of list.txt

#### Simple 'for' loop:

In [8]:
with open('list.txt') as f:
    
    # we create a variable to serve as a counter and call it i
    i = 0 # we start i at 0
    for line in f:
        i += 1 # every time we go into a new line in f, we add 1 to i (i += 1)

In [9]:
i

110401

#### Notes:

We can see tha the list.txt file has many more items than the names.txt file.

### Get counts

#### What we're going to do:

1) create empty dictionary 'name_counts', to be structured as - {name: count}

2) open list.txt

3) iterate through lines

4) run the strip() function on each line and save as new variable 'potential_name'

5) check to see if potential_name in names_take_two

6) if not, continue to next line. but if so:

7) check if potential name has already been counted. if so:

8) increase count of this name by 1. if this is the first time this name has been seen:

9) create a new key in name_counts with value 1


#### Checking names against the contents of 'list.txt':

In [11]:
name_counts = {} # create a new dictionary
with open('list.txt') as f: # open 'list.txt'
    for line in f: # iterate through lines in f
        potential_name = line.strip() # strip each line of extra chars and call it 'potential_name'
        if potential_name in names_take_two: # is this name one of the names we want to count? 
            if potential_name in name_counts: # is this name a key in our dictionary?
                name_counts[potential_name] += 1 # if so, increase its value by 1
            else:
                name_counts[potential_name] = 1 # otherwise, create it as a key, and set its value to 1

In [12]:
name_counts

{'abigail': 37,
 'aiden': 35,
 'alexander': 20,
 'amelia': 4,
 'aria': 7,
 'audrey': 24,
 'aurora': 21,
 'ava': 12,
 'benjamin': 6,
 'caleb': 25,
 'charlotte': 5,
 'declan': 13,
 'elijah': 24,
 'ella': 26,
 'elliot': 31,
 'emma': 18,
 'emmett': 29,
 'ethan': 15,
 'finn': 22,
 'gabriel': 34,
 'grace': 29,
 'grayson': 27,
 'harper': 33,
 'henry': 12,
 'isla': 38,
 'jackson': 10,
 'levi': 40,
 'liam': 6,
 'lily': 31,
 'lucas': 37,
 'lucy': 41,
 'noah': 18,
 'nora': 16,
 'oliver': 3,
 'olivia': 9,
 'owen': 7,
 'scarlett': 10,
 'sophia': 22,
 'violet': 15,
 'vivien': 34}

#### Notes:
    
We got the number of times that each of our names appears in list.txt! However, we can see that the names are not printing out in the order that they were listed in in the names file.

So now we will iterate through our names list, printing, in order, the number of times that each appears in list.txt.

In [15]:
for name in names_take_two:
    print("{}: {}".format(name, name_counts[name])) # print, for each name: 'name: count'

oliver: 3
charlotte: 5
liam: 6
amelia: 4
benjamin: 6
aria: 7
owen: 7
olivia: 9
jackson: 10
scarlett: 10
henry: 12
ava: 12
declan: 13
violet: 15
ethan: 15
nora: 16
noah: 18
emma: 18
alexander: 20
aurora: 21
finn: 22
sophia: 22
elijah: 24
audrey: 24
caleb: 25
ella: 26
grayson: 27
grace: 29
emmett: 29
lily: 31
elliot: 31
harper: 33
gabriel: 34
vivien: 34
aiden: 35
abigail: 37
lucas: 37
isla: 38
levi: 40
lucy: 41


# Efficient solution; store procedure as function

#### Notes:
    
Why was our previous solution suboptimal? Because we were performing lookups in a list. Lists are usefu: they preserve order and perform fast - O(1) - appends. But lookups are O(n), which is not phenomenal.

The Python dictionary is fast when it comes to lookups, clocking in at O(1). But these don't preserve order. (Which is why we were initially storing all our names in a list) So what do we do? Luckily, the collections library features a special type of dictionary known as the OrderedDict. The OrderedDict, as its name implies, does preserve order. It is also performs look-ups at O(1). 

We can use the OrderedDict to kill 2 birds with one stone. It will retain, in order, all the names that we need to check. It will also allow us to keep track of the number of times that we've seen each name. This leads to a more efficient solution and will also lead to cleaner code.  

#### We are also going to store our procedure for getting the names to check as a function:

This function will take as an argument the filename to read the names out of. Will create a new OrderedDict and populate it with our names.

The OrderedDict will not allow duplicates entries, which means we are protected against potential duplicates in our names list. We will then return this OrderedDict of names, so that we can use it later (and store the counts in it)

In [16]:
import collections # generally we put these statements at the top of this book (will ease that rule this time)

In [17]:
# store our procedure for getting the list of names as a function
def get_ordered_list_of_names(file): # function takes a filename as argument
    names = collections.OrderedDict()
    with open(file) as f:
        for line in f:
            name = line.strip().lower() # store this as variable so we don't repeat strip() and lower() function calls
            if name in names: # check to see if we've already seen this name
                continue # if so, this is  duplicate - next iteration of loop
            else: # otherwise
                names[name] = 0 # add name to name_checker so that we can avoid duplicates
    return names

In [18]:
names = get_ordered_list_of_names('names.txt')

In [19]:
names

OrderedDict([('oliver', 0),
             ('charlotte', 0),
             ('liam', 0),
             ('amelia', 0),
             ('benjamin', 0),
             ('aria', 0),
             ('owen', 0),
             ('olivia', 0),
             ('jackson', 0),
             ('scarlett', 0),
             ('henry', 0),
             ('ava', 0),
             ('declan', 0),
             ('violet', 0),
             ('ethan', 0),
             ('nora', 0),
             ('noah', 0),
             ('emma', 0),
             ('alexander', 0),
             ('aurora', 0),
             ('finn', 0),
             ('sophia', 0),
             ('elijah', 0),
             ('audrey', 0),
             ('caleb', 0),
             ('ella', 0),
             ('grayson', 0),
             ('grace', 0),
             ('emmett', 0),
             ('lily', 0),
             ('elliot', 0),
             ('harper', 0),
             ('gabriel', 0),
             ('vivien', 0),
             ('aiden', 0),
             ('abigail', 0),
    

#### Notes:
    
We can see that our distinct set (we've protected against duplicates) of names to look for is ready to go! Additionally,
each name already as a value of 0, which will save us a couple lines of code when incrementing our counts. (We won't have to give new names a value of 1)                                  

In [24]:
# store procedure for getting counts of occurences of names in a file
def get_name_counts(file, names): # takes file_name and dictionary (or OrderedDict) of names
    with open(file) as f:
        for line in f:
            name = line.strip().lower()
            if name in names:
                if name in names:
                    names[name] += 1
    return names

In [25]:
names = get_name_counts('list.txt', names)

In [26]:
names

OrderedDict([('oliver', 3),
             ('charlotte', 5),
             ('liam', 6),
             ('amelia', 4),
             ('benjamin', 6),
             ('aria', 7),
             ('owen', 7),
             ('olivia', 9),
             ('jackson', 10),
             ('scarlett', 10),
             ('henry', 12),
             ('ava', 12),
             ('declan', 13),
             ('violet', 15),
             ('ethan', 15),
             ('nora', 16),
             ('noah', 18),
             ('emma', 18),
             ('alexander', 20),
             ('aurora', 21),
             ('finn', 22),
             ('sophia', 22),
             ('elijah', 24),
             ('audrey', 24),
             ('caleb', 25),
             ('ella', 26),
             ('grayson', 27),
             ('grace', 29),
             ('emmett', 29),
             ('lily', 31),
             ('elliot', 31),
             ('harper', 33),
             ('gabriel', 34),
             ('vivien', 34),
             ('aiden', 35),
      

#### Notes:
    
We now have our names, stored in the order in which we they were found, paired with the number of times they appear in 'list.txt', and this was achieved efficiently. 

#### We can access our results in the saem way that we would with any other dictionary:

In [27]:
names['abigail']

37

#### Let's iterate through names' keys and values to find the name that showed up the most:

In [28]:
most_found = ''
max_count = 0

for key, value in names.items():
    if value > max_count:
        max_count = value
        most_found = key

print('The most found name was {}, showing up {} times.'.format(most_found, max_count))

The most found name was lucy, showing up 41 times.


#### We can also write a function to save our results to file:

In [29]:
# function for writing results to a .txt file
def write_output_to_txt(outputfile, list_of_names, name_counts):
    with open(outputfile, 'w') as f: # must specify 'w' to be able to write to open file
        for key, value in names.items():
            f.write("{}: {}\n".format(key, value))

In [30]:
write_output_to_txt('name_counts.txt', names, name_counts)