# Working with data in python

## The open function

### Reading Files with Open

In [None]:
#| echo: false
#| warning: false

##Download example data.

#import urllib.request
#url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%204/data/Example1.txt.txt'
#filename = 'Example1.txt.txt'
#urllib.request.urlretrieve(url, filename)

## Download Example file
# !wget -O data/Example1.txt.txt https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%204/data/Example1.txt.txt

One way to read or write a file in Python is to use the built-in open function. The open function provides a File object that contains the methods and attributes you need in order to read, save, and manipulate the file. In this notebook, we will only cover .txt files. 

- The first argument is the filepath and the filename we want to open
- The second parameter is the mode:
    - r: open a file for reading
    - w: open a file for writing
    - a: open a file for appending
    - r+ : Reading and writing. Cannot truncate the file.
    - w+ : Writing and reading. Truncates the file.
    - a+ : Appending and Reading. Creates a new file, if none exists.
    
- We store this in a file object, i.e. File1 and we can use the file object to obtain information about the file
- You should always close the file object using the method close.

Python file method close() closes the opened file. A closed file cannot be read or written any more. Any operation, which requires that the file be opened will raise a ValueError after the file has been closed. Calling close() more than once is allowed.

In [None]:
#open a file
File1 = open("data/Example1.txt", "r")

#get the name of the file
print(File1.name)

#see what mode the object is in
print(File1.mode)

#get the file contents
print(File1.read())

#get they type of the file
type(File1.read())

#close the file
File1.close()

Since using .close() can be tedious, we can use an alternative, the with statement. This is a better alternative, since the with statement will automatically close the file.

In [None]:
with open("data/Example1.txt", "r") as File1:
    #code will be performed in the indented code block
    #The method "read" stores the values of the file in the variable "file_stuff" as a string
    file_stuff = File1.read()
    print(file_stuff)
    print(File1.mode)

Notice, that we didn’t have to write “file.close()”. That will automatically be called.

We can output every line as an element in a list using the method "readlines."

In [None]:
with open("data/Example1.txt", "r") as File1:
    file_stuff = File1.readlines()
    print(file_stuff)

We don’t have to read the entire file, for example, we can read the first 4 characters by entering three as a parameter to the method .read():

In [None]:
# Read first four characters
with open("data/Example1.txt", "r") as file1:
    print(file1.read(4))

Use a loop to print out each line individually

In [None]:
with open("data/Example1.txt", "r") as File1:
    for i in File1:
        print(i)

We can also read all lines and save them as a list

In [None]:
# Read all lines and save as a list
with open("data/Example1.txt", "r") as file1:
    FileasList = file1.readlines()
    
#print the first line    
FileasList[0]

### Writting Files with Open

Create a new, empty example as follows:

**Beware**: If we have a file with that filename in our directory, it will be overwritten!

In [None]:
#create example.txt in the specified dir
with open("data/Example2.txt", "w") as File1:
    #add something into our file
    File1.write("This is line A\n")
    File1.write("This is line B\n")

We can also have a list and write this to a file

In [None]:
Lines = ["This is line A\n", "This is line B\n", "This is line C\n"]

with open("data/Example2.txt", "w") as File1:
    for i in Lines:
        File1.write(i)

### Appending lines to a new file

Append will not create a new file but append lines to an existing file.

In [None]:
with open("data/Example2.txt", "a") as File1:
    File1.write("This is line D\n")

### Copy one file to a new file

In [None]:
with open("data/Example1.txt", "r") as readfile:
    with open("data/Example3.txt", "w") as writefile:
        for line in readfile:
            writefile.write(line)

### Other modes


In [None]:
with open('data/Example2.txt', 'a+') as testwritefile:
    testwritefile.write("This is line E\n")
    print(testwritefile.read())

Opening the file in w is akin to opening the .txt file, moving your cursor to the beginning of the text file, writing new text and deleting everything that follows. Whereas opening the file in a is similiar to opening the .txt file, moving your cursor to the very end and then adding the new pieces of text. It is often very useful to know where the 'cursor' is in a file and be able to control it. The following methods allow us to do precisely this -

- .tell() - returns the current position in bytes
- .seek(offset,from) - changes the position by 'offset' bytes with respect to 'from'. From can take the value of 0,1,2 corresponding to beginning, relative to current position and end


In [None]:
with open('data/Example2.txt', 'a+') as testwritefile:
    print("Initial Location: {}".format(testwritefile.tell()))
    data = testwritefile.read()
    if (not data):  #empty strings return false in python
            print('Read nothing') 
    else: 
            print(testwritefile.read())
    
    testwritefile.seek(0,0) # move 0 bytes from beginning.
    
    print("\nNew Location : {}".format(testwritefile.tell()))
    data = testwritefile.read()
    if (not data): 
            print('Read nothing') 
    else: 
            print(data)
    
    print("Location after read: {}".format(testwritefile.tell()) )

Finally, a note on the difference between w+ and r+. Both of these modes allow access to read and write methods, however, opening a file in w+ overwrites it and deletes all pre-existing data.
**To work with a file on existing data, use r+ and a+**. While using r+, it can be useful to add a .truncate() method at the end of your data. This will reduce the file to your data and delete everything that follows.

In [None]:
with open('data/Example2.txt', 'r+') as testwritefile:
    data = testwritefile.readlines()
    testwritefile.seek(0,0) #write at beginning of file
   
    testwritefile.write("Line 1" + "\n")
    testwritefile.write("Line 2" + "\n")
    testwritefile.write("Line 3" + "\n")
    testwritefile.write("finished\n")
    #Uncomment the line below
    testwritefile.truncate()
    testwritefile.seek(0,0)
    print(testwritefile.read())

After reading files, we can also write data into files and save them in different file formats like .txt, .csv, .xls (for excel files) etc. You will come across these in further examples

#### Exercise

Your local university's Raptors fan club maintains a register of its active members on a .txt document. Every month they update the file by removing the members who are not active. You have been tasked with automating this with your Python skills.

Given the file currentMem, Remove each member with a 'no' in their Active column. Keep track of each of the removed members and append them to the exMem file. Make sure that the format of the original files in preserved. (Hint: Do this by reading/writing whole lines and ensuring the header remains )

Run the code block below prior to starting the exercise. The skeleton code has been provided for you. Edit only the cleanFiles function.

In [None]:
#Run this prior to starting the exercise
from random import randint as rnd

memReg = 'data/members.txt'
exReg = 'data/inactive.txt'
fee =('yes','no')

def genFiles(current,old):
    with open(current,'w+') as writefile: 
        writefile.write('Membership No  Date Joined  Active  \n')
        data = "{:^13}  {:<11}  {:<6}\n"

        for rowno in range(20):
            date = str(rnd(2015,2020))+ '-' + str(rnd(1,12))+'-'+str(rnd(1,25))
            writefile.write(data.format(rnd(10000,99999),date,fee[rnd(0,1)]))


    with open(old,'w+') as writefile: 
        writefile.write('Membership No  Date Joined  Active  \n')
        data = "{:^13}  {:<11}  {:<6}\n"
        for rowno in range(3):
            date = str(rnd(2015,2020))+ '-' + str(rnd(1,12))+'-'+str(rnd(1,25))
            writefile.write(data.format(rnd(10000,99999),date,fee[1]))


genFiles(memReg,exReg)

In [None]:
def cleanFiles(currentMem, exMem):
    with open(currentMem, "r+") as writeFile:
        with open(exMem, "a+") as appendFile:
            #get the data
            writeFile.seek(0)
            members = writeFile.readlines()
            #remove header
            header = members[0]
            members.pop(0)
            inactive=[]
            for member in members:
                if 'no' in member:
                    inactive.append(member)
            '''
            inactive = [member for member in members if ('no' in member)]
            The above is the same as 
            
            for member in members:
            if 'no' in member:
                inactive.append(member)
            '''
            #go to the beginning of the write file
            writeFile.seek(0)
            writeFile.write(header)
            for member in members:
                if (member in inactive):
                    appendFile.write(member)
                else:
                    writeFile.write(member)
            writeFile.truncate()

Test code:

In [None]:
memReg = 'data/members.txt'
exReg = 'data/inactive.txt'
cleanFiles(memReg,exReg)

# code to help you see the files
headers = "Membership No  Date Joined  Active  \n"

with open(memReg,'r') as readFile:
    print("Active Members: \n\n")
    print(readFile.read())
    
with open(exReg,'r') as readFile:
    print("Inactive Members: \n\n")
    print(readFile.read())


Automated test code:

In [None]:
def testMsg(passed):
    if passed:
       return 'Test Passed'
    else :
       return 'Test Failed'

testWrite = "data/testWrite.txt"
testAppend = "data/testAppend.txt" 
passed = True

genFiles(testWrite,testAppend)

with open(testWrite,'r') as file:
    ogWrite = file.readlines()

with open(testAppend,'r') as file:
    ogAppend = file.readlines()

try:
    cleanFiles(testWrite,testAppend)
except:
    print('Error')

with open(testWrite,'r') as file:
    clWrite = file.readlines()

with open(testAppend,'r') as file:
    clAppend = file.readlines()
        
# checking if total no of rows is same, including headers

if (len(ogWrite) + len(ogAppend) != len(clWrite) + len(clAppend)):
    print("The number of rows do not add up. Make sure your final files have the same header and format.")
    passed = False
    
for line in clWrite:
    if  'no' in line:
        passed = False
        print("Inactive members in file")
        break
    else:
        if line not in ogWrite:
            print("Data in file does not match original file")
            passed = False
print ("{}".format(testMsg(passed)))

## Pandas

Pandas is a popular library for data analysis built on top of the Python programming language. Pandas generally provide two data structures for manipulating data, They are:

- DataFrame: a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
- Series: represents a one-dimensional array of indexed data. It has two main components :
    - An array of actual data.
    - An associated array of indexes or data labels.

We input a library, a set of pre-written functions with the import command followed by the name of the library. Since always writing the library name, we can shorten it with the as statement. Pd is the standard abbreviation often used for pandas.

In [None]:
import pandas as pd

This gives us access to a number of pre-build classes and functions. 

### Loading data with pandas

We import a table with and store it in a dataframe. A dataframe is comprised of rows and columns. 

In [None]:
#read in a dataframe
df = pd.read_csv("data/file1.csv")

#examine the first 5 rows of a dataframe
df.head()

We also can create a dataframe out of a dictionary. The keys correspond to the table headers, the values correspond to the rows.

In [None]:
#Define a dictionary 'x'
x = {'Name': ['Rose','John', 'Jane', 'Mary'], 
        'ID': [1, 2, 3, 4], 
        'Department': ['Architect Group', 'Software Group', 'Design Team', 'Infrastructure'], 
        'Salary':[100000, 80000, 50000, 60000]}

#casting the dictionary to a DataFrame
df = pd.DataFrame(x)

#display the result df
df

We can create a new df consisting of one column.

In [None]:
x = df[["ID"]]
x

Let's use the <code>type()</code> function and check the type of the variable.

In [None]:
#check the type of x
type(x)

We can also do this for multiple columns:

In [None]:
z = df[['Department','Salary','ID']]
z

To view the column as a series, we just use one bracket.

In [None]:
x = df["Name"]
x

In [None]:
#check the type of x
type(x)

### loc() and iloc() functions

loc() is a label-based data selecting method which means that we have to pass the name of the row or column that we want to select. This method includes the last element of the range passed in it.

Simple syntax for your understanding:

`loc[row_label, column_label]`

iloc() is an indexed-based selecting method which means that we have to pass integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it.

Simple syntax for your understanding:

`iloc[row_index, column_index]`


In [None]:
# Access the value on the first row and the first column
df.iloc[0, 0]

In [None]:
# Access the value on the first row and the third column
df.iloc[0,2]

In [None]:
# Access the column using the name
df.loc[0, 'Salary']

Let us create a new dataframe called 'df1' and assign 'df' to it. Now, let us set the "Name" column as an index column using the method set_index().

In [None]:
df1=df
df1=df1.set_index("Name")

#To display the first 5 rows of new dataframe
df1.head()

In [None]:
#Now, let us access the column using the name
df1.loc['Jane', 'Salary']

### Slicing

Slicing uses the [] operator to select a set of rows and/or columns from a DataFrame.

To slice out a set of rows, you use this syntax: data[start:stop],

here the start represents the index from where to consider, and stop represents the index one step BEYOND the row you want to select. You can perform slicing using both the index and the name of the column.

**NOTE: When slicing in pandas, the start bound is included in the output.**

So if you want to select rows 0, 1, and 2 your code would look like this: df.iloc[0:3].

It means you are telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.

**NOTE: Labels must be found in the DataFrame or you will get a KeyError.**

Indexing by labels(i.e. using loc()) differs from indexing by integers (i.e. using iloc()). With loc(), both the start bound and the stop bound are inclusive. When using loc(), integers can be used, but the integers refer to the index label and not the position.

For example, using loc() and select 1:4 will get a different result than using iloc() to select rows 1:4.

In [None]:
# let us do the slicing using old dataframe df
df.iloc[0:2, 0:3]

In [None]:
#let us do the slicing using loc() function on old dataframe df where index column is having labels as 0,1,2
df.loc[0:2,'ID':'Department']

### Working with and Saving Data

#### Make a list of unique elements

In [None]:
# Read data from CSV file
csv_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%204/data/TopSellingAlbums.csv'
df = pd.read_csv(csv_path)
df.head()

In [None]:
# Access to the column Released
x = df[['Released']]
x

In [None]:
#apply the unique method to remove duplicated years
df['Released'].unique()

### Selecting rows by conditions

Lets find all the albums released after 1980

In [None]:
#find albums after 1980
#the results is a list of boolean values
print(df["Released"]>=1980)

#select the specified columns from our dataframe
df1 = df[df["Released"]>=1980]
df1.head()

### Save dataframes

In [None]:
#save new df to csv
df.to_csv("data/new_songs.csv")

## Numpy

Numpy is a package for scientific computing and has many useful functions.

### Basics

A numpy array is similar to a list. Its usually fixed in size and each element is of the same type.

In [None]:
#load numpy
import numpy as np

#create an array
a = np.array([0,1,2,3,4])
print(a)

#access individual elements
print(a[2])

#check the type
print(type(a))

#obtain the data type of the array element
print(a.dtype)

#check the attribute size, the number of elements in the array
print(a.size)

#show the array dimensions
print(a.ndim)

#show the size of the array in each dimension
print(a.shape)

### Indexing and slicing

Change elements of an array:

In [None]:
#define an array
c = np.array([20,1,2,3,4])
print(c)

#change the first element to 100
c[0]=100
print(c)

We also can slice a numpy array:

In [None]:
#select the elements from 1 to 3
print(c[1:4])

### Basic operations
