# Python File Objects

### Instructions

* Complete the programming exercises below.
* Delete or comment out the line of code in each cell which says `raise NotImplementedError()` and replace it with your own.

### Exercises

## Path to file or directory

Sometimes you want to work with files in a particular directory. In order to do that you need to get the path. There are a couple different ways you can do that.

In Windows you can open a command prompt (in Mac a terminal window) and drag the file or folder to the command prompt and the path will display in the window. You can then copy the path and paste it into your code. It will probably look something like this.

`C:\Users\UserName\Documents`

To use the path in your code, you will want to assign it to a variable.

`file_path = 'C:\Users\UserName\Documents'`

Windows users will quickly realize that their path as Windows formats it causes error when used without some modification. (Mac users won't have this problem because Macs format the path in a way that Python can use it without modification.) This is a problem for Windows users because Windows uses backslash (\) characters in the file path. In Python, the backslash character is a special character. To use special characters in strings in Python, they need to have an escape character added before the special character. In Python, the escape character is the backslash (\). One option Windows users have is to add an extra backslash in front of every backslash in the path. It will look like this.

`file_path = 'C:\\Users\\UserName\\Documents'`

Another option is to change each backslash (\) to a forward slash (/).

`file_path = 'C:/Users/UserName/Documents'`

Another option is to format the string as a raw string. This option saves a lot of typing. This is done in Python by adding a lowercase `r` in front of the quotation marks for the string. This is a signal to Python to treat the string exactly as it is formatted and consider any special characters included in the string as literal characters.

`file_path = r'C:\Users\UserName\Documents'`

Another option is to use a graphical user interface included in Python (called tkinter) to open a dialog window and allow the user to select the file or directory of interest. Run the code block below to use the file dialog to select the files directory included in the repository for this assignment. (Be sure to select the inner files directory which actually contains the files needed for the assignment.)

## Use this code with Google Colab

The code blocks in this section with clone the GitHub repository and extrac the files you need from the zip archive. (A zip archive is a compressed folder that contains other files. The files need to be extracted before they can be used.)

In [1]:
!git clone https://github.com/kknippenberg11/exerPythCombineFiles.git # replace the address shown with the address to your own repository

Cloning into 'exerPythCombineFiles'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 42 (delta 16), reused 4 (delta 4), pack-reused 14 (from 1)[K
Receiving objects: 100% (42/42), 40.15 KiB | 708.00 KiB/s, done.
Resolving deltas: 100% (16/16), done.


In [2]:
zip_path = '/content/exerPythCombineFiles/FileTypes.zip'

In [3]:
from zipfile import ZipFile
with ZipFile(zip_path, 'r') as zip:
  zip.extractall()
  print('Done')

Done


In [4]:
# Get the path to the directory with the extracted files
file_path = '/content/FileTypes'

## Skip this code block if you're using Google Colab

This code block is used if you're working with files on your own computer. You can skip it if you're using Google Colab and you completed the steps above.

In [None]:
# Have user select the folder with the needed files.
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.lift()
root.withdraw()

# This code block will open a specific file. (Uncomment the lines after this comment to use them.)
#print('Opening dialogue box for file selection. Please choose a file.')
#file_path = filedialog.askopenfilename()
#print('File selected:')

# This code block will get a directory path. (Uncomment the lines after this comment to use them.)
print('Opening dialogue box for folder selection. Please choose a folder.')
file_path = filedialog.askdirectory()
print('Folder selected:')

print(file_path)

## Working with different file types in Python

The purpose of this assignment is to see how to work with different file types in Python. Run the code block below to see the different types of files included in the folder.

In [5]:
# Make sure correct files are recognized.

import os
import pandas as pd

# List the files in the directory.
files = os.listdir(file_path)
print('All files in folder: ', files, '\n')

# List of file types we want to add
file_types = ['xlsx','csv','json','xml']

# create a list of files for each file type
files_csv = [f for f in files if f[-3:] == 'csv']
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
files_json = [f for f in files if f[-4:] == 'json']
files_xml = [f for f in files if f[-3:] == 'xml']

print('CSV files: ', files_csv, '\nExcel files: ', files_xlsx, '\nJSON files: ',files_json, '\nXML files: ', files_xml)

All files in folder:  ['JSONtest3.json', 'Combine multiple file types with same columns into dataframe.ipynb', 'CSVtest.csv', 'EXCELtest2.xlsx', 'EXCELtest.xlsx', 'JSONtest.json', 'Combine multiple file types with same columns into dataframe.py', 'XMLtest.xml', 'CSVtest2.csv', 'XMLtest3.xml', 'XMLtest2.xml', 'JSONtest2.json'] 

CSV files:  ['CSVtest.csv', 'CSVtest2.csv'] 
Excel files:  ['EXCELtest2.xlsx', 'EXCELtest.xlsx'] 
JSON files:  ['JSONtest3.json', 'JSONtest.json', 'JSONtest2.json'] 
XML files:  ['XMLtest.xml', 'XMLtest3.xml', 'XMLtest2.xml']


Write your answers as comments in the cell below.

* What types of files are included in the folder? List the file extensions and how many of each type you see.

In [None]:
CSV files: 2
Excel files: 2
JSON files: 3
XML files: 3
IPYNB files: 1
PY files: 1

## Creating a Pandas dataframe

Run the code block below to read each data file and combine them into a Pandas dataframe.

In [None]:
import json
import requests
import xml.etree.ElementTree as ET

# Iterate through the files in the directory and append each one into the dataframe.
# This will only work correctly if the files have the exact same column names.
df_list = []
for f in files_csv:
    data = pd.read_csv(str(file_path) + '/' + str(f), index_col=None, header=0)
    data['Source'] = f
    df_list.append(data)

for f in files_xlsx:
    data = pd.read_excel(str(file_path) + '/' + str(f))
    data['Source'] = f
    df_list.append(data)

# Iterate through the json files and add data from each to a list.
json_list = []
for f in files_json:
    with open(str(file_path) + '/' + str(f)) as json_file:
        json_obj = json.load(json_file)
        json_obj['Source'] = f
        json_list.append(json_obj.copy())
# Turn the combined list into a dataframe.
data = pd.DataFrame(json_list)
# Add the data frame to the list of dataframes.
df_list.append(data)

# Iterate through the xml files and add data from each to a list.
xml_list = []
for f in files_xml:
    # create element tree object
    tree = ET.parse(str(file_path) + '/' + str(f))
    # get root element
    root = tree.getroot()
    # create dictionary from XML tags and values
    itemdict = {}
    for item in root:
        itemdict[item.tag] = item.text
    itemdict['Source'] = f
    xml_list.append(itemdict.copy())
# Turn the combined list into a dataframe.
data = pd.DataFrame(xml_list)
# Add the data frame to the list of dataframes.
df_list.append(data)

# Combine all the data frames in the list into a single data frame.
df =  pd.concat(df_list, axis=0, ignore_index=True, sort=False)

# See how many rows the data frame has.
print(len(df.index))

# Show the data in the data frame.
df

## Saving a dataframe to csv

Run the code block below to save your new dataframe as a csv file.

In [None]:
# Save the dataframe to a new combined csv file.

# Add today's date to the name of the new file.
from datetime import date
today = date.today()
print(today)

filename = str(file_path) + '/' + 'NewCombinedFile_' + str(today) + '.csv'
print(filename)

df.to_csv(filename, index=False)
print('File saved.')

## Create additional data files

Create a new data file for each type included in the folder. (You can use Excel to modify the content in the .xlsx and .csv files and save them under a new name in the same folder. You can use VSCode or Notepad++ to modify the content of an .xml and .json file and save those under a new name in the same folder.) Make sure you create at least one additional file for each type of data file in the folder.

Enter code in the cell below.

* Create a variable called `file_path` which contains the path to the folder where all the data files are included. You can use any of the methods described at the beginning of the notebook (Manually copy the path or copy the code to create a file dialog to choose the correct folder.)

In [None]:

# YOUR CODE HERE
raise NotImplementedError()

Enter code in the cell below.

* Add code in the cell below to list the files in the folder and create lists for each data file type. (Hint - It's OK to reuse code from earlier in the assignment.)

In [None]:

# YOUR CODE HERE
raise NotImplementedError()

Write you answers as comments in the cell below.

* What types of files are included in the folder? List the file extensions and how many of each type you see.

In [None]:

# YOUR CODE HERE
raise NotImplementedError()

Enter code in the cell below.

* Add code in the cell below to read each data file and combine them into a Pandas dataframe and display the dataframe. (Hint - It's OK to reuse code from earlier in the assignment.)

In [None]:

# YOUR CODE HERE
raise NotImplementedError()

Enter code in the cell below.

* Add code in the cell below to save your new dataframe as a csv file. (Hint - It's OK to reuse code from earlier in the assignment.)

In [None]:

# YOUR CODE HERE
raise NotImplementedError()

# Push your updated files to GitHub

If you're running this on Google Colab, you'll need to push your updated data files go GitHub and also upload your Jupyter Notebook with all of your output to your GitHub assignment repository. Instructions can be found at https://github.com/cmcntsh/OpenJupyterNBinGoogleColab

Be sure to turn in the url to your GitHub repository in Canvas.