# Using Files with Pandas

As always, we are going to need some modules loaded.

In [1]:
import os
import pandas as pd

In this notebook, we will explore several common tasks associated with dealing with files in Pandas.  

Pandas has direct support for many file types.  Consider some of the Pandas methods and supported formats in the following table:

| Method  | Description |
|---------|-----------------------------------------------------------|
| `to_clipboard` | Save data into system clipboard |
| `to_dict` | Convert data into Python dictionary |
| `to_hdf` | Save data into hiearchical data format (HDF) |
| `to_html` | Convert data to HTML table |
| `to_json` | Convert data to JSON string |
| `to_sql` | Save data into SQL database |

However, in this notebook, we will focus on two of the most common: CSV files and Excel files.

- [Navigating the File System](#Navigating-the-File-System)
- [Reading and Writing CSV Files](#Reading-and-Writing-CSV-Files)
- [Handling Missing Values in CSV Files](#Handling-Missing-Values-in-CSV-Files)
- [Reading and Writing Excel Files](#Reading-and-Writing-Excel-Files)

## Navigating the File System

The commands to manipulate files and the conventions for path formats are different on different opeating systems (e.g., Windows, Linux, MacOS).  Directly including the conventions of a particular operating system in your code (or notebooks) will cause that code to fail if the code is moved to a difference system.  We want to prevent that failure.

Fortunately, Python's `os` module provides us platform-independent ways to interact with the file system.  We will only explore a few of the capabilities here, but there are many methods.  Consider the list below.

| Method | Description |
|--------|-------------------------|
|chdir | Change working directory
|rename | Rename a file
|remove | Delete a file
|mkdir | Create a single level directory
|rmdir | Delete a single level directory
|makedirs | Create multi-layer recursive directory
|removedirs | Delete empty directory
|listdir | Return a list of all files and folders in a directory
|system | Execute command (perhaps OS-specific command)

What operating system are we currently using?

From the `os` documentation [here](#https://docs.python.org/3/library/os.html), there are three values for `os.name`.

![image.png](attachment:22a83df4-c0f6-4d82-bd3a-efbea04b9edf.png)

Windows is "nt".  MacOS and Linux appear as "posix".  Note that the `sys` module can provide additional granulaity if needed.

In [2]:
os.name

'posix'

Let's define the location of the data file.

We want to avoid OS-specific strings to specify the file location (e.g., "C:\users\bob\documents\myfile.csv").  

The `os` module provides a `path` method to construct the path from a string of component pieces.  The `path` method handles introducing the appropriate directory seperators for the particular platform.

In [3]:
datapath = os.path.join('.', 'users', 'bob', 'documents', '')
datapath

'./users/bob/documents/'

This gives the relative path to the directory based upon the current directory.  

To get the absolute path, we can do this...

In [4]:
datapath = os.path.join(os.path.abspath('.'), 'users', 'bob', 'documents', '')
datapath

'/Users/tmt/Dropbox/UCA/Intro-to-DataScience/Class-material/notebooks/additional-lab-notebooks/Pandas/users/bob/documents/'

For this notebook, our example data files are in the current directory, so we can set the datapath as follows...

In [5]:
# Set the path to our data location
datapath = os.path.join(".", "")
datapath

'./'

In [6]:
datafile = 'hogwarts_grades.csv'

Verify that the data file is where we think.  We can use the `exists` method to see if the file exists and the `isfile` method to more specifically verify that this is a normal file (e.g., rather than a special file like a pipe). 

In [7]:
os.path.exists(datapath + datafile)

True

In [8]:
os.path.isfile(datapath + datafile)

True

It is helpful to know the files exist, but we might want to know other details like the size of the file, when it was modified, etc.  Let's exploit the underlying operating system to tell us more about the file.

However, to do that, we need to use the appropriate command for our current operating system.  Let's set that command.

In [9]:
if os.name == 'nt':
    dircmd = 'dir'
else:
    dircmd = 'ls -l'

Now we can use the `os.system` method to invoke command.

In [10]:
os.system(dircmd + ' ' + datapath + datafile)

-rw-r--r--@ 1 tmt  staff  540 Jul 20 16:07 ./hogwarts_grades.csv


0

We can also do a little tricky Jupyter "magic" command work to execute the command.

In [11]:
!{dircmd} {datapath + datafile}

-rw-r--r--@ 1 tmt  staff  540 Jul 20 16:07 ./hogwarts_grades.csv


After a bit of Google searching, I didn't immediately find a good portable option to replace `ls` and `dir` data.  

The `os.stat` method returns the data, but not in a very user friendly format.

In [12]:
os.stat(datapath + datafile)

os.stat_result(st_mode=33188, st_ino=33440688, st_dev=16777230, st_nlink=1, st_uid=501, st_gid=20, st_size=540, st_atime=1667326920, st_mtime=1658351255, st_ctime=1658352142)

In [13]:
os.stat(datapath)

os.stat_result(st_mode=16877, st_ino=33437672, st_dev=16777230, st_nlink=15, st_uid=501, st_gid=20, st_size=480, st_atime=1667326934, st_mtime=1667326934, st_ctime=1667326934)

Below, I have attempted some tasks that are operating system specific.  As a result, I need a couple of extra modules.

Unfortunately, it appears Windows does not have an implementation of the `pwd` or `grp` modules, so I have to selectively load modules.

In [14]:
import datetime

if os.name != 'nt':
    import pwd
    import grp

Make a (very) limited version of a function to format the stat results.

In [15]:
def file_ls(fspec):
    """Produce a primitive version of a Linux 'ls -l' command that is platform independent (or at least platform aware).
    
        Args:
            fspec(string)  - file specification
            
        Returns:
            None
    """
    stat_result = os.stat(fspec)
   
    # get the file access permissions, then convert the permissions to a readable version
    perm_translation = {'000': '---',
                        '001': '--x',
                        '010': '-w-',
                        '011': '-wx',
                        '100': 'r--',
                        '101': 'r-x',
                        '110': 'rw-',
                        '111': 'rwx'
                       }
    perms = str(bin(stat_result.st_mode))[-9:]
    perms = perm_translation[perms[0:3]] + perm_translation[perms[3:6]] + perm_translation[perms[6:]]
    
    # get the modification time, then drop the fractional seconds part of the string
    mod_date = str(datetime.datetime.fromtimestamp(stat_result.st_mtime))
    mod_date = mod_date.split('.')[0]
    
    # if we are not on Windows, get a readable user name and group name; otherwise, just use the uid and gid
    if os.name != 'nt':
        name = pwd.getpwuid(stat_result.st_uid).pw_name
        group = grp.getgrgid(stat_result.st_gid).gr_name
    else:
        name = stat_result.st_uid
        group = stat_result.st_gid
        
    print(perms, stat_result.st_size, name, group, mod_date, fspec)

In [16]:
file_ls(datapath + datafile)

rw-r--r-- 540 tmt staff 2022-07-20 16:07:35 ./hogwarts_grades.csv


In [17]:
file_ls('.')

rwxr-xr-x 480 tmt staff 2022-11-01 13:22:14 .


## Reading and Writing CSV Files

Let's take a look at our data file.  You can do this by finding the file and opening it in Jupyter notebook.

Here, we show a screen shot of the file as it appears in a Jupyter tab.

![image.png](attachment:ef00450c-321f-4703-b5c7-9d67bde7b4ec.png)

However, the actual file looks a bit different.  We will again do some OS-specific stuff...

In [18]:
if os.name == 'nt':
    list_file = 'type'
else:
    list_file = 'cat'

In [19]:
!{list_file} {datapath + datafile}

Name,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93.0,,91.0,,58.0,,
George,75.0,93.0,,91.0,,58.0,,
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0
Cedric,,98.0,,,,,,
Neville,,84.0,,71.0,78.0,,,100.0


We have seen how to read a CSV file into a Pandas file before.  Let's load our data.

In [20]:
hogwarts = pd.read_csv(datapath + datafile)
hogwarts

Unnamed: 0,Name,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
0,Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
1,Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
2,Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
3,Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
4,Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
5,Fred,75.0,93.0,,91.0,,58.0,,
6,George,75.0,93.0,,91.0,,58.0,,
7,Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
8,Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
9,Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


Well, that was pretty easy.  However, what we really wanted was for the student to be the dataframe index.  We can read the file and specify the index column.

In [21]:
hogwarts = pd.read_csv(datapath + datafile, index_col='Name')
hogwarts

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93.0,,91.0,,58.0,,
George,75.0,93.0,,91.0,,58.0,,
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


That's better.

Now write the dataframe to a CSV.

In [22]:
newdatafile = 'hogwarts_grades_again.csv'

In [23]:
hogwarts.to_csv(datapath + newdatafile)

In [24]:
!{dircmd} {datapath + 'hogwarts*.csv'}

-rw-r--r--@ 1 tmt  staff  540 Jul 20 16:07 ./hogwarts_grades.csv
-rw-r--r--  1 tmt  staff  540 Nov  1 13:25 ./hogwarts_grades_again.csv


The files are the same size.  Verify they are identical.

In [25]:
if os.name == 'nt':
    diff = 'fc'
else:
    diff = 'diff'

In [26]:
!{diff} {datapath + datafile} {datapath + newdatafile}

Yes, they are identical.

Display the contents of the new file.

In [27]:
!{list_file} {datapath + newdatafile}

Name,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93.0,,91.0,,58.0,,
George,75.0,93.0,,91.0,,58.0,,
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0
Cedric,,98.0,,,,,,
Neville,,84.0,,71.0,78.0,,,100.0


Let's read the data back in to a different dataframe and make sure everything is saved correctly.

In [28]:
saved_hogwarts = pd.read_csv(datapath + newdatafile)

In [29]:
saved_hogwarts

Unnamed: 0,Name,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
0,Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
1,Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
2,Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
3,Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
4,Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
5,Fred,75.0,93.0,,91.0,,58.0,,
6,George,75.0,93.0,,91.0,,58.0,,
7,Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
8,Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
9,Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


In [30]:
if hogwarts.equals(saved_hogwarts):
    print('Everything is cool.')
else:
    print('Hmmm...something is amiss.')

Hmmm...something is amiss.


In [31]:
hogwarts

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93.0,,91.0,,58.0,,
George,75.0,93.0,,91.0,,58.0,,
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


Ahhh...our original dataframe index is showing up as a column in the dataframe we read from disk.  We need to tell read_csv how to handle the index.

In [69]:
saved_hogwarts = pd.read_csv(datapath + newdatafile, index_col='Name')
saved_hogwarts

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93.0,,91.0,,58.0,,
George,75.0,93.0,,91.0,,58.0,,
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


In [33]:
if hogwarts.equals(saved_hogwarts):
    print('Everything is cool.')
else:
    print('Hmmm...something is amiss.')

Everything is cool.


## Handling Missing Values in CSV Files

Note that the CSV file has many missing values.  By default, Pandas identifies missing values and replaces those value with NaN.  However, we have considerable control over this process should we have the need.

Missing data can be represented in data in any of several methods:
- null values (there is truly no value present)
- NaN (pandas)
- None (Python 3.x)
- a designated value, like -999

We can have pandas find missing values and either assign the `NaN` value or keep the source value for missing data when we read the file into memory. By default, `pd.read_csv` replaces null, NA, and NaN values with `NaN`. There are three arguments for `pd.read_csv` that allow us to do this:
- `na_values` 
    - This is not used very often because it allows you to specify what values should be considered missing. If you had data using -999 for example, you would use `na_values=[-999]` to make sure pandas treated them as missing.
    - Note that the right side of the equals sign is a list, so you can include multiple values separated by commas.
- `keep_default_na` 
    - This is used in conjunction with `na_values`. 
    - If `True` (the default), then values like NA and NaN will be treated as missing in addition to the values you specified in the `na_values` parameter. 
    - If `False`, just the listed values are considered missing. 
- `na_filter` 
    - Specifies whether _any_ values will be coded as missing.
    - If `True` (the default), missing values are coded as `NaN`.
    - If `False`, nothing is recorded as missing.


For fun, load the Hogwarts data without replacing missing values with NaN.

In [34]:
hogwarts_is_missing = pd.read_csv(datapath + datafile, na_filter=False)
hogwarts_is_missing

Unnamed: 0,Name,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
0,Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
1,Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
2,Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
3,Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
4,Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
5,Fred,75.0,93.0,,91.0,,58.0,,
6,George,75.0,93.0,,91.0,,58.0,,
7,Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
8,Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
9,Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


## Reading and Writing Excel Files

## Augmenting our data ##

We are going to augment our data, but first we will make a copy of our base dataframe.

In [35]:
hogwarts_augmented = hogwarts.copy()

Time for an assessment of all the students across all the classes.

How many total points does each student have across all the classes for which they are enrolled?

In [36]:
total_points = hogwarts_augmented.sum(axis='columns')
total_points

Name
Harry       611.0
Hermione    700.0
Ron         598.0
Draco       591.0
Crabbe      206.0
Fred        317.0
George      317.0
Goyle       220.0
Luna        778.0
Cho         570.0
Cedric       98.0
Neville     333.0
dtype: float64

What is the average (mean) score of each student across all their classes?

In [37]:
average_class_points = hogwarts_augmented.mean(axis='columns')
average_class_points

Name
Harry        87.285714
Hermione    100.000000
Ron          85.428571
Draco        84.428571
Crabbe       29.428571
Fred         79.250000
George       79.250000
Goyle        31.428571
Luna         97.250000
Cho          95.000000
Cedric       98.000000
Neville      83.250000
dtype: float64

Add these two overall assessments to our dataframe.

In [38]:
hogwarts_augmented['Total Points'] = total_points
hogwarts_augmented

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology,Total Points
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0,611.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0,700.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0,598.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0,591.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0,206.0
Fred,75.0,93.0,,91.0,,58.0,,,317.0
George,75.0,93.0,,91.0,,58.0,,,317.0
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0,220.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0,778.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0,570.0


In [39]:
hogwarts_augmented['Average Class Points'] = average_class_points
hogwarts_augmented

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology,Total Points,Average Class Points
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0,611.0,87.285714
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0,700.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0,598.0,85.428571
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0,591.0,84.428571
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0,206.0,29.428571
Fred,75.0,93.0,,91.0,,58.0,,,317.0,79.25
George,75.0,93.0,,91.0,,58.0,,,317.0,79.25
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0,220.0,31.428571
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0,778.0,97.25
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0,570.0,95.0


Our original data is still available.

In [40]:
hogwarts

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93.0,,91.0,,58.0,,
George,75.0,93.0,,91.0,,58.0,,
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


## Writing an Excel File

All the professors at Hogwarts are going to receive a copy of our data.  

Professor Snape is a big Excel user and wants his copy of the report delivered to him as an Excel spreadsheet.  

While it is true that Professor Snape could use Excel to directly read a CSV file, we find it best not to irrirate Professor Snape, so we will directly produce an Excel file for him.

We want to create an Excel file with 2 sheets.  The original data will go on one sheet and our augmented data on another.

To do this, we need to use an `ExcelWriter`.  Here is an example from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html:

``` python
df2 = df1.copy()
with pd.ExcelWriter('output.xlsx') as writer:  
     df1.to_excel(writer, sheet_name='Sheet_name_1')
     df2.to_excel(writer, sheet_name='Sheet_name_2')
```

In [41]:
exceldatafile = 'hogwarts.xlsx'

In [42]:
with pd.ExcelWriter(datapath + exceldatafile) as writer:
    hogwarts.to_excel(writer, sheet_name='Original')
    hogwarts_augmented.to_excel(writer, sheet_name='Augmented')

Did we create an Excel file?

In [43]:
os.path.isfile(datapath + exceldatafile)

True

In [44]:
!{dircmd} {datapath + exceldatafile}

-rw-r--r--@ 1 tmt  staff  7011 Nov  1 13:25 ./hogwarts.xlsx


Open the file in Excel to see if things worked.  Here is a screenshot of the file in Excel.

![image.png](attachment:c275a6e4-5cc6-446f-af77-db83a06b0675.png)

![image.png](attachment:e4fea87c-a209-4515-9b2e-cc3b5b6818a6.png)

## Reading an Excel File

Naturally, we can also read from Excel files.

First, let's read the original data from Excel.

In [45]:
hogwarts_excel = pd.read_excel(datapath + exceldatafile, sheet_name='Original', index_col='Name')
hogwarts_excel

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Harry,80.0,92,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93,,91.0,,58.0,,
George,75.0,93,,91.0,,58.0,,
Goyle,23.0,43,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92,,95.0,93.0,98.0,95.0,97.0


In [46]:
if hogwarts.equals(hogwarts_excel):
    print('Everything is cool.')
else:
    print('Hmmm...something is amiss.')

Hmmm...something is amiss.


Well, they ***look*** the same...

In [47]:
hogwarts

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0
Fred,75.0,93.0,,91.0,,58.0,,
George,75.0,93.0,,91.0,,58.0,,
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0


Check the datatypes for the columns.  The `equals` method checks both the values and the datatypes of the columns for equality.

In [48]:
hogwarts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, Harry to Neville
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Potions          9 non-null      float64
 1   Transfiguration  12 non-null     float64
 2   Runes            2 non-null      float64
 3   Defense          11 non-null     float64
 4   Divination       8 non-null      float64
 5   Data Science     10 non-null     float64
 6   Charms           8 non-null      float64
 7   Herbology        9 non-null      float64
dtypes: float64(8)
memory usage: 1.1+ KB


In [49]:
hogwarts_excel.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, Harry to Neville
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Potions          9 non-null      float64
 1   Transfiguration  12 non-null     int64  
 2   Runes            2 non-null      float64
 3   Defense          11 non-null     float64
 4   Divination       8 non-null      float64
 5   Data Science     10 non-null     float64
 6   Charms           8 non-null      float64
 7   Herbology        9 non-null      float64
dtypes: float64(7), int64(1)
memory usage: 864.0+ bytes


Ahhh...the 'Transfiguration' column has a different datatype.

In [50]:
hogwarts_excel['Transfiguration'] = hogwarts_excel['Transfiguration'].astype('float64')

In [51]:
hogwarts_excel.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, Harry to Neville
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Potions          9 non-null      float64
 1   Transfiguration  12 non-null     float64
 2   Runes            2 non-null      float64
 3   Defense          11 non-null     float64
 4   Divination       8 non-null      float64
 5   Data Science     10 non-null     float64
 6   Charms           8 non-null      float64
 7   Herbology        9 non-null      float64
dtypes: float64(8)
memory usage: 864.0+ bytes


In [52]:
if hogwarts.equals(hogwarts_excel):
    print('Everything is cool.')
else:
    print('Hmmm...something is amiss.')

Everything is cool.


Now, read the augmented data from Excel.

In [53]:
hogwarts_augmented_excel = pd.read_excel(datapath + exceldatafile, sheet_name='Augmented', index_col='Name')
hogwarts_augmented_excel

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology,Total Points,Average Class Points
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Harry,80.0,92,,100.0,71.0,92.0,93.0,83.0,611,87.285714
Hermione,100.0,100,100.0,100.0,,100.0,100.0,100.0,700,100.0
Ron,70.0,83,,92.0,73.0,98.0,95.0,87.0,598,85.428571
Draco,100.0,88,,72.0,75.0,72.0,92.0,92.0,591,84.428571
Crabbe,31.0,15,,29.0,6.0,3.0,70.0,52.0,206,29.428571
Fred,75.0,93,,91.0,,58.0,,,317,79.25
George,75.0,93,,91.0,,58.0,,,317,79.25
Goyle,23.0,43,,32.0,11.0,21.0,41.0,49.0,220,31.428571
Luna,94.0,97,100.0,93.0,98.0,100.0,98.0,98.0,778,97.25
Cho,,92,,95.0,93.0,98.0,95.0,97.0,570,95.0


In [54]:
hogwarts_augmented

Unnamed: 0_level_0,Potions,Transfiguration,Runes,Defense,Divination,Data Science,Charms,Herbology,Total Points,Average Class Points
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Harry,80.0,92.0,,100.0,71.0,92.0,93.0,83.0,611.0,87.285714
Hermione,100.0,100.0,100.0,100.0,,100.0,100.0,100.0,700.0,100.0
Ron,70.0,83.0,,92.0,73.0,98.0,95.0,87.0,598.0,85.428571
Draco,100.0,88.0,,72.0,75.0,72.0,92.0,92.0,591.0,84.428571
Crabbe,31.0,15.0,,29.0,6.0,3.0,70.0,52.0,206.0,29.428571
Fred,75.0,93.0,,91.0,,58.0,,,317.0,79.25
George,75.0,93.0,,91.0,,58.0,,,317.0,79.25
Goyle,23.0,43.0,,32.0,11.0,21.0,41.0,49.0,220.0,31.428571
Luna,94.0,97.0,100.0,93.0,98.0,100.0,98.0,98.0,778.0,97.25
Cho,,92.0,,95.0,93.0,98.0,95.0,97.0,570.0,95.0


In [55]:
if hogwarts_augmented.equals(hogwarts_augmented_excel):
    print('Everything is cool.')
else:
    print('Hmmm...something is amiss.')

Hmmm...something is amiss.


Probably a type problem again...

In [56]:
hogwarts_augmented.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, Harry to Neville
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Potions               9 non-null      float64
 1   Transfiguration       12 non-null     float64
 2   Runes                 2 non-null      float64
 3   Defense               11 non-null     float64
 4   Divination            8 non-null      float64
 5   Data Science          10 non-null     float64
 6   Charms                8 non-null      float64
 7   Herbology             9 non-null      float64
 8   Total Points          12 non-null     float64
 9   Average Class Points  12 non-null     float64
dtypes: float64(10)
memory usage: 1.3+ KB


In [57]:
hogwarts_augmented_excel.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, Harry to Neville
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Potions               9 non-null      float64
 1   Transfiguration       12 non-null     int64  
 2   Runes                 2 non-null      float64
 3   Defense               11 non-null     float64
 4   Divination            8 non-null      float64
 5   Data Science          10 non-null     float64
 6   Charms                8 non-null      float64
 7   Herbology             9 non-null      float64
 8   Total Points          12 non-null     int64  
 9   Average Class Points  12 non-null     float64
dtypes: float64(8), int64(2)
memory usage: 1.0+ KB


Convert the datatypes of the different columns.

In [58]:
hogwarts_augmented_excel['Transfiguration'] = hogwarts_augmented_excel['Transfiguration'].astype('float64')

In [59]:
hogwarts_augmented_excel['Total Points'] = hogwarts_augmented_excel['Total Points'].astype('float64')

In [60]:
hogwarts_augmented_excel.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, Harry to Neville
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Potions               9 non-null      float64
 1   Transfiguration       12 non-null     float64
 2   Runes                 2 non-null      float64
 3   Defense               11 non-null     float64
 4   Divination            8 non-null      float64
 5   Data Science          10 non-null     float64
 6   Charms                8 non-null      float64
 7   Herbology             9 non-null      float64
 8   Total Points          12 non-null     float64
 9   Average Class Points  12 non-null     float64
dtypes: float64(10)
memory usage: 1.0+ KB


In [61]:
if hogwarts_augmented.equals(hogwarts_augmented_excel):
    print('Everything is cool.')
else:
    print('Hmmm...something is amiss.')

Hmmm...something is amiss.


Well, that didn't fix it.  What is going on here?

In [62]:
hogwarts_augmented.columns

Index(['Potions', 'Transfiguration', 'Runes', 'Defense', 'Divination',
       'Data Science', 'Charms', 'Herbology', 'Total Points',
       'Average Class Points'],
      dtype='object')

In [63]:
hogwarts_augmented_excel.columns

Index(['Potions', 'Transfiguration', 'Runes', 'Defense', 'Divination',
       'Data Science', 'Charms', 'Herbology', 'Total Points',
       'Average Class Points'],
      dtype='object')

This all ***appears*** correct.  Which column is throwing things off?

In [64]:
differences = [hogwarts_augmented[x].equals(hogwarts_augmented_excel[x]) for x in hogwarts_augmented.columns]

In [65]:
differences

[True, True, True, True, True, True, True, True, True, False]

Time for a deeper look at 'Average Class Points'...

In [66]:
col = 'Average Class Points'

In [67]:
hogwarts_augmented[col] == hogwarts_augmented_excel[col]

Name
Harry        True
Hermione     True
Ron          True
Draco        True
Crabbe      False
Fred         True
George       True
Goyle       False
Luna         True
Cho          True
Cedric       True
Neville      True
Name: Average Class Points, dtype: bool

In [68]:
hogwarts_augmented[col] - hogwarts_augmented_excel[col]

Name
Harry       0.000000e+00
Hermione    0.000000e+00
Ron         0.000000e+00
Draco       0.000000e+00
Crabbe     -3.552714e-15
Fred        0.000000e+00
George      0.000000e+00
Goyle      -3.552714e-15
Luna        0.000000e+00
Cho         0.000000e+00
Cedric      0.000000e+00
Neville     0.000000e+00
Name: Average Class Points, dtype: float64

So, a subtle loss of precision in the convesion back and forth to Excel, but very small value differences.