# Lecture 1031: Reading in Excel files

In [None]:
import pandas as pd

## pd.read_excel()

On October 31, 2022, I downloaded `export.xls` from [Cal-Access](https://cal-access.sos.ca.gov/Campaign/Committees/Detail.aspx?id=1414018&session=2021&view=received) using the download link that says "DOWNLOAD THESE RESULTS: MICROSOFT EXCEL"

We're going to use the method [`pd.read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) to import the data.

In [None]:
# you will get an error when running this
data = pd.read_excel('export.xls')

## Error 1
The error we got was "Excel file format cannot be determined, you must specify an engine manually." Let's go to the [method's documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) and search for `engine`.

```
engine: str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
- “xlrd” supports old-style Excel files (.xls).
- “openpyxl” supports newer Excel file formats.
- “odf” supports OpenDocument file formats (.odf, .ods, .odt).
- “pyxlsb” supports Binary Excel files.
```

This is a .xls file, so we should use the `xlrd` engine.

In [None]:
# you will STILL get an error when running this
data = pd.read_excel('export.xls', engine='xlrd')

## Error 2

The error I got here was "Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd."

So, let's `pip install xlrd`. 

When you use an exclamation point before some code in a Python cell, Jupyter will run that cell like a terminal command. (We'll learn about the terminal next week!)

In [None]:
!pip install xlrd

In [None]:
# you will STILL get an error when running this
data = pd.read_excel('export.xls', engine='xlrd')

## Error 3
I still got an error! The error I got was: "Unsupported format, or corrupt file: Expected BOF record; found b'"NAME OF'"

OK, so does anyone know what the problem is?

If you have Excel on your computer, try opening up the file.

![Screenshot of error](xls_error.png "Screenshot of error")

Something is wrong with this file. So when I got that alert, I hit "Yes" then saved the file in Excel as a `.xlsx` file.

(Note: I do not have Excel on my personal laptop, so I did this on my work laptop.)

## Try importing again

In [None]:
# we'll change the engine since `xlrd` doesn't support .xlsx files
data = pd.read_excel('export.xlsx', engine='openpyxl')

ARGH: "ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl."

### Install openpyxl

In [None]:
!pip install openpyxl

## FINALLY, import data

In [None]:
data = pd.read_excel('export.xlsx', engine='openpyxl')

Yay, that worked! Let's see what's inside and be sure to export a CSV of this!!!

## Explore and export

In [None]:
data

In [None]:
data.info()

POLL 4: Anything you're noticing about the data? [https://pollev.com/soooh](https://pollev.com/soooh)

### Re-import the data, typed

In [None]:
data = pd.read_excel(
    'export.xlsx',
    engine='openpyxl',
    dtype={
        'ID NUMBER' : object
    }
)

In [None]:
data

### Export

In [None]:
data.to_csv('newsom_contribs.csv', index=False)